Predicting the effectiveness of pattern-based entity extractor inference

doi:10.1016/j.asoc.2016.05.023

Applied Soft Computing

Volume 46, September 2016, Pages 398-406

https://doi.org/10.1016/j.asoc.2016.05.023 Get rights and content

Highlights

•
Pattern-based entity extraction is an essential component of many digital workflows.
•
No accuracy prediction methods exist for extractor generators from examples.
•
We propose a predictor based on string similarity and machine learning.
•
In-depth experiments on real and challenging data give promising results.

Abstract

An essential component of any workflow leveraging digital data consists in the identification and extraction of relevant patterns from a data stream. We consider a scenario in which an extraction inference engine generates an entity extractor automatically from examples of the desired behavior, which take the form of user-provided annotations of the entities to be extracted from a dataset. We propose a methodology for predicting the accuracy of the extractor that may be inferred from the available examples. We propose several prediction techniques and analyze experimentally our proposals in great depth, with reference to extractors consisting of regular expressions. The results suggest that reliable predictions for tasks of practical complexity may indeed be obtained quickly and without actually generating the entity extractor.

Graphical abstract

Introduction

An essential component of any workflow leveraging digital data consists in the identification and extraction of relevant patterns from a data stream. This task occurs routinely in virtually every sector of business, government, science, technology, and so on. In this work we are concerned with extraction from an unstructured text stream of entities that adhere to a syntactic pattern. We consider a scenario in which an extractor is obtained by tailoring a generic tool to a specific problem instance. The extractor may consist, e.g., of a regular expression, or of an expression in a more general formalism [1], or of full programs suitable to be executed by NLP tools [2], [3]. The problem instance is characterized by a dataset from which a specified entity type is to be extracted, e.g., VAT numbers, IP addresses, or more complex entities.

The difficulty of generating an extractor is clearly dependent on the specific problem. However, we are not aware of any methodology for providing a practically useful answer to questions of this sort: generating an extractor for describing IP addresses is more or less difficult than generating one for extracting email addresses? Is it possible to generate an extractor for drug dosages in medical recipes, or for ingredients in cake recipes, with a specified accuracy level? Does the difficulty of generating an extractor for a specified entity type depend on the properties of the text that is not to be extracted? Not only answering such questions may provide crucial insights on extractor generation techniques, it may also be of practical interest to end users. For example, a prediction of low effectiveness could be exploited by providing more examples of the desired extraction behavior; the user might even decide to adopt a manual approach, perhaps in crowdsourcing, for problems that appear to be beyond the scope of the extractor generation technique being used.

In this work we propose an approach for addressing questions of this sort systematically. We consider on a scenario of increasing interest in which the problem instance is specified by examples of the desired behavior and the target extractor is generated based on those examples automatically [4], [5], [6], [7], [8], [9], [10], [11], [12]. We propose a methodology for predicting the accuracy of the extractor that may be inferred by a given extraction inference engine from the available examples. Our prediction methodology does not depend on the inference engine internals and can in principle be applied to any inference engine: indeed, we validate it on two different engines which infer different forms of extractors.

The basic idea is to use string similarity metrics to characterize the examples. In this respect, an “easy” problem instance is one in which (i) strings to be extracted are “similar” to each other, (ii) strings not to be extracted are “similar” to each other, and (iii) strings to be extracted are not “similar” to strings not to be extracted. Despite its apparent simplicity, implementing this idea is highly challenging for several reasons.

To be practically useful, a prediction methodology shall satisfy these requirements: (a) the prediction must be reliable; (b) it must be computed without actually generating the extractor; (c) it must be computed very quickly w.r.t. the time taken for inferring the extractor. First and foremost, predicting the performance of a solution without actually generating the solution is clearly very difficult (see also the related work section).

Second, it is not clear to which degree a string similarity metric can capture the actual difficulty in inferring an extractor for a given problem instance. Consider, for instance, the Levenshtein distance (string edit distance) applied to a problem instance in which entities to be extracted are dates. Two dates (e.g., 2-3-1979 and 7-2-2011, whose edit distance is 6) could be as distant as a date and a snippet not to be extracted (e.g, 2-3-1979 and 19.79$, whose edit distance is 6 too); yet dates could be extracted by an extractor in the form of regular expression that is very compact, does not extract any of the other snippets and could be very easy to generate (∖d+-∖d+-∖d+). However, many string similarity metrics exist and their effectiveness is tightly dependent on the specific application [13], [14]. Indeed, one of the contributions of our proposal is precisely to investigate which metric is the most suitable for assessing the difficulty of extractor inference.

Third, the number of snippets in an input text grows quadratically with the text size and becomes huge very quickly—e.g., a text composed of just 10⁵ characters includes ≈10¹⁰ snippets. It follows that computing forms of similarity between all pairs of snippets may be feasible for snippets that are to be extracted but is not practically feasible for snippets that are not to be extracted.

We propose several prediction techniques and analyze experimentally our proposals in great depth, with reference to a number of different similarity metrics and of challenging problem instances. We validate our techniques with respect to a state-of-the-art extractor generator¹ approach that we have recently proposed [9], [5], [6]; we further validate our predictor on a worse-performing alternative extractor generator [15] which works internally in a different way. The results are highly encouraging suggesting that reliable predictions for tasks of practical complexity may indeed be obtained quickly.

Section snippets

Related work

Although we are not aware of any work specifically devoted to predicting the effectiveness of a pattern-based entity extractor inference method, there are several research fields that addressed similar issues. The underlying common motivation is twofold: inferring a solution to a given problem instance may be a lengthy procedure; and, the inference procedure is based on heuristics that cannot provide any optimality guarantees. Consequently, lightweight methods for estimating the quality of a

Pattern-based entity extraction

The application problem consists in extracting entities that follow a syntactic pattern from a potentially large text. Extraction is performed by means of an extractor tailored to the specific pattern of interest. We consider a scenario in which the extractor is generated automatically by an extraction inference engine, based on examples of the desired behavior in the form of snippets to be extracted (i.e., the entities) and of snippets not to be extracted. Such examples usually consist of

Our prediction method

Our prediction method consists of three steps. First, we transform the input (s, X, s′) in an intermediate representation which is suitable to be processed using string similarities. Second, we extract a set of numerical features consisting in several statistics of similarities among strings of the intermediate representation. Finally, we apply a regressor to the vector of features and obtain an estimate ${\hat{f}}^{'}$ of the F-measure f′ which an extractor would have on X′.

In the following sections, we

Experimental evaluation

We constructed and assessed experimentally all the 48 prediction model variants resulting from the combination of: 2 feature set construction methods (Sample and Rep, Section 4.2); 8 string similarity metrics (Section 4.2.1); 3 regressors (LM, RF, and SVM, Section 4.3). We trained each model variant with a set of solved problem instances $E_{train}$ and assessed the resulting predictor on a set of solved problem instances disjoint from $E_{train}$ , as detailed in the next sections.

Concluding remarks

We have considered a scenario in which an extraction inference engine generates an extractor automatically from user-provided examples of the entities to be extracted from a dataset. We have addressed the problem of predicting the accuracy of the extractor that may be inferred from the available examples, by requiring that the prediction be obtained very quickly w.r.t. the time required for actually inferring the extractor. This problem is highly challenging and we are not aware of any earlier

Acknowledgements

The authors are grateful to the anonymous reviewers for their constructive comments.

References (35)

K. Smith-Miles et al.
Measuring instance difficulty for combinatorial optimization problems
Comput. Oper. Res.
(2012)
S.N. Group
TokensRegex
(2011)
A. Project, UIMA-Ruta rule-based text annotation,...
P. Kluegl et al.
Uima ruta: rapid development of rule-based information extraction applications
Nat. Lang. Eng. FirstView
(2015)
A. Bartoli et al.
Data quality challenge: toward a tool for string processing by examples
J. Data Inf. Qual.
(2015)
A. Bartoli et al.
Learning text patterns using separate-and-conquer genetic programming
A. Bartoli et al.
Inference of regular expressions for text extraction from examples
IEEE Trans. Knowl. Data Eng.
(2016)
R.A. Cochran et al.
Program boosting: program synthesis via crowd-sourcing
V. Le et al.
Flashextract: a framework for data extraction by examples
A. Bartoli et al.
Automatic synthesis of regular expressions from examples
Computer
(2014)

K. Davydov et al.

Smart Autofill – Harnessing the Predictive Power of Machine Learning in Google Sheets

(2014, October)

F. Brauer et al.

Enabling information extraction by inference of regular expressions from sample entities

Y. Li et al.

Regular expression learning for information extraction

M. Cheatham et al.

String similarity metrics for ontology alignment

W.R. Cohen, P. Fienberg, A comparison of string distance metrics for name-matching tasks, in: Proceedings of the...

S.M. Lucas et al.

Learning deterministic finite automata with a smart state labeling evolutionary algorithm

IEEE Trans. Pattern Anal. Mach. Intell.

(2005)

J. Pihera et al.

Application of machine learning to algorithm selection for TSP

Cited by (0)

View full text

Predicting the effectiveness of pattern-based entity extractor inference

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Related work

Pattern-based entity extraction

Our prediction method

Experimental evaluation

Concluding remarks

Acknowledgements

Comput. Oper. Res.

TokensRegex

Uima ruta: rapid development of rule-based information extraction applications

Nat. Lang. Eng. FirstView

Data quality challenge: toward a tool for string processing by examples

J. Data Inf. Qual.

Learning text patterns using separate-and-conquer genetic programming

Inference of regular expressions for text extraction from examples

IEEE Trans. Knowl. Data Eng.

Program boosting: program synthesis via crowd-sourcing

Flashextract: a framework for data extraction by examples

Automatic synthesis of regular expressions from examples

Computer