Dealing with Data Sparsity in Drug Named Entity Recognition
Created by W.Langdon from
gp-bibliography.bib Revision:1.8051
- @InProceedings{Piliouras:2013:ICHI,
-
author = "Dimitrios Piliouras and Ioannis Korkontzelos and
Andrew Dowsey and Sophia Ananiadou",
-
title = "Dealing with Data Sparsity in Drug Named Entity
Recognition",
-
booktitle = "IEEE International Conference on Healthcare
Informatics (ICHI 2013)",
-
year = "2013",
-
month = sep,
-
pages = "14--21",
-
keywords = "genetic algorithms, genetic programming, artificial
intelligence, drugs, medical computing, natural
language processing, pattern classification, BioNLP
tasks, automatic annotations, biomedical natural
language processing tasks, data sparsity, drug named
entity recognition, drug-NER, gold-standard data,
manual annotations, voting system, Data models,
Dictionaries, Drugs, Proteins, Training, Training data,
data-sparsity",
-
DOI = "doi:10.1109/ICHI.2013.9",
-
abstract = "Drug Named Entity Recognition (drug-NER) is a critical
step for complex Biomedical Natural Language Processing
(BioNLP) tasks such as the extraction of
pharmaco-genomic, pharmaco-dynamic and pharmaco-kinetic
parameters. Large quantities of high quality training
data are almost always a prerequisite for employing
supervised machine-learning (ML) techniques to achieve
high classification performance. However, the human
labour needed to produce and maintain such resources is
a detrimental limitation. In this study, we attempt to
improve the performance of drug NER without relying
exclusively on manual annotations. Instead, we use
either a small gold-standard corpus (120 abstracts) or
no corpus at all. In our approach, we use a voting
system to combine a number of heterogeneous models to
enhance performance. Moreover, 11 regular-expressions
that capture common drug suffixes were evolved via
genetic-programming. We evaluate our approach against
state-of-the-art recognisers trained on manual
annotations, automatic annotations and a mixture of
both. Aggregate classifiers are shown to improve
performance, achieving a maximum F-score of 95percent.
In addition, combined models trained on mixed data are
shown to achieve comparable performance to models
trained exclusively on gold-standard data.",
-
notes = "DrugBank, PK PharmacoKinetic corpus, 360 articles,
maximum-entropy maxent.sf openNLP, 8 features per
token, ANN perceptron. p17 Silver data='anotated by
direct string matching dictionary entries', AcroMine
negative, p18 GP Evolving strin-simularity patterns (ie
regular expressions) USAN stem grouping restrictiing to
_last_ 4,5, o6 characters gives 'major positive
effect'. 200*(pop=10000,generations=80). anti-bloat
(max tree depth=10 no space character in terminal set)
p19 best-evolved GP tree (fig and Table 3). p19
'gold-standard data not' needed for drug-NER. 2013 and
still says 'data sparsity is pervasive...' p20 data not
split into ttraining and holdout sets. Also known as
\cite{6680456}",
- }
Genetic Programming entries for
Dimitrios Piliouras
Ioannis Korkontzelos
Andrew Dowsey
Sophia Ananiadou
Citations