Automated retrieval and extraction of training course information from unstructured web pages
Created by W.Langdon from
gp-bibliography.bib Revision:1.8178
- @PhdThesis{Xhemali:thesis,
-
author = "Daniela Xhemali",
-
title = "Automated retrieval and extraction of training course
information from unstructured web pages",
-
school = "Loughborough University",
-
year = "2010",
-
type = "Engineering Doctorate",
-
address = "Leicestershire, LE11 3TU, UK",
-
month = "9 " # jul,
-
keywords = "genetic algorithms, genetic programming, Web page,
Information Retrieval, Information Extraction, Web
Classifier, Naive Bayes Classifiers, Regular
Expressions",
-
URL = "http://hdl.handle.net/2134/7022",
-
URL = "https://dspace.lboro.ac.uk/dspace-jspui/bitstream/2134/7022/2/EngD-Thesis-DanielaXhemali.pdf",
-
size = "236 pages",
-
abstract = "Web Information Extraction (WIE) is the discipline
dealing with the discovery, processing and extraction
of specific pieces of information from semi-structured
or unstructured web pages. The World Wide Web comprises
billions of web pages and there is much need for
systems that will locate, extract and integrate the
acquired knowledge into organisations practices. There
are some commercial, automated web extraction software
packages, however their success comes from heavily
involving their users in the process of finding the
relevant web pages, preparing the system to recognise
items of interest on these pages and manually dealing
with the evaluation and storage of the extracted
results. This research has explored WIE, specifically
with regard to the automation of the extraction and
validation of online training information. The work
also includes research and development in the area of
automated Web Information Retrieval (WIR), more
specifically in Web Searching (or Crawling) and Web
Classification. Different technologies were considered,
however after much consideration, Naive Bayes Networks
were chosen as the most suitable for the development of
the classification system. The extraction part of the
system used Genetic Programming (GP) for the generation
of web extraction solutions. Specifically, GP was used
to evolve Regular Expressions, which were then used to
extract specific training course information from the
web such as: course names, prices, dates and locations.
The experimental results indicate that all three
aspects of this research perform very well, with the
Web Crawler outperforming existing crawling systems,
the Web Classifier performing with an accuracy of over
95percent and a precision of over 98percent, and the
Web Extractor achieving an accuracy of over 94percent
for the extraction of course titles and an accuracy of
just under 67percent for the extraction of other course
attributes such as dates, prices and locations.
Furthermore, the overall work is of great significance
to the sponsoring company, as it simplifies and
improves the existing time-consuming, labour-intensive
and error-prone manual techniques, as will be discussed
in this thesis. The prototype developed in this
research works in the background and requires very
little, often no, human assistance.",
-
notes = "Sorting programs p218-219. Daniela Birdsall",
- }
Genetic Programming entries for
Daniela Birdsall
Citations