An Improving Genetic Programming Approach Based Deduplication Using KFINDMR
Created by W.Langdon from
gp-bibliography.bib Revision:1.8187
- @Article{Shanmugavadivu:2012:IJCTT,
-
title = "An Improving Genetic Programming Approach Based
Deduplication Using {KFINDMR}",
-
author = "P. Shanmugavadivu and N. Baskar",
-
journal = "International Journal of Computer Trends and
Technology",
-
year = "2012",
-
volume = "3",
-
number = "5",
-
pages = "694--701",
-
month = sep # "-" # oct,
-
keywords = "genetic algorithms, genetic programming, extracting
data, identifying duplication, deduplication",
-
publisher = "Seventh Sense Research Group",
-
ISSN = "2231-2803",
-
bibsource = "OAI-PMH server at www.doaj.org",
-
oai = "oai:doaj-articles:888f8d9f98d711833425c4b976780e4e",
-
URL = "http://www.ijcttjournal.org/volume-3/issue-5/IJCTT-V3I5P106.pdf",
-
size = "8 pages",
-
abstract = "The record deduplication is the task of identifying,
in a data repository, records that refer to the same
real world entity or object in spite of misspelling
words, types, different writing styles or even
different schema representations or data types. In
existing system aims at providing Unsupervised
Duplication Detection (UDD) method which can be used to
identify and remove the duplicate records from
different data sources. Starting from the non duplicate
set, the two cooperating classifiers, a Weighted
Component Similarity Summing Classifier (WCSS) and
Support Vector Machine (SVM) are used to iteratively
identify the duplicate records from the non duplicate
record and present a genetic programming (GP) approach
to record deduplication. Their GP-based approach is
also able to automatically find effective deduplication
functions. The genetic programming approach is time
consuming task so we propose new algorithm
KFINDMR(KFIND using Most Represented data samples) to
find the most represented data samples to improve the
accuracy of the classifier. The proposed system
calculates the mean value of the most represented data
samples in centroid of the record members; it selects
the first most represented data sample that closest to
the mean value calculates the minimum distance. The
system Remove the duplicate dataset samples in the
system and find the optimisation solution to
deduplication of records or data samples.",
- }
Genetic Programming entries for
P Shanmugavadivu
N Baskar
Citations