Entropy Based Measurement of Text Dissimilarity for Duplicate – Detection

Venkatesh Kumar, G. Rajendran

Abstract


The problem of identifying approximate similarity between pair of strings is an essential step for data cleansing and data integration process. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity potential duplicate. But existing system does not produce the similarity percentage between pair of strings. In this paper we propose a method using entropy and information gain (IG) to find dissimilarity between pair of strings to increase the accuracy of data.


Full Text: PDF DOI: 10.5539/mas.v4n9p142

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.

Modern Applied Science   ISSN 1913-1844 (Print)   ISSN 1913-1852 (Online)

Copyright © Canadian Center of Science and Education

To make sure that you can receive messages from us, please add the 'ccsenet.org' domain to your e-mail 'safe list'. If you do not receive e-mail in your 'inbox', check your 'bulk mail' or 'junk mail' folders.