Almost Optimal Algorithms for Detecting Near-Duplicates in Domain-Independent Big Data

Fellah, Aziz (2022) Almost Optimal Algorithms for Detecting Near-Duplicates in Domain-Independent Big Data. In: Recent Advances in Mathematical Research and Computer Science Vol. 9. B P International, pp. 41-66. ISBN 978-93-5547-483-4

Full text not available from this repository.

Abstract

In this chapter, we propose Merge-Filter Representative-based Clustering (Merge-Filter-RC), a general domain-independent approach for detecting near-duplicate records within a single and across multiple data sources. Following that, we create three almost optimal classes of algorithms: constant threshold (CT), variable threshold (VT), and function threshold (FT), which we refer to as All-Three algorithms. The foundation of this work is formed by Merge-Filter-RC and All-Three. Merge-Filter-RC works recursively in the divide-merge fashion to dis- till locally and globally near-duplicates as hierarchical clusters with prototype representatives. Each cluster is distinguished by one or more representatives, which are dynamically refined. Representatives are used for additional similarity comparisons to reduce the number of pairwise comparisons and thus the search space. Furthermore, we categorize the results of the comparisons using labels such as "very similar," "similar," and "not similar." We supplement All-Three algorithms with a more thorough reexamination of the original well-tuned features of Monge-Elkan’s (ME) seminal work, which we circumvented using an affine variant of the Smith-Waterman’s (SW) similarity measure. We performed several experiments and extensive analysis using both real-world benchmarks and synthetically generated data sets to show that All-Three algorithms based on the Merge-Filter-RC approach outperform Monge-Elkan’s algorithmic in terms of accuracy in detecting near-duplicates. In addition, All-Three algorithms are as efficient in terms of computations as Monge-Elkan’s algorithm.

Item Type: Book Section
Subjects: Impact Archive > Computer Science
Depositing User: Managing Editor
Date Deposited: 12 Oct 2023 07:43
Last Modified: 12 Oct 2023 07:43
URI: http://research.sdpublishers.net/id/eprint/3073

Actions (login required)

View Item
View Item