A Parallel Pairwise-Clustering Matching Algorithm for Large-Scale Metadata Models Using Levenshtein Distance

Seham MOAWED¹*, Ali ELDOSOUKY¹, Amany SARHAN², Sally ELGHAMRAWY³
¹ Department of Computer Engineering, Mansoura University, Mansoura, Egypt
schemamatching2020@gmail.com, schema@std.mans.edu.eg (*Corresponding author)
² Department of Computer Engineering, Tanta University, Tanta, Egypt
am_sarhan_ya@yahoo.com
³ Computers Engineering Department, MISR Higher Institute for Engineering and Technology,
Mansoura, Egypt
sally_elghamrawy@ieee.org

Abstract: Data integration is required for applications processing multiple data sources; consequently, semantic heterogeneity has become increasingly severe. The problem has been faced inevitably, imposing the need for metadata models matching to discover correspondences across different metadata models. However, in the large-scale scene, current metadata model matching systems suffer from memory consumption and a lack of scalability. Partitioning and parallelized techniques have been proposed to reduce space and temporal complexities. Nevertheless, few studies have been conducted to ameliorate matching efficiency on Graphical Processing Units (GPUs) when clustering techniques are utilized. To this end, the present paper proposes an efficient hardware implementable matching algorithm on GPU dedicated to the large-scale metadata models, named PPM (Parallel Pairwise Matching), which depends heavily on approximate string matching using the k-difference (ASM). In PPM, parallel processing is based on row flow computing in a way that eliminates data dependency obstacles exposed to the matching space. In addition, it decreases the amount of data transferred over the GPU. At most, one individual matching task will simultaneously be assigned to the device for parallel manipulation. This overcomes bringing out all independent matching tasks into bulks to the device where individual matching tasks can be separated and carried out in parallel. Extensive tests utilizing various workloads of metadata models on a CUDA-enabled GPU of NVIDIA GeForce GTX 860M have demonstrated the validity of the present assertions. The results imply that the proposed algorithm outperforms the sequential algorithm by up to 5.21-30.70 times.

Keywords: Metadata models, Large-scale matching, Partitioning-based matching, Scalability, Parallel computing.

>>FULL TEXT: PDF

CITE THIS PAPER AS:
Seham MOAWED, Ali ELDOSOUKY, Amany SARHAN, Sally ELGHAMRAWY, A Parallel Pairwise-Clustering Matching Algorithm for Large-Scale Metadata Models Using Levenshtein Distance, Studies in Informatics and Control, ISSN 1220-1766, vol. 32(3), pp. 17-30, 2023. https://doi.org/10.24846/v32i3y202302