Past Issues

Studies in Informatics and Control
Vol. 32, No. 3, 2023

A Parallel Pairwise-Clustering Matching Algorithm for Large-Scale Metadata Models Using Levenshtein Distance

Seham MOAWED, Ali ELDOSOUKY, Amany SARHAN, Sally ELGHAMRAWY
Abstract

Data integration is required for applications processing multiple data sources; consequently, semantic heterogeneity has become increasingly severe. The problem has been faced inevitably, imposing the need for metadata models matching to discover correspondences across different metadata models. However, in the large-scale scene, current metadata model matching systems suffer from memory consumption and a lack of scalability. Partitioning and parallelized techniques have been proposed to reduce space and temporal complexities. Nevertheless, few studies have been conducted to ameliorate matching efficiency on Graphical Processing Units (GPUs) when clustering techniques are utilized. To this end, the present paper proposes an efficient hardware implementable matching algorithm on GPU dedicated to the large-scale metadata models, named PPM (Parallel Pairwise Matching), which depends heavily on approximate string matching using the k-difference (ASM). In PPM, parallel processing is based on row flow computing in a way that eliminates data dependency obstacles exposed to the matching space. In addition, it decreases the amount of data transferred over the GPU. At most, one individual matching task will simultaneously be assigned to the device for parallel manipulation. This overcomes bringing out all independent matching tasks into bulks to the device where individual matching tasks can be separated and carried out in parallel. Extensive tests utilizing various workloads of metadata models on a CUDA-enabled GPU of NVIDIA GeForce GTX 860M have demonstrated the validity of the present assertions. The results imply that the proposed algorithm outperforms the sequential algorithm by up to 5.21-30.70 times.

Keywords

Metadata models, Large-scale matching, Partitioning-based matching, Scalability, Parallel computing.

View full article