An Integrated Cluster Analysis and
Validity Test Platform for the
Compression based Clustering Approach
Alexandra CERNIAN*, Dorin CÂRSTOIU,
Adriana OLTEANU, Valentin SGÂRCIU
University Politehnica of Bucharest ,
313 Splaiul Independentei, Bucharest, Romania Alexandra.email@example.com
* Corresponding author
Abstract: This paper focuses on the compression based clustering and aims to determine the most suitable combinations of algorithms for different clustering contexts (text, heterogeneous data, Web pages, metadata and so on) and establish whether using compression with traditional clustering methods leads to better performance. In this context, we propose an integrated cluster analysis test platform, called EasyClustering, which incorporates two subsystems: a clustering component and a cluster validity expert system, which automatically determines the quality of a clustering solution by computing the FScore value. The experimental results are focused on two main directions: determining the best approach for compression based clustering in terms of context, compression algorithms and clustering algorithms, and validating the functionality of the cluster analysis expert system for determining the quality of the clustering solutions. After conducting a set of 324 clustering tests, we concluded that compressing the input when using traditional clustering methods increases the quality of the clustering solutions, leading to results comparable to the NCD and the cluster analysis expert system proved 100% its accuracy so far, so we estimate that, even if some slight deviation should occur, it will be minimal.
Keywords: Clustering, compression, cluster analysis, FScore, expert system.
CITE THIS PAPER AS:
Alexandra CERNIAN, Dorin CÂRSTOIU, Adriana OLTEANU, Valentin SGÂRCIU, An Integrated Cluster Analysis and Validity Test Platform<br> for the Compression based Clustering Approach, Studies in Informatics and Control, ISSN 1220-1766, vol. 24 (2), pp. 151-158, 2015.
Clustering is an extremely powerful tool used for identifying patterns and grouping in datasets, based on the similarity between elements (Murty et. al., 1999). It is considered an unsupervised process (Charu and Chandan, 2013), since there is no predefined structure of the data. Clustering is applicable in many domains, ranging from biology and medicine to finance and marketing. It is used in fields such as data mining, pattern recognition, information retrieval, image analysis, market analysis, statistical data analysis and so on.
This paper presents the design, implementation and evaluation of a cluster analysis expert system, called EasyClustering, developed in order to assess the performance of different compression based clustering approaches and automatically computes the quality of the solutions. The system has 2 main integrated components:
- A clustering component (Cernian et. al., 2011), with 3 compression algorithms (ZIP, bzip2 and GZIP), 4 distance metrics (NCD, Jaro, Jaccard and Levenstein) and 3 clustering algorithms (UPGMA, MQTC and k-means).
- A cluster analysis expert system, which performs an automatic evaluation of the quality of the clustering results, using one of the most representative quality measures – the FScore (van Rijsbergen, 1976).
The research conducted with the EasyClustering platform has the following objectives:
- To establish which is the most appropriate clustering context for using the compression based approach
- To facilitate a comparative analysis of the clustering results produced by various combinations of compression algorithms, distance metrics and clustering algorithms
- To evaluate the benefits of the compression based clustering approach
- To provide an expert system component to automatically assess the quality of the clustering solutions
- To investigate if traditional clustering methods have improved performance when the input is compressed
The rest of the paper is structured as follows: Section 2 presents the theoretical background and some related work, Section 3 describes the EasyClustering platform and the methodology for using the platform, Section 4 presents some experimental results for validating the capabilities of this integrated system, and Section 5 draws the conclusions for this work.
- BOOCH, G., I. JACOBSON, J. RUMBAUGH, OMG Unified Modelling Language Specification, First Edition: 2010. http://omg.org/spec/UML/2.3/
- BZIP2 home page: http://bzip.org/, last accessed 23.06.2014.
- CILIBRASI,, The CompLearn Toolkit, http://www.complearn.org/, 2003.
- CILIBRASI R, VITÁNYI, PAUL M.B., Clustering by Compression, IEEE Trans. on Info. Th., vol. 51, 2005, pp. 1523-1545.
- CLUSTIO, http://www.softpedia.com/get/ Science-CAD/ClusTIO.shtml, 2009.
- DE HOON, M. J. L., S. IMOTO, J. NOLAN, S. MIYANO, Open Source Clustering Software, Bioinformatics, vol. 20(9), 2004, pp. 1453-1454.
- CHARU C. A., C. K. REDDY, Data Clustering: Algorithms and Applications, CRC Press, 2013.
- MARMANIS, H, D. BABENKO, Algorithms of the Intelligent Web, Manning Publications, 2009.
- MURTY, M., A. JAIN, P. FLYN, Data Clustering: A Review, ACM Computing Surveys, vol. 31(3), 1999.
- MILLIGAN, G. W. Clustering Validation: Results and Implications for Applied Analyses, World Scientific Publ., 1996.
- RAPIDMINER http://rapid-i.com/ content/view/181/196/., accessed 10.06.14.
- WANG, K., CVAP: Cluster Validity Analysis Platform (cluster analysis and validation tool), at: http://www. mathworks.com/matlabcentral/fileexchange/14620-cvap-cluster-validity-analysis-platform-cluster-analysis-and-validation-tool, 2009.
- HALL, M., E. FRANK, G. HOLMES, B. PFAHRINGER, P. REUTEMANN, I. H. WITTEN, The WEKA Data Mining Software: An Update; SIGKDD Explorations, vol. 11(1). 2009.
- CERNIAN, A., SGARCIU, V., CARSTOIU, D., Experimental Validation of the Clustering by Compression Technique, U. P. B. Scientific Bulletin, Series C, vol. 73(3), 2011, pp. 61-74.
- VAN RIJSBERGEN, C. J., Information Retrieval, 2nd ed., Butterworth, 1979.
- CERNIAN A., D. CARSTOIU, Clustering Heterogeneous Data using Clustering by Compression, 13th WSEAS Intl. Conf. on Computer, 2009.
- CARSTOIU, D., A. CERNIAN, V. SGARCIU, A. OLTEANU, A New Method for Clustering Heterogeneous Data: Clustering by Compression, Journal WSEAS Transactions on Computers, v 8(9), Sept. 2009, pp. 1461-1470.
- CERNIAN, A., D. CARSTOIU, A. OLTEANU, Clustering Heterogeneous Web Data using Clustering by Compression. Cluster Validity, 13th Intl. Symp. on Symbolic and Numeric Algorithms for Scientific Computing, 2011.
- MOCANU, S., R. DIN, D. SARU, C. POPA, Using Graphics Processing Units for Accelerated Information Retrieval, Studies in Informatics and Control, vol. 23 (3), 2014, pp. 249-256.