Thursday , April 18 2024

Weights Space Exploration Using Genetic Algorithms for Meta-classifier in Text Document Classification*

Radu G. Creţulescu
“Lucian Blaga” University of Sibiu
10, Victoriei Street., 550024, Sibiu, ROMANIA

Daniel Ionel Morariu
“Lucian Blaga” University of Sibiu
10, Victoriei Street., 550024, Sibiu, ROMANIA

Macarie Breazu
“Lucian Blaga” University of Sibiu
10, Victoriei Street., 550024, Sibiu, ROMANIA

Lucian N. Vinţan
“Lucian Blaga” University of Sibiu
10, Victoriei Street., 550024, Sibiu, ROMANIA

Abstract:

Automatic document classification has become an important task because of the continually increasing number of text documents with the users have to deal with. The aim of this paper is to develop a non-adaptive meta-classifier for text documents that has an increased classification accuracy. The developed meta-classifier is based on combining some SVM classifiers and a Naïve Bayes classifier. We proposed a new meta-classification method which takes into consideration the corresponding positions and confidence degrees obtained for all the classes. In this work we have tried to find, using Genetic Algorithms, the optimal weighting factors for the values returned by each classifier separately. Consequently, it is possible for the meta-classifier to select as the winner class, a class that is not hierarchized as the first one by any of the compounded classifiers. The experimental results have showed that the classification accuracy can be improved through the proposed method.

Keywords:

Text Classification and Performance Evaluation, SVM, Meta-classification, Genetic Algorithms.

>>Full text
CITE THIS PAPER AS:
Radu G. CRETULESCU, Daniel I. MORARIU, Macarie BREAZU, Lucian N. VINŢAN, Weights Space Exploration Using Genetic Algorithms for Meta-classifier in Text Document Classification, Studies in Informatics and Control, ISSN 1220-1766, vol. 21 (2), pp. 147-154, 2012. https://doi.org/10.24846/v21i2y201204

1. Introduction

While more and more textual information is available online, effective retrieval is difficult without good indexing and summarization of document content. Document categorization is one solution to this problem. The task of document categorization is to assign a user defined categorical label to a given document. In recent years a growing number of categorization methods and machine learning techniques have been developed and applied in different contexts.

Documents are typically represented as vectors in a features space. Each word in the vocabulary is represented as a separate dimension. The number of a certain word’s occurrences in a document represents the value of the corresponding component in the document’s vector.

In this paper we investigate a strategy for combining classifiers’ results in order to improve the classification accuracy using genetic algorithms. We used classifiers based on Support Vector Machine (SVM) techniques and based on Naïve Bayes theory. They are less vulnerable to degrade with an increasing dimensionality of the feature space, and have been shown effective in many classification tasks. The SVM classifier is actually based on learning with kernels and support vectors.

We combine multiple classifiers hoping that the classification accuracy can be improved without a significant increase in response time. Instead of building just one highly accurate specialized classifier with much time and effort, we build and combine several simpler classifiers.

_________________________________________________________________________________________ * A previous, shorter version of this paper was presented in “The Second International Conference on Information Science and Information Literacy“, with the title “Using Genetic Algorithms for Weight Space Exploration in an Eurovision-like weighted Meta-Classifier“.

Several combination schemes have been described in the literature [5],[8] and [1]. A usual approach is to build individual classifiers and later combine their judgments to make the final decision.

Another approach, which is not so commonly used because it suffers from the “curse of dimensionality” [7], is to concatenate features from each classifier to make a longer feature vector and use it for the final decision. Anyway, meta-classification is effective only if its component classifiers synergies can be exploited.

In previous studies combination strategies were used ad hoc and are strategies like majority vote, linear combination, winner-take-all [5], or Bagging and Adaboost [15]. Also, some rather complex strategies have been suggested. For example in [7] and [10] meta-classification strategies using SVM [14] are presented and are compared with probability based strategies.

Section 2 and 3 contains prerequisites for the main work developed in this research. In sections 4 we present the methodology used for developing our experiments. Section 5 presents the experimental framework and section 6 presents the main results of our experiments. Finally the last section debates and concludes on the most important results obtained and proposes some further work.

References:

  1. ANDREI, N., On Quadratic Internal Model Principle in Mathematical Programming, Studies in Informatics and Control, Vol. 18, No. 4, 2009, pp. 337-348.
  2. CHAKRABARTI S., Mining the Web- Discovering Knowledge from Hypertext Data, Morgan Kaufmann Press, 2003.
  3. CRETULESCU, R., D. MORARIU, L. VINTAN, Eurovision-like weighted Non-Adaptive Meta-classifier for Text Documents, The 8th RoEduNet International Conference, Galati, Romania, 2009.
  4. CHEN, Q., D. ZHENG, T. ZHAO, S. LI, A Fusion of Multiple Classifiers Approach Based on Reliability Function for Text Categorization, 5th International Conference on Fuzzy Systems and Knowledge Discovery, IEEE, 2008.
  5. DIMITROVA, N., L. AGNIHOTRI, G. WEI, Video Classification Based on HMM Using Text and Face, Proceedings of the European Conference on Signal Processing, Finland, 2000.
  6. LEWIS, D., Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. In Proceedings of the 10th European Conference on Machine Learning, 1998.
  7. LIN, W.-H., A. HAUPTMANN, News Video Classification Using SVM-based Multimodal Classifier and Combination Strategies, In Proceedings of the Tenth ACM international Conference on Multimedia, 2002.
  8. LIN, W.-H., R. JIN, A. HAUPTMANN, A Meta-classification of Multimedia Classifiers, International Workshop on Knowledge Discovery in Multimedia and Complex Data, Taiwan, 2002.
  9. MORARIU, D., L. VINTAN, V. TRESP, Meta-Classification using SVM Classifiers for Text Documents, The 3rd International Conference on Neural Computing and Patter Recognition, Barcelona, October 2006.
  10. MORARIU, D., Text Mining Methods based on Support Vector Machine, MatrixRom, Bucharest, 2008.
  11. MORARIU, D., R. CRETULESCU, L. VINTAN, Improving a SVM Meta-classifier for Text Documents by using Naive Bayes, International Journal of Computers, Communications & Control, Vol. V, No. 3, 2010, pp. 351-361, ISSN 1841-9836, E-ISSN 1841-9844.
  12. NELLO, C., J. SWAWE-TAYLOR, An introduction to Support Vector Machines, Cambridge University Press, 2000.
  13. Reuters Corpus: http://about.reuters.com/ researchandstandards/ corpus/. Released in November 2000.
  14. SCHOELKOPF, B., A. SMOLA, Learning with Kernels. Support Vector Machines, MIT Press, London, 2002.
  15. SIYANG, G., L. QUINGRUI, M. LIN, Meta-classifier in Text Classification, http://www. comp.nus.edu.sg/~zhouyong/papers/cs5228project.
  16. SUDUC, A. M., L. DUTA, G. GORGHIU, Interface Architecture for a web-Based Group Decision Support System, Studies in Informatics and Control, Vol. 18, no. 3, 2009, pp. 241-246.
  17. GHOSH, A., L. C. JAIN, Evolutionary Computation in Data Mining, Springer Verlag Berlin Heidelberg, 2005.