Tuesday , October 23 2018

Statistical Methods for Performance Evaluation of WEB Document Classification

Daniel VOLOVICI 
‘Lucian Blaga’ University of Sibiu,
10, Victoriei Blv., 550024, Sibiu, Romania

Macarie BREAZU
‘Lucian Blaga’ University of Sibiu,
10, Victoriei Blv., 550024, Sibiu, Romania

Gabriel Dacian CUREA
‘Lucian Blaga’ University of Sibiu,
10, Victoriei Blv., 550024, Sibiu, Romania

Daniel Ionel MORARIU
‘Lucian Blaga’ University of Sibiu,
10, Victoriei Blv., 550024, Sibiu, Romania

Abstract: The principal aim of this paper is to make a review of main statistical methods for classifying documents that could be easily adapted in the context of Web document retrieval. After presenting the most popular methods of classification we will also define the most accurate indicators for assessment of classifiers performance. Thus we will refer to the recall, precision, fscore, sensitivity and specificity. We will also describe how these indicators can be calculated in the context of Web documents.

Keywords: Information retrieval, Classification, Naïve Bayes, Evaluation metrics.

>>Full text
CITE THIS PAPER AS:
Daniel VOLOVICI, Macarie BREAZU, Dacian CUREA, Daniel Ionel MORARIU, Statistical Methods for Performance Evaluation of WEB Document Classification, Studies in Informatics and Control, ISSN 1220-1766, vol. 19 (1), pp. 169-176, 2010.

1. Introduction

Access to information is an increasingly frequent topic discussed both at national and international level. Today we can not talk about some traditional information skills, where each country can have access to virtual planetary database.

The main reason why people require information on another medium than the traditional one concerns the need of some specialised information. For example, in the field of science, it takes a long time to update information. Some specialized, cutting-edge information can be found only on this media, because the lengthy book publication process required for a traditional format makes printed books obsolete.

Development of online services must be the main concern in librarian world. In “WWW Library Directory” magazine [15] are identified over 30 types of services involving using of internet and reference services, databases and indexes sites, search guides, information services for trade and industry, banks of images, and so.

In December, 1999 the European Commission launched an initiative entitled “eEurope: An Information Society for All”, [16] initiative which proposed ambitious targets, namely to provide the benefits of information society to all Europeans. The initiative focuses on ten areas of priority, from education to transportation, from health to disability issues. The idea behind this initiative was to build a strategy to modernize the European economy, hoping that it will become “the most competitive and dynamic knowledge-based economy in the world” [4]. In the same idea was started also a project for Romania [1].

Recently there was a new generation of Web technologies designed under the concept of Semantic Web project launched by Tim Berners-Lee [2]. The semantic Web seeks to access the data with heterogeneous semantics and obtain some useful knowledge from data through various services offered in the Web space. Semantic Web claims to improve communication between peoples using different technologies, extending the interoperability of databases and providing new mechanisms for agent-based data computation in which the people and the machines will work online and make possible a new level of interaction between scientific communities [5] [12].

References:

  1. Banciu, D., e-Romania- A Citizens’ Gateway towards Public Information, Journal of Studies In Informatics and Control, Vol. 18, No. 3, 2009.
  2. Berners-Lee, T., J. Hendler, O. Lassila, The Semantic Web, Scientific American, May 2001, Vol. 284, No. 5, 2001, pp.34-43.
  3. Goller, G., J. Loning, T. Will, W. Wolff, Automatic Document Classification: A thorough Evaluation of Various Methods, Internationalen Symposiums für Informationswissenschaft, Darmstadt, Nov. 2000, pp. 145-162.
  4. Hand, D., H. Mannila, P. Smyth, Principles of Data Mining, MIT Press, Cambridge, MA., 2001, ISBN 0-262-08290-X,.
  5. Hendler, J., Science and the Semantic web, Science, 299, January 2003.
  6. Hong, S. J., Use of Contextual Information for Feature Ranking and Discretization, in proceedings of IEEE Transactions on Knowledge and Data Engineering – 1997, available at www.research.ibm.com/dar/papers/pdf/ tkde-cm_with_cover.pdf, (acc. Feb. 2010).
  7. Huang, J., C. X. Ling, Constructing New and Better Evaluation Measures for Machine Learning, available at http://www.ijcai.org/papers07/Papers/IJCAI07-138.pdf, accessed in august. 2007.
  8. Kononenko, I., On Biases in Estimating Multi-Valued Attributes, International Joint Conference on Artificial Intelligence, 1995, pp. 1034-1040.
  9. Pop, I., Strategies for the Classification Cost Calculus, International Journal of Computers, Communications & Control, Volume: II (2007), No:4, ISSN 1841 – 9844, accepted august 2007.
  10. Sokolova, M., Learning from Communication Data: Language in Electronic Business Negotiations Ph.D. dissertation, 2006, available at www.etud.iro.umontreal.ca/~sokolovm/ThesisSokolova.pdf, accessed august 2007.
  11. Sokolova, M., G. Lapalme, Performance Measures in Classification of Human Communications, a study available at http://www-etud.iro. umontreal.ca/~sokolovm/PMF.pdf, accessed 2007.
  12. Oprean, C., C. Kifor, B. Barbat, D. Banciu, E-Maieutics in Post-industrial Engineering Education, Journal of Studies in Informatics and Control, Vol. 19, No. 1, 2010.
  13. Vilalta, R., M. Brodie, D. Oblinger I. Rish, A Unified Framework for Evaluation Metrics in Classification Using Decision Trees, Machine Learning: EMCL 2001: 12th European Conference on Machine Learning, Freiburg, Germany, September, 2001, Proceedings, Lecture Notes in Computer Science, Springer Berlin/Heidelberg, Vol. 2167/2001, pp. 503-511.
  14. Witten, I. H., E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Academic Press, Morgan Kaufmann Publishers, ISBN: 1-55860-552-5, 1999.
  15. WWW Library Directory – http://travelinlibrarian.info/libdir/ – accessed in January 2010.
  16. eEurope: An Information Society For All – http://www.w3.org/WAI/References/ eEurope – accessed in January 2010.

https://doi.org/10.24846/v19i2y201007