Friday , April 19 2024

Digital Information Retrieval

Daniel VOLOVICI
‘Lucian Blaga’ University of Sibiu,
10, Victoriei Blv., 550024, Sibiu, Romania

Macarie BREAZU
‘Lucian Blaga’ University of Sibiu,
10, Victoriei Blv., 550024, Sibiu, Romania

Daniel Ionel MORARIU
‘Lucian Blaga’ University of Sibiu,
10, Victoriei Blv., 550024, Sibiu, Romania

Adi-Cristina MITEA
‘Lucian Blaga’ University of Sibiu,
10, Victoriei Blv., 550024, Sibiu, Romania

Abstract: The retrieval of digital information imposes the need for a special attention to the digital representation of documents. Data reduction, based on informational measures, represents the theoretical base for the optimization of digital documents storage. In this paper, besides analyzing data reduction and documents storage, we have also studied the multimedia data models related to the MMDBMS architecture. These theoretical aspects have been applied to the development of the SCRIBe – Information System for Processing and Visualization of Old Books Inventory.

Keywords: Information Theory, Document Representation, Information Retrieval, Digital Libraries.

>Full text
CITE THIS PAPER AS:
Daniel VOLOVICI, Macarie BREAZU, Daniel Ionel MORARIU, Adi-Cristina MITEA, Digital Information Retrieval, Studies in Informatics and Control, ISSN 1220-1766, vol. 19 (2), pp. 185-192, 2010.

1. Introduction

Information Theory [3] answers two fundamental questions in communication theory: which is the best compression of data (entropy H) and which is the best communication rate for data transmission (channel capacity C)? Therefore, some consider information theory as a subfield of communication theory. Indeed, it has fundamental contribution to statistical physics (thermodynamics), computer science (Kolmogorov complexity and algorithm complexity), statistical inferences and probabilistic theory and statistics.

Shannon claims that random processes as music and voice have an irreducible complexity the signal cannot be compressed below. He calls this Entropy, by association with its use in thermodynamics.

Information Theory proposes the means to obtain the extreme limits of communication. Although, these theoretical optimal communication schemes, even if wonderful, prove not to be computational feasible. Advances in the field of integrated circuits and design of codes allow us to obtain some of the gains suggested by Shannon theory. A good example for applying these ideas from information theory is the use of error correcting codes on CDs.

Modern research on aspects of information theory in communication focused on information theory in networking consisting of the theory of simultaneous communication rates from multiple senders to multiple receivers in a network. Some gains in the communication rates between the senders and the receivers were unexpected, but have a certain mathematical simplicity. Although, a unifying theory still has to be found.

Computer science (Kolmogorov complexity). Kolmogorov, Chaitin şi Solomonoff suggested the idea that the complexity of a string of symbols can be defined as the length of the shortest binary program that generates that string. Then, the complexity is given by the length of the minimal description. This complexity definition is universal, independent from computer, and of a fundamental importance. Kolmogorov complexity sustains the descriptive theory of complexity.

Fortunately, Kolmogorov complexity approximately equals Shannon entropy H, if the sequence is chosen at random from a distribution having the entropy H. We consider Kolmogorov complexity to be more fundamental than Shannon entropy. It is the limit of data compression and leads us to a logical consistent inference procedure.

A pleasant complementary relationship between algorithmic complexity and computational complexity exists here. We can see computational complexity (time complexity) and Kolmogorov complexity (the length of the program or descriptive complexity) as two axes corresponding to the running time of the program and the length of the program. Kolmogorov complexity focuses on minimizing along the second axis and computational complexity focuses on minimizing along the first axis. Few attempts have been made to minimize simultaneous along both axes.

Physics (thermodynamics). Statistical mechanics is the place of birth for entropy and the second law of thermodynamics. Entropy always grows.

Mathematics (Theory of probability and statistics). Fundamental measures of information theory – relative entropy and mutual information – describe the behavior of long sequences of random variables and allow us to estimate the probability of rare events and to find the best error estimator in hypothesis testing.

Philosophy of Science (Occam Razor). William of Occam said: “Causes shall not be multiplied beyond necessity” or, paraphrasing him, “The simplest explanation is the best”. Solomonoff, and later Chaitin, stated that we can obtain a universal good prediction procedure if we take a weighted combination of all the programs that explains the data and still look at what they print afterwards. Furthermore, this inference procedure will work in many problems that cannot be solved by statistics. When it is applied on stock market is has to find stock market rules and to extrapolate them optimally. Basically, such a procedure would have found Newton law in physics. Certainly, such inference is not applicable, because eliminating all the programs that do not manage to generate the data takes too much time.

Economy. Repeating investments in a stationary market leads to an exponential growth of welfare. Welfare growing rate is dual to the market entropy. There is an obvious link between the theory of optimal investment in stock market and information theory. To explore this duality, a theory of investments can be developed.

Computation versus Communication: As we build bigger and bigger computers made by smaller and smaller components we reach both a computing limit and a communication limit. Computing is limited by communication and communication is limited by computing. These become interdependent and, therefore, all the developments in communication theory based on information theory should have a direct impact on computation theory.

References:

  1. http://finereader.abbyy.com/
  2. Banciu, D., e-Romania – A Citizens’ Gateway towards Public Information, Journal of Studies In Informatics and Control, Vol. 18, No. 3, 2009.
  3. Cover, T. M., T. A. Joy, Elements of Information Theory, John Wiley & Sons Interscience Publication, 1991.
  4. Forman, G., A Pitfall and Solution in Multi-Class Feature Selection for Text Classification, Proceedings of the 21st International Conference on Machine Learning, Banff, Canada, 2004.
  5. Guerra-Salcedo, C., S. Chen, D. Whitley, S. Smith, Fast and Accurate Feature Selection Using Hybrid Genetic Strategies, CEC00, Proc. of the Congress on Evolutionary Computation, CEC00, July 2000.
  6. http://alice.ulbsibiu.ro:8080/scribe/
  7. Sifaoui, A., A. Abdelkrim, S. Alouane, M. Benrejeb, On New RBF Neural Network Construction Algorithm for Classification, Journal of Studies In Informatics and Control, Vol. 18, No. 2, 2009.

https://doi.org/10.24846/v19i2y201009