Wednesday , December 19 2018

Query Optimization on Random Databases

Silviu Laurenţiu VASILE1,2
1 University of Bucharest, Faculty of Mathematics and Computer Science, Academiei 14, RO-010014, Bucharest, Romania
2 Institute of Mathematical Statistics and Applied Mathematics,
Calea 13 Septembrie 13, RO-050711, Bucharest, Romania

vsl@fmi.unibuc.ro

Abstract: We analyze the cardinality of the ?-join operation into a random database for tables having column values following some representative types of probability distributions (uniform distribution, exponential distribution and normal distribution). The goal of our research is to use our analysis to make some query optimization on random databases. Some numerical results are studied and the conclusions we have drawn as a result of the practical experiments are presented.

Keywords: Random database, Join, Cardinality, Query optimization.

>>Full Text
CITE THIS PAPER AS:
Silviu Laurențiu VASILE, Query Optimization on Random Databases, Studies in Informatics and Control, ISSN 1220-1766, vol. 23 (3), pp. 257-264, 2014. https://doi.org/10.24846/v23i3y201404

1. Introduction

The relational data model is one of the main database models and the basis for most existing database management systems. It is based on the first-order predicate logic which was first formulated and proposed in 1969 by Edgar F. Codd [3], [4]. The relational database model is the most popular data model since it is very simple and easily understandable by the information systems professionals and by the end users. In the relational model of a database, all users` data is represented in terms of tuples (records) which are grouped into relations (tables). A database organized in terms of the relational model is called a relational database. The rows of relations (relational matrices) represent records and columns represent the domains or attributes, respectively. Records or tuples can be identified, recorded and searched by sets of attributes, so-called keys, in a unique way. Generally, a key is an attribute (or a combination of several attributes) that uniquely identifies a particular record. A given set of attributes is a minimal key if its proper subsets are not keys. A large variety of algorithms used in database technology have as a goal the identification of tuples through keys. Examples in this direction are algorithms for selection, joining, constructing and maintaining tuples. These algorithms are as simple as search algorithms if key indexes are used. Therefore, keys and minimal keys are absolutely fundamental to database models.

Analysts often need to work with large datasets in which some of the data is uncertain [2]. This is the case when the data is connected to hypothetical or future events. For example, the data of interest might be the customer order sizes for some product under a hypothetical price increase of 5%. Data uncertainty is frequently modeled as a probability distribution over possible data values. These distributions are usually specified by means of a complex stochastic model. Various system characteristics of interest to the analyst can be viewed as answers to queries over the uncertain data values [9], [10]. Because the data is uncertain, there is a probability distribution over the possible results of running a given query, and the analysis of the underlying stochastic model is equivalent to studying the features (mean, variance, and so forth) of the query-result distribution. The query-result distribution is often very complex, and must be analyzed using Monte Carlo methods [5]. Such analysis is important to assessing for example the enterprise risk, as well as making decisions under uncertainty.

Random databases are databases where the value of some attributes or the presence of some records are uncertain and known only with some probability. Applications of random databases can be found in many areas such as information extraction, Radio-frequency identification (RFID) and scientific data management, data cleaning, data integration, and financial risk assessment.

At present Data models and databases for uncertain and/or probabilistic data are a hot topic in data management research.

The traditional relational databases are deterministic. Every record stored in the database is meant to be present with certainty, and fields in that record have a precise value.

The theoretical foundations of relational databases are if a logical sentence is true or false, if a record is, or is not, in a relation, or in a query result [1], but they cannot make any less precise statement. Today, however, data management needs to include new data sources, where data are uncertain [8], and which are difficult or impossible to model with traditional database systems. In this situation we have the following question: what do the traditional relational operators become in the case of random databases?

Definition: Consider finite domains art04_clip_image002, not necessarily disjoint. The Cartesian product art04_clip_image002_0000 of the domains art04_clip_image002_0001 is defined by the set of the tuples art04_clip_image002_0002where art04_clip_image002_0003.

Definition: A relation R on the sets is a subset of the Cartesian product art04_clip_image002_0000.

We can represent a relation by a bi-dimensional table in which each line corresponds to a tuple and each column corresponds to a domain in the Cartesian product. A column corresponds to an attribute. The number of attributes defines the relation’s degree, and the number of tuples in the relation defines the relation’s cardinality. The relational databases are perceived by the users as a set of tables. We consider a table as a representation of a relation.

We consider a distance d(x, y) for the elements in the two sets Di and Dj, where  art04_clip_image002_0005, which are assumed to be subsets of a metric space where the distance d is defined [11]. Many µ-join techniques use as a metric the standard Euclidean distance. We denote byart04_clip_image002_0006the ball centered in x and with radius µ. In this context, the operations in the relational algebra will be based upon approximations in order to deal with uncertainty. The name of the operation we intend to investigate are µ-equi-join.

Definition: Consider two relations R and S. Then, the µ-operation above can be described as follows:  art04_clip_image002_0007, where takes all values of column A that belongs to relation R and take all values of column B that belongs to relation S. We denote by art04_clip_image002_0008the number of lines in the result of the µ-join operation [11]. Whenever the context in which we refer these values are clear, we denote them by for simplicity.

During our studies concerning the random databases, which involved different experiments with data sets and analysis of the results, we have noticed the existence of some relations between the cardinalities of the approximate join (µ-join) operation in case of various probability distributions of the columns of tables.

Our empirical observations encouraged to find the proofs of the relations we conjectured. We were able to state them as theoretical results. The practical experiments dealt with tables containing at least 1000 records, with columns having values distributed uniformly, exponentially and, respectively, normally.

Such relations have a great impact in random database query optimization, as we have shown in [13].

REFERENCES

  1. ABITEBOUL, S., R. HULL, V. VIANU, Foundations of Databases, Addison-Wesley, 1995.
  2. ANTOVA, L., T. JANSEN, C. KOCH, D. OLTEANU, Fast and Simple Relational Processing of Uncertain Data, ICDE, 2008, pp. 983-992.
  3. CODD, E. F., Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks, IBM Research Report, 1969.
  4. CODD, E. F., A Relational Model of Data for Large Shared Data Banks, Communications of the ACM, vol. 13(6), 1970, pp. 377-387.
  5. DAGUM, P., R. M. KARP, M. LUBY, S. M. ROSS, An Optimal Algorithm for Monte Carlo Estimation, SIAM Journal on Computing, vol. 29(5), 2000, pp. 1484-1496.
  6. DALVI, N., D. SUCIU, Efficient Query Evaluation on Probabilistic Databases, The VLDB Journal, vol. 16(4), 2007,     pp. 523-544.
  7. JOHNSON, N., A. KEMP, S. KOTZ, Univariate Discrete Distributions, 3rd edition, John Wiley & Sons, 2005.
  8. KALASHNIKOV, D., Super-EGO: Fast Multi-dimensional Similarity Join, The VLDB Journal, vol. 22(4), 2013, pp. 561-585.
  9. OSORIO, R., S. GARCIA, M. PENA, I. LOPEZ-JUAREZ, G. LEFRANC, Movement and Color Detection of a Dynamic Object: An application to a Mobile Robot, Studies in Informatics and Control, ISSN 1220-1766, vol. 21(1), 2012, pp. 33-40.
  10. LUNGU, R., C. ROTARU, M. LUNGU, New Systems for Identification, Estimation and Adaptive Control of the Aircrafts Movement, Studies in Informatics and Control, ISSN 1220-1766, vol. 20(3), 2011, pp. 273-284.
  11. SELEZNJEV, O., B. THALHEIM, Random Databases with Approximate Record Matching, Methodology and Computing in Applied Probability, Springer Verlag, vol. 12(1), 2008, pp. 63-89.
  12. TALAGRAND, M., The Glivenko-Cantelli Problem, Annals of Probability, vol. 15(3), 1987, pp. 837-870.
  13. VASILE, L., L. VELCESCU, Inequalities between the Relational Result Sets Cardinalities in Random Databases, Proc. of SACI, IEEE, 2012, pp. 209-212.
  14. WHITMORE, G. A., M. CHAPMAN FINDLAY, Stochastic Dominance: An Approach to Decision-making under Risk, Lexington Books, 1978.