Numerical Representations Involved in DNA Repeats Detection Using Spectral Analysis

Petre G. POP
Technical University of Cluj-Napoca, Comm. Dept.
G. Baritiu str., 26-28, Cluj-Napoca, 400027, Romania

Alin VOINA
Technical University of Cluj-Napoca, Comm. Dept.
G. Baritiu str., 26-28, Cluj-Napoca, 400027, Romania

Abstract: Sequence repeats are the simplest form of regularity and the detection of repeats is important in biology and medicine as it can be used for phylogenic studies and disease diagnosis. A major difficulty in identification of repeats is caused by the fact that the repeat units can be of unknown length and either exact or imperfect, in tandem or dispersed. Many of the methods for detecting repeated sequences are part of the digital signal processing (DSP) field. These methods involve a transformation which has as main goal the mapping of the symbolic domain into the numeric domain without adding structure information to the symbolic sequence beyond that inherent to it. Therefore, the numerical representation of genomic signals is very important. This paper presents the results obtained by using different numerical representations (including two novel) and spectral analysis to isolate the position and length of DNA repeats in short sequences containing microsatellites and on long sequences with alpha DNA repeats.

Keywords: Genomic signal processing, sequence repeats, DNA representations, Fourier analysis, spectrograms.

>>Full text
CITE THIS PAPER AS:
Petre G. POP, Alin VOINA, Numerical Representations Involved in DNA Repeats Detection Using Spectral Analysis, Studies in Informatics and Control, ISSN 1220-1766, vol. 20 (2), pp. 163-180, 2011. https://doi.org/10.24846/v20i2y201109

1. Introduction

Over the past few decades, major progress in the field of molecular biology, combined with the advances in genomic technologies, have led to a fulminating growth in the biological information generated by scientists. There are databases which contains hundreds of billions of bases and sequence records. Therefore, computers have become an indispensable tool for biological research as they provide the means for storing large quantities of data and revealing the relationships between them.

A surprising genetic difference among species is the size of their genomes. Relatively simple organisms may have much larger genomes than complex organisms. These major differences might be due to the presence of repeats. In general, for eukaryotes duplicated genetic material is abundant and can represent up to 60% of the genome. Although some of the mechanisms that generate these repeats are known, from an evolutionary point of view, the reasons for such redundancy remains unknown [1]. The presence of repeated sequences is a fundamental feature of all genomes.

A repeat is the simplest form of regularity and analyzing repeats can lead to first clues to discovering new biological phenomena. Tandem repeats are two or more contiguous, approximate copies of a pattern of nucleotides. Tandem duplication occurs as a result of mutational events in which an original segment of DNA (the pattern) is converted into a sequence of individual copies.

The centromere of most complex eukaryotic chromosomes is a specialized locus made up of repetitive DNA which is responsible for chromosome segregation at mitosis and meiosis.

A major challenge in genomic signal processing is to understand the information contained in the biological genomes. Almost all DSP techniques require two parts: mapping the symbolic data (symbols for nucleotides) into a numeric form in a non-arbitrary manner and calculating a kind of transform of that numeric sequence. Consequently, the numerical representation of genomic signals becomes very important.

Fourier spectral analysis is used to reveal periodicity in symbolic sequences because they are rather robust in the presence of substitutions, insertions and deletions and may identify approximate periodicities in DNA sequences.

This paper presents results obtained using different numerical representations (including two new) and spectral analysis to isolate the position and length of DNA repeats in short sequences containing microsatellites and on long sequences with alpha DNA repeats.

Most of the numerical representations used for repeats detection associate a numerical value to one position in the sequence using numerical values associated to each nucleotide and, finally, reflect the presence or the absence of a certain nucleotide in a specific position. In order to include information about the number of consecutive nucleotides and to generate only one numerical sequence for each DNA subsequence which may be associated with a repeat [9, 10], we’ve introduced two novel representations. Therefore, to emphasize subsequences with consecutive repeats of the same nucleotide, we used a modified form of indicator sequences which includes the repeating factor. Then, we proposed a novel sequence representation and a mapping algorithm which takes into account the length of the expected repeats and the number of possible mismatches due to point mutations, based on polynomial-like representation.

Grey-levels spectrograms were used to validate numerical representations because they provide an overview of the informational content of the analyzed sequence and allow a fast and easy determination of the presence of repeated sequences. In addition, spectrograms do not need to specify the length, the pattern or the number of mismatches for target repeats. Thus, the spectrogram can be used for a qualitative assessment of numerical representation. The main focus was on numerical representations and on qualitative differences that occur in spectrograms and not on spectral analysis itself. Our goal was not the comparison of the different ways for identifying the repeated sequences but the comparison of the different numeric representations using one of the frequent used methods.

Interests in DNA Repeats

Nucleotide sequences contain patterns or motifs that have been preserved throughout evolution because of their importance to the structure or function of the DNA molecule. Nucleotide sequences outside the coding regions generally tend to be less conserved among organisms, except where they have a functional importance, like the involvement in gene expression regulation. Motifs discovery in protein and nucleotide sequences can lead to determination of function and to the elucidation of the evolutionary relationships among sequences.

The interest in detecting tandem repeats can be summarized as follows [2]:
Theoretical interest: regarding their role in the structure and evolution of the genome.
Technical interest: they can be used as polymorphic markers, either to trace the propagation of genetic traits in populations or as genetic identifiers in forensic studies.

Medical interest: the appearance of specific tandem repeats has been linked to a number of different severe diseases (e.g. Huntington’s disease). In healthy individuals, the repeat size varies around a few tens of copies, while in affected individuals the number of copies at the same locus reaches at least hundreds.

Definitions

Nucleotide and protein sequences are represented by character strings, in which each element is one out of a finite number of possible symbols of an “alphabet.” In the case of DNA sequences, the alphabet has four symbols and consists of the letters A, T, C and G, corresponding to Adenine, Thymine, Cytosine and Guanine nucleotides.

A perfect (exact) repeat is a string that can be represented as a smaller string repeated contiguously twice or more. For example, ACACAC is a repeat, as it can be represented as string AC repeated three times. The length of the repeated pattern is called the period (2 for the case of ACACAC), and the number of pattern copies is called the exponent (3 for ACACAC). If the exponent is 2 or more, the repeat is usually called a tandem repeat (TR). Repeats, whose copies are distant in the genome, whether or not located on the same chromosome, are called distant/dispersed repeats. Among those, biologists distinguish micro-satellites, mini-satellites, and satellites, according to the length of their repeated unit.

However, perfect tandem repeats are of limited biological interest, since different biological events will often render the copies imperfect [3]. The result is an approximate tandem repeat (ATR), defined as a string of nucleotides repeated consecutively at least twice with small differences between the instances. The role of ATRs discovered by using some of the algorithmic approaches is limited by constraints on the input data, search parameters, the type of allowed mutations and the number of such mutations. In other ATRs, time requirements render the algorithm infeasible for the analysis of whole genomes containing millions of base pairs (bp).

The centromere of most complex eukaryotic chromosomes is a specialized locus comprised of repetitive DNA that is responsible for chromosome segregation during mitosis and meiosis. Alpha satellite DNA has been identified at every human centromere. There are two major types of alpha satellite: higher-order and monomeric [4]. Higher-order alpha satellite is the predominant type in the genome (megabase quantities at each centromere) and made up of ~171 bp monomers organized in arrays of multimeric repeat units that are highly homogeneous. Monomeric alpha satellite lies at the edges of higher-order arrays and lacks any higher-order periodicity; its monomers are only on average ~70% identical to each other [4].

REFERENCES

KRISHNAN, A., TANG, F, Exhaustive Whole-Genome Tandem Repeats Search, Bioinformatics Advance Access, vol. 20(16), 2004, pp. 2702-2710.
RIVALS, E., A Survey on Algorithmic Aspects of Tandem Repeats Evolution, Intl, J. Foundations of Computer Science, vol. 15, 2004, pp. 225- 257.
WEXLER, Y., Z. YAKHINI, Y. KASHI, D. GEIGER, Finding Approximate Tandem Repeats in Genomic Sequences, RECOMB’04, March 27-31, 2004, San Diego, California, USA.
RUDD, M. K., H. F. WILLARD, Analysis of the Centromeric Regions of the Human Genome Assembly, TRENDS in Genetics, 2004, vol. 20(11), pp. 529-533.
COWARD, E., Equivalence of Two Fourier Methods for Biological Sequences, Journal of Mathematical Biology, vol. 36, 1997, pp. 64-70.
AFREIXO, V., P. J. S. G. FEREIRA, D. SANTOS, Fourier Analysis of Symbolic Data: A Brief Review, Digital Signal Processing, vol. 14, 2004, pp. 523-530.
ANASTASSIOU, D., Genomic Signal Processing, IEEE Signal Processing Magazine, vol. 18(4), pp. 8-20.
CHAKRAVARTHY, K. et al., Autoregressive Modelling and Feature Analysis of DNA Sequences, EURASIP Journal on Applied Signal Processing, vol. 1, 2004, pp. 13-28.
POP, G. P., E. LUPU, DNA Repeats Detection using BW Spectrograms, IEEE-TTTC Intl. Conf. on Automation, Quality and Testing, Robotics, AQTR 2008, May 22-25, 2008, Romania, Tome III, pp. 408-412.
POP, G.P., Spectral Representations of Alpha Satellite DNA, WSEAS Trans. Information Science and Applications 2009, vol. 5(6), pp. 819-828.
PAAR, V, N. PAVIN, I. BASAR, M. ROSANDIC, M. GLUNCIC, N. PAAR, Hierarchical Structure of Cascade of Primary and Secondary Periodicities in Fourier Power Spectrum of Alphoid Higher Order Repeats, BMC Bioinformatics, vol. 9(1), Nov. 3, 2008, p. 466.
ACHUTHSANKAR, S. N, P. S. SIVARAMA, A Coding Measure Scheme Employing Electron-Ion Interaction Pseudopotential (EIIP) , Bioinformation, vol. 1(6), 2006, pp. 197-202.
HAMMING, R. W., Error Detecting and Error Correcting Codes, Bell System Technical Journal, vol. 29(2), 1950, pp. 147-160.
SHARMA, D., B. ISSAC, G. P. S. RAGHVA, R. RAMASWAMY, Spectral Repeat Finder (SRF): Identification of Repetitive Sequences using Fourier Transformation, Bioinformatics, vol. 20(9), 2004, pp. 1405-1411.
DODIN, G, P. VANDERGHEYNST, P. LEVOIR, C. CORDIER, L. MARCOURT, Fourier and Wavelet Transform Analysis, A Tool for Visualizing Regular Patterns in DNA Sequences, Journal of Theoretical Biology, vol. 206, 2000, pp. 323-326.
SUSILLO, A., A. KUNDAJE, D. ANASTASSIOU, Spectrogram Analysis of Genomes, EURASIP Journal on Applied Signal Processing, vol. 1, 2004, pp. 29-42.
TIWARI, S., S. RAMACHANDRAN, A. BHATTACHARYA., S. BHATTACHARYA, R. RAMASWAMY, Prediction of Probable Genes by Fourier Analysis of Genomic Sequences, Computer Applications in the Bioscience, vol. 13(3), 1997, pp. 263-270.
VOSS, R., Evolution of Long-Range Fractal Correlations and 1/f Noise in DNA Base Sequences, Physical Review Letters, vol. 68, 1992, pp. 3805-3808.
HERZEL, H., O. WEISS, E. N. TRIFONOV, 10-11 bp Periodicities in Complete Genomes Reflect Protein Structure and Protein Folding, Bioinformatics, vol. 15, 1999, pp. 187-193.
TRAN, T. T., V. A. EMANUELE II, G. T. ZHOU, Techniques for Detecting Approximate Tandem Repeats in DNA, Proceedings of the International Conference for Acoustics, Speech, and Signal Processing (ICASSP), Montreal, Canada, May 17-21, 2004, vol. 5, pp. 449-452.
EMANUELE II, V. A., T. T. TRAN, G. T. ZHOU, A Fourier Product Method for Detecting Approximate Tandem Repeats in DNA, IEEE Workshop on Statistical Signal Processing, Bordeaux, 2005, July 17-20, pp. 1390-1395.
VAIDYANATHAN, P. P., B.-J. YOON, The Role of Signal-Processing Concepts in Genomics and Proteonomics, J. Franklin Institute (Special Issue on Genomics), 2004, pp. 1-27.