Exploring the Combinatorics of Motif Alignments Foraccurately Computing E-values from P-values
In biological and biomedical research motif finding tools are important in locating regulatory elements in DNA sequences. There are many such motif finding tools available, which often yield position weight matrices and significance indicators. These indicators, p-values and E-values, describe the likelihood that a motif alignment is generated by the background process, and the expected number of occurrences of the motif in the data set, respectively. The various tools often estimate these indicators differently, making them not directly comparable. One approach for comparing motifs from different tools, is computing the E-value as the product of the p-value and the number of possible alignments in the data set. In this paper we explore the combinatorics of the motif alignment models OOPS, ZOOPS, and ANR, and propose a generic algorithm for computing the number of possible combinations accurately. We also show that using the wrong alignment model can give E-values that significantly diverge from their true values.
 M. Tompa, N. Li, T. L. Bailey, G. M. Church, B. De Moor, E. Eskin,
A. V. Favorov, M. C. Frith, Y. Fu, W. J. Kent, V. J. Makeev, A. A.
Mironov, W. S. Noble, G. Pavesi, G. Pesole, M. Regnier, N. Simonis,
S. Sinha, G. Thijs, J. van Helden, M. Vandenbogaert, Z. Weng,
C. Workman, C. Ye, and Z. Zhu, "Assessing computational tools for the
discovery of transcription factor binding sites." Nat Biotechnol, vol. 23,
no. 1, pp. 137-144, Jan 2005.
 T. A. Down and T. J. P. Hubbard, "Nestedmica: sensitive inference of
over-represented motifs in nucleic acid sequence." Nucleic Acids Res,
vol. 33, no. 5, pp. 1445-1453, 2005.
 T. Bailey and C. Elkan, "Unsupervised learning of multiple motifs
in biopolymers using expectation maximization," Machine Learning,
vol. 21, no. 1-2, pp. 51-80, Oct-Nov 1995.
 G. Pavesi, G. Mauri, and G. Pesole, "An algorithm for finding signals
of unknown length in dna sequences." Bioinformatics, vol. 17 Suppl 1,
pp. S207-14, 2001.
 G. Z. Hertz and G. D. Stormo, "Identifying dna and protein patterns
with statistically significant alignments of multiple sequences." Bioinformatics,
vol. 15, no. 7-8, pp. 563-577, Jul-Aug 1999.
 C. Pizzia and E. Ukkonen, "Fast profile matching algorithms - a survey,"
Theoretical Computer Science, vol. 395, no. 2-3, pp. 137-157, MAY 1
 N. Nagarajan, P. Ng, and U. Keich, "Refining motif finders with e-value
calculations," Regulatory Genomics, p. 73, 2006.
 G. Andrews, The theory of partitions. Cambridge University Press,
 G. Andrews and K. Eriksson, Integer partitions. Cambridge University
 A. Zoghbi and I. Stojmenovic, "Fast algorithms for generating integer
partitions," International Journal of Computer Mathematics, vol. 70,
no. 2, pp. 319-332, 1998.