Constraint Based Frequent Pattern Mining Technique for Solving GCS Problem

Generalized Center String (GCS) problem are generalized from Common Approximate Substring problem and Common substring problems. GCS are known to be NP-hard allowing the problems lies in the explosion of potential candidates. Finding longest center string without concerning the sequence that may not contain any motifs is not known in advance in any particular biological gene process. GCS solved by frequent pattern-mining techniques and known to be fixed parameter tractable based on the fixed input sequence length and symbol set size. Efficient method known as Bpriori algorithms can solve GCS with reasonable time/space complexities. Bpriori 2 and Bpriori 3-2 algorithm are been proposed of any length and any positions of all their instances in input sequences. In this paper, we reduced the time/space complexity of Bpriori algorithm by Constrained Based Frequent Pattern mining (CBFP) technique which integrates the idea of Constraint Based Mining and FP-tree mining. CBFP mining technique solves the GCS problem works for all center string of any length, but also for the positions of all their mutated copies of input sequence. CBFP mining technique construct TRIE like with FP tree to represent the mutated copies of center string of any length, along with constraints to restraint growth of the consensus tree. The complexity analysis for Constrained Based FP mining technique and Bpriori algorithm is done based on the worst case and average case approach. Algorithm's correctness compared with the Bpriori algorithm using artificial data is shown.




References:
[1] C.N. Meneses, Z. Lu, A.S. Oliveria, and P.M. Pardolos, "Optimal
Solutions for the Closest String Problem via Integer Programming,"
INFORMS J. Computing, vol. 16, no. 4, pp. 419-429, 2004.
[2] E. Eskin and P. Pevzner, "Finding Composite Regulatory Patterns
in DNA Sequences," Bioinformatics, pp. 354-363, 2002.
[3] F.Y.L. Chin and H.C.M. Leung, "Voting Algorithms for
Discovering Long Motifs," Proc. Third Asia-Pacific Bioinformatics
Conf. (APBC -05), pp. 261-271, 2005.
[4] G. Grahne, L. Lakshmanan, and X. Wang, "Efficient mining of
constrained correlated sets," In Proc. 2000 Int. Conf. Data
Engineering (ICDE-00), San Diego, CA, pp. 512-521, 2000.
[5] G. Pavesi, G. Mauri, and G. Pesole, "An Algorithm for Finding
Signals of Unknown Length in DNA Sequences," Bioinformatics,
vol. 17, pp. 207-214, 2001.
[6] G. Pavesi, G. Mauri, and G. Pesole, "In Silico Representation and
Discovery of Transcription Factor Binding Sites," Brief in
Bioinformatics, vol. 5, no. 3, pp. 217-36, 2004.
[7] G.D. Stormo, "DNA Binding Sites: Representation and Discovery,"
Bionformatics, vol. 16, no. 1, pp. 16-23, 2004.
[8] H. Mannila, H. Toivonen, "Levelwise search and borders of theories
in knowledge discovery," Data Mining and Knowledge Discovery
1, pp 241-258, 1997.
[9] H.C.M. Leung and F.Y.L. Chin, "Algorithms for Challenging Motif
Problems," J. Bioinformatics and Computational Biology, vol. 4,
no. 1, pp. 43-58, 2006.
[10] J. Buhler and M. Tompa, "Finding Motifs Using Random
Projections," Proc. Fifth Ann. Int-l Conf. Computational Molecular
Biology (RECOMB), pp. 69-76, 2001.
[11] J. Davila, S. Balla, and S. Rajasekaran, "Space and Time Efficient
Algorithms for Planted Motif Search," Proc. Int-l Workshop
Bioinformatics Research and Applications (IWBRA -06), May
2006.
[12] J. Gramm, R. Niedermeier, and P. Rossmanith, "Exact Solutions for
CLOSEST STRING and Related Problems," Proc. 12th Int-l Symp.
Algorithms and Computation, pp. 441-453, 2001.
[13] J. Guan, D.Y. Liu, and D.A. Bell, "Discovering Motifs in DNA
Sequences," Fundamenta Informaticae, vol. 59, pp. 119-132, 2004.
[14] J. Han, J. Pei, and Y. Yin, "Mining frequent patterns without
candidate generation," In Proc. 2000 ACMSIGMOD, Int. Conf.
Management of Data (SIGMOD-00), Dallas, TX, pp. 1-12, 2000.
[15] J. Han, J. Pei, and Y. Yin, "Mining Frequent Patterns without
Candidate Generation," Proc. ACM SIGMOD, pp. 1-12, 2000.
[16] J. Han, L.V.S. Lakshmanan, T.N.Raymond, "Constraint-Based
Multidimensional Data Mining," IEEE Computer, 32(8), 46-
50,1999.
[17] J. Pei, J. Han, and L.V.S. Lakshmanan, "Mining frequent itemsets
with convertible constraints," In Proc. 2001 Int. Conf. Data
Engineering (ICDE-01), Heidelberg, Germany, pp. 433-332, 2001.
[18] J. Pei, J. Han, and R. Mao, "CLOSET: An efficient algorithm for
mining frequent closed itemsets," In Proc. 2000 ACM-SIGMOD
Int. Workshop Data Mining and Knowledge Discovery
(DMKD-00), Dallas, TX, pp. 11-20, 2000.
[19] K. Lanctot, M. Li, B. Ma, and L. Zhang, "Distinguishing String
Selection Problems," Information and Computation, vol. 185, no. 1,
pp. 44-51, 2003.
[20] L. De Raedt, S. Kramer, "The levelwise version space algorithm
and its application to molecular fragment finding," In. IJCA101: 7th
Int. Joint Conf. Artificial Intelligence, 2001.
[21] L. Marsan and M.F. Sagot, "Algorithms for Extracting Structured
Motifs Using a Suffix Tree with Application to Promoter and
Regulatory Site Consensus Identification," J. Computational
Biology, vol. 7, pp. 345-360, 2000.
[22] M. Ester and X. Zhang, "A Top-Down Method for Mining Most
Specific Frequent Patterns in Biological Sequence Data," Proc.
SIAM Int-l Conf. on Data Mining (SDM -04), Apr. 2004.
[23] M. Frances and A. Litman, "On Covering Problems of Codes,"
Theoretical Computer System, vol. 30, pp. 113-119, 1997.
[24] M. Li, B. Ma, and L. Wang, "Finding Similar Regions in Many
Strings," Proc. 31st Ann. ACM Symp. Theory of Computing
(STOC -99), pp. 473-482, 1999.
[25] M. Li, B. Ma, and L. Wang, "On the Closest String and Substring
Problems," J. ACM, vol. 49, no. 2, pp. 151-171, 2002.
[26] M. Tompa et al., "Assessing Computational Tools for the Discovery
of Transcription Factor Binding Sites," Nature Biotechnology, vol.
23, no. 1, pp. 137-144, 2005.
[27] M. Tompa, "An Exact Method for Finding Short Motifs in
Sequences, with Application to the Ribosome Binding Site
Problem," Proc. Seventh Int-l Conf. Intelligent Systems for
Molecular Biology, pp. 262-271, 1999.
[28] M.F. Sagot, "Spelling Approximate Repeated or Common Motifs
Using a Suffix Tree," Proc. Third Latin Am. Symp. Theoretical
Informatics, pp. 111-127, 1998.
[29] M.J. Zaki, and C.J. Hsiao, "CHARM: An efficient algorithm for
closed itemset mining," In Proc. 2002 SIAMInt. Conf. Data Mining,
Arlington, VA, pp. 457-473, 2002.
[30] M.N. Garofalakis, R. Rastogi, and K. Shim, "SPIRIT: Sequential
pattern mining with regular expression constraints," In Proc. 25th Int.
Conf. Very Large Data Bases (VLDB '99), San Francisco, Morgan
Kaufmann, pp 223-234, 1999.
[31] M.S. Waterman, R. Arratia, and D.J. Galas, "Pattern Recognition in
Several Sequences: Consensus and Alignment," Bull. Math. Biology,
vol. 46, pp. 515-527, 1984
[32] P. Pevzner and S. Sze, "Combinatorial Approaches to Finding Subtle
Signals in DNA Sequences," Proc. Int-l Conf. Intelligent Systems for
Molecular Biology, pp. 269-278, 2000.
[33] P.A. Evans and A.D. Smith, "Complexity of Approximating Closest
Substring Problems," Fundamentals of Computation Theory, pp.
210-221, 2003.
[34] P.A. Evans and A.D. Smith, "Toward Optimal Motif Enumeration,"
Proc. Algorithms and Data Structures, Eighth Int-l Workshop
(WADS -03), pp. 47-58, 2003.
[35] P.A. Evans and H.T. Wareham, "Practical Algorithms for Universal
DNA Primer Design: An Exercise in Algorithm Engineering,"
Currents in Computational Molecular Biology 2001, N. El-Mabrouk,
T. Lengauer, and D. Sankoff, eds., pp. 25-26, Les Publications CRM,
2001.
[36] P.A. Evans, A.D. Smith, and H.T. Wareham, "On the Complexity of
Finding Common Approximate Substrings," Theoretical Computer
Science, vol. 306, no. 1-3, pp. 407-430, 2003.
[37] R. Agarwal, C. Aggarwal, and V.V.V. Prasad, "A tree projection
algorithm for generation of frequentitemsets," Journal of Parallel and
Distributed Computing, 61:350-371, 2001.
[38] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association
Rules," Proc. 20th Int-l Conf. Very Large Data Bases (VLDB -94),
pp. 487-499, 1994.
[39] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and I. Verkamo,
"Fast Discovery of the Association Rules," Advances in Knowledge
Discovery and Data Mining, pp. 307-328, AAAI Press, 1996.
[40] R. Ng et al., "Exploratory Mining and Pruning Optimizations of
Constrained Associations Rules," Proc. 1998 ACM SIGMOD Int'l
Conf. Management of Data, ACM Press, New York, 1998, pp. 13-
24.
[41] R. Srikant, and R. Agrawal, "Mining sequential patterns:
Generalizations and performance improvements," In Proc. 5th Int.
Conf. Extending Database Technology (EDBT-96), Avignon,
France, pp. 3-17, 1996.
[42] R. Srikant, Q. Vu, and R. Agrawal, "Mining association rules with
item constraints," In Proc. 1997 Int. Conf. Knowledge Discovery and
Data Mining (KDD-97), Newport Beach, CA, pp. 67-73, 1997.
[43] Ruqian Lu, Caiyan Jia, Shaofang Zhang, Lusheng Chen, Hongyu
Zhang, "An Exact Data Mining Method for Finding Center Strings
and All Their Instances," IEEE Trans. Knowledge and Data
Engineering," 19(4), 509-522, 2007.
[44] S. Rajasekaran, S. Balla, and C.-H. Huang, "Exact Algorithms for
the Planted Motif Problem," J. Computational Biology, vol. 12, no.
8, pp. 1117-1128, 2005.
[45] S. Sinha and M. Tompa, "Discovery of Novel Transcription Factor
Binding Sites by Statistical Overrepresentation," Nucleic Acids
Research, vol. 30, no. 24, pp. 5549-5560, 2002.
[46] S. Sinha and M. Tompa, "YMF: A Program or Discovery of Novel
Transcription Factor Binding Sites by Statistical
Overrepresentation," Nucleic Acids Research, vol. 31, pp. 3586-
3588, 2003.
[47] S.T. Jensen, X.S. Liu, Q. Zhou, and J.S. Liu, "Computational
Discovery of Gene Regulatory Binding Motifs: A Bayesian
Perspective," Statistical Science, vol. 19, pp. 188-204, 2004.
[48] Sau Dan Lee, Luc De Raedt, "Constraint Based Mining of First
Order Sequences in SeqLog," Database Support for Data Mining
Applications 154-173, 2004.