Incremental Learning Probabilistic State Machine for Symbolic Sequences Classification
Main Article Content
Abstract
Symbolic sequence classification can be used in a variety of applications such as DNA sequences analysis, intrusion detection, electrocardiography (ECG) analysis. The convention methods can be applied to this field for example, probabilistic language model, support vector machine, and artificial neural network. However, learning unknown words in very long sequences such as DNA sequences need fix size of imitation words. Consequently, the probability and position of original words are distorted which can lead to incorrect results in long term. Moreover, learning update process may cause confliction with the former fix size knowledge. In this work, to optimise the probability and the position of original words as possible while the direction of data is still tractable, we propose a novel probabilistic language model using Markov assumption with variable memory length considered by unique and repeated substring. We propose an on-line algorithm to build the model with time and space complexity . The experiment of classification applied to DNA sequences of promoter and non-promoter bacteria E. Coli. The error of classification is only 3.77 % which is satisfied compared to the other methods.
Article Details
Ramkhamhaeng University
References
Cavnar, W. B. and Trenkle, J. M. 1994. N-Gram-Based Text Categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval , 161–175.
Dua, D. and Graff, C. 2019 UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Dietterich, T. G. 2002. Machine Learning for Sequential Data: A Review. Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition , 15–30.
Galata, A. Johnson, N. and Hogg, D. 2001. Learning Variable Length Markov Models of Behaviour. Computer Vision and Image Understanding. 81(3): 398-413
Geppert, E. and Hammer, B. 2016. Incremental learning algorithms and applications. ESANN , 357-368
Ilie, L., Smyth, W. F. 2011. Minimum Unique Substrings and Maximum Repeats. Fundam. Inf, 110 (1-4),
183 - 195.
Kermorvant, C., Dupont, P. 2002. Improved Smoothing for Probabilistic Suffix Trees Seen As Variable Order Markov Chains. Proceedings of the 13th European Conference on Machine Learning , 185--194.
Khreich, W., Granger, E., Miri, A., and Sabourin, R. 2012. A Survey of Techniques for Incremental Learning of HMM Parameters. Information Sciences, 197 , 105–130.
Leslie, C., Eskin, E., and Noble, W. S. 2002. The spectrum kernel: a string kernel for SVM protein classification. PAC Symp Biocomput, , 564-575.
Lin, J., Keogh, E., Lonardi, S., and Chiu, B. 2003. A Symbolic Representation of Time Series, with Implications for Streaming Algorithms. Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery , 2--11
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., and Watkins, C. 2002. Text classification using string kernels. Journal of Machine Learnining Research, 2 , 419-444.
Mingers, J. 1989. An Empirical Comparison of Pruning Methods for Decision Tree Induction, Machine Learning 4 227–243.
O'Neill, Michael. 1989. Escherichia coli promoters. I. Consensus as it relates to spacing class, specificity, repeat substructure, and three-dimensional organization. The Journal of biological chemistry. 264. 5522-30.
Rabiner, L. R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77 (2), 257-286.
Rayan Chikhi, and Paul Medvedev. 2014. Informed and automated k-mer size selection for genome assembly, Bioinformatics, 30(1), 31-37
Ron, D., Singer, Y., and Tishby, N. 1996. The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length. Machine learning , 117–149.
Rumelhart, D. E., G. Hinton, G. E., and Williams, R. J. 1986 Learning Internal Represen- tations by Error Propagation, in: Parallel Distributed Processing: Explorations in the microstructure of cognition. Volume 1: Foundations, D. E. Rumelhart and J. L. McClelland (Eds.), 318–363.
Towell, G., Shavlik, J., and Noordewier, M. 1990. Refinement of Approximate Domain Theories by Knowledge-Based Neural Networks. In Proceedings of the Eighth National Conference on Artificial Intelligence , 861-866.
Vidal, E., Thollard, F., de la Higuera, C., Casacuberta, F., and Carrasco, R. C. 2005. Probabilistic Finite-State Machines-Part I. IEEE Trans. Pattern Anal. Mach. Intell., 27 (7), 1013--1025.
Vidal, E., Thollard, F., de la Higuera, C., Casacuberta, F., and Carrasco, R. C. 2005. Probabilistic Finite-State Machines-Part II. IEEE Trans. Pattern Anal. Mach. Intell., 27 (7), 1026–1039.
Weiner, P. 1973. Linear Pattern Matching Algorithms. Proceedings of the 14th Annual Symposium on Switching and Automata Theory , 1--11.
Xing, Z., Pei, J., and Keogh, E. 2010. A Brief Survey on Sequence Classification. SIGKDD Explor. Newsl., 12 (1), 40--48.
Xing, Z., Pei, J., Dong, G., and Yu, P. S. 2008. Mining Sequence Classifiers for Early Prediction. Proceedings of the {SIAM} International Conference on Data Mining, 644--655.