Thai Text Segmentation Using Vowel-Centered Rules and Learning
Main Article Content
Abstract
A vast majority of text processing algorithms make one common assumption that input text is a sequence of words. In some language in which word boundaries are not always explicit, such as, Thai, text segmentation is an issue of interest. This work presents a two-step algorithm for Thai text segmentation. The first step chops the input text into pieces centered around the vowels. In the second step, the algorithm defines a set of features that might help determine whether or not two consecutive pieces from the previous step belong together as a unit (word, syllable, etc). It then uses learning algorithms to build a model out of these features. Given an input text, applying this model will result in a sequence of units. Each small (few syllables) yet useful enough for further processing by other word-based algorithms.
Keywords: Thai, Text, Segmentation, Learning, Decision Trees, C4.5
Corresponding author: E-mail: patrawadee@as.nida.ac.th
Article Details
Copyright Transfer Statement
The copyright of this article is transferred to Current Applied Science and Technology journal with effect if and when the article is accepted for publication. The copyright transfer covers the exclusive right to reproduce and distribute the article, including reprints, translations, photographic reproductions, electronic form (offline, online) or any other reproductions of similar nature.
The author warrants that this contribution is original and that he/she has full power to make this grant. The author signs for and accepts responsibility for releasing this material on behalf of any and all co-authors.
Here is the link for download: Copyright transfer form.pdf
References
[2] Sornlertlamvanich, V. 1993 Word Segmentation for Thai in Machine Translation System. Machine Translation, National Electronics and Computer Technology Center, Bangkok. Pp. 50-56.
[3] Pooworawan, Y. and Imarom, V. 1986 Thai Syllable Separater by Dictionary. Proceedings 9th National Conference on Electrical Engineering, Khon Kaen, Thailand.
[4] Kawtrakul, A. and Thumkanon, C. 1997 A Statistical Approach to Thai Morphological Analyzer, Proceedings 5th Workshop on Very Large Corpora. Beijing.
[5] Meknavin, S. Charoenpornsawat, P. and Kijsirikul, B. 1997 Feature-based Thai Word Segmentation. Proceedings Natural Language Proceeding Pacific Rim Symposium, Phuket, Thailand, pp.41-46.
[6] Teeramunkong, T. and Usanavasin, S. 2001 Non-Dictionary-Based Thai Word Segmentation Using Decision Trees. Proceedings 1st International Conference on Human Language Technology Research, San Diego, CA, pp. 1-5.
[7] Sojka, P. and Anto, D. 2003 Context Sensitive Pattern Based Segmentation: A Thai Challenge. Proceedings EACL 2003 workshop Computational Linguistics for South Asian Languages – Expanding Synergies with Europe, Budapest.
[8] Thonglor, K. 2004 Thai Grammar. Bangkok, Ruamsarn.
[9] Quinlan, R. 1993 C4.5: Programs for Machine Learning. San Mateo, CA, Morgan Kaufmann.
[10] Quinlan, R. 1986 Induction of decision trees, Machine Learning, 1().81-106.
[11] John, G.H. and Langley, P. 1995 Estimating Continuous Distributions in Bayesian Classifiers. Proceedings 11th Conference on Uncertainty in Artificial Intelligence, Montreal, pp. 338-345.
[12] Mitchell, T. 1997 Machine Learning. New York, McGraw Hill.