Thai Text Segmentation Using Vowel-Centered Rules and Learning

Main Article Content

Patrawadee Tanawongsuwan*

Abstract

A vast majority of text processing algorithms make one common assumption that input text is a sequence of words. In some language in which word boundaries are not always explicit, such as, Thai, text segmentation is an issue of interest. This work presents a two-step algorithm for Thai text segmentation. The first step chops the input text into pieces centered around the vowels. In the second step, the algorithm defines a set of features that might help determine whether or not two consecutive pieces from the previous step belong together as a unit (word, syllable, etc). It then uses learning algorithms to build a model out of these features. Given an input text, applying this model will result in a sequence of units. Each small (few syllables) yet useful enough for further processing by other word-based algorithms.


Keywords: Thai, Text, Segmentation, Learning, Decision Trees, C4.5


Corresponding author: E-mail: patrawadee@as.nida.ac.th


 

Article Details

Section
Original Research Articles

References

[1] Lorchirachoonkul, V. and Khuwinphunt, C. 1981 Thai Soundex Algorithm and Thai-Syllable Seperation Algorithm. Research paper, National Institute of Development Administration, Thailand.
[2] Sornlertlamvanich, V. 1993 Word Segmentation for Thai in Machine Translation System. Machine Translation, National Electronics and Computer Technology Center, Bangkok. Pp. 50-56.
[3] Pooworawan, Y. and Imarom, V. 1986 Thai Syllable Separater by Dictionary. Proceedings 9th National Conference on Electrical Engineering, Khon Kaen, Thailand.
[4] Kawtrakul, A. and Thumkanon, C. 1997 A Statistical Approach to Thai Morphological Analyzer, Proceedings 5th Workshop on Very Large Corpora. Beijing.
[5] Meknavin, S. Charoenpornsawat, P. and Kijsirikul, B. 1997 Feature-based Thai Word Segmentation. Proceedings Natural Language Proceeding Pacific Rim Symposium, Phuket, Thailand, pp.41-46.
[6] Teeramunkong, T. and Usanavasin, S. 2001 Non-Dictionary-Based Thai Word Segmentation Using Decision Trees. Proceedings 1st International Conference on Human Language Technology Research, San Diego, CA, pp. 1-5.
[7] Sojka, P. and Anto, D. 2003 Context Sensitive Pattern Based Segmentation: A Thai Challenge. Proceedings EACL 2003 workshop Computational Linguistics for South Asian Languages – Expanding Synergies with Europe, Budapest.
[8] Thonglor, K. 2004 Thai Grammar. Bangkok, Ruamsarn.
[9] Quinlan, R. 1993 C4.5: Programs for Machine Learning. San Mateo, CA, Morgan Kaufmann.
[10] Quinlan, R. 1986 Induction of decision trees, Machine Learning, 1().81-106.
[11] John, G.H. and Langley, P. 1995 Estimating Continuous Distributions in Bayesian Classifiers. Proceedings 11th Conference on Uncertainty in Artificial Intelligence, Montreal, pp. 338-345.
[12] Mitchell, T. 1997 Machine Learning. New York, McGraw Hill.