Thai text Transformation for Compression

Main Article Content

K. Sermkawinrak
S. Intakosum
V. Boonjing

Abstract

The paper presents a new Thai-text transform algorithm to enhance compression using the list of frequently used Thai words/phases. The approach is to increase redundancy in text by encoding it into intermediate form. The encoding scheme uses the list of fixed length codes for frequently used Thai words/phases to substitute words/phases in text with their codes. Algorithm performance is measured in terms of compression ratio. There are three major implementations for experiment. The first is to include all 511 frequently used Thai words/phrases. Therefore, a three-byte code is assigned to each word/phase. The second uses a two-byte code because it concerns with the first 255 most frequently used words/phases. The last concerns the first 109 most frequently used words/phases with one-byte code for each word/phase. An experiment was made using each text and its transformed version as input to standard compression programs. The result shows that the transformed text gives compression ratio significantly better than its original one.


Keywords:  -


Corresponding author: E-mail: [email protected]

Article Details

Section
Original Research Articles

References

[1] Burrows M., Wheeler D.J. 1994 A Block-Sorting Lossless Data Compression Algorithm. SRC Research Report 124, Digital Systems Research Center, Palo Alto, CA.
[2] www.http://www.arturocampos.com/ac_bwt.html
[3] Lerwongrat S. 1997 Text Compression by Sorting Transformation. M.S. Thesis in Computer Science, Faculty of Graduate Studies, Mahidol University.
[4] Awan F. and Mukherjee A. 2001 LIPT: A Lossless Text Transform to improve compression. Proceedings of International Conference on Information and Theory, Coding and Computing, IEEE Computer Society, Las Vegas, Nevada.
[5] Dissunrat K. 2001 Text Compression with Modified Length Index Preserving Transformation Using Semi-Dynamic and Dynamic Dictionary. M.S. Thesis in Computer Science, Faculty of Graduate Studies, Mahidol University.
[6] Poovarawan Y. 1984 Thai Word Analysis. (in Thai language) Microcomputer Res. Lab., Computer Engineering, Faculty of Engineering, Kasetsart University.
[7] Poovarawan Y., Imarom W. 1986 Thai Syllable Separater by Dictionary. (in Thai language) Microcomputer Res. Lab., Computer Engineering, Faculty of Engineering, Kasetsart University.
[8] Poovarawan Y. Keretho S. 1983 Suggestion for Thai Standard Character Code. (in Thai language) Microcomputer Res. Lab., Computer Engineering, Faculty of Engineering, Kasetsart University.
[9] Poovarawan Y., Wongchaisuwat C. 1989 Design and Compression of Thai Words in Dictionary for Spelling Check (in Thai language) Microcomputer Res. Lab., Computer Engineering, Faculty of Engineering, Kasetsart University.
[10] Sermkawinrak K. 2005 Thai Text Transformation for Data Compression (in Thai language). M.S. Thesis in Computer Science, School of Graduate Studies, King Mongkut’s Institute of Technology Ladkrabang.