Computational analysis of sugarcane ESTs for high-quality clusters and SSR mining

Authors

  • Piyarat Ponyared Department of Biology, Faculty of Science, Khon Kaen University, Khon Kaen 40000, Thailand
  • Tawun Remsungnen Department of Mathematics, Faculty of Science, Khon Kaen University, Khon Kaen 40000, Thailand
  • Ngamnij Arch-int Department of Computer Science, Faculty of Science, Khon Kaen University, Khon Kaen 40000, Thailand
  • Wichai Neeratanaphan Department of Biology, Faculty of Science, Khon Kaen University, Khon Kaen 40000, Thailand
  • Chutipong Akkasaeng Department of Plant Science and Agricultural Resources, Faculty of Agriculture, Khon Kaen University, Khon Kaen 40000, Thailand
  • Napaporn Tantisuwichwong Department of Biology, Faculty of Science, Khon Kaen University, Khon Kaen 40000, Thailand

DOI:

https://doi.org/10.14456/tjg.2009.3

Keywords:

expressed sequence tags (ESTs), sugarcane, EST clustering, SSR

Abstract

Expressed sequence tags (ESTs) have provided opportunity for development of powerful marker SSR when high-quality EST clusters are available. EST clustering is commonly performed on the basis of nucleotide similarity to reduce redundancy and increase the sequence quality. The degree of similarity is one of the important parameters affecting the EST cluster quality. This work aimed to determine EST cluster quality with various degrees of nucleotide similarity and identifying SSR locus within the defined EST clusters. A collection of 2,268 ESTs from mature stalk of sugarcane (Saccharum spp.) hybrid cultivar CP72-2036, available in dbEST of GenBank, was passed into pre-processing steps to eliminate the sequencing errors and contaminant sequences. This resulted in 2,167 clean ESTs. EST clustering with sequence identity P = 85, 90, 95 and 100% reduced the EST data set. The lowest number of clusters was obtained at P = 85%. Exploring of SSR locus also yielded the lowest number of SSR in EST clusters defined at the P value = 85%.

References

ปิยรัตน์ พลยะเรศ และ นภาภรณ์ ตันติสุวิชวงษ์. 2552. การตรวจสอบคุณภาพและความซ้ำซ้อนของลำดับนิวคลีโอไทด์ใน ESTs ที่ได้จากลำอ้อยที่เติบโตเต็มที่. ในการประชุมวิชาการเสนอผลงานวิจัยระดับบัณฑิตศึกษา ครั้งที่ 2. หน้า 161-171. บัณฑิตวิทยาลัย มหาวิทยาลัยราชภัฏจันทรเกษม กรุงเทพฯ.

Aaronson, J. S., Eckman, B., Blevins, R. A., Borkowski, J. A., Myerson, J., Imran, S. and Elliston, K. O. 1996. Toward the development of a gene index to the human genome: an assessment of the nature of high throughput EST sequence data. Genome Res 6: 829-845.

Adams, M. D., Kelly, J. M., Gocayne, J. D., Dubnick, M., Polymeropoulos, M. H., Xiao, H., Merril, C. R., Wu, A., Olde, B., Moremo, R. F., Kerlavage, A. R., McCombie, W. R., and Venter, J. C. 1991. Complementary DNA sequence tags and human genome project. Science 252: 1651-1656.

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. 1997. Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucl Acids Res 25: 3389-3402.

Burke, J., Davison, D. and Hide, W. 1999. d2_cluster: a validated method for clustering EST and full-length cDNA sequences. Genome Res 9: 1135-1142.

Carson, D.L. and Botha, F.C. 2000. Preliminary analysis of expressed sequence tags for sugarcane. Crop Sci 40: 1769-1779.

Chen, Y.A., Lin, C.C., Wang, C.D., Wu, H.B. and Hwang, P.I. 2007 An optimized procedure greatly improves EST vector contamination removal. BMC Bioinformatics 8: 416.

Kawabe, A. and Miyashita, N.T. 2003. Patterns of codon usage bias in three dicot and four monocot plant species. Gen Gene Systems 78: 343-352.

Li, W. and Godzik, A. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22: 1658-1659.

Li, W., Jaroszewski, L. and Godzik, A. 2002 Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics 18: 77-82.

Miller, R. T., Christoffels, A. G., Gopalakrishnan, C., Burke, J., Ptitsyn, A. A., Broveak, T. R., and Hide, W. A. 1999. A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. Genome Res 9: 1143-1155.

Pertea, G., Huang, X., Liang, F., Antonescu, V., Sultana, R., Karamycheva, S., Lee, Y., White, J., Cheung, F., Parvizi, B., Tsai, J. and Quackenbush, J. 2003. TIGR gene indices clustering tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics 19: 651-652.

Ptitsyn, A. and Hide, W. 2005. CLU: A new algorithm for EST clustering. BMC Bioinformatics 6 (Suppl2): S3.

Reed, G. 2001. StackPACK clustering system. Brief Bioinform 2: 388-404.

Schulze, S. R., Ma, H. M., Meizhu Yang, J., Bowers, J. E., Mirkov, E. and Paterson, A. H. 2002. An EST survey of the sugarcane transcriptome. GenBank.

Smith, A.F.A., Hubley, R. and Green, P. 1996-2004. RepeatMasker Open-3.0. http://www.repeatmasker.org.

Thiel, T., Michalek, W., Varshney, R.K. and Graner, A. 2003. Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L.). Theor Appl Genet 106: 411-422.

Uchimiya, H., Kidou, S., Shimazaki, T., Aotsuka, S., Takamatsu, S., Nishi, R., Hashimoto, H., Matsubayashi, Y., Kidou, N., Umeda, M. and Kata, A. 1992. Random sequenceing of cDNA libraries reveals a variety of expressed genes in cultured-cells of rice (Oryza sativa L.). Plant J 2: 1005-1009.

Wan, H. and Wootton, J.C. 2000. A global compositional complexity measure for biological sequences: AT-rich and GC-rich genomes encode less complex proteins. Comp Chem 24: 71-94.

Downloads

Published

2012-07-12

Issue

Section

Research Articles