การเปรียบเทียบมาตรวัดระยะห่างในการวิเคราะห์การจัดกลุ่มสำหรับข้อมูลอนุกรมเวลา

Main Article Content

นัท กุลวานิช
อัครินทร์ ไพบูลย์พานิช

Abstract

Abstract


One important step in cluster analysis is to determine a distance or dissimilarity measure between data objects. Searching for suitable measure is an important process for time series clustering since a various measure of dissimilarity designed for time-series data could lead to different cluster solutions. This research is a comparative study of the effectiveness of 8 distance measures for time-series dataset: (1) Euclidean, (2) Minkowski, (3) dynamic time warping, (4) Chouakria-Douzal, (5) Piccolo, (6) Maharaj, (7) discrete wavelet transform, and (8) cepstral-based distance. This study extends prior research by using both real and simulated data for the comparison of clustering results. The simulated data is generated from 14 ARIMA processes. The real data is the daily stock close prices of 68 listed companies that are included in the list of stocks in stock exchange of Thailand 100 (SET100) index during January-April 2018. The results suggest that dynamic time warping is the most effective measure in both real and simulated data. 


Keywords: distance measure; cluster analysis; time-series data

Article Details

Section
Physical Sciences
Author Biographies

นัท กุลวานิช

ภาควิชาสถิติ คณะพาณิชยศาสตร์และการบัญชี จุฬาลงกรณ์มหาวิทยาลัย แขวงวังใหม่ เขตปทุมวัน กรุงเทพมหานคร 10330

อัครินทร์ ไพบูลย์พานิช

ภาควิชาสถิติ คณะพาณิชยศาสตร์และการบัญชี จุฬาลงกรณ์มหาวิทยาลัย แขวงวังใหม่ เขตปทุมวัน กรุงเทพมหานคร 10330

References

[1] กัลยา วานิชย์บัญชา, 2551, การวิเคราะห์ข้อมูลหลายตัวแปร, พิมพ์ครั้งที่ 3, ธรรมสาร, กรุงเทพฯ.
[2] สุชาติ ประสิทธิ์รัฐสินธุ์, 2540, เทคนิคการวิเคราะห์ตัวแปรหลายตัวสําหรับการวิจัยทางสังคมศาสตร์, พิมพ์ครั้งที่ 4, บริษัท เฟื่องฟ้า พริ้นติ้ง จํากัด, กรุงเทพฯ.
[3] Patidar, A.K., Agrawal, J. and Mishra, N., 2012, Analysis of different similarity measure functions and their impacts on shared nearest neighbor clustering approach, Int. J. Comp. Appl. 40(16): 1-5.
[4] Tong, H. and Dabas, P., 1990, Cluster of time series models: An example, J. Appl. Stat. 17: 187-198.
[5] Rani, S. and Sikka, G., 2012, Recent techniques of clustering of time series data: A survey, Int. J. Comp. Appl. 52(15): 1-9.
[6] Galeano, P. and Peña, D.P., 2000, Multivariate analysis in vector time series, Resenhas do Instituto de Matemática e Estatística da Universidade de São Paulo 4: 383-403.
[7] Keogh, E. and Kasetty, S., 2003, On the need for time series data mining benchmarks: A survey and empirical demonstration, Data Min. Knowl. Disc. 7: 349-371.
[8] Wang, X., Mueen, A., Ding, H., Trajcevski, G., Scheuermann, P. and Keogh, E., 2013, Experimental comparison of representa tion methods and distance measures for time series data, Data Min. Knowl. Disc. 26: 275-309.
[9] Kianimajd, A., Ruano, M.G., Carvalho, P., Henriques, J., Rocha, T., Paredes, S. and Ruano, A.E., 2017, Comparison of different methods of measuring similarity in physiologic time series, IFAC-Papers On Line 50(1): 11005-11010.
[10] Sankoff, D., 1983, Time Warps, String Edits and Macromolecules, The Theory and Practice of Sequence Comparison, Reading.
[11] Chouakria, A.D. and Nagabhushan, P.N., 2007, Adaptive dissimilarity index for measuring time series proximity, Adv. Data Anal. Classif. 1: 5-21.
[12] Piccolo, D., 1990, A distance measure for classifying ARIMA models, J. Time Ser. Anal. 11: 153-164.
[13] Maharaj, E.A., 1996, A significance test for classifying ARMA models, J. Stat. Comp. Simul. 54: 305-331.
[14] Maharaj, E.A., 2000, Cluster of time series, J. Classif. 17: 297-314.
[15] Shasha, D. and Zhu, Y., 2004, High Performance Discovery in Time Series: Techniques and Case Studies, Ph.D. Thesis, Available Source: https://cs.nyu.edu/media/publications/zhu_yunyue.pdf.
[16] Wei, W.S., 2006, Time Series Analysis: Univariate and Multivariate Methods, Pearson Addison Wesley, 605 p.
[17] Kalpakis, K., Gada, D. and Puttagunta, V., 2001, Distance measures for effective clustering of ARIMA time-series, pp. 273-280, In Data Mining, ICDM 2001, IEEE International Conference.
[18] Díaz, S.P. and Vilar, J.A., 2010, Comparing several parametric and nonparametric approaches to time series clustering: A simulation study, J. Classif. 27: 333-362.
[19] Khan, A., Khan, K. and Baharudin, B.B., 2009, Frequent patterns mining of stock data using hybrid clustering association algorithm, pp. 667-671, In Information Management and Engineering, ICIME 2009, International Conference.
[20] Kremer, H., Gunnemann, S. and Seidl, T., 2010, Detecting climate change in multivariate time series data by novel clustering and cluster tracing techniques, pp. 96-97, In Data Mining Workshops (ICDMW), 2010 IEEE International Conference.
[21] Niennattrakul, V. and Ratanamahatana, C.A., 2007, On clustering multimedia time series data using K-means and dynamic time warping, pp. 733-738, IEEE.
[22] Jixue, D., 2009, Data mining of time series based on wave cluster, pp. 697-699, In Information Technology and Applications, IFITA 2009, International Forum, IEEE.
[23] Verdoolaege, G. and Rosseel, Y., 2010, Activation detection in event-related fMRI through clustering of wavelet distributions, pp. 4393-4396, In Image Processing (ICIP), 2010 17th IEEE International Conference.
[24] Baragona, R., 2001, A simulation study on clustering time series with metaheuristic methods, Quaderni di Statistica 3: 1-26.
[25] Lhermitte, S., Verbesselt, J., Verstraeten, W.W. and Coppin, P., 2011, A comparison of time series similarity measures for classification and change detection of ecosystem dynamics, Remote Sens. Environ. 115: 3129-3152.
[26] Shumway, R.H. and Stoffer, D.S., 2010, Time Series Analysis and Its Applications with R Examples, 3rd Ed., Springer, 576 p.
[27] Cuturi, M., 2011, Fast global alignment Kernels), pp. 929-936, In 28th International Conference on Machine Learning (ICML-11).
[28] Cuturi, M. and Blondel, M., 2017, Soft-DTW: A differentiable loss function for time-series, pp. 894-903, In 34th International Conference on Machine Learning.