Performance of Crossover Operators in Genetic Algorithm for Variable Selection in Regression Analysis

Main Article Content

พัทธ์ชนก ศรีสุรเดชชัย
ชญานิน อินกว่าง

Abstract

Variable selection is a challenged procedure when there is the large number of explanatory variables and interaction effects are expected in the model. The number of possible models can be very large so that the stepwise algorithm tends to give a local optimal model. This paper aims to apply the genetic algorithm with 6 types of crossover operators for 4 real datasets and simulated data. Both linear regression and binomial logistic regression are of interest and the Akaike’s information criterion (AIC) is used as a criterion for variable selections. For simulated data, the explanatory variables are set to have no correlation and to have correlations with the First-order autoregressive structure in which correlations equal 0.3, 0.5, and 0.8. The results will be compared with those from stepwise variable selections: forward selection, backward elimination, and alternating stepwise selection. Furthermore, we have proposed a new criterion representing percentage of independent variables correctly included into a model. The results show that compared to the stepwise variable selection, the genetic algorithm can find the model with a lower AIC. If comparing the 6 crossover operators, we find that the (m-1)-point crossover will choose the less suitable model and this statistically differs from other crossover operators. And, the shuffle crossover and uniform crossover are not statistically significant in all cases in the study.

Article Details

Section
Physical Sciences
Author Biographies

พัทธ์ชนก ศรีสุรเดชชัย

สาขาวิชาคณิตศาสตร์และสถิติ คณะวิทยาศาสตร์และเทคโนโลยี มหาวิทยาลัยธรรมศาสตร์ ศูนย์รังสิต ตำบลคลองหนึ่ง อำเภอคลองหลวง จังหวัดปทุมธานี 12120

ชญานิน อินกว่าง

สาขาวิชาคณิตศาสตร์และสถิติ คณะวิทยาศาสตร์และเทคโนโลยี มหาวิทยาลัยธรรมศาสตร์ ศูนย์รังสิต ตำบลคลองหนึ่ง อำเภอคลองหลวง จังหวัดปทุมธานี 12120

References

Agresti, A., 2015, Foundations of Linear and Generalized Linear Models, John Wiley & Sons, Inc., New Jersey.

Bilder, C.R. and Loughin, T.M., 2015, Analysis of Categorical Data with R, CRC Press, Inc., Boca Raton.

Simon, D., 2013, Evolutionary Optimization Algorithms, John Wiley & Sons, Inc., New Jersey.

Paterlini, S. and Minerva, T., 2010, Regression Model Selection Using Genetic Algorithm, Available Source: http://www.wseas.us/e-library/conferences/2010/Iasi/NNECFS/NNECFS-01.pdf, October 19, 2018.

Vinterbo, S. and Ohno-Machado, L., 1999, A genetic algorithm to select variable in logistic regression: Example in the domain of myocardial infarction, Proc. AMIA Symp. 1999: 984-988.

Johnson, P., Vendewater, L., Wilson, W., Maruff, P., Savage, G., Graham, P. and Macaulay, L.S., 2014, Genetic algorithm with logistic regression for prediction of progression to Alzheimer’s disease, BMC Bioinfomatics 15(16): 1-14.

Picek, S. and Golub, M., 2010, Comparison of a crossover operator in binary-coded genetic algorithms, WSEAS Transact. Comput. 9: 1064-1073.

Holland, J.H., 1975, Adaptation in Natural and Artificial Systems, In Hoeting, J.A. and Givens, G.H., Computational Statistics, 2nd Ed., John Wiley & Sons, Inc., New Jersey.

Hoeting, J.A. and Givens, G.H., 2013, Computational Statistics, 2th Ed., John Wiley & Sons, Inc., New Jersey, 469 p.

de Jong, K.A., 1975, An Analysis of the Behavior of a Class of Genetic Adaptive Systems, Doctoral Dissertation, University of Michigan, Ann-Arbor, MI.

Umbarkar, A.J. and Sheth, P.D., 2015, Crossover operators in genetic algorithms: A review, ICTACT J. Soft Comput. 6: 1083-1092.

Gwiazda, T.D., 2006, Genetic Algorithms Reference, TomaszGwiazda E-Book, Poland, 410 p.

Shodhganga a Reservoir of Indian Theses, Chapter 9: Crossover, Available Source: http://shodhganga.inflibnet.ac.in/bitstream/10603/32680/19/19_chapter%209.pdf, October 9, 2018.

Soni, N. and Kumar, T., 2014, Study of various mutation operators in genetic algorithms, IJCSIT 5: 4519-4521.

Burnham, K.P. and Anderson, D.R., 2004, Multimodel inference: Understanding AIC and BIC in model selection, Sociol. Methods Res. 33: 261-304.

Vrieze, S.I., 2012, Model selection and psychological theory: A discussion of the differences between the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC), Psychol. Methods 17: 228-243.

Yang, Y., 2005, Can the strengths of AIC and BIC be shared?, Biometrika, 92: 937-950.

Johnson, R.W., 1996, Fitting percentage of body fat to simple body measurements, J. Stat. Edu. 4(1).

Cortez, P., Cerdeira, A., Almeida, F., Matos, T. and Reis, J., 2009, Modeling wine preferences by data mining from physicochemical properties, Decis. Supp. Syst. 47: 547-553.

Cock, D.D., 2011, Ames, Iowa: BDNernative to the Boston Housing Data as an End of Semester Regression Project, J. Stat. Edu. 19(3): 1-15.

National Institute of Diabetes and Digestive and Kidney Diseases, Pima Indian Diabetes, Available Source: https://www.kaggle.com/rnmehta5/pima-indian-diabetes-binary-classification/data, November 25, 2018.

Albright, J., Introduction to Random Effects Models Including HLM, Available Source: https://www.methodsconsultants.com/tutorial/introduction-to-random-effects-models-including-hlm, February 19, 2019.