Abstract
Predicting student failure at school has become a difficult challenge due to both the high number of factors that can affect the low performance of students and the imbalanced nature of these types of datasets. In this paper, a genetic programming algorithm and different data mining approaches are proposed for solving these problems using real data about 670 high school students from Zacatecas, Mexico. Firstly, we select the best attributes in order to resolve the problem of high dimensionality. Then, rebalancing of data and cost sensitive classification have been applied in order to resolve the problem of classifying imbalanced data. We also propose to use a genetic programming model versus different white box techniques in order to obtain both more comprehensible and accuracy classification rules. The outcomes of each approach are shown and compared in order to select the best to improve classification accuracy, specifically with regard to which students might fail.
Similar content being viewed by others
References
Aloise-Young PA, Chavez EL (2002) Not all school dropouts are the same: Ethnic differences in the relation between reason for leaving school and adolescent substance use. Psychol Sch 39(5):539–547
Araque F, Roldan C, Salguero A (2009) Factors influencing university drop out rates. Comput Educ 53:563–574
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Chapman & Hall, New York
Cano A, Zafra A, Ventura S (2011) An EP algorithm for learning highly interpretable classifiers. In: Proceedings of the 10th international conference on intelligent systems design and applications, ISDA’11, pp 325–330
Cendrowska J (1987) Prism: an algorithm for inducing modular rules. Int J Man-Mach Stud 27(4):349–370
Cohen WW (1995) Fast effective rule induction. In: Twelfth international conference on machine learning, pp 115–123
Dekker GW, Pechenizkiy M, Vleeshouwers JM (2009) Predicting students drop out: a case study. In: International conference on educational data mining
Diosan L, Rogozan A, Pecuchet J (2012) Improving classification performance of support vector machine by genetically optimising kernel shape and hyper-parameters. Appl Intell 36:280–294
Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence, pp 1–6
Fourtin L, Marcotte D, Potvin P, Roger E, Joly J (2006) Typology of students at risk of dropping out of school: description by personal, family and school factors. Eur J Psychol Educ 21(4):363–383
Freund Y, Mason L (1999) The alternating decision tree algorithm. In: Proceedings of the 16th international conference on machine learning, pp 124–133
Gu Q, Cai Z, Zhu L, Huang B (2008) Data mining on imbalanced data sets. In: Proceedings of international conference on advanced computer theory and engineering, pp 1020–1024
Hall MA, Holmes G (2002) Benchmarking attribute selection techniques for data mining. Tech. rep, University of Waikato, Department of Computer Science, Hamilton, New Zealand
Hämäläinen W, Vinni M (2011) Classifiers for educational data mining. Chapman & Hall/CRC, London
Hernández MM (2002) Causas del Fracaso Escolar. In: XIII Congreso de la Sociedad Española de Medicina del Adolescente
Holte RC (1993) Very simple classification rules perform well on most commonly used datasets. Mach Learn 11:63–91
Klösgen W, Zytkow JM (2002) Handbook of data mining and knowledge discovery. Oxford University Press, London
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, pp 1137–1143
Kotsiantis S (2009) Educational data mining: a case study for predicting dropout-prone students. Int J Knowl Eng Soft Data Paradig 1(2):101–111
Kotsiantis S, Patriarcheas K, Xenos M (2010) A combinational incremental ensemble of classifiers as a technique for predicting students performance in distance education. Knowl-Based Syst 23(6):529–535
Kotsiantis SB, Pintelas PE (2005) Predicting students’ marks in Hellenic Open University. In: IEEE international conference on advanced learning technologies, pp 664–668
Lykourentzou I, Giannoukos I, Nikolopoulos V, Mpardis G, Loumos V (2009) Dropout prediction in e-learning courses through the combination of machine learning techniques. Comput Educ 53:950–965
Márquez-Vera C, Romero C, Ventura S (2011) Predicting school failure using data mining. In: Educational data mining conference
Martínez D (2001) Predicting student outcomes using discriminant function analysis. In: Annual meeting of the research and planning group, pp 163–173
McDonald B (2004) Predicting student success. J Math Teach Learn 1:1–14
Mendez G, Buskirk TD, Lohr S, Haag S (2008) Factors associated with persistence in science and engineering majors: an exploratory study using classification trees and random forests. J Eng Educ 97:57–70
Moseley LG, Mead DM (2008) Predicting who will drop out of nursing courses: a machine learning exercise. Nurse Educ Today 28:469–475
Pan W (2012) The use of genetic programming for the construction of a financial management model in an enterprise. Appl Intell 36:271–279
Parker A (1999) A study of variables that predict dropout from distance education. Int J Educ Technol 1(2):1–11
Quadril MN, Kalyankar NV (2010) Drop out feature of student data for academic performance using decision tree techniques. J Comput Sci Technol 10:2–5
Quinlan JR (1983) C45. Programs for machine learning. Morgan Kaufman, San Mateo
Richards D (2009) Two decades of RDR research. Knowl Eng Rev 24(2):159–184
Slavin RE, Karweit NL, Wasik BA (1994) Preventing early school failure. Allyn & Bacon, Needham Heights
Romero C, Espejo PG, Zafra A, Romero JR, Ventura S (2012, in press) Web usage mining for predicting final marks of MOODLE students. Comput Appl Eng Educ, 1–12
Romero C, Ventura S (2010) Educational data mining: a review of the state of the art. IEEE Trans Syst Man Cybern 6:601–618
Roy S (2002) Nearest neighbor with generalization. Master’s thesis, University of Canterbury, Christchurch, New Zealand
Deegalla S, Bostrom H (2006) Reducing high-dimensional data by principal component analysis vs random projection for nearest neighbor classification. In: International conference on machine learning and applications, pp 245–250
Superby JF, Vandamme JP, Meskens N (2006) Determination of factors influencing the achievement of the first-year university students using data mining methods. In: Educational data mining workshop, pp 1–8
Tinto V (1987) Leaving college: rethinking the causes and curse of students attrition. University of Chicago Press, Chicago
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Veitch W (2004) Identifying characteristics of high school dropouts: data mining with a decision tree model. In: Annual meeting of the American educational Research Association, pp 1–11
Wang AY, Newlin MH (2002) Predictors of web-based performance: the role of self-efficacy and reasons for taking an on-line class. Comput Hum Behav J 18:151–163
Wegner L, Flisher AJ, Chikobvu P, Lombard C, King G (2008) Leisure boredom and high school dropout in Cape Town, South Africa. J Adolesc 31:421–431
Whigham PA (1996) Grammatical bias for evolutionary learning. PhD Dissertation, University of New South Wales
Witten IH, Eibe F, Hall MA (2011) Data mining, practical machine learning tools and techniques. Morgan Kaufman, San Mateo
Acknowledgements
This research has been supported by projects of the Regional Government of Andalucía and the Ministry of Science and Technology, P08-TIC-3720, TIN-2011-22408, FPU grant AP2010-0042 and FEDER funds.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Márquez-Vera, C., Cano, A., Romero, C. et al. Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Appl Intell 38, 315–330 (2013). https://doi.org/10.1007/s10489-012-0374-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-012-0374-8