Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data

Márquez-Vera, Carlos; Cano, Alberto; Romero, Cristóbal; Ventura, Sebastián

doi:10.1007/s10489-012-0374-8

Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data

Published: 26 August 2012

Volume 38, pages 315–330, (2013)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Carlos Márquez-Vera¹,
Alberto Cano²,
Cristóbal Romero² &
…
Sebastián Ventura²

3164 Accesses
111 Citations
Explore all metrics

Abstract

Predicting student failure at school has become a difficult challenge due to both the high number of factors that can affect the low performance of students and the imbalanced nature of these types of datasets. In this paper, a genetic programming algorithm and different data mining approaches are proposed for solving these problems using real data about 670 high school students from Zacatecas, Mexico. Firstly, we select the best attributes in order to resolve the problem of high dimensionality. Then, rebalancing of data and cost sensitive classification have been applied in order to resolve the problem of classifying imbalanced data. We also propose to use a genetic programming model versus different white box techniques in order to obtain both more comprehensible and accuracy classification rules. The outcomes of each approach are shown and compared in order to select the best to improve classification accuracy, specifically with regard to which students might fail.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Genetic Algorithm Based Method of Early Warning Rule Mining for Student Performance Prediction

Learners’ Performance Evaluation Using Genetic Algorithms

Students’ Performance Prediction Model Using Meta-classifier Approach

References

Aloise-Young PA, Chavez EL (2002) Not all school dropouts are the same: Ethnic differences in the relation between reason for leaving school and adolescent substance use. Psychol Sch 39(5):539–547
Article Google Scholar
Araque F, Roldan C, Salguero A (2009) Factors influencing university drop out rates. Comput Educ 53:563–574
Article Google Scholar
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Chapman & Hall, New York
MATH Google Scholar
Cano A, Zafra A, Ventura S (2011) An EP algorithm for learning highly interpretable classifiers. In: Proceedings of the 10th international conference on intelligent systems design and applications, ISDA’11, pp 325–330
Chapter Google Scholar
Cendrowska J (1987) Prism: an algorithm for inducing modular rules. Int J Man-Mach Stud 27(4):349–370
Article MATH Google Scholar
Cohen WW (1995) Fast effective rule induction. In: Twelfth international conference on machine learning, pp 115–123
Google Scholar
Dekker GW, Pechenizkiy M, Vleeshouwers JM (2009) Predicting students drop out: a case study. In: International conference on educational data mining
Google Scholar
Diosan L, Rogozan A, Pecuchet J (2012) Improving classification performance of support vector machine by genetically optimising kernel shape and hyper-parameters. Appl Intell 36:280–294
Article Google Scholar
Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence, pp 1–6
Google Scholar
Fourtin L, Marcotte D, Potvin P, Roger E, Joly J (2006) Typology of students at risk of dropping out of school: description by personal, family and school factors. Eur J Psychol Educ 21(4):363–383
Article Google Scholar
Freund Y, Mason L (1999) The alternating decision tree algorithm. In: Proceedings of the 16th international conference on machine learning, pp 124–133
Google Scholar
Gu Q, Cai Z, Zhu L, Huang B (2008) Data mining on imbalanced data sets. In: Proceedings of international conference on advanced computer theory and engineering, pp 1020–1024
Google Scholar
Hall MA, Holmes G (2002) Benchmarking attribute selection techniques for data mining. Tech. rep, University of Waikato, Department of Computer Science, Hamilton, New Zealand
Hämäläinen W, Vinni M (2011) Classifiers for educational data mining. Chapman & Hall/CRC, London
Google Scholar
Hernández MM (2002) Causas del Fracaso Escolar. In: XIII Congreso de la Sociedad Española de Medicina del Adolescente
Google Scholar
Holte RC (1993) Very simple classification rules perform well on most commonly used datasets. Mach Learn 11:63–91
Article MATH Google Scholar
Klösgen W, Zytkow JM (2002) Handbook of data mining and knowledge discovery. Oxford University Press, London
MATH Google Scholar
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, pp 1137–1143
Google Scholar
Kotsiantis S (2009) Educational data mining: a case study for predicting dropout-prone students. Int J Knowl Eng Soft Data Paradig 1(2):101–111
Article Google Scholar
Kotsiantis S, Patriarcheas K, Xenos M (2010) A combinational incremental ensemble of classifiers as a technique for predicting students performance in distance education. Knowl-Based Syst 23(6):529–535
Article Google Scholar
Kotsiantis SB, Pintelas PE (2005) Predicting students’ marks in Hellenic Open University. In: IEEE international conference on advanced learning technologies, pp 664–668
Chapter Google Scholar
Lykourentzou I, Giannoukos I, Nikolopoulos V, Mpardis G, Loumos V (2009) Dropout prediction in e-learning courses through the combination of machine learning techniques. Comput Educ 53:950–965
Article Google Scholar
Márquez-Vera C, Romero C, Ventura S (2011) Predicting school failure using data mining. In: Educational data mining conference
Google Scholar
Martínez D (2001) Predicting student outcomes using discriminant function analysis. In: Annual meeting of the research and planning group, pp 163–173
Google Scholar
McDonald B (2004) Predicting student success. J Math Teach Learn 1:1–14
Google Scholar
Mendez G, Buskirk TD, Lohr S, Haag S (2008) Factors associated with persistence in science and engineering majors: an exploratory study using classification trees and random forests. J Eng Educ 97:57–70
Google Scholar
Moseley LG, Mead DM (2008) Predicting who will drop out of nursing courses: a machine learning exercise. Nurse Educ Today 28:469–475
Article Google Scholar
Pan W (2012) The use of genetic programming for the construction of a financial management model in an enterprise. Appl Intell 36:271–279
Article Google Scholar
Parker A (1999) A study of variables that predict dropout from distance education. Int J Educ Technol 1(2):1–11
Google Scholar
Quadril MN, Kalyankar NV (2010) Drop out feature of student data for academic performance using decision tree techniques. J Comput Sci Technol 10:2–5
Google Scholar
Quinlan JR (1983) C45. Programs for machine learning. Morgan Kaufman, San Mateo
Google Scholar
Richards D (2009) Two decades of RDR research. Knowl Eng Rev 24(2):159–184
Article Google Scholar
Slavin RE, Karweit NL, Wasik BA (1994) Preventing early school failure. Allyn & Bacon, Needham Heights
Google Scholar
Romero C, Espejo PG, Zafra A, Romero JR, Ventura S (2012, in press) Web usage mining for predicting final marks of MOODLE students. Comput Appl Eng Educ, 1–12
Romero C, Ventura S (2010) Educational data mining: a review of the state of the art. IEEE Trans Syst Man Cybern 6:601–618
Google Scholar
Roy S (2002) Nearest neighbor with generalization. Master’s thesis, University of Canterbury, Christchurch, New Zealand
Deegalla S, Bostrom H (2006) Reducing high-dimensional data by principal component analysis vs random projection for nearest neighbor classification. In: International conference on machine learning and applications, pp 245–250
Google Scholar
Superby JF, Vandamme JP, Meskens N (2006) Determination of factors influencing the achievement of the first-year university students using data mining methods. In: Educational data mining workshop, pp 1–8
Google Scholar
Tinto V (1987) Leaving college: rethinking the causes and curse of students attrition. University of Chicago Press, Chicago
Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
MATH Google Scholar
Veitch W (2004) Identifying characteristics of high school dropouts: data mining with a decision tree model. In: Annual meeting of the American educational Research Association, pp 1–11
Google Scholar
Wang AY, Newlin MH (2002) Predictors of web-based performance: the role of self-efficacy and reasons for taking an on-line class. Comput Hum Behav J 18:151–163
Article Google Scholar
Wegner L, Flisher AJ, Chikobvu P, Lombard C, King G (2008) Leisure boredom and high school dropout in Cape Town, South Africa. J Adolesc 31:421–431
Article Google Scholar
Whigham PA (1996) Grammatical bias for evolutionary learning. PhD Dissertation, University of New South Wales
Witten IH, Eibe F, Hall MA (2011) Data mining, practical machine learning tools and techniques. Morgan Kaufman, San Mateo
Google Scholar

Download references

Acknowledgements

This research has been supported by projects of the Regional Government of Andalucía and the Ministry of Science and Technology, P08-TIC-3720, TIN-2011-22408, FPU grant AP2010-0042 and FEDER funds.

Author information

Authors and Affiliations

Autonomous University of Zacatecas, Zacatecas, México
Carlos Márquez-Vera
Department of Computer Science, University of Córdoba, Córdoba, Spain
Alberto Cano, Cristóbal Romero & Sebastián Ventura

Authors

Carlos Márquez-Vera
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Cano
View author publications
You can also search for this author in PubMed Google Scholar
Cristóbal Romero
View author publications
You can also search for this author in PubMed Google Scholar
Sebastián Ventura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cristóbal Romero.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Márquez-Vera, C., Cano, A., Romero, C. et al. Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Appl Intell 38, 315–330 (2013). https://doi.org/10.1007/s10489-012-0374-8

Download citation

Published: 26 August 2012
Issue Date: April 2013
DOI: https://doi.org/10.1007/s10489-012-0374-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data

Abstract

Access this article

Similar content being viewed by others

A Genetic Algorithm Based Method of Early Warning Rule Mining for Student Performance Prediction

Learners’ Performance Evaluation Using Genetic Algorithms

Students’ Performance Prediction Model Using Meta-classifier Approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data

Abstract

Access this article

Similar content being viewed by others

A Genetic Algorithm Based Method of Early Warning Rule Mining for Student Performance Prediction

Learners’ Performance Evaluation Using Genetic Algorithms

Students’ Performance Prediction Model Using Meta-classifier Approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation