Skip to main content

Advertisement

Log in

Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Predicting student failure at school has become a difficult challenge due to both the high number of factors that can affect the low performance of students and the imbalanced nature of these types of datasets. In this paper, a genetic programming algorithm and different data mining approaches are proposed for solving these problems using real data about 670 high school students from Zacatecas, Mexico. Firstly, we select the best attributes in order to resolve the problem of high dimensionality. Then, rebalancing of data and cost sensitive classification have been applied in order to resolve the problem of classifying imbalanced data. We also propose to use a genetic programming model versus different white box techniques in order to obtain both more comprehensible and accuracy classification rules. The outcomes of each approach are shown and compared in order to select the best to improve classification accuracy, specifically with regard to which students might fail.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Aloise-Young PA, Chavez EL (2002) Not all school dropouts are the same: Ethnic differences in the relation between reason for leaving school and adolescent substance use. Psychol Sch 39(5):539–547

    Article  Google Scholar 

  2. Araque F, Roldan C, Salguero A (2009) Factors influencing university drop out rates. Comput Educ 53:563–574

    Article  Google Scholar 

  3. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Chapman & Hall, New York

    MATH  Google Scholar 

  4. Cano A, Zafra A, Ventura S (2011) An EP algorithm for learning highly interpretable classifiers. In: Proceedings of the 10th international conference on intelligent systems design and applications, ISDA’11, pp 325–330

    Chapter  Google Scholar 

  5. Cendrowska J (1987) Prism: an algorithm for inducing modular rules. Int J Man-Mach Stud 27(4):349–370

    Article  MATH  Google Scholar 

  6. Cohen WW (1995) Fast effective rule induction. In: Twelfth international conference on machine learning, pp 115–123

    Google Scholar 

  7. Dekker GW, Pechenizkiy M, Vleeshouwers JM (2009) Predicting students drop out: a case study. In: International conference on educational data mining

    Google Scholar 

  8. Diosan L, Rogozan A, Pecuchet J (2012) Improving classification performance of support vector machine by genetically optimising kernel shape and hyper-parameters. Appl Intell 36:280–294

    Article  Google Scholar 

  9. Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence, pp 1–6

    Google Scholar 

  10. Fourtin L, Marcotte D, Potvin P, Roger E, Joly J (2006) Typology of students at risk of dropping out of school: description by personal, family and school factors. Eur J Psychol Educ 21(4):363–383

    Article  Google Scholar 

  11. Freund Y, Mason L (1999) The alternating decision tree algorithm. In: Proceedings of the 16th international conference on machine learning, pp 124–133

    Google Scholar 

  12. Gu Q, Cai Z, Zhu L, Huang B (2008) Data mining on imbalanced data sets. In: Proceedings of international conference on advanced computer theory and engineering, pp 1020–1024

    Google Scholar 

  13. Hall MA, Holmes G (2002) Benchmarking attribute selection techniques for data mining. Tech. rep, University of Waikato, Department of Computer Science, Hamilton, New Zealand

  14. Hämäläinen W, Vinni M (2011) Classifiers for educational data mining. Chapman & Hall/CRC, London

    Google Scholar 

  15. Hernández MM (2002) Causas del Fracaso Escolar. In: XIII Congreso de la Sociedad Española de Medicina del Adolescente

    Google Scholar 

  16. Holte RC (1993) Very simple classification rules perform well on most commonly used datasets. Mach Learn 11:63–91

    Article  MATH  Google Scholar 

  17. Klösgen W, Zytkow JM (2002) Handbook of data mining and knowledge discovery. Oxford University Press, London

    MATH  Google Scholar 

  18. Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, pp 1137–1143

    Google Scholar 

  19. Kotsiantis S (2009) Educational data mining: a case study for predicting dropout-prone students. Int J Knowl Eng Soft Data Paradig 1(2):101–111

    Article  Google Scholar 

  20. Kotsiantis S, Patriarcheas K, Xenos M (2010) A combinational incremental ensemble of classifiers as a technique for predicting students performance in distance education. Knowl-Based Syst 23(6):529–535

    Article  Google Scholar 

  21. Kotsiantis SB, Pintelas PE (2005) Predicting students’ marks in Hellenic Open University. In: IEEE international conference on advanced learning technologies, pp 664–668

    Chapter  Google Scholar 

  22. Lykourentzou I, Giannoukos I, Nikolopoulos V, Mpardis G, Loumos V (2009) Dropout prediction in e-learning courses through the combination of machine learning techniques. Comput Educ 53:950–965

    Article  Google Scholar 

  23. Márquez-Vera C, Romero C, Ventura S (2011) Predicting school failure using data mining. In: Educational data mining conference

    Google Scholar 

  24. Martínez D (2001) Predicting student outcomes using discriminant function analysis. In: Annual meeting of the research and planning group, pp 163–173

    Google Scholar 

  25. McDonald B (2004) Predicting student success. J Math Teach Learn 1:1–14

    Google Scholar 

  26. Mendez G, Buskirk TD, Lohr S, Haag S (2008) Factors associated with persistence in science and engineering majors: an exploratory study using classification trees and random forests. J Eng Educ 97:57–70

    Google Scholar 

  27. Moseley LG, Mead DM (2008) Predicting who will drop out of nursing courses: a machine learning exercise. Nurse Educ Today 28:469–475

    Article  Google Scholar 

  28. Pan W (2012) The use of genetic programming for the construction of a financial management model in an enterprise. Appl Intell 36:271–279

    Article  Google Scholar 

  29. Parker A (1999) A study of variables that predict dropout from distance education. Int J Educ Technol 1(2):1–11

    Google Scholar 

  30. Quadril MN, Kalyankar NV (2010) Drop out feature of student data for academic performance using decision tree techniques. J Comput Sci Technol 10:2–5

    Google Scholar 

  31. Quinlan JR (1983) C45. Programs for machine learning. Morgan Kaufman, San Mateo

    Google Scholar 

  32. Richards D (2009) Two decades of RDR research. Knowl Eng Rev 24(2):159–184

    Article  Google Scholar 

  33. Slavin RE, Karweit NL, Wasik BA (1994) Preventing early school failure. Allyn & Bacon, Needham Heights

    Google Scholar 

  34. Romero C, Espejo PG, Zafra A, Romero JR, Ventura S (2012, in press) Web usage mining for predicting final marks of MOODLE students. Comput Appl Eng Educ, 1–12

  35. Romero C, Ventura S (2010) Educational data mining: a review of the state of the art. IEEE Trans Syst Man Cybern 6:601–618

    Google Scholar 

  36. Roy S (2002) Nearest neighbor with generalization. Master’s thesis, University of Canterbury, Christchurch, New Zealand

  37. Deegalla S, Bostrom H (2006) Reducing high-dimensional data by principal component analysis vs random projection for nearest neighbor classification. In: International conference on machine learning and applications, pp 245–250

    Google Scholar 

  38. Superby JF, Vandamme JP, Meskens N (2006) Determination of factors influencing the achievement of the first-year university students using data mining methods. In: Educational data mining workshop, pp 1–8

    Google Scholar 

  39. Tinto V (1987) Leaving college: rethinking the causes and curse of students attrition. University of Chicago Press, Chicago

    Google Scholar 

  40. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  41. Veitch W (2004) Identifying characteristics of high school dropouts: data mining with a decision tree model. In: Annual meeting of the American educational Research Association, pp 1–11

    Google Scholar 

  42. Wang AY, Newlin MH (2002) Predictors of web-based performance: the role of self-efficacy and reasons for taking an on-line class. Comput Hum Behav J 18:151–163

    Article  Google Scholar 

  43. Wegner L, Flisher AJ, Chikobvu P, Lombard C, King G (2008) Leisure boredom and high school dropout in Cape Town, South Africa. J Adolesc 31:421–431

    Article  Google Scholar 

  44. Whigham PA (1996) Grammatical bias for evolutionary learning. PhD Dissertation, University of New South Wales

  45. Witten IH, Eibe F, Hall MA (2011) Data mining, practical machine learning tools and techniques. Morgan Kaufman, San Mateo

    Google Scholar 

Download references

Acknowledgements

This research has been supported by projects of the Regional Government of Andalucía and the Ministry of Science and Technology, P08-TIC-3720, TIN-2011-22408, FPU grant AP2010-0042 and FEDER funds.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cristóbal Romero.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Márquez-Vera, C., Cano, A., Romero, C. et al. Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Appl Intell 38, 315–330 (2013). https://doi.org/10.1007/s10489-012-0374-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-012-0374-8

Keywords

Navigation