Skip to main content

Advertisement

Log in

Robust password security: a genetic programming approach with imbalanced dataset handling

  • Regular Contribution
  • Published:
International Journal of Information Security Aims and scope Submit manuscript

Abstract

Developing a method for determining password strength using artificial intelligence (AI) is crucial as it enhances cybersecurity by providing a more robust defense against unauthorized access. AI can analyze complex patterns and trends, allowing for the identification of weak passwords and potential vulnerabilities more effectively than traditional methods. This proactive approach helps users and organizations strengthen their security posture, reducing the risk of data breaches and unauthorized intrusions. In this paper, the genetic programming symbolic classifier (GPSC) was applied to the publicly available dataset to obtain a set of symbolic expressions for password strength classification with high classification accuracy. One of the problems with the dataset was an imbalance between classes so various oversampling/undersampling techniques have been utilized. The optimal GPSC hyperparameter values were found using the random hyperparameter value search method. The algorithm was trained using fivefold cross-validation (5FCV). One of the problems with the dataset was an imbalance between classes so various oversampling/undersampling techniques have been utilized. To evaluate obtained SEs, the evaluation metric accuracy, area under receiver operating characteristics curve, precision, recall, and f1-score were used. The obtained SEs on balanced dataset variations achieved high classification accuracy (0.99) and with the application of all SEs on the entire original imbalanced dataset achieved the same accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  1. Dell’Amico, M., Michiardi, P., Roudier, Y. Password strength: an empirical analysis. In: 2010 Proceedings IEEE INFOCOM, pp. 1–9. IEEE (2010)

  2. Yan, J., Blackwell, A., Anderson, R., Grant, A.: Password memorability and security: empirical results. IEEE Secur. Priv. 2(5), 25–31 (2004)

    Article  Google Scholar 

  3. Jarecki, S., Krawczyk, H., Shirvanian, M., Saxena, N. Two-factor authentication with end-to-end password security. In: Public-Key Cryptography–PKC 2018: 21st IACR International Conference on Practice and Theory of Public-Key Cryptography, Rio de Janeiro, Brazil, March 25-29, 2018, Proceedings, Part II 21, pp. 431–461. Springer (2018)

  4. O’Gorman, L.: Comparing passwords, tokens, and biometrics for user authentication. Proc. IEEE 91(12), 2021–2040 (2003)

    Article  Google Scholar 

  5. Cipresso, P., Gaggioli, A., Serino, S., Cipresso, S., Riva, G.: How to create memorizable and strong passwords. J. Med. Internet Res. 14(1), e10 (2012)

    Article  PubMed  PubMed Central  Google Scholar 

  6. Vijaya, M.S., Jamuna, K.S., Karpagavalli, S. Password strength prediction using supervised machine learning techniques. In: 2009 international conference on advances in computing, control, and telecommunication technologies, pp. 401–405. IEEE (2009)

  7. Darbutaitė, E., Stefanovič, P., Ramanauskaitė, S.: Machine-learning-based password-strength-estimation approach for passwords of Lithuanian context. Appl. Sci. 13(13), 7811 (2023)

    Article  Google Scholar 

  8. Jun Kim, S., Mun Lee, B., et al.: Multi-class classification prediction model for password strength based on deep learning. J. Multimed. Inf. Syst. 10(1), 45–52 (2023)

    Article  Google Scholar 

  9. Bhavik Bansal. Password strength classifier dataset, Jun (2019)

  10. Josuamarcelc. Josuamarcelc/common-password-list: common password list (rockyou.txt) built-in kali linux wordlist rockyou.txt

  11. Ji, S., Yang, S., Wang, T., Liu, C., Lee, W.-H., Beyah, R. Pars: a uniform and open-source password analysis and research system. In: Proceedings of the 31st Annual Computer Security Applications Conference, pp. 321–330 (2015)

  12. Sedgwick, P.: Pearson’s correlation coefficient. Bmj 345, 4883 (2012)

    Google Scholar 

  13. Singh, K., Upadhyaya, S.: Outlier detection: applications and techniques. Int. J. Comput. Sci. Issues (IJCSI) 9(1), 307 (2012)

    Google Scholar 

  14. Abdi, H., Williams, L.J.: Principal component analysis. Wiley interdiscip. Rev. Comput. Stat. 2(4), 433–459 (2010)

    Article  Google Scholar 

  15. Fernández, A., Garcia, S., Herrera, F., Chawla, N.V.: Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018)

    Article  MathSciNet  Google Scholar 

  16. Han, H., Wang, W.Y., Mao, B.-H. Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International Conference on Intelligent Computing, pp. 878–887. Springer (2005)

  17. Last, F., Douzas, G., Bacao, F. Oversampling for imbalanced learning based on k-means and smote. arxiv 2017. arXiv preprint arXiv:1711.00837, 2

  18. Nguyen, H.M., Cooper, E.W., Kamei, K.: Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradig. 3(1), 4–21 (2011)

    Article  Google Scholar 

  19. Li, M., Ziheng, W., Wang, W., Kun, L., Zhang, J., Zhou, Y., Chen, Z., Li, D., Zheng, S., Chen, P., et al.: Protein-protein interaction sites prediction based on an under-sampling strategy and random forest algorithm. IEEE/ACM Trans. Comput. Biol. Bioinform. 19(6), 3646–3654 (2021)

    Google Scholar 

  20. Espejo, P.G., Ventura, S., Herrera, F.: A survey on the application of genetic programming to classification. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 40(2), 121–144 (2009)

    Article  Google Scholar 

  21. Ravuri, S., Vinyals, O. Classification accuracy score for conditional generative models. In: Advances in Neural Information Processing Systems, 32 (2019)

  22. Goutte, C., Gaussier, E. A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. In: European Conference on Information Retrieval, pp. 345–359. Springer (2005)

  23. Hand, D.J.: Measuring classifier performance: a coherent alternative to the area under the roc curve. Mach. Learn. 77(1), 103–123 (2009)

    Article  Google Scholar 

  24. Susmaga, R. Confusion matrix visualization. In: Intelligent Information Processing and Web Mining: Proceedings of the International IIS: IIPWM ‘04 Conference held in Zakopane, Poland, May 17–20, 2004, pp. 107–116. Springer (2004)

  25. Andelić, N., Šegota, S.B., Lorencin, I., Glučina, M.: Detection of malicious websites using symbolic classifier. Future Internet 14(12), 358 (2022)

    Article  Google Scholar 

Download references

Acknowledgements

This research was (partly) supported by the CEEPUS network CIII-HR-0108, the European Regional Development Fund under Grant KK.01.1.1.01.0009 (DATACROSS), the Erasmus+ project WICT under Grant 2021-1-HR01-KA220-HED-000031177, and the University of Rijeka Scientific Grants uniri-mladi-technic-22-61 and uniri-tehnic-18-275-1447.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikola Andelić.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 Appendix A.1. The modified mathematical functions used in GPSC

In description of GPSC and RHVS method, it was mentioned that mathematical functions such as division, square root, natural logarithm, and logarithms with bases 2 and 10 had to be modified to avoid generating infinity or not a number values. The mathematical function of division can be written in the following form:

$$\begin{aligned} y_{\text {DIV}}(x)= & {} {\left\{ \begin{array}{ll}x_1/x_2 &{} |x_2| > 0.001\\ 1 &{} |x_2| < 0.001 \end{array}\right. } \end{aligned}$$
(8)
$$\begin{aligned} y_{\text {SQRT}}{x}= & {} {\left\{ \begin{array}{ll} \sqrt{|x|} &{} |x| > 0.001 \\ \end{array}\right. } \end{aligned}$$
(9)

The natural logarithm, logarithm with bases 2 and 10 can be defined as:

$$\begin{aligned} y_{i}(x) = {\left\{ \begin{array}{ll}\log _i |x| &{} |x| >0.001 \\ 0 &{} |x| < 0.001\end{array}\right. }, i = e, 2, 10 \end{aligned}$$
(10)

1.2 Appendix A.2. How to obtain and use the SEs from this research

Due to a large number of obtained SEs in this paper, the SEs are not shown. The SEs can be obtained from GitHub repository (web-link: https://github.com/nandelic2022/PasswordStrengthEquations.git). After downloading the SEs the procedure of using these consist of following steps:

  1. 1.

    From initial dataset define input variables and output variable.

  2. 2.

    use the input variables to calculate the output of each SEs.

  3. 3.

    use the output generated from SEs to calculate the sigmoid function value, i.e., to determine whether the dataset sample belongs to class or not.

  4. 4.

    use the previously mentioned evaluation metrics to calculate the SEs performance.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Andelić, N., Baressi S̆egota, S. & Car, Z. Robust password security: a genetic programming approach with imbalanced dataset handling. Int. J. Inf. Secur. (2024). https://doi.org/10.1007/s10207-024-00814-2

Download citation

  • Published:

  • DOI: https://doi.org/10.1007/s10207-024-00814-2

Keywords

Navigation