Abstract
Developing a method for determining password strength using artificial intelligence (AI) is crucial as it enhances cybersecurity by providing a more robust defense against unauthorized access. AI can analyze complex patterns and trends, allowing for the identification of weak passwords and potential vulnerabilities more effectively than traditional methods. This proactive approach helps users and organizations strengthen their security posture, reducing the risk of data breaches and unauthorized intrusions. In this paper, the genetic programming symbolic classifier (GPSC) was applied to the publicly available dataset to obtain a set of symbolic expressions for password strength classification with high classification accuracy. One of the problems with the dataset was an imbalance between classes so various oversampling/undersampling techniques have been utilized. The optimal GPSC hyperparameter values were found using the random hyperparameter value search method. The algorithm was trained using fivefold cross-validation (5FCV). One of the problems with the dataset was an imbalance between classes so various oversampling/undersampling techniques have been utilized. To evaluate obtained SEs, the evaluation metric accuracy, area under receiver operating characteristics curve, precision, recall, and f1-score were used. The obtained SEs on balanced dataset variations achieved high classification accuracy (0.99) and with the application of all SEs on the entire original imbalanced dataset achieved the same accuracy.
Similar content being viewed by others
References
Dell’Amico, M., Michiardi, P., Roudier, Y. Password strength: an empirical analysis. In: 2010 Proceedings IEEE INFOCOM, pp. 1–9. IEEE (2010)
Yan, J., Blackwell, A., Anderson, R., Grant, A.: Password memorability and security: empirical results. IEEE Secur. Priv. 2(5), 25–31 (2004)
Jarecki, S., Krawczyk, H., Shirvanian, M., Saxena, N. Two-factor authentication with end-to-end password security. In: Public-Key Cryptography–PKC 2018: 21st IACR International Conference on Practice and Theory of Public-Key Cryptography, Rio de Janeiro, Brazil, March 25-29, 2018, Proceedings, Part II 21, pp. 431–461. Springer (2018)
O’Gorman, L.: Comparing passwords, tokens, and biometrics for user authentication. Proc. IEEE 91(12), 2021–2040 (2003)
Cipresso, P., Gaggioli, A., Serino, S., Cipresso, S., Riva, G.: How to create memorizable and strong passwords. J. Med. Internet Res. 14(1), e10 (2012)
Vijaya, M.S., Jamuna, K.S., Karpagavalli, S. Password strength prediction using supervised machine learning techniques. In: 2009 international conference on advances in computing, control, and telecommunication technologies, pp. 401–405. IEEE (2009)
Darbutaitė, E., Stefanovič, P., Ramanauskaitė, S.: Machine-learning-based password-strength-estimation approach for passwords of Lithuanian context. Appl. Sci. 13(13), 7811 (2023)
Jun Kim, S., Mun Lee, B., et al.: Multi-class classification prediction model for password strength based on deep learning. J. Multimed. Inf. Syst. 10(1), 45–52 (2023)
Bhavik Bansal. Password strength classifier dataset, Jun (2019)
Josuamarcelc. Josuamarcelc/common-password-list: common password list (rockyou.txt) built-in kali linux wordlist rockyou.txt
Ji, S., Yang, S., Wang, T., Liu, C., Lee, W.-H., Beyah, R. Pars: a uniform and open-source password analysis and research system. In: Proceedings of the 31st Annual Computer Security Applications Conference, pp. 321–330 (2015)
Sedgwick, P.: Pearson’s correlation coefficient. Bmj 345, 4883 (2012)
Singh, K., Upadhyaya, S.: Outlier detection: applications and techniques. Int. J. Comput. Sci. Issues (IJCSI) 9(1), 307 (2012)
Abdi, H., Williams, L.J.: Principal component analysis. Wiley interdiscip. Rev. Comput. Stat. 2(4), 433–459 (2010)
Fernández, A., Garcia, S., Herrera, F., Chawla, N.V.: Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018)
Han, H., Wang, W.Y., Mao, B.-H. Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International Conference on Intelligent Computing, pp. 878–887. Springer (2005)
Last, F., Douzas, G., Bacao, F. Oversampling for imbalanced learning based on k-means and smote. arxiv 2017. arXiv preprint arXiv:1711.00837, 2
Nguyen, H.M., Cooper, E.W., Kamei, K.: Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradig. 3(1), 4–21 (2011)
Li, M., Ziheng, W., Wang, W., Kun, L., Zhang, J., Zhou, Y., Chen, Z., Li, D., Zheng, S., Chen, P., et al.: Protein-protein interaction sites prediction based on an under-sampling strategy and random forest algorithm. IEEE/ACM Trans. Comput. Biol. Bioinform. 19(6), 3646–3654 (2021)
Espejo, P.G., Ventura, S., Herrera, F.: A survey on the application of genetic programming to classification. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 40(2), 121–144 (2009)
Ravuri, S., Vinyals, O. Classification accuracy score for conditional generative models. In: Advances in Neural Information Processing Systems, 32 (2019)
Goutte, C., Gaussier, E. A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. In: European Conference on Information Retrieval, pp. 345–359. Springer (2005)
Hand, D.J.: Measuring classifier performance: a coherent alternative to the area under the roc curve. Mach. Learn. 77(1), 103–123 (2009)
Susmaga, R. Confusion matrix visualization. In: Intelligent Information Processing and Web Mining: Proceedings of the International IIS: IIPWM ‘04 Conference held in Zakopane, Poland, May 17–20, 2004, pp. 107–116. Springer (2004)
Andelić, N., Šegota, S.B., Lorencin, I., Glučina, M.: Detection of malicious websites using symbolic classifier. Future Internet 14(12), 358 (2022)
Acknowledgements
This research was (partly) supported by the CEEPUS network CIII-HR-0108, the European Regional Development Fund under Grant KK.01.1.1.01.0009 (DATACROSS), the Erasmus+ project WICT under Grant 2021-1-HR01-KA220-HED-000031177, and the University of Rijeka Scientific Grants uniri-mladi-technic-22-61 and uniri-tehnic-18-275-1447.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Appendix A.1. The modified mathematical functions used in GPSC
In description of GPSC and RHVS method, it was mentioned that mathematical functions such as division, square root, natural logarithm, and logarithms with bases 2 and 10 had to be modified to avoid generating infinity or not a number values. The mathematical function of division can be written in the following form:
The natural logarithm, logarithm with bases 2 and 10 can be defined as:
1.2 Appendix A.2. How to obtain and use the SEs from this research
Due to a large number of obtained SEs in this paper, the SEs are not shown. The SEs can be obtained from GitHub repository (web-link: https://github.com/nandelic2022/PasswordStrengthEquations.git). After downloading the SEs the procedure of using these consist of following steps:
-
1.
From initial dataset define input variables and output variable.
-
2.
use the input variables to calculate the output of each SEs.
-
3.
use the output generated from SEs to calculate the sigmoid function value, i.e., to determine whether the dataset sample belongs to class or not.
-
4.
use the previously mentioned evaluation metrics to calculate the SEs performance.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Andelić, N., Baressi S̆egota, S. & Car, Z. Robust password security: a genetic programming approach with imbalanced dataset handling. Int. J. Inf. Secur. (2024). https://doi.org/10.1007/s10207-024-00814-2
Published:
DOI: https://doi.org/10.1007/s10207-024-00814-2