Numerical sensitive data recognition based on hybrid gene expression programming for active distribution networks

https://doi.org/10.1016/j.asoc.2020.106213Get rights and content

Abstract

Complex and flexible access mode, and frequent data interaction bring about large security risks to data transmission for active distribution networks. How to ensure data security is critical to the safe and stable operation of active distribution networks. Traditional methods, like access control, data encryption, and text filtering based on intelligent algorithms, are difficult to ensure the security of dynamically increased and high-dimensional numerical data transmission in active distribution networks. In this paper, we first propose a rough feature selection algorithm based on the average importance measurement (RFS-AIM) to simplify the complexity of data recognition. Then, we propose a sensitive data recognition function mining algorithm based on RFS-AIM and improved gene expression programming (SDR-IGEP) where population update operation is constructed by chromosome similarity based on the Jaccard coefficient. The operation avoids local convergence of the gene express programming by increasing individual diversity in the new population. Finally, we present a new incremental mining algorithm for a sensitive data recognition function based on global function fitting (ISDR-GFF) by using a grain granulation model for incremental datasets. The experimental results on IEEE benchmark datasets and real datasets show that the algorithms proposed in this paper outperform the state-of-the-art algorithms in terms of the average running time, precision, recall, F1 index, accuracy, specificity and speedup on all experimental datasets.

Introduction

As an important part of information security in active distribution networks, data security has received increasing attention from researchers in recent years [1], [2]. However, data security in active distribution networks is more complicated than traditional distribution networks due to complex network architecture and frequent data interaction.

With the extensive application of advanced information and communication technologies such as wireless communication and Internet of Things in active distribution networks, active distribution network must face an increasingly serious threat of viruses, Trojan horses and hacker attacks from the Internet [3], [4], [5], [6]. Especially with the continuous construction of a strong smart grid, the active distribution network has a more complex access environment, flexible and diverse access methods, a large number of intelligent access terminals and dynamically distributed massive access data. Notably, with the access of a large number of distributed loads in the active distribution network, the interaction between the power grid and the users is greatly enhanced, which makes it possible for the users’ electricity data and monitoring data of various equipment state operations to be stolen, modified and injected with bad data during network communications. Therefore, how to ensure the confidentiality, integrity and non-repudiation of business data in active distribution network has become a research hotspot for data security protection in active distribution network.

Traditional data security protection methods, such as data encryption [7] and access control [8], have high requirements on the computing power, storage capacity, and network transmission bandwidth of the user and server in active distribution network. The most important thing is that the business data transmitted in the active distribution network cannot be identified at the content level. Therefore, these data security protection methods are passive protection. To better solve data security protection, many researchers proposed data intelligent filtering based on content recognition [9], [10] which can better achieve content level security protection. Compared with passive protection measures, it belongs to proactive protection. The existing data content filtering algorithms, which are based on text classification, focus on text and cannot effectively solve the leakage of numerical data in a SCADA system, an AMI system or smart meters in active distribution networks.

Gene expression programming (GEP) which is first proposed by Candida [11] is a new evolution algorithm. GEP has powerful classification and function mining capabilities [12], [13], which can solve the problem of sensitive data identification well. Therefore, in this paper, we propose a novel parallel numerical data recognition algorithm for active distribution network based on feature selection and improved gene expression programming, which combine the advantages of rough sets and gene expression programming, to better protect the security of numerical data transmission in active distribution networks. The major contributions of our work are listed as follows:

  • To reduce the complexity of sensitive data recognition model, this paper proposes a rough feature selection algorithm based on average importance measurement (RFS-AIM). The purpose is to quantitatively analyze the importance of each feature after reduction to the final decision feature.

  • On the basis of RFS-AIM, in this paper, we propose a sensitive data recognition function mining algorithm based on feature selection and improved gene expression programming (SDR-IGEP). By using the concept of chromosome similarity, the algorithm improves the genetic operation in the traditional GEP and prevents the GEP population from falling into a local optimum.

  • Meanwhile, an incremental mining algorithm of sensitive data recognition function based on global function fitting (ISDR-GFF) is proposed to solve the increasing sensitive data recognition function mining in active distribution networks. This algorithm constructs the architecture of parallel function mining based on grain granulation. At the same time, the multi-population grafting operation based on population similarity is proposed to improve the population diversity on the computing nodes and to increase the convergence speed of the GEP population.

  • Experimental results on IEEE benchmark datasets and real datasets show that the proposed algorithms in this paper outperformed the traditional other algorithms in terms of the precision, recall, F1 index, accuracy and specificity of sensitive data recognition, the average running time and speedup.

The remainder of this paper is organized as follows. Section 2 introduces a detailed overview of the related work. Section 3 focuses on the rough feature selection algorithm based on average importance measurement. Section 4 proposes a sensitive data recognition function mining algorithm based on feature selection and improved gene expression programming. Section 5 designs an incremental mining algorithm of sensitive data recognition function based on global function fitting. To evaluate the performance of the proposed algorithm, experimental results and analyses on IEEE benchmark datasets and real load datasets are given in Section 6. Conclusions are remarked in the last section.

Section snippets

Data security of smart grid

The access of various distributed energy sources and flexible loads makes the interaction of various intelligent terminals and users in smart grid more and more frequent and the distribution of data leakage points more and more extensive. Due to the insecurity of smart grid, the data in the cyber and physical system of the distribution network must address numerous security attacks and threats. Attia et al. [14] presented an intrusion detection system architecture for detecting illegal attacks

Rough feature selection algorithm based on average importance measurement

An active distribution network is a classical cyber physical system. The source and interaction between the information flow and power flow in an active distribution network are shown in Fig. 1 [32]. Fig. 1 shows that the data sources in the active distribution network come from a wide range, mainly from power distribution SCADA systems or DMS systems, advanced metering facilities AMI, smart meters and so on. There are many types of data, including the status data of distribution lines, alarm

Sensitive data recognition function mining algorithm based on feature selection and improved gene expression programming

Data from distribution SCADA systems, various status monitoring systems, AMIs, smart meters, and operational logs are the core assets of active distribution networks. These data come from various aspects such as power monitoring, equipment operation, and power consumption. In general, they are characterized by having a wide range of sources, having a large scale, and being composed of complex types. However, with the widespread use of various types of wireless communication technologies in

Incremental mining algorithm of the sensitive data recognition function based on global function fitting

The operation process of the active distribution network is not invariable. Because of the unstable output of the distributed generation and the strong load fluctuation, all types of electrical information and topological data in the operation process of the active distribution network change dynamically. Meanwhile, the active distribution network has a wide range of data sources and a wide geographical distribution. There are many system parameters that affect the safe and stable operation of

Experimental environment

To better explain the effectiveness and feasibility of the proposed algorithms, the related experiments are performed in a laboratory environment. The hardware in the experiments includes five algorithm servers and one administration server. The hardware and software configurations are shown in Table 1, Table 2, respectively.

In Table 1, the algorithm server is used mainly to execute the sensitive data recognition function mining algorithm, and the management server is used mainly to manage the

Conclusions

The safe and efficient transmission of data is critical to the safe and stable operation of active distribution network business systems. The existing data transmission security protection in an active distribution network focuses on unstructured and structured data such as text and database, and is mainly based on centralized text classification algorithms and traditional security defense methods such as access control and encryption. Therefore, to solve the intelligent recognition of

CRediT authorship contribution statement

Song Deng: Conceptualization, Methodology, Software, Writing - original draft, Formal analysis, Funding acquisition. Xiangpeng Xie: Data curation, Writing - review & editing. Changan Yuan: Resources, Investigation. Lechan Yang: Investigation. Xindong Wu: Project administration, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We would like to thank the anonymous reviewers for their comments and constructive suggestions that have improved the paper. The subject is sponsored by the National Natural Science Foundation of PR China (No.51977113,51507084) and Science Foundation of Nanjing University of Posts and Telecommunication (NUPTSF), China (No.NY219095).

References (40)

  • Bou-HarbE. et al.

    Communication security for smart grid distribution networks

    IEEE Commun. Mag.

    (2013)
  • ArefifarS.A. et al.

    Optimum microgrid design for enhancing reliability and supply-security

    IEEE Trans. Smart Grid

    (2013)
  • JayaweeraD. et al.

    Steady-state security in distribution networks with large wind farms

    J. Mod. Power Syst. Clean Energy

    (2014)
  • GuanZ. et al.

    Achieving efficient and secure data acquisition for cloud-supported internet of things in smart grid

    IEEE Internet Things J.

    (2017)
  • RujS. et al.

    A decentralized security framework for data aggregation and access control in smart grids

    IEEE Trans. Smart Grid

    (2013)
  • AggarwalC.C.

    Data Classification: Algorithms and Applications

    (2014)
  • GuX. et al.

    A new cross-multidomain classification algorithm and its fast version for large datasets

    Acta Autom. Sin.

    (2014)
  • FerreiraC.

    Genetic representation and genetic neutrality in gene expression programming

    Adv. Complex Syst.

    (2002)
  • ZhongJ. et al.

    Gene expression programming: A survey

    IEEE Comput. Intell. Mag.

    (2017)
  • Q. Li, S. Li, B. Xu, Y. Liu, Data-driven attacks and data recovery with noise on state estimation of smart grid, J....
  • Cited by (0)

    View full text