Elsevier

Expert Systems with Applications

Volume 46, 15 March 2016, Pages 15-23
Expert Systems with Applications

Combination of genetic network programming and knapsack problem to support record clustering on distributed databases

https://doi.org/10.1016/j.eswa.2015.10.006Get rights and content

Highlights

  • A decision support algorithm for record clustering in databases is proposed.

  • Capacity limitation problem is introduced to make a general clustering application.

  • Rule extraction from datasets is realized by the proposed evolutionary algorithm.

  • Rule clustering considering capacity limitation is solved by knapsack problem.

  • The simulations of record clustering show some advantages of the proposed method.

Abstract

This research involves implementation of genetic network programming (GNP) and standard dynamic programming to solve the knapsack problem (KP) as a decision support system for record clustering in distributed databases. Fragment allocation with storage capacity limitation problem is a background of the proposed method. The problem of storage capacity is to distribute sets of fragments into several sites (clusters). Total amount of fragments in each site must not exceed the capacity of site, while the distribution process must keep the relation (similarity) between fragments within each site. The objective is to distribute big data to certain sites with the limited amount of capacities by considering the similarity of distributed data in each site. To solve this problem, GNP is used to extract rules from big data by considering characteristics (value ranges) of each attribute in a dataset. The proposed method also provides partial random rule extraction method in GNP to discover frequent patterns in a database for improving the clustering algorithm, especially for large data problems. The concept of KP is applied to the storage capacity problem and standard dynamic programming is used to distribute rules to each site by considering similarity (value) and data amount (weight) related to each rule to match the site capacities. From the simulation results, it is clarified that the proposed method shows some advantages over the conventional clustering algorithms, therefore, the proposed method provides a new clustering method with an additional storage capacity problem.

Introduction

Distributed database management system (DDBMS) could be a solution for large scale information systems with large amount of data growth and data accesses. A distributed database (DDB) is a collection of data that logically belongs to the same system but is spread over the sites of a computer network (Fig. 1). A DDBMS is then defined as a software system that permits the management of DDB and makes the distribution of data between databases and software transparent to the users (Bhuyar, Gawande, Deshmukh, 2012, Zilio, Rao, Lightstone, Lohman, Storm, Garcia-Arellano, Fadden, 2004).

To handle the data proliferation, efficient access methods and data storage techniques have become increasingly critical to maintain an acceptable query response time. One way to improve query response time is to reduce the number of disk I/Os by clustering the database vertically (attribute clustering) and/or horizontally (record clustering) (Guinepain, Gruenwald, 2006, Guinepain, Gruenwald, 2008). Improvements in the retrieval time of multi-attribute records can be attained if similar records are grouped close together in the file space as a result of restructuring. This is because fewer page transfers are required as the probability of two or more of the target records residing in the same page of storage is increased (Lowden & Kitsopanidis, 1993).

In this paper, a novel method combining genetic network programming (GNP) (Mabu, Chen, Lu, Shimada, Hirasawa, 2011, Shimada, Hirasawa, Hu, 2006) and standard dynamic programming solving knapsack problems (KP) (Lai, Singh, 2011) for record clustering is proposed. Hypothesis of this research are the implementation of GNP for data mining can create effective clusters from complicated datasets and the concept of KP can be used to define the problem of distributing fragments to several sites considering value (similarity of data) and mass (data size) in DDBMS. Therefore, it could be a solution to the fragment allocation and site storage capacity problems.

This paper is organized as follows. Section 2 describes the review of the proposed framework, Section 3 describes a review of literatures, 4 describes the detailed algorithm of the proposed framework, Section 5 shows the simulation results, and finally Section 6 is devoted to conclusions.

Section snippets

Genetic network programming

GNP is an evolutionary optimization technique, which uses directed graph structures instead of strings in genetic algorithm (Holland, 1975) or trees in genetic programming (Koza, 1992), which leads to enhancing the representation ability with compact programs derived from the re-usability of nodes in a graph structure.

In GNP, nodes are interpreted as the minimum units of judgement and action, and node transition represents rules of the program. After starting the node transition from the start

Literature review

The proposed method uses GNP algorithm for data mining that has been proposed in (Mabu et al., 2011), and the proposed method is applied to the storage capacity problem of fragment allocation in distributed databases that has been introduced in (Özsu & Valduriez, 2011). This research involves the implementation of genetic network programming (GNP) for data mining and standard dynamic programming to solve the knapsack problem (KP) for the rule based clustering. Introducing storage capacity

Combination of GNP and knapsack problem

The implementation of record clustering is separated into two parts: GNP rule extraction, and rule distribution based on standard dynamic programming for solving knapsack problem, which is explained in Sections 4.1 and 4.2. In addition, the complexity analysis of the entire clustering process is described in Section 4.3.

Simulations

First, full random and partial random methods in the rule extraction of GNP are compared. Then, the knapsack rule distribution is carried out and its results are analyzed. Finally, the clustering simulations using six datasets downloaded from UCI Machine Learning Repository (archive.ics.uci.edu/ml/) are executed and their results are compared with other five conventional clustering algorithms.

Conclusions

This paper proposes a novel clustering method combining Genetic network programming and knapsack problem to handle record clustering. The proposed method can find good combinations of attributes to create rules for clustering, and also consider the capacity of sites to distribute rules.

The proposed method provides a new clustering method with an additional storage capacity problem that is compatible with big data with large number of attributes, samples and clusters, and the clustering

References (23)

  • KarypisG. et al.

    Chameleon: Hierarchical clustering using dynamic modeling

    Computer

    (1999)
  • Cited by (7)

    • Automatic design of specialized algorithms for the binary knapsack problem

      2020, Expert Systems with Applications
      Citation Excerpt :

      Several decision-making problems have been studied directly through formulations of KP. Its formulation directly represents the minimization of raw materials (Martin, Hokama, Morabito & Munari, 2019), the selection of a portfolio (Vaezi, Sadjadi & Makui, 2019) or, the clustering of data (Wedashwara, Mabu, Obayashi & Kuremoto, 2016), between many other problems (Kellerer, Pferschy & Pisinger, 2004). Besides, KP is one of the fundamental NP-hard problems in the field of combinatorial optimization then, indirectly, through a sequence of polynomial transformations KP can be connected as a subproblem of many other problems in the NP-hard class (Martello & Toth, 1990).

    • Niching genetic network programming with rule accumulation for decision making: An evolutionary rule-based approach

      2018, Expert Systems with Applications
      Citation Excerpt :

      Mich-style LCSs embed more complex credit assignment situations, but can engage wider range of problem domains, such as the decision making control tasks. Standard EAs have also been directly applied to the decision making domains, such as GP (Iba, 1996; Koza, 1992), evolutionary programming (EP) (Fogel, 1994) and GNP (Li et al., 2014; Mabu et al., 2007; Wedashwara, Mabu, Obayashi, & Kuremoto, 2016). In these studies, evolution bias is enforced to evolve the computer programs which are encoded as the individuals.

    • Parallel time–space reduction by unbiased filtering for solving the 0/1-Knapsack problem

      2018, Journal of Parallel and Distributed Computing
      Citation Excerpt :

      In addition, the KP model was applied efficiently in resource allocation [2,35]. Moreover, the research in security network could apply the KP model for the construction of trapdoor KP and the encryption of information and signature for secure transmission [22] as well as other applications in network systems [1,36,37]. However, the performance of such applications depends on the precision of the 0/1-KP solving, and hence the efficient NP-hard solving still has been researched for the optimal solutions in reasonable time.

    • Land consolidation of small-scale farms in preparation for a cane harvester

      2017, Computers and Electronics in Agriculture
      Citation Excerpt :

      Combining groups of sugar cane fields for the use of harvesters is similar to the aforementioned maximum/minimum spanning tree problem, with the extra condition that the combined plots must not exceed the capacity of the harvester. This can considered as a Knapsack problem (KP), in which there are a number of objects with different values and weights, and objects must be chosen to maximize the total value, under the constraint that the total weight does not exceed a specified capacity (Jukna and Schnitger, 2011; Wedashwara et al., 2016). In previous research, both of these problems have been combined as a Knapsack constrained maximum spanning tree problem (KCMST), which was developed by Yamada et al. (2005).

    View all citing articles on Scopus
    View full text