Combination of genetic network programming and knapsack problem to support record clustering on distributed databases
Introduction
Distributed database management system (DDBMS) could be a solution for large scale information systems with large amount of data growth and data accesses. A distributed database (DDB) is a collection of data that logically belongs to the same system but is spread over the sites of a computer network (Fig. 1). A DDBMS is then defined as a software system that permits the management of DDB and makes the distribution of data between databases and software transparent to the users (Bhuyar, Gawande, Deshmukh, 2012, Zilio, Rao, Lightstone, Lohman, Storm, Garcia-Arellano, Fadden, 2004).
To handle the data proliferation, efficient access methods and data storage techniques have become increasingly critical to maintain an acceptable query response time. One way to improve query response time is to reduce the number of disk I/Os by clustering the database vertically (attribute clustering) and/or horizontally (record clustering) (Guinepain, Gruenwald, 2006, Guinepain, Gruenwald, 2008). Improvements in the retrieval time of multi-attribute records can be attained if similar records are grouped close together in the file space as a result of restructuring. This is because fewer page transfers are required as the probability of two or more of the target records residing in the same page of storage is increased (Lowden & Kitsopanidis, 1993).
In this paper, a novel method combining genetic network programming (GNP) (Mabu, Chen, Lu, Shimada, Hirasawa, 2011, Shimada, Hirasawa, Hu, 2006) and standard dynamic programming solving knapsack problems (KP) (Lai, Singh, 2011) for record clustering is proposed. Hypothesis of this research are the implementation of GNP for data mining can create effective clusters from complicated datasets and the concept of KP can be used to define the problem of distributing fragments to several sites considering value (similarity of data) and mass (data size) in DDBMS. Therefore, it could be a solution to the fragment allocation and site storage capacity problems.
This paper is organized as follows. Section 2 describes the review of the proposed framework, Section 3 describes a review of literatures, 4 describes the detailed algorithm of the proposed framework, Section 5 shows the simulation results, and finally Section 6 is devoted to conclusions.
Section snippets
Genetic network programming
GNP is an evolutionary optimization technique, which uses directed graph structures instead of strings in genetic algorithm (Holland, 1975) or trees in genetic programming (Koza, 1992), which leads to enhancing the representation ability with compact programs derived from the re-usability of nodes in a graph structure.
In GNP, nodes are interpreted as the minimum units of judgement and action, and node transition represents rules of the program. After starting the node transition from the start
Literature review
The proposed method uses GNP algorithm for data mining that has been proposed in (Mabu et al., 2011), and the proposed method is applied to the storage capacity problem of fragment allocation in distributed databases that has been introduced in (Özsu & Valduriez, 2011). This research involves the implementation of genetic network programming (GNP) for data mining and standard dynamic programming to solve the knapsack problem (KP) for the rule based clustering. Introducing storage capacity
Combination of GNP and knapsack problem
The implementation of record clustering is separated into two parts: GNP rule extraction, and rule distribution based on standard dynamic programming for solving knapsack problem, which is explained in Sections 4.1 and 4.2. In addition, the complexity analysis of the entire clustering process is described in Section 4.3.
Simulations
First, full random and partial random methods in the rule extraction of GNP are compared. Then, the knapsack rule distribution is carried out and its results are analyzed. Finally, the clustering simulations using six datasets downloaded from UCI Machine Learning Repository (archive.ics.uci.edu/ml/) are executed and their results are compared with other five conventional clustering algorithms.
Conclusions
This paper proposes a novel clustering method combining Genetic network programming and knapsack problem to handle record clustering. The proposed method can find good combinations of attributes to create rules for clustering, and also consider the capacity of sites to distribute rules.
The proposed method provides a new clustering method with an additional storage capacity problem that is compatible with big data with large number of attributes, samples and clusters, and the clustering
References (23)
- et al.
A k-mean clustering algorithm for mixed numeric and categorical data
Data & Knowledge Engineering
(2007) - et al.
Fcm: The fuzzy c-means clustering algorithm
Computers & Geosciences
(1984) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
Journal of computational and applied mathematics
(1987)- et al.
Automated feature weighting in clustering with separable distances and inner product induced norms–a theoretical generalization
Pattern Recognition Letters
(2015) - et al.
DB2 design advisor: Integrated automatic physical database design
Proceedings of the 30th international conference on very large data bases - volume 30
(2004) - et al.
Horizontal fragmentation technique in distributed database
International Journal of Scientific and Research Publications
(2012) - et al.
Evolutionary fine-tuning of automated semantic annotation systems
Expert Systems with Applications
(2015) - et al.
Automatic database clustering using data mining
Proceedings of the 17th international workshop on database and expert systems applications (DEXA’06)
(2006) - et al.
Using cluster computing to support automatic and dynamic database clustering
Proceedings of the 2008 IEEE international conference on cluster computing
(2008) Adaptation in natural and artificial systems
(1975)
Chameleon: Hierarchical clustering using dynamic modeling
Computer
Cited by (7)
Automatic design of specialized algorithms for the binary knapsack problem
2020, Expert Systems with ApplicationsCitation Excerpt :Several decision-making problems have been studied directly through formulations of KP. Its formulation directly represents the minimization of raw materials (Martin, Hokama, Morabito & Munari, 2019), the selection of a portfolio (Vaezi, Sadjadi & Makui, 2019) or, the clustering of data (Wedashwara, Mabu, Obayashi & Kuremoto, 2016), between many other problems (Kellerer, Pferschy & Pisinger, 2004). Besides, KP is one of the fundamental NP-hard problems in the field of combinatorial optimization then, indirectly, through a sequence of polynomial transformations KP can be connected as a subproblem of many other problems in the NP-hard class (Martello & Toth, 1990).
Niching genetic network programming with rule accumulation for decision making: An evolutionary rule-based approach
2018, Expert Systems with ApplicationsCitation Excerpt :Mich-style LCSs embed more complex credit assignment situations, but can engage wider range of problem domains, such as the decision making control tasks. Standard EAs have also been directly applied to the decision making domains, such as GP (Iba, 1996; Koza, 1992), evolutionary programming (EP) (Fogel, 1994) and GNP (Li et al., 2014; Mabu et al., 2007; Wedashwara, Mabu, Obayashi, & Kuremoto, 2016). In these studies, evolution bias is enforced to evolve the computer programs which are encoded as the individuals.
Parallel time–space reduction by unbiased filtering for solving the 0/1-Knapsack problem
2018, Journal of Parallel and Distributed ComputingCitation Excerpt :In addition, the KP model was applied efficiently in resource allocation [2,35]. Moreover, the research in security network could apply the KP model for the construction of trapdoor KP and the encryption of information and signature for secure transmission [22] as well as other applications in network systems [1,36,37]. However, the performance of such applications depends on the precision of the 0/1-KP solving, and hence the efficient NP-hard solving still has been researched for the optimal solutions in reasonable time.
Land consolidation of small-scale farms in preparation for a cane harvester
2017, Computers and Electronics in AgricultureCitation Excerpt :Combining groups of sugar cane fields for the use of harvesters is similar to the aforementioned maximum/minimum spanning tree problem, with the extra condition that the combined plots must not exceed the capacity of the harvester. This can considered as a Knapsack problem (KP), in which there are a number of objects with different values and weights, and objects must be chosen to maximize the total value, under the constraint that the total weight does not exceed a specified capacity (Jukna and Schnitger, 2011; Wedashwara et al., 2016). In previous research, both of these problems have been combined as a Knapsack constrained maximum spanning tree problem (KCMST), which was developed by Yamada et al. (2005).