Multi-instance genetic programming for web index recommendation
Introduction
In the last few years, the quantity of information available on Internet has been growing so rapidly that it now exceeds human processing capabilities. Users feel overwhelmed by the amount of information available and are usually unable to locate really relevant information that suits their individual needs in a limited amount of time. In this situation, there is a pressing need for tools that anticipate the preferences of users and provide recommendations about whether or not a particular item will be of interest to the user. Such systems, referred to in the literature as recommendation systems (Felfernig, Friedrich, & Schmidt-Thieme, 2007), have features similar to traditional information retrieval approaches but differ from them, especially in the use of models that contain information about user tastes, preferences and needs. This information differs according to the type of processing performed by the system. So, in collaborative filtering recommender systems (Schafer, Herlocker, & Sen, 2007) this model reflects similar users’ preferences or needs, while in content-based recommender systems (Pazzani & Billsus, 2007) this information maps the relationship between the items to be recommended and the preferences of a given user.
In modelling user preferences, an interesting problem is the classifying of web index pages into two categories (according to whether or not they are pertinent for a user), because this allows us to build a user model for a content-based recommendation system. The main difficulty in this problem lies in training set representation; web index pages are those which contain references or brief summaries of other pages and where there is a different number of references on each page. Moreover, the information available about the user is imprecise. We know if the user is interested in an index page or not, instead of determining exactly which concrete links the user really considers to be of interest. Recently, Zhou, Jiang, and Li (2005) have solved the problem from a multi-instance learning perspective, adapting the well known k-Nearest Neighbor (k-NN) algorithm to this new learning framework. Experimental results show that this approach greatly improves supervised learning algorithm approaches.
In spite of the interesting results reported by Zhou et al. (2005), their proposal presents two major limitations. The first one is related to sparsity and to scalability, as the k-NN algorithm requires computations that grow linearly with the number of items, which makes it hard to scale when the number of items is high and maintain reasonable prediction performance and accuracy. The second one is related to the interpretability of new-found knowledge. The K-NN algorithm is a black box algorithm, that is, it simply classifies web index pages as being “of interest” or “not of interest”, without providing additional information about user preferences. This is not a desirable property in recommendation systems, where any information that allows us to learn more about the interest of the user is of outmost interest for facilitating new recommendations.
To overcome the aforementioned drawbacks, we propose the use of G3P-MI, a grammar-guided genetic programming algorithm for multiple instance learning. This algorithm learns prediction rules which provide information on whether any of the links contained on a given web index page are of interest to a given user. Experimental results concerning several benchmarks show that this approach obtains competitive results in terms of accuracy, recall and precision. Moreover, it adds comprehensibility and clarity to the knowledge discovery process which is such an important characteristic for obtaining high predictive accuracy since the system’s results can be interpreted easily (understandable user models) and this data can be used to obtain further information about the user thus generating even more appropriate recommendations.
The rest of this paper is organized as follows. Section 2 is devoted to introducing the multi-instance learning paradigm, and Section 3 describes the proposed G3P-MI algorithm. Section 4 presents Web Index Recommendation as a multi-instance learning problem. Sections 5 Experimental setup, 6 Results and discussion presents and analyses the experimental results of our system. Finally, Section 6 presents conclusions and future work.
Section snippets
Multiple instance learning
The term Multiple Instance Learning was coined by Dietterich, Lathrop, and Lozano-Perez (1997) when investigating a qualitative structure–activity relationship problem. In this problem, the task consisted of determining if a given substance does or does not present pharmacological activity in information about its molecular structure. The difficulty of this task is due to the fact that a substance can present more than one spatial configuration, each of which showing different structural
Grammar-guided genetic programming for multiple instance learning
In this section we introduce G3P-MI, a grammar-guided genetic programming algorithm for multi-instance learning. In the next sections, we will introduce the following design aspects: individual representation, genetic operators, fitness function and evolutionary process.
Web index recommendation: a multiple instance problem
Web Index Pages are pages that provide titles or brief summaries of other pages. These pages contain a lot of information through references, leaving detailed presentations to their linked pages. An example of a web index page is http://health.yahoo.com as shown in Fig. 4.
The web index recommendation problem consists of building a model to establish exactly which web page index it is that interests a given user from among the contents of a myriad of web index pages that have already been
Experimental setup
This section describes the data sets that have been used in the experimentation as well as several especially relevant methodological and configuration aspects.
Results and discussion
We carry out two types of experiments. The first experiment compares the performance of our proposals with respect to the problem of Web Index Recommendation. The second experiment compares the performance of our best algorithm to other classification techniques to solve this problem. This section describes these experiments and the results obtained. Also, at the end of the section we will comment on the type of knowledge discovered with G3P-MI algorithms.
Conclusions and future work
This study describes the use of the G3P-MI algorithm for recommending Web Index Pages. This algorithm applies grammar-guided genetic programming to learn rules about whether or not a page referred to on a Web Index Page is of interest to a given user. To represent the Web Index Page, this algorithm applies the concept of multi-instances, representing the web pages as a set of instances where each instance represent the different referenced pages and stores information related to reference page.
Acknowledgments
This work has been subsidised in part by the research project SAINFOWEB (P05-TIC-00602) and the TIN2005-08386-C05-02, TIN2007-61079 and TIN2008-06681-C06-03 projects of the Spanish Inter-Ministerial Commission of Science and Technology (CICYT) and FEDER funds.
References (36)
- et al.
Solving the multiple instance problem with axis-parallel rectangles
Artificial Intelligence
(1997) - et al.
An em based multiple instance learning method for image classification
Expert Systems with Applications
(2008) - et al.
Fault detection using genetic programming
Mechanical Systems and Signal Processing
(2005) - Andrews, S., Tsochantaridis, I., & Hofmann, T. (2002). Support vector machines for multiple-instance learning. In...
On learning from multi-instance examples: Empirical evaluation of a theoretical approach
- et al.
Genetic programming: An introduction
(1998) - Chai, Y.-M., & Yang, Z.-W. (2007). A multi-instance learning algorithm based on normalized radial basis function...
- et al.
Miles: Multiple-instance learning via embedded instance selection
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2006) - et al.
Image categorization by learning and reasoning with regions
Journal of Machine Learning Research
(2004) Statistical comparisons of classifiers over multiple data sets
Journal of Machine Learning Research
(2006)
Guest editors’ introduction: Recommender systems
IEEE Intelligent Systems
Multi-instance kernels
Fuzzy rule-based expert systems and genetic machine learning
Evaluating collaborative filtering recommender systems
ACM Transaction Information Systems
A note on learning from multiple-instance examples
Machine Learning
PAC learning axis-aligned rectangles with respect to product distributions from multiple-instance examples
Machine Learning
Cited by (24)
Matrix factorization with a sigmoid-like loss control
2024, NeurocomputingFuzzy rough classifiers for class imbalanced multi-instance data
2016, Pattern RecognitionA multi-instance learning wrapper based on the Rocchio classifier for web index recommendation
2014, Knowledge-Based SystemsCitation Excerpt :To solve the WIR problem, Zhou et al. [37] proposed the MIL algorithm Fretcit-kNN. Subsequently, new MIL algorithms from the genetic programming family were introduced in [32,33] to solve it. The following subsections describe these MIL approaches, along with the way in which each of them represents the WIR data.
HyDR-MI: A hybrid algorithm to reduce dimensionality in multiple instance learning
2013, Information SciencesMultiple instance learning for classifying students in learning management systems
2011, Expert Systems with ApplicationsCitation Excerpt :Learning with multi-instances has flourished enormously in the last few years due to the great number of applications that have found a more appropriate form of representation in this learning than in traditional learning. Thus we can find proposals for text categorization (Andrews, Tsochantaridis, & Hofmann, 2002), content-based image retrieval (Herman, Ye, Xu, & Zhang, 2008; Pao, Chuang, Xu, & Fu, 2008), image annotation (Qi & Han, 2007; Yang, Dong, & Fotouhi, 2005), drug activity prediction (Maron & Lozano-Pérez, 1997; Zhou & Zhang, 2007), web index page recommendation (Zafra, Ventura, Romero, & Herrera-Viedma, 2009), semantic video retrieval (Chen & Chen, 2009), video concept detection (Gao & Sun, 2008; Gu, Mei, Tang, Wu, & Hua, 2008) and pedestrian detection (Pang, Huang, & Jiang, 2008). In all cases MIL provides a more natural form of representation that achieves better the results than those obtained by traditional supervised learning.
Multiple instance learning with multiple objective genetic programming for web mining
2011, Applied Soft Computing Journal