Created by W.Langdon from gp-bibliography.bib Revision:1.8081
First we propose contextual combinatorial cascading bandits. At each time step a learning agent is given a set of contextual information, then selects a list of items, and observes stochastic outcomes of a prefix in the selected items by cascading click feedback. The cascade model assumes users check the list from the first item to the last and stops at the first satisfying one. We consider position discounts in the list order, so that the agents reward is discounted depending on the position of the clicked item. Our setting generalizes existing studies with contextual information, position discounts,and a more general reward function. We design an algorithm with proven sublinear regret bound and the experiments on synthetic and real datasets demonstrate the advantage of involving contextual information and position discounts.
Next we employ the techniques of online clustering of bandits to improve the recommendation qualities under cascade click feedback. We consider a new setting of online clustering of contextual cascading bandits, an on-line learning problem where the underlying cluster structure over users is unknown and needs to be learned under cascade feedback. The last work corresponds to the degenerate case of only one cluster and our general regret bound in this degenerate case also improves the previous results. The experiments on both synthetic and real data demonstrate the advantage of incorporating online clustering structure.
Since the existing online clustering methods only consider uniform distribution over users and are not very efficient in separating dissimilar users. We consider a general setting of online clustering of bandits by allowing non-uniform distribution over user frequencies together with a more efficient algorithm which uses sets to represent structures. We provide a regret bound for the new algorithm which is free of the minimal frequency over users. The experiments on both synthetic and real datasets consistently show the advantage of our new algorithm over existing methods.
Next we introduce a new model for online learning to rank in which the click probability factors into an examination and attractiveness function and the attractiveness function is a linear function of a feature vector and an unknown parameter. Only relatively mild assumptions are made on the examination function. The new click model covers many common click models including cascade model and position-based model. We bring up a novel recursive ranking algorithm for this setup and prove its regret depends on the feature dimension instead of the number of items, which allows the new algorithm to handle a large number of items. When reduced to the orthogonal case, the regret of the algorithm improves on the state-of-the-art.
Last we consider to evaluate new ranking policies offline and optimize them before they are deployed. This is an important problem in practice. We address this problem by proposing evaluation algorithms for estimating the expected number of clicks on ranked lists from historical logged data. The existing algorithms are not guaranteed to be statistically efficient in our problem because the number of recommended lists can grow exponentially with their length. To overcome this challenge, we use the click models to construct estimators that learn statistically efficiently. We analyze our estimators and prove that they are more efficient than the estimators that do not use the structure of the click model, under the assumption that the click model holds. We evaluate our estimators in a series of experiments on a real-world dataset and show that they consistently outperform prior estimators.",
Supervisor: Kwong-Sak Leung",
Genetic Programming entries for Shuai Li2