Scalable algorithms for graph-based semi-supervised learning with embedding
Created by W.Langdon from
gp-bibliography.bib Revision:1.8901
- @PhdThesis{kamalov:tel-04042245,
-
author = "Mikhail Kamalov",
-
title = "Scalable algorithms for graph-based semi-supervised
learning with embedding",
-
title_fr = "Mise a l'echelle des algorithmes pour l'apprentissage
semi-supervise base sur des graphes avec le
plongement",
-
school = "Universite Cote d'Azur",
-
year = "2022",
-
address = "France",
-
month = "1 " # dec,
-
keywords = "genetic algorithms, genetic programming, MyDataModels,
Personalized PageRank, Stochastic approximation, Neural
networks, ANN, Semi-supervised learning, Personalized
PageRank, Approximation stochastique, Reseaux de
neurones, Apprentissage semi-supervise, INRIA",
-
number = "2022COAZ4079",
-
hal_version = "v1",
-
URL = "
https://team.inria.fr/neo/thesis-defense-11/",
-
URL = "
https://theses.hal.science/tel-04042245/",
-
URL = "
https://theses.hal.science/tel-04042245v1/file/2022COAZ4079.pdf",
-
size = "150 pages",
-
abstract = "Nowadays, graph-based semi-supervised learning
(GB-SSL) is a fast-growing area of classifying nodes in
a graph with an extremely low number of labelled nodes.
However, the GB-SSL algorithms have two general
limitations: the first is the memory/time complexity
that arises in all state-of-the-art GB-SSL algorithms
on extremely large graphs. In particular, the high
memory consumption occurs in graph convolution networks
and leads to Out of Memory (OOM) issues on GPU or RAM;
(ii) the second one appears in all GB-SSL algorithms
based on Laplacian regularization loss. This thesis’
major contribution is divided into two parts in order
to suggest strategies that would guarantee to avoid the
restrictions mentioned above. In the first part of this
thesis, we propose a novel linear algorithm called
Markov-Batch Stochastic Approximation (MBSA) for
solving Personalized PageRank. MBSA updates nodes
batches and proposes a significantly better tradeoff
between memory consumption and convergence rate for an
optimal classification result than other linear models.
Then, we propose a novel scaling graph convolution
network, denoted as MBSA-NN, which embeds our linear
MBSA. MBSA-NN avoids OOM issues and significantly
reduces time and memory consumption on GPU and RAM. We
applied MBSA-NN on several very large datasets, and we
showed that it can handle graphs with more than 10M
nodes and 2M of features under one minute on one
standard machine, including preprocessing, training and
inference time. Furthermore, we show that it has
significantly improved memory/time consumption and
competitive accuracy concerning the latest best GB-SSL
scaling algorithms. The second part of this thesis
focuses on solutions to Laplacian regularization loss
issues. For that reason, we propose a novel framework
called Graph Diffusion & PCA (GDPCA). This framework
combines a modified Principal Component Analysis with
the classical supervised loss and Laplacian
regularization loss. GDPCA allows handling the case
where the adjacency matrix presents through Binary
edges and avoids the Curse of dimensionality. Also,
GDPCA can be applied to non-graph datasets, such as
images, by constructing a similarity graph.
Furthermore, we propose a framework that embeds
PageRank SSL in a generative model (GenPR). GenPR joint
training of nodes latent space representation and label
spreading through the reweighted adjacency matrix by
node similarities in the latent space. We demonstrate
that a generative model can improve accuracy and reduce
the number of iteration steps for PageRank SSL.
Moreover, we show how to embed MBSA into the GenPR
framework for providing the batch training regime of
GenPR. Finally, we propose a flexible SSL framework
based on stacking GDPCA and Zoetrope Genetic
Programming algorithms into a novel framework: PaZoe.
This self-labelling framework shows that graph-based
and non-graph based algorithms jointly improve the
quality of predictions and outperform each component
taken alone. We also show that PaZoe outperforms
state-of-the-art SSL algorithms on real datasets. Note
that the one of the datasets was generated in house,
taking data from industrial graded equipment to mimic
DC motors during operation",
-
notes = "In english
NNT : 2022COAZ4079⟩. ⟨tel-04042245⟩ Inria
Sophia-Antipolis,
Supervisor: Konstantin Avrachenkov",
- }
Genetic Programming entries for
Mikhail Kamalov
Citations