title = "Contribution to automatic text classification :
metrics and evolutionary algorithms",
school = "Universite du Littoral Cote d'Opale",
year = "2018",
address = "France",
month = Nov,
keywords = "genetic algorithms, genetic programming, NLP, Machine
learning, Natural language processing, Text mining,
FORMTEXT Classification of texts, Term Weighting
Schemes, Optimization, Apprentissage automatique,
Traitement du langage naturel, Exploration de texte,
FORMTEXT Classification des textes, Sch{\'e}ma de
Pond{\'e}ration des Termes, Optimisation, Programmation
g{\'e}n{\'e}tique",
abstract = "This thesis deals with natural language processing and
text mining, at the intersection of machine learning
and statistics. We are particularly interested in Term
Weighting Schemes (TWS) in the context of supervised
learning and specifically the Text Classification (TC)
task. In TC, the multi-label classification task has
gained a lot of interest in recent years. Multi-label
classification from textual data may be found in many
modern applications such as news classification where
the task is to find the categories that a newswire
story belongs to (e.g., politics, middle east, oil),
based on its textual content, music genre
classification (e.g., jazz, pop, oldies, traditional
pop) based on customer reviews, film classification
(e.g. action, crime, drama), product classification
(e.g. Electronics, Computers, Accessories). Traditional
classification algorithms are generally binary
classifiers, and they are not suited for the
multi-label classification. The multi-label
classification task is, therefore, transformed into
multiple single-label binary tasks. However, this
transformation introduces several issues. First, terms
distributions are only considered in relevance to the
positive and the negative categories (i.e., information
on the correlations between terms and categories is
lost). Second, it fails to consider any label
dependency (i.e., information on existing correlations
between classes is lost). Finally, since all categories
but one are grouped into one category (the negative
category), the newly created tasks are imbalanced. This
information is commonly used by supervised TWS to
improve the effectiveness of the classification system.
Hence, after presenting the process of multi-label text
classification, and more particularly the TWS, we make
an empirical comparison of these methods applied to the
multi-label text classification task. We find that the
superiority of the supervised methods over the
unsupervised methods is still not clear. We show then
that these methods are not fully adapted to the
multi-label classification problem and they ignore much
statistical information that could be used to improve
the classification results. Thus, we propose a new TWS
based on information gain. This new method takes into
consideration the term distribution, not only regarding
the positive and the negative categories but also in
relevance to all classes. Finally, aiming at finding
specialized TWS that also solve the issue of imbalanced
tasks, we studied the benefits of using genetic
programming for generating TWS for the text
classification task. Unlike previous studies, we
generate formulas by combining statistical information
at a microscopic level (e.g., the number of documents
that contain a specific term) instead of using complete
TWS. Furthermore, we make use of categorical
information such as (e.g., the number of categories
where a term occurs). Experiments are made to measure
the impact of these methods on the performance of the
model. We show through these experiments that the
results are positive.",
notes = "Reuters-21578, Oshumed, Webkb. Porter stemming p87
'The GP-Based TWSs outperforms the best baseline
schemes.'
Supervisors: Prof. Cyril Fonlupt and MCF Fabien
Teytaud",