Wikipedia-based hybrid document representation for textual news classification
DATE:
2016-11
UNIVERSAL IDENTIFIER: http://hdl.handle.net/11093/5442
EDITED VERSION: https://ieeexplore.ieee.org/document/8057457/
UNESCO SUBJECT: 1102.15 Teoría de Lenguajes Formales ; 5701.09 Traducción Automática ; 1203.04 Inteligencia Artificial
DOCUMENT TYPE: conferenceObject
ABSTRACT
Automatic classification of news articles is a relevant problem due to the large amount of news generated every day, so it is crucial that these news are classified to allow for users to access to information of interest quickly and effectively. On the one hand, traditional classification systems represent documents as bag-of-words (BoW), which are oblivious to two problems of language: synonymy and polysemy. On the other hand, several authors propose the use of a bag-of-concepts (BoC) representation of documents, which tackles synonymy and polysemy. This paper shows the benefits of using a hybrid representation of documents to the classification of textual news, leveraging the advantages of both approaches-the traditional BoW representation and a BoC approach based on Wikipedia knowledge. To evaluate the proposal, we used three of the most relevant algorithms in the state-of-the art-SVM, Random Forest and Naïve Bayes-and two corpora: the Reuters-21578 corpus and a purpose-built corpus, Reuters-27000. Results obtained show that the performance of the classification algorithm depends on the dataset used, and also demonstrate that the enrichment of the BoW representation with the concepts extracted from documents through the semantic annotator adds useful information to the classifier and improves their performance. Experiments conducted show performance increases up to 4.12% when classifying the Reuters-21578 corpus with the SVM algorithm and up to 49.35% when classifying the corpus Reuters-27000 with the Random Forest algorithm.
Files in this item
![pdf [PDF]](/xmlui/themes/Mirage2/images/thumbnails/mimes/pdf.png)
- Name:
- 2016_wikipedia-based_hybrid_do ...
- Size:
- 540.0Kb
- Format:
- Description:
- accepted manuscript