RT Journal Article T1 Wikipedia-based hybrid document representation for textual news classification A1 Mouriño García, Marcos Antonio A1 Pérez Rodríguez, Roberto A1 Anido Rifón, Luis Eulogio A1 Vilares Ferro, Manuel K1 1102.15 Teoría de Lenguajes Formales K1 5701.09 Traducción Automática K1 1203.04 Inteligencia Artificial AB The sheer amount of news items that are published every day makes worth the task of automating their classification. The common approach consists in representing news items by the frequency of the words they contain and using supervised learning algorithms to train a classifier. This bag-of-words (BoW) approach is oblivious to three aspects of natural language: synonymy, polysemy, and multiword terms. More sophisticated representations based on concepts—or units of meaning—have been proposed, following the intuition that document representations that better capture the semantics of text will lead to higher performance in automatic classification tasks. The reality is that, when classifying news items, the BoW representation has proven to be really strong, with several studies reporting it to perform above different ‘flavours’ of bag of concepts (BoC). In this paper, we propose a hybrid classifier that enriches the traditional BoW representation with concepts extracted from text—leveraging Wikipedia as background knowledge for the semantic analysis of text (WikiBoC). We benchmarked the proposed classifier, comparing it with BoW and several BoC approaches: Latent Dirichlet Allocation (LDA), Explicit Semantic Analysis, and word embeddings (doc2vec). We used two corpora: the well-known Reuters-21578, composed of newswire items, and a new corpus created ex professo for this study: the Reuters-27000. Results show that (1) the performance of concept-based classifiers is very sensitive to the corpus used, being higher in the more “concept-friendly” Reuters-27000; (2) the Hybrid-WikiBoC approach proposed offers performance increases over BoW up to 4.12 and 49.35% when classifying Reuters-21578 and Reuters-27000 corpora, respectively; and (3) for average performance, the proposed Hybrid-WikiBoC outperforms all the other classifiers, achieving a performance increase of 15.56% over the best state-of-the-art approach (LDA) for the largest training sequence. Results indicate that concepts extracted with the help of Wikipedia add useful information that improves classification performance for news items. PB Soft Computing SN 14327643 YR 2018 FD 2018-09 LK http://hdl.handle.net/11093/5441 UL http://hdl.handle.net/11093/5441 LA eng NO Soft Computing, 22: 6047-6065 (2018) NO Atlantic Research Center for Information and Communication Technologies DS Investigo RD 04-dic-2024