dc.contributor.author | Mouriño García, Marcos Antonio | |
dc.contributor.author | Pérez Rodríguez, Roberto | |
dc.contributor.author | Anido Rifón, Luis Eulogio | |
dc.contributor.author | Vilares Ferro, Manuel | |
dc.date.accessioned | 2023-12-01T08:07:18Z | |
dc.date.available | 2023-12-01T08:07:18Z | |
dc.date.issued | 2018-09 | |
dc.identifier.citation | Soft Computing, 22: 6047-6065 (2018) | spa |
dc.identifier.issn | 14327643 | |
dc.identifier.issn | 14337479 | |
dc.identifier.uri | http://hdl.handle.net/11093/5441 | |
dc.description.abstract | The sheer amount of news items that are published every day makes worth the task of automating their classification. The common approach consists in representing news items by the frequency of the words they contain and using supervised learning algorithms to train a classifier. This bag-of-words (BoW) approach is oblivious to three aspects of natural language: synonymy, polysemy, and multiword terms. More sophisticated representations based on concepts—or units of meaning—have been proposed, following the intuition that document representations that better capture the semantics of text will lead to higher performance in automatic classification tasks. The reality is that, when classifying news items, the BoW representation has proven to be really strong, with several studies reporting it to perform above different ‘flavours’ of bag of concepts (BoC). In this paper, we propose a hybrid classifier that enriches the traditional BoW representation with concepts extracted from text—leveraging Wikipedia as background knowledge for the semantic analysis of text (WikiBoC). We benchmarked the proposed classifier, comparing it with BoW and several BoC approaches: Latent Dirichlet Allocation (LDA), Explicit Semantic Analysis, and word embeddings (doc2vec). We used two corpora: the well-known Reuters-21578, composed of newswire items, and a new corpus created ex professo for this study: the Reuters-27000. Results show that (1) the performance of concept-based classifiers is very sensitive to the corpus used, being higher in the more “concept-friendly” Reuters-27000; (2) the Hybrid-WikiBoC approach proposed offers performance increases over BoW up to 4.12 and 49.35% when classifying Reuters-21578 and Reuters-27000 corpora, respectively; and (3) for average performance, the proposed Hybrid-WikiBoC outperforms all the other classifiers, achieving a performance increase of 15.56% over the best state-of-the-art approach (LDA) for the largest training sequence. Results indicate that concepts extracted with the help of Wikipedia add useful information that improves classification performance for news items. | spa |
dc.description.sponsorship | Atlantic Research Center for Information and Communication Technologies | spa |
dc.description.sponsorship | Xunta de Galicia | Ref. R2014/034 (RedPlir) | spa |
dc.description.sponsorship | Xunta de Galicia | Ref. R2014/029 (TELGalicia) | spa |
dc.language.iso | eng | spa |
dc.publisher | Soft Computing | spa |
dc.rights | © Springer-Verlag GmbH Germany, part of Springer Nature 2018 | |
dc.title | Wikipedia-based hybrid document representation for textual news classification | eng |
dc.type | article | spa |
dc.rights.accessRights | openAccess | spa |
dc.identifier.doi | 10.1007/s00500-018-3101-5 | |
dc.identifier.editor | http://link.springer.com/10.1007/s00500-018-3101-5 | spa |
dc.publisher.departamento | Informática | spa |
dc.publisher.departamento | Enxeñaría telemática | spa |
dc.publisher.grupoinvestigacion | COmputational LEarnig | spa |
dc.publisher.grupoinvestigacion | GIST (Grupo de Enxeñería de Sistemas Telemáticos) | spa |
dc.subject.unesco | 1102.15 Teoría de Lenguajes Formales | spa |
dc.subject.unesco | 5701.09 Traducción Automática | spa |
dc.subject.unesco | 1203.04 Inteligencia Artificial | spa |
dc.date.updated | 2023-11-09T16:54:55Z | |
dc.computerCitation | pub_title=Soft Computing|volume=22|journal_number=|start_pag=6047|end_pag=6065 | spa |
dc.references | This version of the article has been accepted for publication, after peer review and is subject to Springer Nature’s AM terms of use, but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: https://doi.org/10.1007/s00500-018-3101-5 | |