Show simple item record

dc.contributor.authorMouriño García, Marcos Antonio 
dc.contributor.authorPérez Rodríguez, Roberto 
dc.contributor.authorAnido Rifón, Luis Eulogio 
dc.contributor.authorVilares Ferro, Manuel 
dc.date.accessioned2023-12-01T08:07:18Z
dc.date.available2023-12-01T08:07:18Z
dc.date.issued2018-09
dc.identifier.citationSoft Computing, 22: 6047-6065 (2018)spa
dc.identifier.issn14327643
dc.identifier.issn14337479
dc.identifier.urihttp://hdl.handle.net/11093/5441
dc.description.abstractThe sheer amount of news items that are published every day makes worth the task of automating their classification. The common approach consists in representing news items by the frequency of the words they contain and using supervised learning algorithms to train a classifier. This bag-of-words (BoW) approach is oblivious to three aspects of natural language: synonymy, polysemy, and multiword terms. More sophisticated representations based on concepts—or units of meaning—have been proposed, following the intuition that document representations that better capture the semantics of text will lead to higher performance in automatic classification tasks. The reality is that, when classifying news items, the BoW representation has proven to be really strong, with several studies reporting it to perform above different ‘flavours’ of bag of concepts (BoC). In this paper, we propose a hybrid classifier that enriches the traditional BoW representation with concepts extracted from text—leveraging Wikipedia as background knowledge for the semantic analysis of text (WikiBoC). We benchmarked the proposed classifier, comparing it with BoW and several BoC approaches: Latent Dirichlet Allocation (LDA), Explicit Semantic Analysis, and word embeddings (doc2vec). We used two corpora: the well-known Reuters-21578, composed of newswire items, and a new corpus created ex professo for this study: the Reuters-27000. Results show that (1) the performance of concept-based classifiers is very sensitive to the corpus used, being higher in the more “concept-friendly” Reuters-27000; (2) the Hybrid-WikiBoC approach proposed offers performance increases over BoW up to 4.12 and 49.35% when classifying Reuters-21578 and Reuters-27000 corpora, respectively; and (3) for average performance, the proposed Hybrid-WikiBoC outperforms all the other classifiers, achieving a performance increase of 15.56% over the best state-of-the-art approach (LDA) for the largest training sequence. Results indicate that concepts extracted with the help of Wikipedia add useful information that improves classification performance for news items.spa
dc.description.sponsorshipAtlantic Research Center for Information and Communication Technologiesspa
dc.description.sponsorshipXunta de Galicia | Ref. R2014/034 (RedPlir)spa
dc.description.sponsorshipXunta de Galicia | Ref. R2014/029 (TELGalicia)spa
dc.language.isoengspa
dc.publisherSoft Computingspa
dc.rights© Springer-Verlag GmbH Germany, part of Springer Nature 2018
dc.titleWikipedia-based hybrid document representation for textual news classificationeng
dc.typearticlespa
dc.rights.accessRightsopenAccessspa
dc.identifier.doi10.1007/s00500-018-3101-5
dc.identifier.editorhttp://link.springer.com/10.1007/s00500-018-3101-5spa
dc.publisher.departamentoInformáticaspa
dc.publisher.departamentoEnxeñaría telemáticaspa
dc.publisher.grupoinvestigacionCOmputational LEarnigspa
dc.publisher.grupoinvestigacionGIST (Grupo de Enxeñería de Sistemas Telemáticos)spa
dc.subject.unesco1102.15 Teoría de Lenguajes Formalesspa
dc.subject.unesco5701.09 Traducción Automáticaspa
dc.subject.unesco1203.04 Inteligencia Artificialspa
dc.date.updated2023-11-09T16:54:55Z
dc.computerCitationpub_title=Soft Computing|volume=22|journal_number=|start_pag=6047|end_pag=6065spa
dc.referencesThis version of the article has been accepted for publication, after peer review and is subject to Springer Nature’s AM terms of use, but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: https://doi.org/10.1007/s00500-018-3101-5


Files in this item

[PDF]

    Show simple item record