BibliotecaPortal de investigación
es | gl
  • Home
  • Contact us
  • Give feedback
  • Help
    • About Investigo
    • Search and Find
    • Submit
    • Intellectual Property
    • Open Access Policy
  • Links
    • Sherpa / Romeo
    • Dulcinea
    • OpenDOAR
    • Dialnet Plus
    • ORCID
    • Creative Commons
    • UNESCO Nomenclature
    • español
    • English
    • Gallegan
JavaScript is disabled for your browser. Some features of this site may not work without it.
All of InvestigoAuthorsTitles Materias Unesco Research GroupsType of ContentsJournal TitlesThis CollectionAuthorsTitlesUNESCO SubjectsResearch GroupsType of ContentsJournal Titles

Library guides

Self-archivingRequest PermissionRelated guides

Statistics

View Usage Statistics

Wikipedia-based hybrid document representation for textual news classification

Mouriño García, Marcos AntonioAutor UVIGO; Pérez Rodríguez, RobertoAutor UVIGO; Anido Rifón, Luis EulogioAutor UVIGO; Vilares Ferro, ManuelAutor UVIGO
DATE: 2018-09
UNIVERSAL IDENTIFIER: http://hdl.handle.net/11093/5441
EDITED VERSION: http://link.springer.com/10.1007/s00500-018-3101-5
UNESCO SUBJECT: 1102.15 Teoría de Lenguajes Formales ; 5701.09 Traducción Automática ; 1203.04 Inteligencia Artificial
DOCUMENT TYPE: article

ABSTRACT

The sheer amount of news items that are published every day makes worth the task of automating their classification. The common approach consists in representing news items by the frequency of the words they contain and using supervised learning algorithms to train a classifier. This bag-of-words (BoW) approach is oblivious to three aspects of natural language: synonymy, polysemy, and multiword terms. More sophisticated representations based on concepts—or units of meaning—have been proposed, following the intuition that document representations that better capture the semantics of text will lead to higher performance in automatic classification tasks. The reality is that, when classifying news items, the BoW representation has proven to be really strong, with several studies reporting it to perform above different ‘flavours’ of bag of concepts (BoC). In this paper, we propose a hybrid classifier that enriches the traditional BoW representation with concepts extracted from text—leveraging Wikipedia as background knowledge for the semantic analysis of text (WikiBoC). We benchmarked the proposed classifier, comparing it with BoW and several BoC approaches: Latent Dirichlet Allocation (LDA), Explicit Semantic Analysis, and word embeddings (doc2vec). We used two corpora: the well-known Reuters-21578, composed of newswire items, and a new corpus created ex professo for this study: the Reuters-27000. Results show that (1) the performance of concept-based classifiers is very sensitive to the corpus used, being higher in the more “concept-friendly” Reuters-27000; (2) the Hybrid-WikiBoC approach proposed offers performance increases over BoW up to 4.12 and 49.35% when classifying Reuters-21578 and Reuters-27000 corpora, respectively; and (3) for average performance, the proposed Hybrid-WikiBoC outperforms all the other classifiers, achieving a performance increase of 15.56% over the best state-of-the-art approach (LDA) for the largest training sequence. Results indicate that concepts extracted with the help of Wikipedia add useful information that improves classification performance for news items.
Show full item record

Files in this item

[PDF]
Name:
2018_wikipedia_based_hybrid.pdf
Size:
1.055Mb
Format:
PDF
Description:
accepted manuscript
View/Open

Send to

MendeleyZoteroRefworks

The Institutional Repository of the University of Vigo Investigo is disseminated in:

University library
Rúa Leonardo da Vinci, s/n
As Lagoas, Marcosende
36310 Vigo

Location

Information
+34 986 813 821
investigo@uvigo.gal

Accessibility | Legal notice | Data protection
Logo UVigo

INFORMACIÓN
+34 986 812 000
informacion@uvigo.gal

CONTACTO

CAMPUS DO MAR

CAMPUS DE OURENSE
+34 988 387 102
Campus da Auga

CAIXA DE QUEIXAS, SUXESTIÓNS E PARABÉNS

TRANSPARENCIA

CAMPUS DE PONTEVEDRA
+34 986 801 949
Campus CREA

OUTRAS WEBS INSTITUCIONAIS

EMERXENCIAS

CAMPUS DE VIGO
+34 986 812 000
Campus Vigo Tecnolóxico

MURO SOCIAL