Automatic classification using DDC on the Swedish union catalogue

Research output: Chapter in Book/Report/Conference proceedingPaper in conference proceeding

Abstract

With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of two machine learning algorithms for Swedish catalogue records from the Swedish union catalogue (LIBRIS). The algorithms are tested on the top three hierarchical levels of the DDC. Based on a data set of 143,838 records, evaluation shows that Support Vector Machine with linear kernel outperforms Multinomial Naïve Bayes algorithm. Also, using keywords or combining titles and keywords gives better results than using only titles as input. The class imbalance where many DDC classes only have few records greatly affects classification performance: 81.37% accuracy on the training set is achieved when at least 1,000 records per class are available, and 66.13% when few records on which to train are available. Proposed future research involves an exploration of the intellectual effort put into creating the DDC to further improve the algorithm performance as commonly applied in string matching, and to test the best approach on new digital collections that do not have DDC assigned.

Details

Authors
  • Koraljka Golub
  • Johan Hagelbäck
  • Anders Ardö
Organisations
External organisations
  • Linnaeus University
Research areas and keywords

Subject classification (UKÄ) – MANDATORY

  • Information Studies

Keywords

  • Automatic classification, Dewey Decimal Classification, LIBRIS, Machine learning, Multinomial Naïve Bayes, Subject access., Support Vector Machine
Original languageEnglish
Title of host publicationProceedings of the 18th European Networked Knowledge Organization Systems (NKOS) Workshop co-located with the 22nd International Conference on Theory and Practice of Digital Libraries 2018 (TPDL 2018)
PublisherCEUR
Pages4-16
Number of pages13
Volume2200
Publication statusPublished - 2018
Publication categoryResearch
Peer-reviewedYes
Event18th European Networked Knowledge Organization Systems Workshop, NKOS 2018 - Porto, Portugal
Duration: 2018 Sep 13 → …

Publication series

NameCEUR Workshop Proceedings
ISSN (Print)1613-0073

Conference

Conference18th European Networked Knowledge Organization Systems Workshop, NKOS 2018
CountryPortugal
CityPorto
Period2018/09/13 → …