The role of different thesauri terms and captions in automated subject classification

Koraljka Golub

Research output: Chapter in Book/Report/Conference proceedingPaper in conference proceedingpeer-review

Abstract

The paper aims to explore to what degree different types of terms in engineering information (Ei) thesaurus and classification scheme influence automated subject classification performance. Preferred terms, their synonyms, broader, narrower, related terms, and captions are examined in combination with a stemmer and a stop-word list. The algorithm comprises string-to-string matching between words in the documents to be classified and words in term lists derived from the Ei thesaurus and classification scheme. The data collection for evaluation consists of some 35000 scientific paper abstracts from the compendex database. A subset of the Ei thesaurus and classification scheme is used, comprising 92 classes at up to five hierarchical levels from general engineering. The results show that preferred terms perform best, whereas captions perform worst. Stemming in most cases shows performance improvement, whereas the stop-word list does not have a significant impact
Original languageEnglish
Title of host publicationProceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
PublisherIEEE - Institute of Electrical and Electronics Engineers Inc.
Pages961-965
Number of pages5
ISBN (Print)0-7695-2747-7
DOIs
Publication statusPublished - 2006
Event2006 IEEE/WIC/ACM International Conference on Web Intelligence - Hong Kong, China
Duration: 2006 Dec 182006 Dec 22

Conference

Conference2006 IEEE/WIC/ACM International Conference on Web Intelligence
Country/TerritoryChina
CityHong Kong
Period2006/12/182006/12/22

Subject classification (UKÄ)

  • Electrical Engineering, Electronic Engineering, Information Engineering

Free keywords

  • thesauri term
  • automated subject classification
  • engineering information
  • string-to-string matching
  • document classification
  • compendex database
  • data collection

Fingerprint

Dive into the research topics of 'The role of different thesauri terms and captions in automated subject classification'. Together they form a unique fingerprint.

Cite this