KOSHIK: A large-scale distributed computing framework for NLP

Research output: Chapter in Book/Report/Conference proceedingPaper in conference proceeding

Abstract

In this paper, we describe KOSHIK, an end-to-end framework to process the unstructured natural language content of multilingual documents. We used the Hadoop distributed computing infrastructure to build this framework as it enables KOSHIK to easily scale by adding inexpensive commodity hardware. We designed an annotation model that allows the processing algorithms to incrementally add layers of annotation without modifyingtheoriginaldocument. We used the Avro binary format to serialize th edocuments. Avro is designed for Hadoop and allows other data warehousing tools to directly query the documents. This paper reports the implementation choices and details of the framework,the annotation model,the options for querying processed data, and the parsing results on the English and Swedish editions of Wikipedia.

Details

Authors
Organisations
Research areas and keywords

Subject classification (UKÄ) – MANDATORY

  • Computer Science
Original languageEnglish
Title of host publication3rd International Conference on Pattern Recognition Applications and Methods (ICPRAM 2014)
PublisherSciTePress
Pages464-470
ISBN (Print)978-989-758-018-5
Publication statusPublished - 2014
Publication categoryResearch
Peer-reviewedYes
Event3rd International Conference on Pattern Recognition Applications an Methods (ICPRAM 2014) - Angers, Angers, France
Duration: 2014 Mar 62014 Mar 8

Conference

Conference3rd International Conference on Pattern Recognition Applications an Methods (ICPRAM 2014)
CountryFrance
CityAngers
Period2014/03/062014/03/08

Total downloads

No data available