100 miljoner ord: Reflektioner kring forskningsarbete med storskaliga dataset som historisk empiri

Translated title of the contribution: A hundred million words: Reflections on historical research with large-scale textual datasets as empirical evidence

Research output: Contribution to journalArticlepeer-review

Abstract

A hundred million words: Reflections on historical research with large-scale textual datasets as empirical evidence

The research project Welfare State Analytics: Text Mining and Modelling Swedish Politics, Media & Culture, 1945–1989 uses probabilistic methods and text-mining models to study three massive textual datasets from Swedish politics, news media, and literary culture. By topic modelling and distant reading a dataset from some 3,100 Swedish Government Official Reports, findings have been made which previous historical scholarship has neglected – or rather, cannot detect because of the limitations of traditional, smallscale examinations of only a few such reports. This article presents some of the project’s findings, but concentrates on the practical issues of curating large-scale textual datasets, and thus the possibilities – and shortcomings – of digital history research practices.

Large-scale textual datasets, often containing hundreds of millions of words, are a new type of empirical material that presents the historian with fresh challenges. The preparation of datasets is usually a resource-intensive task, where algorithmic machine learning is combined with the manual curation of data, a process that compiles the empirical material into datasets (in different versions).

Plainly, historical empirical material must be compiled into datasets to enable large-scale analyses, and such work can be laborious, as it depends on extensive programming efforts; what may come as a surprise is how complicated the relationship between data and empirical material can be in a digital-historical context, and the fact that preparing datasets is usually an iterative procedure that fundamentally changes the historical sources. In this type of research, compiled empirical material will usually result in several datasets, depending not only on how effective the available software is to curate and correct errors but also the specific research questions – given that data can be modelled in many ways. The relationship between empirical material and curated datasets is therefore complex, and highly dependent on both software and research practices.
Translated title of the contributionA hundred million words: Reflections on historical research with large-scale textual datasets as empirical evidence
Original languageSwedish
Pages (from-to)320-352
JournalHistorisk Tidskrift
Volume142
Issue number3
Publication statusPublished - 2022 Sept 23

Subject classification (UKÄ)

  • Language Technology (Computational Linguistics)
  • Other Computer and Information Science
  • Software Engineering

Free keywords

  • data curation
  • machine learning
  • textual datasets
  • digital history

Fingerprint

Dive into the research topics of 'A hundred million words: Reflections on historical research with large-scale textual datasets as empirical evidence'. Together they form a unique fingerprint.

Cite this