University of Oulu

Coats, S. A new corpus of geolocated ASR transcripts from Germany. Lang Resources & Evaluation (2023).

A new corpus of geolocated ASR transcripts from Germany

Saved in:
Author: Coats, Steven1
Organizations: 1English, Faculty of Humanities, University of Oulu, Oulu, Finland
Format: article
Version: published version
Access: open
Online Access: PDF Full Text (PDF, 1.4 MB)
Persistent link:
Language: English
Published: Springer Nature, 2023
Publish Date: 2023-10-24


This report describes the Corpus of German Speech (CoGS), a 56-million-word corpus of automatic speech recognition transcripts from YouTube channels of local government entities in Germany. Transcripts have been annotated with latitude and longitude coordinates, making the resource potentially useful for geospatial analyses of lexical, morpho-syntactic, and pragmatic variation; this is exemplified with an exploratory geospatial analysis of grammatical variation in the encoding of past temporal reference. Additional corpus metadata include video identifiers and timestamps on individual word tokens, making it possible to search for specific discourse content or utterance sequences in the corpus and download the underlying video and audio from the web, using open-source tools. The discourse content of the transcripts in CoGS touches upon a wide range of topics, making the resource potentially interesting as a data source for research in digital humanities and social science. The report also briefly discusses the permissibility of reuse of data sourced from German municipalities for corpus-building purposes in the context of EU, German, and American law, which clearly authorize such a use case.

see all

Series: Language resources and evaluation
ISSN: 1574-020X
ISSN-E: 1574-0218
ISSN-L: 1574-020X
Issue: Latest articles
DOI: 10.1007/s10579-023-09686-9
Type of Publication: A1 Journal article – refereed
Field of Science: 6121 Languages
Copyright information: © The Author(s) 2023. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit