A new corpus of geolocated ASR transcripts from Germany
1English, Faculty of Humanities, University of Oulu, Oulu, Finland
|Online Access:||PDF Full Text (PDF, 1.4 MB)|
|Persistent link:|| http://urn.fi/urn:nbn:fi-fe20231024141107
|Publish Date:|| 2023-10-24
This report describes the Corpus of German Speech (CoGS), a 56-million-word corpus of automatic speech recognition transcripts from YouTube channels of local government entities in Germany. Transcripts have been annotated with latitude and longitude coordinates, making the resource potentially useful for geospatial analyses of lexical, morpho-syntactic, and pragmatic variation; this is exemplified with an exploratory geospatial analysis of grammatical variation in the encoding of past temporal reference. Additional corpus metadata include video identifiers and timestamps on individual word tokens, making it possible to search for specific discourse content or utterance sequences in the corpus and download the underlying video and audio from the web, using open-source tools. The discourse content of the transcripts in CoGS touches upon a wide range of topics, making the resource potentially interesting as a data source for research in digital humanities and social science. The report also briefly discusses the permissibility of reuse of data sourced from German municipalities for corpus-building purposes in the context of EU, German, and American law, which clearly authorize such a use case.
Language resources and evaluation
|Type of Publication:||
A1 Journal article – refereed
|Field of Science:||
© The Author(s) 2023. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.