University of Oulu

Coats, S. (2023). Dialect Corpora from YouTube. In B. Busse, N. Dumrukcic & I. Kleiber (Ed.), Language and Linguistics in a Complex World (pp. 79-102). Berlin, Boston: De Gruyter.

Dialect corpora from YouTube

Saved in:
Author: Coats, Steven
Format: article
Version: published version
Access: open
Online Access: PDF Full Text (PDF, 9.3 MB)
Persistent link:
Language: English
Published: De Gruyter, 2023
Publish Date: 2023-02-27


This paper introduces two new large corpora comprised of YouTube Automatic Speech Recognition (ASR) transcripts of the speech of videos from geographically localized channels in the United States, Canada, and the British Isles, a promising resource for more in-depth study of regional language variation in spoken English. The procedure used to create the corpora bypasses the web API for YouTube, instead relying on web scraping and open-source scripts or software for the automatic identification and downloading of suitable channel content as well as dealing with the rate-limiting issues that arise thereby. In order to assess the accuracy of downloaded transcripts, word frequency statistics are compared for ASR and manual transcripts of city council meetings of Philadelphia, Pennsylvania, USA, and a transcript classification task is undertaken using vector- based distributed representations of transcript content. Despite errors, corpora of ASR transcripts may prove useful for the characterization and study of regional language variation, particularly when analytical techniques are employed that are relatively robust to low-frequency phenomena.

see all

Series: Diskursmuster
ISSN: 2701-0260
ISSN-E: 2701-0279
ISSN-L: 2701-0260
ISBN: 978-3-11-101789-1
ISBN Print: 978-3-11-101727-3
Volume: 32
Pages: 79 - 102
DOI: 10.1515/9783111017433-005
Host publication: Language and Linguistics in a Complex World
Host publication editor: Busse, Beatri
Dumrukcic, Nina
Kleiber, Ingo
Type of Publication: A3 Book chapter
Field of Science: 6121 Languages
Copyright information: © 2023 the author(s), published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0 International License.