Dialect corpora from YouTube |
|
Author: | Coats, Steven |
Format: | article |
Version: | published version |
Access: | open |
Online Access: | PDF Full Text (PDF, 9.3 MB) |
Persistent link: | http://urn.fi/urn:nbn:fi-fe2023022728751 |
Language: | English |
Published: |
De Gruyter,
2023
|
Publish Date: | 2023-02-27 |
Description: |
AbstractThis paper introduces two new large corpora comprised of YouTube Automatic Speech Recognition (ASR) transcripts of the speech of videos from geographically localized channels in the United States, Canada, and the British Isles, a promising resource for more in-depth study of regional language variation in spoken English. The procedure used to create the corpora bypasses the web API for YouTube, instead relying on web scraping and open-source scripts or software for the automatic identification and downloading of suitable channel content as well as dealing with the rate-limiting issues that arise thereby. In order to assess the accuracy of downloaded transcripts, word frequency statistics are compared for ASR and manual transcripts of city council meetings of Philadelphia, Pennsylvania, USA, and a transcript classification task is undertaken using vector- based distributed representations of transcript content. Despite errors, corpora of ASR transcripts may prove useful for the characterization and study of regional language variation, particularly when analytical techniques are employed that are relatively robust to low-frequency phenomena. see all
|
Series: |
Diskursmuster |
ISSN: | 2701-0260 |
ISSN-E: | 2701-0279 |
ISSN-L: | 2701-0260 |
ISBN: | 978-3-11-101789-1 |
ISBN Print: | 978-3-11-101727-3 |
Volume: | 32 |
Pages: | 79 - 102 |
DOI: | 10.1515/9783111017433-005 |
OADOI: | https://oadoi.org/10.1515/9783111017433-005 |
Host publication: |
Language and Linguistics in a Complex World |
Host publication editor: |
Busse, Beatri Dumrukcic, Nina Kleiber, Ingo |
Type of Publication: |
A3 Book chapter |
Field of Science: |
6121 Languages |
Subjects: | |
Copyright information: |
© 2023 the author(s), published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0 International License. https://doi.org/10.1515/9783111017433-005 |
https://creativecommons.org/licenses/by/4.0/ |