University of Oulu

Mohamad Mehdi, Chitu Okoli, Mostafa Mesgari, Finn Årup Nielsen, Arto Lanamäki, Excavating the mother lode of human-generated text: A systematic review of research that uses the wikipedia corpus, Information Processing & Management, Volume 53, Issue 2, 2017, Pages 505-529, ISSN 0306-4573,

Excavating the mother lode of human-generated text : a systematic review of research that uses the wikipedia corpus

Saved in:
Author: Mehdi, Mohamad1; Okoli, Chitu2; Mesgari, Mostafa3;
Organizations: 1Computer Science, Concordia University, Montreal, Canada
2John Molson School of Business, Concordia University, Montreal, Canada
3Love School of Business, Elon University, Elon, NC, USA
4DTU Compute, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark
5Interact research unit, University of Oulu, Oulu, Finland
Format: article
Version: accepted version
Access: open
Online Access: PDF Full Text (PDF, 0.4 MB)
Persistent link:
Language: English
Published: Elsevier, 2017
Publish Date: 2020-03-05


Although primarily an encyclopedia, Wikipedia’s expansive content provides a knowledge base that has been continuously exploited by researchers in a wide variety of domains. This article systematically reviews the scholarly studies that have used Wikipedia as a data source, and investigates the means by which Wikipedia has been employed in three main computer science research areas: information retrieval, natural language processing, and ontology building. We report and discuss the research trends of the identified and examined studies. We further identify and classify a list of tools that can be used to extract data from Wikipedia, and compile a list of currently available data sets extracted from Wikipedia.

see all

Series: Information processing & management
ISSN: 0306-4573
ISSN-E: 1873-5371
ISSN-L: 0306-4573
Volume: 53
Issue: 2
Pages: 505 - 529
DOI: 10.1016/j.ipm.2016.07.003
Type of Publication: A1 Journal article – refereed
Field of Science: 113 Computer and information sciences
222 Other engineering and technologies
518 Media and communications
Copyright information: © 2016 Published by Elsevier Ltd. This manuscript version is made available under the CC-BY-NC-ND 4.0 license