Demographic inference and affect estimation of microbloggers

Pandya, Abhinay

Demographic inference and affect estimation of microbloggers

Pandya, Abhinay (2020-11-13)

Avaa tiedosto

isbn978-952-62-2746-7.pdf (3.191Mt)

isbn978-952-62-2746-7_meta.xml (114.3Kt)

isbn978-952-62-2746-7_solr.xml (103.9Kt)

Lataukset:

Pandya, Abhinay

University of Oulu

13.11.2020

Tämä Kohde on tekijänoikeuden ja/tai lähioikeuksien suojaama. Voit käyttää Kohdetta käyttöösi sovellettavan tekijänoikeutta ja lähioikeuksia koskevan lainsäädännön sallimilla tavoilla. Muunlaista käyttöä varten tarvitset oikeudenhaltijoiden luvan.

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on
https://urn.fi/URN:ISBN:9789526227467

Kuvaus

Academic dissertation to be presented with the assent of the Doctoral Training Committee of Information Technology and Electrical Engineering of the University of Oulu for public defence in the Tönning auditorium (L4), Linnanmaa, on 20 November 2020, at 12 noon

Tiivistelmä

Abstract

Owing to the peculiar nature of the discourse on Twitter, developing analytical frameworks to derive useful insights from Twitter remains challenging as evidenced by the poor performance at tasks such as reliable demographic inference, affect estimation, and event detection. One of the focal problems lies in analyzing short texts in general, and tweets in particular. The analysis is as such made difficult because of the vagaries of the linguistic expressions and Twitter further exacerbates this by enabling the use of emojis, hashtags, URLs, and embedded media. While the previous research has demonstrated ways of extracting useful information from individual tweet-texts to some extent, a detailed and thorough investigation of the role of metadata has not yet been systematically performed. Furthermore, a majority of the previous work has paid little or no attention to the emerging role of deep learning approaches in Twitter-based analytics. These observations motivate this thesis, which aims to enhance machine understanding of tweets towards deriving deeper insights from the public data on Twitter and inform the scientific objectives of this thesis. First, this thesis sets out to empirically investigate the impact and efficacy of deep learning approaches integrating message-text and metadata leveraging on the distributed semantic representations of textual entities. Second, the thesis contributes towards improving capturing enhanced semantics from tweets by harnessing external, open-sourced knowledge graphs and other crowd-sourced lexical resources. Third, the role of the user-created metadata, such as hashtags and URLs, in machine understanding of tweets is examined and quantified. At the same time, computational models are introduced to derive conversational, topical, and temporal contexts of tweets and utilize them in machine learning models to improve Twitter-based analytics. Validation of the proposed novel machine learning models integrating the diverse footprints of users’ online activity/behavior is achieved by employing them in various case study applications. In addition, the datasets and the tools developed during this thesis have been made available publicly for the scientific community.

Tiivistelmä

Twitter-pohjainen analytiikka on noussut useiden tieteenalojen työkalupakkiin viime vuosina. Kuitenkin, järjestelmällisten analyysikokonaisuuksien kehitys on mikroblog-keskustelujen erityisluonteen vuoksi haastavaa. Analysointimenetelmien heikko suorituskyky on todettu useissa sovelluskohteissa, kuten kirjoittajien väestörakenne- ja tunnetila-analyyseissa taikka tehtävissä, joissa mikrobloggauksista pyritään havaitsemaan tärkeitä tapahtumia. Analyysit pitäisi suorittaa hyvin lyhyistä tekstipätkistä, tässä tutkimuksessa erityisesti mikroblogauksista. Omalaatuisten ja persoonallisten kielellisten ilmaisujen, mutta myös Twitterin emojien, metatietotagien, ulkoisten linkkien (url) ja upotettujen kuvien sekä videoiden käyttö monipuolistaa ongelmakenttää. Aikaisemmissa tutkimuksissa on onnistuttu johtamaan hyödyllistä tietoa yksittäisistä mikroblogauksista jossain määrin, mutta metatietojen roolia ja merkitystä ei ole vielä järjestelmällisesti eikä yksityiskohtaisesti tutkittu. Lisäksi syväoppimisen hyödyntämistä Twitter-pohjaisten datojen analyyseissa on tutkittu vähän tai ei ollenkaan. Tämän väitöskirjan tavoitteena on parantaa tietokoneiden valmiuksia käsitellä mikroblogauksia siten, että nykyistä parempi ja merkityksellisempi julkisten Twitter-aineistojen koneellinen ymmärtäminen olisi mahdollista. Ensinnäkin, tutkimuksessa testataan empiirisesti syväoppivan mallin vaikuttavuutta sekä tehokkuutta ym. tekstikokonaisuuksien hajautetun semanttisen esitysmuodon integroinnissa. Toiseksi, työssä parannetaan mikroblogauksien sisältöanalyysia ulkoisten, avoimen lähdekoodin tietograafien sekä muiden joukkoistettujen sanastojen avulla. Kolmanneksi tutkitaan ja kvantifioidaan käyttäjien luomien metadatojen, kuten metatietotagien ja ulkoisten linkkien roolit analyysikehikoissa. Työssä esitellään laskennalliset mallit mikroblogauksien keskusteluun, aihepiiriin sekä aikaan liittyvien asiayhteyksien päättelemiseksi ja käytetään näitä malleja koneoppimismallien suorituskyvyn parantamiseksi Twitter-dataan pohjautuvassa analytiikassa. Mikroblogaajien verkkokäyttäytymisen perusteella saadun monimuotoisen aineiston integrointi tapahtuu koneoppivien mallien avulla. Työssä käytetyt aineistot sekä tutkimuksessa kehitetyt työkalut on saatettu julkiseksi tiedeyhteisön käyttöön.

Original papers

Original papers are not included in the electronic version of the dissertation.

Pandya, A., & Oussalah, M. (2017). Novel semantics-based distributed representations for message polarity classification using deep convolutional neural networks. Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, 71–82. https://doi.org/10.5220/0006500800710082
Self-archived version
Pandya, A., Oussalah, M., Monachesi, P., & Kostakos, P. (2020). On the use of distributed semantics of tweet metadata for user age prediction. Future Generation Computer Systems, 102, 437–452. https://doi.org/10.1016/j.future.2019.08.018
Self-archived version
Pandya, A., Oussalah, M., Monachesi, P., Kostakos, P., & Loven, L. (2018). On the use of URLs and hashtags in age prediction of Twitter users. 2018 IEEE International Conference on Information Reuse and Integration (IRI), 62–69. https://doi.org/10.1109/IRI.2018.00017
Self-archived version
Kostakos, P., Pandya, A., Kyriakouli, O., & Oussalah, M. (2018). Inferring demographic data of marginalized users in Twitter with computer vision APIs. 2018 European Intelligence and Security Informatics Conference (EISIC), 81–84. https://doi.org/10.1109/EISIC.2018.00022
Self-archived version
Pandya, A., Oussalah, M., Kostakos, P., & Fatima, U. (2020). MaTED: Metadata-assisted Twitter event detection system. Communications in Computer and Information science, 1237, 402–414. https://doi.org/10.1007/978-3-030-50146-4_30
Self-archived version
Kostakos, P., Sprachalova, L., Pandya, A., Aboeleinen, M., & Oussalah, M. (2018). Covert online ethnography and machine learning for detecting individuals at risk of being drawn into online sex work. 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 1096–1099. https://doi.org/10.1109/ASONAM.2018.8508276
Self-archived version

Osajulkaisut

Osajulkaisut eivät sisälly väitöskirjan elektroniseen versioon.

Pandya, A., & Oussalah, M. (2017). Novel semantics-based distributed representations for message polarity classification using deep convolutional neural networks. Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, 71–82. https://doi.org/10.5220/0006500800710082
Rinnakkaistallennettu versio
Pandya, A., Oussalah, M., Monachesi, P., & Kostakos, P. (2020). On the use of distributed semantics of tweet metadata for user age prediction. Future Generation Computer Systems, 102, 437–452. https://doi.org/10.1016/j.future.2019.08.018
Rinnakkaistallennettu versio
Pandya, A., Oussalah, M., Monachesi, P., Kostakos, P., & Loven, L. (2018). On the use of URLs and hashtags in age prediction of Twitter users. 2018 IEEE International Conference on Information Reuse and Integration (IRI), 62–69. https://doi.org/10.1109/IRI.2018.00017
Rinnakkaistallennettu versio
Kostakos, P., Pandya, A., Kyriakouli, O., & Oussalah, M. (2018). Inferring demographic data of marginalized users in Twitter with computer vision APIs. 2018 European Intelligence and Security Informatics Conference (EISIC), 81–84. https://doi.org/10.1109/EISIC.2018.00022
Rinnakkaistallennettu versio
Pandya, A., Oussalah, M., Kostakos, P., & Fatima, U. (2020). MaTED: Metadata-assisted Twitter event detection system. Communications in Computer and Information science, 1237, 402–414. https://doi.org/10.1007/978-3-030-50146-4_30
Rinnakkaistallennettu versio
Kostakos, P., Sprachalova, L., Pandya, A., Aboeleinen, M., & Oussalah, M. (2018). Covert online ethnography and machine learning for detecting individuals at risk of being drawn into online sex work. 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 1096–1099. https://doi.org/10.1109/ASONAM.2018.8508276
Rinnakkaistallennettu versio

Kokoelmat

Avoin saatavuus [32043]