Evaluation and prediction of content quality in stack overflow with logistic regression
1University of Oulu, Faculty of Information Technology and Electrical Engineering, Department of Information Processing Science, Information Processing Science
|Online Access:||PDF Full Text (PDF, 1.2 MB)|
|Persistent link:|| http://urn.fi/URN:NBN:fi:oulu-201510282094
|Publish Date:|| 2015-10-28
|Thesis type:||Master's thesis
Collaborative questioning and answering (CQA) sites such as Stack Overflow are the placement in which community members can ask and answer questions as well as interact with questions and answers. A question may receive multiple answers, and only one may be selected as the best answer, which means that this answer is more suitable for the given question. For the purpose of effective information retrieval, it will be beneficial to automatically predict and select the best answer.
This thesis carried out and presented a study to evaluate content quality in CQA site by using logistic regression and features extracted from questions, answers and users’ information. By reviewing previous researches, all features which can be used to evaluate and predict the quality of content in the research case were identified. Stack Overflow was chosen as the research case and a sample of questions and answers has been extracted for further analysis. The human rated question score was done with the assistance of three people working in the field of information technology. Various features from questions, answers, and owners’ information were modelled and trained into classifiers to choose the best answer or high quality question.
The results indicate that the models built in this research for evaluating answer quality have high predictive ability and strong robustness. While the models for evaluating question quality have low predictive ability in this study. In addition, it is demonstrated that several features from questions, answers, and owners’ information can be valuable component in evaluating and predicting content quality, such as owner’s reputation points, and questions’ or answer score, but human rated question score has no significant influence on evaluating answer quality. This research has contributions to science and implications for practice. For example, one main contribution is that based on the models built in this study, CQA sites can automatically suggest to their users the best answers, which is a time-saving solution for users looking for help from CQA sites.
© Daoying Qiu, 2015. This publication is copyrighted. You may download, display and print it for your own personal use. Commercial use is prohibited.