Knowledge discovery method for deriving conditional probabilities from large datasets
1University of Oulu, Faculty of Technology, Department of Electrical and Information Engineering
|Online Access:||PDF Full Text (PDF, 2.3 MB)|
|Persistent link:|| http://urn.fi/urn:isbn:9789514286698
|Publish Date:|| 2007-12-04
|Thesis type:||Doctoral Dissertation
|Defence Note:||Academic dissertation to be presented, with the assent of the Faculty of Technology of the University of Oulu, for public defence in Auditorium TS101, Linnanmaa, on December 14th, 2007, at 12 noon
Doctor Bogdan Filipi&#x10d;
Doctor Jukka Kömi
In today's world, enormous amounts of data are being collected everyday. Thus, the problems of storing, handling, and utilizing the data are faced constantly. As the human mind itself can no longer interpret the vast datasets, methods for extracting useful and novel information from the data are needed and developed. These methods are collectively called knowledge discovery methods.
In this thesis, a novel combination of feature selection and data modeling methods is presented in order to help with this task. This combination includes the methods of basic statistical analysis, linear correlation, self-organizing map, parallel coordinates, and k-means clustering. The presented method can be used, first, to select the most relevant features from even hundreds of them and, then, to model the complex inter-correlations within the selected ones. The capability to handle hundreds of features opens up the possibility to study more extensive processes instead of just looking at smaller parts of them. The results of k-nearest-neighbors study show that the presented feature selection procedure is valid and appropriate.
A second advantage of the presented method is the possibility to use thousands of samples. Whereas the current rules of selecting appropriate limits for utilizing the methods are theoretically proved only for small sample sizes, especially in the case of linear correlation, this thesis gives the guidelines for feature selection with thousands of samples. A third positive aspect is the nature of the results: given that the outcome of the method is a set of conditional probabilities, the derived model is highly unrestrictive and rather easy to interpret.
In order to test the presented method in practice, it was applied to study two different cases of steel manufacturing with hot strip rolling. In the first case, the conditional probabilities for different types of retentions were derived and, in the second case, the rolling conditions for the occurrence of wedge were revealed. The results of both of these studies show that steel manufacturing processes are indeed very complex and highly dependent on the various stages of the manufacturing. This was further confirmed by the fact that with studies of k-nearest-neighbors and C4.5, it was impossible to derive useful models concerning the datasets as a whole. It is believed that the reason for this lies in the nature of these two methods, meaning that they are unable to grasp such manifold inter-correlations in the data. On the contrary, the presented method of conditional probabilities allowed new knowledge to be gained of the studied processes, which will help to better understand these processes and to enhance them.
Acta Universitatis Ouluensis. C, Technica
© University of Oulu, 2007. This publication is copyrighted. You may download, display and print it for your own personal use. Commercial use is prohibited.