Research

PhD thesis

Title: Dataset Selection for Aggregate Model Implementation in Predictive Data Mining

Date of graduation: 2 September, 2010

PhD thesis abstract

In her thesis, Patricia Lutu studied the problem of aggregate classification model design and related training dataset selection methods from large amounts of data as commonly encountered in data mining. The objectives of the study were to establish aggregate model design methods, feature selection methods and training instance selection methods that result in classification models with a high level of predictive performance. New methods were designed for feature ranking, and a new algorithm was designed for feature subset search to identify the best predictive features for classification modeling. The two methods of aggregate modeling that were studied are One-Vs-All (OVA) and positive-Vs-negative (pVn) modeling. While OVA is an existing method that has been used for small datasets, pVn is a new method of aggregate modeling, proposed by Patricia Lutu. The sparse confusion matrix property was defined and used as a basis for a new construct that was called a confusion graph and used as a basis for the design of base models for aggregation. Two new algorithms were designed to process confusion graphs. A new algorithm was designed to resolve tied predictions that arise in aggregate modeling. Experimental work was conducted to demonstrate the performance of the proposed feature and training instance selection methods, aggregate modeling methods, and new algorithms. Theoretical models were developed to specify the relationships between the factors that affect the quality of selected features and aggregate model performance. Guidelines were developed for feature selection, training instance selection and aggregate modeling from large datasets.

PhD thesis document

The thesis document is available at the University of Pretoria UPeTD website at this link