Local and personalised models for prediction, classification and knowledge discovery on real world data modelling problems
MetadataShow full metadata
This thesis presents several novel methods to address some of the real world data modelling issues through the use of local and individualised modelling approaches. A set of real world data modelling issues such as modelling evolving processes, defining unique problem subspaces, identifying and dealing with noise, outliers, missing values, imbalanced data and irrelevant features, are reviewed and their impact on the models are analysed. The thesis has made nine major contributions to information science, includes four generic modelling methods, three real world application systems that apply these methods, a comprehensive review of the real world data modelling problems and a data analysis and modelling software. Four novel methods have been developed and published in the course of this study. They are: (1) DyNFIS – Dynamic Neuro-Fuzzy Inference System, (2) MUFIS – A Fuzzy Inference System That Uses Multiple Types of Fuzzy Rules, (3) Integrated Temporal and Spatial Multi-Model System, (4) Personalised Regression Model. DyNFIS addresses the issue of unique problem subspaces by identifying them through a clustering process, creating a fuzzy inference system based on the clusters and applies supervised learning to update the fuzzy rules, both antecedent and consequent part. This puts strong emphasis on the unique problem subspaces and allows easy to understand rules to be extracted from the model, which adds knowledge to the problem. MUFIS takes DyNFIS a step further by integrating a mixture of different types of fuzzy rules together in a single fuzzy inference system. In many real world problems, some problem subspaces were found to be more suitable for one type of fuzzy rule than others and, therefore, by integrating multiple types of fuzzy rules together, a better prediction can be made. The type of fuzzy rule assigned to each unique problem subspace also provides additional understanding of its characteristics. The Integrated Temporal and Spatial Multi-Model System is a different approach to integrating two contrasting views of the problem for better results. The temporal model uses recent data and the spatial model uses historical data to make the prediction. By combining the two through a dynamic contribution adjustment function, the system is able to provide stable yet accurate prediction on real world data modelling problems that have intermittently changing patterns. The personalised regression model is designed for classification problems. With the understanding that real world data modelling problems often involve noisy or irrelevant variables and the number of input vectors in each class may be highly imbalanced, these issues make the definition of unique problem subspaces less accurate. The proposed method uses a model selection system based on an incremental feature selection method to select the best set of features. A global model is then created based on this set of features and then optimised using training input vectors in the test input vector’s vicinity. This approach focus on the definition of the problem space and put emphasis the test input vector’s residing problem subspace. The novel generic prediction methods listed above have been applied to the following three real world data modelling problems: 1. Renal function evaluation which achieved higher accuracy than all other existing methods while allowing easy to understand rules to be extracted from the model for future studies. 2. Milk volume prediction system for Fonterra achieved a 20% improvement over the method currently used by Fonterra. 3. Prognoses system for pregnancy outcome prediction (SCOPE), achieved a more stable and slightly better accuracy than traditional statistical methods. These solutions constitute a contribution to the area of applied information science. In addition to the above contributions, a data analysis software package, NeuCom, was primarily developed by the author prior and during the PhD study to facilitate some of the standard experiments and analysis on various case studies. This is a full featured data analysis and modelling software that is freely available for non-commercial purposes (see Appendix A for more details). In summary, many real world problems consist of many smaller problems. It was found beneficial to acknowledge the existence of these sub-problems and address them through the use of local or personalised models. The rules extracted from the local models also brought about the availability of new knowledge for the researchers and allowed more in-depth study of the sub-problems to be carried out in future research.