TUWHERA Open Theses & Dissertations
AUT University
View Item 
  •   Open Theses & Dissertations
  • Masters Theses
  • View Item
  •   Open Theses & Dissertations
  • Masters Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Effect of imbalanced data on document classification algorithms

Paul, Amrita
Thumbnail
View/Open
Whole thesis (1.569Mb)
Permanent link
http://hdl.handle.net/10292/7413
Metadata
Show full metadata
Abstract
Text classification is the task of assigning predefined categories to free text documents. Due to the ever-increasing amount of electronic documents, digital libraries and web resources, document classification is critical in higher level document processing tasks such as information extraction, named entity recognition and event modelling. Text categorization is considered to be challenging because of the large number of features in a typical text document. In spite of this, various categorization algorithms have reached accuracies in the vicinity of 90%. It has generally been found that probability based algorithms perform better on Natural Language Processing tasks compared to other types of algorithms. This is in addition to probabilistic algorithms being highly extensible.

In this thesis paper, a tool called MALLET (MAchine Learning for LanguagE Toolkit) was used to perform document classification using a set of probabilistic algorithms to determine the effect of imbalanced data on the performance of these algorithms when compared to balanced data. The data used for the research was taken from Reuters Corpus (RCV1) which contains categorized newspaper articles. Although the corpus contains many fine levels of categorization, this research used four upper level topic codes which were further organized into binary categories of a document belonging to a category or out of it. The documents were then converted into a form acceptable to MALLET and tested for categorization with the chosen algorithms.

The algorithms used for the research were Naïve Bayes, Balanced Winnow and three variations of Max Ent, namely Max Ent, Max Ent L1 and MC Max Ent. It was firstly found that these probability based algorithms performed marginally better than other algorithms reported in previous works on similar genre of input data. However, a significant finding from the research was that the algorithms performed similarly or in some cases even better, for imbalanced data compared to balanced data. This was due to the vocabulary properties of the documents used for training and asserts the resilience of the probability based algorithms for text categorization.
Keywords
Classify; Texts
Date
2014
Item Type
Thesis
Supervisor(s)
Nand, Parma
Degree Name
Master of Computer and Information Sciences
Publisher
Auckland University of Technology

Contact Us
  • Admin

Hosted by Tuwhera, an initiative of the Auckland University of Technology Library

 

 

Browse

Open Theses & DissertationsTitlesAuthorsDateThesis SupervisorMasters ThesesTitlesAuthorsDateThesis Supervisor

Alternative metrics

 

Statistics

For this itemFor all Open Theses & Dissertations

Share

 
Follow @AUT_SC

Contact Us
  • Admin

Hosted by Tuwhera, an initiative of the Auckland University of Technology Library