Abstract:
The goal of the Web page categorization is to classify Web documents into a certain number of predefined categories. Previous works in this area employed a large number of labeled training documents for supervised learning. The problem is that, it is difficult to create labeled training documents. Though it is difficult to manually categorize unlabeled documents for creating training data, it is easy to collect unlabeled ones. Therefore, a new machine learning algorithm is investigated to overcome these difficulties and effectively utilize unlabeled documents. We propose in this thesis a novel approach called Iterative Cross-Training (ICT) to solve the Web page categorization problem. In this thesis, we applied the algorithm to solve the Web page categorization problems on four data sets. The performance of ICT was evaluated and analyzed with the supervised learning, Co-Training and Expectation Maximization algorithms. We found that the ICT algorithmis an effective approach for the Web page categorization task. We studied the effect of noise on the Web page categorization problem and found that the ICT algorithm was robust to noise when domain knowledge was given. In case that no domain knowledge was available, ICT's performance loss was less than other learning algorithms. Furthermore, the enhanced version of ICT was developed. We integrated an Inductive Logic Programming (ILP) with the ICT algorithm. The experimental results showed that the ILP system had capability to increase the overall performance of ICT