Induction in Noisy Domains

Reference: P. Clark and T. Niblett. Induction in Noisy Domains. In I. Bratko and N. Lavrac, editors, Progress in Machine Learning: Proc. 2nd European ML Conference (EWSL-87). pages 11-30, Sigma, Wilmslow, UK, 1987.

Abstract: This paper examines the induction of classification rules from examples using real-world data. Real-world data is almost always characterized by two features, which are important for the design of an induction algorithm. Firstly, there is often noise present, for example, due to imperfect measuring equipment used to collect the data. Secondly the description language is often incomplete, such that examples with identical descriptions in the language will not always be members of the same class.

Many induction systems make the `noiseless domain' assumption that the examples do not contain errors and the description language is complete, and consequently constrain their search for rules to those for which no counter-examples exist in the data used for induction. However, in real-world domains correlations between attributes and classes in a data set are rarely without exceptions. To locate such correlations and induce rules describing them it is also necessary to consider rules which may not classify all the training examples correctly.

This paper firstly discusses some of the problems presented by noise and proposes a top-down induction algorithm for induction in real-world domains. Secondly, an experimental comparison of this algorithm with other induction systems is presented using three sets of real-world medical data.