Chameleon Features Categorical Variables Anomaly Detection Missing Values Prediction Feature Extraction Proprietary Algorithms Seventh Sense Software Chameleon Homepage Cluster Analysis Density Estimation Classification Visualisation

CLASSIFICATION

Consider a set of training data samples in which each sample is labelled as belonging to one of several pre-specified ‘classes’. Data classification then involves the assignment of newly presented data samples to the classes on the basis of mathematical 'models' built for each of the classes.

For example, here we have five pre-defined classes, and one test point which must be assigned to one of the classes. Although this particular case is simple, when data is noisy or classes are not well-separated, perfect classification becomes impossible. The goal then is to choose the model which minimises the classification error. For any given classification problem there exists a fundamental limit to the classification accuracy achievable, and this minimum error rate is called the "Bayes error" for that problem.

There are two basic types of classifier, namely (a) those which attempt to minimise the error rate without regard to density estimation, such as (i) neural networks,
(ii) decision trees, and (iii) support vector machines, and (b) those
which use density estimates to derive a classification, such as (i) nearest-neighbour methods, (ii) (Gaussian) mixture models, and (ii) kernel-based methods.

The former methods (a) give only the class assignment, while the latter methods (b) also give the likelihood of a sample belonging to each class. This means that the former methods, despite often giving good classification accuracy, are not recommended for use when accountability is essential (e.g. medical image analysis) or when ranked probabilities are required (e.g. speech recognition). See the Algorithms page for classification research carried out at Seventh Sense Software.