Feature Selection in High Dimension Sample Spaces

Bertoncelli, D; Caianiello, Pasquale; Costantini, Stefania

Classification and multivariate class prediction problems arise in many scientific disciplines often relying on large data sets in the form of sample vectors of high dimension. In order to reduce dimensionality and effectively treat the problem, we may try to pre-process data and extract a minimum subset of feature variables that is sufficient for items identification, a problem addressed in literature as Minimum Test Set and known to be NP-hard by a reduction to Set Cover. We use a procedure based on computing mutual information between sets of symbolic feature variables and we present results obtained by applying the procedure over a previously discretised data set from integrated circuit industry collecting wafer data by sampling chips over a large number of real valued attribute variables (typically about a thousand). The experimentation gives evidence that the procedure effectively and feasibly yields a small number of features that provide sufficient information for chip failure prediction.