Invited session on:
Intelligent Multivariate Data Mining Using Genetic Algorithms and Informational Complexity

Organizer:
Hamparsum Bozdogan
University of Tennessee, Knoxville

Chair:
Stan Sclove
University of Illinois at Chicago

Invited Session Papers:

  1. Inductive Inference via the Shortest Explanation
    Rohan Baxter, Ultimode Inc, Berkeley, CA
  2. Information Complexity Criteria for Detecting Influential Observations in Dynamic Multivariate Linear Models Using the Genetic Algorithm
    Hamparsum Bozdogan, University of Tennessee, and
    Peter M. Bearse, University of North Carolina at Greensboro
  3. Combinatorial Cluster Analysis of Variables of High Dimensional Data Using the Genetic Algorithm and Information Complexity Criteria as the Fitness Function
    Hamparsum Bozdogan, University of Tennessee, and
    Stanley L. Sclove, University of Illinois at Chicago

Abstracts follow.

Inductive Inference via the Shortest Explanation

Rohan Baxter
Ultimode Inc. Berkeley, CA
rohan@ultimode.com 

This talk describes a general theory of inductive inference. Of all possible theories about the available data, we prefer the theory which leads to the shortest explanation. This is a form of Occam's Razor, but information theory provides a precise and quantifiable definition of the length of an explanation. An "explanation" is a message that begins with a general assertion ("theory") about the data, and then continues with an exact statement of the available data encoded in a code which would be efficient were the "theory" true.

A code can be considered to be "efficient" under two approaches. The first uses Shannon's theory of information to quantify the message length, and subtly extends Bayesian inference. The second is based on Kolmogorov's theory of algorithmic complexity. The two approaches are equivalent under weak assumptions. Note also that the two-part structure of an explanation is different to the one-part codes used in model selection with Stochastic Complexity leading to differences in principle and practice.
Desirable properties have been proven for this inference approach by Solomonoff, Wallace, Rissanen, Cover and Barron. The search for short explanations has been automated in simple but non-trivial domains including mixture modeling, probabilistic finite state automata, decision trees, decision graphs, string alignment, and logic programs.

Keywords: Minimum message length, MML, MDL, inductive inference, model selection, parameter estimation


Information Complexity Criteria for Detecting Influential Observations in Dynamic Multivariate Linear Models Using the Genetic Algorithm

Hamparsum Bozdogan
Department of Statistics
The University ofTennessee Knoxville
TN 37996-0532 USA
bozdogan@utk.edu

Peter M. Bearse
Department of Economics
University of North Carolina at Greensboro Greensboro
NC 27402-6165 USA
bearse@uncg.edu
 

We develop a new information theoretic approach to detecting influential observations in Dynamic Linear models known as Vector Autoregressions (VARs) for multivariate time series. Our approach builds on recent advances in information theoretic model selection (Bozdogan, 1988,1990, 1994) and globalizing strategies for optimal predictor selection in subset VAR modeling (Bearse and Bozdogan, 1998). We use a Genetic Algorithm (GA) with Bozdogan's ICOMP criterion as the fitness function to select a near optimal subset VAR model. We then jackknife the ICOMP criterion given the subset VAR model chosen by our GA, and form a sequence of ratios -- one for each multivariate observation -- of the jackknifed ICOMP scores to the full data ICOMP score. Our approach yields an intuitive, practical and rigorous two imensional graphical representation of influential observations in multivariate data by taking into account the lack-of-fit and model complexity simultaneously in one criterion function, while eliminating the need for arbitrarily chosen levels of significance and statistical table look-up. We demonstrate our new approach using quarterly US macroeconomic data on the civilian unemployment rate, the consumer price index, the federal funds rate, and the M2 money supply between 1960:Q1 - 1996:Q4.

Keywords: Information complexity, influential observations, vector autoregressions (VARs), Genetic Algorithm (GA).

References

Bearse, P.M. and Bozdogan, H. (1998). Subset Selection in Vector Autoregressive Models Using the Genetic Algorithm with Informational Complexity as the Fitness Function, Systems Analysis, Modelling, and Simulation (SAMS), 31, pp. 61-91.

Bozdogan, H. (1988). ICOMP: A New Model Selection Criterion, in H. Bock (ed.), Classification and Related Methods in of Data Analysis, North-Holland, Amsterdam, April, 599-608.

Bozdogan, H. (1990). On the Information-Based Measure of Covariance Complexity and its Application to the Evaluation of Multivariate Linear Models, Communications in Statistics, Theory and Methods, 19(1), 221-78.

Bozdogan, H. (1994). Mixture-Model Cluster Analysis using a Model Selection Criteria and a New Informational Measure of Complexity, in H. Bozdogan (ed.), Multivariate Statistical Modeling, Vol. 2, Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach, Kluwer Academic Publishers, Dordrecht, 69-113.


Combinatorial Cluster Analysis of Variables of High Dimensional Data Using the Genetic Algorithm and Information Complexity Criteria as the Fitness Function

Hamparsum Bozdogan
Department of Statistics
The University of Tennessee
Knoxville, TN 37996-0532 USA
bozdogan@utk.edu
 

Stanley L. Sclove
Information & Decision Sciences Dept.
M/C 294
University of Illinois at Chicago
601 S. Morgan Street
Chicago, IL 60607-7124, USA
slsclove@uic.edu
 

Heuristic algorithms have been applied to the problem of clustering variables, based on the use of correlations as similarity measures and applying the maximum, minimum, or average pairwise correlation as a measure of similarity between two clusters of variables.

In this paper, we generalize the ideas presented in Sclove (1987, p. 338) in clustering variables of high dimensional multivariate data. We propose one or a combination of the two following strategies for clustering variables.

(i)For large complex high dimensional data, we first introduce and use the Genetic Algorithm (GA) (Bearse and Bozdogan, 1998) to reduce the "curse of dimensionality" in the multivariate data as our preprocessing scheme using information complexity (ICOMP) (Bozdogan, 1988,1990, 1994) as the fitness function, and then

(ii)For the best variable-complex extracted by our GA, we consider various combinatorial configurations of the covariance matrix as corresponding to clusterings of variables.

We evaluate the special configurations of the covariance matrix as alternative models using information complexity (or other model selection criteria) under several different smoothed (or improved) covariance estimators. Using the smoothed covariance estimators, we circumvent the degeneracy of the usual maximum likelihood (ML) covariance estimator in cases when the number of variables exceeds the number of observations. This further reduces the bias in the data.

Such an approach provides a more rigorous technique for clustering variables as compared to using principal components.

We illustrate our new approach on real data sets and show the versatility of these techniques in reducing simultaneously the dimensionality and the bias in high dimensional multivariate data.

Keywords: Clustering variables, Genetic Algorithm (GA), and Information Complexity Criteria.

References

Bearse, P.M. and Bozdogan, H. (1998). Subset Selection in Vector Autoregressive Models Using the Genetic Algorithm with Informational Complexity as the Fitness Function, Systems Analysis, Modelling, and Simulation (SAMS), 31, pp. 61-91.

Bozdogan, H. (1988). ICOMP: A New Model Selection Criterion, in H. Bock (ed.), Classification and Related Methods in of Data Analysis, North-Holland, Amsterdam, April, 599-608.

Bozdogan, H. (1990). On the Information-Based Measure of Covariance Complexity and its Application to the Evaluation of Multivariate Linear Models, Communications in Statistics, Theory and Methods, 19(1), 221-78.

Bozdogan, H. (1994). Mixture-Model Cluster Analysis using a Model Selection Criteria and a New Informational Measure of Complexity, in H. Bozdogan (ed.), Multivariate Statistical Modeling, Vol. 2, Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach, Kluwer Academic Publishers, Dordrecht, 69-113.

Sclove, Stanley L. (1987). Application of Model-Selection Criteria to Some Problems in Multivariate Analysis, Psychometrika, 52, 333-343.