Organizer:

*Hamparsum Bozdogan*

University of Tennessee, Knoxville

Chair:

*Stan Sclove*

University of Illinois at Chicago

Invited Session Papers:

**Inductive Inference via the Shortest Explanation**

*Rohan Baxter*, Ultimode Inc, Berkeley, CA

**Information Complexity Criteria for Detecting Influential Observations in Dynamic Multivariate Linear Models Using the Genetic Algorithm**

*Hamparsum Bozdogan*, University of Tennessee, and

*Peter M. Bearse*, University of North Carolina at Greensboro

**Combinatorial Cluster Analysis of Variables of High Dimensional Data Using the Genetic Algorithm and Information Complexity Criteria as the Fitness Function**

*Hamparsum Bozdogan*, University of Tennessee, and

*Stanley L. Sclove*, University of Illinois at Chicago

**Inductive Inference via the Shortest Explanation**

*Rohan Baxter*

Ultimode Inc. Berkeley, CA

rohan@ultimode.com

This talk describes a general theory of inductive inference. Of all possible theories about the available data, we prefer the theory which leads to the shortest explanation. This is a form of Occam's Razor, but information theory provides a precise and quantifiable definition of the length of an explanation. An "explanation" is a message that begins with a general assertion ("theory") about the data, and then continues with an exact statement of the available data encoded in a code which would be efficient were the "theory" true.

A code can be considered to be "efficient" under two approaches. The
first uses Shannon's theory of information to quantify the message length,
and subtly extends Bayesian inference. The second is based on Kolmogorov's
theory of algorithmic complexity. The two approaches are equivalent under
weak assumptions. Note also that the two-part structure of an explanation
is different to the one-part codes used in model selection with Stochastic
Complexity leading to differences in principle and practice.

Desirable properties have been proven for this inference approach by
Solomonoff, Wallace, Rissanen, Cover and Barron. The search for short
explanations has been automated in simple but non-trivial domains
including mixture modeling, probabilistic finite state automata, decision
trees, decision graphs, string alignment, and logic programs.

**Keywords:** Minimum message length, MML, MDL, inductive inference,
model
selection, parameter estimation

**Information Complexity Criteria for Detecting Influential Observations
in Dynamic Multivariate Linear Models Using the Genetic Algorithm**

*Hamparsum Bozdogan*

Department of Statistics

The University ofTennessee Knoxville

TN 37996-0532 USA

bozdogan@utk.edu

*Peter M. Bearse*

Department of Economics

University of North Carolina at Greensboro Greensboro

NC 27402-6165 USA

bearse@uncg.edu

We develop a new information theoretic approach to detecting influential observations in Dynamic Linear models known as Vector Autoregressions (VARs) for multivariate time series. Our approach builds on recent advances in information theoretic model selection (Bozdogan, 1988,1990, 1994) and globalizing strategies for optimal predictor selection in subset VAR modeling (Bearse and Bozdogan, 1998). We use a Genetic Algorithm (GA) with Bozdogan's ICOMP criterion as the fitness function to select a near optimal subset VAR model. We then jackknife the ICOMP criterion given the subset VAR model chosen by our GA, and form a sequence of ratios -- one for each multivariate observation -- of the jackknifed ICOMP scores to the full data ICOMP score. Our approach yields an intuitive, practical and rigorous two imensional graphical representation of influential observations in multivariate data by taking into account the lack-of-fit and model complexity simultaneously in one criterion function, while eliminating the need for arbitrarily chosen levels of significance and statistical table look-up. We demonstrate our new approach using quarterly US macroeconomic data on the civilian unemployment rate, the consumer price index, the federal funds rate, and the M2 money supply between 1960:Q1 - 1996:Q4.

**Keywords:** Information complexity, influential observations, vector
autoregressions (VARs), Genetic Algorithm (GA).

**References**

Bearse, P.M. and Bozdogan, H. (1998). Subset Selection in Vector Autoregressive Models Using the Genetic Algorithm with Informational Complexity as the Fitness Function, Systems Analysis, Modelling, and Simulation (SAMS), 31, pp. 61-91.

Bozdogan, H. (1988). ICOMP: A New Model Selection Criterion, in H. Bock (ed.), Classification and Related Methods in of Data Analysis, North-Holland, Amsterdam, April, 599-608.

Bozdogan, H. (1990). On the Information-Based Measure of Covariance Complexity and its Application to the Evaluation of Multivariate Linear Models, Communications in Statistics, Theory and Methods, 19(1), 221-78.

Bozdogan, H. (1994). Mixture-Model Cluster Analysis using a Model Selection Criteria and a New Informational Measure of Complexity, in H. Bozdogan (ed.), Multivariate Statistical Modeling, Vol. 2, Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach, Kluwer Academic Publishers, Dordrecht, 69-113.

**Combinatorial Cluster Analysis of Variables of High Dimensional Data
Using the Genetic Algorithm and Information Complexity Criteria as the
Fitness Function**

*Hamparsum Bozdogan*

Department of Statistics

The University of Tennessee

Knoxville, TN 37996-0532 USA

bozdogan@utk.edu

*Stanley L. Sclove*

Information & Decision Sciences Dept.

M/C 294

University of Illinois at Chicago

601 S. Morgan Street

Chicago, IL 60607-7124, USA

slsclove@uic.edu

Heuristic algorithms have been applied to the problem of clustering variables, based on the use of correlations as similarity measures and applying the maximum, minimum, or average pairwise correlation as a measure of similarity between two clusters of variables.

In this paper, we generalize the ideas presented in Sclove (1987, p. 338) in clustering variables of high dimensional multivariate data. We propose one or a combination of the two following strategies for clustering variables.

(ii)For the best variable-complex extracted by our GA, we consider various combinatorial configurations of the covariance matrix as corresponding to clusterings of variables.

Such an approach provides a more rigorous technique for clustering variables as compared to using principal components.

We illustrate our new approach on real data sets and show the versatility of these techniques in reducing simultaneously the dimensionality and the bias in high dimensional multivariate data.

**Keywords:** Clustering variables, Genetic Algorithm (GA), and
Information
Complexity Criteria.

**References**

Bearse, P.M. and Bozdogan, H. (1998). Subset Selection in Vector Autoregressive Models Using the Genetic Algorithm with Informational Complexity as the Fitness Function, Systems Analysis, Modelling, and Simulation (SAMS), 31, pp. 61-91.

Bozdogan, H. (1988). ICOMP: A New Model Selection Criterion, in H. Bock (ed.), Classification and Related Methods in of Data Analysis, North-Holland, Amsterdam, April, 599-608.

Bozdogan, H. (1990). On the Information-Based Measure of Covariance Complexity and its Application to the Evaluation of Multivariate Linear Models, Communications in Statistics, Theory and Methods, 19(1), 221-78.

Bozdogan, H. (1994). Mixture-Model Cluster Analysis using a Model Selection Criteria and a New Informational Measure of Complexity, in H. Bozdogan (ed.), Multivariate Statistical Modeling, Vol. 2, Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach, Kluwer Academic Publishers, Dordrecht, 69-113.

Sclove, Stanley L. (1987). Application of Model-Selection Criteria to Some Problems in Multivariate Analysis, Psychometrika, 52, 333-343.