CSNA'95 Program Abstracts

CLASSIFICATION SOCIETY OF NORTH AMERICA
1995 ANNUAL MEETING

Executive Tower Inn
Denver, Colorado
June 22 - 25, 1995

PRELIMINARY LIST OF ABSTRACTS OF PAPERS TO BE PRESENTED AT CSNA-95

The section immediately following gives abstracts of the papers to be presented, listed in alphabetical order by the last name of the first -listed author. The tentative schedule of presentations at the conference and information on registration is also available.

Categorization Efficiency Measured with Trivariate Stimuli

Leola A. Alfonso-Reese, F. Gregory Ashby & David H. Brainard
University of California, Santa Barbara

reese@condor.psych.ucsb.edu

We report experiments in which the human subject's task is to view a continuous-valued three-dimensional stimulus and classify it into one of two categories. In each experiment the categories are represented by trivariate normal distributions with equal covariance matrices. The stimuli are randomly generated from these trivariate normal distributions. Across the experiments we manipulate complexity of category structure where complexity is defined as a function of the category means and covariance matrices. We measure each subject's overall performance in terms of percent correct and efficiency. From our results we establish that subjects can optimally categorize three-dimensional stimuli generated from simple category structures. Also, subjects integrate information from all three stimulus dimensions rather than ignoring information in one of them. Finally, we demonstrate that as the complexity of the category structure increases, the subject's categorization performance decreases. In one experimental condition the category structure is highly complex and all subjects perform far worse than an ideal classifier. This result contrasts with previous two -dimensional studies in which subjects show high efficiencies regardless of the category structure. Altogether, our results suggest that humans can successfully classify three-dimensional continuous-valued stimuli; however, performance is significantly affected by complexity of the category structure.

KEYWORDS: Category structure; Efficiency; Ideal observer; Human classification; Decision processes.

A Monte-Carlo Study to Determine the Validity of a [new] Fuzzy Clustering Algorithm used in Market Definition and Segmentation

Miguel A. Arrufat
Manacon, Inc.
P. O. Box 7158, Station J
Ottawa, Ontario K2A 4C5

manacon@server.iii.net

Market definition and market segmentation in high technology business sectors over the past 10 years led me to conclude in [Arrufat,1987] that the highly popular ``disjoint'' clustering methods and a couple of the newly emerging ones such as the ``additive'' type, were not appropriate. In a research paper published in [Arrufat and Haines, 1992], a ``modified c-means fuzzy clustering method'' based on work by Roubens [3] and Dunn [4] was proposed as a much better way to deal with ``high tech/volatile'' market definitions and segmentation using Rouben's NFI criterion. Besides the successful applications of the method described in [Arrufat and Haines, 1992] and [Petit, 1994], there are a few others not reported yet. In spite of its success based on ``expert opinion'', the method had not formally been validated until now. So over the past 3 years, a Monte-Carlo study was undertaken to examine not only his original NFI but his newly proposed NFI'' Roubens criterion against a uniform random distribution. The paper describes a Monte-Carlo simulation study undertaken over the past 3 years which examines the statistical validity of Roubens NFI and NFI'' in all the cases reported. The results obtained prove that both the NFI and NFI'' are statistically valid measures of clustering of great use in marketing applications. In addition, the paper also touches on the issue of random number generation in Monte-Carlo experiments using digital computers, and describes how painful these experiments could be.

The Formula of Lance & Williams and its Extension to Classification Models of VL Type

[Invited Presentation]

Helena Bacelar-Nicolau%
Laboratorio de Estatistica e Analise de Dados
Faculdade de Psicologia e de Ciencias da Educacao
Universidade de Lisboa, and

Dep. Matematica, Faculdade Ciencias e Tecnologia,
Universidade Nova de Lisboa
Lisboa, Portugal

The formula of Lance and Wiliams is a recursive formula for hierarchical agglomerative clustering criteria. It has been widely applied to generate Models of Data Representation based on Distances, that is, of MDRD type. In our approach to classification we are mostly interested in classifying variables and we must very often deal with hierarchical models of MDR type, based on Similarities, and specifically on the probabilistic VL similarity coefficient. VL stands for ``Validity of Link'' (Bacelar-Nicolau 1987, Nicolau 1983), where the similarity is evaluated by the cumulative distribution function of some basic similarity coefficient, under a suitable reference hypothesis ( Vraisemblance du Lien, in French, see Lerman 1970, Bacelar-Nicolau 1973). Therefore we validate the basic similarity values (which measure the degree of link between variables) by reference to the same probabilistic scale.

In this work, after reviewing some properties of the L & W formula, we use the corresponding adapted formula for analyzing data structures of MDRS type, and study an extended formula we propose for that case. It incorporates a generalised parametric probabilistic family of VL type (a previous extension to a more restricted family was previously studied by F. C. Nicolau).

Developing clustering models into general parametric families unifies the study of the methodology and simplifies the application of the algorithms. It allows us to perform in an easier way comparative studies among different models of either simulated or real data sets, searching for robustness of models in the family (that is in some parts of the ``super model'' which the family in fact represents) or of data structures being analyzed (and choosing the model(s) which best fit the data structure). Of course, introducing VL models in the generalized recursive formula reinforces those properties and improves robustness studies with the new methodology.

KEYWORDS: Aggregation criterion; Hierarchical clustering; Recursive formula; Model of data representation; Probabilistic approach.

Clustering Problems That Arise in the Analysis of Data on Pornography Consumption on the Internet

David Banks
Department of Statistics
Carnegie Mellon University
Pittsburgh PA 15213

banks@stat.cmu.edu

Electronic access to pornography has exploded in recent years, both through the Internet and commercial bulletin boards. This talk describes joint work undertaken at Carnegie Mellon to assess quantitative aspects of the demand and supply of such material. Key problems involve the development of a classification scheme for pornographic material, the identification of patterns of user preferences from the log-files of commercial bulletin board systems operators, and the measurement of the demand for such material based on data obtained from usage records of students. The results of this study have interesting potential ramifications in law, psychology, and sociology. In particular, this is the first study that does not rely upon self-reports to measure demand for particular types and amounts of pornography.

Resubstitution Criteria in the Design of Linear Classifiers

Leon Bobrowski, Mark Vriesenda and Jack Sklansky%
Department of Electrical and Computer Engineering
University of California at Irvine
Irvine, California 92717

Pattern classifiers are often designed on the basis of a finite data set of labeled feature vectors xj, where xj belongs to class k. The data set is usually used in two modes: training and evaluation. In the training mode, the classifier's parameters are chosen so as to optimize a performance criterion on the training set, which consists of the entire or a designated subset of the given data set. In the evaluation mode the performance of the classifier on future (unknown) data is estimated.

Often the performance criterion for the training mode is the classifier's error rate on the training set. We refer to this as the resubstitution error rate. Other criteria on the training set are possible ---e.g., an error rate in which the errors are weighted by their degrees of certainty (a high degree of certainty for xj indicates that xj is very likely misclassified). We refer to these as resubstitution criteria.

In this paper we compare the computational effectiveness and reliability of three resubstitution criteria for the design of linear classifiers: error rate, perceptron and window.

Since the cardinality of xj is finite, the resubstitution error rate e is a piecewise constant function of w, where w denotes the parameters determining the classifier. It follows that the gradient of e is either zero or indeterminate. Hence gradient descent cannot be applied to minimize e. Genetic algorithms offer a means of minimizing e. We evaluate the effectiveness and reliability of the resubstitution error rate criterion on the assumption that genetic algorithms are used for the search process.

The resubstitution perceptron and window criteria involve the use of a penalty function phi(xj) for each point xj in the data set. The perceptron penalty function phi(xj) is greater than zero if xj is situated on the wrong side of the decision hyperplane and increases linearly with the distance between xj and this hyperplane. The window penalty function phi(xj) increases only within a window enclosing the decision hyperplane. The perceptron and the window functions are piecewise linear. Finding the minima of these functions can be carried out effectively by basis exchange algorithms --- forms of linear programming --- even when the training sets are large. We evaluate the effectiveness and reliability of the resubstitution perceptron and window criteria on the assumption that basis exchange is used for the minimization process.

Using a Genetic Algorithm to Evolve a Set of Rules for Classifying Roads in Satellite Images

Julian Eugene Boggess%
Department of Computer Science
Mississippi State University P. O. Drawer CS
Mississippi State, MS 39762

It is relatively easy for humans to identify roads in satellite images through visual inspection. However, it turns out to be rather difficult to devise a computer algorithm that approaches the accuracy of a human classifier. In general, computerized approaches to road identification look at the characteristics of a given pixel in the image, and perhaps its surrounding pixels (its context), and then use a set of explicit or implicit rules to determine whether the given pixel should be classified as a member of the set of road pixels or not. Unfortunately, determining a useful set of rules is a difficult task.

A genetic algorithm is able to evolve a solution to a problem through a process of selection, crossover, and mutation, similar to the biological process upon which it is modeled. In this study, a genetic algorithm is used to evolve a good set of rules, and a good set of parameter settings for these rules, which are used to identify road pixels in LandSat satellite images. The results are compared to other techniques, including a pure neural network technique and a hybrid neural network.

KEYWORDS: Genetic algorithms; Image processing

Determining the Number of Mixture Components: A Simulation Study

Melinda Smith de Borrero
College of Business Administration & Economics
Division of Marketing
Lehigh University Bethlehem, PA 18015 USA

meb3@lehigh.edu In mixture analysis, the data are taken from a mixture distribution without knowledge of the component membership of each observation and the density of y is of the form:

f(y, U)= (PLEASE SEE THE HARD COPY)

where the unknown parameter vector U consists of S component parameters (insert from hard copy) and S component proportions, often referred to as mixing proportions or mixing weights, (insert from hard copy), so that

U= (INSERT FROM HARD COPY).

One of the most challenging problems in mixture modeling is estimating S, the number of components in the mixture distribution. A situation where the researcher has knowledge of S a priori is rare. A review of the extant marketing literature shows that a variety of model fit heuristics have been used in determining how many mixture components to retain, including Akaike's (1973) Information Criterion (AIC) and Bozdogan's (1987) consistent AIC (CAIC). Other information-theoretic fit functionals, such as Windham and Cutler's (1992, 1994) minimum information ratio (MIR) and the Bozdogan's (1988, 1990, 1994) information theoretic measure of complexity (ICOMP), have yet to appear in the marketing literature, but are also potential candidates for component determination.

In this paper, the performance of AIC, CAIC, MIR, and ICOMP is systematically assessed through a series of controlled simulations. The factors varied in the simulation procedure include: number of mixture components, sample size, and level of component separation. The ML method and EM algorithm is used for parameter estimation of a Poisson mixture density. Performance is measured by computing the squared difference between the estimated number of mixture components using the various fit measures and the true number of components.

The simulation results indicate that ICOMP and CAIC are consistently the superior model fit measures. CAIC is the favored measure in most marketing applications primarily because of its seemingly computational simplicity. The performance, however, of ICOMP is superior and is computationally manageable given the capabilities of today's computer resources, especially in situations where the inverse of the Fisher's information matrix has a closed form solution. Furthermore, ICOMP is robust to distributional assumptions and different covariance structures in the mixture model, particularly in the multivariate case.

Conceptual Clustering Revision: a Generalization Space-Based Approach

Isabelle Bournaud%
Universite Paris VI
75252 Paris Cedex 05 - France

bournaud@laforia.ibp.fr

Artificial Intelligence (AI) approaches to clustering and traditional techniques developed in both cluster analysis and numerical taxonomy differ in two major aspects. Firstly, the objective of AI approaches is clearly not only to build a set of clusters each defined extensionally as a set of objects, but also to associate an intentional definition, called a concept, to each cluster. Secondly, the quality of a clustering does not only depend on a within-cluster similarity and between-clusters dissimilarity, but also on the comprehensibility of the clusters descriptions. From a set of observations, conceptual clustering systems build a hierarchical organization of the acquired concepts, called a conceptual hierarchy. These approaches have proven to be efficient and are used in many domains. Examples of such systems include cluster/2 [Michalsky and Stepp, 83], cobweb [Fisher, 87] or labyrinth [Thompson and Langley, 91].

However, experts in systematics are more interested in modifying acquired or existing hierarchies than in acquiring them. We propose an original approach to conceptual clustering that provides ways to both build conceptual hierarchies and easily revise them. The method consists in extracting conceptual hierarchies from an explicitly built space, called Generalization Space. A Generalization Space (GS) is an inheritance network that contains a computable subset of all the potential clusters over the observations, represented as nodes. Because of the inheritance structure of the GS, each node may have more than one father. Extracting a hierarchy consists in choosing a unique father for each node based on heuristics. Revising a hierarchy consists in choosing another father for some nodes, i.e. modifying the heuristics used to extract the hierarchy from the GS. The heuristics are expressed in terms of hierarchy structure (large, deep, ...) or in terms of domain knowledge (e.g. ``do not choose the proposed father f for the node n''). In restricting the building of the GS to a computable subset of all the potential clusters over the observations, as in [Mineau, 90], we have developed a low complexity algorithm to support the proposed method.

KEYWORDS: Conceptual clustering; Restructuring; Heuristics for extraction; Systematics.

Title and Abstract to be Provided

Yudong Cai%
Chinese Academy of Sciences
Shanghai Research Center, 500 Caobao Road
Shanghai, China 200233

An Overlapping K-Centroids Approach to Cluster Analysis

Anil Chaturvedi, J. Douglas Carroll, Paul E. Green and John A. Rotondo
AT&T; Bell Laboratories, Rutgers University, and University of Pennsylvania

In this paper, we present a very general approach to Market Segmentation via a generalization of the standard K-means clustering procedure that allows for overlap in the extracted cluster structure. This approach assumes a bilinear model which is a special case of a general multilinear model called ``CANDCLUS'', (for CANonical Decomposition CLUStering; see Carroll and Chaturvedi (in press), with mixed discrete and real parameters. Parameters of this two-way CANDCLUS model can be estimated by optimizing a loss function based on any L_p-norm, although we have used only L_1- and L_2% -norms to date. In the case of an L_2-norm based loss function, fitting this model with constraints imposed to make the clustering a partition results in a procedure equivalent to K-means clustering, while fitting without such partitioning constraints results in a method we call ``overlapping K-means''. In the case of fitting via an L_1-norm based loss function, the unconstrained model yields what we call an overlapping K -medians solution, while, when partitioning constraints are imposed, this results in a K-medians solution (Hartigan, 1975). The benefit of overlapping K-means and K-medians over existing overlapping clustering procedures is their ability to handle relatively large datasets, and to provide robust estimates of cluster membership and cluster centroids or ``generalized centroids'' (such as the vector of medians which replaces the standard centroid if an L_1 rather than L_2-norm is minimized). We will use the term ``centroid'' henceforth to refer to either a standard centroid (vector of means), or such a generalized centroid.

An application of the overlapping K-centroids approach to market segmentation is presented, entailing minimization of both L_1 and L_2 based loss functions -- thus using overlapping K-medians and overlapping K -means, respectively.

Consensus Rules for Molecular Sequences: Theory vs. Biology

William H. E. Day%
Box 17 (Shore Road)
Port Maitland, NS B0W 2V0 Canada

whday@ac.dal.ca

The analysis of long molecular (e.g., DNA, protein) sequences is a rapidly developing branch of computational molecular biology. One of this branch's challenging problems is to align a set of n sequences such that aligned positions match in an optimal way and are comparable. Any such alignment can be represented by a matrix of n rows and m columns in which the n -tuple of symbols (e.g., bases, amino acids) in a given column represent homologous states of a biological character. At each column of the matrix, a problem of consensus description is to identify a set of symbols (e.g., ambiguity codes) that best represents the n symbols in the column. There have been few investigations of the features --- or relevance --- of rules for calculating consensus molecular sequences. Do consensus rules based on voting paradigms make sense in a biological context? Should biologists be encouraged to use consensus rules that are based on minimizing angles in Euclidean spaces? I will introduce these issues and (as time permits) discuss whether angle minimization is an appropriate model for calculating consensus molecular sequences.

KEYWORDS: Biological application; Consensus; Sequence analysis.

Growing-Sphere Graphs as a Generalization of Sphere of Influence Graphs and the Union of Mintrees

Donald W. Dearholt
Department of Computer Science
Mississippi State University MS 39762

dearholt@cs.MsState.edu

Growing-sphere graphs (GSGs) are introduced as a means of providing a coherent model for the clustering and representational capabilities of both the sphere-of-influence graphs (SIGs) and the union of minimum-cost spanning trees in a coordinate space. The clustering properties of SIGs and the union of mintrees are reviewed, and a novel way of viewing these graphs as growing-sphere graphs (GSGs) is presented. Under appropriate constraints, it is shown that GSGs can generate either SIGs or the union of mintrees. Applications include modeling dynamic aspects of energy propagation, such as electromagnetic or acoustic waves.

And Then There Were None: A Computer-Aided Analysis of the Shakespeare Claimants

Ward Elliott
Claremont McKenna College, Department of Government
850 Columbia Avenue
Claremont, California 97111-6420

The CMC Shakespeare Clinic has applied a battery of 52 computer-based tests to plays by 17 claimed ``True Authors'' of Shakespeare's work, to 27 plays of the Shakespeare Apocrypha, and to all of the plays of the Shakespeare Canon. It has applied up to 14 tests to poems by 28 poet claimants. Its assignment from the Shakespeare Authorship Roundtable was to find out which, if any, of the 37 testable claimants on the Reader's Encyclopedia of Shakespeare's list actually matched Shakespeare. Its conclusions:

1. None of the 37 claimants, playwright or poet, matches Shakespeare.

2. Two long-accepted plays in the Shakespeare Canon, Titus Andronicus and Henry VI, Part 3, don't match the rest and appear to have been written, in substantial part, by someone other than Shakespeare. Shakespeare's accepted authorship of the short poem A Lover's Complaint likewise seems doubtful.

3. Apart from the two outliers mentioned, no core Shakespeare play has more than three rejections in 52 tests. No claimant or Shakespeare Apocrypha play has less than eight rejections. Of the Shakespeare Dubitanda, only Shakespeare's part of Two Noble Kinsmen and the ``Late Stratum'' of Titus Andronicus have fewer than eight rejections.

4. Every 3,000-word block of Shakespeare's accepted poems, and the 2,600-word A Lover's Complaint, pass our powerful modal test. No claimant poem corpus of 3,000 or more words passes the modal test. Two shorter poem corpora, of only 1-2,000 words passed the modal test, but -- along with A Lover's Complaint -- failed non-modal testing with four or more rejections. L.C. aside, no block of Shakespeare had more than two rejections. No claimant block had fewer than three rejections from all available tests combined.

5. If all, or even most, of its tests are valid, the Clinic has eliminated every testable claimant, including 13 who are considered ``full claimants,'' not just supposed collaborators on one work or another. 21 other claimants, four of them ``full,'' have not, to our knowledge, left any work comparable to Shakespeare's and cannot be tested using internal evidence. Among the claimants eliminated by the clinic's tests are some of the most prominent ones: the Earl of Oxford, Bacon, Marlowe, and members of the Sidney Circle.

KEYWORDS: Shakespeare; Authorship; Shakespeare claimants; Shakespeare Apocrypha; Shakespeare Dubitanda

Gender Discrimination and Prediction on the Basis of Facial Metric Information

Jean-Marc Fellous%
University of Southern California
Los Angeles, CA 90089-2520
(213) 740-1922 (office) (213) 740-5687 (fax)

fellous@selforg.usc.edu

We show that gender discrimination and prediction can be achieved on the basis of purely metric information extracted from frontal pictures of faces. We show that vertical and horizontal metric facial measurements can be considered as statistically independent. Five appropriately normalized facial distances are derived using discriminant analysis that explain over 95% of the gender differences of training samples and predict the gender of 85% novel test faces exhibiting various facial expressions. Robustness of the method and its results are assessed. It is argued that these distances (termed fiducial) are compatible with those found experimentally by psycho-physical and neuro-physiological studies, and that in consequence, partial explanations for the effects observed then, can be found in the intrinsic nature of the facial stimuli used.

A learning method for text categorization

Jeffrey L. Goldberg Department of Computer Science
Texas A&M; University
College Station, TX 77843-3112

jeff@cs.tamu.edu

Text categorization is the classification of units of natural language text with respect to a set of pre-existing categories. There has been two primary approaches, to create text categorizers manually, via human engineering, or automatically, via machine learning.

However, there have been statistical problems associated with natural language text when it is applied as input to existing, non-domain specific machine learning algorithms such as Bayesian classification and decision tree learning (too much noise, too many features, skewed distribution). The Category Discrimination Method (CDM) is a new learning algorithm designed specifically for text categorization that addresses the short comings of these existing learning methods.

The bases of the CDM are research results about the way that humans learn categories and concepts vis-a-vis contrasting concepts. The essential formula is cue validity borrowed from cognitive psychology, and used to select from all possible single word-based features the `best' predictors of a given category.

The CDM learns text categorizers. A precategorized test collection of text documents is divided into two parts, a training set and a test set.

For each category relevant to the test collection:

Using the training documents, the CDM determines the best predictors (i.e. features) for the category. A metric called cue validity, borrowed from cognitive psychology, is used to select the features.

Then the CDM conducts a multi-stage search to learn how these features might best be organized into a logical structure that can be used as a text categorizer. During each stage of search, literally hundreds of text categorizers are tested against a set of training documents. The best categorizers produced by each of the intermediate search stages are used to set the parameters for the next search stage.

Then the categorizer that is best at predicting the desired category on the training documents, the output of the final stage of the search, is used to predict that category on the unseen test documents. The performance of the CDM algorithm for that category can then be determined.

An overall performance figure for the CDM on the test collection is obtained from the individual per category figures by microaveraging. The hypothesis that the CDM's performance will exceed two non-domain specific algorithms, Bayesian classification and decision tree learners, both of which having been previously evaluated on the same test collection of documents under similar conditions, is thereby tested.

The preliminary results have thus far confirmed the hypothesis. After evaluating 89% of the Reuters-22173 test collection, and using only the CDM's first search stage (without the benefit of the second and third fine tuning stages), its overall performance is significantly higher (72%) than that of Bayesian classification (65%) or decision tree learners (67%), based on published figures from comparable studies.

KEYWORDS: Text categorization; Machine learning; Supervised learning.

Predictiveness of Higher Order Graph Structure

Timothy E. Goldsmith and Peder J. Johnson%
Department of Psychology, Logan Hall,
University of New Mexico, Albuquerque, NM 87131

e-mail: gold@unm.edu

Graphs are increasingly used to represent proximity data. Several algorithms now exist for deriving graphs from proximities. Once a graph is derived, its structure can be described either by noting the node pairs that are directly linked (adjacency matrix) or by noting the minimum path distance between node pairs (graph distance matrix). In subsequent applications of graphs, one of these characterizations may be more useful than another. We investigated whether higher-order structural information in graphs offer any advantage over first-order adjacency values. More specifically, we asked whether graph distances contain any unique predictive information over adjacency distances.

We investigated this issue in the domain of knowledge graphs. Graphs were derived from individuals' relatedness ratings of the central concepts of a domain. Domain competence was then predicted from the similarity of each individual's graph to the graph representation of a domain expert. We computed similarity with both adjacency values and graph distances. Previous work had confounded path distance and path type. In contrast, we defined a match to occur only when a pair of nodes had both the same minimum distance and the same intervening path nodes.

The analysis of two data sets N=101 and N=40 showed that the predictiveness of second-order similarity did not significantly add to criterion predictiveness after first-order similarity was partialled out. These results were based on proximity graphs, which emphasize the most related pairs (Schvaneveldt, 1990). We are currently extending the analysis to graphs computed from a least-squares approach (Klauer & Carroll, 1989). If the findings hold, the implication is that the essence of graph structure lies simply in knowing a graph's direct links.

KEYWORDS: Proximity data; Graphs; Higher-order information; Graph theory; Applications.

A Comparison of Alternative Approaches to Cluster-Based Market Segmentation

Paul E. Green and Abba M. Krieger
The Wharton School, University of Pennsylvania
email: Paul.Green@marketing.wharton.upenn.edu

The common practice of factor analyzing data matrices, followed by a clustering of rotated, standardized factor scores has recently been questioned by several researchers in marketing. This paper presents comparative analyses of nine large-scale data sets in which two different types of transformations are employed prior to K-means clustering.

The study results suggest that there is some empirical evidence for the recent claims. Clusterings based on rotated factor scores generally show poorer associations with exogenous (background) variables than those based on either the original data or a less severe preliminary transformation. Implications of these findings for post hoc market segmentation are discussed.

Probability Models and Limit Theorems for Random Interval Graphs with Applications to Cluster Analysis

[Invited Presentation]

Bernard Harris and Erhard Godehart%
University of Wisconsin, Madison and Heinrich-Heine Universitaet, Duesseldorf

harris@stat.wisc.edu

N independent and identically distributed random variables, with a distribution absolutely continuous with respect to Lebesgue measure are observed. The random interval graph determined by their distances is calculated. Asymptotic distributions for the number of edges and the number of complete subgraphs of order k is obtained, k

Tree-Based Classification of Predictors in Regression Models and Structure Determination via Self-Organization: The GMDH

Cheryl Hild and Hamparsum Bozdogan

Management Science Program and Department of Statistics
University of Tennessee, Knoxville, TN 37996 USA

bozdogan@utkvx.utk.edu%

This paper studies a tree-based classification of predictor variables in regression models with one response variable y, and m predictor variables x_1,...,x_m via the self-organization method called the Ivakhnenko (1968) Polynomials of Several Predictors or the Group Method of Data Handling (GMDH). The GMDH is an elegant approach to statistical data modeling to determine the relationships among the m given predictors and to describe the structure of high-order regression type models in a hierarchical fashion.

Self-organization uses combinatorial methods beginning with a few basic quadratic equations in a series of hierarchical tree-based approaches to classify predictors. This is called ``threshold self-selections''. Thresholds are employed at each layer (or generation) in the hierarchy to construct new generations of complex equations. The survival-of-the-fittest principle is used to classify those predictors which best fit into the desired hypersurface where more complex combinations of predictors are formed. Each element in each layer in the hierarchy implements a nonlinear function of two predictors based on the results of the previous layer. As this algorithm continues for several generations, a set of mathematical models is developed that behave more and more like the actual system of interest (Farlow, 1981). As opposed to assuming a known functional form of the model, the GMDH algorithm considers only the data collected. The functional relationships between the response and predictor variables and the overall structure is learned directly from self-organization of the data.

Model selection strategies are introduced and developed using information-based model evaluation criteria for the use in the GMDH algorithm to study the quality of the competing offspring models at each generation. More specifically, Informational Complexity (ICOMP) criterion of Bozdogan (1988, 1990, 1994) and the classic Akaike (1973) Information Criterion (AIC) are developed and integrated into the technology that chooses the model which ``best'' approximates the given data set among the set of competing alternative models with different numbers of parameters.

A symbolic backtracing algorithm is designed for the hierarchical decision tree to classify the best predictors and to determine the best polynomial structure in terms of the m original predictor variables.

Several real numerical examples are shown along with an open architecture symbolic computational toolbox to illustrate the utility of the new proposed approach.

Multivariate Approaches to Authorship Attribution

Dr. David I Holmes%
Department of Mathematical Sciences
University of the West of England
Coldharbour Lane, Bristol BS16 1QY, England

di-holmes@csm.uwe.ac.uk

Multivariate statistical techniques such as principal component analysis, discriminant analysis and cluster analysis have found increasing application in the analysis of literary style in recent years. This paper commences by reviewing examples of such applications, in particular the work of Burrows (1992, 1994) and Binongo (1994) on the rate of occurrence of the most frequent words in a text, and the work of Holmes (1991, 1992) on variables measuring the richness of vocabulary of a writer.

The paper then moves to work in progress on testing the efficacy of both of these approaches by applying them to a rather severe testing ground, namely the authorship of the Federalist Papers. These were a series of essays appearing over the pseudonym `Publius' during 1787 and 1788 attempting to persuade New Yorkers to support ratification of the proposed new United States Constitution. The essays were the subject of a well-known investigation by Mosteller and Wallace (1984) using Bayesian techniques.

A sample of 33 papers is taken from the Federalist corpus and run on the Oxford Concordance Program, after which a series of measures of vocabulary richness is computed. Multivariate statistical analyses indicate that such measures would seem, collectively, to be a good set of discriminators. Two separate investigations are then conducted using a set of 49 high frequency function words and the 30 Mosteller and Wallace `marker' words. Plots in the space of the first two principal components appear to add weight to Burrows' contention that multivariate word frequency analysis of large sets of common words is a stylometric technique which can discriminate between writers.

The paper suggests that the selection of multiple criteria must now be favoured to tackle problems of authorship attribution, although they must be sensitive enough to distinguish between samples from the same genre as well as being able to discriminate between samples from different genres.

An Empirical Investigation of the Robustness of the ``Heuristic Identification of Noise Variables (HINoV)'' Algorithm

Ali Kara
Pennsylvania State University at York
1031 Edgecomb Avenue, York, Pennsylvania 17403-3398

axk19@psuvm.psu.edu

One of the most frequently used phrases in the marketing literature is the identification of market segments which usually entails the identification of homogeneous subsets of the mass market by clustering customers on a set of variables. Mainly, cluster analysis has been used as a classification tool (all segmentation research, regardless of the method used to identify groups of people, markets, or organizations that share common characteristics), to improve understanding of buyer behavior and developing taxonomies of buyer behavior (Claxton, Fry, and Portis 1974), to identify potential product opportunities (Srivastava, Leone, and Shocker 1981; Srivastava, Shocker, and Day 1978), and as a general data reduction technique to apply to data sets with many variables.

Unfortunately, cluster analysis is not a deterministic technique, i.e., it requires subjective interpretation of the researcher. One of the major problems concerns the selection of variables to cluster on (Green, Carmone, and Kim 1990; Milligan 1989; DeSarbo, Carroll, Clark, and Green 1984). If this task is not done carefully, variables containing little clustering information could be included in the analysis. These ``noisy'' variables can cause misleading results (``poor clusters''). The researchers argue that ``...in any given array of variables only a proper subset of the variables contribute in an important way to the resultant clustering. Hence, one should only use these variables and ignore the others.'' However, clustering methods tend to be used in an exploratory fashion without utilizing as much information about the background and structure of the data as one would use in dependence methods, e.g regression and discriminant analysis. Indeed, these noisy variables can mask the clear structure portrayed by other variables (Fowlkes, Gnanadesikan, and Kettenring 1988; Milligan 1980; Punj and Stewart 1983).

Carmone (1990) developed an algorithm called ``Heuristic Identification of Noisy Variables (HINoV)'' to help isolate the noisy variables prior to clustering. If the noisy variables could be identified and removed prior to clustering, better clusters could be obtained. HINoV identifies the noisy variables that may exist in a data set before starting the partitioning process with a very straightforward and easy to understand and implement procedure. The purpose of this study is to empirically test the robustness of HINoV. The artificial data sets with known structures were generated with the use of a routine developed by Milligan (1985). Several cluster recovery measures were used to test the performance of the algorithm. Results of this study indicate that the HINoV algorithm identified noisy variables in both synthetic and real data sets successfully. Recovery of true cluster structure was enhanced. The algorithm should prove quite helpful not only in market segmentation problems encountered for classifying consumers into groups or segments where purchase behavior or general characteristics are similar for those consumers in the same segment, but also several other behavioral science applications.

Consumers' Price-Quality Schemas for Services

Susan M. Keaveney
Graduate School of Business Administration, Campus Box 165
University of Colorado at Denver
Denver, Colorado 80217-3364

skeaveney@castle.cudenver.edu

Consumers are faced with numerous market-related decisions daily, including advertisements, product choices, service choices, prices, promotions, and so on. To simplify these everyday market decisions, consumers are believed to develop a repertoire of decision rules or ``schemas.'' For example, consumers learn that larger sizes of products are less expensive on a per unit basis (therefore, they can ``safely'' buy the larger size without checking unit-prices) or that signs and in-store displays signal sale prices (so, consumers can ``safely'' buy the sale product without the effort of comparing prices). Consumers' schemas may or may not be accurate: although the two schemas described above were once generally true, marketers have since changed the rules of the game.

A price-quality schema is a decision rule that states: ``The higher the price, the higher the quality.'' Consumers who develop and apply price-quality schemas simplify decision- making by accepting the belief that price is a cue for quality. Using cluster analytic techniques, product researchers have identified four clusters of consumers, grouped by their tendencies to develop and apply price-quality schemas about products: (1) Consumers who believe that the rule ``the higher the price, the higher the quality'' can be applied to most products; (2) consumers who do not form price-quality schemas at all (but believe that price-quality relationships can be determined only after evaluating product information) (3) consumers who form price-quality schemas about non-durable goods but not durable goods; and (4) consumers who form price-quality schemas about durable goods but not non-durable goods.

Conventional wisdom in services (versus goods) marketing holds that consumers have strong price-quality schemas about services. The rationale is as follows: Because services are intangibles --- experiences rather than objects --- it is difficult to evaluate services prior to actually experiencing them. Therefore, when evaluating a service prior to purchase and consumption, consumers must use other more tangible cues, such as price.

Prior to this study, price-quality schema hypotheses had not been empirically tested in the context of services. Do consumers form price-quality schemas about services ? If consumers form price-quality schemas about services, should services marketers use price to signal quality? Are consumers more likely to form price-quality schemas about services than about products? Are certain types of services more likely than others to stimulate the use of price-quality schemas? For example, if a consumer can observe service outcomes (landscaping, a haircut, house cleaning), he/she can make decisions based on observations as well as price. But when a consumer cannot observe service outcomes (how much was learned in a class) then there are fewer cues to inform decision-making. Does price then increase in information value and price-quality schemas become more useful? Finally, if consumers form price-quality schemas, about products or services, do they actually pay higher prices on average than consumers who do not form price-quality schemas?

The research surveyed approximately 400 consumers of services regarding their price-quality beliefs. Cluster analysis of the responses revealed the following preliminary results:

One cluster of consumers forms price-quality schemas about all services.

A second cluster of consumers appears to believe that price-quality relationships are not true or do not exist (i.e., show an inverse relationship).

A third cluster shows random patterns of price-quality beliefs; i.e., this group appears to make decisions independently.

A fourth cluster of consumers forms price-quality schemas about non-personal services (landscaping, auto mechanic, house cleaning) but not about personal services (doctors, dentists, hotels).

A fifth cluster forms price-quality schemas about personal services (doctors, dentists, hotels) but not about non-personal services (landscaping, auto mechanic, house cleaning).

Finally, the research examined the question: If consumers have price-quality schemas about products or services, do they actually pay higher prices on average than consumers who do not form price-quality schemas? Initial results indicate that they do not.

KEYWORDS: Cluster analysis; Marketing; K-means clustering; Price-quality schema.

A Correlated Evolution Model Using a Multiple Regression with Phylogenetically Autoregressive Error Components

Hang-Kwang Luh John Gittleman and Hamparsum Bozdogan
Department of Mathematics, Department of Zoology and Department of Statistics
University of Tennesse, Knoxville, TN 37996 USA

luh@mathsun5.math.utk.edu

The search for a functional relationship between an organism and its environment is one of the most important themes in evolutionary biology. This functional relationship, as Darwin found a century ago, evolves through a combination of genealogical and environmental processes. This causes difficulties for biologists attempting to test the hypothesis that the evolution of a phenotypic trait is under the mechanism of adaptation, because the similarity between species may be due to common environmental factors (adaptation) or due to common inheritance (phylogenetic constraint). Comparative statistics with phylogenetic analysis can distinguish various kinds of similarity by using the information of phylogenetic classification. This approach requires two steps:

(1) removal of phylogenetic effects, and

(2) apply comparative statistics to the trait which is free of phylogenetic effects.

In this study, we present a flexible framework to deal with these two steps simultaneously. We first postulate a regression model representing the correlated evolution among traits. Since the observed data in a trait are phylogeneticly correlated, the random error terms in the regression model are not independent of one another. Thus, the proposed model should incorporate the first order phylogenetically autoregressive process in standard error components in order to deal with the several sources associated with the phylogenetic dependencies among traits evolution. After we build the comparative model structure, we can further ask which traits should be included in the model in order to best describe the relationships between the traits.

Here we introduce information-based model selection criteria to estimate the performances of each of the candidate models. Information-based criteria as ''figures-of-merit'' take into account the principle of parsimony which is the cornerstone of biological classification. The best model representing the evolutionary pattern of correlated traits should provide most the information with the simplest model structure.

Classification of Patients with Germ Cell Tumors (GCT) Using Factors Predictive of Response to Platinum-Based Chemotherapy

M. Mazumdar, D.F. Bajorin, R.J. Motzer, V. Vlamis, and G.J. Bosl
Department of Epidemiology and Biostatistics, and Genitourinary Oncology
Service, Department of Medicine
Memorial Sloan-Kettering Cancer Center (MSKCC)
New York, NY 10021

mazumdar@biosta.mskcc.org

The management of patients with GCT has been based on statistical models from the early 80's, even though different rules were used by different institutions making it difficult to compare the published literature. Currently platinum based chemotherapies have approximately an 80% cure rate. It was well accepted that allocation of patients to different risk categories should be done on the basis of the values of pretreatment biochemical markers (eg., LDH, HCG and AFP ) and extent of disease. There were many assays for each marker and various ways of defining bulk of disease (eg., visceral organ involvement, number of metastatic sites, primary sites of disease) contributing to the variability. In the current environment of managed care, it is necessary that we conserve our resources by collaborating nationally and if possible internationally. In order to do that, we needed to integrate the classification rules.

The objective of this project was to classify these patients into three risk categories (say, good, intermediate and poor) which will be acceptable to all the primary groups in this field. We wanted to get patients with higher than 90% of CR rate in the good-risk category to be able to treat them with the standard therapy since their efficacy cannot be improved any more. Intermediate-risk category will capture the patients whose efficacy can be improved by performing clinical trials with new regimens and moderate sample size requirements. The poor-risk category will consist of extremely poor prognosis patients so that very aggressive therapy such as high dose chemotherapy with bone-marrow transplantation can be used to get some response.

There were 796 patients treated at MSKCC between 1975-1990 on consecutive protocols. The endpoint of interest was response (Complete (CR) vs. Incomplete responders (IR)). We randomly chose 75% of the data to be the developmental set and the remaining was kept for the validation set. Multivariate logistic regression was used to find the significant variables. Classification rules were formed using these variables, experience of the clinicians and other factors to be described below. Akaike's Information Criterion was used to compare different models. ROC analysis was used to check the performance of the model in the validation set.

AFP, even though not significant in our database, was used for classification because of its statistical significance found in the analysis of the International GCT collaboration group. Also to get rid of the variability in the units of assays for LDH and HCG, a transformation as a multiple to its normal value was used. Visceral mets (mets to liver, lung or bone) was used instead of total number of mets because of its acceptance by other groups.

Statistical classification is not always based only on statistics. Many factors need to be incorporated to make it useful.

KEYWORDS: Response; Bone-marrow transplantation; Visceral metastasis.

An Information-theoretic Approach to Optimal Design of Statistical Classifiers

David Miller, Ajit Rao, Kenneth Rose, and Allen Gersho
Department of Electrical and Computer Engineering
University of California Santa Barbara, CA 93106

rose@ece.ucsb.edu

A global optimization method is introduced for the design of statistical classifiers minimizing the probability of misclassification. The method, which is based on ideas from information theory and analogies to statistical physics, is inherently probabilistic. During the design, data are assigned to classes in probability, with the assignments chosen to minimize the expected classification error given a specified level of randomness, as measured by Shannon's entropy. The constrained optimization is equivalent to a free energy minimization, motivating a deterministic annealing approach in which the entropy and expected misclassification cost are reduced with the temperature while enforcing the classifier's structure. In the limit, a hard classifier is obtained. This deterministic learning approach is applicable to a variety of classifier structures, including the widely used prototype-based, radial basis function, and multilayer perceptron classifiers. The method has been compared with learning vector quantization, back propagation, several radial basis function design techniques, as well as with paradigms for more directly optimizing these structures to minimize probability of error. The annealing method achieves substantial performance gains over all the other design methods on a number of benchmark examples from the literature [e.g., Ripley (1994), J. R. Statist. Soc. B, 56, pp. 409-456,] while often retaining design complexity comparable to or only moderately greater than that of strict descent methods. Remarkably substantial performance gains, both inside and outside the training set, are achieved for complicated examples involving high-dimensional data and large class overlap.

KEYWORDS: Classification; Information theory; Neural networks; Deterministic annealing; Shannon entropy; Statistical physics.

An Evaluation of Within-Cluster Variable Standardization

Glenn W. Milligan and Richard Cheng
Dept. of Management Sciences, 302 Hagerty Hall
The Ohio State University
Columbus, Ohio 43210 USA

gmilliga@magnus.acs.ohio-state.edu

The problem of standardizing within-cluster is explored. Standardization within-cluster results in a circularity problem. That is, before variable standardization can be conducted, cluster memberships must be known. However, before determining the cluster memberships, the data must be standardized. One approach to dealing with the circularity problem involves using tentative cluster memberships for preliminary standardization. Subsequently, an iterative routine can be followed that alternates between clustering and re-computation of the within-cluster standardized data. Such a routine was incorporated into a K-Means clustering algorithm for testing purposes. Bivariate simulated data sets were constructed to evaluate various iterative strategies. The presentation will focus on the effect of within-cluster standardization by using snapshot graphical displays showing movement of the data points and tentative cluster assignments.

KEYWORDS: Variable standardization; K-means clustering; Monte Carlo methods.

A Multi-layer Radial Basis Function Network that can improve generalization

P. Antonio Olmos-Gallo
Department of Psychology, University of Denver.

polmos@pstar.psy.du.edu

The ability of Radial Basis Function Networks (RBFN) in classification have been studied for some time (Moody & Darken, 1988; 1989). This architecture is considered a very fast learner and very accurate classifier (Singer & Lippmann, 1991; R"oscheisen, Hoffman & Tresp, 1991). However, its shortcomings are also well known. It is known that computation with these type of networks is very expensive in terms of the number of units that need to be utilized (Hartman & Keeler, 1991; Hartman Keeler & Kowalski, 1990), as well as their strong tendency to overfit. Several modifications have been proposed to deal with this shortcoming, including growing neural-gas algorithms (Fritzke, 1994; 1995), and arbitrarily reducing the number of units to improve generalization (Bishop, 1991; Omohundro, 1991; Wettschereck & Dietterich, 1991). In this paper, we explore a biologically inspired alternative: the addition of a second layer to the network. It is suggested that this addition may help to improve generalization by removing complexity, reducing the dimensionality of the problem space, and providing a reorganization of the representation in such a way that features that influence each other get closer. The network was tested on a pattern classification problem known as Hyperacuity. Its fit to psychophysical data as well as its shortcomings are discussed.

A Comparison of Linear and Nonlinear Classification Methods in the Diagnosis of Dyslexia

Michael P. Peacock and Pablo Antonio Olmos-Gallo
Experimental-Cognitive Psychology Laboratory
University of Denver
Denver, Colorado

mpeacock@synapse.psy.du.edu and polmos@pstar.psy.du.edu

Two nonlinear classification systems were explored for their utility in assisting in the classification and prediction of developmental dyslexia. The performance of two different neural network architectures, multi-layer perceptron (MLP) and radial basis function (RBF), was compared to the performance of linear discriminant analysis (LDA). The database consisted of eleven assessment scores used by expert clinicians in the diagnosis of dyslexia for a sample of 121 children between the ages of nine and fifteen.

For this problem, we were only interested in individuals who were either diagnosed with dyslexia, or were diagnosed with no specific reading disorder. Such a binary classification problem may seem trivial at first glance, but a caveat is in order: since we believe that the mapping from clinical assessments to a diagnosis is nonlinear, we expect an implicitly nonlinear classification system, like a neural network, to outperform a more traditional linear classification system. Better classification tools are useful to clinical psychologists since they would lead to more accurate and consistent diagnoses of psychological disorders.

Preliminary results suggest that the networks outperform LDA in classification of cases, but only the MLP network outperforms LDA in prediction of dyslexia. These results demonstrate that for this type of problem, we can improve generalization performance through the use of nontraditional classification techniques. In addition, hidden unit analysis through the course of network training may prove useful in training clinical psychology students to perform diagnoses of dyslexia.

KEYWORDS: Linear discriminant analysis; Multi-layer perceptron; Radial basis function; Diagnosis of dyslexia.

Genetic Algorithms for Classification and Feature Extraction

Min Pei, Ying Ding, William F. Punch III, and Erik D. Goodman
Beijing Union University, Beijing, China,
Case Center for Computer-Aided Engineering and Manufacturing, and Intelligent Systems Laboratory, Department of Computer Science Michigan State University, 112 Engineering Building East Lansing, Michigan 48824

pei@egr.msu.edu

In a decision-theoretic or statistical approach to pattern recognition, the classification or description of data is based on the set of data features used. Therefore, feature slection and extraction are crucial in optimizing performance, and strongly affect classifier design. Defining appropriate features requires interaction with experts in the application area. In practice, there is much noise and redundancy in most high dimensionality, complex patterns. Therefore, it is sometimes difficult even for experts to determine a minimum or optimum feature set. The ``curse of dimensionality'' becomes an annoying phenomenon in both statistical pattern recognition (SPR) and artificial Neural Network (ANN) techniques. Researchers have discovered that many learning procedures lack the property of ``scaling,'' i.e., these procedures either fail or produce unsatisfactory results when applied to problems of larger size.

To address this problem, we have developed two approaches, both based on genetic algorithms (GA's). The basic operation of these approaches utilizes a feedback linkage between feature evaluation and classification. That is, we carry out feature extraction (with dimensionaltiy reduction) and classifier design simultaneously, though ``genetic learning and evolution.'' The objective of these approaches is to find a reduced subset from the original N features such that the useful class discriminatory information is included and redundant class information and/or noise is excluded. We take the following general approach. The data's original feature space is transformed into a new feature space with fewer features that (potentially) offer better separation of the pattern classes, which, in turn, improves the performance of the decision-making classifier. The criterion for optimality of the feature subset selected is usually the probability of misclassification. Since the number of different subsets of N available features can be very large, exhaustive search is computationally infeasible and other methods must be examined. For example, a number of heuristic techniques have been used, but it is not clear under what circumstances any one heuristic should be applied, as each has its good and bad points. Genetic algorithms are good candidates for this task, since GAs are most useful in multiclass, high-dimensionality problems when knowledge about performance of heuristics is sparse or incomplete. In order to apply a GA to classification/discrimination tasks, we combined the GA with two different approaches: the K Nearest Neighbor decision rule and a production decision rule.

GA/KNN hybrid approach (Genetic algorithms combined with K Nearest Neighbor). Here the GA defines a population of weight vectors W, each of which is multiplied with every sample's data pattern vector X, yielding new feature vectors Y for the given data. The KNN agorithm then classifies the entire Y-transformed data set. The results of the classification are fed back as part of the fitness function, to guide the genetic learning or search for the best transformation. The GA answer is the ``best'' vector W under the, potentially conflicting, constraints of evaluation function. For example, the ``best'' W might be both parsimonious (has the fewest features) and gives fewest misclassifications. Thus the GA searches for an optimal transformation from pattern space to feature space for improving the performance of the KNN classifier.

GA/RULE approach (Genetic algorithms combined with a production rule system). This approach is based on the general structure of classifier systems, and focuses on the classification of binary feature patterns. A single, integrated rule format is used. The inductive learning procedure of the GA evolves a rule classifier using a known labeled ``training'' sample set. The result of training is a small set of ``best'' classification rules which classify the training set more accurately than other rule sets. By examining these ``good'' rules, one can determine the features which are most useful for classification. This provides information useful to domain experts for deriving new hypotheses.

The two methods described above were applied successfully to three high-dimensionality (96 or more features) biological binary pattern data sets with multiple classes. The results achieved indicate potential for appliation in many other fields. They can also be further developed using a GA combined with other classifiers, such as neural networks, to solve a fairly broad class of complex pattern recognition problems.

KEYWORDS: Genetic algorithms; Classification; Feature extraction; High dimensionality.

Forensic Hair Analysis Using Pattern Classification Techniques

L. Pratt and C. Medina%
Department of Mathematical and Computer Sciences
Colorado School of Mines
Golden, Colorado 80401

lpratt@mines.colorado.edu and cmedina@mines.colorado.edu

An important forensic task is to analyze and compare hair evidence in criminal cases. A forensic expert compares sets of hair images from evidence at a crime scene to a set from hair samples from a suspect. Under a miroscope, hairs are fairly distinctive. Forensic analysis, however, is a tedious procedure. The forensic expert must manually compare hundreds of hair images to determine whether or not two sets of samples came from the same person.

This project explores the automation of hair analysis. Our system takes as input microscopic images of two hairs and produces a classification decision as to whether or not the hairs came from the same person.

Each image is segmented into a number of pieces appropriate for classification of different features. A variety of image processing techniques are used to enhance this information. Neural networks are then used for feature classification. Finally statistical tests determine the degree of match between the resulting collection of hair feature vectors.

An important issue in the automation of any task used in criminal investigations is the reliability and understandability of the resulting system. To address this concern, we have also developed methods to facilitate explanation of the neural networks' behavior using a decision tree.

KEYWORDS: Neural networks, Pattern Classification, Decision Trees, Statistics

Market Segmentation on Multiple Bases with Pick-Any Data: An Integrated Approach

Venkatram Ramaswamy, Rabikar Chatterjee, and Steven H. Cohen
The University of Michigan, Ann Arbor, Michigan and Stratford Associates, Boston, Massachusetts

Pick-any data are easy to collect and are indeed commonly collected by marketing researchers. This paper proposes an approach to market segmentation given pick-any data on multiple potential data bases for segmentation (e.g., benefits sought, products or services used, occasions for use) that may be interdependent. The proposed "integrated" latent segmentation approach simultaneously extracts segments on multiple bases while capturing and explicitly estimating the structure of interdependence between the bases. A hypothetical example motivates how the approach works and demonstrates its benefits relative to other potential approaches. An empirical application is presented, using pick-any data collected by a regional bank on two popular, conceptually appealing, and interdependent bases for segmenting customers of financial services -- benefits (i.e., desired financial goals) and product usage (of an array of banking services).

Statistics and Authorship Attribution Studies

Joseph Rudman
Department of English
Carnegie Mellon University
Pittsburgh, PA 15221 USA

Rudman@cmphys.phys.cmu.edu

The use of statistics in authorship attribution studies can be as dangerous as it is valuable. This paper reports on the who, what, when, where, why and how of statistics in attribution studies:

An historical overview. The value of statistics to attribution studies. When statistics should not be employed. Some problem studies. Making assumptions about the data. An example from research on the canon of Daniel Defoe.

A substantial bibliography is provided.

KEYWORDS: Authorship; Statistics; Defoe.

Incorporating a Neural Network into Logistic Regression

Brad Warner
Dept. of Preventive Medicine and Biometrics
University of Colorado
Health Sciences Center,Denver, CO 80220

bw@rover.hsc.colorado.edu

In prediction or classification modeling problems we often have two contrasting situations of interest: (1) we either build models solely for prediction purposes because we are interested only in accurate estimation of a response variable or (2) we build models to understand the relationship between the independent (or predictor) variables and the outcome. In the second case, we often want to statistically test this association. The model complexity is sacrificed for ease of interpretation. Sometimes, a situation between these two extremes is of interest. We may have some confounding variables that must be accounted for in the model but which will not be used for testing purposes. For example, we may be interested in understanding whether serum cholesterol affects the outcome of a coronary bypass operation. We know that age and sex affect this outcome so we want to account for them in our model; however, we only want to test for an association between cholesterol and bypass mortality. The confounding variables can be thought of as nuisance variables.

Neural networks have received a great deal of attention over the last few years in the areas of prediction and classification, areas where statistical techniques have traditionally been used. The disadvantage with neural network models is that they are difficult to interpret at the independent variable level. In our example, if we used a neural network to build our model, it would be difficult to test only for an association between cholesterol and bypass surgery outcome. Logistic regression on the other hand has gained a great deal of popularity in prediction and classification problems because of the capability to easily interpret its parameters. This simplicity comes from the modeling constraint of linearity on the logit scale. The price paid for the ease of interpretability is loss of predictive power in cases where the linearity assumption is weak. A neural network could potentially have better predictive performance in this situation as no assumption is made about the structure of the problem. One way to think of this relationship is that the linearity constraint causes logistic regression to be a special case of a neural network. Since the neural network is more general, it is able to model more complex relationships.

In a scenario where our modeling has nuisance (or confounding) variables, it might be advantageous to exploit the predictive power of a model such as a neural network in the confines of a logistic regression model. Thus, we may be able to test variables of interest as in a traditional logistic regression model, but after we have accounted for as much of the variability in the response due to the nuisance variables as possible.

This paper briefly describes the two models, logistic regression and neural networks, pointing out their similarities and differences. Then we demonstrate three methods to incorporate a neural network that accounts for nuisance variables into a logistic regression model. We discuss situations where these methods might be employed. We also discuss how other complex predictive models such as projection pursuit regression or multivariate adaptive regression splines could be incorporated into this framework. Finally, we discuss the limitations of this modeling approach.

KEYWORDS: Combining models; Generalized linear model.

Logistic Regressions for Graphs and Social Networks

Stanley Wasserman and Philippa Pattison
University of Illinois and University of Melbourne

Spanning nearly sixty years of research, statistical analysis of graphs has passed through (at least) two generations of researchers and models. Beginning in the late 1930's, the first generation of research dealt with the distribution of various graph statistics, under a variety of null models. The second generation, beginning in the 1970's and continuing into the 1980's, worked on models, usually for probabilities of relational ties among very small subsets of nodes, in which various simple substantive tendencies were parameterized. Much of this research, most of which utilized log linear models, first appeared in applied statistics publications.

But recent developments in social network analysis promise to bring us into a third generation. The Markov random graphs of Frank and Strauss (1986) and especially the estimation strategy for these models developed by Strauss and Ikeda (1990) (described in brief in Strauss 1992), are very recent and promising contributions to this field. These models have not yet been widely applied, nor have they been generalized to a wider range of data. It appears that one can now model easily, succinctly, and most importantly, accurately, a wide range of structural features of network relations, including valued, multivariate, and those subject to algebraic constraints.

In this talk we review and extend these models and demonstrate how they can be used to address a variety of substantive questions about structure in social networks. We show how Strauss and Ikeda's (1990) strategy may be used to obtain approximate model fits using logistic regression. These generalizations permit different types of heterogeneous structural subgroup effects of many different kinds. In addition, we outline the relationship between Holland and Leinhardt's (1981) p_1 log linear model and a corresponding logit formulation. We describe applications of these models to the problem of classification of actors into distinct subgroups.

KEYWORDS: Categorical data analysis; Network analysis; Graph theory; Markov random fields; Markov graphs; Logit models.

Applications of Spline Transformations to Building Models for Classification

Ming Zhang, Ph.D
Department of Quantitative Solutions
Equifax Credit Information Services, Inc.
1525 Windward Concourse
Alpharetta, Georgia 30302

In many business applications of discriminant analysis it is often not sufficient to simply classify an observation of a random vector into one of several categories. Rather it is more desirable to provide a set of estimates of probabilities with which an observation falls into each category. These estimates of probabilities are functions of the random vector associated with an observation. One example of such an approach to classification problems in the business world is to use scoring models to classify the general population of accounts into ``good'' and ``bad'' subpopulations in the marketing and banking industries. Traditionally generalized linear models such as logit and probit have been employed to build these models for classification. But one often finds that the former models do not provide an adequate description of what are sometimes quite complex nonlinear relationships between the probabilities and their predictor variables and the latter models provide too crude a description of these relationships to satisfactorily discriminate among the categories. In this paper a more flexible method based on piecewise spline transformations of predictor variables is proposed for building models for classification. An actual application of this method to building a scoring model for evaluation of bankcard attrition is presented. Some of the problems associated with this method are briefly outlined.

End of Abstracts

Go to Program CSNA Home Page Prepared by S. C. Hirtle, hirtle+@pitt.edu

CLASSIFICATION SOCIETY OF NORTH AMERICA 1995 ANNUAL MEETING