Leola A. Alfonso-Reese, F. Gregory Ashby & David H. Brainard
University of California, Santa Barbara
reese@condor.psych.ucsb.edu
We report experiments in which the human subject's task is to view a continuous-valued three-dimensional stimulus and classify it into one of two categories. In each experiment the categories are represented by trivariate normal distributions with equal covariance matrices. The stimuli are randomly generated from these trivariate normal distributions. Across the experiments we manipulate complexity of category structure where complexity is defined as a function of the category means and covariance matrices. We measure each subject's overall performance in terms of percent correct and efficiency. From our results we establish that subjects can optimally categorize three-dimensional stimuli generated from simple category structures. Also, subjects integrate information from all three stimulus dimensions rather than ignoring information in one of them. Finally, we demonstrate that as the complexity of the category structure increases, the subject's categorization performance decreases. In one experimental condition the category structure is highly complex and all subjects perform far worse than an ideal classifier. This result contrasts with previous two -dimensional studies in which subjects show high efficiencies regardless of the category structure. Altogether, our results suggest that humans can successfully classify three-dimensional continuous-valued stimuli; however, performance is significantly affected by complexity of the category structure.
KEYWORDS: Category structure; Efficiency; Ideal observer; Human classification; Decision processes.
Miguel A. Arrufat
Manacon, Inc.
P. O. Box 7158, Station J
Ottawa, Ontario K2A 4C5
manacon@server.iii.net
Market definition and market segmentation in high technology business sectors over the past 10 years led me to conclude in [Arrufat,1987] that the highly popular ``disjoint'' clustering methods and a couple of the newly emerging ones such as the ``additive'' type, were not appropriate. In a research paper published in [Arrufat and Haines, 1992], a ``modified c-means fuzzy clustering method'' based on work by Roubens [3] and Dunn [4] was proposed as a much better way to deal with ``high tech/volatile'' market definitions and segmentation using Rouben's NFI criterion. Besides the successful applications of the method described in [Arrufat and Haines, 1992] and [Petit, 1994], there are a few others not reported yet. In spite of its success based on ``expert opinion'', the method had not formally been validated until now. So over the past 3 years, a Monte-Carlo study was undertaken to examine not only his original NFI but his newly proposed NFI'' Roubens criterion against a uniform random distribution. The paper describes a Monte-Carlo simulation study undertaken over the past 3 years which examines the statistical validity of Roubens NFI and NFI'' in all the cases reported. The results obtained prove that both the NFI and NFI'' are statistically valid measures of clustering of great use in marketing applications. In addition, the paper also touches on the issue of random number generation in Monte-Carlo experiments using digital computers, and describes how painful these experiments could be.
The Formula of Lance & Williams and its Extension to Classification Models of VL Type[Invited Presentation]
Helena Bacelar-Nicolau%
Laboratorio de Estatistica e Analise de Dados
Faculdade de Psicologia e de Ciencias da Educacao
Universidade de Lisboa, and
Dep. Matematica, Faculdade Ciencias e Tecnologia,
Universidade Nova de Lisboa
Lisboa, Portugal
The formula of Lance and Wiliams is a recursive formula for hierarchical agglomerative clustering criteria. It has been widely applied to generate Models of Data Representation based on Distances, that is, of MDRD type. In our approach to classification we are mostly interested in classifying variables and we must very often deal with hierarchical models of MDR type, based on Similarities, and specifically on the probabilistic VL similarity coefficient. VL stands for ``Validity of Link'' (Bacelar-Nicolau 1987, Nicolau 1983), where the similarity is evaluated by the cumulative distribution function of some basic similarity coefficient, under a suitable reference hypothesis ( Vraisemblance du Lien, in French, see Lerman 1970, Bacelar-Nicolau 1973). Therefore we validate the basic similarity values (which measure the degree of link between variables) by reference to the same probabilistic scale.
In this work, after reviewing some properties of the L & W formula, we use the corresponding adapted formula for analyzing data structures of MDRS type, and study an extended formula we propose for that case. It incorporates a generalised parametric probabilistic family of VL type (a previous extension to a more restricted family was previously studied by F. C. Nicolau).
Developing clustering models into general parametric families unifies the study of the methodology and simplifies the application of the algorithms. It allows us to perform in an easier way comparative studies among different models of either simulated or real data sets, searching for robustness of models in the family (that is in some parts of the ``super model'' which the family in fact represents) or of data structures being analyzed (and choosing the model(s) which best fit the data structure). Of course, introducing VL models in the generalized recursive formula reinforces those properties and improves robustness studies with the new methodology.
KEYWORDS: Aggregation criterion; Hierarchical clustering; Recursive formula; Model of data representation; Probabilistic approach.
David Banks
Department of Statistics
Carnegie Mellon University
Pittsburgh PA 15213
banks@stat.cmu.edu
Electronic access to pornography has exploded in recent years, both through the Internet and commercial bulletin boards. This talk describes joint work undertaken at Carnegie Mellon to assess quantitative aspects of the demand and supply of such material. Key problems involve the development of a classification scheme for pornographic material, the identification of patterns of user preferences from the log-files of commercial bulletin board systems operators, and the measurement of the demand for such material based on data obtained from usage records of students. The results of this study have interesting potential ramifications in law, psychology, and sociology. In particular, this is the first study that does not rely upon self-reports to measure demand for particular types and amounts of pornography.
Leon Bobrowski, Mark Vriesenda and Jack Sklansky%
Department of Electrical and Computer Engineering
University of California at Irvine
Irvine, California 92717
Pattern classifiers are often designed on the basis of a finite data set of labeled feature vectors xj, where xj belongs to class k. The data set is usually used in two modes: training and evaluation. In the training mode, the classifier's parameters are chosen so as to optimize a performance criterion on the training set, which consists of the entire or a designated subset of the given data set. In the evaluation mode the performance of the classifier on future (unknown) data is estimated.
Often the performance criterion for the training mode is the classifier's error rate on the training set. We refer to this as the resubstitution error rate. Other criteria on the training set are possible ---e.g., an error rate in which the errors are weighted by their degrees of certainty (a high degree of certainty for xj indicates that xj is very likely misclassified). We refer to these as resubstitution criteria.
In this paper we compare the computational effectiveness and reliability of three resubstitution criteria for the design of linear classifiers: error rate, perceptron and window.
Since the cardinality of xj is finite, the resubstitution error rate e is a piecewise constant function of w, where w denotes the parameters determining the classifier. It follows that the gradient of e is either zero or indeterminate. Hence gradient descent cannot be applied to minimize e. Genetic algorithms offer a means of minimizing e. We evaluate the effectiveness and reliability of the resubstitution error rate criterion on the assumption that genetic algorithms are used for the search process.
The resubstitution perceptron and window criteria involve the use of a penalty function phi(xj) for each point xj in the data set. The perceptron penalty function phi(xj) is greater than zero if xj is situated on the wrong side of the decision hyperplane and increases linearly with the distance between xj and this hyperplane. The window penalty function phi(xj) increases only within a window enclosing the decision hyperplane. The perceptron and the window functions are piecewise linear. Finding the minima of these functions can be carried out effectively by basis exchange algorithms --- forms of linear programming --- even when the training sets are large. We evaluate the effectiveness and reliability of the resubstitution perceptron and window criteria on the assumption that basis exchange is used for the minimization process.
Julian Eugene Boggess%
Department of Computer Science
Mississippi State University P. O. Drawer CS
Mississippi State, MS 39762
It is relatively easy for humans to identify roads in satellite images through visual inspection. However, it turns out to be rather difficult to devise a computer algorithm that approaches the accuracy of a human classifier. In general, computerized approaches to road identification look at the characteristics of a given pixel in the image, and perhaps its surrounding pixels (its context), and then use a set of explicit or implicit rules to determine whether the given pixel should be classified as a member of the set of road pixels or not. Unfortunately, determining a useful set of rules is a difficult task.
A genetic algorithm is able to evolve a solution to a problem through a process of selection, crossover, and mutation, similar to the biological process upon which it is modeled. In this study, a genetic algorithm is used to evolve a good set of rules, and a good set of parameter settings for these rules, which are used to identify road pixels in LandSat satellite images. The results are compared to other techniques, including a pure neural network technique and a hybrid neural network.
KEYWORDS: Genetic algorithms; Image processing
Melinda Smith de Borrero
College of Business Administration & Economics
Division of Marketing
Lehigh University Bethlehem, PA 18015 USA
meb3@lehigh.edu In mixture analysis, the data are taken from a mixture distribution without knowledge of the component membership of each observation and the density of y is of the form:
f(y, U)= (PLEASE SEE THE HARD COPY)
where the unknown parameter vector U consists of S component parameters (insert from hard copy) and S component proportions, often referred to as mixing proportions or mixing weights, (insert from hard copy), so that
U= (INSERT FROM HARD COPY).
One of the most challenging problems in mixture modeling is estimating S, the number of components in the mixture distribution. A situation where the researcher has knowledge of S a priori is rare. A review of the extant marketing literature shows that a variety of model fit heuristics have been used in determining how many mixture components to retain, including Akaike's (1973) Information Criterion (AIC) and Bozdogan's (1987) consistent AIC (CAIC). Other information-theoretic fit functionals, such as Windham and Cutler's (1992, 1994) minimum information ratio (MIR) and the Bozdogan's (1988, 1990, 1994) information theoretic measure of complexity (ICOMP), have yet to appear in the marketing literature, but are also potential candidates for component determination.
In this paper, the performance of AIC, CAIC, MIR, and ICOMP is systematically assessed through a series of controlled simulations. The factors varied in the simulation procedure include: number of mixture components, sample size, and level of component separation. The ML method and EM algorithm is used for parameter estimation of a Poisson mixture density. Performance is measured by computing the squared difference between the estimated number of mixture components using the various fit measures and the true number of components.
The simulation results indicate that ICOMP and CAIC are consistently the superior model fit measures. CAIC is the favored measure in most marketing applications primarily because of its seemingly computational simplicity. The performance, however, of ICOMP is superior and is computationally manageable given the capabilities of today's computer resources, especially in situations where the inverse of the Fisher's information matrix has a closed form solution. Furthermore, ICOMP is robust to distributional assumptions and different covariance structures in the mixture model, particularly in the multivariate case.
Isabelle Bournaud%
Universite Paris VI
75252 Paris Cedex 05 - France
bournaud@laforia.ibp.fr
Artificial Intelligence (AI) approaches to clustering and traditional techniques developed in both cluster analysis and numerical taxonomy differ in two major aspects. Firstly, the objective of AI approaches is clearly not only to build a set of clusters each defined extensionally as a set of objects, but also to associate an intentional definition, called a concept, to each cluster. Secondly, the quality of a clustering does not only depend on a within-cluster similarity and between-clusters dissimilarity, but also on the comprehensibility of the clusters descriptions. From a set of observations, conceptual clustering systems build a hierarchical organization of the acquired concepts, called a conceptual hierarchy. These approaches have proven to be efficient and are used in many domains. Examples of such systems include cluster/2 [Michalsky and Stepp, 83], cobweb [Fisher, 87] or labyrinth [Thompson and Langley, 91].
However, experts in systematics are more interested in modifying acquired or existing hierarchies than in acquiring them. We propose an original approach to conceptual clustering that provides ways to both build conceptual hierarchies and easily revise them. The method consists in extracting conceptual hierarchies from an explicitly built space, called Generalization Space. A Generalization Space (GS) is an inheritance network that contains a computable subset of all the potential clusters over the observations, represented as nodes. Because of the inheritance structure of the GS, each node may have more than one father. Extracting a hierarchy consists in choosing a unique father for each node based on heuristics. Revising a hierarchy consists in choosing another father for some nodes, i.e. modifying the heuristics used to extract the hierarchy from the GS. The heuristics are expressed in terms of hierarchy structure (large, deep, ...) or in terms of domain knowledge (e.g. ``do not choose the proposed father f for the node n''). In restricting the building of the GS to a computable subset of all the potential clusters over the observations, as in [Mineau, 90], we have developed a low complexity algorithm to support the proposed method.
KEYWORDS: Conceptual clustering; Restructuring; Heuristics for extraction; Systematics.
Yudong Cai%
Chinese Academy of Sciences
Shanghai Research Center, 500 Caobao Road
Shanghai, China 200233
Anil Chaturvedi, J. Douglas Carroll, Paul E. Green and John A. Rotondo
AT&T; Bell Laboratories, Rutgers University, and University of Pennsylvania
In this paper, we present a very general approach to Market Segmentation via a generalization of the standard K-means clustering procedure that allows for overlap in the extracted cluster structure. This approach assumes a bilinear model which is a special case of a general multilinear model called ``CANDCLUS'', (for CANonical Decomposition CLUStering; see Carroll and Chaturvedi (in press), with mixed discrete and real parameters. Parameters of this two-way CANDCLUS model can be estimated by optimizing a loss function based on any L_p-norm, although we have used only L_1- and L_2% -norms to date. In the case of an L_2-norm based loss function, fitting this model with constraints imposed to make the clustering a partition results in a procedure equivalent to K-means clustering, while fitting without such partitioning constraints results in a method we call ``overlapping K-means''. In the case of fitting via an L_1-norm based loss function, the unconstrained model yields what we call an overlapping K -medians solution, while, when partitioning constraints are imposed, this results in a K-medians solution (Hartigan, 1975). The benefit of overlapping K-means and K-medians over existing overlapping clustering procedures is their ability to handle relatively large datasets, and to provide robust estimates of cluster membership and cluster centroids or ``generalized centroids'' (such as the vector of medians which replaces the standard centroid if an L_1 rather than L_2-norm is minimized). We will use the term ``centroid'' henceforth to refer to either a standard centroid (vector of means), or such a generalized centroid.
An application of the overlapping K-centroids approach to market segmentation is presented, entailing minimization of both L_1 and L_2 based loss functions -- thus using overlapping K-medians and overlapping K -means, respectively.
William H. E. Day%
Box 17 (Shore Road)
Port Maitland, NS B0W 2V0 Canada
whday@ac.dal.ca
The analysis of long molecular (e.g., DNA, protein) sequences is a rapidly developing branch of computational molecular biology. One of this branch's challenging problems is to align a set of n sequences such that aligned positions match in an optimal way and are comparable. Any such alignment can be represented by a matrix of n rows and m columns in which the n -tuple of symbols (e.g., bases, amino acids) in a given column represent homologous states of a biological character. At each column of the matrix, a problem of consensus description is to identify a set of symbols (e.g., ambiguity codes) that best represents the n symbols in the column. There have been few investigations of the features --- or relevance --- of rules for calculating consensus molecular sequences. Do consensus rules based on voting paradigms make sense in a biological context? Should biologists be encouraged to use consensus rules that are based on minimizing angles in Euclidean spaces? I will introduce these issues and (as time permits) discuss whether angle minimization is an appropriate model for calculating consensus molecular sequences.
KEYWORDS: Biological application; Consensus; Sequence analysis.
Donald W. Dearholt
Department of Computer Science
Mississippi State University MS 39762
dearholt@cs.MsState.edu
Growing-sphere graphs (GSGs) are introduced as a means of providing a coherent model for the clustering and representational capabilities of both the sphere-of-influence graphs (SIGs) and the union of minimum-cost spanning trees in a coordinate space. The clustering properties of SIGs and the union of mintrees are reviewed, and a novel way of viewing these graphs as growing-sphere graphs (GSGs) is presented. Under appropriate constraints, it is shown that GSGs can generate either SIGs or the union of mintrees. Applications include modeling dynamic aspects of energy propagation, such as electromagnetic or acoustic waves.
Ward Elliott
Claremont McKenna College, Department of Government
850 Columbia Avenue
Claremont, California 97111-6420
The CMC Shakespeare Clinic has applied a battery of 52 computer-based tests to plays by 17 claimed ``True Authors'' of Shakespeare's work, to 27 plays of the Shakespeare Apocrypha, and to all of the plays of the Shakespeare Canon. It has applied up to 14 tests to poems by 28 poet claimants. Its assignment from the Shakespeare Authorship Roundtable was to find out which, if any, of the 37 testable claimants on the Reader's Encyclopedia of Shakespeare's list actually matched Shakespeare. Its conclusions:
1. None of the 37 claimants, playwright or poet, matches Shakespeare.
2. Two long-accepted plays in the Shakespeare Canon, Titus Andronicus and Henry VI, Part 3, don't match the rest and appear to have been written, in substantial part, by someone other than Shakespeare. Shakespeare's accepted authorship of the short poem A Lover's Complaint likewise seems doubtful.
3. Apart from the two outliers mentioned, no core Shakespeare play has more than three rejections in 52 tests. No claimant or Shakespeare Apocrypha play has less than eight rejections. Of the Shakespeare Dubitanda, only Shakespeare's part of Two Noble Kinsmen and the ``Late Stratum'' of Titus Andronicus have fewer than eight rejections.
4. Every 3,000-word block of Shakespeare's accepted poems, and the 2,600-word A Lover's Complaint, pass our powerful modal test. No claimant poem corpus of 3,000 or more words passes the modal test. Two shorter poem corpora, of only 1-2,000 words passed the modal test, but -- along with A Lover's Complaint -- failed non-modal testing with four or more rejections. L.C. aside, no block of Shakespeare had more than two rejections. No claimant block had fewer than three rejections from all available tests combined.
5. If all, or even most, of its tests are valid, the Clinic has eliminated every testable claimant, including 13 who are considered ``full claimants,'' not just supposed collaborators on one work or another. 21 other claimants, four of them ``full,'' have not, to our knowledge, left any work comparable to Shakespeare's and cannot be tested using internal evidence. Among the claimants eliminated by the clinic's tests are some of the most prominent ones: the Earl of Oxford, Bacon, Marlowe, and members of the Sidney Circle.
KEYWORDS: Shakespeare; Authorship; Shakespeare claimants; Shakespeare Apocrypha; Shakespeare Dubitanda
Jean-Marc Fellous%
University of Southern California
Los Angeles, CA 90089-2520
(213) 740-1922 (office) (213) 740-5687 (fax)
fellous@selforg.usc.edu
We show that gender discrimination and prediction can be achieved on the basis of purely metric information extracted from frontal pictures of faces. We show that vertical and horizontal metric facial measurements can be considered as statistically independent. Five appropriately normalized facial distances are derived using discriminant analysis that explain over 95% of the gender differences of training samples and predict the gender of 85% novel test faces exhibiting various facial expressions. Robustness of the method and its results are assessed. It is argued that these distances (termed fiducial) are compatible with those found experimentally by psycho-physical and neuro-physiological studies, and that in consequence, partial explanations for the effects observed then, can be found in the intrinsic nature of the facial stimuli used.
Jeffrey L. Goldberg
Department of Computer Science
Texas A&M; University
College Station, TX 77843-3112
jeff@cs.tamu.edu
Text categorization is the classification of units of natural language text with respect to a set of pre-existing categories. There has been two primary approaches, to create text categorizers manually, via human engineering, or automatically, via machine learning.
However, there have been statistical problems associated with natural language text when it is applied as input to existing, non-domain specific machine learning algorithms such as Bayesian classification and decision tree learning (too much noise, too many features, skewed distribution). The Category Discrimination Method (CDM) is a new learning algorithm designed specifically for text categorization that addresses the short comings of these existing learning methods.
The bases of the CDM are research results about the way that humans learn categories and concepts vis-a-vis contrasting concepts. The essential formula is cue validity borrowed from cognitive psychology, and used to select from all possible single word-based features the `best' predictors of a given category.
The CDM learns text categorizers. A precategorized test collection of text documents is divided into two parts, a training set and a test set.
For each category relevant to the test collection:
Using the training documents, the CDM determines the best predictors (i.e. features) for the category. A metric called cue validity, borrowed from cognitive psychology, is used to select the features.
Then the CDM conducts a multi-stage search to learn how these features might best be organized into a logical structure that can be used as a text categorizer. During each stage of search, literally hundreds of text categorizers are tested against a set of training documents. The best categorizers produced by each of the intermediate search stages are used to set the parameters for the next search stage.
Then the categorizer that is best at predicting the desired category on the training documents, the output of the final stage of the search, is used to predict that category on the unseen test documents. The performance of the CDM algorithm for that category can then be determined.
An overall performance figure for the CDM on the test collection is obtained from the individual per category figures by microaveraging. The hypothesis that the CDM's performance will exceed two non-domain specific algorithms, Bayesian classification and decision tree learners, both of which having been previously evaluated on the same test collection of documents under similar conditions, is thereby tested.
The preliminary results have thus far confirmed the hypothesis. After evaluating 89% of the Reuters-22173 test collection, and using only the CDM's first search stage (without the benefit of the second and third fine tuning stages), its overall performance is significantly higher (72%) than that of Bayesian classification (65%) or decision tree learners (67%), based on published figures from comparable studies.
KEYWORDS: Text categorization; Machine learning; Supervised learning.
Timothy E. Goldsmith and Peder J. Johnson%
Department of Psychology, Logan Hall,
University of New Mexico, Albuquerque, NM 87131
e-mail: gold@unm.edu
Graphs are increasingly used to represent proximity data. Several algorithms now exist for deriving graphs from proximities. Once a graph is derived, its structure can be described either by noting the node pairs that are directly linked (adjacency matrix) or by noting the minimum path distance between node pairs (graph distance matrix). In subsequent applications of graphs, one of these characterizations may be more useful than another. We investigated whether higher-order structural information in graphs offer any advantage over first-order adjacency values. More specifically, we asked whether graph distances contain any unique predictive information over adjacency distances.
We investigated this issue in the domain of knowledge graphs. Graphs were derived from individuals' relatedness ratings of the central concepts of a domain. Domain competence was then predicted from the similarity of each individual's graph to the graph representation of a domain expert. We computed similarity with both adjacency values and graph distances. Previous work had confounded path distance and path type. In contrast, we defined a match to occur only when a pair of nodes had both the same minimum distance and the same intervening path nodes.
The analysis of two data sets N=101 and N=40 showed that the predictiveness of second-order similarity did not significantly add to criterion predictiveness after first-order similarity was partialled out. These results were based on proximity graphs, which emphasize the most related pairs (Schvaneveldt, 1990). We are currently extending the analysis to graphs computed from a least-squares approach (Klauer & Carroll, 1989). If the findings hold, the implication is that the essence of graph structure lies simply in knowing a graph's direct links.
KEYWORDS: Proximity data; Graphs; Higher-order information; Graph theory; Applications.
Paul E. Green and Abba M. Krieger
The Wharton School, University of Pennsylvania
email: Paul.Green@marketing.wharton.upenn.edu
The common practice of factor analyzing data matrices, followed by a clustering of rotated, standardized factor scores has recently been questioned by several researchers in marketing. This paper presents comparative analyses of nine large-scale data sets in which two different types of transformations are employed prior to K-means clustering.
The study results suggest that there is some empirical evidence for the recent claims. Clusterings based on rotated factor scores generally show poorer associations with exogenous (background) variables than those based on either the original data or a less severe preliminary transformation. Implications of these findings for post hoc market segmentation are discussed.
Bernard Harris and Erhard Godehart%
University of Wisconsin, Madison and Heinrich-Heine Universitaet, Duesseldorf
harris@stat.wisc.edu
N independent and identically distributed random variables, with a
distribution absolutely continuous with respect to Lebesgue measure are
observed. The random interval graph determined by their distances is
calculated. Asymptotic distributions for the number of edges and the number
of complete subgraphs of order k is obtained, k
Cheryl Hild and Hamparsum Bozdogan
Management Science Program and Department of Statistics
bozdogan@utkvx.utk.edu%
This paper studies a tree-based classification of predictor variables in
regression models with one response variable y, and m predictor
variables x_1,...,x_m via the self-organization method called the
Ivakhnenko (1968) Polynomials of Several Predictors or the Group Method of
Data Handling (GMDH). The GMDH is an elegant approach to
statistical data modeling to determine the relationships among the m given
predictors and to describe the structure of high-order regression type
models in a hierarchical fashion.
Self-organization uses combinatorial methods beginning with a few basic
quadratic equations in a series of hierarchical tree-based approaches to
classify predictors. This is called ``threshold self-selections''.
Thresholds are employed at each layer (or generation) in the hierarchy to
construct new generations of complex equations. The survival-of-the-fittest
principle is used to classify those predictors which best fit into the
desired hypersurface where more complex combinations of predictors are
formed. Each element in each layer in the hierarchy implements a nonlinear
function of two predictors based on the results of the previous layer. As
this algorithm continues for several generations, a set of mathematical
models is developed that behave more and more like the actual system of
interest (Farlow, 1981). As opposed to assuming a known functional form of
the model, the GMDH algorithm considers only the data collected.
The functional relationships between the response and predictor variables
and the overall structure is learned directly from self-organization of the
data.
Model selection strategies are introduced and developed using
information-based model evaluation criteria for the use in the GMDH
algorithm to study the quality of the competing offspring models at each
generation. More specifically, Informational Complexity (ICOMP) criterion of
Bozdogan (1988, 1990, 1994) and the classic Akaike (1973) Information
Criterion (AIC) are developed and integrated into the
technology that chooses the model which ``best'' approximates the given data
set among the set of competing alternative models with different numbers of
parameters.
A symbolic backtracing algorithm is designed for the hierarchical decision
tree to classify the best predictors and to determine the best polynomial
structure in terms of the m original predictor variables.
Several real numerical examples are shown along with an open architecture
symbolic computational toolbox to illustrate the utility of the new proposed
approach.
Dr. David I Holmes%
di-holmes@csm.uwe.ac.uk
Multivariate statistical techniques such as principal component analysis,
discriminant analysis and cluster analysis have found increasing application
in the analysis of literary style in recent years. This paper commences by
reviewing examples of such applications, in particular the work of Burrows
(1992, 1994) and Binongo (1994) on the rate of occurrence of the most
frequent words in a text, and the work of Holmes (1991, 1992) on variables
measuring the richness of vocabulary of a writer.
The paper then moves to work in progress on testing the efficacy of both of
these approaches by applying them to a rather severe testing ground, namely
the authorship of the Federalist Papers. These were a series of essays
appearing over the pseudonym `Publius' during 1787 and 1788 attempting to
persuade New Yorkers to support ratification of the proposed new United
States Constitution. The essays were the subject of a well-known
investigation by Mosteller and Wallace (1984) using Bayesian techniques.
A sample of 33 papers is taken from the Federalist corpus and run on the
Oxford Concordance Program, after which a series of measures of vocabulary
richness is computed. Multivariate statistical analyses indicate that such
measures would seem, collectively, to be a good set of discriminators. Two
separate investigations are then conducted using a set of 49 high frequency
function words and the 30 Mosteller and Wallace `marker' words. Plots in the
space of the first two principal components appear to add weight to Burrows'
contention that multivariate word frequency analysis of large sets of common
words is a stylometric technique which can discriminate between writers.
The paper suggests that the selection of multiple criteria must now be
favoured to tackle problems of authorship attribution, although they must be
sensitive enough to distinguish between samples from the same genre as well
as being able to discriminate between samples from different genres.
Ali Kara
axk19@psuvm.psu.edu
One of the most frequently used phrases in the marketing literature is the
identification of market segments which usually entails the identification
of homogeneous subsets of the mass market by clustering customers on a set
of variables. Mainly, cluster analysis has been used as a classification
tool (all segmentation research, regardless of the method used to identify
groups of people, markets, or organizations that share common
characteristics), to improve understanding of buyer behavior and developing
taxonomies of buyer behavior (Claxton, Fry, and Portis 1974), to identify
potential product opportunities (Srivastava, Leone, and Shocker 1981;
Srivastava, Shocker, and Day 1978), and as a general data reduction
technique to apply to data sets with many variables.
Unfortunately, cluster analysis is not a deterministic technique, i.e., it
requires subjective interpretation of the researcher. One of the major
problems concerns the selection of variables to cluster on (Green, Carmone,
and Kim 1990; Milligan 1989; DeSarbo, Carroll, Clark, and Green 1984). If
this task is not done carefully, variables containing little clustering
information could be included in the analysis. These ``noisy'' variables can
cause misleading results (``poor clusters''). The researchers argue that
``...in any given array of variables only a proper subset of the variables
contribute in an important way to the resultant clustering. Hence, one
should only use these variables and ignore the others.'' However, clustering
methods tend to be used in an exploratory fashion without utilizing as much
information about the background and structure of the data as one would use
in dependence methods, e.g regression and discriminant analysis. Indeed,
these noisy variables can mask the clear structure portrayed by other
variables (Fowlkes, Gnanadesikan, and Kettenring 1988; Milligan 1980; Punj
and Stewart 1983).
Carmone (1990) developed an algorithm called ``Heuristic Identification of
Noisy Variables (HINoV)'' to help isolate the noisy variables prior to
clustering. If the noisy variables could be identified and removed prior to
clustering, better clusters could be obtained. HINoV identifies the noisy
variables that may exist in a data set before starting the partitioning
process with a very straightforward and easy to understand and implement
procedure. The purpose of this study is to empirically test the robustness
of HINoV. The artificial data sets with known structures were generated with
the use of a routine developed by Milligan (1985). Several cluster recovery
measures were used to test the performance of the algorithm. Results of this
study indicate that the HINoV algorithm identified noisy variables in both
synthetic and real data sets successfully. Recovery of true cluster
structure was enhanced. The algorithm should prove quite helpful not only in
market segmentation problems encountered for classifying consumers into
groups or segments where purchase behavior or general characteristics are
similar for those consumers in the same segment, but also several other
behavioral science applications.
Susan M. Keaveney
skeaveney@castle.cudenver.edu
Consumers are faced with numerous market-related decisions daily, including
advertisements, product choices, service choices, prices, promotions, and so
on. To simplify these everyday market decisions, consumers are believed to
develop a repertoire of decision rules or ``schemas.'' For example,
consumers learn that larger sizes of products are less expensive on a per
unit basis (therefore, they can ``safely'' buy the larger size without
checking unit-prices) or that signs and in-store displays signal sale prices
(so, consumers can ``safely'' buy the sale product without the effort of
comparing prices). Consumers' schemas may or may not be accurate: although
the two schemas described above were once generally true, marketers have
since changed the rules of the game.
A price-quality schema is a decision rule that states: ``The higher the
price, the higher the quality.'' Consumers who develop and apply
price-quality schemas simplify decision- making by accepting the belief that
price is a cue for quality. Using cluster analytic techniques, product
researchers have identified four clusters of consumers, grouped by their
tendencies to develop and apply price-quality schemas about products: (1)
Consumers who believe that the rule ``the higher the price, the higher the
quality'' can be applied to most products; (2) consumers who do not form
price-quality schemas at all (but believe that price-quality relationships
can be determined only after evaluating product information) (3) consumers
who form price-quality schemas about non-durable goods but not durable
goods; and (4) consumers who form price-quality schemas about durable goods
but not non-durable goods.
Conventional wisdom in services (versus goods) marketing holds that
consumers have strong price-quality schemas about services. The rationale is
as follows: Because services are intangibles --- experiences rather than
objects --- it is difficult to evaluate services prior to actually
experiencing them. Therefore, when evaluating a service prior to purchase
and consumption, consumers must use other more tangible cues, such as price.
Prior to this study, price-quality schema hypotheses had not been
empirically tested in the context of services. Do consumers form
price-quality schemas about services ? If consumers form price-quality
schemas about services, should services marketers use price to signal
quality? Are consumers more likely to form price-quality schemas about
services than about products? Are certain types of services more likely than
others to stimulate the use of price-quality schemas? For example, if a
consumer can observe service outcomes (landscaping, a haircut, house
cleaning), he/she can make decisions based on observations as well as price.
But when a consumer cannot observe service outcomes (how much was learned in
a class) then there are fewer cues to inform decision-making. Does price
then increase in information value and price-quality schemas become more
useful? Finally, if consumers form price-quality schemas, about products or
services, do they actually pay higher prices on average than consumers who
do not form price-quality schemas?
The research surveyed approximately 400 consumers of services regarding
their price-quality beliefs. Cluster analysis of the responses revealed the
following preliminary results:
One cluster of consumers forms price-quality schemas about all services.
A second cluster of consumers appears to believe that price-quality
relationships are not true or do not exist (i.e., show an inverse
relationship).
A third cluster shows random patterns of price-quality beliefs; i.e., this
group appears to make decisions independently.
A fourth cluster of consumers forms price-quality schemas about non-personal
services (landscaping, auto mechanic, house cleaning) but not about personal
services (doctors, dentists, hotels).
A fifth cluster forms price-quality schemas about personal services
(doctors, dentists, hotels) but not about non-personal services
(landscaping, auto mechanic, house cleaning).
Finally, the research examined the question: If consumers have price-quality
schemas about products or services, do they actually pay higher prices on
average than consumers who do not form price-quality schemas? Initial
results indicate that they do not.
KEYWORDS: Cluster analysis; Marketing; K-means clustering;
Price-quality schema.
Hang-Kwang Luh John Gittleman and Hamparsum Bozdogan
luh@mathsun5.math.utk.edu
The search for a functional relationship between an organism and its
environment is one of the most important themes in evolutionary biology.
This functional relationship, as Darwin found a century ago, evolves through
a combination of genealogical and environmental processes. This causes
difficulties for biologists attempting to test the hypothesis that the
evolution of a phenotypic trait is under the mechanism of adaptation,
because the similarity between species may be due to common environmental
factors (adaptation) or due to common inheritance (phylogenetic constraint).
Comparative statistics with phylogenetic analysis can distinguish various
kinds of similarity by using the information of phylogenetic classification.
This approach requires two steps:
(1) removal of phylogenetic effects, and
(2) apply comparative statistics to the trait which is free of phylogenetic
effects.
In this study, we present a flexible framework to deal with these two steps
simultaneously. We first postulate a regression model representing the
correlated evolution among traits. Since the observed data in a trait are
phylogeneticly correlated, the random error terms in the regression model
are not independent of one another. Thus, the proposed model should
incorporate the first order phylogenetically autoregressive process in
standard error components in order to deal with the several sources
associated with the phylogenetic dependencies among traits evolution. After
we build the comparative model structure, we can further ask which traits
should be included in the model in order to best describe the relationships
between the traits.
Here we introduce information-based model selection criteria to estimate the
performances of each of the candidate models. Information-based criteria as
''figures-of-merit'' take into account the principle of parsimony which is
the cornerstone of biological classification. The best model representing
the evolutionary pattern of correlated traits should provide most the
information with the simplest model structure.
M. Mazumdar, D.F. Bajorin, R.J. Motzer, V. Vlamis, and G.J.
Bosl
mazumdar@biosta.mskcc.org
The management of patients with GCT has been based on statistical models
from the early 80's, even though different rules were used by different
institutions making it difficult to compare the published literature.
Currently platinum based chemotherapies have approximately an 80% cure
rate. It was well accepted that allocation of patients to different risk
categories should be done on the basis of the values of pretreatment
biochemical markers (eg., LDH, HCG and AFP ) and extent of disease. There
were many assays for each marker and various ways of defining bulk of
disease (eg., visceral organ involvement, number of metastatic sites,
primary sites of disease) contributing to the variability. In the current
environment of managed care, it is necessary that we conserve our resources
by collaborating nationally and if possible internationally. In order to do
that, we needed to integrate the classification rules.
The objective of this project was to classify these patients into three risk
categories (say, good, intermediate and poor) which will be acceptable to
all the primary groups in this field. We wanted to get patients with higher
than 90% of CR rate in the good-risk category to be able to treat them with
the standard therapy since their efficacy cannot be improved any more.
Intermediate-risk category will capture the patients whose efficacy can be
improved by performing clinical trials with new regimens and moderate sample
size requirements. The poor-risk category will consist of extremely poor
prognosis patients so that very aggressive therapy such as high dose
chemotherapy with bone-marrow transplantation can be used to get some
response.
There were 796 patients treated at MSKCC between 1975-1990 on consecutive
protocols. The endpoint of interest was response (Complete (CR) vs.
Incomplete responders (IR)). We randomly chose 75% of the data to be the
developmental set and the remaining was kept for the validation set.
Multivariate logistic regression was used to find the significant variables.
Classification rules were formed using these variables, experience of the
clinicians and other factors to be described below. Akaike's Information
Criterion was used to compare different models. ROC analysis was used to
check the performance of the model in the validation set.
AFP, even though not significant in our database, was used for
classification because of its statistical significance found in the analysis
of the International GCT collaboration group. Also to get rid of the
variability in the units of assays for LDH and HCG, a transformation as a
multiple to its normal value was used. Visceral mets (mets to liver, lung or
bone) was used instead of total number of mets because of its acceptance by
other groups.
Statistical classification is not always based only on statistics. Many
factors need to be incorporated to make it useful.
KEYWORDS: Response; Bone-marrow transplantation; Visceral
metastasis.
David Miller, Ajit Rao, Kenneth Rose, and Allen Gersho
rose@ece.ucsb.edu
A global optimization method is introduced for the design of statistical
classifiers minimizing the probability of misclassification. The method,
which is based on ideas from information theory and analogies to statistical
physics, is inherently probabilistic. During the design, data are assigned
to classes in probability, with the assignments chosen to minimize the
expected classification error given a specified level of randomness, as
measured by Shannon's entropy. The constrained optimization is equivalent to
a free energy minimization, motivating a deterministic annealing approach in
which the entropy and expected misclassification cost are reduced with the
temperature while enforcing the classifier's structure. In the limit, a hard
classifier is obtained. This deterministic learning approach is applicable
to a variety of classifier structures, including the widely used
prototype-based, radial basis function, and multilayer perceptron
classifiers. The method has been compared with learning vector quantization,
back propagation, several radial basis function design techniques, as well
as with paradigms for more directly optimizing these structures to minimize
probability of error. The annealing method achieves substantial performance
gains over all the other design methods on a number of benchmark examples
from the literature [e.g., Ripley (1994), J. R. Statist. Soc. B, 56,
pp. 409-456,] while often retaining design complexity comparable to or only
moderately greater than that of strict descent methods. Remarkably
substantial performance gains, both inside and outside the training set, are
achieved for complicated examples involving high-dimensional data and large
class overlap.
KEYWORDS: Classification; Information theory; Neural networks; Deterministic annealing; Shannon entropy; Statistical physics.
Glenn W. Milligan and Richard Cheng
gmilliga@magnus.acs.ohio-state.edu
The problem of standardizing within-cluster is explored. Standardization
within-cluster results in a circularity problem. That is, before variable
standardization can be conducted, cluster memberships must be known.
However, before determining the cluster memberships, the data must be
standardized. One approach to dealing with the circularity problem involves
using tentative cluster memberships for preliminary standardization.
Subsequently, an iterative routine can be followed that alternates between
clustering and re-computation of the within-cluster standardized data. Such
a routine was incorporated into a K-Means clustering algorithm for testing
purposes. Bivariate simulated data sets were constructed to evaluate various
iterative strategies. The presentation will focus on the effect of
within-cluster standardization by using snapshot graphical displays showing
movement of the data points and tentative cluster assignments.
KEYWORDS: Variable standardization; K-means clustering; Monte Carlo methods.
P. Antonio Olmos-Gallo
polmos@pstar.psy.du.edu
The ability of Radial Basis Function Networks (RBFN) in classification have
been studied for some time (Moody & Darken, 1988; 1989). This architecture
is considered a very fast learner and very accurate classifier (Singer &
Lippmann, 1991; R"oscheisen, Hoffman & Tresp, 1991). However, its
shortcomings are also well known. It is known that computation with these
type of networks is very expensive in terms of the number of units that need
to be utilized (Hartman & Keeler, 1991; Hartman Keeler & Kowalski, 1990),
as well as their strong tendency to overfit. Several modifications have been
proposed to deal with this shortcoming, including growing neural-gas
algorithms (Fritzke, 1994; 1995), and arbitrarily reducing the number of
units to improve generalization (Bishop, 1991; Omohundro, 1991; Wettschereck
& Dietterich, 1991). In this paper, we explore a biologically inspired
alternative: the addition of a second layer to the network. It is suggested
that this addition may help to improve generalization by removing
complexity, reducing the dimensionality of the problem space, and providing
a reorganization of the representation in such a way that features that
influence each other get closer. The network was tested on a pattern
classification problem known as Hyperacuity. Its fit to psychophysical data
as well as its shortcomings are discussed.
Michael P. Peacock and Pablo Antonio Olmos-Gallo
mpeacock@synapse.psy.du.edu and polmos@pstar.psy.du.edu
Two nonlinear classification systems were explored for their utility in
assisting in the classification and prediction of developmental dyslexia.
The performance of two different neural network architectures, multi-layer
perceptron (MLP) and radial basis function (RBF), was compared to the
performance of linear discriminant analysis (LDA). The database consisted of
eleven assessment scores used by expert clinicians in the diagnosis of
dyslexia for a sample of 121 children between the ages of nine and fifteen.
For this problem, we were only interested in individuals who were either
diagnosed with dyslexia, or were diagnosed with no specific reading
disorder. Such a binary classification problem may seem trivial at first
glance, but a caveat is in order: since we believe that the mapping from
clinical assessments to a diagnosis is nonlinear, we expect an implicitly
nonlinear classification system, like a neural network, to outperform a more
traditional linear classification system. Better classification tools are
useful to clinical psychologists since they would lead to more accurate and
consistent diagnoses of psychological disorders.
Preliminary results suggest that the networks outperform LDA in
classification of cases, but only the MLP network outperforms LDA in
prediction of dyslexia. These results demonstrate that for this type of
problem, we can improve generalization performance through the use of
nontraditional classification techniques. In addition, hidden unit analysis
through the course of network training may prove useful in training clinical
psychology students to perform diagnoses of dyslexia.
KEYWORDS: Linear discriminant analysis; Multi-layer perceptron; Radial basis function; Diagnosis of dyslexia.
Min Pei, Ying Ding, William F. Punch III, and Erik D. Goodman
pei@egr.msu.edu
In a decision-theoretic or statistical approach to pattern recognition, the
classification or description of data is based on the set of data features
used. Therefore, feature slection and extraction are crucial in optimizing
performance, and strongly affect classifier design. Defining appropriate
features requires interaction with experts in the application area. In
practice, there is much noise and redundancy in most high dimensionality,
complex patterns. Therefore, it is sometimes difficult even for experts to
determine a minimum or optimum feature set. The ``curse of dimensionality''
becomes an annoying phenomenon in both statistical pattern recognition (SPR)
and artificial Neural Network (ANN) techniques. Researchers have discovered
that many learning procedures lack the property of ``scaling,'' i.e., these
procedures either fail or produce unsatisfactory results when applied to
problems of larger size.
To address this problem, we have developed two approaches, both based on
genetic algorithms (GA's). The basic operation of these approaches utilizes
a feedback linkage between feature evaluation and classification. That is,
we carry out feature extraction (with dimensionaltiy reduction) and
classifier design simultaneously, though ``genetic learning and evolution.''
The objective of these approaches is to find a reduced subset from the
original N features such that the useful class discriminatory information
is included and redundant class information and/or noise is excluded. We
take the following general approach. The data's original feature space is
transformed into a new feature space with fewer features that (potentially)
offer better separation of the pattern classes, which, in turn, improves the
performance of the decision-making classifier. The criterion for optimality
of the feature subset selected is usually the probability of
misclassification. Since the number of different subsets of N available
features can be very large, exhaustive search is computationally infeasible
and other methods must be examined. For example, a number of heuristic
techniques have been used, but it is not clear under what circumstances any
one heuristic should be applied, as each has its good and bad points.
Genetic algorithms are good candidates for this task, since GAs are most
useful in multiclass, high-dimensionality problems when knowledge about
performance of heuristics is sparse or incomplete. In order to apply a GA to
classification/discrimination tasks, we combined the GA with two different
approaches: the K Nearest Neighbor decision rule and a production decision
rule.
GA/KNN hybrid approach (Genetic algorithms combined with K Nearest
Neighbor). Here the GA defines a population of weight vectors W, each of
which is multiplied with every sample's data pattern vector X, yielding
new feature vectors Y for the given data. The KNN agorithm then classifies
the entire Y-transformed data set. The results of the classification are
fed back as part of the fitness function, to guide the genetic learning or
search for the best transformation. The GA answer is the ``best'' vector W
under the, potentially conflicting, constraints of evaluation function. For
example, the ``best'' W might be both parsimonious (has the fewest
features) and gives fewest misclassifications. Thus the GA searches for an
optimal transformation from pattern space to feature space for improving the
performance of the KNN classifier.
GA/RULE approach (Genetic algorithms combined with a production rule
system). This approach is based on the general structure of classifier
systems, and focuses on the classification of binary feature patterns. A
single, integrated rule format is used. The inductive learning procedure of
the GA evolves a rule classifier using a known labeled ``training'' sample
set. The result of training is a small set of ``best'' classification rules
which classify the training set more accurately than other rule sets. By
examining these ``good'' rules, one can determine the features which are
most useful for classification. This provides information useful to domain
experts for deriving new hypotheses.
The two methods described above were applied successfully to three
high-dimensionality (96 or more features) biological binary pattern data
sets with multiple classes. The results achieved indicate potential for
appliation in many other fields. They can also be further developed using a
GA combined with other classifiers, such as neural networks, to solve a
fairly broad class of complex pattern recognition problems.
KEYWORDS: Genetic algorithms; Classification; Feature extraction; High dimensionality.
L. Pratt and C. Medina%
lpratt@mines.colorado.edu and cmedina@mines.colorado.edu
An important forensic task is to analyze and compare hair evidence in
criminal cases. A forensic expert compares sets of hair images from evidence
at a crime scene to a set from hair samples from a suspect. Under a
miroscope, hairs are fairly distinctive. Forensic analysis, however, is a
tedious procedure. The forensic expert must manually compare hundreds of
hair images to determine whether or not two sets of samples came from the
same person.
This project explores the automation of hair analysis. Our system takes as
input microscopic images of two hairs and produces a classification decision
as to whether or not the hairs came from the same person.
Each image is segmented into a number of pieces appropriate for
classification of different features. A variety of image processing
techniques are used to enhance this information. Neural networks are then
used for feature classification. Finally statistical tests determine the
degree of match between the resulting collection of hair feature vectors.
An important issue in the automation of any task used in criminal
investigations is the reliability and understandability of the resulting
system. To address this concern, we have also developed methods to
facilitate explanation of the neural networks' behavior using a decision
tree.
KEYWORDS: Neural networks, Pattern Classification, Decision Trees, Statistics
Venkatram Ramaswamy, Rabikar Chatterjee, and Steven H. Cohen
Pick-any data are easy to collect and are indeed commonly
collected by marketing researchers. This paper proposes an
approach to market segmentation given pick-any data on
multiple potential data bases for segmentation (e.g.,
benefits sought, products or services used, occasions for
use) that may be interdependent. The proposed "integrated"
latent segmentation approach simultaneously extracts
segments on multiple bases while capturing and explicitly
estimating the structure of interdependence between the
bases. A hypothetical example motivates how the approach
works and demonstrates its benefits relative to other
potential approaches. An empirical application is
presented, using pick-any data collected by a regional bank
on two popular, conceptually appealing, and interdependent
bases for segmenting customers of financial services --
benefits (i.e., desired financial goals) and product usage
(of an array of banking services).
Joseph Rudman
Rudman@cmphys.phys.cmu.edu
The use of statistics in authorship attribution studies can be as dangerous
as it is valuable. This paper reports on the who, what, when, where, why and
how of statistics in attribution studies:
An historical overview.
The value of statistics to attribution studies.
When statistics should not be employed.
Some problem studies.
Making assumptions about the data.
An example from research on the canon of Daniel Defoe.
A substantial bibliography is provided.
KEYWORDS: Authorship; Statistics; Defoe.
Brad Warner
bw@rover.hsc.colorado.edu
In prediction or classification modeling problems we often have two
contrasting situations of interest: (1) we either build models solely for
prediction purposes because we are interested only in accurate estimation of
a response variable or (2) we build models to understand the relationship
between the independent (or predictor) variables and the outcome. In the
second case, we often want to statistically test this association. The model
complexity is sacrificed for ease of interpretation. Sometimes, a situation
between these two extremes is of interest. We may have some confounding
variables that must be accounted for in the model but which will not be used
for testing purposes. For example, we may be interested in understanding
whether serum cholesterol affects the outcome of a coronary bypass
operation. We know that age and sex affect this outcome so we want to
account for them in our model; however, we only want to test for an
association between cholesterol and bypass mortality. The confounding
variables can be thought of as nuisance variables.
Neural networks have received a great deal of attention over the last few
years in the areas of prediction and classification, areas where statistical
techniques have traditionally been used. The disadvantage with neural
network models is that they are difficult to interpret at the independent
variable level. In our example, if we used a neural network to build our
model, it would be difficult to test only for an association between
cholesterol and bypass surgery outcome. Logistic regression on the other
hand has gained a great deal of popularity in prediction and classification
problems because of the capability to easily interpret its parameters. This
simplicity comes from the modeling constraint of linearity on the logit
scale. The price paid for the ease of interpretability is loss of predictive
power in cases where the linearity assumption is weak. A neural network
could potentially have better predictive performance in this situation as no
assumption is made about the structure of the problem. One way to think of
this relationship is that the linearity constraint causes logistic
regression to be a special case of a neural network. Since the neural
network is more general, it is able to model more complex relationships.
In a scenario where our modeling has nuisance (or confounding) variables, it
might be advantageous to exploit the predictive power of a model such as a
neural network in the confines of a logistic regression model. Thus, we may
be able to test variables of interest as in a traditional logistic
regression model, but after we have accounted for as much of the variability
in the response due to the nuisance variables as possible.
This paper briefly describes the two models, logistic regression and neural
networks, pointing out their similarities and differences. Then we
demonstrate three methods to incorporate a neural network that accounts for
nuisance variables into a logistic regression model. We discuss situations
where these methods might be employed. We also discuss how other complex
predictive models such as projection pursuit regression or multivariate
adaptive regression splines could be incorporated into this framework.
Finally, we discuss the limitations of this modeling approach.
KEYWORDS: Combining models; Generalized linear model.
Stanley Wasserman and Philippa Pattison
Spanning nearly sixty years of research, statistical analysis of graphs has
passed through (at least) two generations of researchers and models.
Beginning in the late 1930's, the first generation of research dealt with
the distribution of various graph statistics, under a variety of null
models. The second generation, beginning in the 1970's and continuing into
the 1980's, worked on models, usually for probabilities of relational ties
among very small subsets of nodes, in which various simple substantive
tendencies were parameterized. Much of this research, most of which utilized
log linear models, first appeared in applied statistics publications.
But recent developments in social network analysis promise to bring us into
a third generation. The Markov random graphs of Frank and Strauss (1986) and
especially the estimation strategy for these models developed by Strauss and
Ikeda (1990) (described in brief in Strauss 1992), are very recent and
promising contributions to this field. These models have not yet been widely
applied, nor have they been generalized to a wider range of data. It appears
that one can now model easily, succinctly, and most importantly, accurately,
a wide range of structural features of network relations, including valued,
multivariate, and those subject to algebraic constraints.
In this talk we review and extend these models and demonstrate how they can
be used to address a variety of substantive questions about structure in
social networks. We show how Strauss and Ikeda's (1990) strategy may be used
to obtain approximate model fits using logistic regression. These
generalizations permit different types of heterogeneous structural subgroup
effects of many different kinds. In addition, we outline the relationship
between Holland and Leinhardt's (1981) p_1 log linear model and a
corresponding logit formulation. We describe applications of these models to
the problem of classification of actors into distinct subgroups.
KEYWORDS: Categorical data analysis; Network analysis; Graph
theory; Markov random fields; Markov graphs; Logit models.
Ming Zhang, Ph.D
In many business applications of discriminant analysis it is often not
sufficient to simply classify an observation of a random vector into one of
several categories. Rather it is more desirable to provide a set of
estimates of probabilities with which an observation falls into each
category. These estimates of probabilities are functions of the random
vector associated with an observation. One example of such an approach to
classification problems in the business world is to use scoring models to
classify the general population of accounts into ``good'' and ``bad''
subpopulations in the marketing and banking industries. Traditionally
generalized linear models such as logit and probit have been employed to
build these models for classification. But one often finds that the former
models do not provide an adequate description of what are sometimes quite
complex nonlinear relationships between the probabilities and their
predictor variables and the latter models provide too crude a description of
these relationships to satisfactorily discriminate among the categories. In
this paper a more flexible method based on piecewise spline transformations
of predictor variables is proposed for building models for classification.
An actual application of this method to building a scoring model for
evaluation of bankcard attrition is presented. Some of the problems
associated with this method are briefly outlined.
End of Abstracts
Tree-Based Classification of Predictors in Regression Models and Structure Determination via Self-Organization: The GMDH
University of Tennessee, Knoxville, TN 37996 USA
Multivariate Approaches to Authorship Attribution
Department of Mathematical Sciences
University of the West of England
Coldharbour Lane, Bristol BS16 1QY, England
An Empirical Investigation of the Robustness of the ``Heuristic Identification of Noise Variables (HINoV)'' Algorithm
Pennsylvania State University at York
1031 Edgecomb Avenue, York, Pennsylvania 17403-3398
Consumers' Price-Quality Schemas for Services
Graduate School of Business Administration, Campus Box 165
University of Colorado at Denver
Denver, Colorado 80217-3364
A Correlated Evolution Model Using a Multiple Regression with Phylogenetically Autoregressive Error Components
Department of Mathematics, Department of Zoology and Department of Statistics
University of Tennesse, Knoxville, TN 37996 USA
Classification of Patients with Germ Cell Tumors (GCT) Using
Factors Predictive of Response to Platinum-Based Chemotherapy
Department of Epidemiology and Biostatistics, and Genitourinary Oncology
Service, Department of Medicine
Memorial Sloan-Kettering Cancer Center (MSKCC)
New York, NY 10021
An Information-theoretic Approach to Optimal Design of Statistical Classifiers
Department of Electrical and Computer Engineering
University of California Santa Barbara, CA 93106
An Evaluation of Within-Cluster Variable Standardization
Dept. of Management Sciences, 302 Hagerty Hall
The Ohio State University
Columbus, Ohio 43210 USA
A Multi-layer Radial Basis Function Network that can improve
generalization
Department of Psychology, University of Denver.
A Comparison of Linear and Nonlinear Classification Methods in the Diagnosis of Dyslexia
Experimental-Cognitive Psychology Laboratory
University of Denver
Denver, Colorado
Genetic Algorithms for Classification and Feature Extraction
Beijing Union University, Beijing, China,
Case Center for Computer-Aided Engineering and Manufacturing, and
Intelligent Systems Laboratory, Department of Computer Science
Michigan State University, 112 Engineering Building
East Lansing, Michigan 48824
Forensic Hair Analysis Using Pattern Classification Techniques
Department of Mathematical and Computer Sciences
Colorado School of Mines
Golden, Colorado 80401
Market Segmentation on Multiple Bases with Pick-Any Data: An
Integrated Approach
The University of Michigan, Ann Arbor, Michigan and
Stratford Associates, Boston, Massachusetts
Statistics and Authorship Attribution Studies
Department of English
Carnegie Mellon University
Pittsburgh, PA 15221 USA
Incorporating a Neural Network into Logistic Regression
Dept. of Preventive Medicine and Biometrics
University of Colorado
Health Sciences Center,Denver, CO 80220
Logistic Regressions for Graphs and Social Networks
University of Illinois and University of Melbourne
Applications of Spline Transformations to Building Models for Classification
Department of Quantitative Solutions
Equifax Credit Information Services, Inc.
1525 Windward Concourse
Alpharetta, Georgia 30302
Prepared by S. C. Hirtle,
hirtle+@pitt.edu