- Recent writings of J.-P. Benzécri: Revue MODULAD,
"Si j'avais un laboratoire ..., No. 38; and
"L'Analyse des données : histoire, bilan, projets, ...,
perspective", No. 35.
(English translation of
part of page 1 of "Si j'avais un laboratoire ..." in text - Latex format.)
- A recent issue of
Electronic Journal for History of Probability and Statistics,
Vol.4/2 December 2008: "About the History of Multivariate Exploratory Data
Analysis", with many articles dealing with the history and development of
Benzécri's approach to analysis of data and science.
This includes a
version in English of the article L'Âme au bout d'un rasoir, The
Soul at the Razor's Edge, originally publised in the journal
Les Cahiers de l'Analyse des Données, vol. V, no. 2, 1980,
Updated version (10 May 2009) of
L'Âme au bout d'un rasoir, The
Soul at the Razor's Edge.
6th Annual Public Boole Lecture in Informatics
Boole Centre for Research in Informatics,
Annual Boole Lecture Series, University College Cork,
Ireland, 29 April 2008
Fionn Murtagh, "The Correspondence Analysis Platform for Uncovering Deep
Structure in Data and Information"
Computer Journal, in press,
Advance Access 9 Sept. 2008
Correspondence Analysis and Data Coding
with R and Java
The software and data presented here accompanies the book
Correspondence Analysis and Data Coding with R and Java, by Fionn
Murtagh, Chapman & Hall/CRC, 2005, pp 250+xviii.
Benzécri, from Foreword:
"Physics progresses, mainly, by constituting corpora of rare
phenomena among immense sets of ordinary cases. The simple
observation of one of these ordinary cases requires detection
apparatus based on millions of small elementary detectors.
Yet physics is, in part, a computational science, as evidenced
by the conclusion of a paper on the theory of generalized zeta
functions: "Our results are secure, numerically, yet appear very
hard to prove by analysis".
I repeat: the statistician has to be modest. The work of my generation
has been exalting. A new statistical and data analysis is there to be
invented, now that one has inexpensive means of computation that could
not be dreamed of just thirty years ago."
- Table of Contents (in PDF).
- Avant-Propos, Foreword (in French and English) by J.P. Benzécri
- Theory of Correspondence Analysis
- Input Data Coding
- Examples and Case Studies
- Content Analysis of Text
- Concluding Remarks, References, Index
- Link to online description - Chapman & Hall/CRC
Computer Science & Data Analysis series
- Link to Amazon
- Link to Amazon.co.uk
Some of the programs, especially
the R and C ones, are in ascii text. Some others
are binary (e.g. the clustering DLL program, and the Java class files).
The Java code and the data sets are collected together in tar files, to
be extracted using WinZIP or tar or some similar utility.
1. Software in R
The R package can be obtained for most computer platforms at the
address The R Project for
- Correspondence analysis
- Hierarchical clustering
- Interpretation aids
- Utilities and data
2. Text Processing
The text processing support programs are all in C.
- Analysis of multiple text files.
47 aviation accident reports, the list of these files, programs
txtanalysis.c and xtabulate.c, and output files words.txt, words0.txt,
and xtabulate.txt, used as examples in the following.
- txtanalysis.c program, that is run
as follows: txtanalysis filelist.txt [words.txt]
- xtabulate.c program, that is run as
follows: xtabulate words.txt filelist.txt [xtabulate.txt]
- word_analysis.c, program to check
for sufficient number of occurrences of words in all texts. Hence,
this program yields a common word-list. This common word-list can be
used by xtabulate.
- Analysis of a single (large) text file.
- arist10.txt, Aristotle's Categories.
Note: we removed the legal information (so as not to influence the
analysis) to yield the file arist10x.txt.
- docanalysis.c program, to produce a word
list from a single text file. Example of use: docanalysis
arist10x.txt words.txt. for the Categories 1260 words are found.
It is best to filter or cull these (or else the cross-tabulations, to
follow will be very large).
- doctabulate.c program, to produce a
cross-tabulation for each of the chapter and section levels in the
Categories. Use: doctabulate arist10x.txt words.txt out. This
produces the cross-tabulations, or contingency tables, out1.txt,
out2.txt, out3.txt, out4.txt, corresponding to the different section
levels in this book.
- Notes: programs txtanalysis and xtabulate should handle acceptably
(i) accented characters, and (ii) use in a Mac OS X environment. (The
latter issue is that memory allocation is already catered for; so
the line "#include <malloc.h>" at the
start of the file should not be present.)
3. Software in Java
To install JDK or JRE (see below), check
Sunsoft Sun Developer Network Site.
4. Updates to the Book
- P. 35, 2 lines above expressions (2.4),
change j \in J to: i \in I.
- P. 37, line -14, change f_K to
f_I. And in paragraphs on lines -11, -10, delete opening words and
terms to begin sentence with: We can right-multiply the
eigen-equation above ....
- P. 113, line 6 of text, 40% should be 48%.
- See above for changes to the C program for hierarchical clustering
(minimum variance/Wards with weighting of rows/cases), and associated R
- A new version of facor, with an example of
use at the start of the program. Input data set,
casa2.prn (text file), a characterization with 13 person and place
attributes of the 77 scenes of the film, Casablanca.
5. Other R Software for Correspondence Analysis
6. Book Reviews and Survey Papers
Origins of modern data analysis
linked to the beginnings and early development of computer science and
http://arxiv.org/abs/0811.2519, submitted, 2008.
Jan de Leeuw's
review of this book in Journal of Statistical Software, Vol. 14, Sept.
2005. My reply to Jan's review
focusing on the continuing ground-breaking innovation underlying data
coding in the correspondence analysis and associated data analysis
Review of M. Greenacre and
J. Blasius, Multiple Correspondence Analysis and Related Methods,
Chapman-Hall, 2006. To appear in Psychometrika.
Review, Journal of
Classification, 25, 137-141, 2008, of
Brigitte Le Roux and Henry Rouanet,
Geometric Data Analysis, From Correspondence Analysis to Structured
Data Analysis, Kluwer, Dordrecht, 2004.
7. Other Data Analysis and Signal Processing Software
data analysis resources, code by F Murtagh, in C, Fortran, R/S-Plus,
and Java, for cluster analysis, and other purposes.
- MR, a large suite of
programs for wavelet transform analysis of images and signals,
together with other multiresolution transform analysis approaches
(curvelet and ridgelet transform), and general image and signal
processing (edge detection, fractal analysis, etc.). For filtering
and noise modeling, compression, deconvolution, visualization, and other
8. Online Writings of J.P. Benzécri
- Earlier versions of work that has since appeared in Revue MODULAD
(see elsewhere on this page).
- L'avenir de l'analyse des
données, Behaviormetrika, 10, 1-11 (1983).
A number of papers between 1954 and 1972 are available from
www.numdam.org. Use "benzécri" as a search term. These include:
"Description mathématique des classifications" (1972);
"Sur les algorithmes de classification" (1971);
"Combattre pour la linguistique" (1971);
"Problèmes et méthodes de la taxinomie" (1970);
and two other papers from 1967. Elsewhere www.numdam.org has e.g.
"Sur l'analyse des matrices de confusion" (1970).
- Back issues of
Les Cahiers de l'Analyse des Données, 1997, 1996, 1994, 1993.
9. Linnaeus, Huyghens, Laplace
From J.-P. Benzécri et coll., L'Analyse des Données.
Tome I, Taxinomie. Tome II, Correspondances.
Dunod, 1973 (2nd edn., 1976).
Contact: f murtagh at acm dot org (user name: one word)