Attribute Interactions in Machine Learning
Abstract
Attribute interactions are the irreducible dependencies between
attributes. Interactions underlie feature relevance and selection, the
structure of joint probability and classification models: if and only if
the attributes interact, they should be connected. While the issue of
2-way interactions, especially of those between an attribute and the
label, has already been addressed, we introduce an operational definition
of a generalized n-way interaction by highlighting two models: the
reductionistic part-to-whole approximation, where the model of the whole
is reconstructed from models of the parts, and the holistic reference
model, where the whole is modelled directly. An interaction is deemed
significant if these two models are significantly different. Correlation
is a special case of attribute interaction.
Keywords
- machine learning, data mining
- classification, pattern recognition
- interaction, dependence, dependency
- independence, independence assumption
- constructive induction, feature construction
- feature selection, attribute selection, myopic, information gain
- naive Bayes, simple Bayes
- naive Bayesian classifier, simple Bayesian classifier
- information theory, entropy, relative entropy, mutual information
Text
A. Jakulin, "Machine Learning Based on Attribute Interactions." PhD Dissertation, 2005. [PDF - 2.7Mb - 252pp] Presentation: [PPT]
Earlier Publications
A. Jakulin, I. Bratko,
"Testing the Significance of Attribute Interactions."
Proceedings of the Twenty-first International Conference on Machine Learning (ICML-2004), Eds. R. Greiner and D. Schuurmans. Pp. 409-416. Banff, Canada, 2004. [PDF] Presentation: [PPT] Poster: [PDF]
A. Jakulin, I. Bratko,
"Quantifying and Visualizing Attribute Interactions."
Working paper (Nov. 2003): [ARXIV] [PDF]
A. Jakulin, G. Leban, "Interactive Interaction Analysis."
Proceedings A of the 6th Information Society Conference (IS 2003), Ljubljana, Slovenia, October 13-17, 2003.
Paper (in Slovene): [PDF (Slovene)] Presentation:[PPT]
A. Jakulin, I. Bratko, "Analyzing Attribute Dependencies." Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2003), Cavtat-Dubrovnik, Croatia, September 22-26, 2003.
N. Lavrac, D. Gamberger, H. Blockeel, L. Todorovski (Eds.) Lecture Notes in Artificial Intelligence, Vol. 2838, Springer. Pp. 229-240.
Preprint (Apr.03): [ps.gz][pdf]
Final version: SpringerLink
Presentation: [PPT]
A. Jakulin, I. Bratko, D. Smrke, J. Demsar and B. Zupan,
"Attribute Interactions in Medical Data Analysis."
Proceedings of the 9th Conference on Artificial Intelligence in Medicine in Europe (AIME 2003), Protaras, Cyprus, October 18-22, 2003. M. Dojat, E. Keravnou, P. Barahona (Eds.) Lecture Notes in Artificial Intelligence, Vol. 2780, Springer. Pp. 229-238.
Preprint (Mar. 2003): [ps.gz] [pdf]Final version: SpringerLink
Presentation: [PPT]
A. Jakulin, "Attribute Interactions in Machine Learning." Master's thesis, University of Ljubljana, December 2002. [outdated]
A. Jakulin.
"Qualitative Aspects in Numerical Data: Structure, Interaction, and Correlation."
[outdated]
[PS.GZ]
[PDF]
Examples
Interaction analysis is able to assist feature construction and feature selection, and provides a way of reliably constructing attribute taxonomies with respect to the similarity between information that the attributes provide about the label. Blue arcs are negative interactions (redundancies), red are positive interactions (synergies).
The examples are illustrations of the CMC dataset from UCI, where the attributes inform us about the
demographic and socioeconomical characteristics of a couple, while the label
describes the contraception method used.
Interaction dendrogram:
Interpretation:
Although the dendrogam may look like attribute clustering, the notion of similarity
is not based merely on similarity: it is based on dependence. Red clusters indicate
clusters of synergy - groups of attributes that should be treated holistically. Blue clusters indicate redundancies: if we wanted to select features, we would only pick the best attribute from each cluster.
Interaction graph:
A detail from the interaction graph:
Interpretation:
- A positive interaction:
Wife age attribute singly eliminates 3.33% of the uncertainty,
Number of children attribute singly eliminates 5.82% of the uncertainty
about the label. If we assume these two attributes are dependent, and treat them
holistically (Cartesian product, classification tree, but not, e.g., naive Bayesian classifier, linear or logistic regression, linear SVM), we eliminate an additional
1.85% of label uncertainty. We say that they interact positively, since they are
synergistic.
- A negative interaction:
Husband education attribute singly eliminates 2.60% of the uncertainty,
Wife education attribute singly eliminates 4.60% of the uncertainty
about the label. These two attributes are partly redundant, as they both provide
a shared 1.15% of label information. To estimate the uncertainty, we should be
careful not to overcount the information or evidence. This problem is solved in
three ways: feature selection, feature weighting (e.g. linear SVM, logistic
regression), and the assumption of dependence (tree-augmented naive Bayesian
classifier, Bayesian networks).
Links
To get introduced to the actual computations of entropy and information, refer to
Paul Meagher's tutorials: