Attribute Interactions in Machine Learning

Abstract

Attribute interactions are the irreducible dependencies between attributes. Interactions underlie feature relevance and selection, the structure of joint probability and classification models: if and only if the attributes interact, they should be connected. While the issue of 2-way interactions, especially of those between an attribute and the label, has already been addressed, we introduce an operational definition of a generalized n-way interaction by highlighting two models: the reductionistic part-to-whole approximation, where the model of the whole is reconstructed from models of the parts, and the holistic reference model, where the whole is modelled directly. An interaction is deemed significant if these two models are significantly different. Correlation is a special case of attribute interaction.

Keywords

machine learning, data mining
classification, pattern recognition
interaction, dependence, dependency
independence, independence assumption
constructive induction, feature construction
feature selection, attribute selection, myopic, information gain
naive Bayes, simple Bayes
naive Bayesian classifier, simple Bayesian classifier
information theory, entropy, relative entropy, mutual information

Text

A. Jakulin, "Machine Learning Based on Attribute Interactions." PhD Dissertation, 2005. [PDF - 2.7Mb - 252pp] Presentation: [PPT]

Earlier Publications

A. Jakulin, I. Bratko, "Testing the Significance of Attribute Interactions."
Proceedings of the Twenty-first International Conference on Machine Learning (ICML-2004), Eds. R. Greiner and D. Schuurmans. Pp. 409-416. Banff, Canada, 2004. [PDF] Presentation: [PPT] Poster: [PDF]

A. Jakulin, I. Bratko, "Quantifying and Visualizing Attribute Interactions."
Working paper (Nov. 2003): [ARXIV] [PDF]

A. Jakulin, G. Leban, "Interactive Interaction Analysis." Proceedings A of the 6th Information Society Conference (IS 2003), Ljubljana, Slovenia, October 13-17, 2003.
Paper (in Slovene): [PDF (Slovene)] Presentation:[PPT]

A. Jakulin, I. Bratko, "Analyzing Attribute Dependencies." Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2003), Cavtat-Dubrovnik, Croatia, September 22-26, 2003. N. Lavrac, D. Gamberger, H. Blockeel, L. Todorovski (Eds.) Lecture Notes in Artificial Intelligence, Vol. 2838, Springer. Pp. 229-240.
Preprint (Apr.03): [ps.gz][pdf] Final version: SpringerLink Presentation: [PPT]

A. Jakulin, I. Bratko, D. Smrke, J. Demsar and B. Zupan, "Attribute Interactions in Medical Data Analysis." Proceedings of the 9th Conference on Artificial Intelligence in Medicine in Europe (AIME 2003), Protaras, Cyprus, October 18-22, 2003. M. Dojat, E. Keravnou, P. Barahona (Eds.) Lecture Notes in Artificial Intelligence, Vol. 2780, Springer. Pp. 229-238.
Preprint (Mar. 2003): [ps.gz] [pdf]Final version: SpringerLink Presentation: [PPT]

A. Jakulin, "Attribute Interactions in Machine Learning." Master's thesis, University of Ljubljana, December 2002. [outdated]

full [ps.gz - 480k]

[pdf - 1300k]

Presentations:
- english [pdf - 290k]
- slovene [pdf - 540k] video

A. Jakulin. "Qualitative Aspects in Numerical Data: Structure, Interaction, and Correlation." [outdated]
[PS.GZ] [PDF]

Examples

Interaction analysis is able to assist feature construction and feature selection, and provides a way of reliably constructing attribute taxonomies with respect to the similarity between information that the attributes provide about the label. Blue arcs are negative interactions (redundancies), red are positive interactions (synergies).

The examples are illustrations of the CMC dataset from UCI, where the attributes inform us about the demographic and socioeconomical characteristics of a couple, while the label describes the contraception method used.

Interaction dendrogram:

Interpretation: Although the dendrogam may look like attribute clustering, the notion of similarity is not based merely on similarity: it is based on dependence. Red clusters indicate clusters of synergy - groups of attributes that should be treated holistically. Blue clusters indicate redundancies: if we wanted to select features, we would only pick the best attribute from each cluster.

Interaction graph:

A detail from the interaction graph:

Interpretation:

A positive interaction: Wife age attribute singly eliminates 3.33% of the uncertainty, Number of children attribute singly eliminates 5.82% of the uncertainty about the label. If we assume these two attributes are dependent, and treat them holistically (Cartesian product, classification tree, but not, e.g., naive Bayesian classifier, linear or logistic regression, linear SVM), we eliminate an additional 1.85% of label uncertainty. We say that they interact positively, since they are synergistic.
A negative interaction: Husband education attribute singly eliminates 2.60% of the uncertainty, Wife education attribute singly eliminates 4.60% of the uncertainty about the label. These two attributes are partly redundant, as they both provide a shared 1.15% of label information. To estimate the uncertainty, we should be careful not to overcount the information or evidence. This problem is solved in three ways: feature selection, feature weighting (e.g. linear SVM, logistic regression), and the assumption of dependence (tree-augmented naive Bayesian classifier, Bayesian networks).

Links

To get introduced to the actual computations of entropy and information, refer to Paul Meagher's tutorials: