Ideas for Projects
1. Bayesian Data Mining – finding interestingly large counts in massive tables.
DuMouchel, W. (1999). Bayesian data mining in large frequency tables, with an application to the FDA spontaneous reporting systems. American Statistician, 53, 177-202
Paper here.slides available here.
DuMouchel W, Pregibon D (2001).
Empirical Bayes screening for multi-item associations [ps]
Proc. KDD 2001, ACM Press, San Diego, CA.
available here.
2. Data Squashing – compressing large datasets to facilitate statistical analysis.
DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., and Pregibon, D. (1999). Squashing flat files flatter. In: Proceedings of the Fifth ACM Conference on Knowledge Discovery and Data Mining, 6-15.
available here
See also this paper and this paper.
3.Recommender Systems – making recommendations based on past shopping/rating behavior
http://www.cs.umbc.edu/~ian/sigir99-rec/
http://www.cis.upenn.edu/~ungar/CF/
http://vtopus.cs.vt.edu/~ramakris/recsys-course.html
http://www.jamesthornton.com/hotlist/collabfilters.html
4. Delegate Sampling – ideas for tree building with massive data
Breiman, L. and Friedman, J. (1984). Tools for large data set analysis. In: Statistical Signal Processing, E.J. Wegman and J.G Smith (Eds.), New York, M. Dekker, 191-197.
Domingos, P. and Hulten, G. (2000).
Mining High-Speed Data Streams, In: Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, 71-80.
Mining Time-Changing Data Streams, with Geoff Hulten and Laurie Spencer. Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining (pp. 97-106), 2001. San Francisco, CA: ACM Press.
5. Big Bayesian Networks – scaling algorithms for learning Bayesian networks
Nice tutorial on David Heckerman’s home page http://www.research.microsoft.com/~heckerman/
6. Multiclass Classification – using coding theory ideas for multiclass classification
Dietterich, T. G., Bakiri, G. (1995) Solving Multiclass Learning Problems via Error-Correcting Output Codes. Journal of Artificial Intelligence Research 2: 263-286.
PDF
7. Markov Transition Distributions – models for higher order Markov chains
Berchtold, A. and Raftery, A.E. (1999).
The Mixture Transition Distribution (MTD) Model for High-Order Markov Chains and Non-Gaussian Time Series. Technical Report no. 360, Department of Statistics, University of Washington, August 1999.
8. Global partial orders from sequential data
Mannila, H. and Meek, C. (2000). Global partial orders from sequential data. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, 161-168.
http://riss.keris.or.kr:8080/pubs/citations/proceedings/ai/347090/p161-mannila/
http://www.research.microsoft.com/scripts/pubs/view.asp?TR_ID=MSR-TR-2000-62
9. Probabilistic models for query approximation
Pavlov, D., Mannila, H. and Smyth, P. (2000). Probabilistic models for query approximation with large sparse binary data sets. UAI-2000.
http://www.ics.uci.edu/~pavlovd/uai2000.ps
10. Adaptive bagging
ftp://ftp.stat.berkeley.edu/pub/users/breiman/adaptbag99.ps.Z
ftp://ftp.stat.berkeley.edu/pub/users/breiman/adaptbag99.abstract
11. Pasting bites together for prediction in large data sets and on-line
ftp://ftp.stat.berkeley.edu/pub/users/breiman/pastebite.ps.Z
http://dnkweb.denken.or.jp/boosting/papers/Bre96b.ps
12. Hierarchical model-based clustering for large datasets
http://www.stat.washington.edu/www/research/reports/1999/tr363.ps
13. Monte Carlo importance sampling for large-scale Bayesian analysis
Ask me and I can give you a paper
about this.
14. Retrieval properties of large text corpora
Information retrieval investigators have noticed that retrieval performance
(i.e. proportion of relevant documents returned in response to
a query) improves as the size of the document collection increases.
The project should conduct some simulations to see if this phenomenon
is universal. Come and talk to me if you are interested in this.
15. Bayesian model averaging for logistic regression
"Bayesian model averaging" is a technique to improve the
predictive performance of models such as classifiers. It has not
really been evaluated in the context of logistic regression. Project:
evaluate Bayesian model averaging (BMA) for logistic regression using
datasets from the UCI machine learning repository. Software to do BMA
is at
http://www.research.att.com/~volinsky/bma.html
16. Bayesian model averaging versus Mixtures of Experts
Compare the predictive performance of Bayesian model averaging
to so-called mixtures-of-experts models. BMA software is at:
http://www.research.att.com/~volinsky/bma.html. ME
software is at
http://www.oigeeza.com/steve/code/moe/code.shtml.
17. Text categorization for disputed authorship
Text categorization concerns the automatic assignment of documents
to categories. The idea is learn a categorization algorithm from
a set of hand-labeled documents. There are lots of interesting
statistical/data mining ideas in this area. One application
concerns disputed authorship: given samples of particular author's
works, assign disputed samples to the right author. The classic
work in this area is very old and updating it would make for a nice
project. See: http://links.jstor.org/sici?sici=0162-1459%28196306%2958%3A302%3C275%3AIIAAP%3E2.0.CO%3B2-V
18. Text mining adverse event reports
Various public database record adverse reactions to medical products.
The free-text component of these data is of considerable interest to
various parties. Key problems include: (1) automatic assignment of
adverse event codes to the free text reports, (2) recovering the temporal
sequence of the adverse events, and (3) mapping verbatim drug names to
a canonical form. Lots of possible projects here.
19. Compare regularized logistic regression to random forests
Logistic regression is an old statistical method that has been around
for a long time. Recent work has shown that logistic regression,
suitably regularized, works as well as more trendy methods like
boosted trees and support vector machines on very large-scale problems.
There are a few possible projects here that could result in a nice paper.
20. Location determination in wireless networks
The basic idea here is to try to figure out a user's physical location
from the strengths of the signals to various access points. There
is a lot of interest in this problem. Here is a
web page on location determination. The papers by Bahl et al. describing
the RADAR system are especially nice.
21. Orthographic analysis
Given a list of hispanic first and last names and a list of non-hispanic
first and last names, build a predictive model that accuractely
classifies future names as hispanic or non-hispanic.