Data
Mining (V0.3)
NOTE: Most
of the material (slides, dates, etc) is on http://courseworks.columbia.edu/ This
page is merely intended to guide you there.
STAT W4240.001 DATA MINING
TR 06:10P-07:25P
SCHOOL OF SO 903
Instructor: Aleks
Jakulin, PhD
Purpose: To
introduce students to the cutting edge of data mining, focusing on data
analysis problems and data types. A student will be able to approach a problem
of data mining, analyze it and identify tools to solve it quickly and
efficiently. A distinguishing characteristic of data mining is the pragmatic
employment of an algorithm as a tool,
rather than the underlying mathematics.
Method: Provide a conceptual
framework that allows navigating a methodological cornucopia present in data
mining. With the framework, one can pick any tool, piece of software, and apply
it to a problem at hand.
á
basic knowledge of probability and
statistics
á
some knowledge of programming
á
quantitative attitude
á
passion for truth
á
helpful: an eye for graphics,
communication/presentation skills, logic, linear algebra, numerical
mathematics, advanced programming, computer graphics, visual design
Practical:
Students will form teams and address a data mining problem of practical
importance. All projects will be published online in the format of a web page.
Teams should combine presentation, computational and mathematical skills.
Lecture
Format: Each lecture will be centered on a case study. The case will
be examined through four aspects: goal, computation, model and data. The
aspects will be explained as needed for the case. There will be guest lectures
on specific tools, guest talks and demonstrations (15 minutes, followed by a
discussion), and just guests sitting in.
Grading: 60%
project, 15% midterm (Mar 11), 25% final. The exams will focus on your ability
to solve practical problems and find flaws in other types of analysis.
Ethics: Collaboration
on midterm and final are not allowed. All projects will be posted permanently online
with a list of credits (who did what). If you use external help, credit it.
Your own contributions need to be listed, and everyone will ask questions that
will establish if you did it yourself or not.
Self study: IÕll be posting good
applied books and links here.
Conceptual
Structure:
Most data mining tasks can be reduced to using tools
of four basic categories:
Purpose Summarization Prediction -
Compression -
Decision-making System
identification or reverse engineering -
Exploration -
Action, manipulation and causality Anomaly
detection Active learning and experimentation Anti-Data
Mining: Privacy protection, obfuscation and data camouflage |
Models Visualization Determinism
and probability Linear
models Interaction Clustering Kernels
and exemplars Logic,
rules, trees Geometry Nonlinearity |
Process Fitting,
over-fitting, under-fitting, complexity Model
evaluation: cross-validation, test/training Model
stabilization: priors and regularization Model
comparison Model
addition: ensembles Model
subtraction: holding things constant Computation:
scalability, parallelization, numeric precision |
Data Data
types: -
Tabular -
Relational -
Temporal -
Text -
Network -
Geographical -
Structured -
Exotic -
Multi-instance -
Multi-task Data preparation: -
Collecting -
Cleaning -
Structuring -
Missing data -
Wrong data -
Uncertain data |
Links to
previous classes:
Chris
Volinsky (2009) (/DataMining on the website)