DATA MINING FOR
BUSINESS ANALYTICS
INTRODUCTION TO DATA SCIENCE
INFO-GB.3336/ DS-GA 1001
Fall 2013
SYLLABUS
Professor Claudia Perlich, Information, Operations & Management Sciences Department
Office; Hours Monday 5-6pm MKC 8-171 and by apt. Friday between 11 and 5PM Union Square
Email [email protected]
Begin subject: [DM GRAD] note!
Telephone Mobile: 914-409-5609
Classroom KMC 4-120
Meeting time Monday, 6PM-9PM
First/Last Class 9/23-12/16
Final Quiz Take home after class
Course Assistant
CA Office Hours
Primary: Samuel Fraiberger [email protected] 617-959-9809
Office: 19 W 4
th
St. #723 Hours: Wed 5PM-6:30pm and by appt.
Also: Jessica Clark [email protected] 212-998-0812
Office: KMC 8-186 Hours: Tues 2-3:30pm, Thurs 5-6:30pm and by apt.
1. Course Overview
This course will change the way you think about data and its role in business. Big data
has received a lot of attention in the last 3 years as businesses, governments, and
individuals create massive collections of data as a by-product of their activity.
Increasingly, decision-makers rely on intelligent technology to analyze data
systematically to improve decision-making. In many cases automating analytical and
decision-making processes is necessary because of the volume of data and the speed
with which new data are generated.
We will examine how data mining and predictive modeling technologies can be used to
improve decision-making. We will study the fundamental principles and techniques of
data mining, and we will examine real-world examples and cases to place data-mining
techniques in context, to develop data-analytic thinking, and to illustrate that proper
application is as much an art as it is a science. In addition, we will work hands-on with
data mining software.
After taking this course you should:
1. Approach business problems data-analytically. Think carefully & systematically about
whether & how data can improve business performance, to make better-informed
decisions for management, marketing, investment, etc.
2. Be able to interact competently on the topic of data mining for business intelligence.
Know the basics of data mining processes, algorithms, & systems well enough to
interact with CTOs, expert data miners, consultants, etc. Envision opportunities.
3. Have had hands-on experience mining data. Be prepared to follow up on ideas or
opportunities that present themselves, e.g., by performing pilot studies.
2. Focus and interaction
The course will explain through lectures and real-world examples the fundamental
principles, uses, and some technical details of data mining techniques. The emphasis
primarily is on understanding the business application of data mining techniques, and
secondarily on the variety of techniques. We will discuss the mechanics of how the
methods work as is necessary to understand the fundamental concepts and business
application. This is not an algorithms course. However, many techniques are the
embodiment of one or more of the fundamental principles.
We will cover a number of cases during the course of the semester. They are primarily
illustrative and you are not expected to the learn details by heart but rather understand
how certain modeling decisions were made in the context of the particular case.
I will expect you to be prepared for class discussions by having satisfied yourself that
you understand what we have done in the prior classes. The assigned readings will
cover the fundamental material.
You are expected to attend every class session, to arrive prior to the starting time, to
remain for the entire class, and to follow basic classroom etiquette, including having all
electronic devices turned off and put away for the duration of the class (this is Stern
policy, see below) and refraining from chatting or doing other work or reading during
class. In general, we will follow Stern default policies unless I state otherwise. I will
assume that you have read them and agree to abide by them:
http://w4.stern.nyu.edu/academic/affairs/policies.cfm?doc_id=7511
The NYU Classes site for this course will contain lecture notes, reading materials,
assignments, extra-class discussions, and late-breaking news. You should check the NYU
Classes site daily, and I will assume that you have read all announcements and class
discussion.
If you have questions about class material that you do not want to ask in class, or that
would take us well off topic, please detain me after class, come to office hours to see me
or the TAs, or ask on the discussion board. The discussion board is much better than
sending me email, which frankly I have a hard time keeping up with. Also, if you have
the question, someone else may too and everyone may benefit from the answers being
available on NYU Classes. Also, please try to answer your classmates questions. In
grading your class participation I will include your contributions to the discussion board.
You will not be penalized for being wrong in trying to participate on the discussion board
(or in class).
Worth repetition: It is your responsibility to check NYU Classes (and your email) at least
once a day during the week (M-F), and you will be expected to be aware of any
announcements within 24 hours of the time the message was sent.
I will check my email at least once a day during the week (M-F). Your email will get
priority if you include the special tag [DM Grad] in the subject header. I use this
tag to make sure to process class email first. If you do not include the special tag, I may
not read the email for a while (maybe a long while). If you forget and send without the
tag and then remember, just send it again including the tag.
3. Lecture Notes and Readings
Book: The textbook for the class will be:
Data Science for Business: Fundamental principles of data mining and data analytic
thinking Provost & Fawcett (OReilly, 2013).
The book is now available, and you can purchase it in the bookstore or online (see
http://data-science-for-biz.com/). I wrote it over the past couple years, in response to
feedback from this coursein particular, that the available books were not adequate.
This book covers the fundamental material that will provide the basis for you to think and
communicate about data science and business analytics. We will complement the book
with discussions of applications, cases, and demonstrations.
Lecture notes: For many classes I will hand out lecture notes. I intend that the notes on
the fundamental material will follow the book very closely. I expect you to ask questions
about any material in the notes that is unclear after our class discussion and reading the
book. I wrote the book to follow the class closely, and to free us up for more discussion of
applications, cases, etc.so many of your questions may be answered in the book. If not,
please let me know! Depending on the direction our class discussion takes, we may not
cover all material in the class notes for any particular session. If the notes and the book are
not adequate to explain a topic we skip, you should ask about it on the discussion board. I
will be happy to follow up.
I may hand out or post some additional required readings as we go along. Note that some
of these readings may be accessible for free only from an NYU computer. If you cant
access a link from home, please try it from school.
For those interested in going further, these following supplemental books give alternative
perspectives on and additional details about the topics we cover. These are completely
optional; you will not be required to know anything in these readings that are not in the
primary materials or lectures. I have many other books that I can recommend, for
example if you want a reference to a more mathematical treatment of the topics. Please
dont hesitate to come and talk to me about what supplemental material might be best
for you, if you want to go further.
Data Mining Techniques (optional)
by Michael Berry and Gordon Linoff, Wiley, 2004
ISBN: 0-471-47064-3
available as ebook for free: http://site.ebrary.com/lib/nyulibrary
Many students find this book to be an excellent supplemental resource
The Third Edition is out. I have not read it yet. Berry says it has been improved
substantially. I have a copy in my office if you want to talk a look at it before buying.
available from Amazon
Python for Data Analysis (optional)
by Wes McKinney 2012
ISBN: 1-449-31979-3
Weka Book (optional):
Data Mining: Practical Machine Learning Tools and Techniques, Third Edition
by Ian Witten, Eibe Frank, Mark Hall
ISBN-10: 0123748569
available from Amazon
This book provides much more technical details of the data mining techniques and is a very
nice supplement for the student who wants to dig more deeply into the technical details. It
also provides a comprehensive introduction to the Weka toolkit.
4. Requirements and Grading
The grade breakdown is as follows:
1. Homeworks: 20%
2. Term Project: 30%
3. Participation & Class Contribution: 20%
4. Final Quiz: 30%
At NYU Stern we seek to teach challenging courses that allow students to demonstrate
differential mastery of the subject matter. Assigning grades that reward excellence and
reflect differences in performance is important to ensuring the integrity of our curriculum.
In my experience, students generally become engaged with this course and do excellent
or very good work, receiving As and Bs, and only one or two perform only adequately or
below and receive Cs or lower. Note that the actual distribution for this course and your
own grade will depend upon how well each of you actually performs this particular
semester.
Homework Assignments
The homework assignments are listed (by due date) in the class schedule below. Each
homework comprises questions to be answered and/or hands-on tasks. Except as
explicitly noted otherwise (see next paragraph), you are expected to complete your
assignments on your ownwithout interacting with on the completion of your
assignment.
For the hands-on parts of the assignments (with Weka or Python), I encourage you to
work with your group members and other classmates to understand how to get Weka or
Python to do what you need to do, and then to complete your assignment on your own.
So, for example, you could have a classmate help you do something similar, such that
then you would be able to complete the assignment.
I hope with the support of me, the TAs, and your classmates, we operate under a
diligence but limited frustration policy: (1) If you get stuck on something, spend some
time Googling to try to find the answer. If you seem to be moving forward, keep going.
That search and discovery will pay off, both in terms of the direct learning about how to
do what you need to do, and also in terms of your learning how to find such things out.
(E.g., if you dont know what stackoverflow is, you will learn!). BUT, (2) limit frustration
start your assignments early enough that if you run into a wall, you can just stop
searching and ask about it. Lets say, if you feel like you have not moved forward after
15 minutes of being stuck, just stop and ask: your classmates, on the discussion board,
to the TAs. If you dont get a solution, escalate it to me.
Completed assignments must be handed on blackboard at least one hour prior to the
start of class on the due date (that is, by 5pm), unless otherwise indicated. Assignments
will be graded and returned promptly. Answers to homework questions should be well
thought out and communicated precisely, avoiding sloppy language, poor diagrams, and
irrelevant discussion.
The hands-on tasks will be based on data that we will provide. You will mine the data to
get hands-on experience in formulating problems and using the various techniques
discussed in class. You will use these data to build and evaluate predictive models.
For the hands-on assignments you will use either the (award-winning) toolkit Weka or
Python and its data science/analytics libraries.
http://www.cs.waikato.ac.nz/ml/weka/ download the latest stable version (3.6.10)
(which is the version associated with the 3
rd
edition of the Weka Book)
For Python we will provide installation instructions to make sure that you have the right
setup.
IMPORTANT: You must have access to a computer on which you can install
software. If you do not have such a computer, please see me immediately so we
can make alternative arrangements. You should bring your computer to class. During
class we will have a lab session during which we will install and configure the software,
get it running, and deal with the inevitable glitches that a few of you might experience. If
you need additional help with using the data mining software, please see the Course
Assistant.
Generally the Course Assistant should be the first point of contact for questions about
and issues with the homeworks. If they cannot help you to your satisfaction, please do
not hesitate to come see me.
Late Assignments
Assignments are due prior to the start of the lecture on the due date. Turn in your
assignment early if there is any uncertainty about your ability to turn it in on the due date.
Assignments up to 24 hours late will have their grade reduced by 25%; assignments up
to one week late will have their grade reduced by 50%. After one week, late
assignments will receive no credit.
Term Project
A term project report will be prepared by student teams. Student teams should comprise
4 students. Every team should contain a mix of both MSDS and MBA students. We will
help seeing the group and you should decide on your teams by the end of the third
class, and submit them to me see the deadlines in the syllabus. Teams are
encouraged to interact with the instructor and TA electronically or face-to-face in
developing their project reports. You will submit a pre-proposal for your project about
half way through the course. Each team will present its project at the end of the
semester. We will discuss the project requirements and presentations in class.
Final Quiz
The final quiz will be a take-home to be completed during the week following the last
class. The subject matter covered and the exact dates will be discussed in class.
Participation/Contribution/Attendance/Punctuality
Please see Section 2.
Regrading
If you feel that a calculation, factual, or judgment error has been made in the grading of
an assignment or exam, please write a formal memo to me describing the error, within
one week after the class date on which that assignment was returned. Include
documentation (e.g., a photocopy of class notes). I will make a decision and get back to
you as soon as I can. Please remember that grading any assignment requires the
grader to make many judgments as to how well you have answered the question.
Inevitably, some of these go in your favor and possibly some go against. In fairness to
all students, the entire assignment or exam will be regraded.
FOR STUDENTS WITH DISABILITIES: If you have a qualified disability and will require
academic accommodation during this course, please contact the Moses Center for
Students with Disabilities (CSD, 998-4980) and provide me with a letter from them
verifying your registration and outlining the accommodations they recommend. If you
will need to take an exam at the CSD, you must submit a completed Exam
Accommodations Form to them at least one week prior to the scheduled exam time to be
guaranteed accommodation.
Please read the policies for Stern courses
http://w4.stern.nyu.edu/academic/affairs/policies.cfm?doc_id=7511
Please keep in mind the Stern Honor Code
http://www.stern.nyu.edu/mba/studact/mjc/hc.html
Class Schedule
Class Date
Topics
Book Sections
Deliverables
(Preliminary)
1 9/23
Introduction and Predictive Modeling 101
What is DM? DM process, What is a model? basic terminology, predictive
modeling, classification, regression, use vs training
Some Class Logistics
Churn Case &
KDD 98 Case
Survey
2 9/30
Data Mining Fundamentals
Tree Induction, Rule Induction,
Python and WEKA lab
Ch. 1 & 2 &3
Homework 1
Read the
Target case
Install WEKA
PYTHON
3 10/7
Data Mining Fundamentals: Logistic Regression & evaluation 101
Logistic regression
Geometric interpretation, linear model versus tree induction, logistic regression?
KDD 98 Case revisited: Simple models!
Foster Provost: NYNEX
Ch 4 &5
Leads for
dataset for
project
4 10/14
Evaluation
How do I know my model is any good? evaluation, in-sample versus out-of-
sample, overfitting, cross-validation, ROC analysis, expected value framework,
domain knowledge validation
Breast Cancer KDD CUP + Leakage
IBM Default identification M6D
Ch. 7&8 Homework 2
5 10/21
Text and Nave BAYES
text classification, Nave Bayes, spam filtering, L1 logistic
Proposal evaluation
Spam & Banter & M6D & Whatson
Ch. 9&10
Project
Proposal
6 10/28
Causal & Fraud
Ori Stitleman
M6D case
Ch. 12
Homework 3
7 11/4
kNN, Recommender,Clusting & Proposal review
Wallet, Freightliner
Ch. 6 & A Project Update
8 11/11
Feature Selection & Engineering
Feature Selection, Social Network, Time Series, Relational Learning
Stacking & Ensembles
Orange KDD CUP (FS)
Netflix KDD CUP
Ch. 7
Homework 4
9 11/18
Visualization
Principles and hands one experience with Python
Homework 5
Guest: Prof. Kristen Sosulski
11//25
Thanksgiving Break
10 12/2
Data Science and Business Strategy
Analytical decom, The paradox of PM, Data Science Teams
Case: M6D
Ch 11
11 12/9
Deployment & Summary
many of the above
CH. 13 & 14 Project Report
12 12/16
Project Presentations