Unit - I & II
Unit - I & II
Chapter 1
What is Data Science?
• Big Data and Data Science Hype
• Getting Past the Hype / Why Now?
• Datafication
• The Current Landscape (with a Little History)
• Data Science Jobs
• A Data Science Profile
• Thought Experiment: Meta-Definition
• OK, So What Is a Data Scientist, Really?
– In Academia
– In Industry
Big Data and Data Science Hype
• Big Data, how big?
• Data Science, who is doing it?
• Academia have been doing this for years
• Statisticians have been doing this work.
• For Example:
– "Google's augmented-reality glasses “datafy” the
gaze.
– Twitter “datafies” stray thoughts.
– Linkedin “datafies” professional networks:
Current Landscape of Data Science
• Drew Conway's Venn diagram of data science
from 20l0,
M atis
ls
at tic
ki l
St
Machine
h
gS
an s
Learning
kin
d
Data
er Science
c
Tra esea
Ha
n
Da one
diti rch
R
Z
on
al
Substantive
Expertise
Data Science Jobs
Job descriptions:
• experts in computer science,
• statistics,
• communication,
• data visualization, and to have
• extensive domain expertise
Chapter 2
Big Data Statistics
• Statistical Inference
• Modeling
Statistical Thinking – Age of Big Data
Chapter 2
Exploratory Data Analysis (EDA)
• “It is an attitude, a state of flexibility, a willingness to
look for those things that we believe are not there, as
well as those we believe to be there.”
-John Tukey
Chapter 3
What is an algorithm?
• Series of steps or rules to accomplish a tasks
such as:
– Sorting
– Searching
– Graph-based computational problems
• Because one problem could be solved by
several algorithms, the “best” is the one that
can do it with most efficiency and least
computational time.
Three Categories of Algorithms
• Data munging, preparation, and processing
– Sorting, MapReduce, Pregel
– Considered data engineering
• Optimization
– Parameter estimation
– Newton’s Method, least squares
• Machine learning
– Predict, classify, cluster
Data Scientists
• Good data scientists use both statistical
modeling and machine learning algorithms.
• Statisticians: • Software engineers:
– Want to apply parameters – Want to create production
to real world scenarios. code into a model without
– Provide confidence interpret parameters.
intervals and have – Machine learning
uncertainty in these. algorithms don’t have
– Make explicit assumptions notions of uncertainty.
about data generation. – Don’t make assumptions of
probability distribution –
implicit.
Linear Regression (supervised)
• Determine if there is causation and build a model
if we think so.
• Does X (explanatory var) cause Y (response var)?
• Assumptions:
– Quantitative variables
– Linear form
Linear Regression (supervised)
• Steps:
– Create a scatterplot of data
– Ensure that data looks linear (maybe apply
transformation?)
– Find “line of least squares” or fit line.
• This is the line that has the lowest sum of all of the
residuals (actual values – expected values)
– Check your model for “goodness” with R-squared,
p-values, etc.
– Apply your model within reason.
Simplistic example
• Suppose you run a social networking site that charges a
• If you showed this to someone else who didn’t even know how
There is no
perfectly
deterministic
relationship
between number of
new friends and
time spent on the
site, but it makes
sense that there is
an association
between these two
variables.
Contd…
• First, let’s start by assuming there actually is a relationship and that it’s linear.
Fitting the model
Contd…
• But it’s up to you, the data scientist, whether you think
you’d actually want to use this linear model to describe the
relationship or predict new outcomes.
• If a new x-value of 5 came in, meaning the user had five
new friends, how confident are you in the output value of –
32.08 + 45.92*5 = 195.7 seconds?
• In order to get at this question of confidence, you need to
extend your model.
• So far modeled the trend, you haven’t yet modeled the
variation.
Extending beyond least squares
• Currently you have a simple linear regression
model down (one output, one predictor) using
least squares estimation to estimate your βs,
you can build upon that model in three
primary ways:
– Adding in modeling assumptions about the errors
– Adding in more predictors
– Transforming the predictors
Adding in modeling assumptions about
the errors
• To capture the variability (Shown in Figure 3-5) in your model, you should
y = β0 + β 1 x + ϵ
where the new term ϵ is referred to as noise, which is the stuff that you
• It’s also called the error term—ϵ represents the actual error, the difference
between the observations and the true regression line, which you’ll never
• This is called the mean squared error and captures how much
the predicted value varies from the observed. Mean squared
error is a useful quantity for any prediction problem.
• In regression in particular, it’s also an estimator for your
variance, but it can’t always be used or interpreted that way.
Evaluation metrics
• You have a couple values in the output of the R function
that help you get at the issue of how confident you can
be in the estimates: p-values and R-squared.
Contd…
Contd…
Other models for error terms
• The mean squared error is an example of what is called a loss
function.
• This is the standard one to use in linear regression because it
gives us a pretty nice measure of closeness of fit.
• There are other loss functions such as one that relies on
absolute value rather than squaring.
• It’s also possible to build custom loss functions specific to your
particular problem or context
Adding other predictors
• What we just looked at was simple linear regression— one outcome or
dependent variable and one predictor.
• you can use the same methods to evaluate your model as discussed
previously: looking at R2, p-values, and using training and testing sets.
Transformations
• Going back to one x predicting one y, why did we assume a
linear relationship? Instead, maybe, a better model would be
a polynomial relationship like this: