0% found this document useful (0 votes)
17 views59 pages

Unit - I & II

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views59 pages

Unit - I & II

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

Doing Data Science

Chapter 1
What is Data Science?
• Big Data and Data Science Hype
• Getting Past the Hype / Why Now?
• Datafication
• The Current Landscape (with a Little History)
• Data Science Jobs
• A Data Science Profile
• Thought Experiment: Meta-Definition
• OK, So What Is a Data Scientist, Really?
– In Academia
– In Industry
Big Data and Data Science Hype
• Big Data, how big?
• Data Science, who is doing it?
• Academia have been doing this for years
• Statisticians have been doing this work.

Conclusion: The terms have lost their basic


meaning and now are too ambiguous, thus, today
they are now meaningless.
Getting Past the Hype / Why Now
• The Hype: Understanding the cultural phenomenon of data
science and how others were experiencing it. Study how
companies, and universities are “doing data science”.

• Why Now: Technology makes this possible: infrastructure for


large-scale data processing, increased memory, and
bandwidth, as well as a cultural acceptance of technology in
the fabric of our lives. This wasn't true a decade ago.

• Consideration should be to the ethical and technical


responsibilities for the people responsible for the process.
Datafication
• Definition: A process of "taking all aspects of
life and turning them into data:''

• For Example:
– "Google's augmented-reality glasses “datafy” the
gaze.
– Twitter “datafies” stray thoughts.
– Linkedin “datafies” professional networks:
Current Landscape of Data Science
• Drew Conway's Venn diagram of data science
from 20l0,

M atis
ls

at tic
ki l

St
Machine

h
gS

an s
Learning
kin

d
Data
er Science
c

Tra esea
Ha

n
Da one

diti rch
R
Z

on
al
Substantive
Expertise
Data Science Jobs
Job descriptions:
• experts in computer science,
• statistics,
• communication,
• data visualization, and to have
• extensive domain expertise

Observation: Nobody is an expert in everything, which is why


it makes more sense to create teams of people who have
different profiles and different expertise-together, as a team,
they can specialize in all those things.
Data Science Profile
Data Science Team
What is Data Science, Really?
• In Academia: an academic data scientist is a scientist, trained in
anything from social science to biology, who works with large amounts
of data, and must grapple with computational problems posed by the
structure, size, messiness, and the complexity and nature of the data,
while simultaneously solving a real-world problem.

• In Industry: Someone who knows how to extract meaning from and


interpret data, which requires both tools and methods from statistics
and machine learning, as well as being human. She spends a lot of time
in the process of collecting, cleaning, and “munging” data, because
data is never clean. This process requires persistence, statistics, and
software engineering skills-skills that are also necessary for
understanding biases in the data, and for debugging logging output
from code.
Doing Data Science

Chapter 2
Big Data Statistics

• Statistical thinking in the Age of Big Data

• Statistical Inference

• Populations and Samples

• Big Data Examples

• Big Assumptions due to Big Data

• Modeling
Statistical Thinking – Age of Big Data

• Prerequisites – massive skills!!


– Math/Comp Science: stats, linear algebra, coding.
– Analytical: Data preparation, modeling,
visualization, communication.
Statistical Inference
• The World – complex, random, uncertain.
– Data are small traces of real-world processes.

• Note: two forms of randomness exist:


– Underlying the process (system property)

– Collection methods (human errors)

• Need a solid method to extract meaning and


information from random, dubious data.
– This is Statistical Inference!
Big Data Domain - Sampling
• Scientific Validity Issues with “Big Data”
populations and samples. (Engineering problems
+ Bias)
– Incompleteness Assumptions
• All statistics and analyses must assume that samples do not
represent the population and therefore scientifically-
tenable conclusions cannot be drawn.
• i.e. It’s a guess at best. These types of assertions will
stand-up better against academic/scientific scrutiny.
Big Data Domain - Assumptions
• Other Bad or Wrong Assumptions
– N = 1 vs. N = ALL (multiple layers)
• Big Data introduces a 2nd degree to the data context.
• There are infinite levels of depth and breadth in the data.
• Individuals become populations. Populations become populations of
populations – to the nth degree. (meta-data)
– My Example:
• 1 billion Facebook posts (one from each user) vs. 1 billion Facebook
posts from one unique user.
• 1 billion tweets vs. 1 billion images from one unique user.
• Danger: Drawing conclusions from incomplete
populations. Understand the boundaries/context.
Modeling
• What’s a model?
– An attempt to understand the population of interest
and represent that in a compact form which can be
used to experiment/analyze/study and determine
cause-and-effect and similar relationships amongst
the variables under study IN THE POPULATION.
• Data model
• Statistical model – fitting?
• Mathematical model
Probability Distributions
Doing Data Science

Chapter 2
Exploratory Data Analysis (EDA)
• “It is an attitude, a state of flexibility, a willingness to
look for those things that we believe are not there, as
well as those we believe to be there.”

-John Tukey

• Traditionally presented as a bunch of histograms and


stem-and-leaf plots.
Features
• EDA is a critical part of data science process.
• Represents a philosophy or way of doing
statistics.
• No hypotheses and there is no model.

• “Exploratory” aspect means that your


understanding of the problem you are
solving, or might solve, is changing as you go.
Basic Tools of EDA
• Plots, graphs and summary statistics.

• Method of systematically going through the


data, plotting distributions of all variables.

• EDA is a set of tools, it’s also a mindset.

• Mindset is about relationship with the data.


Philosophy of EDA
• Many reasons any one working with data should do EDA.

• EDA helps with de-bugging the logging process.

• EDA helps assuring the product is performing as intended.

• EDA is done toward the beginning of the analysis.


Data Science Process
A Data Scientist’s Role in This process
Doing Data Science

Chapter 3
What is an algorithm?
• Series of steps or rules to accomplish a tasks
such as:
– Sorting
– Searching
– Graph-based computational problems
• Because one problem could be solved by
several algorithms, the “best” is the one that
can do it with most efficiency and least
computational time.
Three Categories of Algorithms
• Data munging, preparation, and processing
– Sorting, MapReduce, Pregel
– Considered data engineering
• Optimization
– Parameter estimation
– Newton’s Method, least squares
• Machine learning
– Predict, classify, cluster
Data Scientists
• Good data scientists use both statistical
modeling and machine learning algorithms.
• Statisticians: • Software engineers:
– Want to apply parameters – Want to create production
to real world scenarios. code into a model without
– Provide confidence interpret parameters.
intervals and have – Machine learning
uncertainty in these. algorithms don’t have
– Make explicit assumptions notions of uncertainty.
about data generation. – Don’t make assumptions of
probability distribution –
implicit.
Linear Regression (supervised)
• Determine if there is causation and build a model
if we think so.
• Does X (explanatory var) cause Y (response var)?
• Assumptions:
– Quantitative variables
– Linear form
Linear Regression (supervised)
• Steps:
– Create a scatterplot of data
– Ensure that data looks linear (maybe apply
transformation?)
– Find “line of least squares” or fit line.
• This is the line that has the lowest sum of all of the
residuals (actual values – expected values)
– Check your model for “goodness” with R-squared,
p-values, etc.
– Apply your model within reason.
Simplistic example
• Suppose you run a social networking site that charges a

monthly subscription fee of $25

Where, x = number of users; y = total revenue

• If you showed this to someone else who didn’t even know how

much you charged or anything about your business model


Contd…
• They figured out that:
• There’s a linear pattern.
• The coefficient relating x and y is 25. i.e., y = 25 x.
• It seems deterministic.
One more example
• Looking at data at the user level

Say you have a dataset keyed by user (meaning each row


contains data for a single user), and the columns represent
user behavior on a social networking site over a period of a week.

The names of the columns are total_num_friends,


total_new_friends_this_week, num_visits, time_spent,
number_apps_downloaded, number_ads_shown, gender, age,
and so on.
Contd…
• for example, x = total_new_friends and y = time_spent (in
seconds).

• The business context might be that eventually you want to


be able to promise advertisers who bid for space on your
website in advance a certain number of users, so you want
to be able to forecast.
Contd…

There is no
perfectly
deterministic
relationship
between number of
new friends and
time spent on the
site, but it makes
sense that there is
an association
between these two
variables.
Contd…
• First, let’s start by assuming there actually is a relationship and that it’s linear.
Fitting the model
Contd…
• But it’s up to you, the data scientist, whether you think
you’d actually want to use this linear model to describe the
relationship or predict new outcomes.
• If a new x-value of 5 came in, meaning the user had five
new friends, how confident are you in the output value of –
32.08 + 45.92*5 = 195.7 seconds?
• In order to get at this question of confidence, you need to
extend your model.
• So far modeled the trend, you haven’t yet modeled the
variation.
Extending beyond least squares
• Currently you have a simple linear regression
model down (one output, one predictor) using
least squares estimation to estimate your βs,
you can build upon that model in three
primary ways:
– Adding in modeling assumptions about the errors
– Adding in more predictors
– Transforming the predictors
Adding in modeling assumptions about
the errors
• To capture the variability (Shown in Figure 3-5) in your model, you should

extend your model to:

y = β0 + β 1 x + ϵ
where the new term ϵ is referred to as noise, which is the stuff that you

haven’t accounted for by the relationships you’ve figured out so far.

• It’s also called the error term—ϵ represents the actual error, the difference

between the observations and the true regression line, which you’ll never

know and can only estimate with your .


Contd…
• One often makes the modeling assumption
that the noise is normally distributed, which is
denoted:
Contd…

• This is called the mean squared error and captures how much
the predicted value varies from the observed. Mean squared
error is a useful quantity for any prediction problem.
• In regression in particular, it’s also an estimator for your
variance, but it can’t always be used or interpreted that way.
Evaluation metrics
• You have a couple values in the output of the R function
that help you get at the issue of how confident you can
be in the estimates: p-values and R-squared.
Contd…
Contd…
Other models for error terms
• The mean squared error is an example of what is called a loss
function.
• This is the standard one to use in linear regression because it
gives us a pretty nice measure of closeness of fit.
• There are other loss functions such as one that relies on
absolute value rather than squaring.
• It’s also possible to build custom loss functions specific to your
particular problem or context
Adding other predictors
• What we just looked at was simple linear regression— one outcome or
dependent variable and one predictor.

• But we can extend this model by building in other predictors, which is


called multiple linear regression:

y = β0 +β1x1 +β2x2 +β3x3 +ϵ.

• The R code will just be:

model <- lm(y ~ x_1 + x_2 + x_3)

• you can use the same methods to evaluate your model as discussed
previously: looking at R2, p-values, and using training and testing sets.
Transformations
• Going back to one x predicting one y, why did we assume a
linear relationship? Instead, maybe, a better model would be
a polynomial relationship like this:

Wait, but isn’t this linear regression?


To think of it as linear, you transform or create new
variables—for example, z = x2—and build a regression
model based on z. Other common transformations are to
take the log or to pick a threshold and turn it into a binary
predictor instead.
k-Nearest Neighbor/k-NN (supervised)
• Used when you have many objects that are
classified into categories but have some
unclassified objects (e.g. movie ratings).
• Assumptions:
– Data is of the type where “distance” make sense.
– Training data is in two or more classes.
– Observed features and the labels are associated
(not necessarily).
– You pick k.
k-Nearest Neighbor/k-NN (supervised)
• Pick a k value (usually a low odd number, but
up to you to pick).
• Find the closest number of k points to the
unclassified point (using various distance
measurement techniques).
• Assign the new point to the class where the
majority of closest points lie.
• Run algorithm again and again using different
k’s.
k-means (unsupervised)
• Goal is to segment data into clusters or strata
– Important for marketing research where you need
to determine your sample space.
• Assumptions:
– Labels are not known.
– You pick k (more of an art than a science).
k-means (unsupervised)
• Randomly pick k centroids (centers of data)
and place them near “clusters” of data.
• Assign each data point to a centroid.
• Move the centroids to the average location of
the data points assigned to it.
• Repeat the previous two steps until the data
point assignments don’t change.

You might also like