0% found this document useful (0 votes)

17 views59 pages

Unit - I & II

Uploaded by

arushasmitkumar1802

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views59 pages

Unit - I & II

Uploaded by

arushasmitkumar1802

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 59

Doing Data Science

Chapter 1
What is Data Science?
• Big Data and Data Science Hype
• Getting Past the Hype / Why Now?
• Datafication
• The Current Landscape (with a Little History)
• Data Science Jobs
• A Data Science Profile
• Thought Experiment: Meta-Definition
• OK, So What Is a Data Scientist, Really?
– In Academia
– In Industry
Big Data and Data Science Hype
• Big Data, how big?
• Data Science, who is doing it?
• Academia have been doing this for years
• Statisticians have been doing this work.

Conclusion: The terms have lost their basic

meaning and now are too ambiguous, thus, today
they are now meaningless.
Getting Past the Hype / Why Now
• The Hype: Understanding the cultural phenomenon of data
science and how others were experiencing it. Study how
companies, and universities are “doing data science”.

• Why Now: Technology makes this possible: infrastructure for

large-scale data processing, increased memory, and
bandwidth, as well as a cultural acceptance of technology in
the fabric of our lives. This wasn't true a decade ago.

• Consideration should be to the ethical and technical

responsibilities for the people responsible for the process.
Datafication
• Definition: A process of "taking all aspects of
life and turning them into data:''

• For Example:
– "Google's augmented-reality glasses “datafy” the
gaze.
– Twitter “datafies” stray thoughts.
– Linkedin “datafies” professional networks:
Current Landscape of Data Science
• Drew Conway's Venn diagram of data science
from 20l0,

M atis
ls

at tic
ki l

St
Machine

h
gS

an s
Learning
kin

d
Data
er Science
c

Tra esea
Ha

n
Da one

diti rch
R
Z

on
al
Substantive
Expertise
Data Science Jobs
Job descriptions:
• experts in computer science,
• statistics,
• communication,
• data visualization, and to have
• extensive domain expertise

Observation: Nobody is an expert in everything, which is why

it makes more sense to create teams of people who have
different profiles and different expertise-together, as a team,
they can specialize in all those things.
Data Science Profile
Data Science Team
What is Data Science, Really?
• In Academia: an academic data scientist is a scientist, trained in
anything from social science to biology, who works with large amounts
of data, and must grapple with computational problems posed by the
structure, size, messiness, and the complexity and nature of the data,
while simultaneously solving a real-world problem.

• In Industry: Someone who knows how to extract meaning from and

interpret data, which requires both tools and methods from statistics
and machine learning, as well as being human. She spends a lot of time
in the process of collecting, cleaning, and “munging” data, because
data is never clean. This process requires persistence, statistics, and
software engineering skills-skills that are also necessary for
understanding biases in the data, and for debugging logging output
from code.
Doing Data Science

Chapter 2
Big Data Statistics

• Statistical thinking in the Age of Big Data

• Statistical Inference

• Populations and Samples

• Big Data Examples

• Big Assumptions due to Big Data

• Modeling
Statistical Thinking – Age of Big Data

• Prerequisites – massive skills!!

– Math/Comp Science: stats, linear algebra, coding.
– Analytical: Data preparation, modeling,
visualization, communication.
Statistical Inference
• The World – complex, random, uncertain.
– Data are small traces of real-world processes.

• Note: two forms of randomness exist:

– Underlying the process (system property)

– Collection methods (human errors)

• Need a solid method to extract meaning and

information from random, dubious data.
– This is Statistical Inference!
Big Data Domain - Sampling
• Scientific Validity Issues with “Big Data”
populations and samples. (Engineering problems
+ Bias)
– Incompleteness Assumptions
• All statistics and analyses must assume that samples do not
represent the population and therefore scientifically-
tenable conclusions cannot be drawn.
• i.e. It’s a guess at best. These types of assertions will
stand-up better against academic/scientific scrutiny.
Big Data Domain - Assumptions
• Other Bad or Wrong Assumptions
– N = 1 vs. N = ALL (multiple layers)
• Big Data introduces a 2nd degree to the data context.
• There are infinite levels of depth and breadth in the data.
• Individuals become populations. Populations become populations of
populations – to the nth degree. (meta-data)
– My Example:
• 1 billion Facebook posts (one from each user) vs. 1 billion Facebook
posts from one unique user.
• 1 billion tweets vs. 1 billion images from one unique user.
• Danger: Drawing conclusions from incomplete
populations. Understand the boundaries/context.
Modeling
• What’s a model?
– An attempt to understand the population of interest
and represent that in a compact form which can be
used to experiment/analyze/study and determine
cause-and-effect and similar relationships amongst
the variables under study IN THE POPULATION.
• Data model
• Statistical model – fitting?
• Mathematical model
Probability Distributions
Doing Data Science

Chapter 2
Exploratory Data Analysis (EDA)
• “It is an attitude, a state of flexibility, a willingness to
look for those things that we believe are not there, as
well as those we believe to be there.”

-John Tukey

• Traditionally presented as a bunch of histograms and

stem-and-leaf plots.
Features
• EDA is a critical part of data science process.
• Represents a philosophy or way of doing
statistics.
• No hypotheses and there is no model.

• “Exploratory” aspect means that your

understanding of the problem you are
solving, or might solve, is changing as you go.
Basic Tools of EDA
• Plots, graphs and summary statistics.

• Method of systematically going through the

data, plotting distributions of all variables.

• EDA is a set of tools, it’s also a mindset.

• Mindset is about relationship with the data.

Philosophy of EDA
• Many reasons any one working with data should do EDA.

• EDA helps with de-bugging the logging process.

• EDA helps assuring the product is performing as intended.

• EDA is done toward the beginning of the analysis.

Data Science Process
A Data Scientist’s Role in This process
Doing Data Science

Chapter 3
What is an algorithm?
• Series of steps or rules to accomplish a tasks
such as:
– Sorting
– Searching
– Graph-based computational problems
• Because one problem could be solved by
several algorithms, the “best” is the one that
can do it with most efficiency and least
computational time.
Three Categories of Algorithms
• Data munging, preparation, and processing
– Sorting, MapReduce, Pregel
– Considered data engineering
• Optimization
– Parameter estimation
– Newton’s Method, least squares
• Machine learning
– Predict, classify, cluster
Data Scientists
• Good data scientists use both statistical
modeling and machine learning algorithms.
• Statisticians: • Software engineers:
– Want to apply parameters – Want to create production
to real world scenarios. code into a model without
– Provide confidence interpret parameters.
intervals and have – Machine learning
uncertainty in these. algorithms don’t have
– Make explicit assumptions notions of uncertainty.
about data generation. – Don’t make assumptions of
probability distribution –
implicit.
Linear Regression (supervised)
• Determine if there is causation and build a model
if we think so.
• Does X (explanatory var) cause Y (response var)?
• Assumptions:
– Quantitative variables
– Linear form
Linear Regression (supervised)
• Steps:
– Create a scatterplot of data
– Ensure that data looks linear (maybe apply
transformation?)
– Find “line of least squares” or fit line.
• This is the line that has the lowest sum of all of the
residuals (actual values – expected values)
– Check your model for “goodness” with R-squared,
p-values, etc.
– Apply your model within reason.
Simplistic example
• Suppose you run a social networking site that charges a

monthly subscription fee of $25

Where, x = number of users; y = total revenue

• If you showed this to someone else who didn’t even know how

much you charged or anything about your business model

Contd…
• They figured out that:
• There’s a linear pattern.
• The coefficient relating x and y is 25. i.e., y = 25 x.
• It seems deterministic.
One more example
• Looking at data at the user level

Say you have a dataset keyed by user (meaning each row

contains data for a single user), and the columns represent
user behavior on a social networking site over a period of a week.

The names of the columns are total_num_friends,

total_new_friends_this_week, num_visits, time_spent,
number_apps_downloaded, number_ads_shown, gender, age,
and so on.
Contd…
• for example, x = total_new_friends and y = time_spent (in
seconds).

• The business context might be that eventually you want to

be able to promise advertisers who bid for space on your
website in advance a certain number of users, so you want
to be able to forecast.
Contd…

There is no
perfectly
deterministic
relationship
between number of
new friends and
time spent on the
site, but it makes
sense that there is
an association
between these two
variables.
Contd…
• First, let’s start by assuming there actually is a relationship and that it’s linear.
Fitting the model
Contd…
• But it’s up to you, the data scientist, whether you think
you’d actually want to use this linear model to describe the
relationship or predict new outcomes.
• If a new x-value of 5 came in, meaning the user had five
new friends, how confident are you in the output value of –
32.08 + 45.92*5 = 195.7 seconds?
• In order to get at this question of confidence, you need to
extend your model.
• So far modeled the trend, you haven’t yet modeled the
variation.
Extending beyond least squares
• Currently you have a simple linear regression
model down (one output, one predictor) using
least squares estimation to estimate your βs,
you can build upon that model in three
primary ways:
– Adding in modeling assumptions about the errors
– Adding in more predictors
– Transforming the predictors
Adding in modeling assumptions about
the errors
• To capture the variability (Shown in Figure 3-5) in your model, you should

extend your model to:

y = β0 + β 1 x + ϵ
where the new term ϵ is referred to as noise, which is the stuff that you

haven’t accounted for by the relationships you’ve figured out so far.

• It’s also called the error term—ϵ represents the actual error, the difference

between the observations and the true regression line, which you’ll never

know and can only estimate with your .

Contd…
• One often makes the modeling assumption
that the noise is normally distributed, which is
denoted:
Contd…

• This is called the mean squared error and captures how much
the predicted value varies from the observed. Mean squared
error is a useful quantity for any prediction problem.
• In regression in particular, it’s also an estimator for your
variance, but it can’t always be used or interpreted that way.
Evaluation metrics
• You have a couple values in the output of the R function
that help you get at the issue of how confident you can
be in the estimates: p-values and R-squared.
Contd…
Contd…
Other models for error terms
• The mean squared error is an example of what is called a loss
function.
• This is the standard one to use in linear regression because it
gives us a pretty nice measure of closeness of fit.
• There are other loss functions such as one that relies on
absolute value rather than squaring.
• It’s also possible to build custom loss functions specific to your
particular problem or context
Adding other predictors
• What we just looked at was simple linear regression— one outcome or
dependent variable and one predictor.

• But we can extend this model by building in other predictors, which is

called multiple linear regression:

y = β0 +β1x1 +β2x2 +β3x3 +ϵ.

• The R code will just be:

model <- lm(y ~ x_1 + x_2 + x_3)

• you can use the same methods to evaluate your model as discussed
previously: looking at R2, p-values, and using training and testing sets.
Transformations
• Going back to one x predicting one y, why did we assume a
linear relationship? Instead, maybe, a better model would be
a polynomial relationship like this:

Wait, but isn’t this linear regression?

To think of it as linear, you transform or create new
variables—for example, z = x2—and build a regression
model based on z. Other common transformations are to
take the log or to pick a threshold and turn it into a binary
predictor instead.
k-Nearest Neighbor/k-NN (supervised)
• Used when you have many objects that are
classified into categories but have some
unclassified objects (e.g. movie ratings).
• Assumptions:
– Data is of the type where “distance” make sense.
– Training data is in two or more classes.
– Observed features and the labels are associated
(not necessarily).
– You pick k.
k-Nearest Neighbor/k-NN (supervised)
• Pick a k value (usually a low odd number, but
up to you to pick).
• Find the closest number of k points to the
unclassified point (using various distance
measurement techniques).
• Assign the new point to the class where the
majority of closest points lie.
• Run algorithm again and again using different
k’s.
k-means (unsupervised)
• Goal is to segment data into clusters or strata
– Important for marketing research where you need
to determine your sample space.
• Assumptions:
– Labels are not known.
– You pick k (more of an art than a science).
k-means (unsupervised)
• Randomly pick k centroids (centers of data)
and place them near “clusters” of data.
• Assign each data point to a centroid.
• Move the centroids to the average location of
the data points assigned to it.
• Repeat the previous two steps until the data
point assignments don’t change.

Andrews M. Doing Data Science in R. an Introduction...2021
No ratings yet
Andrews M. Doing Data Science in R. an Introduction...2021
486 pages
IDS Sec-1 CS1-CS8 Merged Slides
No ratings yet
IDS Sec-1 CS1-CS8 Merged Slides
419 pages
Datascience Notes
No ratings yet
Datascience Notes
161 pages
Data Science S3mca
No ratings yet
Data Science S3mca
55 pages
22mca341 - Data Science
No ratings yet
22mca341 - Data Science
109 pages
IDS Mid 1 Notes
No ratings yet
IDS Mid 1 Notes
80 pages
Module 1
No ratings yet
Module 1
192 pages
Introduction Am
No ratings yet
Introduction Am
74 pages
DS-Unit-1_ABM
No ratings yet
DS-Unit-1_ABM
103 pages
M1 - FDS
No ratings yet
M1 - FDS
19 pages
Dia1
No ratings yet
Dia1
88 pages
DAT100_Int_Data_Ana_Lec2_Intro II
No ratings yet
DAT100_Int_Data_Ana_Lec2_Intro II
39 pages
Data Science - Ebook
No ratings yet
Data Science - Ebook
32 pages
Module 1
No ratings yet
Module 1
19 pages
Modul1 PPt.pptx
No ratings yet
Modul1 PPt.pptx
56 pages
introductiontodatascience-230122140841-b90a0856
No ratings yet
introductiontodatascience-230122140841-b90a0856
44 pages
DS_Module 1
No ratings yet
DS_Module 1
57 pages
Lec 1 Course Introduction
No ratings yet
Lec 1 Course Introduction
16 pages
ST2195 Complete
No ratings yet
ST2195 Complete
430 pages
0bd671618021447098c9e2da9729d5bb_20250117_105640_347_862932_Introduction
No ratings yet
0bd671618021447098c9e2da9729d5bb_20250117_105640_347_862932_Introduction
35 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
Week 12 Intro to DS and ML
No ratings yet
Week 12 Intro to DS and ML
67 pages
Lec1 - For Upload Complete
No ratings yet
Lec1 - For Upload Complete
111 pages
L1 - Introduction To Data Science
No ratings yet
L1 - Introduction To Data Science
33 pages
Aamp 2010 Program Book
No ratings yet
Aamp 2010 Program Book
166 pages
Design of Concrete Structures
No ratings yet
Design of Concrete Structures
378 pages
Data Science 1
100% (3)
Data Science 1
133 pages
Introduction to Data-Science
No ratings yet
Introduction to Data-Science
246 pages
em-edus342305-d_rxyq
No ratings yet
em-edus342305-d_rxyq
96 pages
Chapter 1- Intr to DS and Business Understanding
No ratings yet
Chapter 1- Intr to DS and Business Understanding
35 pages
Module1 21CS644 DSV
No ratings yet
Module1 21CS644 DSV
16 pages
Ch7-Overview of Data Science-part 1
No ratings yet
Ch7-Overview of Data Science-part 1
37 pages
Building With Reclaimed Components and Materials A Design Handbook For Reuse and Recycling PDF
100% (1)
Building With Reclaimed Components and Materials A Design Handbook For Reuse and Recycling PDF
225 pages
DS 1
No ratings yet
DS 1
56 pages
Data Science
100% (2)
Data Science
33 pages
Data Science
No ratings yet
Data Science
40 pages
Carmichael MArron 2018 OJO
No ratings yet
Carmichael MArron 2018 OJO
22 pages
Sap Tcodes Module Pm En
No ratings yet
Sap Tcodes Module Pm En
96 pages
Data Science - AD1102-1
No ratings yet
Data Science - AD1102-1
53 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Unit I
No ratings yet
Unit I
52 pages
DSF 1-2
No ratings yet
DSF 1-2
28 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
30 pages
100T 150T 175T Parts Operators Manual 2018 Onwards 1
No ratings yet
100T 150T 175T Parts Operators Manual 2018 Onwards 1
84 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
dataScience(mod1)
No ratings yet
dataScience(mod1)
4 pages
M1 - FDS
No ratings yet
M1 - FDS
19 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Data Science With Python (MSC 3rd Sem) Unit 1
No ratings yet
Data Science With Python (MSC 3rd Sem) Unit 1
17 pages
Data Science Ppt1 Update
No ratings yet
Data Science Ppt1 Update
67 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
6220010
No ratings yet
6220010
37 pages
Getting Started With Data Science: Grade VIII
No ratings yet
Getting Started With Data Science: Grade VIII
32 pages
Chapter 1 - Lecture
No ratings yet
Chapter 1 - Lecture
7 pages
Module - 1 IDS
100% (1)
Module - 1 IDS
19 pages
Data Science vs. Statistics: Two Cultures?
No ratings yet
Data Science vs. Statistics: Two Cultures?
22 pages
Types of Digital Data
No ratings yet
Types of Digital Data
22 pages
Industrial Training Report
No ratings yet
Industrial Training Report
24 pages
Hangcha A Series 1-5t Electrical Four-Wheel Forklift Truck Service Manual 2019.08
No ratings yet
Hangcha A Series 1-5t Electrical Four-Wheel Forklift Truck Service Manual 2019.08
47 pages
STM32 Microcontroller Debug Toolbox PDF
No ratings yet
STM32 Microcontroller Debug Toolbox PDF
99 pages
Birla Institute of Technology & Science, Pilani: Work Integrated Learning Programmes Part A: Content Design
No ratings yet
Birla Institute of Technology & Science, Pilani: Work Integrated Learning Programmes Part A: Content Design
6 pages
Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
Beige Aesthetic Event Proposal
No ratings yet
Beige Aesthetic Event Proposal
15 pages
McAfee Sekigahara TheFinalBattleforJapan
No ratings yet
McAfee Sekigahara TheFinalBattleforJapan
16 pages
Douglas Jones Et Al 2021 Anthropology of Data
No ratings yet
Douglas Jones Et Al 2021 Anthropology of Data
17 pages
Get (Ebook) The Quest For Sultan Abdülhamid's Oil Assets. His heirs' legal battle for their rights by Sami (E. Mahmud) ISBN 9789754283297, 975428329X free all chapters
No ratings yet
Get (Ebook) The Quest For Sultan Abdülhamid's Oil Assets. His heirs' legal battle for their rights by Sami (E. Mahmud) ISBN 9789754283297, 975428329X free all chapters
76 pages
Data Science Intro
No ratings yet
Data Science Intro
52 pages
LNG Tech 2
50% (2)
LNG Tech 2
28 pages
Data Science
100% (2)
Data Science
52 pages
Design Project Second Part (Ktunotes - In)
No ratings yet
Design Project Second Part (Ktunotes - In)
24 pages
Infobrochure Master Biomedical Sciences (Eng)
No ratings yet
Infobrochure Master Biomedical Sciences (Eng)
8 pages
Quantification of The Degradation in Facades With The Application of A Model Developed Using Hygrother
No ratings yet
Quantification of The Degradation in Facades With The Application of A Model Developed Using Hygrother
14 pages
Karnali Resume 2024
No ratings yet
Karnali Resume 2024
2 pages
(Ebook) Long dark road: Bill King and the murder in Jasper, Texas by Byrd, James;Ainslie, Ricardo C.;King, Bill ISBN 9780292705746, 0292705743 - The ebook in PDF format with all chapters is ready for download
No ratings yet
(Ebook) Long dark road: Bill King and the murder in Jasper, Texas by Byrd, James;Ainslie, Ricardo C.;King, Bill ISBN 9780292705746, 0292705743 - The ebook in PDF format with all chapters is ready for download
35 pages
Creative Thinking
No ratings yet
Creative Thinking
3 pages
Perkin Reaction
No ratings yet
Perkin Reaction
3 pages
Iphone Marimba Ringtone Transcription by Lisztlovers PDF
No ratings yet
Iphone Marimba Ringtone Transcription by Lisztlovers PDF
4 pages
Towards 100 Gbps Ethernet: Development of Ethernet / Physical Layer Aspects
No ratings yet
Towards 100 Gbps Ethernet: Development of Ethernet / Physical Layer Aspects
4 pages
IDFCFIRSTBankstatement 10168737914
No ratings yet
IDFCFIRSTBankstatement 10168737914
2 pages
How Ohio's Unemployment Insurance Benefit Amounts Are Calculated
No ratings yet
How Ohio's Unemployment Insurance Benefit Amounts Are Calculated
3 pages
Structure of Clauses
No ratings yet
Structure of Clauses
7 pages
Godrej Consumer Products Ltd. - Pranit Arora
No ratings yet
Godrej Consumer Products Ltd. - Pranit Arora
3 pages
Notice Under Section 80
No ratings yet
Notice Under Section 80
3 pages
Exercise Sheet 3: Quantum Information - Summer Semester 2020
No ratings yet
Exercise Sheet 3: Quantum Information - Summer Semester 2020
2 pages
Re-Order Level Maximum Consumption X Lead Time
No ratings yet
Re-Order Level Maximum Consumption X Lead Time
2 pages
Risk Assessment Sheet Cadbury HF
No ratings yet
Risk Assessment Sheet Cadbury HF
3 pages
Finding Data Patterns in the Noise: A Data Scientist's Tale
From Everand
Finding Data Patterns in the Noise: A Data Scientist's Tale
Olayinka Ugwu
No ratings yet
The Data Revolution
From Everand
The Data Revolution
Pasquale De Marco
No ratings yet
Data Science Mastery: From Beginner to Expert in Big Data Analytics
From Everand
Data Science Mastery: From Beginner to Expert in Big Data Analytics
Kameron Hussain
No ratings yet

Unit - I & II

Uploaded by

Unit - I & II

Uploaded by

Doing Data Science

Conclusion: The terms have lost their basic

• Why Now: Technology makes this possible: infrastructure for

• Consideration should be to the ethical and technical

Observation: Nobody is an expert in everything, which is why

• In Industry: Someone who knows how to extract meaning from and

• Statistical thinking in the Age of Big Data

• Populations and Samples

• Big Data Examples

• Big Assumptions due to Big Data

• Prerequisites – massive skills!!

• Note: two forms of randomness exist:

– Collection methods (human errors)

• Need a solid method to extract meaning and

• Traditionally presented as a bunch of histograms and

• “Exploratory” aspect means that your

• Method of systematically going through the

• EDA is a set of tools, it’s also a mindset.

• Mindset is about relationship with the data.

• EDA helps with de-bugging the logging process.

• EDA helps assuring the product is performing as intended.

• EDA is done toward the beginning of the analysis.

monthly subscription fee of $25

Where, x = number of users; y = total revenue

much you charged or anything about your business model

Say you have a dataset keyed by user (meaning each row

The names of the columns are total_num_friends,

• The business context might be that eventually you want to

extend your model to:

haven’t accounted for by the relationships you’ve figured out so far.

know and can only estimate with your .

• But we can extend this model by building in other predictors, which is

y = β0 +β1x1 +β2x2 +β3x3 +ϵ.

• The R code will just be:

model <- lm(y ~ x_1 + x_2 + x_3)

Wait, but isn’t this linear regression?

You might also like