Unit 3 Part 1
Unit 3 Part 1
Unit 3
TCS 421
Big Data
Big Data is a term used for a collection of data sets that are large
and complex, which is difficult to store and process using
available database management tools or traditional data
processing applications.
4/3/2024
Big Data Characteristics
1. Volume
2. Velocity
3. Variety
4. Veracity
5. Value
4/3/2024
DATA SCIENCE EVERYWHERE!...
3
DATA SCIENCE VOCABULARY
4
WHAT IS DATA SCIENCE?
5
WHAT IS DATA SCIENCE?
and method.”
WHAT IS DATA SCIENCE?
• Fourth paradigm
• “… change of all sciences moving from observational,
to theoretical, to computational and now to the 4th
Paradigm – Data-Intensive Scientific Discovery”
WHAT IS IMPORTANT?
Humanities
Data Machine/
Management Statistical
Learning
Law
Data Application
Domain
Science Expertise
Social
Visualization
Science
Mathematical
Optimization
DATA SCIENCE AND BIG DATA
• They are not the “same thing”
• Big data = crude oil
• Big data is about extracting “crude oil”, transporting it in “mega tankers”,
siphoning it through “pipelines”, and storing it in “massive silos”
• Data science is about refining the “crude oil”
Carlos Samohano
Founder, Data Science London
DATA SCIENCE APPLICATION EXAMPLES
• Fraud detection
• Investigate fraud patterns in past data
• Early detection is important
• Before damage propagates
• Harder than late detection
• Precision is important
• False positive and false negative are both bad
• Real-time analytics
11
DATA SCIENCE APPLICATION EXAMPLES
• Recommender systems
• The ability to offer unique
personalized service
• Increase sales, click-through rates,
conversions, …
• Netflix recommender system valued at
$1B per year
• Amazon recommender system drives a
20-35% lift in sales annually
• Collaborative filtering at scale
12
DATA SCIENCE APPLICATION EXAMPLES
• Predicting why patients are being
readmitted
• Reduce costs
• Improve population health
• Find the “why” behind specific
populations being readmitted
• Data lakes of multiple data
sources d
• Investigate ties between readmission an
socioeconomic data points, patient history,
genetics, …
13
DATA SCIENCE APPLICATION EXAMPLES
• “Smart cities”
• Not well-defined
14
DATA SCIENCE APPLICATION EXAMPLES
• “Smart cities”
• Not well-defined
14
DATA SCIENCE APPLICATION EXAMPLES
• “Smart cities”
• Not well-defined
• Generally, refers to using data and
ICT towards -
• Better plan communities
• Better manage assets
• Reduce costs
• Deploy open data to better engage
with community
14
DATA SCIENCE APPLICATION EXAMPLES
• Moneyball
• How to build a baseball team on a very low budget by relying on data
• Sabermetrics: the statistical analysis of baseball data to objectively evaluate
performance
• 2002 record of 103-59 was joint best in MLB
• Team salary budget: $40 million
• Other team: Yankees
• Team salary budget: $120 million
15
HOLISTIC APPROACH TO DATA SCIENCE
Core
16
CORE RESEARCH ISSUES & INTERACTIONS
Making Data
Trustable &
Usable
Data
Visualization &
Dissemination
17
CORE RESEARCH ISSUES & INTERACTIONS
• Data cleaning
Making Data • Sampling
Trustable & • Data provenance
Usable
Data
Visualization &
Dissemination
17
CORE RESEARCH ISSUES & INTERACTIONS
• Data cleaning
Making Data • Sampling
• Data lakes Trustable & • Data provenance
• Batch & online access Usable
• Platforms
Data
Visualization &
Dissemination
17
Relational Data
(Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
You can afford to scan the data once
What To Do With These Data?
Data Scientist
They find stories, extract knowledge. They
are not reporters
Data Scientists
National Security
Cyber Security
Business Analytics
Engineering
Healthcare
And more ….
Concentration in Data Science