0% found this document useful (0 votes)
19 views43 pages

Unit 3 Part 1

Uploaded by

jevono3360
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views43 pages

Unit 3 Part 1

Uploaded by

jevono3360
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Introduction to Data Science

Unit 3
TCS 421
Big Data

Big Data is a term used for a collection of data sets that are large
and complex, which is difficult to store and process using
available database management tools or traditional data
processing applications.

4/3/2024
Big Data Characteristics

1. Volume
2. Velocity
3. Variety
4. Veracity
5. Value

4/3/2024
DATA SCIENCE EVERYWHERE!...

3
DATA SCIENCE VOCABULARY

4
WHAT IS DATA SCIENCE?

5
WHAT IS DATA SCIENCE?

“Data science, also known as data-driven


science, is an interdisciplinary field of scientific
methods, processes, algorithms and systems to extract
knowledge or insights from data in various forms,
either structured or unstructured, akin to data
mining.” 5
WHAT IS DATA SCIENCE?
• “Data science, also known as data-driven science, is an
interdisciplinary field of scientific methods, processes,
algorithms and systems to extract knowledge or insights
from data in various forms, either structured or
unstructured, similar to data mining.”
• “Data science intends to analyze and understand actual
phenomena with ‘data’. In other words, the aim of data science
is to reveal the features or the hidden structure of complicated
natural, human, and social phenomena with data from a
different point of view from the established or traditional theory
5

and method.”
WHAT IS DATA SCIENCE?
• Fourth paradigm
• “… change of all sciences moving from observational,
to theoretical, to computational and now to the 4th
Paradigm – Data-Intensive Scientific Discovery”
WHAT IS IMPORTANT?

……….Need to solve a real problem using data…

…..No applications, no data science.


DATA SCIENCE AS A UNIFIER

Humanities
Data Machine/
Management Statistical
Learning

Law
Data Application
Domain
Science Expertise

Social
Visualization
Science
Mathematical
Optimization
DATA SCIENCE AND BIG DATA
• They are not the “same thing”
• Big data = crude oil
• Big data is about extracting “crude oil”, transporting it in “mega tankers”,
siphoning it through “pipelines”, and storing it in “massive silos”
• Data science is about refining the “crude oil”

Carlos Samohano
Founder, Data Science London
DATA SCIENCE APPLICATION EXAMPLES
• Fraud detection
• Investigate fraud patterns in past data
• Early detection is important
• Before damage propagates
• Harder than late detection
• Precision is important
• False positive and false negative are both bad
• Real-time analytics

11
DATA SCIENCE APPLICATION EXAMPLES
• Recommender systems
• The ability to offer unique
personalized service
• Increase sales, click-through rates,
conversions, …
• Netflix recommender system valued at
$1B per year
• Amazon recommender system drives a
20-35% lift in sales annually
• Collaborative filtering at scale

12
DATA SCIENCE APPLICATION EXAMPLES
• Predicting why patients are being
readmitted
• Reduce costs
• Improve population health
• Find the “why” behind specific
populations being readmitted
• Data lakes of multiple data
sources d
• Investigate ties between readmission an
socioeconomic data points, patient history,
genetics, …
13
DATA SCIENCE APPLICATION EXAMPLES
• “Smart cities”
• Not well-defined

14
DATA SCIENCE APPLICATION EXAMPLES
• “Smart cities”
• Not well-defined

14
DATA SCIENCE APPLICATION EXAMPLES
• “Smart cities”
• Not well-defined
• Generally, refers to using data and
ICT towards -
• Better plan communities
• Better manage assets
• Reduce costs
• Deploy open data to better engage
with community

14
DATA SCIENCE APPLICATION EXAMPLES
• Moneyball
• How to build a baseball team on a very low budget by relying on data
• Sabermetrics: the statistical analysis of baseball data to objectively evaluate
performance
• 2002 record of 103-59 was joint best in MLB
• Team salary budget: $40 million
• Other team: Yankees
• Team salary budget: $120 million

15
HOLISTIC APPROACH TO DATA SCIENCE

Core

Data Security & Privacy

Data Making Data Data


Trustable & Management of Modeling & Dissemination &
Usable Big Data Analysis Visualization
Acquisition Preservation

Ethics, Policy & Social Impact

Application Application Application Application

16
CORE RESEARCH ISSUES & INTERACTIONS
Making Data
Trustable &
Usable

Big Data Modelling &


Management Analysis

Data
Visualization &
Dissemination
17
CORE RESEARCH ISSUES & INTERACTIONS
• Data cleaning
Making Data • Sampling
Trustable & • Data provenance
Usable

Big Data Modelling &


Management Analysis

Data
Visualization &
Dissemination
17
CORE RESEARCH ISSUES & INTERACTIONS
• Data cleaning
Making Data • Sampling
• Data lakes Trustable & • Data provenance
• Batch & online access Usable
• Platforms

Big Data Modelling &


Management Analysis

Data
Visualization &
Dissemination
17

Canadian Data Science Workshop


CORE RESEARCH ISSUES & INTERACTIONS
• Data cleaning
Making Data • Sampling
• Data lakes Trustable & • Data provenance
• Batch & online access Usable
• Platforms

Big Data Modelling &


Management Analysis

• Models & methods for data


lakes
• Unsupervised
Data
Visualization & classification & AI
Dissemination
17
CORE RESEARCH ISSUES & INTERACTIONS
• Data cleaning
Making Data • Sampling
• Data lakes Trustable & • Data provenance
• Batch & online access Usable
• Platforms

Big Data Modelling &


Management Analysis

• Visualization for wider • Models & methods for data


audience
• Visualization for data lakes
• Unsupervised
exploration Data
• Open data technologies Visualization & classification & AI
Dissemination
17
CORE RESEARCH ISSUES & INTERACTIONS
• Data cleaning
Making Data • Sampling
• Data lakes Trustable & • Data provenance
• Batch & online access Usable
• Platforms • DM support for
provenance
• Data preparation for big
data management
Big Data• Cleaning for data Modelling &
Managementanalysis Analysis
• DM for ML
• Visualization for wider • ML for DM
• Visual analytics • Models & methods for data
audience
• Visualization for data … lakes
• Unsupervised
exploration Data
• Open data technologies Visualization & classification & AI
Dissemination
17
Data, Big Data and Challenges
Data Science
Introduction
Why Data Science
Data Scientists
What do they do?
Major/Concentration in Data Science
What courses to take.
Data All Around

Lots of data is being collected


and warehoused
Web data, e-commerce
Financial transactions, bank/credit
transactions
Online trading and purchasing
Social Network
How Much Data Do We have?

Google processes 20 PB a day (2008)


Facebook has 60 TB of daily logs
eBay has 6.5 PB of user data + 50 TB/day
(5/2009)
1000 genomes project: 200 TB

Cost of 1 TB of disk: $35


Time to read 1 TB disk: 3 hrs
(100 MB/s)
Big Data
Big Data is any data that is expensive to manage
and hard to extract value from
Volume
The size of the data
Velocity
The latency of data processing relative to the
growing demand for interactivity
Variety and Complexity
the diversity of sources, formats, quality, structures.
Big Data
Types of Data We Have

Relational Data
(Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
You can afford to scan the data once
What To Do With These Data?

Aggregation and Statistics


Data warehousing and OLAP
Indexing, Searching, and Querying
Keyword based search
Pattern matching (XML/RDF)
Knowledge discovery
Data Mining
Statistical Modeling
Big Data and Data Science

The U.S. will need 140,000-190,000 predictive


analysts and 1.5 million managers/analysts by 2018.
McKinsey Global Institute’s June 2011

New Data Science institutes being created or


repurposed – NYU, Columbia, Washington, UCB,...
New degree programs, courses, boot-camps:
e.g., at Berkeley: Stats, I-School, CS, Astronomy…
One proposal (elsewhere) for an MS in “Big Data Science”
What is Data Science?

An area that manages, manipulates,


extracts, and interprets knowledge from
tremendous amount of data
Data science (DS) is a multidisciplinary
field of study with goal to address the
challenges in big data
Data science principles apply to all data –
big and small
What is Data Science?

Theories and techniques from many fields and


disciplines are used to investigate and analyze a
large amount of data to help decision makers in
many industries such as science, engineering,
economics, politics, finance, and education
Computer Science
Pattern recognition, visualization, data warehousing, High
performance computing, Databases, AI
Mathematics
Mathematical Modeling
Statistics
Statistical and Stochastic modeling, Probability.
Data Science
Data Science
Real Life Examples

Companies learn your secrets, shopping


patterns, and preferences
For example, can we know if you want some
type of car …based on your browsing online
Data Science and election (2008, 2012)
1 million people installed the Obama
Facebook app that gave access to info on
“friends”
Data Scientists

Data Scientist
They find stories, extract knowledge. They
are not reporters
Data Scientists

Data scientists are the key to realizing the


opportunities presented by big data. They
bring structure to it, find compelling
patterns in it, and advise executives on the
implications for products, processes, and
decisions
What do Data Scientists do?

National Security
Cyber Security
Business Analytics
Engineering
Healthcare
And more ….
Concentration in Data Science

Mathematics and Applied Mathematics


Applied Statistics - Data Analysis (SPSS, R)
Programming Skills (R, Python, Julia, MySQL)
Data Mining (Weka, Tableau)
Data Base Storage and Management (NoSQL,
MySQL)
Machine Learning and discovery (Python)

You might also like