Introduction to Big Data
& Basic Data Analysis
Basic Concepts in Big Data
What is big data?
"Big Data are high-volume, high-velocity,
and/or high-variety information assets that
require new forms of processing to enable
enhanced decision making, insight discovery
and process optimization (Gartner 2012)
Complicated (intelligent) analysis of data
may make a small data appear to be big
Bottom line: Any data that exceeds our
current capability of processing can be
regarded as big
Why is big data a big
Government deal?
Obama administration announced big data initiative
Many different big data programs launched
Private Sector
Walmart handles more than 1 million customer transactions every hour,
which is imported into databases estimated to contain more than 2.5
petabytes of data
Facebook handles 40 billion photos from its user base.
Falcon Credit Card Fraud Detection System protects 2.1 billion active
accounts world-wide
Science
Large Synoptic Survey Telescope will generate 140 Terabyte of data every
5 days.
Biomedical computation like decoding human Genome & personalized
medicine
Social science revolution
-
Lifecycle of Data: 4 As
In
ed D te
er Aggregatio a g
att ta rat
c
S ta n ed
Da
Acquisition Analysis
g e
Log ed
da l
ta ow
Application Kn
Computational View of Big Data
Data
Visualization
Data Access Data Analysis
Data Understanding Data Integration
Formatting, Cleaning
Storage Data
Big Data & Related Topics/Courses
CS19
Human-Computer Interaction
9
Data
Visualization Machine Learning
DatabasesInformation Retrieval
Data Access Data Analysis
Data Mining
Computer Vision
Speech Recognition
Data Understanding Data Integration
Natural Language ProcessingData Warehousing
Formatting, Cleaning
Signal Processing
Many
Storage Applications!
Data
Information Theory
Some Data Analysis Techniques
Visualizat
ion
Classificati Predictive
on Modeling
Time Clusteri
Series ng
Big Data EveryWhere!
Lots of data is being collected
and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Social Network
How much data?
Google processes 20 PB a day (2008)
Facebook has 2.5 PB of user data + 15
TB/day (4/2009)
eBay has 6.5 PB of user data + 50 TB/day
(5/2009)
640K ought to be
enough for
anybody.
The Earthscope
The Earthscope is the world's
largest science project.
Designed to track North
America's geological evolution,
this observatory records data
over 3.8 million square miles,
amassing 67 terabytes of data.
much more.
(http://www.msnbc.msn.com/id/4
4363598/ns/technology_and_sci
ence-
future_of_technology/#.TmetOd
Q--uI)
Type of Data
Relational Data
(Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF),
What to do with these data?
Aggregation and Statistics
Data warehouse and OLAP
Indexing, Searching, and Querying
Keyword based search
Pattern matching (XML/RDF)
Knowledge discovery
Data Mining
Statistical Modeling
OLAP and Data Mining
Warehouse Architecture
Client Client
Query &
Analysis
Metadata Warehous
e
Integration
Sourc Sourc Sourc
e e e
15
Star Schemas
A star schema is a common
organization for data at a
warehouse. It consists of:
1. Fact table : a very large accumulation
of facts such as sales.
Often insert-only.
2. Dimension tables : smaller, generally
static information about the entities
involved in the facts.
16
Terms
sale
Fact table orderId
date customer
Dimension tables
product
prodId custId custId
prodId name
Measures name
price storeId address
qty city
amt
store
storeId
city
17
Star
product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5
c2 sfo
c3 la
sale oderId date custId prodId storeId qty amt
o100 1/7/97 53 p1 c1 1 12
o102 2/7/97 53 p2 c1 2 11
105 3/8/97 111 p1 c3 5 50
customer custId name address city
53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la
18
Cube
Fact table view:
Multi-dimensional cube:
sale prodId storeId amt
p1 c1 12 c1 c2 c3
p2 c1 11 p1 12 50
p1 c3 50 p2 11 8
p2 c2 8
dimensions = 2
19
3-D Cube
Fact table view: Multi-dimensional cube:
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11 c1 c2 c3
p1 c3 1 50 day 2 p1 44 4
p2 c2 1 8 p2 c1 c2 c3
p1 c1 2 44 day 1 p1 12 50
p1 c2 2 4 p2 11 8
dimensions = 3
20
ROLAP vs. MOLAP
ROLAP:
Relational On-Line Analytical
Processing
MOLAP:
Multi-Dimensional On-Line Analytical
Processing
21
Aggregates
Add up amounts for day 1
In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11
p1 c3 1 50 81
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
22
Aggregates
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11 ans date sum
p1 c3 1 50 1 81
p2 c2 1 8 2 48
p1 c1 2 44
p1 c2 2 4
23
Another Example
Add up amounts by day, product
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale prodId storeId date amt
p1 c1 1 12 sale prodId date amt
p2 c1 1 11
p1 1 62
p1 c3 1 50
p2 1 19
p2 c2 1 8
p1 c1 2 44 p1 2 48
p1 c2 2 4
rollup
drill-down
24
Aggregates
Operators: sum, count, max, min,
median, ave
Having clause
Using dimension hierarchy
average by region (within store)
maximum by month (within date)
25
What is Data Mining?
Discovery of useful, possibly
unexpected, patterns in data
Extraction of implicit, previously
unknown and potentially useful
information from data
Exploration & analysis, by automatic
or
semi-automatic means, of large
quantities of data in order to discover
meaningful patterns
Data Mining Tasks
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function
of the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.
Decision Trees
Example:
Conducted survey to see what customers were
interested in new model car
Want to select customers for advertising campaign
training
set
29
Clustering
income
education
age
30
K-Means Clustering
31
Association Rule Mining
t ion er ts
ac m c
a ns
d sto odu ht
t r i cu id pr oug
b
sales
market-basket
records:
data
Trend: Products p5, p8 often bough together
Trend: Customer 12 likes product p9
32
Association Rule Discovery
Marketing and Sales Promotion:
Let the rule discovered be
{Bagels, } --> {Potato Chips}
Potato Chips as consequent => Can be used to
determine what should be done to boost its sales.
Bagels in the antecedent => can be used to see which
products would be affected if the store discontinues
selling bagels.
Bagels in antecedent and Potato chips in consequent
=> Can be used to see what products should be sold
with Bagels to promote sale of Potato chips!
Supermarket shelf management.
Inventory Managemnt
Other Types of Mining
Text mining: application of data mining to
textual documents
cluster Web pages to find related pages
cluster pages a user has visited to organize
their visit history
classify Web pages automatically into a Web
directory
Graph Mining:
Deal with graph data
34
The Meaning of Big Data - 3
Vs
Big Volume
With simple (SQL) analytics
With complex (non-SQL) analytics
Big Velocity
Drink from the fire hose
Big Variety
Large number of diverse data sources to
integrate
35
The Participants
Row storage and row executor
Microsoft Madison, DB2, Netezza, Oracle(!)
Column store grafted onto a row executor (wannabees)
Terradata/Asterdata, EMC/Greenplum
Column store and column executor
HP/Vertica, Sybase/IQ, Paraccel
Oracle Exadata is not:
a column store
a scalable shared-nothing architecture
36
Hadoop..
Simple analytics
X100 times a parallel DBMS
Complex analytics (Mahout or roll-your-own)
X100 times Scalapack
Parallel programming
Parallel grep (great)
Everything else (awful)
Hadoop lacks
Stateful computations
Point-to-point communication
37
Big Velocity
Sensor tagging everything of value
sends velocity through the roof
E.g. car insurance
Smart phones as a mobile platform
sends velocity through the roof
State of multi-player internet games
must be recorded sends velocity
through the roof
38
New OLTP
You need to ingest a
fire hose in real-time
You need to perform
high volume OLTP
You often need real-
time analytics
39
VoltDB: an example of
New SQL
A main memory SQL engine
Open source
Shared nothing, Linux, TCP/IP on jelly beans
Light-weight transactions
Run-to-completion with no locking
Single-threaded
Multi-core by splitting main memory
About 100x RDBMS on TPC-C
40
Big Variety
Typical enterprise has 5000 operational systems
Only a few get into the data warehouse
What about the rest?
And what about all the rest of your data?
Spreadsheets
Access data bases
Web pages
And public data from the web?
41
The World of Data
Integration
the rest of your data
enterprise text
data warehouse
42
Summary
The rest of your data (public and private)
Is a treasure trove of incredibly valuable
information
Largely untapped
43
IoT Meets Big Data
44
Big Data Value Chain
Discove
Ingestio ry & Integrat
Collection Analysis Delivery
n Cleansin ion
g
Collection Structured, unstructured and semi-structured data from
multiple sources
Ingestion loading vast amounts of data onto a single data store
Discovery & Cleansing understanding format and content; clean
up and formatting
Integration linking, entity extraction, entity resolution, indexing
and data fusion
Analysis Intelligence, statistics, predictive and text analytics,
Need learning
machine for Standardized Approaches At
Delivery querying, visualization, real time delivery on enterprise-
class availability
Each Step
Source OReilly Strata 2012
12
45
45
Considerations for Big Data Standardization
Variety of Use Data Characteristics
Cases Distributed /
Centralized
Mobility
The 4 Vs : Volume,
Security & Privacy Velocity, Variety,
Lifecycle Veracity
Management & Data Collection
Data Quality Data Visualization
System Data Quality
Management & Data Analytics &
Other Issues Action
46
Data Sources
Source Any*
Anytime
Sensors
Anything
Applications
Any Device
Software agents
Any Context
Individuals
Any Place
Organizations
Anywhere
Hardware resources
Any one
47
Big Data Standardization Challenges
(1)
Big Data use cases, definitions, vocabulary and reference architectures
(e.g. system, data, platforms, online/offline)
Specifications and standardization of metadata including data
provenance
Application models (e.g. batch, streaming)
Query languages including non-relational queries to support diverse
data types (XML, RDF, JSON, multimedia) and Big Data operations (e.g.
matrix operations)
Domain-specific languages
Semantics of eventual consistency
Advanced network protocols for efficient data transfer
General and domain specific ontologies and taxonomies for describing
data semantics including interoperation between ontologies
Source : ISO
48
Big Data Standardization
Challenges (2)
Big Data security and privacy access controls
Remote, distributed, and federated analytics (taking the
analytics to the data) including data and processing
resource discovery and data mining
Data sharing and exchange
Data storage, e.g. memory storage system, distributed file
system, data warehouse, etc.
Human consumption of the results of big data analysis (e.g.
visualization)
Interface between relational (SQL) and non-relational
(NoSQL)
Big Data Quality and Veracity description and management
Source : ISO
49
Big Data Seminar Report with ppt and pdf
The Structure of Big Data
Structured
Most traditional data sources
Semi-structured
Many sources of big data
Unstructured
Video data, audio data
Benefits of Big Data
Big Data is already an important part of the $64 billion
database and data analytics market
It offers commercial opportunities of a comparable
Sekhar Kondepudi
[email protected]
www.kondepudi-group.info
M : +65 98566472
51