Outils informatiques pour le Big Data
en astronomie
Lionel Fillatre
Université Nice Sophia Antipolis
Polytech Nice Sophia
Laboratoire I3S
École d'été thématique CNRS BasMatI
1
3 juin 2015
Outlines
What is the Big Data (including Hadoop Ecosystem)
HDFS (Hadoop Distributed File System)
What is MapReduce?
Image Coaddition with MapReduce
What is NoSQL?
What is Pig?
What is Hive?
What is Spark?
Conclusion
2
What is the Big Data
3
Big Data Definition
No single standard definition…
“Big Data” is data whose scale, diversity, and complexity
require new architecture, techniques, algorithms, and analytics
to manage it and extract value and hidden knowledge from it…
4
Characteristics of Big Data:
1-Scale (Volume)
Data Volume
44x increase from 2009 to 2020
From 0.8 zettabytes to 35zb
Data volume is increasing exponentially
Exponential increase in
collected/generated data
5
Characteristics of Big Data:
2-Complexity (Variety)
Various formats, types, and structures
Text, numerical, images, audio, video,
sequences, time series, social media
data, multi-dim arrays, etc…
Static data vs. streaming data
A single application can be
generating/collecting many types of
data
To extract knowledge
all these types of data need to be linked together
6
Characteristics of Big Data:
3-Speed (Velocity)
Data is generated fast and need to be processed fast
Online Data Analytics
Late decisions missing opportunities
Examples
E-Promotions: Based on your current location, your purchase history, what you like
send promotions right now for store next to you
Healthcare monitoring: sensors monitoring your activities and body
any abnormal measurements require immediate reaction
7
Some Make it 5V’s
8
What technology for Big Data?
9
10
11
12
Hadoop Origins
Apache Hadoop is a framework that allows for the
distributed processing of large data sets accross clusters of
commodity computers using a simple programming model.
Hadoop is an open-source implementation of Google
MapReduce and Google File System (GFS).
Hadoop fulfills need of common infrastructure:
Efficient, reliable, easy to use,
Open Source, Apache License.
13
Hadoop Ecosystem (main elements)
14
Data Storage
Storage capacity has grown exponentially but read
speed has not kept up
1990:
Store 1,400 MB
Transfer speed of 4.5MB/s
Read the entire drive in ~ 5 minutes
2010:
Store 1 TB
Transfer speed of 100MB/s
Read the entire drive in ~ 3 hours
Hadoop - 100 drives working at the same time can
read 1TB of data in 2 minutes
15
Hadoop Cluster
A set of "cheap" commodity hardware
No need for super-computers, use commodity unreliable hardware
Not desktops
Networked together
May reside in the same location
– Set of servers in a set of racks in a data center
16
Scale-Out Instead of Scale-Up
It is harder and more expensive to scale-up
Add additional resources to an existing node (CPU, RAM)
Moore’s Law can’t keep up with data growth
New units must be purchased if required resources can not be added
Also known as scale vertically
Scale-Out
Add more nodes/machines to an existing distributed application
Software layer is designed for node additions or removal
Hadoop takes this approach - A set of nodes are bonded together as a
single distributed system
Very easy to scale down as well
17
Code to Data
Traditional data processing architecture
Nodes are broken up into separate processing and storage nodes
connected by high-capacity link
Many data-intensive applications are not CPU demanding
causing bottlenecks in network
18
Code to Data
Hadoop co-locates processors and storage
Code is moved to data (size is tiny, usually in KBs)
Processors execute code and access underlying local storage
19
Failures are Common
Given a large number machines, failures are
common
Large warehouses may see machine failures weekly or even daily
Hadoop is designed to cope with node failures
Data is replicated
Tasks are retried
20
Comparison to RDBMS
Relational Database Management Systems
(RDBMS) for batch processing
Oracle, Sybase, MySQL, Microsoft SQL Server, etc.
Hadoop doesn’t fully replace relational products; many
architectures would benefit from both Hadoop and a Relational
product
RDBMS products scale up
Expensive to scale for larger installations
Hits a ceiling when storage reaches 100s of terabytes
Structured Relational vs. Semi-Structured vs. Unstructured
Hadoop was not designed for real-time or low latency queries
21
HDFS
(Hadoop Distributed File System)
22
HDFS
Appears as a single disk
Runs on top of a native filesystem
Fault Tolerant
Can handle disk crashes, machine crashes, etc...
Based on Google's Filesystem (GFS or GoogleFS)
23
HDFS is Good for...
Storing large files
Terabytes, Petabytes, etc...
Millions rather than billions of files
100MB or more per file
Streaming data
Write once and read-many times patterns
Optimized for streaming reads rather than random reads
“Cheap” Commodity Hardware
No need for super-computers, use less reliable commodity hardware
24
HDFS is not so good for...
Low-latency reads
High-throughput rather than low latency for small chunks of data
HBase addresses this issue
Large amount of small files
Better for millions of large files instead of billions of small files
For example each file can be 100MB or more
Multiple Writers
Single writer per file
Writes only at the end of file, no-support for arbitrary offset
25
HDFS Daemons
26
Files and Blocks
27
HDFS File Write
28
HDFS File Read
29
What is MapReduce?
30
Hadoop MapReduce
Model for processing large amounts of data in
parallel
On commodity hardware
Lots of nodes
Derived from functional programming
Map and reduce functions
Can be implemented in multiple languages
Java, C++, Ruby, Python, etc.
31
Hadoop MapReduce History
32
Main principle
Map: ( f, [a, b, c, ...]) -> [ f(a), f(b), f(c), ... ]
Apply a function to all the elements of a list
ex.: map((f: x->x + 1), [1, 2, 3]) = [2, 3, 4]
Intrinsically parallel
Reduce: ( g, [a, b, c, ...] ) -> g(a, g(b, g(c, ... )))
Apply a function to a list recursively
ex.: (sum , [1, 2, 3 ,4]) = sum(1, sum( 2, sum( 3, 4 ) ) )
Purely fonctionnal
No global variables, no side effects
33
WordCount example
34
MapReduce Framework
Takes care of distributed processing and coordination
Scheduling
Jobs are broken down into smaller chunks called tasks.
These tasks are scheduled.
Task localization with Data
Framework strives to place tasks on the nodes that host the
segment of data to be processed by that specific task
Code is moved to where the data is
35
MapReduce Framework
Error Handling
Failures are an expected behavior so tasks are automatically re-tried
on other machines
Data Synchronization
Shuffle and Sort barrier re-arranges and moves data between
machines
Input and output are coordinated by the framework
36
Map Reduce 2.0 on YARN
Yet Another Resource Negotiator (YARN)
Various applications can run on YARN
MapReduce is just one choice (the main choice at this point)
http://wiki.apache.org/hadoop/PoweredByYarn
37
YARN Cluster
38
YARN: Running an Application
39
YARN: Running an Application
40
YARN: Running an Application
41
YARN: Running an Application
42
YARN: Running an Application
43
YARN and MapReduce
YARN does not know or care what kind of application it is
running
MapReduce uses YARN
Hadoop includes a MapReduce ApplicationMaster to manage
MapReduce jobs
Each MapReduce job is an instance of an application
44
Running a MapReduce2 Application
45
Running a MapReduce2 Application
46
Running a MapReduce2 Application
47
Running a MapReduce2 Application
48
Running a MapReduce2 Application
49
Running a MapReduce2 Application
50
Running a MapReduce2 Application
51
Running a MapReduce2 Application
52
Running a MapReduce2 Application
53
Image Coaddition with
MapReduce
54
What is Astronomical Survey Science
from Big Data point of view ?
Gather millions of images and TBs/PBs of storage.
Require high-throughput data reduction pipelines.
Require sophisticated off-line data analysis tools
The following example is extracted from
Wiley K., Connolly A., Gardner J., Krughoff S., Balazinska M., Howe B., Kwon
Y., Bu Y.
Astronomy in the Cloud: Using MapReduce for Image Co-Addition.
Publications of the Astronomical Society of the Pacific,
2011, vol. 123, no. 901, pp. 366-380.
55
FITS (Flexible Image Transport System)
An image format that knows where it is looking.
Common astronomical image representation file format.
Metadata tags (like EXIF):
Most importantly: Precise astrometry (position on sky)
Other:
Geolocation (telescope location)
Sky conditions, image quality, etc.
56
Image Coaddition
Give multiple partially overlapping images and a query
(color and sky bounds):
Find images’ intersections with the query bounds.
Project bitmaps to the bounds.
Stack and mosaic into a final product.
57
Image Stacking (Signal Averaging)
Stacking improves SNR: makes
fainter objects visible.
Example (SDSS, Stripe 82):
Top: Single image, R-band
Bottom: 79-deep stack (~9x
SNR improvement)
Variable conditions (e.g.,
atmosphere, PSF, haze) mean
stacking algorithm complexity
58
can exceed a mere sum.
Advantages of MapReduce
High-level problem description. No effort spent on
internode communication, message-passing, etc.
Programmed in Java (accessible to most science researchers,
not just computer scientists and engineers).
Runs on cheap commodity hardware, potentially in the
cloud, e.g., Amazon’s EC2.
Scalable: 1000s of nodes can be added to the cluster with no
modification to the researcher’s software.
Large community of users/support.
59
Coaddition in Hadoop
60
What is NoSQL?
61
What is NoSQL?
Stands for Not Only SQL
Class of non-relational data storage systems
Usually do not require a fixed table schema nor do they use the
concept of joins
All NoSQL offerings relax one or more of the ACID properties
(CAP theorem)
For data storage, an RDBMS cannot be the be-all/end-all
Just as there are different programming languages, need to have
other data storage tools in the toolbox
A NoSQL solution is more acceptable to a client now
62
The CAP Theorem
Theorem: You can have at most
two of these properties for any
Availability shared-data system
Consistency
Partition
tolerance
63
The CAP Theorem
Once a writer has written, all
readers will see that write
Availability
Consistency
Partition
tolerance
64
Consistency
Two kinds of consistency:
strong consistency – ACID (Atomicity Consistency
Isolation Durability)
weak consistency – BASE (Basically Available Soft-state
Eventual consistency)
• Basically Available: The database system always seems to work!
• Soft State: It does not have to be consistent all the time.
• Eventually Consistent: The system will eventually become
consistent when the updates propagate, in particular, when there
are not too many updates.
65
The CAP Theorem
System is available during
software and hardware upgrades
Availability and node failures.
Consistency
Partition
tolerance
66
Availability
A guarantee that every request receives a response about whether
it succeeded or failed.
Traditionally, thought of as the server/process available five 9’s
(99.999 %).
However, for large node system, at almost any point in time
there’s a good chance that a node is either down or there is a
network disruption among the nodes.
67
The CAP Theorem
A system can continue to
operate in the presence of a
Availability network partitions.
Consistency
Partition
tolerance
68
Failure is the rule
Amazon:
Datacenter with 100 000 disks
From 6 000 to 10 000 disks fail over per year (25
disks per day)
Sources of failures are numerous:
Hardware (disk)
Network
Power
Software
Software and OS updates.
69
The CAP Theorem
70
Different Types of NoSQL Systems
• Distributed Key-Value Systems - Lookup a single value for a key
• Amazon’s Dynamo
• Document-based Systems - Access data by key or by search of “document” data.
• CouchDB
• MongoDB
• Column-based Systems
• Google’s BigTable
• HBase
• Facebook’s Cassandra
• Graph-based Systems - Use a graph structure
• Google’s Pregel
• Neo4j
71
Key-Value Pair (KVP) Stores
“Value” is stored as a “blob”
• Without caring or knowing what is inside
• Application is responsible for understanding the data
In simple terms, a NoSQL Key-Value store is a single table with two columns: one
being the (Primary) Key, and the other being the Value.
72
Each record may have a different schema
Document storage
• Records within a single table can have different structures.
• An example record from Mongo, using JSON format, might look like
{
“_id” : ObjectId(“4fccbf281168a6aa3c215443″),
“first_name” : “Thomas”,
“last_name” : “Jefferson”,
“address” : {
“street” : “1600 Pennsylvania Ave NW”,
“city” : “Washington”, Embedded object
“state” : “DC”
}
}
• Records are called documents.
• You can also modify the structure of any document on the fly by adding and removing
members from the document.
• Unlike simple key-value stores, both keys and values are fully searchable in document
databases.
73
Column-based Stores
• Based on Google’s BigTable store:
• Each record = (row:string, column:string, time:int64)
• Distributed data storage, especially versioned data (time-stamps).
• What is a column-based store? - Data tables are stored as sections of
columns of data, rather than as rows of data.
74
Graph Database
• Apply graph theory in the storage of information about the relationship
between entries
• A graph database is a database that uses graph structures with nodes,
edges, and properties to represent and store data.
• In general, graph databases are useful when you are more interested in
relationships between data than in the data itself:
• for example, in representing and traversing social networks,
generating recommendations, or conducting forensic investigations
(e.g. pattern detection).
75
Example
76
What is Pig?
77
Pig
In brief:
“is a platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs, coupled with infrastructure
for evaluating these programs.”
Top Level Apache Project
http://pig.apache.org
Pig is an abstraction on top of Hadoop
Provides high level programming language designed for data processing
Converted into MapReduce and executed on Hadoop Clusters
Pig is widely accepted and used
Yahoo!, Twitter, Netflix, etc...
At Yahoo!, 70% MapReduce jobs are written in Pig
78
Disadvantages of Raw MapRaduce
1. Extremely rigid data flow
M R
Other flows constantly hacked in
M M R M
Join, Union Split Chains
2. Common operations must be coded by hand
• Join, filter, projection, aggregates, sorting, distinct
3. Semantics hidden inside map-reduce functions
• Difficult to maintain, extend, and optimize
• Resulting code is difficult to reuse and maintain; shifts focus and attention away
79
from data analysis
Pig and MapReduce
MapReduce requires programmers
Must think in terms of map and reduce functions
More than likely will require Java programmers
Pig provides high-level language that can be used by
Analysts
Data Scientists
Statisticians
Etc...
Originally implemented at Yahoo! to allow analysts to
access data
80
Pig’s Features
Main operators:
Join Datasets
Sort Datasets
Filter
Data Types
Group By
User Defined Functions
Etc..
Example:
>movies = LOAD '/home/movies_data.csv' USING PigStorage(',') as
(id,name,year,rating,duration);
>movies_greater_than_four = FILTER movies BY (float)rating>4.0;
>DUMP movies_greater_than_four;
81
What is Hive?
82
Hive
Data Warehousing Solution built on top of Hadoop
Provides SQL-like query language named HiveQL
Minimal learning curve for people with SQL expertise
Data analysts are target audience
Early Hive development work started at Facebook in 2007
Today Hive is an Apache project under Hadoop
http://hive.apache.org
83
Advantages and Drawbacks
Hive provides
Ability to bring structure to various data formats
Simple interface for ad hoc querying, analyzing and summarizing large
amounts of data
Access to files on various data stores such as HDFS and Hbase
Hive does not provide
Hive does not provide low latency or realtime queries
Even querying small amounts of data may take minutes
Designed for scalability and ease-of-use rather than low latency
responses
84
Hive
Translates HiveQL statements into a set of MapReduce Jobs which are
then executed on a Hadoop Cluster
85
What is Spark?
86
A Brief History: Spark
87
A general view of Spark
88
Current programming models
Current popular programming models for clusters transform
data flowing from stable storage to stable storage
E.g., MapReduce:
Map
Reduce
Input Map Output
Reduce
Map
Benefits of data flow: runtime can decide where to run tasks and can
89 automatically recover from failures
MapReduce I/O
90
Spark
Acyclic data flow is a powerful abstraction, but is not efficient for
applications that repeatedly reuse a working set of data:
Iterative algorithms (many in machine learning)
Interactive data mining tools (R, Excel, Python)
Spark makes working sets a first-class concept to efficiently
support these apps.
91
Goal: Sharing at Memory Speed
92
Resilient Distributed Dataset (RDD)
Provide distributed memory abstractions for clusters to support apps
with working sets.
Retain the attractive properties of MapReduce:
Fault tolerance (for crashes & stragglers)
Data locality
Scalability
Solution: augment data flow model with “resilient distributed
datasets” (RDDs)
93
Programming Model with RDD
Resilient distributed datasets (RDDs)
Immutable collections partitioned across cluster that can be rebuilt
if a partition is lost
Created by transforming data in stable storage using data flow
operators (map, filter, group-by, …)
Can be cached across parallel operations
Parallel operations on RDDs
Reduce, collect, count, save, …
Restricted shared variables
Accumulators, broadcast variables
94
Example: Logistic Regression
Goal: find best line separating two sets of points
random initial line
target
95
Logistic Regression (SCALA Code)
val data =
spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y *
p.x
).reduce(_ + _)
w -= gradient
}
println("Final w: " + w)
96
Conclusion
97
Conclusion
Data storage needs are rapidly increasing
Hadoop has become the de-facto standard for handling these
massive data sets.
Storage of Big Data requires new storage models
NoSQL solutions.
Parallel processing of Big Data requires a new programming
paradigm
MapReduce programming model.
“Big data” is moving beyond one-passbatch jobs, to low-latency
apps that need datasharing
Apache Spark is an alternative solution.
98