0% found this document useful (0 votes)

144 views98 pages

Fillatre Big Data

This document discusses tools for Big Data in astronomy. It begins with an introduction to Big Data and the Hadoop ecosystem, including HDFS for distributed storage and MapReduce for distributed processing. It then discusses specific Hadoop tools like Pig, Hive, and Spark. It provides an example of using MapReduce for image coaddition in astronomy. Overall, the document outlines how Hadoop and its ecosystem can provide scalable tools for storing, processing, and analyzing the large datasets common in modern astronomy.

Uploaded by

satmania

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

144 views98 pages

Fillatre Big Data

Uploaded by

satmania

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Outils informatiques pour le Big Data

en astronomie
Lionel Fillatre
Université Nice Sophia Antipolis
Polytech Nice Sophia
Laboratoire I3S

École d'été thématique CNRS BasMatI

1
3 juin 2015
Outlines
 What is the Big Data (including Hadoop Ecosystem)
 HDFS (Hadoop Distributed File System)
 What is MapReduce?
 Image Coaddition with MapReduce
 What is NoSQL?
 What is Pig?
 What is Hive?
 What is Spark?
 Conclusion

2
What is the Big Data

3
Big Data Definition
 No single standard definition…

“Big Data” is data whose scale, diversity, and complexity

require new architecture, techniques, algorithms, and analytics
to manage it and extract value and hidden knowledge from it…

4
Characteristics of Big Data:
1-Scale (Volume)

 Data Volume
 44x increase from 2009 to 2020
 From 0.8 zettabytes to 35zb
 Data volume is increasing exponentially

Exponential increase in
collected/generated data

5
Characteristics of Big Data:
2-Complexity (Variety)

 Various formats, types, and structures

 Text, numerical, images, audio, video,
sequences, time series, social media
data, multi-dim arrays, etc…
 Static data vs. streaming data
 A single application can be
generating/collecting many types of
data

To extract knowledge
 all these types of data need to be linked together

6
Characteristics of Big Data:
3-Speed (Velocity)
 Data is generated fast and need to be processed fast
 Online Data Analytics
 Late decisions  missing opportunities
 Examples
 E-Promotions: Based on your current location, your purchase history, what you like
 send promotions right now for store next to you

 Healthcare monitoring: sensors monitoring your activities and body

 any abnormal measurements require immediate reaction

7
Some Make it 5V’s

8
What technology for Big Data?

9
10
11
12
Hadoop Origins
 Apache Hadoop is a framework that allows for the
distributed processing of large data sets accross clusters of
commodity computers using a simple programming model.
 Hadoop is an open-source implementation of Google
MapReduce and Google File System (GFS).
 Hadoop fulfills need of common infrastructure:
 Efficient, reliable, easy to use,
 Open Source, Apache License.

13
Hadoop Ecosystem (main elements)

14
Data Storage
 Storage capacity has grown exponentially but read
speed has not kept up
 1990:
 Store 1,400 MB
 Transfer speed of 4.5MB/s
 Read the entire drive in ~ 5 minutes
 2010:
 Store 1 TB
 Transfer speed of 100MB/s
 Read the entire drive in ~ 3 hours
 Hadoop - 100 drives working at the same time can
read 1TB of data in 2 minutes

15
Hadoop Cluster
 A set of "cheap" commodity hardware
 No need for super-computers, use commodity unreliable hardware
 Not desktops
 Networked together
 May reside in the same location
– Set of servers in a set of racks in a data center

16
Scale-Out Instead of Scale-Up
 It is harder and more expensive to scale-up
 Add additional resources to an existing node (CPU, RAM)
 Moore’s Law can’t keep up with data growth
 New units must be purchased if required resources can not be added
 Also known as scale vertically
 Scale-Out
 Add more nodes/machines to an existing distributed application
 Software layer is designed for node additions or removal
 Hadoop takes this approach - A set of nodes are bonded together as a
single distributed system
 Very easy to scale down as well

17
Code to Data
 Traditional data processing architecture
 Nodes are broken up into separate processing and storage nodes
connected by high-capacity link
 Many data-intensive applications are not CPU demanding
causing bottlenecks in network

18
Code to Data
 Hadoop co-locates processors and storage
 Code is moved to data (size is tiny, usually in KBs)
 Processors execute code and access underlying local storage

19
Failures are Common
 Given a large number machines, failures are
common
 Large warehouses may see machine failures weekly or even daily
 Hadoop is designed to cope with node failures
 Data is replicated
 Tasks are retried

20
Comparison to RDBMS
 Relational Database Management Systems
(RDBMS) for batch processing
 Oracle, Sybase, MySQL, Microsoft SQL Server, etc.
 Hadoop doesn’t fully replace relational products; many
architectures would benefit from both Hadoop and a Relational
product
 RDBMS products scale up
 Expensive to scale for larger installations
 Hits a ceiling when storage reaches 100s of terabytes
 Structured Relational vs. Semi-Structured vs. Unstructured
 Hadoop was not designed for real-time or low latency queries

21
HDFS
(Hadoop Distributed File System)

22
HDFS
 Appears as a single disk
 Runs on top of a native filesystem
 Fault Tolerant
 Can handle disk crashes, machine crashes, etc...
 Based on Google's Filesystem (GFS or GoogleFS)

23
HDFS is Good for...
 Storing large files
 Terabytes, Petabytes, etc...
 Millions rather than billions of files
 100MB or more per file
 Streaming data
 Write once and read-many times patterns
 Optimized for streaming reads rather than random reads
 “Cheap” Commodity Hardware
 No need for super-computers, use less reliable commodity hardware

24
HDFS is not so good for...

 Low-latency reads
 High-throughput rather than low latency for small chunks of data
 HBase addresses this issue
 Large amount of small files
 Better for millions of large files instead of billions of small files
 For example each file can be 100MB or more
 Multiple Writers
 Single writer per file
 Writes only at the end of file, no-support for arbitrary offset

25
HDFS Daemons

26
Files and Blocks

27
HDFS File Write

28
HDFS File Read

29
What is MapReduce?

30
Hadoop MapReduce
 Model for processing large amounts of data in
parallel
 On commodity hardware
 Lots of nodes
 Derived from functional programming
 Map and reduce functions
 Can be implemented in multiple languages
 Java, C++, Ruby, Python, etc.

31
Hadoop MapReduce History

32
Main principle
 Map: ( f, [a, b, c, ...]) -> [ f(a), f(b), f(c), ... ]
 Apply a function to all the elements of a list
 ex.: map((f: x->x + 1), [1, 2, 3]) = [2, 3, 4]
 Intrinsically parallel

 Reduce: ( g, [a, b, c, ...] ) -> g(a, g(b, g(c, ... )))

 Apply a function to a list recursively
 ex.: (sum , [1, 2, 3 ,4]) = sum(1, sum( 2, sum( 3, 4 ) ) )

 Purely fonctionnal
 No global variables, no side effects

33
WordCount example

34
MapReduce Framework
 Takes care of distributed processing and coordination
 Scheduling
 Jobs are broken down into smaller chunks called tasks.
 These tasks are scheduled.
 Task localization with Data
 Framework strives to place tasks on the nodes that host the
segment of data to be processed by that specific task
 Code is moved to where the data is

35
MapReduce Framework
 Error Handling
 Failures are an expected behavior so tasks are automatically re-tried
on other machines
 Data Synchronization
 Shuffle and Sort barrier re-arranges and moves data between
machines
 Input and output are coordinated by the framework

36
Map Reduce 2.0 on YARN
 Yet Another Resource Negotiator (YARN)
 Various applications can run on YARN
 MapReduce is just one choice (the main choice at this point)
 http://wiki.apache.org/hadoop/PoweredByYarn

37
YARN Cluster

38
YARN: Running an Application

39
YARN: Running an Application

40
YARN: Running an Application

41
YARN: Running an Application

42
YARN: Running an Application

43
YARN and MapReduce
 YARN does not know or care what kind of application it is
running
 MapReduce uses YARN
 Hadoop includes a MapReduce ApplicationMaster to manage
MapReduce jobs
 Each MapReduce job is an instance of an application

44
Running a MapReduce2 Application

45
Running a MapReduce2 Application

46
Running a MapReduce2 Application

47
Running a MapReduce2 Application

48
Running a MapReduce2 Application

49
Running a MapReduce2 Application

50
Running a MapReduce2 Application

51
Running a MapReduce2 Application

52
Running a MapReduce2 Application

53
Image Coaddition with
MapReduce

54
What is Astronomical Survey Science
from Big Data point of view ?
 Gather millions of images and TBs/PBs of storage.
 Require high-throughput data reduction pipelines.
 Require sophisticated off-line data analysis tools
 The following example is extracted from
Wiley K., Connolly A., Gardner J., Krughoff S., Balazinska M., Howe B., Kwon
Y., Bu Y.
Astronomy in the Cloud: Using MapReduce for Image Co-Addition.
Publications of the Astronomical Society of the Pacific,
2011, vol. 123, no. 901, pp. 366-380.

55
FITS (Flexible Image Transport System)
 An image format that knows where it is looking.
 Common astronomical image representation file format.
 Metadata tags (like EXIF):
 Most importantly: Precise astrometry (position on sky)
 Other:
 Geolocation (telescope location)
 Sky conditions, image quality, etc.

56
Image Coaddition
 Give multiple partially overlapping images and a query
(color and sky bounds):
 Find images’ intersections with the query bounds.
 Project bitmaps to the bounds.
 Stack and mosaic into a final product.

57
Image Stacking (Signal Averaging)
 Stacking improves SNR: makes
fainter objects visible.

 Example (SDSS, Stripe 82):

 Top: Single image, R-band
 Bottom: 79-deep stack (~9x
SNR improvement)

 Variable conditions (e.g.,

atmosphere, PSF, haze) mean
stacking algorithm complexity
58
can exceed a mere sum.
Advantages of MapReduce
 High-level problem description. No effort spent on
internode communication, message-passing, etc.
 Programmed in Java (accessible to most science researchers,
not just computer scientists and engineers).
 Runs on cheap commodity hardware, potentially in the
cloud, e.g., Amazon’s EC2.
 Scalable: 1000s of nodes can be added to the cluster with no
modification to the researcher’s software.
 Large community of users/support.

59
Coaddition in Hadoop

60
What is NoSQL?

61
What is NoSQL?
 Stands for Not Only SQL
 Class of non-relational data storage systems
 Usually do not require a fixed table schema nor do they use the
concept of joins
 All NoSQL offerings relax one or more of the ACID properties
(CAP theorem)
 For data storage, an RDBMS cannot be the be-all/end-all
 Just as there are different programming languages, need to have
other data storage tools in the toolbox
 A NoSQL solution is more acceptable to a client now

62
The CAP Theorem

Theorem: You can have at most

two of these properties for any
Availability shared-data system

Consistency

Partition
tolerance

63
The CAP Theorem
 Once a writer has written, all
readers will see that write

Availability

Consistency

Partition
tolerance

64
Consistency
 Two kinds of consistency:
 strong consistency – ACID (Atomicity Consistency
Isolation Durability)
 weak consistency – BASE (Basically Available Soft-state
Eventual consistency)
• Basically Available: The database system always seems to work!
• Soft State: It does not have to be consistent all the time.
• Eventually Consistent: The system will eventually become
consistent when the updates propagate, in particular, when there
are not too many updates.

65
The CAP Theorem

System is available during

software and hardware upgrades
Availability and node failures.

Consistency

Partition
tolerance

66
Availability
 A guarantee that every request receives a response about whether
it succeeded or failed.
 Traditionally, thought of as the server/process available five 9’s
(99.999 %).
 However, for large node system, at almost any point in time
there’s a good chance that a node is either down or there is a
network disruption among the nodes.

67
The CAP Theorem

A system can continue to

operate in the presence of a
Availability network partitions.

Consistency

Partition
tolerance

68
Failure is the rule
 Amazon:
 Datacenter with 100 000 disks
 From 6 000 to 10 000 disks fail over per year (25
disks per day)
 Sources of failures are numerous:
 Hardware (disk)
 Network
 Power
 Software
 Software and OS updates.
69
The CAP Theorem

70
Different Types of NoSQL Systems
• Distributed Key-Value Systems - Lookup a single value for a key
• Amazon’s Dynamo

• Document-based Systems - Access data by key or by search of “document” data.

• CouchDB
• MongoDB

• Column-based Systems
• Google’s BigTable
• HBase
• Facebook’s Cassandra

• Graph-based Systems - Use a graph structure

• Google’s Pregel
• Neo4j
71
Key-Value Pair (KVP) Stores
“Value” is stored as a “blob”
• Without caring or knowing what is inside
• Application is responsible for understanding the data

In simple terms, a NoSQL Key-Value store is a single table with two columns: one
being the (Primary) Key, and the other being the Value.

72
Each record may have a different schema
Document storage
• Records within a single table can have different structures.
• An example record from Mongo, using JSON format, might look like
{
“_id” : ObjectId(“4fccbf281168a6aa3c215443″),

“first_name” : “Thomas”,
“last_name” : “Jefferson”,
“address” : {
“street” : “1600 Pennsylvania Ave NW”,
“city” : “Washington”, Embedded object
“state” : “DC”
}
}
• Records are called documents.
• You can also modify the structure of any document on the fly by adding and removing
members from the document.
• Unlike simple key-value stores, both keys and values are fully searchable in document
databases.
73
Column-based Stores
• Based on Google’s BigTable store:
• Each record = (row:string, column:string, time:int64)
• Distributed data storage, especially versioned data (time-stamps).
• What is a column-based store? - Data tables are stored as sections of
columns of data, rather than as rows of data.

74
Graph Database
• Apply graph theory in the storage of information about the relationship
between entries

• A graph database is a database that uses graph structures with nodes,

edges, and properties to represent and store data.

• In general, graph databases are useful when you are more interested in
relationships between data than in the data itself:
• for example, in representing and traversing social networks,
generating recommendations, or conducting forensic investigations
(e.g. pattern detection).

75
Example

76
What is Pig?

77
Pig
 In brief:
“is a platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs, coupled with infrastructure
for evaluating these programs.”

 Top Level Apache Project

 http://pig.apache.org

 Pig is an abstraction on top of Hadoop

 Provides high level programming language designed for data processing
 Converted into MapReduce and executed on Hadoop Clusters

 Pig is widely accepted and used

 Yahoo!, Twitter, Netflix, etc...
 At Yahoo!, 70% MapReduce jobs are written in Pig

78
Disadvantages of Raw MapRaduce
1. Extremely rigid data flow
M R
Other flows constantly hacked in

M M R M

Join, Union Split Chains

2. Common operations must be coded by hand

• Join, filter, projection, aggregates, sorting, distinct

3. Semantics hidden inside map-reduce functions

• Difficult to maintain, extend, and optimize
• Resulting code is difficult to reuse and maintain; shifts focus and attention away
79
from data analysis
Pig and MapReduce
 MapReduce requires programmers
 Must think in terms of map and reduce functions
 More than likely will require Java programmers
 Pig provides high-level language that can be used by
 Analysts
 Data Scientists
 Statisticians
 Etc...
 Originally implemented at Yahoo! to allow analysts to
access data

80
Pig’s Features
 Main operators:
 Join Datasets
 Sort Datasets
 Filter
 Data Types
 Group By
 User Defined Functions
 Etc..

 Example:
>movies = LOAD '/home/movies_data.csv' USING PigStorage(',') as
(id,name,year,rating,duration);
>movies_greater_than_four = FILTER movies BY (float)rating>4.0;
>DUMP movies_greater_than_four;

81
What is Hive?

82
Hive
 Data Warehousing Solution built on top of Hadoop
 Provides SQL-like query language named HiveQL
 Minimal learning curve for people with SQL expertise
 Data analysts are target audience
 Early Hive development work started at Facebook in 2007
 Today Hive is an Apache project under Hadoop
 http://hive.apache.org

83
Advantages and Drawbacks
 Hive provides
 Ability to bring structure to various data formats
 Simple interface for ad hoc querying, analyzing and summarizing large
amounts of data
 Access to files on various data stores such as HDFS and Hbase

 Hive does not provide

 Hive does not provide low latency or realtime queries
 Even querying small amounts of data may take minutes
 Designed for scalability and ease-of-use rather than low latency
responses

84
Hive
 Translates HiveQL statements into a set of MapReduce Jobs which are
then executed on a Hadoop Cluster

85
What is Spark?

86
A Brief History: Spark

87
A general view of Spark

88
Current programming models
 Current popular programming models for clusters transform
data flowing from stable storage to stable storage
 E.g., MapReduce:

Map
Reduce

Input Map Output

Reduce
Map

Benefits of data flow: runtime can decide where to run tasks and can
89 automatically recover from failures
MapReduce I/O

90
Spark
 Acyclic data flow is a powerful abstraction, but is not efficient for
applications that repeatedly reuse a working set of data:
 Iterative algorithms (many in machine learning)
 Interactive data mining tools (R, Excel, Python)

 Spark makes working sets a first-class concept to efficiently

support these apps.

91
Goal: Sharing at Memory Speed

92
Resilient Distributed Dataset (RDD)

 Provide distributed memory abstractions for clusters to support apps

with working sets.

 Retain the attractive properties of MapReduce:

 Fault tolerance (for crashes & stragglers)
 Data locality
 Scalability

Solution: augment data flow model with “resilient distributed

datasets” (RDDs)
93
Programming Model with RDD
 Resilient distributed datasets (RDDs)
 Immutable collections partitioned across cluster that can be rebuilt
if a partition is lost
 Created by transforming data in stable storage using data flow
operators (map, filter, group-by, …)
 Can be cached across parallel operations

 Parallel operations on RDDs

 Reduce, collect, count, save, …

 Restricted shared variables

 Accumulators, broadcast variables

94
Example: Logistic Regression
 Goal: find best line separating two sets of points

random initial line

target

95
Logistic Regression (SCALA Code)
val data =
spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {

val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y *
p.x
).reduce(_ + _)
w -= gradient
}

println("Final w: " + w)

96
Conclusion

97
Conclusion
 Data storage needs are rapidly increasing
Hadoop has become the de-facto standard for handling these
massive data sets.
 Storage of Big Data requires new storage models
NoSQL solutions.
 Parallel processing of Big Data requires a new programming
paradigm
MapReduce programming model.
 “Big data” is moving beyond one-passbatch jobs, to low-latency
apps that need datasharing
Apache Spark is an alternative solution.
98

Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Chapter - 2 Hadoop
100% (1)
Chapter - 2 Hadoop
32 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
Hadoop: Big Data Processing Essentials
No ratings yet
Hadoop: Big Data Processing Essentials
19 pages
Bsd1313 Chapter 4
No ratings yet
Bsd1313 Chapter 4
129 pages
Big Data Analytics Challenges Overview
No ratings yet
Big Data Analytics Challenges Overview
24 pages
Big Data Management Applications Overview
No ratings yet
Big Data Management Applications Overview
29 pages
Big Data
No ratings yet
Big Data
3 pages
BDA Module2
No ratings yet
BDA Module2
83 pages
Big Data and Hadoop Architecture Overview
No ratings yet
Big Data and Hadoop Architecture Overview
9 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
BD by Maaz
No ratings yet
BD by Maaz
19 pages
Understanding Big Data Analytics
No ratings yet
Understanding Big Data Analytics
47 pages
Big Data Insights with Hadoop
No ratings yet
Big Data Insights with Hadoop
34 pages
Unit 3
No ratings yet
Unit 3
90 pages
Hadoop for Scalable Data Management
No ratings yet
Hadoop for Scalable Data Management
58 pages
Big Data Challenges and Hadoop Insights
No ratings yet
Big Data Challenges and Hadoop Insights
55 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Big Data
No ratings yet
Big Data
27 pages
Introduction to Hadoop and Big Data Models
No ratings yet
Introduction to Hadoop and Big Data Models
113 pages
Unit 3 Hadoop
No ratings yet
Unit 3 Hadoop
50 pages
Introduction to Hadoop and Big Data
No ratings yet
Introduction to Hadoop and Big Data
22 pages
Attachment
No ratings yet
Attachment
11 pages
HDFS Node Types and User Interfaces
No ratings yet
HDFS Node Types and User Interfaces
15 pages
HADOOP
No ratings yet
HADOOP
10 pages
Bda Ese
No ratings yet
Bda Ese
21 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
BDA Module 2
No ratings yet
BDA Module 2
40 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Big Data Analytics and MapReduce Overview
No ratings yet
Big Data Analytics and MapReduce Overview
35 pages
1 - Big Data and Hadoop Framework
No ratings yet
1 - Big Data and Hadoop Framework
40 pages
I Am Preparing For A Big Data Analytics University...
No ratings yet
I Am Preparing For A Big Data Analytics University...
15 pages
Week 02
No ratings yet
Week 02
115 pages
Big Data Overview and Hadoop Insights
No ratings yet
Big Data Overview and Hadoop Insights
38 pages
Overview of Hadoop and Big Data Analytics
100% (1)
Overview of Hadoop and Big Data Analytics
25 pages
Understanding Big Data Concepts
No ratings yet
Understanding Big Data Concepts
87 pages
Unit-5 - Hadoop
No ratings yet
Unit-5 - Hadoop
29 pages
Big Data Overview and Hadoop Insights
No ratings yet
Big Data Overview and Hadoop Insights
17 pages
Introduction to Hadoop and Big Data
No ratings yet
Introduction to Hadoop and Big Data
59 pages
Understanding Hadoop and Its Ecosystem
No ratings yet
Understanding Hadoop and Its Ecosystem
90 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
Big Data Analytics
100% (1)
Big Data Analytics
20 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
Introduction to Hadoop for Big Data
No ratings yet
Introduction to Hadoop for Big Data
61 pages
Hadoop Administration
No ratings yet
Hadoop Administration
97 pages
Biggdata
No ratings yet
Biggdata
24 pages
Hadoop for Big Data Professionals
No ratings yet
Hadoop for Big Data Professionals
24 pages
Introduction to Hadoop and HDFS
No ratings yet
Introduction to Hadoop and HDFS
18 pages
Hadoop for Data Professionals
No ratings yet
Hadoop for Data Professionals
12 pages
Cloud Computing Unit-5
No ratings yet
Cloud Computing Unit-5
22 pages
Apache Hadoop Developer Training PDF
100% (1)
Apache Hadoop Developer Training PDF
397 pages
Understanding HDFS in Big Data
No ratings yet
Understanding HDFS in Big Data
61 pages
Introduction to Hadoop and Cloudera
100% (1)
Introduction to Hadoop and Cloudera
91 pages
ZooKeeper Project Success and Challenges
No ratings yet
ZooKeeper Project Success and Challenges
2 pages
Machine Learning: Supervised vs. Unsupervised
No ratings yet
Machine Learning: Supervised vs. Unsupervised
24 pages
The Forrester Wave™ - Big Data Fabric Q2 2018 PDF
No ratings yet
The Forrester Wave™ - Big Data Fabric Q2 2018 PDF
18 pages
Hadoop Troubleshooting 101 Kate Ting Cloudera
No ratings yet
Hadoop Troubleshooting 101 Kate Ting Cloudera
6 pages
HW4 Tutorial
No ratings yet
HW4 Tutorial
17 pages
Data Modeling With Graph Databases
100% (2)
Data Modeling With Graph Databases
68 pages
My PHP Generator
No ratings yet
My PHP Generator
305 pages
Impact Factors of Key Journals in Algorithms
No ratings yet
Impact Factors of Key Journals in Algorithms
1 page
Idc Seagate Dataage Whitepaper PDF
No ratings yet
Idc Seagate Dataage Whitepaper PDF
28 pages
Idc Digital Universe 2014 PDF
No ratings yet
Idc Digital Universe 2014 PDF
17 pages
Wright's Spectral Chromaticity Data
No ratings yet
Wright's Spectral Chromaticity Data
70 pages
Sybca Bigdata
No ratings yet
Sybca Bigdata
10 pages
Big Data Analytics: Applications, Prospects and Challenges
No ratings yet
Big Data Analytics: Applications, Prospects and Challenges
18 pages
Relational Database Design Guide
No ratings yet
Relational Database Design Guide
50 pages
Checklist For EU GDPR Implementation en
No ratings yet
Checklist For EU GDPR Implementation en
3 pages
Session Plan Common Awha
No ratings yet
Session Plan Common Awha
15 pages
Project Report Guidelines for BBA/B.Com
No ratings yet
Project Report Guidelines for BBA/B.Com
10 pages
AZURE DATA FACTORY Content
No ratings yet
AZURE DATA FACTORY Content
5 pages
SQL Queries for Database Management
No ratings yet
SQL Queries for Database Management
6 pages
Define XML1 Day
No ratings yet
Define XML1 Day
201 pages
Using Aspell for Spellcheck Setup
No ratings yet
Using Aspell for Spellcheck Setup
1 page
ACN 3 Merged
No ratings yet
ACN 3 Merged
225 pages
Databricks Data Engineer Professional
No ratings yet
Databricks Data Engineer Professional
98 pages
Applied Microsoft SQL Server 2008 Reporting Services
No ratings yet
Applied Microsoft SQL Server 2008 Reporting Services
770 pages
SQL Correction
No ratings yet
SQL Correction
3 pages
Understanding Database Management Systems
No ratings yet
Understanding Database Management Systems
2 pages
UR Robot Client Interfaces Overview
No ratings yet
UR Robot Client Interfaces Overview
312 pages
GAP Analysis: EN 17141:2020 vs ISO 14698
No ratings yet
GAP Analysis: EN 17141:2020 vs ISO 14698
10 pages
An A-Z Index of The Command Line: Linux BASH
No ratings yet
An A-Z Index of The Command Line: Linux BASH
8 pages
A Comparative Study of Customer Satisfaction Towards Amazon and Flipkart With Special Reference To Coimbatore City
No ratings yet
A Comparative Study of Customer Satisfaction Towards Amazon and Flipkart With Special Reference To Coimbatore City
6 pages
Waterfall vs. Agile: A Comparison
No ratings yet
Waterfall vs. Agile: A Comparison
63 pages
TDbf Manual: Delphi Data Access Guide
No ratings yet
TDbf Manual: Delphi Data Access Guide
28 pages
Database Schema Overview for Admin and Employees
No ratings yet
Database Schema Overview for Admin and Employees
4 pages
Marketing Research Process Overview
No ratings yet
Marketing Research Process Overview
39 pages
Iowa Wildlife Monitoring Spreadsheet
No ratings yet
Iowa Wildlife Monitoring Spreadsheet
71 pages
MANI Project PDF
No ratings yet
MANI Project PDF
82 pages
60 Common Data Mining Interview Questions in 2025
No ratings yet
60 Common Data Mining Interview Questions in 2025
20 pages
AWS vs Azure: Service Comparison Guide
No ratings yet
AWS vs Azure: Service Comparison Guide
1 page
Teradata Utilities FastLoad
No ratings yet
Teradata Utilities FastLoad
21 pages
How To Determine If PGA Size Is Set Properly
100% (1)
How To Determine If PGA Size Is Set Properly
5 pages
Sample IA 1
No ratings yet
Sample IA 1
73 pages

Fillatre Big Data

Uploaded by

Fillatre Big Data

Uploaded by

Outils informatiques pour le Big Data

École d'été thématique CNRS BasMatI

“Big Data” is data whose scale, diversity, and complexity

 Various formats, types, and structures

 Healthcare monitoring: sensors monitoring your activities and body

 Reduce: ( g, [a, b, c, ...] ) -> g(a, g(b, g(c, ... )))

 Example (SDSS, Stripe 82):

 Variable conditions (e.g.,

Theorem: You can have at most

System is available during

A system can continue to

• Document-based Systems - Access data by key or by search of “document” data.

• Graph-based Systems - Use a graph structure

• A graph database is a database that uses graph structures with nodes,

 Top Level Apache Project

 Pig is an abstraction on top of Hadoop

 Pig is widely accepted and used

Join, Union Split Chains

2. Common operations must be coded by hand

3. Semantics hidden inside map-reduce functions

 Hive does not provide

Input Map Output

 Spark makes working sets a first-class concept to efficiently

 Provide distributed memory abstractions for clusters to support apps

 Retain the attractive properties of MapReduce:

Solution: augment data flow model with “resilient distributed

 Parallel operations on RDDs

 Restricted shared variables

random initial line

for (i <- 1 to ITERATIONS) {

You might also like