0% found this document useful (0 votes)

38 views42 pages

12 BigData

Uploaded by

Xiya Luo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views42 pages

12 BigData

Uploaded by

Xiya Luo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

DATA2001 – Data Science,

Big Data, and Data Diversity

Week 12: Big Data

Presented by A/Prof Uwe Roehm

School of Computer Science

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 1
Learning Objectives
– Big Data
– The three V's: Volume, Velocity and Variety
– Ethical challenges for Big Data Processing

– Scale-Agnostic Data Analytics Platforms

– Scale-Up vs. Scale-Out
– MapReduce principle; similarities to SQL
– Role of modern Big Data platforms
MapReduce, Apache Spark, Flink or Hive

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 2
Big Data

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 3
Big Data
the three Vs:

[cf. article by
Doug Laney, 2001]

[Barton Poulson "Techniques and Concepts of Big Data", 2014]

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 4
Big Data: Volume
– very relative due to Moore's Law
– What once was considered big data, is considered a main-memory
problem nowadays
– eg. Excel: In 2003 max 65000 rows, now max 1 million rows, still ...

– Nowadays: Terabyte to Exabyte

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 5
Big Data: Velocity
– conventional scientific research:
– months to gather data from 100s cases, weeks
to analyze the data and years to publish.
– Example: Iris flower data set by Edgar Anderson
and Ronal Fisher from 1936

– on the other end of the scale: Twitter

– average 6000 tweets/sec, 500 million per day
or 200 billion per year
– Cf. life Twitter Usage Statistics
http://www.internetlivestats.com/twitter-statistics/

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 6
Big Data: Variety
– Structured Data, such as CSV or RDBMS
– Semi-structured Data, such as JSON or XML
– Unstructured Data, ie. text, e-mails, images, video
– an estimated 80% of enterprise data is unstructured

– study by Forester Research: variety biggest challenge in Big Data

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 7
Big Data Examples
Big Data for Consumers (examples)
– Siri, Yelp!, Spotify, Amazon, Netflix, Google Now
– Some Big Data Variety examples:
– "Neighborland" App [https://neighborland.com]
– "WalkScore.com" [https://www.walkscore.com]

Big Data for Businesses (examples)

– Google Ads Searches
– Predictive Marketing
– Example "EDITED.com": predicting fashion trends
– Fraud Detection
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 8
Big Data Examples: Big Data for Research
– Astronomy: Sloan Digital Sky Survey (SDDS) SkyServer
– Cern's Large Hadron Collider (LHC)
– The Human Brain Project
– Personalities in the United States
(cf Journal of American Psychological Association)
– Google Flu trends (only historic data; stopped publishing new trends)
– Apple COVID19 Mobility Trends (https://www.apple.com/covid19/mobility)
(discontinued April 2022)
– Google Books project
– (eg. changes of word usage over time (eg. maths vs arithmetic vs algebra)
https://books.google.com/ngrams/graph?content=math,arithmetic,algebra&
case_insensitive=off&year_start=1800 )
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 9
Big Data Challenges beyond Technical Aspects
“[…] consider that great responsibility follows inseparably from great power” [French National Convention,1793]
– Data Privacy
– Some data sources, such as "Internet-of-Things”, allow tracking anyone
• Do you really need to know who was travelling a route in order to predict, e.g.,
traffic densities?
• Personal data can be inferred sometimes => New York Taxi data set example
– Privacy laws
• Always check: Are you allowed to use some data or process is anywhere?
• Some personal data, especially regarding health or tax, is specially protected;
e.g., not allowed to leave a jurisdictive area
• e.g. EU’s General Data Protection Regulation (GDPR) applies to any company
holding data about any European Union citizen
– Data Security
– Can your users trust you to keep their data safe?
– Big data can expose your organization to serious privacy and security attacks!
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 10
Use Case: COVIDSafe App
https://www.health.gov.au/resources/apps-and-tools/covidsafe-app
– Tool to help contact tracing – who was in close contact to a known COVID-19 case?
– The app does not collect location data, but
just events of being in close proximity
of another COVIDSafe app user (via BT)
– Data is stored encrypted locally on the
phone for 21 days, then overwritten.
– Data only uploaded to cloud (AWS…)
on request after personal permission
– Benefit to society vs. Privacy concerns
– Which data collected and how stored?
• Locally: anonymised close contacts (date, time, distance, duration, and other user’s refcode);
cloud: meta-data (refcode; phone#, nickname, age range, postcode)
– Where is data processed? => cloud, resp. by contact tracers
– Who has access to this data? => only contract tracers; protected by Biosecurity Privacy Laws
– Does it work? False positives/negatives are possible => risk of false sense of security…
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 11
Big Data Challenges beyond Technical Aspects (cont’d)
– Data Discrimination
– Is it acceptable to discriminate against people based on data on their lives?
– Credit card scoring? Health insurance?
– Cf. FTC: "Big Data – A Tool for Inclusion or Exclusion?"
[https://www.ftc.gov/system/files/documents/reports/big-data-tool-inclusion-or-
exclusion-understanding-issues/160106big-data-rpt.pdf]
– Check:
– Are you working on a representative sample of users/consumers?
– Do your algorithms prioritize fairness? Aware of the biases in the data?
– Check your Big Data outcomes against traditionally applied statistics practices
– Keep in mind other Vs of Big Data:
– Validity (data quality), Veracity (data accuracy / trustworthiness), …
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 12
Analysing Big Data

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 14
Data Science Workflow
Analysing ‘Big Data’ with “scripts”?

[Source: http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/]
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 15
Case for Data Science Platforms
– Data is either
– too large (volume),
– too fast (velocity), or
– needs to be combined from diverse sources (variety)
for processing with scripts or on single server.

– Need for
– scalable platform
– processing abstractions

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 16
Jupyter Notebooks as Platform for Big Data?
Web
Browser
http
Jupyter Server
(Python)
network

csv file read

psycopg,
– This does not scale to petabyte of data …
a, b , c

– Which approach for Amazon? Facebook?

1, foo, 4.5
a, b , c
2, bar, 1.3
1, foo, 4.5
3, oho, 9.0
2, bar, 1.3
…
3, oho, 9.0
…
Database System

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 17
Case Study: LinkedIn Source: https://engineering.linkedin.com/architecture/brief-history-scaling-linkedin

– Started in 2003
– 2700 members in first week
– Single database and web server

– for years experienced exponential growth…

– As of Jan 2018:
(https://www.omnicoreagency.com/linkedin-statistics/)

– 500 million members

– 250 million active users / month
– Many users with hundreds of connections => huge graph
– Fun Fact: Statistical Analysis and Data Mining are Top skills on Linkedin

– world’s 34th-most-popular website in terms of overall visitor traffic (Alexa, Dec-16)

(https://www.alexa.com/topsites)

• For comparison: Microsoft is #37

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 18
Scale-Up
– The traditional approach:
– To scale with increasing load,
buy more powerful, larger
hardware
• from single workstation
• to dedicated db server
• to large massive-parallel
database appliance

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) [source: Jim Gray, HPTS99] 19
The Alternative: Scale-Out [recall Wk5]

A single server has limits…

For real Big Data processing, need to
scale-out to a cluster of multiple servers (nodes):

[Source: Server.png from PinClipart.com]

State-of-the-Art:
shared-nothing architecture

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 20
Case Study: LinkedIn Analytical Architecture
”We have multiple grids divided up based upon purpose.
Hardware:
~800 Westmere-based HP SL 170x, with 2x4 cores, 24GB RAM, 6x2TB SATA
~1900 Westmere-based SuperMicro X8DTT-H, with 2x6 cores, 24GB RAM, 6x2TB SATA
~1400 Sandy Bridge-based SuperMicro with 2x6 cores, 32GB RAM, 6x2TB SATA
…
We use these things for discovering People You May Know and other fun facts.”
LinkdIn via https://wiki.apache.org/hadoop/PoweredBy/

Azkaban

Hadoop Hadoop Hadoop Hadoop Hadoop

…
Hive Hive Hive Hive Hive Kafka

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 21
Challenges
 Scale-Agnostic Data Management
 sharding for performance
 replication for availability
 ideally such that applications are unaware of underlying complexities
 cf. Week 5
 Scale-Agnostic Data Processing
 Nowadays we collect massive amounts of data; how can we analyze it?
 Answer: use lots of machines… (hundreds/thousands of CPUs, can grow)
 Performance: parallel processing
 Availability: Ideally, the system never down; can handle failures transparent
=> Map/Reduce processing paradigm
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 22
Scale-Agnostic Data Analysis

The MapReduce Principle

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 23
Big Data Analytics Stack
– Layered stack of frameworks for distributed data management and processing
– Many choices of distributed data processing platforms

Application

Data Processing MapReduce

Storage

Infrastructure

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 24
[slide by Ion Stoica, UCB, 2013]
MapReduce Overview
– Scan large volumes of data
– Map: Extract some interesting information
– Shuffle and sort intermediate results
– Reduce: aggregate intermediate results
– Generate final output

– Key idea: provide an abstraction at the point of these two operations (map
and reduce)
– Higher-order functions
– Cf. map functions in functional programming languages such as Lisp or
Haskell

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 25
12-25
MapReduce Paradigm
– Functional Programming approach to data processing
– map() : applies a given function f to all elements of a collection;
returns a new collection
– map (f, originalList)

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) Diagram from Yahoo! Hadoop Tutorial 26
MapReduce Paradigm: Reduce()
reduce(): applies a given function g to all elements of an input list;
produces, starting from a given initial value, a single (aggregate) output value

Keys divide the reduce space

all of the output values are not usually reduced together. All of the
values with the same key are presented to a single reducer together

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) Diagram from Yahoo! Hadoop Tutorial 27
Similarities between SQL-Queries and MapReduce
– A standard map-reduce task is similar in its functionality to declarative
aggregation queries in SQL:

SELECT out_key, reduce(out_value)

FROM map(InputData)
GROUP BY out_key

New in MR-Paradigm: map and reduce() as higher-order functions

which take a user-defined function with the actual functionality.

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 28
Example: Word Count program
– Word Count programmed as standard linear program
– Two nested for loops
– Difficult to generalise or parallelise

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 29
Example: Word Count
– Input:
– List of documents that contain text
– Provided to MapReduce in the form of
(k: documentID, v: textcontent) pairs

– Goal:
– Determine which words occur in the documents and how often
– E.g. for text indexing…

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 30
MapReduce Approach
To solve the same problem using MapReduce, we need

1. map() function (aka mapper)

2. reduce() function (aka reducer)

3. Some control code that connects mapper and reducers

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 31
Example: Word Count in MapReduce
– Word Count programmed using Map/Reduce paradigm

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 32
MapReduce Generalised
– Previous example was hard-coded for word count
– We can generalise the pattern for the driver code even further
mapper and reducer are now also inputs

Call to function needs 3 arguments: data, mapper and reducer

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 33
Example Word Count with MapReduce
(in_key, in_value) (out_key, value) (out_key,list(value)) (out_key, out_value)
map
(Google, 1) shuffle reduce
doc1 Google File (File, 1)
System (System, 1) (Google, {1}) (Google, 1)
(File, {1}) (File, 1)
(Decentralized, 1) (System, {1,1,1}) (System, 3)
Decentralised
(Structured, 1) (Decentralized,{1}) (Decentralized,1)
doc2 Structured (Storage, 1)
Storage System (Structured, {1,1}) (Structured,2)
(System, 1) (Storage,{1,1}) (Storage,2)
(Distributed, {1}) (Distributed,1)
(Distributed, 1) (Data,{1}) (Data,1)
Distributed
Storage System (Storage, 1)
doc3 (System, 1)
Structured Data
(Structured,1)
(Data, 1)

Map Phase Reduce Phase

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 34
Why Scale-Agnostic?
– Note that the functions given to map() and reduce() only rely on local input
– functions without side-effects and independent of each other
– function invocation is agnostic to the scale (size) of the overall dataset

– Can hence be parallelized easily

– Partition the dataset over multiple nodes
– apply different instances of the same map/reduce functions to each
partition independently / in parallel

– Fits perfectly to a scale-out approach

– bigger data => more nodes and data partitions => more parallelism
=> same or faster speed
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 35
MapReduce Discussion

Pros: Cons:
– very flexible due to the – requires programming skills
user-defined functions and functional thinking
– great scalability – relatively low-level, even
because FP approach filtering to be coded manually
– easy parallelism due to – complex frameworks
stateless functions – batch-processing oriented
– fault-tolerance

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 36
Distributed, Dataflow-Oriented Analytics Platforms

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 37
Challenge: Iterative Algorithms
– Many data mining and machine learning algorithms rely on global state and
iterations
– Examples:
– data clustering (eg. k-Means)
– frequent itemset mining (eg. Apriori algorithm)
– linear regression
– collaborative filtering
– PageRank
– …
M R
J R R
M
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 38
Distributed Data Analytics Frameworks
– Apache Hadoop
– Open-source implementation of original MapReduce from Google; Apache top-level project
– Java framework, but also provides a Python interface nowadays
– Parts: own distributed file system (HDFS), job scheduler (YARN), MR framework (Hadoop)

– Apache Spark
– Distributed cluster computing framework on top of HDFS/YARN
– Concentrates on main-memory processing and more high-level data flow control
– Originates from research project from UC Berkeley

– Apache Flink
– Efficient data flow runtime on top of HDFS/YARN
– Similar to Spark, but more emphasize on build-in dataflow optimiser and pipelined processing
– Strong for data stream processing
– Origin: Stratosphere research project by TU Berlin, Humboldt University Berlin and HPI Potsdam

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 39
Distributed Data Analytics Frameworks (continued)
– Apache Hive
– Provides an SQL-like interface on top of Hadoop / HDFS
– Allows to define a relational schema on top of HDFS files, and to query and analyse data with
HiveQL (SQL dialect)
– Queries automatically translated to MR jobs and executed in parallel in cluster
– Example: WordCount in HIVE

DROP TABLE IF EXISTS docs;

CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'input_file' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts AS

SELECT word, count(1) AS count
FROM (SELECT explode(split(line, '\s')) AS word FROM docs) temp
GROUP BY word
ORDER BY word;

– Many more high-level frameworks for advanced data analytics.

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 40
Example: WordCount in Apache Flink (Python)
from flink.plan.Environment import get_environment
from flink.functions.GroupReduceFunction import GroupReduceFunction

env = get_environment()
data= env.read_text("hdfs://…");
data.flat_map(lambda x,c: [(word,1) for word in x.lower().split()]) \
.group_by(0) \
.sum(1) \
.write_csv("hdfs://…")
env.execute()

[Cf.: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/batch/python.html]
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 42
Summary
– Big Data
– The three V's: Volume, Velocity and Variety
– Ethical challenges for Big Data Processing
– Scale-Up versus Scale-Out
– Scale-Agnostic Computation
– Parallelisable higher-order functions map & reduce
– MapReduce principle; similarities to existing material and SQL
– Scale-Agnostic Data Analytics Platforms
– Data Scientists need more high-level tools and interfaces than MapReduce
– Examples: Apache Spark or Apache Flink or Apache Hive
– Componentized infrastructure: SQL querying, ML-Libraries, Streaming, etc.
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 43
Learn More
– DATA3404 Data Science Platforms

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 44

Big Data S All Units
No ratings yet
Big Data S All Units
122 pages
Big Data Overview
No ratings yet
Big Data Overview
75 pages
Lecture 1- Introduction to Big Data
No ratings yet
Lecture 1- Introduction to Big Data
51 pages
Data Science and Big Data Analytics_ Unit_1
No ratings yet
Data Science and Big Data Analytics_ Unit_1
47 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
00 - 00 DS - Overview - FRAMEWORK
No ratings yet
00 - 00 DS - Overview - FRAMEWORK
63 pages
Unit 1
No ratings yet
Unit 1
137 pages
Big Data
No ratings yet
Big Data
190 pages
Big Data
No ratings yet
Big Data
41 pages
CSE545 sp23 (1) What Is Big Data 1-29
No ratings yet
CSE545 sp23 (1) What Is Big Data 1-29
88 pages
BDA Unit 1
No ratings yet
BDA Unit 1
36 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
42 pages
Unit 1
No ratings yet
Unit 1
76 pages
Data Analytics Notes Unit 1
No ratings yet
Data Analytics Notes Unit 1
23 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Lecture 1
No ratings yet
Lecture 1
22 pages
Unit 1_BDS_DS307
No ratings yet
Unit 1_BDS_DS307
47 pages
BDA_Notes
No ratings yet
BDA_Notes
68 pages
Big-Data-A-Comprehensive-Overview
No ratings yet
Big-Data-A-Comprehensive-Overview
25 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
43 pages
Module 3 - Data Science
No ratings yet
Module 3 - Data Science
22 pages
Big Data Analytics_Lecture Slides
No ratings yet
Big Data Analytics_Lecture Slides
72 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
BDA U1
No ratings yet
BDA U1
80 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
Big Data Seminar
100% (2)
Big Data Seminar
27 pages
BIGDATA UNITS
No ratings yet
BIGDATA UNITS
80 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
BIG DATA_UNIT-I
No ratings yet
BIG DATA_UNIT-I
17 pages
BIG data1
No ratings yet
BIG data1
49 pages
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
No ratings yet
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
58 pages
PPT 1.1.2
No ratings yet
PPT 1.1.2
17 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
UNIT I BIG DATA Extra Content
No ratings yet
UNIT I BIG DATA Extra Content
15 pages
DBIS Lecture 4 - Slides (AI and Big Data)
No ratings yet
DBIS Lecture 4 - Slides (AI and Big Data)
84 pages
Big Data Analytics
100% (3)
Big Data Analytics
79 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
unit 1 b tech 3 year bd
No ratings yet
unit 1 b tech 3 year bd
10 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
48 pages
TOPIC-1-BIG-DATA-TECHNOLOGIES
No ratings yet
TOPIC-1-BIG-DATA-TECHNOLOGIES
5 pages
Big Data Analytics: by S. P. Sajjan
No ratings yet
Big Data Analytics: by S. P. Sajjan
21 pages
Big Data Analytics M1
No ratings yet
Big Data Analytics M1
27 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Big Data Analytics
No ratings yet
Big Data Analytics
31 pages
Module 1
No ratings yet
Module 1
54 pages
BDA-UNIT-1
No ratings yet
BDA-UNIT-1
32 pages
BDA UNIT-1 NOTES
No ratings yet
BDA UNIT-1 NOTES
10 pages
Big Data Seminar Report Rahul Jain
No ratings yet
Big Data Seminar Report Rahul Jain
41 pages
BDA2023Outline
No ratings yet
BDA2023Outline
7 pages
BD 1
No ratings yet
BD 1
15 pages
117769
No ratings yet
117769
20 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Big Data Analytics (BDA) : Name of The Faculty: Affiliation: Teaching Area
No ratings yet
Big Data Analytics (BDA) : Name of The Faculty: Affiliation: Teaching Area
8 pages
20210913115458D3708 - Session 01 Introduction To Big Data Analytics
No ratings yet
20210913115458D3708 - Session 01 Introduction To Big Data Analytics
28 pages
Ds4015 Big Data Analytics QB
No ratings yet
Ds4015 Big Data Analytics QB
155 pages
Introduction to Big Data
No ratings yet
Introduction to Big Data
4 pages

12 BigData

Uploaded by

12 BigData

Uploaded by

DATA2001 – Data Science,

Big Data, and Data Diversity

Presented by A/Prof Uwe Roehm

– Scale-Agnostic Data Analytics Platforms

[Barton Poulson "Techniques and Concepts of Big Data", 2014]

– Nowadays: Terabyte to Exabyte

– on the other end of the scale: Twitter

– study by Forester Research: variety biggest challenge in Big Data

Big Data for Businesses (examples)

csv file read

– Which approach for Amazon? Facebook?

– for years experienced exponential growth…

– 500 million members

– world’s 34th-most-popular website in terms of overall visitor traffic (Alexa, Dec-16)

• For comparison: Microsoft is #37

A single server has limits…

[Source: Server.png from PinClipart.com]

Hadoop Hadoop Hadoop Hadoop Hadoop

The MapReduce Principle

Data Processing MapReduce

Keys divide the reduce space

SELECT out_key, reduce(out_value)

New in MR-Paradigm: map and reduce() as higher-order functions

1. map() function (aka mapper)

2. reduce() function (aka reducer)

3. Some control code that connects mapper and reducers

Call to function needs 3 arguments: data, mapper and reducer

Map Phase Reduce Phase

– Can hence be parallelized easily

– Fits perfectly to a scale-out approach

DROP TABLE IF EXISTS docs;

CREATE TABLE word_counts AS

– Many more high-level frameworks for advanced data analytics.

You might also like