0% found this document useful (0 votes)
38 views42 pages

12 BigData

Uploaded by

Xiya Luo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views42 pages

12 BigData

Uploaded by

Xiya Luo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

DATA2001 – Data Science,

Big Data, and Data Diversity


Week 12: Big Data

Presented by A/Prof Uwe Roehm


School of Computer Science

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 1
Learning Objectives
– Big Data
– The three V's: Volume, Velocity and Variety
– Ethical challenges for Big Data Processing

– Scale-Agnostic Data Analytics Platforms


– Scale-Up vs. Scale-Out
– MapReduce principle; similarities to SQL
– Role of modern Big Data platforms
MapReduce, Apache Spark, Flink or Hive

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 2
Big Data

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 3
Big Data
the three Vs:

[cf. article by
Doug Laney, 2001]

[Barton Poulson "Techniques and Concepts of Big Data", 2014]


DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 4
Big Data: Volume
– very relative due to Moore's Law
– What once was considered big data, is considered a main-memory
problem nowadays
– eg. Excel: In 2003 max 65000 rows, now max 1 million rows, still ...

– Nowadays: Terabyte to Exabyte

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 5
Big Data: Velocity
– conventional scientific research:
– months to gather data from 100s cases, weeks
to analyze the data and years to publish.
– Example: Iris flower data set by Edgar Anderson
and Ronal Fisher from 1936

– on the other end of the scale: Twitter


– average 6000 tweets/sec, 500 million per day
or 200 billion per year
– Cf. life Twitter Usage Statistics
http://www.internetlivestats.com/twitter-statistics/

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 6
Big Data: Variety
– Structured Data, such as CSV or RDBMS
– Semi-structured Data, such as JSON or XML
– Unstructured Data, ie. text, e-mails, images, video
– an estimated 80% of enterprise data is unstructured

– study by Forester Research: variety biggest challenge in Big Data

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 7
Big Data Examples
Big Data for Consumers (examples)
– Siri, Yelp!, Spotify, Amazon, Netflix, Google Now
– Some Big Data Variety examples:
– "Neighborland" App [https://neighborland.com]
– "WalkScore.com" [https://www.walkscore.com]

Big Data for Businesses (examples)


– Google Ads Searches
– Predictive Marketing
– Example "EDITED.com": predicting fashion trends
– Fraud Detection
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 8
Big Data Examples: Big Data for Research
– Astronomy: Sloan Digital Sky Survey (SDDS) SkyServer
– Cern's Large Hadron Collider (LHC)
– The Human Brain Project
– Personalities in the United States
(cf Journal of American Psychological Association)
– Google Flu trends (only historic data; stopped publishing new trends)
– Apple COVID19 Mobility Trends (https://www.apple.com/covid19/mobility)
(discontinued April 2022)
– Google Books project
– (eg. changes of word usage over time (eg. maths vs arithmetic vs algebra)
https://books.google.com/ngrams/graph?content=math,arithmetic,algebra&
case_insensitive=off&year_start=1800 )
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 9
Big Data Challenges beyond Technical Aspects
“[…] consider that great responsibility follows inseparably from great power” [French National Convention,1793]
– Data Privacy
– Some data sources, such as "Internet-of-Things”, allow tracking anyone
• Do you really need to know who was travelling a route in order to predict, e.g.,
traffic densities?
• Personal data can be inferred sometimes => New York Taxi data set example
– Privacy laws
• Always check: Are you allowed to use some data or process is anywhere?
• Some personal data, especially regarding health or tax, is specially protected;
e.g., not allowed to leave a jurisdictive area
• e.g. EU’s General Data Protection Regulation (GDPR) applies to any company
holding data about any European Union citizen
– Data Security
– Can your users trust you to keep their data safe?
– Big data can expose your organization to serious privacy and security attacks!
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 10
Use Case: COVIDSafe App
https://www.health.gov.au/resources/apps-and-tools/covidsafe-app
– Tool to help contact tracing – who was in close contact to a known COVID-19 case?
– The app does not collect location data, but
just events of being in close proximity
of another COVIDSafe app user (via BT)
– Data is stored encrypted locally on the
phone for 21 days, then overwritten.
– Data only uploaded to cloud (AWS…)
on request after personal permission
– Benefit to society vs. Privacy concerns
– Which data collected and how stored?
• Locally: anonymised close contacts (date, time, distance, duration, and other user’s refcode);
cloud: meta-data (refcode; phone#, nickname, age range, postcode)
– Where is data processed? => cloud, resp. by contact tracers
– Who has access to this data? => only contract tracers; protected by Biosecurity Privacy Laws
– Does it work? False positives/negatives are possible => risk of false sense of security…
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 11
Big Data Challenges beyond Technical Aspects (cont’d)
– Data Discrimination
– Is it acceptable to discriminate against people based on data on their lives?
– Credit card scoring? Health insurance?
– Cf. FTC: "Big Data – A Tool for Inclusion or Exclusion?"
[https://www.ftc.gov/system/files/documents/reports/big-data-tool-inclusion-or-
exclusion-understanding-issues/160106big-data-rpt.pdf]
– Check:
– Are you working on a representative sample of users/consumers?
– Do your algorithms prioritize fairness? Aware of the biases in the data?
– Check your Big Data outcomes against traditionally applied statistics practices
– Keep in mind other Vs of Big Data:
– Validity (data quality), Veracity (data accuracy / trustworthiness), …
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 12
Analysing Big Data

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 14
Data Science Workflow
Analysing ‘Big Data’ with “scripts”?

[Source: http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/]
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 15
Case for Data Science Platforms
– Data is either
– too large (volume),
– too fast (velocity), or
– needs to be combined from diverse sources (variety)
for processing with scripts or on single server.

– Need for
– scalable platform
– processing abstractions

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 16
Jupyter Notebooks as Platform for Big Data?
Web
Browser
http
Jupyter Server
(Python)
network

csv file read


psycopg,
– This does not scale to petabyte of data …
a, b , c

– Which approach for Amazon? Facebook?


1, foo, 4.5
a, b , c
2, bar, 1.3
1, foo, 4.5
3, oho, 9.0
2, bar, 1.3

3, oho, 9.0

Database System

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 17
Case Study: LinkedIn Source: https://engineering.linkedin.com/architecture/brief-history-scaling-linkedin

– Started in 2003
– 2700 members in first week
– Single database and web server

– for years experienced exponential growth…


– As of Jan 2018:
(https://www.omnicoreagency.com/linkedin-statistics/)

– 500 million members


– 250 million active users / month
– Many users with hundreds of connections => huge graph
– Fun Fact: Statistical Analysis and Data Mining are Top skills on Linkedin

– world’s 34th-most-popular website in terms of overall visitor traffic (Alexa, Dec-16)


(https://www.alexa.com/topsites)

• For comparison: Microsoft is #37


DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 18
Scale-Up
– The traditional approach:
– To scale with increasing load,
buy more powerful, larger
hardware
• from single workstation
• to dedicated db server
• to large massive-parallel
database appliance

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) [source: Jim Gray, HPTS99] 19
The Alternative: Scale-Out [recall Wk5]

A single server has limits…


For real Big Data processing, need to
scale-out to a cluster of multiple servers (nodes):

[Source: Server.png from PinClipart.com]

State-of-the-Art:
shared-nothing architecture

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 20
Case Study: LinkedIn Analytical Architecture
”We have multiple grids divided up based upon purpose.
Hardware:
~800 Westmere-based HP SL 170x, with 2x4 cores, 24GB RAM, 6x2TB SATA
~1900 Westmere-based SuperMicro X8DTT-H, with 2x6 cores, 24GB RAM, 6x2TB SATA
~1400 Sandy Bridge-based SuperMicro with 2x6 cores, 32GB RAM, 6x2TB SATA

We use these things for discovering People You May Know and other fun facts.”
LinkdIn via https://wiki.apache.org/hadoop/PoweredBy/

Azkaban

Hadoop Hadoop Hadoop Hadoop Hadoop



Hive Hive Hive Hive Hive Kafka

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 21
Challenges
 Scale-Agnostic Data Management
 sharding for performance
 replication for availability
 ideally such that applications are unaware of underlying complexities
 cf. Week 5
 Scale-Agnostic Data Processing
 Nowadays we collect massive amounts of data; how can we analyze it?
 Answer: use lots of machines… (hundreds/thousands of CPUs, can grow)
 Performance: parallel processing
 Availability: Ideally, the system never down; can handle failures transparent
=> Map/Reduce processing paradigm
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 22
Scale-Agnostic Data Analysis

The MapReduce Principle

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 23
Big Data Analytics Stack
– Layered stack of frameworks for distributed data management and processing
– Many choices of distributed data processing platforms

Application

Data Processing MapReduce

Storage

Infrastructure

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 24
[slide by Ion Stoica, UCB, 2013]
MapReduce Overview
– Scan large volumes of data
– Map: Extract some interesting information
– Shuffle and sort intermediate results
– Reduce: aggregate intermediate results
– Generate final output

– Key idea: provide an abstraction at the point of these two operations (map
and reduce)
– Higher-order functions
– Cf. map functions in functional programming languages such as Lisp or
Haskell

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 25
12-25
MapReduce Paradigm
– Functional Programming approach to data processing
– map() : applies a given function f to all elements of a collection;
returns a new collection
– map (f, originalList)

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) Diagram from Yahoo! Hadoop Tutorial 26
MapReduce Paradigm: Reduce()
reduce(): applies a given function g to all elements of an input list;
produces, starting from a given initial value, a single (aggregate) output value

Keys divide the reduce space


all of the output values are not usually reduced together. All of the
values with the same key are presented to a single reducer together

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) Diagram from Yahoo! Hadoop Tutorial 27
Similarities between SQL-Queries and MapReduce
– A standard map-reduce task is similar in its functionality to declarative
aggregation queries in SQL:

SELECT out_key, reduce(out_value)


FROM map(InputData)
GROUP BY out_key

New in MR-Paradigm: map and reduce() as higher-order functions


which take a user-defined function with the actual functionality.

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 28
Example: Word Count program
– Word Count programmed as standard linear program
– Two nested for loops
– Difficult to generalise or parallelise

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 29
Example: Word Count
– Input:
– List of documents that contain text
– Provided to MapReduce in the form of
(k: documentID, v: textcontent) pairs

– Goal:
– Determine which words occur in the documents and how often
– E.g. for text indexing…

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 30
MapReduce Approach
To solve the same problem using MapReduce, we need

1. map() function (aka mapper)

2. reduce() function (aka reducer)

3. Some control code that connects mapper and reducers

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 31
Example: Word Count in MapReduce
– Word Count programmed using Map/Reduce paradigm

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 32
MapReduce Generalised
– Previous example was hard-coded for word count
– We can generalise the pattern for the driver code even further
mapper and reducer are now also inputs

Call to function needs 3 arguments: data, mapper and reducer


DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 33
Example Word Count with MapReduce
(in_key, in_value) (out_key, value) (out_key,list(value)) (out_key, out_value)
map
(Google, 1) shuffle reduce
doc1 Google File (File, 1)
System (System, 1) (Google, {1}) (Google, 1)
(File, {1}) (File, 1)
(Decentralized, 1) (System, {1,1,1}) (System, 3)
Decentralised
(Structured, 1) (Decentralized,{1}) (Decentralized,1)
doc2 Structured (Storage, 1)
Storage System (Structured, {1,1}) (Structured,2)
(System, 1) (Storage,{1,1}) (Storage,2)
(Distributed, {1}) (Distributed,1)
(Distributed, 1) (Data,{1}) (Data,1)
Distributed
Storage System (Storage, 1)
doc3 (System, 1)
Structured Data
(Structured,1)
(Data, 1)

Map Phase Reduce Phase


DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 34
Why Scale-Agnostic?
– Note that the functions given to map() and reduce() only rely on local input
– functions without side-effects and independent of each other
– function invocation is agnostic to the scale (size) of the overall dataset

– Can hence be parallelized easily


– Partition the dataset over multiple nodes
– apply different instances of the same map/reduce functions to each
partition independently / in parallel

– Fits perfectly to a scale-out approach


– bigger data => more nodes and data partitions => more parallelism
=> same or faster speed
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 35
MapReduce Discussion

Pros: Cons:
– very flexible due to the – requires programming skills
user-defined functions and functional thinking
– great scalability – relatively low-level, even
because FP approach filtering to be coded manually
– easy parallelism due to – complex frameworks
stateless functions – batch-processing oriented
– fault-tolerance

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 36
Distributed, Dataflow-Oriented Analytics Platforms

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 37
Challenge: Iterative Algorithms
– Many data mining and machine learning algorithms rely on global state and
iterations
– Examples:
– data clustering (eg. k-Means)
– frequent itemset mining (eg. Apriori algorithm)
– linear regression
– collaborative filtering
– PageRank
– …
M R
J R R
M
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 38
Distributed Data Analytics Frameworks
– Apache Hadoop
– Open-source implementation of original MapReduce from Google; Apache top-level project
– Java framework, but also provides a Python interface nowadays
– Parts: own distributed file system (HDFS), job scheduler (YARN), MR framework (Hadoop)

– Apache Spark
– Distributed cluster computing framework on top of HDFS/YARN
– Concentrates on main-memory processing and more high-level data flow control
– Originates from research project from UC Berkeley

– Apache Flink
– Efficient data flow runtime on top of HDFS/YARN
– Similar to Spark, but more emphasize on build-in dataflow optimiser and pipelined processing
– Strong for data stream processing
– Origin: Stratosphere research project by TU Berlin, Humboldt University Berlin and HPI Potsdam

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 39
Distributed Data Analytics Frameworks (continued)
– Apache Hive
– Provides an SQL-like interface on top of Hadoop / HDFS
– Allows to define a relational schema on top of HDFS files, and to query and analyse data with
HiveQL (SQL dialect)
– Queries automatically translated to MR jobs and executed in parallel in cluster
– Example: WordCount in HIVE

DROP TABLE IF EXISTS docs;


CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'input_file' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts AS


SELECT word, count(1) AS count
FROM (SELECT explode(split(line, '\s')) AS word FROM docs) temp
GROUP BY word
ORDER BY word;

– Many more high-level frameworks for advanced data analytics.


DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 40
Example: WordCount in Apache Flink (Python)
from flink.plan.Environment import get_environment
from flink.functions.GroupReduceFunction import GroupReduceFunction

env = get_environment()
data= env.read_text("hdfs://…");
data.flat_map(lambda x,c: [(word,1) for word in x.lower().split()]) \
.group_by(0) \
.sum(1) \
.write_csv("hdfs://…")
env.execute()

[Cf.: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/batch/python.html]
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 42
Summary
– Big Data
– The three V's: Volume, Velocity and Variety
– Ethical challenges for Big Data Processing
– Scale-Up versus Scale-Out
– Scale-Agnostic Computation
– Parallelisable higher-order functions map & reduce
– MapReduce principle; similarities to existing material and SQL
– Scale-Agnostic Data Analytics Platforms
– Data Scientists need more high-level tools and interfaces than MapReduce
– Examples: Apache Spark or Apache Flink or Apache Hive
– Componentized infrastructure: SQL querying, ML-Libraries, Streaming, etc.
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 43
Learn More
– DATA3404 Data Science Platforms

DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 44

You might also like