12 BigData
12 BigData
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 1
Learning Objectives
– Big Data
– The three V's: Volume, Velocity and Variety
– Ethical challenges for Big Data Processing
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 2
Big Data
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 3
Big Data
the three Vs:
[cf. article by
Doug Laney, 2001]
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 5
Big Data: Velocity
– conventional scientific research:
– months to gather data from 100s cases, weeks
to analyze the data and years to publish.
– Example: Iris flower data set by Edgar Anderson
and Ronal Fisher from 1936
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 6
Big Data: Variety
– Structured Data, such as CSV or RDBMS
– Semi-structured Data, such as JSON or XML
– Unstructured Data, ie. text, e-mails, images, video
– an estimated 80% of enterprise data is unstructured
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 7
Big Data Examples
Big Data for Consumers (examples)
– Siri, Yelp!, Spotify, Amazon, Netflix, Google Now
– Some Big Data Variety examples:
– "Neighborland" App [https://neighborland.com]
– "WalkScore.com" [https://www.walkscore.com]
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 14
Data Science Workflow
Analysing ‘Big Data’ with “scripts”?
[Source: http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/]
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 15
Case for Data Science Platforms
– Data is either
– too large (volume),
– too fast (velocity), or
– needs to be combined from diverse sources (variety)
for processing with scripts or on single server.
– Need for
– scalable platform
– processing abstractions
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 16
Jupyter Notebooks as Platform for Big Data?
Web
Browser
http
Jupyter Server
(Python)
network
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 17
Case Study: LinkedIn Source: https://engineering.linkedin.com/architecture/brief-history-scaling-linkedin
– Started in 2003
– 2700 members in first week
– Single database and web server
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) [source: Jim Gray, HPTS99] 19
The Alternative: Scale-Out [recall Wk5]
State-of-the-Art:
shared-nothing architecture
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 20
Case Study: LinkedIn Analytical Architecture
”We have multiple grids divided up based upon purpose.
Hardware:
~800 Westmere-based HP SL 170x, with 2x4 cores, 24GB RAM, 6x2TB SATA
~1900 Westmere-based SuperMicro X8DTT-H, with 2x6 cores, 24GB RAM, 6x2TB SATA
~1400 Sandy Bridge-based SuperMicro with 2x6 cores, 32GB RAM, 6x2TB SATA
…
We use these things for discovering People You May Know and other fun facts.”
LinkdIn via https://wiki.apache.org/hadoop/PoweredBy/
Azkaban
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 21
Challenges
Scale-Agnostic Data Management
sharding for performance
replication for availability
ideally such that applications are unaware of underlying complexities
cf. Week 5
Scale-Agnostic Data Processing
Nowadays we collect massive amounts of data; how can we analyze it?
Answer: use lots of machines… (hundreds/thousands of CPUs, can grow)
Performance: parallel processing
Availability: Ideally, the system never down; can handle failures transparent
=> Map/Reduce processing paradigm
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 22
Scale-Agnostic Data Analysis
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 23
Big Data Analytics Stack
– Layered stack of frameworks for distributed data management and processing
– Many choices of distributed data processing platforms
Application
Storage
Infrastructure
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 24
[slide by Ion Stoica, UCB, 2013]
MapReduce Overview
– Scan large volumes of data
– Map: Extract some interesting information
– Shuffle and sort intermediate results
– Reduce: aggregate intermediate results
– Generate final output
– Key idea: provide an abstraction at the point of these two operations (map
and reduce)
– Higher-order functions
– Cf. map functions in functional programming languages such as Lisp or
Haskell
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 25
12-25
MapReduce Paradigm
– Functional Programming approach to data processing
– map() : applies a given function f to all elements of a collection;
returns a new collection
– map (f, originalList)
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) Diagram from Yahoo! Hadoop Tutorial 26
MapReduce Paradigm: Reduce()
reduce(): applies a given function g to all elements of an input list;
produces, starting from a given initial value, a single (aggregate) output value
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) Diagram from Yahoo! Hadoop Tutorial 27
Similarities between SQL-Queries and MapReduce
– A standard map-reduce task is similar in its functionality to declarative
aggregation queries in SQL:
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 28
Example: Word Count program
– Word Count programmed as standard linear program
– Two nested for loops
– Difficult to generalise or parallelise
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 29
Example: Word Count
– Input:
– List of documents that contain text
– Provided to MapReduce in the form of
(k: documentID, v: textcontent) pairs
– Goal:
– Determine which words occur in the documents and how often
– E.g. for text indexing…
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 30
MapReduce Approach
To solve the same problem using MapReduce, we need
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 31
Example: Word Count in MapReduce
– Word Count programmed using Map/Reduce paradigm
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 32
MapReduce Generalised
– Previous example was hard-coded for word count
– We can generalise the pattern for the driver code even further
mapper and reducer are now also inputs
Pros: Cons:
– very flexible due to the – requires programming skills
user-defined functions and functional thinking
– great scalability – relatively low-level, even
because FP approach filtering to be coded manually
– easy parallelism due to – complex frameworks
stateless functions – batch-processing oriented
– fault-tolerance
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 36
Distributed, Dataflow-Oriented Analytics Platforms
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 37
Challenge: Iterative Algorithms
– Many data mining and machine learning algorithms rely on global state and
iterations
– Examples:
– data clustering (eg. k-Means)
– frequent itemset mining (eg. Apriori algorithm)
– linear regression
– collaborative filtering
– PageRank
– …
M R
J R R
M
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 38
Distributed Data Analytics Frameworks
– Apache Hadoop
– Open-source implementation of original MapReduce from Google; Apache top-level project
– Java framework, but also provides a Python interface nowadays
– Parts: own distributed file system (HDFS), job scheduler (YARN), MR framework (Hadoop)
– Apache Spark
– Distributed cluster computing framework on top of HDFS/YARN
– Concentrates on main-memory processing and more high-level data flow control
– Originates from research project from UC Berkeley
– Apache Flink
– Efficient data flow runtime on top of HDFS/YARN
– Similar to Spark, but more emphasize on build-in dataflow optimiser and pipelined processing
– Strong for data stream processing
– Origin: Stratosphere research project by TU Berlin, Humboldt University Berlin and HPI Potsdam
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 39
Distributed Data Analytics Frameworks (continued)
– Apache Hive
– Provides an SQL-like interface on top of Hadoop / HDFS
– Allows to define a relational schema on top of HDFS files, and to query and analyse data with
HiveQL (SQL dialect)
– Queries automatically translated to MR jobs and executed in parallel in cluster
– Example: WordCount in HIVE
env = get_environment()
data= env.read_text("hdfs://…");
data.flat_map(lambda x,c: [(word,1) for word in x.lower().split()]) \
.group_by(0) \
.sum(1) \
.write_csv("hdfs://…")
env.execute()
[Cf.: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/batch/python.html]
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 42
Summary
– Big Data
– The three V's: Volume, Velocity and Variety
– Ethical challenges for Big Data Processing
– Scale-Up versus Scale-Out
– Scale-Agnostic Computation
– Parallelisable higher-order functions map & reduce
– MapReduce principle; similarities to existing material and SQL
– Scale-Agnostic Data Analytics Platforms
– Data Scientists need more high-level tools and interfaces than MapReduce
– Examples: Apache Spark or Apache Flink or Apache Hive
– Componentized infrastructure: SQL querying, ML-Libraries, Streaming, etc.
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 43
Learn More
– DATA3404 Data Science Platforms
DATA2001 "Data Science, Big Data, and Data Diversity" – 2022 (Roehm) 44