Donghui Zhang
dzhang@BigAnalyticsPlatform.com
2017-5-4
Host: NECINA DIG
Co-Host: MIT CSSA
Your Background
 Familiar with big-data analytics?
 Value = show you what’s “under the hood”.
 Familiar with big-data platform?
 Mostly review; Value = think about my opinions.
 Just curious?
 Value = general awareness.
 Not interested in big data?
 You are in the wrong room.
http://BigAnalyticsPlatform.com 2(C) 2017 Donghui Zhang
Disclaimer
 The opinions expressed on this site are mine and
do not necessarily represent those of my
employer.
 BigAnalyticsPlatform.com is my personal blogging
site. I currently work at Facebook.
3http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Content
 Why
 What
 History
 Technical How-Tos
 Career Advice
 Conclusions
http://BigAnalyticsPlatform.com 4(C) 2017 Donghui Zhang
Why Big Data? Data Grows Fast
 Data in the world:
 10 billion TB
 90% was produced in
the last 2 years!
5
Source: Mikal Khoso. “How Much Data is Produced Every Day?”
http://www.northeastern.edu/levelblog/2016/05/13/how-much-data-produced-every-day
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Why Big-Data Platform?
 Platform can be a competitive advantage.
 Enable junior developers to quickly create robust
applications.
 Google thinks of itself as a systems engineering
company.
6
Quote source: Todd Hoff. “Google Architecture”.
http://highscalability.com/google-architecture
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
7
Data source: Yahoo Finance on 1/3/2017.
159
208 174
106
504
616
156
357
547
234 222
338
0
100
200
300
400
500
600
700
IBM
Samsung
Intel
SAP
Microsoft
Apple
Oracle
Amazon
Google
Tencent
Alibaba
Facebook
1911 193819681972 19751976197719941998199819992004
Marketcap(billion$)
Company + year founded
All biggies have big data platforms
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
8
Data source: Yahoo Finance on 1/3/2017.
159
208 174
106
504
616
156
357
547
234 222
338
0
100
200
300
400
500
600
700
IBM
Samsung
Intel
SAP
Microsoft
Apple
Oracle
Amazon
Google
Tencent
Alibaba
Facebook
1911 193819681972 19751976197719941998199819992004
Marketcap(billion$)
Company + year founded
All biggies have big data platforms
top 3 cloud service
providers
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
9
Data source: Yahoo Finance on 1/3/2017.
159
208 174
106
504
616
156
357
547
234 222
338
0
100
200
300
400
500
600
700
IBM
Samsung
Intel
SAP
Microsoft
Apple
Oracle
Amazon
Google
Tencent
Alibaba
Facebook
1911 193819681972 19751976197719941998199819992004
Marketcap(billion$)
Company + year founded
All biggies have big data platforms
Larry Ellison:
“Amazon’s lead is
over”
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
10
Data source: Yahoo Finance on 1/3/2017.
159
208 174
106
504
616
156
357
547
234 222
338
0
100
200
300
400
500
600
700
IBM
Samsung
Intel
SAP
Microsoft
Apple
Oracle
Amazon
Google
Tencent
Alibaba
Facebook
1911 193819681972 19751976197719941998199819992004
Marketcap(billion$)
Company + year founded
All biggies have big data platforms
Apple “Pie”
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
11
Data source: Yahoo Finance on 1/3/2017.
159
208 174
106
504
616
156
357
547
234 222
338
0
100
200
300
400
500
600
700
IBM
Samsung
Intel
SAP
Microsoft
Apple
Oracle
Amazon
Google
Tencent
Alibaba
Facebook
1911 193819681972 19751976197719941998199819992004
Marketcap(billion$)
Company + year founded
All biggies have big data platforms
Samsung bought
Joyant
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
12
Data source: Yahoo Finance on 1/3/2017.
159
208 174
106
504
616
156
357
547
234 222
338
0
100
200
300
400
500
600
700
IBM
Samsung
Intel
SAP
Microsoft
Apple
Oracle
Amazon
Google
Tencent
Alibaba
Facebook
1911 193819681972 19751976197719941998199819992004
Marketcap(billion$)
Company + year founded
All biggies have big data platforms
Alibaba 2015: 377 sec (3,377 nodes Apsara)
Tencent 2016: 134 sec (512 nodes OpenPower)
Gray sort. See http://sortbenchmark.org
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Content
 Why
 What
 History
 Technical How-Tos
 Career Advice
 Conclusions
http://BigAnalyticsPlatform.com 13(C) 2017 Donghui Zhang
What is Big Data?
 Big data sets
 e.g. “This year our users uploaded 10X more videos; we
have big data now.”
 big volume, big variety, or big velocity
 exceed existing data processing capabilities
 Big data analytics
 e.g. “We use big data to predict stock trends.”
 Big data stack
 software
 platform
 infrastructure
14http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
The Big Data Stack
15
Analytics
Infrastructure Think IaaS such as AWS EC2.
Networked VMs.
Platform Think PaaS such as Google App Engine.
A platform for developing software.
Analytics Software
Think SaaS such as Microsoft Office 365.
Software that Data Scientists can use.
Reports, docs, ad hoc scripts...
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Google Stack
16
Infrastructure
Platform
Products
Custom-built machines; RedHat Linux
GFS/Colossus, BigTable, Spanner,
MapReduce/Cloud Dataflow, Chubby,
Borg/Omega
search, advertising, gmail, docs, maps,
youtube, cloud platform, …
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Sample Open-Source Stack
17
Infrastructure
Platform
Analytics Software
Analytics
VMs
Spark on YARN with Hive
Tableau, scikit-learn
Python scripts
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
5 V’s of Big Data
 Volume
 Variety
 Velocity
 Veracity
 Value
18
5V’s source: Jason Williamson. “The 4 V’s of Big Data”.
http://www.dummies.com/careers/find-a-job/the-4-vs-of-big-data
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
5 V’s of Big Data
 Volume
 Variety
 Velocity
 Value
 Veracity
19
“Your small data can be my big data!”
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
5 V’s of Big Data
 Volume
 Variety
 Velocity
 Value
 Veracity
20
Lessons
• A key feature missing in RDBMS is
variety.
RDBMS guru: “Put you data in a database!”
Scientist: “My data is not relational.”
RDBMS guru: “Make your data relational!”
Scientist: “But it is not relational!”
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
5 V’s of Big Data
 Volume
 Variety
 Velocity
 Value
 Veracity
21
Streaming.
ETL  ELT: Load first, transform later.
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
5 V’s of Big Data
 Volume
 Variety
 Velocity
 Value
 Veracity
22
Lessons
• Do big data for increasing business
value, not for tech.
• Read a book on building a startup.
http://BigAnalyticsPlatform.com
Source: Frank McSherry. “Scalability! But at what COST?”
http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html
If you are going to use a big
data system for yourself,
see if it is faster than your
laptop.
Frank McSherry
(C) 2017 Donghui Zhang
5 V’s of Big Data
 Volume
 Variety
 Velocity
 Value
 Veracity
23
Source: Philip Russom. “Best Practices for Data Lake Management”.
https://tdwi.org/research/2016/10/checklist-data-lake-management.aspx
Lessons
• Use Data Lakes, not Data Swamps.
• Read Russom’s “Best Practices for Data
lake Management”.
Data scientist: “My analysis suggested
this billion-dollar action.”
Manager: “Where was the data from?”
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Content
 Why
 What
 History
 Technical How-Tos
 Career Advice
 Conclusions
http://BigAnalyticsPlatform.com 24(C) 2017 Donghui Zhang
Big Data History
25
What goes around
comes around.
Mike Stonebraker
Everything has prior art.
David DeWitt
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Big Data History
 1969: relational model (Edgar F. Codd*)
 1976: System R by IBM (Jim Gray*; transactions)
 1986: Postgres (Mike Stonebraker*; ADT)
 1990: Gamma (David DeWitt; shared nothing)
 2004: MapReduce (Jeff Dean; flexibility)
 2005: “One size doesn’t fit all” (Mike Stonebraker)
 2006: Hadoop (Doug Cutting)
 2011: Spark (Matei Zaharia)
 2017: Death of shared nothing (David DeWitt)
26
* Turing Award Winners (1981, 1998, 2014). http://amturing.acm.org/byyear.cfm
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Big Data History
27
Lessons
• Don’t reinvent the wheels.
• Read the editors’ intro for “the red book”.
• Read "Architecture of a Database System".
• Study favorite posts on HighScalability.
The red book: Bailis, Hellerstein, Stonebraker. “Readings in Database Systems”, 5th Ed.
http://www.redbook.io
HighScalability: http://highscalability.com/all-time-favorites
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Content
 Why
 What
 History
 Technical How-Tos
 Career Advice
 Conclusions
http://BigAnalyticsPlatform.com 28(C) 2017 Donghui Zhang
How to Scale to Many Servers?
29
 When your data is small
http://BigAnalyticsPlatform.com
clients
server
(C) 2017 Donghui Zhang
How to Scale to Many Servers?
30
 Use a load balancer
http://BigAnalyticsPlatform.com
clients
LB
servers
(C) 2017 Donghui Zhang
How to Scale to Many Servers?
 Round-Robin DNS, Point of Presence, multi-level LB.
http://BigAnalyticsPlatform.com 31
LB
clients
servers
POP
POP
POP
POP
POP
(C) 2017 Donghui Zhang
Image source: Abhijeet Desai. "Google Cluster Architecture".
http://www.slideshare.net/abhijeetdesai/google-cluster-architecture
Google Cluster at the Beginning
32http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
33
Google Belgium Data Center
Image source: Malte Schwarzkopf. "What does it take to make Google work at scale".
https://docs.google.com/presentation/d/1OvJStE8aohGeI3y5BcYX8bBHwoHYCPu99A3KTTZElr0
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
34
Image source: Malte Schwarzkopf. "What does it take to make Google work at scale".
https://docs.google.com/presentation/d/1OvJStE8aohGeI3y5BcYX8bBHwoHYCPu99A3KTTZElr0
Google Belgium Data Center
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Google Data Centers
 About 40 data centers
 About 2 million machines
 Machines are organized in containers each having
1,160 machines
 30 racks of 40 machines
 Sometimes double stacked
35
Data sources:
James Pearn, “How many servers does Google have?”
https://plus.google.com/+JamesPearn/posts/VaQu9sNxJuY
“Learn How Google Works: in Gory Detail”.
http://www.ppcblog.com/how-google-works
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Google Data Size
 Data too large
 130 trillion pages
 Index 100 PB (stacking 2TB drives up: 0.8 mile)
 Demand too much
 3 billion searches per day (or 35K per second)
36
Data sources:
https://www.google.com/insidesearch/howsearchworks/thestory
http://www.seobook.com/learn-seo/infographics/how-search-works.php
http://www.ppcblog.com/how-google-works
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
How to Evaluate a Distributed System
 Well-known goals
 Useful (solve your business need)
 Performant (high throughput, low latency)
 Elastic (you may add/remove nodes)
 Scalable (adding nodes improves performance)
 Fault tolerant (deal with failures)
 In addition, I’d advocate
 Flexible (scaling, model, interface, architecture)
http://BigAnalyticsPlatform.com 37(C) 2017 Donghui Zhang
Shared Nothing  Shared Storage
http://BigAnalyticsPlatform.com 38
Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of
Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program
For 30 years, DW were
shared nothing.
Now they are all
shared storage.
Gamma
Teradata
Netezza
Vertica
DB2/PE
SQL Server PDW
Greenplum
Asterdata
SciDB
Redshift Spectrum
Snowflake
Microsoft SQL DW
Google BigQuery
(C) 2017 Donghui Zhang
Why Shared Storage? Flexible Scaling!
http://BigAnalyticsPlatform.com 39
Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of
Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program
in minutes
(C) 2017 Donghui Zhang
Case Study: Snowflake (flexible scaling)
S3 DATA
STORAGE
COMPUTE
LAYER
VIRTUAL
WAREHOUSE
N
1
N
2
N
3
N
4
CLUSTER OF EC2 INSTANCES
DATA CACHE
VIRTUAL
WAREHOUSE
N
1
N
2
VIRTUAL
WAREHOUSE
N
1
N
2
N
3
N
4
N
5
N
6
N
7
N
8
CLOUD
SERVICES
AUTHENTICATION & ACCESS CONTROL
QUERY
OPTIMIZER
TRANSACTION
MANAGER
INFRASTRUCTURE
MANAGER
SECURITY
METADATA
STORAGE
Database tables stored here
These disks are strictly used as
caches
40
Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of
Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Case Study: Spark
http://BigAnalyticsPlatform.com 41
SparkSQL ML Streaming GraphX
Spark Core
RDD API DataFrame API
Standalone YARN MESOS Local
Java/Scala/Python/R shell/script
(C) 2017 Donghui Zhang
Case Study: Spark (Flexible Model)
http://BigAnalyticsPlatform.com 42
SparkSQL ML Streaming GraphX
Spark Core
RDD API DataFrame API
Standalone YARN MESOS Local
Java/Scala/Python/R shell/script
 Not only SQL, but also ML, streaming, graph.
(C) 2017 Donghui Zhang
Case Study: Spark (Flexible Interface)
http://BigAnalyticsPlatform.com 43
SparkSQL ML Streaming GraphX
Spark Core
RDD API DataFrame API
Standalone YARN MESOS Local
Java/Scala/Python/R shell/script
 You could access Spark using traditional JDBC.
 Also, interactive session (in multiple languages).
 Also, submit a script as a task.
(C) 2017 Donghui Zhang
Case Study: Spark (Flexible Architecture)
http://BigAnalyticsPlatform.com 44
SparkSQL ML Streaming GraphX
Spark Core
RDD API DataFrame API
Standalone YARN MESOS Local
Java/Scala/Python/R shell/script
 May deploy on top of existing YARN or MESOS.
 Could also be standalone.
 Possible to add components.
(C) 2017 Donghui Zhang
How to Evaluate a Distributed System
http://BigAnalyticsPlatform.com 45
Lessons
• Flexibility is an important metric.
• Spark is a flexible system.
• Cloud DW: shared storage.
(C) 2017 Donghui Zhang
 In addition to well-known goals
 Useful, Performant, Elastic, Scalable, Fault tolerant
 I’d advocate
 Flexible (scaling, model, interface, architecture)
Content
 Why
 What
 History
 Technical How-Tos
 Career Advice
 Conclusions
http://BigAnalyticsPlatform.com 46(C) 2017 Donghui Zhang
Growing Need for Big Data Jobs
47
Source: https://www.indeed.com/jobtrends
10X in 5
years
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Big Data Roles
 Chief Data Officer
 Data Scientist
 Data Engineer
 Solutions Architect
 Big Data Strategist
 ...... at least 15 more
48
Source: “Top 20 Big Data jobs and their responsibilities”.
http://bigdata-madesimple.com/top-20-big-data-jobs-and-their-responsibilities
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
If You Want to Do Analytics
 Python
 Numpy, Jupyter Notebook
 Machine Learning
 Scikit-learn
 Practice at http://DrivenData.org
http://BigAnalyticsPlatform.com 49(C) 2017 Donghui Zhang
If You Want to Do Big Data Platform
 Only for senior engineers
 Practice at http://LeetCode.com
 Embrace open source
 Assemble a solution; don’t build from scratch
 Consulting business: target medium-sized companies
http://BigAnalyticsPlatform.com 50(C) 2017 Donghui Zhang
If You Want to Build A Startup
 Read some books about building a startup
 Don’t assume you know users’ pain point
 Throw away prototype code
 Three key people must have good working relationship:
What-To-Do, How-To-Do, and When-To-Do
 When in doubt, keep it simple
 Strive for a clean API (external and internal)
 Do one thing really well first
http://BigAnalyticsPlatform.com 51(C) 2017 Donghui Zhang
Stonebraker’s Startup Loop
while (true)
{
1. Talk with users to find their pain;
2. Brainstorm with professors;
3. Recruit students to build a prototype;
4. Draw a quadrant; E.g.
5. Co-found a VC-backed startup;
6. Play banjo; write papers; give talks; receive awards;
}
E.g. Streambase, Vertica, VoltDB, Paradigm4, Tamr, …
E.g. Received ACM Turing Award 2014
52
Small Big
Simple
Complex
http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
Content
 Why
 What
 History
 Technical How-Tos
 Career Advice
 Conclusions
http://BigAnalyticsPlatform.com 53(C) 2017 Donghui Zhang
Conclusions
 All “biggies” have big-data platform
 Shared nothing  shared storage
 Leverage on open source: pick/compose/expand
 Flexibility is a key metric for distributed systems
http://BigAnalyticsPlatform.com 54(C) 2017 Donghui Zhang

More Related Content

PDF
IBM-Why Big Data?
PDF
Overview - IBM Big Data Platform
PDF
Big data ibm keynote d advani presentation
PDF
Big Data Scotland 2017
PDF
What is big data - Architectures and Practical Use Cases
PDF
Ibm big data
PPTX
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
PDF
Big Data & Analytics Architecture
IBM-Why Big Data?
Overview - IBM Big Data Platform
Big data ibm keynote d advani presentation
Big Data Scotland 2017
What is big data - Architectures and Practical Use Cases
Ibm big data
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
Big Data & Analytics Architecture

What's hot (20)

PPTX
Big Data vs Data Warehousing
PPT
Big Data Real Time Analytics - A Facebook Case Study
PDF
Big data case study collection
PDF
Maximize the Value of Your Data: Neo4j Graph Data Platform
PPTX
Protecting data privacy in analytics and machine learning ISACA London UK
PDF
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
PPTX
Big Data PPT by Rohit Dubey
PPT
Big data Analytics
PPTX
Big Data in Action : Operations, Analytics and more
PDF
Taming Big Data With Modern Software Architecture
PDF
Big Data Overview
PDF
The Future Of Big Data
PDF
Big Data Use Cases
PDF
Telco Big Data Workshop Sample
PDF
02 a holistic approach to big data
PPTX
AI in the Enterprise at Scale
PDF
Big data Introduction by Mohan
PDF
Apache hadoop bigdata-in-banking
PPTX
Service generated big data and big data-as-a-service
PDF
"Industrializing Machine Learning – How to Integrate ML in Existing Businesse...
Big Data vs Data Warehousing
Big Data Real Time Analytics - A Facebook Case Study
Big data case study collection
Maximize the Value of Your Data: Neo4j Graph Data Platform
Protecting data privacy in analytics and machine learning ISACA London UK
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Big Data PPT by Rohit Dubey
Big data Analytics
Big Data in Action : Operations, Analytics and more
Taming Big Data With Modern Software Architecture
Big Data Overview
The Future Of Big Data
Big Data Use Cases
Telco Big Data Workshop Sample
02 a holistic approach to big data
AI in the Enterprise at Scale
Big data Introduction by Mohan
Apache hadoop bigdata-in-banking
Service generated big data and big data-as-a-service
"Industrializing Machine Learning – How to Integrate ML in Existing Businesse...
Ad

Similar to Big Data Platform Landscape by 2017 (20)

PPTX
Big Data
PPTX
What is big data
PPTX
Big-Data-Seminar-6-Aug-2014-Koenig
PPTX
Big Data ppt
PPTX
Big data ppt
PPTX
Big data Presentation
PPTX
big-data-8722-m8RQ3h1.pptx
PPTX
Big data
PPTX
Big data
PPTX
Big data
PDF
Bigdatappt 140225061440-phpapp01
PPTX
Special issues on big data
PPTX
Big data
PPTX
ppt final.pptx
PPT
big data
DOCX
Content1. Introduction2. What is Big Data3. Characte.docx
PPTX
bigdata.pptx
PPTX
Big_Data_ppt[1] (1).pptx
PDF
How to build and run a big data platform in the 21st century
Big Data
What is big data
Big-Data-Seminar-6-Aug-2014-Koenig
Big Data ppt
Big data ppt
Big data Presentation
big-data-8722-m8RQ3h1.pptx
Big data
Big data
Big data
Bigdatappt 140225061440-phpapp01
Special issues on big data
Big data
ppt final.pptx
big data
Content1. Introduction2. What is Big Data3. Characte.docx
bigdata.pptx
Big_Data_ppt[1] (1).pptx
How to build and run a big data platform in the 21st century
Ad

Recently uploaded (20)

PDF
Streamlining Project Management in Microsoft Project, Planner, and Teams with...
PPTX
Why 2025 Is the Best Year to Hire Software Developers in India
PPTX
A Spider Diagram, also known as a Radial Diagram or Mind Map.
PDF
Coding with GPT-5- What’s New in GPT 5 That Benefits Developers.pdf
PPTX
Lecture 5 Software Requirement Engineering
PDF
Sanket Mhaiskar Resume - Senior Software Engineer (Backend, AI)
PPTX
Human Computer Interaction lecture Chapter 2.pptx
PDF
Mobile App Backend Development with WordPress REST API: The Complete eBook
PPTX
ROI from Efficient Content & Campaign Management in the Digital Media Industry
PDF
Top 10 Project Management Software for Small Teams in 2025.pdf
PPTX
Lesson-3-Operation-System-Support.pptx-I
PPTX
Folder Lock 10.1.9 Crack With Serial Key
PPTX
Streamlining Project Management in the AV Industry with D-Tools for Zoho CRM ...
PPTX
SmartGit 25.1 Crack + (100% Working) License Key
PDF
Internet Download Manager IDM Crack powerful download accelerator New Version...
PDF
Crypto Loss And Recovery Guide By Expert Recovery Agency.
PDF
Workplace Software and Skills - OpenStax
PDF
Engineering Document Management System (EDMS)
PPTX
Human-Computer Interaction for Lecture 2
PPT
3.Software Design for software engineering
Streamlining Project Management in Microsoft Project, Planner, and Teams with...
Why 2025 Is the Best Year to Hire Software Developers in India
A Spider Diagram, also known as a Radial Diagram or Mind Map.
Coding with GPT-5- What’s New in GPT 5 That Benefits Developers.pdf
Lecture 5 Software Requirement Engineering
Sanket Mhaiskar Resume - Senior Software Engineer (Backend, AI)
Human Computer Interaction lecture Chapter 2.pptx
Mobile App Backend Development with WordPress REST API: The Complete eBook
ROI from Efficient Content & Campaign Management in the Digital Media Industry
Top 10 Project Management Software for Small Teams in 2025.pdf
Lesson-3-Operation-System-Support.pptx-I
Folder Lock 10.1.9 Crack With Serial Key
Streamlining Project Management in the AV Industry with D-Tools for Zoho CRM ...
SmartGit 25.1 Crack + (100% Working) License Key
Internet Download Manager IDM Crack powerful download accelerator New Version...
Crypto Loss And Recovery Guide By Expert Recovery Agency.
Workplace Software and Skills - OpenStax
Engineering Document Management System (EDMS)
Human-Computer Interaction for Lecture 2
3.Software Design for software engineering

Big Data Platform Landscape by 2017

  • 2. Your Background  Familiar with big-data analytics?  Value = show you what’s “under the hood”.  Familiar with big-data platform?  Mostly review; Value = think about my opinions.  Just curious?  Value = general awareness.  Not interested in big data?  You are in the wrong room. http://BigAnalyticsPlatform.com 2(C) 2017 Donghui Zhang
  • 3. Disclaimer  The opinions expressed on this site are mine and do not necessarily represent those of my employer.  BigAnalyticsPlatform.com is my personal blogging site. I currently work at Facebook. 3http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 4. Content  Why  What  History  Technical How-Tos  Career Advice  Conclusions http://BigAnalyticsPlatform.com 4(C) 2017 Donghui Zhang
  • 5. Why Big Data? Data Grows Fast  Data in the world:  10 billion TB  90% was produced in the last 2 years! 5 Source: Mikal Khoso. “How Much Data is Produced Every Day?” http://www.northeastern.edu/levelblog/2016/05/13/how-much-data-produced-every-day http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 6. Why Big-Data Platform?  Platform can be a competitive advantage.  Enable junior developers to quickly create robust applications.  Google thinks of itself as a systems engineering company. 6 Quote source: Todd Hoff. “Google Architecture”. http://highscalability.com/google-architecture http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 7. 7 Data source: Yahoo Finance on 1/3/2017. 159 208 174 106 504 616 156 357 547 234 222 338 0 100 200 300 400 500 600 700 IBM Samsung Intel SAP Microsoft Apple Oracle Amazon Google Tencent Alibaba Facebook 1911 193819681972 19751976197719941998199819992004 Marketcap(billion$) Company + year founded All biggies have big data platforms http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 8. 8 Data source: Yahoo Finance on 1/3/2017. 159 208 174 106 504 616 156 357 547 234 222 338 0 100 200 300 400 500 600 700 IBM Samsung Intel SAP Microsoft Apple Oracle Amazon Google Tencent Alibaba Facebook 1911 193819681972 19751976197719941998199819992004 Marketcap(billion$) Company + year founded All biggies have big data platforms top 3 cloud service providers http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 9. 9 Data source: Yahoo Finance on 1/3/2017. 159 208 174 106 504 616 156 357 547 234 222 338 0 100 200 300 400 500 600 700 IBM Samsung Intel SAP Microsoft Apple Oracle Amazon Google Tencent Alibaba Facebook 1911 193819681972 19751976197719941998199819992004 Marketcap(billion$) Company + year founded All biggies have big data platforms Larry Ellison: “Amazon’s lead is over” http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 10. 10 Data source: Yahoo Finance on 1/3/2017. 159 208 174 106 504 616 156 357 547 234 222 338 0 100 200 300 400 500 600 700 IBM Samsung Intel SAP Microsoft Apple Oracle Amazon Google Tencent Alibaba Facebook 1911 193819681972 19751976197719941998199819992004 Marketcap(billion$) Company + year founded All biggies have big data platforms Apple “Pie” http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 11. 11 Data source: Yahoo Finance on 1/3/2017. 159 208 174 106 504 616 156 357 547 234 222 338 0 100 200 300 400 500 600 700 IBM Samsung Intel SAP Microsoft Apple Oracle Amazon Google Tencent Alibaba Facebook 1911 193819681972 19751976197719941998199819992004 Marketcap(billion$) Company + year founded All biggies have big data platforms Samsung bought Joyant http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 12. 12 Data source: Yahoo Finance on 1/3/2017. 159 208 174 106 504 616 156 357 547 234 222 338 0 100 200 300 400 500 600 700 IBM Samsung Intel SAP Microsoft Apple Oracle Amazon Google Tencent Alibaba Facebook 1911 193819681972 19751976197719941998199819992004 Marketcap(billion$) Company + year founded All biggies have big data platforms Alibaba 2015: 377 sec (3,377 nodes Apsara) Tencent 2016: 134 sec (512 nodes OpenPower) Gray sort. See http://sortbenchmark.org http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 13. Content  Why  What  History  Technical How-Tos  Career Advice  Conclusions http://BigAnalyticsPlatform.com 13(C) 2017 Donghui Zhang
  • 14. What is Big Data?  Big data sets  e.g. “This year our users uploaded 10X more videos; we have big data now.”  big volume, big variety, or big velocity  exceed existing data processing capabilities  Big data analytics  e.g. “We use big data to predict stock trends.”  Big data stack  software  platform  infrastructure 14http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 15. The Big Data Stack 15 Analytics Infrastructure Think IaaS such as AWS EC2. Networked VMs. Platform Think PaaS such as Google App Engine. A platform for developing software. Analytics Software Think SaaS such as Microsoft Office 365. Software that Data Scientists can use. Reports, docs, ad hoc scripts... http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 16. Google Stack 16 Infrastructure Platform Products Custom-built machines; RedHat Linux GFS/Colossus, BigTable, Spanner, MapReduce/Cloud Dataflow, Chubby, Borg/Omega search, advertising, gmail, docs, maps, youtube, cloud platform, … http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 17. Sample Open-Source Stack 17 Infrastructure Platform Analytics Software Analytics VMs Spark on YARN with Hive Tableau, scikit-learn Python scripts http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 18. 5 V’s of Big Data  Volume  Variety  Velocity  Veracity  Value 18 5V’s source: Jason Williamson. “The 4 V’s of Big Data”. http://www.dummies.com/careers/find-a-job/the-4-vs-of-big-data http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 19. 5 V’s of Big Data  Volume  Variety  Velocity  Value  Veracity 19 “Your small data can be my big data!” http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 20. 5 V’s of Big Data  Volume  Variety  Velocity  Value  Veracity 20 Lessons • A key feature missing in RDBMS is variety. RDBMS guru: “Put you data in a database!” Scientist: “My data is not relational.” RDBMS guru: “Make your data relational!” Scientist: “But it is not relational!” http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 21. 5 V’s of Big Data  Volume  Variety  Velocity  Value  Veracity 21 Streaming. ETL  ELT: Load first, transform later. http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 22. 5 V’s of Big Data  Volume  Variety  Velocity  Value  Veracity 22 Lessons • Do big data for increasing business value, not for tech. • Read a book on building a startup. http://BigAnalyticsPlatform.com Source: Frank McSherry. “Scalability! But at what COST?” http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html If you are going to use a big data system for yourself, see if it is faster than your laptop. Frank McSherry (C) 2017 Donghui Zhang
  • 23. 5 V’s of Big Data  Volume  Variety  Velocity  Value  Veracity 23 Source: Philip Russom. “Best Practices for Data Lake Management”. https://tdwi.org/research/2016/10/checklist-data-lake-management.aspx Lessons • Use Data Lakes, not Data Swamps. • Read Russom’s “Best Practices for Data lake Management”. Data scientist: “My analysis suggested this billion-dollar action.” Manager: “Where was the data from?” http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 24. Content  Why  What  History  Technical How-Tos  Career Advice  Conclusions http://BigAnalyticsPlatform.com 24(C) 2017 Donghui Zhang
  • 25. Big Data History 25 What goes around comes around. Mike Stonebraker Everything has prior art. David DeWitt http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 26. Big Data History  1969: relational model (Edgar F. Codd*)  1976: System R by IBM (Jim Gray*; transactions)  1986: Postgres (Mike Stonebraker*; ADT)  1990: Gamma (David DeWitt; shared nothing)  2004: MapReduce (Jeff Dean; flexibility)  2005: “One size doesn’t fit all” (Mike Stonebraker)  2006: Hadoop (Doug Cutting)  2011: Spark (Matei Zaharia)  2017: Death of shared nothing (David DeWitt) 26 * Turing Award Winners (1981, 1998, 2014). http://amturing.acm.org/byyear.cfm http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 27. Big Data History 27 Lessons • Don’t reinvent the wheels. • Read the editors’ intro for “the red book”. • Read "Architecture of a Database System". • Study favorite posts on HighScalability. The red book: Bailis, Hellerstein, Stonebraker. “Readings in Database Systems”, 5th Ed. http://www.redbook.io HighScalability: http://highscalability.com/all-time-favorites http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 28. Content  Why  What  History  Technical How-Tos  Career Advice  Conclusions http://BigAnalyticsPlatform.com 28(C) 2017 Donghui Zhang
  • 29. How to Scale to Many Servers? 29  When your data is small http://BigAnalyticsPlatform.com clients server (C) 2017 Donghui Zhang
  • 30. How to Scale to Many Servers? 30  Use a load balancer http://BigAnalyticsPlatform.com clients LB servers (C) 2017 Donghui Zhang
  • 31. How to Scale to Many Servers?  Round-Robin DNS, Point of Presence, multi-level LB. http://BigAnalyticsPlatform.com 31 LB clients servers POP POP POP POP POP (C) 2017 Donghui Zhang
  • 32. Image source: Abhijeet Desai. "Google Cluster Architecture". http://www.slideshare.net/abhijeetdesai/google-cluster-architecture Google Cluster at the Beginning 32http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 33. 33 Google Belgium Data Center Image source: Malte Schwarzkopf. "What does it take to make Google work at scale". https://docs.google.com/presentation/d/1OvJStE8aohGeI3y5BcYX8bBHwoHYCPu99A3KTTZElr0 http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 34. 34 Image source: Malte Schwarzkopf. "What does it take to make Google work at scale". https://docs.google.com/presentation/d/1OvJStE8aohGeI3y5BcYX8bBHwoHYCPu99A3KTTZElr0 Google Belgium Data Center http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 35. Google Data Centers  About 40 data centers  About 2 million machines  Machines are organized in containers each having 1,160 machines  30 racks of 40 machines  Sometimes double stacked 35 Data sources: James Pearn, “How many servers does Google have?” https://plus.google.com/+JamesPearn/posts/VaQu9sNxJuY “Learn How Google Works: in Gory Detail”. http://www.ppcblog.com/how-google-works http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 36. Google Data Size  Data too large  130 trillion pages  Index 100 PB (stacking 2TB drives up: 0.8 mile)  Demand too much  3 billion searches per day (or 35K per second) 36 Data sources: https://www.google.com/insidesearch/howsearchworks/thestory http://www.seobook.com/learn-seo/infographics/how-search-works.php http://www.ppcblog.com/how-google-works http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 37. How to Evaluate a Distributed System  Well-known goals  Useful (solve your business need)  Performant (high throughput, low latency)  Elastic (you may add/remove nodes)  Scalable (adding nodes improves performance)  Fault tolerant (deal with failures)  In addition, I’d advocate  Flexible (scaling, model, interface, architecture) http://BigAnalyticsPlatform.com 37(C) 2017 Donghui Zhang
  • 38. Shared Nothing  Shared Storage http://BigAnalyticsPlatform.com 38 Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program For 30 years, DW were shared nothing. Now they are all shared storage. Gamma Teradata Netezza Vertica DB2/PE SQL Server PDW Greenplum Asterdata SciDB Redshift Spectrum Snowflake Microsoft SQL DW Google BigQuery (C) 2017 Donghui Zhang
  • 39. Why Shared Storage? Flexible Scaling! http://BigAnalyticsPlatform.com 39 Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program in minutes (C) 2017 Donghui Zhang
  • 40. Case Study: Snowflake (flexible scaling) S3 DATA STORAGE COMPUTE LAYER VIRTUAL WAREHOUSE N 1 N 2 N 3 N 4 CLUSTER OF EC2 INSTANCES DATA CACHE VIRTUAL WAREHOUSE N 1 N 2 VIRTUAL WAREHOUSE N 1 N 2 N 3 N 4 N 5 N 6 N 7 N 8 CLOUD SERVICES AUTHENTICATION & ACCESS CONTROL QUERY OPTIMIZER TRANSACTION MANAGER INFRASTRUCTURE MANAGER SECURITY METADATA STORAGE Database tables stored here These disks are strictly used as caches 40 Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 41. Case Study: Spark http://BigAnalyticsPlatform.com 41 SparkSQL ML Streaming GraphX Spark Core RDD API DataFrame API Standalone YARN MESOS Local Java/Scala/Python/R shell/script (C) 2017 Donghui Zhang
  • 42. Case Study: Spark (Flexible Model) http://BigAnalyticsPlatform.com 42 SparkSQL ML Streaming GraphX Spark Core RDD API DataFrame API Standalone YARN MESOS Local Java/Scala/Python/R shell/script  Not only SQL, but also ML, streaming, graph. (C) 2017 Donghui Zhang
  • 43. Case Study: Spark (Flexible Interface) http://BigAnalyticsPlatform.com 43 SparkSQL ML Streaming GraphX Spark Core RDD API DataFrame API Standalone YARN MESOS Local Java/Scala/Python/R shell/script  You could access Spark using traditional JDBC.  Also, interactive session (in multiple languages).  Also, submit a script as a task. (C) 2017 Donghui Zhang
  • 44. Case Study: Spark (Flexible Architecture) http://BigAnalyticsPlatform.com 44 SparkSQL ML Streaming GraphX Spark Core RDD API DataFrame API Standalone YARN MESOS Local Java/Scala/Python/R shell/script  May deploy on top of existing YARN or MESOS.  Could also be standalone.  Possible to add components. (C) 2017 Donghui Zhang
  • 45. How to Evaluate a Distributed System http://BigAnalyticsPlatform.com 45 Lessons • Flexibility is an important metric. • Spark is a flexible system. • Cloud DW: shared storage. (C) 2017 Donghui Zhang  In addition to well-known goals  Useful, Performant, Elastic, Scalable, Fault tolerant  I’d advocate  Flexible (scaling, model, interface, architecture)
  • 46. Content  Why  What  History  Technical How-Tos  Career Advice  Conclusions http://BigAnalyticsPlatform.com 46(C) 2017 Donghui Zhang
  • 47. Growing Need for Big Data Jobs 47 Source: https://www.indeed.com/jobtrends 10X in 5 years http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 48. Big Data Roles  Chief Data Officer  Data Scientist  Data Engineer  Solutions Architect  Big Data Strategist  ...... at least 15 more 48 Source: “Top 20 Big Data jobs and their responsibilities”. http://bigdata-madesimple.com/top-20-big-data-jobs-and-their-responsibilities http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 49. If You Want to Do Analytics  Python  Numpy, Jupyter Notebook  Machine Learning  Scikit-learn  Practice at http://DrivenData.org http://BigAnalyticsPlatform.com 49(C) 2017 Donghui Zhang
  • 50. If You Want to Do Big Data Platform  Only for senior engineers  Practice at http://LeetCode.com  Embrace open source  Assemble a solution; don’t build from scratch  Consulting business: target medium-sized companies http://BigAnalyticsPlatform.com 50(C) 2017 Donghui Zhang
  • 51. If You Want to Build A Startup  Read some books about building a startup  Don’t assume you know users’ pain point  Throw away prototype code  Three key people must have good working relationship: What-To-Do, How-To-Do, and When-To-Do  When in doubt, keep it simple  Strive for a clean API (external and internal)  Do one thing really well first http://BigAnalyticsPlatform.com 51(C) 2017 Donghui Zhang
  • 52. Stonebraker’s Startup Loop while (true) { 1. Talk with users to find their pain; 2. Brainstorm with professors; 3. Recruit students to build a prototype; 4. Draw a quadrant; E.g. 5. Co-found a VC-backed startup; 6. Play banjo; write papers; give talks; receive awards; } E.g. Streambase, Vertica, VoltDB, Paradigm4, Tamr, … E.g. Received ACM Turing Award 2014 52 Small Big Simple Complex http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  • 53. Content  Why  What  History  Technical How-Tos  Career Advice  Conclusions http://BigAnalyticsPlatform.com 53(C) 2017 Donghui Zhang
  • 54. Conclusions  All “biggies” have big-data platform  Shared nothing  shared storage  Leverage on open source: pick/compose/expand  Flexibility is a key metric for distributed systems http://BigAnalyticsPlatform.com 54(C) 2017 Donghui Zhang