Objective:
Key Takeaways:
BI and analytics
Data warehouse
ETL
Data sources
* Donald Feinberg, Mark Beyer, Merv Adrian, Roxane Edjlali (Gartner), The State of Data Warehousing in 2012 (Stamford, CT.: Gartner, 2012) 4
* Gartner, Big Data (Stamford, CT.: Gartner, 2016), URL: http://www.gartner.com/it-glossary/big-data/
Data Cost Culture
Characteristics
Traditional Big Data
Data
Relational All Data
Characteristics (with highly modeled schema) (with schema agility)
Expensive Commodity
Cost (storage and compute capacity) (storage and compute capacity)
Rear-view reporting Intelligent action
Culture (using relational algebra) (using relational algebra AND ML,
graph, streaming, image processing)
Tangerine instantly adapts to
customer feedback to offer customers
what they want, when they want it
Lack of insight for targeted campaigns
Scenario Inability to support data growth
Azure HDInsight (Hadoop-as-a-service) with the Analytics
Platform System enables instant analysis of social sentiment
Solution and customer feedback across digital, face-to-face and
phone interactions.
• Reduced time to customer insight
• Ability to make changes to campaigns or adjust product
Result rollouts based on real-time customer reactions “I can see us…creating predictive, context-aware financial
• Ability to offer incentives and new services to retain—and services applications that give information based on the time
grow—its customer base and where the customer is.”
Billy Lo
Head of Enterprise Architecture
1. Start with end-user requirements to identify desired reports
and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema (‘schema-
on-write’)
5. Create reports, analyze data
Dedicated ETL tools (e.g. SSIS)
Relational Queries
ETL pipeline
Results
LOB Applications Defined schema
All data not immediately required is discarded or archived
All data has value
• All data has potential value
• Data hoarding
• No defined schema—stored in native format
• Schema is imposed and transformations are done at query time (schema-on-read).
• Apps and users interpret the data as they see fit
Iterate
Gather data
Store indefinitely Analyze See results
from all sources
Devices
Batch queries Dashboards
Reports
Interactive queries Exploration
Real-time analytics
Machine Learning Queries
Meta-Data, Cooked
Joins Data
Relational Results
ETL pipeline
LOB Applications Defined schema
Obtaining skills Determining how Integrating with
and capabilities to get value existing IT investments
*Gartner: Survey Analysis – Hadoop Adoption Drivers and Challenges (Stamford, CT.: Gartner, 2015)
Data Stored
Xbox Live
Office365
LCA
Live
Yammer SMSG Bing
CRM/Dynamics
Skype
Exchange
Windows
Malware Protection Microsoft Stores
Commerce Risk
1 2 3 4 5 6 7
Big Data Analytics – Data Flow
Ingestion
Discovery
Azure
Data Catalog
Business
apps
Bulk Ingestion Preparation, Analytics and
Machine Learning Visualization
People
Power BI
Custom
apps
Sensors
and devices
Event Ingestion
Azure Data Lake Store
DATA INTELLIGENCE ACTION
Business
apps
Azure Event
Hubs
Events Events Transformed
Data
Kafka
Custom
apps
Sensors
and devices
Raw Events
Lambda architecture
DATA
INGEST PREPARE ANALYZE PUBLISH CONSUME
SOURCES
Hot Path Machine Learning
Real-time
Scoring
Reference Data
Event hubs ASA Job Rule
Cortana
Event hubs Flatten & Event hubs ASA Job Rule
Sensors (IoT, Metadata Join
Devices, Mobile)
Aggregated Hourly, Daily,
Archived Data Data Monthly Roll-Ups Power BI
Data Lake Data Lake Data Lake Data Lake
Store Store Analytics Store
Offline
Training
Machine
Batch Scoring Web/LOB
Learning Azure SQL
Data Warehouse Dashboards
Logs (CSV, JSON,
Data Factory: Move Data, Orchestrate, Schedule, and Monitor
XML…)
On Premise Cold Path
Clickstream,
Leading Computer Manufacturer / Retailer
Recommendation
How They Did It: Analyzing Clickstream to Provide Real-time Recommendations Online
HDInsight Cluster
AzureML
• How They Did It
• Collect clickstream data
• In tab separated text files
• Adding 22 new files per hour ~5-18
Blog Blog MB ase
MB/file
• Currently 1TB and growing
Storage Storage
IaaS VM
• Spin up Hadoop
Email Server
Visitor
• Use Hive scripts because of SQL-like
Omniture Product Information Website.com
Catalog Service
Targeted Email syntax
• Extracts click behavior like buys,
additions to carts, reviews etc. and
NRT
assigns scores
AzureML
• Jobs run hourly
Event Hub
Azure SQL
DE
Blog
Storage
NoSQL
Storage
Persisted
Storage
• Currently 8-nodes with plans to 16
BK1
• •
•
•
•
•
•
•
•
•
Information Big Data Stores Machine Learning Intelligence
Data Management and Analytics
People
Sources
Machine Cognitive
Data Factory Data Lake Store
Learning Services
SQL Data Data Lake Bot Web
Data Catalog Warehouse Analytics Framework
Apps HDInsight
Event Hubs (Hadoop and Cortana Mobile
Spark) Apps
Stream Analytics Bots
Dashboards &
Visualizations
Sensors Automated
and Power BI Systems
devices
Data Intelligence Action
Information Big Data Stores Machine Learning Intelligence
Data Management and Analytics
People
Sources
Machine Cognitive
Data Factory Data Lake Store
Learning Services
SQL Data Data Lake Bot Web
Data Catalog Warehouse Analytics Framework
Apps HDInsight
Event Hubs (Hadoop and Cortana Mobile
Spark) Apps
Stream Analytics Bots
Dashboards &
Visualizations
Sensors Automated
and Power BI Systems
devices
Data Intelligence Action
Control Ease of use
User Adoption
Data Lake Analytics
Specific Applications in a
multi-tenant form factor
HDInsight
HDP | CDH | MapR Workload optimized,
(Azure Marketplace) managed clusters
Any Hadoop technology
IaaS Hadoop Managed Hadoop Big Data as-a-service
Azure Data Lake
Analytics
Azure Storage Data Lake Store
Fully-managed Hadoop and Spark
Azure for the cloud
HDInsight 100% Open Source Hortonworks
data platform
Hadoop and Spark
Clusters up and running in minutes
as a Service on Azure
Managed, monitored and supported
by Microsoft with the industry’s best SLA
Familiar BI tools for analysis, or open source
notebooks for interactive data science
63% lower TCO than deploy your own
Hadoop on-premises*
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
Azure
Data Lake Store Hadoop File System (HDFS) for the cloud
No limits to scale
A hyper-scale
repository for Big Data Store any data in its native format
analytics workloads Enterprise-grade access control,
encryption at rest
Optimized for analytic workload performance
Distributed analytics service built on
Azure Apache YARN
Data Lake Analytics Elastic scale per query lets users focus on
business goals—not configuring hardware
A new distributed
analytics service Includes U-SQL—a language that unifies the
benefits of SQL with the expressive
power of C#
Integrates with Visual Studio to develop,
debug, and tune code faster
Federated query across Azure data sources
Enterprise-grade role based access control
Highest availability • Managed, monitored and
guarantee in the supported by Microsoft
industry for peace of • Enterprise-leading SLA—
99.9% uptime
mind • No IT resources needed for
upgrades and patching
• Microsoft monitors your
deployment so you don’t
have to
99.9% SLA
*Applies to HDInsight only
Runs in the most datacenters worldwide
North Central US
Illinois
West Europe
Netherlands
Central US
Iowa
China North*
Beijing
Japan East
North Europe China South* Tokyo, Saitama
Ireland Shanghai
West US East US
California Virginia Japan West
India Central
Pune Osaka
East US 2
South Central US Virginia
Texas
East Asia
Hong Kong
SE Asia
Singapore
Australia East
Azure doubling compute New South Wales
and storage every 6 months Brazil South
Sao Paulo State Australia South East
Victoria
*Applies to HDInsight only
Manage and • Auditing, alerting, access
secure your data control—all from within
a single web-based portal
by leveraging
• Azure Active Directory
existing IT integration for identity
investments and access management
• Leverage existing investment
in Active Directory on-
premises
Lower total cost • No hardware
• Hadoop support included with
of ownership Azure support
• Pay only for what you use
• Independently scale storage
and compute
• No need to hire specialized
operations team
• 63% lower total cost of
ownership than on-premises*
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with
Microsoft Azure HDInsight”
Learn more on the Data Lake website:
http://azure.com/datalake
Watch videos on Azure Data Lake:
https://channel9.msdn.com/Series/AzureDataLake
Take courses and read documentation
on Azure Data Lake:
http://aka.ms/hditraining
http://aka.ms/adlanalytics
http://aka.ms/adlstore