0% found this document useful (0 votes)
170 views35 pages

Transforming Data with Azure Solutions

This document discusses data warehousing and big data concepts. It provides an overview of key differences between traditional data warehousing and big data approaches, including data characteristics, costs, and culture. Traditional approaches focus on structured data and predefined schemas, while big data embraces all data types and schema flexibility.

Uploaded by

Srikanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
170 views35 pages

Transforming Data with Azure Solutions

This document discusses data warehousing and big data concepts. It provides an overview of key differences between traditional data warehousing and big data approaches, including data characteristics, costs, and culture. Traditional approaches focus on structured data and predefined schemas, while big data embraces all data types and schema flexibility.

Uploaded by

Srikanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Objective:

Key Takeaways:
BI and analytics

Data warehouse

ETL

Data sources

* Donald Feinberg, Mark Beyer, Merv Adrian, Roxane Edjlali (Gartner), The State of Data Warehousing in 2012 (Stamford, CT.: Gartner, 2012) 4
* Gartner, Big Data (Stamford, CT.: Gartner, 2016), URL: http://www.gartner.com/it-glossary/big-data/
Data Cost Culture
Characteristics
Traditional Big Data
Data
Relational All Data
Characteristics (with highly modeled schema) (with schema agility)

Expensive Commodity
Cost (storage and compute capacity) (storage and compute capacity)

Rear-view reporting Intelligent action


Culture (using relational algebra) (using relational algebra AND ML,
graph, streaming, image processing)
Tangerine instantly adapts to
customer feedback to offer customers
what they want, when they want it
Lack of insight for targeted campaigns
Scenario Inability to support data growth

Azure HDInsight (Hadoop-as-a-service) with the Analytics


Platform System enables instant analysis of social sentiment
Solution and customer feedback across digital, face-to-face and
phone interactions.

• Reduced time to customer insight


• Ability to make changes to campaigns or adjust product
Result rollouts based on real-time customer reactions “I can see us…creating predictive, context-aware financial
• Ability to offer incentives and new services to retain—and services applications that give information based on the time
grow—its customer base and where the customer is.”
Billy Lo
Head of Enterprise Architecture
1. Start with end-user requirements to identify desired reports
and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema (‘schema-
on-write’)
5. Create reports, analyze data
Dedicated ETL tools (e.g. SSIS)
Relational Queries
ETL pipeline
Results
LOB Applications Defined schema

All data not immediately required is discarded or archived


All data has value
• All data has potential value
• Data hoarding
• No defined schema—stored in native format
• Schema is imposed and transformations are done at query time (schema-on-read).
• Apps and users interpret the data as they see fit
Iterate

Gather data
Store indefinitely Analyze See results
from all sources
Devices
Batch queries Dashboards
Reports
Interactive queries Exploration

Real-time analytics
Machine Learning Queries

Meta-Data, Cooked
Joins Data

Relational Results
ETL pipeline
LOB Applications Defined schema
Obtaining skills Determining how Integrating with
and capabilities to get value existing IT investments

*Gartner: Survey Analysis – Hadoop Adoption Drivers and Challenges (Stamford, CT.: Gartner, 2015)
Data Stored

Xbox Live

Office365
LCA
Live
Yammer SMSG Bing
CRM/Dynamics

Skype
Exchange
Windows
Malware Protection Microsoft Stores
Commerce Risk

1 2 3 4 5 6 7
Big Data Analytics – Data Flow
Ingestion
Discovery

Azure
Data Catalog
Business
apps
Bulk Ingestion Preparation, Analytics and
Machine Learning Visualization
People

Power BI
Custom
apps

Sensors
and devices
Event Ingestion
Azure Data Lake Store

DATA INTELLIGENCE ACTION


Business
apps
Azure Event
Hubs
Events Events Transformed
Data
Kafka
Custom
apps

Sensors
and devices

Raw Events
Lambda architecture
DATA
INGEST PREPARE ANALYZE PUBLISH CONSUME
SOURCES

Hot Path Machine Learning


Real-time
Scoring

Reference Data
Event hubs ASA Job Rule

Cortana

Event hubs Flatten & Event hubs ASA Job Rule


Sensors (IoT, Metadata Join
Devices, Mobile)
Aggregated Hourly, Daily,
Archived Data Data Monthly Roll-Ups Power BI

Data Lake Data Lake Data Lake Data Lake


Store Store Analytics Store
Offline
Training
Machine
Batch Scoring Web/LOB
Learning Azure SQL
Data Warehouse Dashboards
Logs (CSV, JSON,
Data Factory: Move Data, Orchestrate, Schedule, and Monitor
XML…)

On Premise Cold Path


Clickstream,

Leading Computer Manufacturer / Retailer


Recommendation

How They Did It: Analyzing Clickstream to Provide Real-time Recommendations Online

HDInsight Cluster
AzureML
• How They Did It
• Collect clickstream data
• In tab separated text files
• Adding 22 new files per hour ~5-18
Blog Blog MB ase
MB/file
• Currently 1TB and growing
Storage Storage

IaaS VM

• Spin up Hadoop
Email Server
Visitor

• Use Hive scripts because of SQL-like


Omniture Product Information Website.com
Catalog Service

Targeted Email syntax


• Extracts click behavior like buys,
additions to carts, reviews etc. and
NRT
assigns scores
AzureML
• Jobs run hourly
Event Hub
Azure SQL
DE
Blog
Storage
NoSQL
Storage
Persisted
Storage
• Currently 8-nodes with plans to 16
BK1
• •








Information Big Data Stores Machine Learning Intelligence
Data Management and Analytics
People
Sources
Machine Cognitive
Data Factory Data Lake Store
Learning Services

SQL Data Data Lake Bot Web


Data Catalog Warehouse Analytics Framework

Apps HDInsight
Event Hubs (Hadoop and Cortana Mobile
Spark) Apps

Stream Analytics Bots

Dashboards &
Visualizations
Sensors Automated
and Power BI Systems
devices

Data Intelligence Action


Information Big Data Stores Machine Learning Intelligence
Data Management and Analytics
People
Sources
Machine Cognitive
Data Factory Data Lake Store
Learning Services

SQL Data Data Lake Bot Web


Data Catalog Warehouse Analytics Framework

Apps HDInsight
Event Hubs (Hadoop and Cortana Mobile
Spark) Apps

Stream Analytics Bots

Dashboards &
Visualizations
Sensors Automated
and Power BI Systems
devices

Data Intelligence Action


Control Ease of use

User Adoption
Data Lake Analytics

Specific Applications in a
multi-tenant form factor

HDInsight

HDP | CDH | MapR Workload optimized,


(Azure Marketplace) managed clusters

Any Hadoop technology

IaaS Hadoop Managed Hadoop Big Data as-a-service


Azure Data Lake
Analytics

Azure Storage Data Lake Store


Fully-managed Hadoop and Spark
Azure for the cloud
HDInsight 100% Open Source Hortonworks
data platform
Hadoop and Spark
Clusters up and running in minutes
as a Service on Azure
Managed, monitored and supported
by Microsoft with the industry’s best SLA
Familiar BI tools for analysis, or open source
notebooks for interactive data science
63% lower TCO than deploy your own
Hadoop on-premises*

*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
Azure
Data Lake Store Hadoop File System (HDFS) for the cloud
No limits to scale
A hyper-scale
repository for Big Data Store any data in its native format
analytics workloads Enterprise-grade access control,
encryption at rest
Optimized for analytic workload performance
Distributed analytics service built on
Azure Apache YARN
Data Lake Analytics Elastic scale per query lets users focus on
business goals—not configuring hardware
A new distributed
analytics service Includes U-SQL—a language that unifies the
benefits of SQL with the expressive
power of C#
Integrates with Visual Studio to develop,
debug, and tune code faster
Federated query across Azure data sources
Enterprise-grade role based access control
Highest availability • Managed, monitored and
guarantee in the supported by Microsoft
industry for peace of • Enterprise-leading SLA—
99.9% uptime
mind • No IT resources needed for
upgrades and patching
• Microsoft monitors your
deployment so you don’t
have to
99.9% SLA
*Applies to HDInsight only
Runs in the most datacenters worldwide
North Central US
Illinois
West Europe
Netherlands
Central US
Iowa
China North*
Beijing
Japan East
North Europe China South* Tokyo, Saitama
Ireland Shanghai
West US East US
California Virginia Japan West
India Central
Pune Osaka
East US 2
South Central US Virginia
Texas
East Asia
Hong Kong

SE Asia
Singapore

Australia East

Azure doubling compute New South Wales

and storage every 6 months Brazil South


Sao Paulo State Australia South East
Victoria
*Applies to HDInsight only
Manage and • Auditing, alerting, access
secure your data control—all from within
a single web-based portal
by leveraging
• Azure Active Directory
existing IT integration for identity
investments and access management
• Leverage existing investment
in Active Directory on-
premises
Lower total cost • No hardware
• Hadoop support included with
of ownership Azure support
• Pay only for what you use
• Independently scale storage
and compute
• No need to hire specialized
operations team
• 63% lower total cost of
ownership than on-premises*
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with
Microsoft Azure HDInsight”
Learn more on the Data Lake website:
http://azure.com/datalake

Watch videos on Azure Data Lake:


https://channel9.msdn.com/Series/AzureDataLake

Take courses and read documentation


on Azure Data Lake:
http://aka.ms/hditraining
http://aka.ms/adlanalytics
http://aka.ms/adlstore

You might also like