0% found this document useful (0 votes)
6 views63 pages

Week 1 Lecture 1

The document provides an overview of the data analytics landscape, emphasizing the significance of big data characterized by its volume, velocity, variety, and veracity. It discusses the challenges associated with traditional data processing methods, the role of Hadoop in managing big data, and the differences between business intelligence and data science. Additionally, it highlights the importance of understanding big data technologies and the ecosystem for effective data management and analysis.

Uploaded by

parth25stat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views63 pages

Week 1 Lecture 1

The document provides an overview of the data analytics landscape, emphasizing the significance of big data characterized by its volume, velocity, variety, and veracity. It discusses the challenges associated with traditional data processing methods, the role of Hadoop in managing big data, and the differences between business intelligence and data science. Additionally, it highlights the importance of understanding big data technologies and the ecosystem for effective data management and analysis.

Uploaded by

parth25stat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

Data Analytics

Landscape

6/25/22 11:59 AM 1
Data Analytics Landscape

People Science
Data

Data Analytics

Technology Processes Business

6/25/22 11:59 AM 2
Data Never Sleeps!

2023 This is
What happens in
One Minute on
Internet
Big Data: Changing the Game of Organizations

Transactions + Interactions + Observations = BIG DATA


Big Data Definition
✔ Big Data is a massive volume of both structured and unstructured data
that is so large that its difficult to process with traditional database and
software techniques.

✔ Big Data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.

✔ Big Data is data whose scale, diversity, and complexity require new
architecture, techniques, algorithms, and analytics to mange it and
extract value and hidden knowledge from it…
The 4 V’s of Big Data

VOLUME VELOCITY VARIETY VERACITY

Real-Time and Structured/ Data Bias


Unprecedented Stream Processing Unstructured /
Amount of Data Semi-Structured Data Noise
Massive and
Continuous Data Raw Data from Data Abnormality
Flow Heterogeneous
Sources

5
DATA

VOLUME VELOCITY VARIETY VERACITY

Real-Time and Structured/ Data Bias


Unprecedented Stream Processing Unstructured /
Amount of Data Semi-Structured Data Noise
Massive and
Continuous Data Raw Data from Data Abnormality
Flow Heterogeneous
Sources

5
The 4 V‘s of Big Data - Volume

40 ZETTABYTES Its estimated that • Most of world‘s current


(43 TRILLION 2.5 QUINTILLION data is in the form of
GIGIBYTES) BYTES unstructured data –
(2.3 TRILLION GIGABYTES) natural text, images,
of data are created each videos, raw sensory
day motor data
Volume • An autonomous car can
generate as much as 4
SCALE OF DATA Terabyetes data per day.
• Facebook‘s last analysis
Most companies in
6.65 on 60 PetaByte of data --
the U.S have at least
Spark
BILLION 162 TERABYTES
(166,000 • Most business‘ have data
PEOPLE in classical formats and
Have cell phone World GIGABYTES)
of data stored of relatively small size.
population: 7.8
Billion

8
The 4 V‘s of Big Data - Volume

Kilobytes or
Terabytes

Huge
Required
Amount of
Large
Data
Storage
Generation
Volume

Saved in Online or
Records Offline
Tables , Transact-
Files ions

https://www.whishworks.com/blog/big-data/understanding-the-3-vs-of-big-data-volume-velocity-and-variety 9
The 4 V‘s of Big Data - Velocity
The New York stock
exchange captures
8 TB OF TRADE
INFORMATION
In Devices 2022 per day
Velocity
ANALYSIS OF Modern cars
STREAMING DATA 100 SENSOR
that monitor items such as
fuel level and tire pressure
18.9 BILLION (2ms)
NETWORK
CONNECTIONS
Almost 2.5 connections per
person on earth

10
The 4 V‘s of Big Data - Velocity

Speed of Generating Data

Generated Real Time

Online or Offline Data

In Streams, Batch or Bits

https://www.whishworks.com/blog/big-data/understanding-the-3-vs-of-big-data-volume-velocity-and-variety 11
The 4 V‘ – Variety

Structured Semi-structured Un-Structured


• In advance AI fields
• Has a data model • Partially modeled • Free Text
ends up in tabular data
non-structured data
form • Images becomes vital.
• XML, JSON,
• Relational MongoDB • Videos
Databases / Data
warehouses • Variable Length • Audio
Time series e.g. • In classical
• CSVs / Excels sensor readings • Sensory Motor Data
business, still the
most values lies in
the structured data.

Structured Data Unstructured Data

12
The 4 V‘s – Veracity
Data is accurate, precise and trusted

Invalid or outdated data


Incomplete data can yield
not representing the
unintended results
present

Different databases not


always have the same Local/silo databases
unique identifier.

Manual created reports Lack of data model or data


leads to invalid data dictionary

13
The 4 V‘s – The 5th V

Cost Reduction

Time Saving

Development & Optimization

Strategic Planning & Operations

14
Four A’s of Big Data

ata

In
dD Aggregation

te
re

rag
e
att

te
c

d
S

Da
ta
Acquisition Analysis

ge
Lo

ed
g

wl
Da

o
Application
ta

Kn
15
Why Business Needs Big Data?

16
The Problem With ETL (Extract, Transform, Load)
• Exploding amounts of stored data the ETL
process starts being a real problem.

• Data sets are very big like 100GB or


Terabytes. That means Billions or Trillions of
rows.
• This has the result that the ETL process
for large data sets takes longer and
longer.

• Very quickly the ETL performance gets so bad


it won't deliver results to analytics anymore.

Common SQL Platform Architecture

17
Scaling Up & Scaling Out Database
Storage Area Network

Scaling Up Database

Scaling Out Database


• Increasing RAM for row Caching.
• Using More powerful CPU • Scaling out is the opposite of scaling up.
• Increasing Optimizing networking • Use Storage Area Network to store the
performance data. You can then use up to eight SQL
• Scaling up the system is fairly easy. servers, attach them to SAN and let them
handle queries.
18
Do Not Use Big Data When!

• If you don't run into scaling issues please, do not


use big data tools!
• Big data is an expensive thing. A Hadoop cluster
for instance needs at least five servers to work
properly. More is better.
• This costs a lot of money. Especially when you
are talking the maintenance and development
on top big data tools into account.
• If you don't need it it's making absolutely no
sense at all!

19
20
Difference between Business Intelligence & Data Science?

vs
What is Business Intelligence?

21
DATA SCIENCE & BUSINESS INTELLIGENCE
Business
Value
Data Science
• Predictive analysis &
prescriptive analysis
• Why….? What will….? What
High should I do?

Data Science Business Intelligence


• Descriptive analysis, standard
reporting
• What happened?
Low Business
Intelligence

Past Future Time

http://hashtaggers586.blogspot.com/ 22
Difference between Business Intelligence & Data Science
Business Intelligence Data Science
Data Analysis Yes Yes
Statistics Yes Yes
Visualization Yes Yes
Data Sources Usually SQL, often Data Less structured (logs, cloud data,
warehouse SQL, no-SQL, text)

Tools Statistics, Visualization Statistics, Machine Learning,


Graph Analysis, NLP

Focus Present and past Future


Method Analytic Scientific
Goal Better strategic decision Proactive decision for planning
and operations
23
Technology

People Science
Data

Data Analytics

Technology Processes Business

24
Big Data Architecture Main Components

Ingestion • Based on the nature and velocity of data we


decide which data ingestion tool should we use?

Storage • Based on the nature and volume of data we decide


which database structure should we use?

• Based on Business requirements we choose


Processing processing platform to process data.

• The last but very important part is


Visualization to visualize the data so we can take
decisions based on visualization.

25
Data Platform – Data Life Cycle

• Streaming
Enrich • Data
Serve • Machine
& Data Flow • Data Warehouse • Operational Learning &
Engineering Database AI

Collect Report Predict

Security | Governance | Lineage | Management | Automation


How Organization Generate Data?

Data Ingestion

Batch
Data
Platform
Real Time
The Rise of Big Data

Earlier with limited data, only one processor and one storage unit was needed

Structured Data

28
The Rise of Big Data
Soon, processor
A single Data generation
was notincreased leading
enough to to high
process such volume of data
high volume of along withkinds
different different data
of data as it
formats
was very time consuming.

Structured Data

Semi Structured data

Unstructured Data

29
The Rise of Big Data
Hence,
The multiple
Single processors
Storage werethe
unit become used to process
bottleneck duehigh volume
to which of dataoverhead
network and this
saved
was time
generated

Structured Data

Semi Structured data

Unstructured Data

30
The Rise of Big Data
The solution was to use distributed storage for each processor. This enabled
This methodThis is known
worked andasthere
parallel
wasprocessing
no networkwith distributed
overhead storage
generated
easy access to store and access data.

Structured Data

Parallel Semi Structured data


Processing Distributed
Storage

Unstructured Data

31
Big Data Challenges and Solution

Single central storage Distributed Storage Massive amount of data


which cannot be stored,
processed and analyzed
using the traditional ways
Serial Processing Parallel Processing
A
A Process Output Process Output
A
Input Inputs

Lack of ability to process Ability to process every


unstructured data type of data

32
Hadoop As Solution

Single central storage Distributed Storage

Serial Processing Parallel Processing


A
A Process Output Process Output
A
Input Inputs

Lack of ability to process Ability to process every


unstructured data type of data

33
What is Hadoop?
Hadoop is a framework that manages big data storage in a distributed way and
processes it parallelly

Big Data Storing Processing Analyzing

34
Components of Hadoop

Hadoop Distributed File System (HDFS) is specially designed for storage


of huge datasets in commodity hardware

Distributed
Storage

35
HDFS
Features of HDFS

Provides Implemented on
Provides data Highly fault
distributed commodity
security tolerant
storage hardware

36
What is MapReduce?
Hadoop MapReduce is a programming technique where huge data is
processed in a parallel and distributed fashion

Processor

Output
Big Data

MapReduce used for parallel


processing of the Big Data, which is
stored in HDFS

37
Hadoop Use Case – Combating Fraudulent Activities
Detecting fraudulent transactions is one among the various problems any bank faces

38
Hadoop Use Case – Combating Fraudulent Activities
Approaches used by Zion’s security team to combat fraudulent activities

Security Information
Management – SIM Tools Parallel Processing System

Problem Problem

It was based on RDBMS

Unable to store huge data Analyzing unstructured data


which needed to be analyzed was not possible
39
Hadoop Use Case – Combating Fraudulent Activities
How Hadoop solved the problems

Storing Processing Analyzing Detecting

Processing of Processing of
unstructured data unstructured data
Zions could now Zions could now
(like server logs, (like server logs,
store massive store massive
customer data, customer data,
amount of data amount of data
customer customer
using Hadoop using Hadoop
transactions) was transactions) was
now possible now possible

40
Hadoop EcoSystem
Tableu
Impala
Sqoop
Knime
• Hadoop has four main modules: Hadoop Chuckwa
Rapidminer
common, HDFS, MapReduce and YARN. The Kafka
way these modules are woven together is what Flume
makes Hadoop so successful. Hadoop Hbase
• Hadoop's core functionality is the driver of HDFS Arrow
Hadoop's adoption. Many Apache side Hive
MapReduce
projects use it's core functions that makes him Pig
Spark
so popular. YARN
Flink
• Because of all those side projects Hadoop has Drill
turned more into an ecosystem that also
makes him popular.
Storm Zookeeper
Ambari
Oozie
41
Hadoop Is Everywhere?

1. Although Hadoop is so popular it is not the silver bullet. It’s not the tool that you should use for
everything.
2. Often times it does not make sense to deploy a Hadoop cluster, because it can be overkill.
3. Hadoop does not run on a single server. You basically need at least five servers, better six to run a small
cluster. Because of that the initial platform costs are quite high. One option you have is to use a
specialized systems like Cassandra, MongoDB or other NoSQL DB's for storage. Or you move to Amazon
and use Amazon's Simple Storage Service, or S3.
4. Guess what the tech behind S3 is. Yes, HDFS. That's why AWS also has the equivalent to MapReduce
named Elastic MapReduce.
5. The great thing about S3 is that you can start very small. When your system grows you don't have to
worry about s3's server scaling.

42
Should You Learn Big Data Ecosystem?

1. Yes, I definitely recommend you to get to now how Hadoop works and how to use it.
2. As I have explained you, the ecosystem is quite large. Many big data projects use
Hadoop or can interface with it. That’s why it is generally a good idea to know as
many big data technologies as possible.
3. Not in depth, but to the point that you know how they work and how you can use
them.
4. Your main goal should be to be able to hit the ground running when you join a big
data project. Plus, most of the technologies are open source. You can try them out for
free.

43
A State-of-the Art Enterprise Big Data Platform

Vision

Holistic and Act as a single


Comprehensive view of all Improve Oversight and source of truth for all
internal and external Data efficiency business units
with adv. data management
and analytical capabilities
Key Capabilities
Structured /
Data Real time/ Transformatio Standardizati API
CDC Unstructure
Ingestion Batch n on integration
d
AI/ML Metadata Archival/ Data
Data Quality MDM Cataloging
Models Management Purging Security
Internal/
Data Data Dashboardin
Data Lake Reporting Self Service External
Warehouse Exchange g
Users
Scope Of Work

1. Design and Build Enterprise Data Management platform comprising of Data Lake/DWH
2. Design and Build Data governance including Data quality, Master Data management, Metadata management , Data
Catalog and Security
3. Design and build consumption layers including Reporting, Dashboarding and AI/ML models

44
ata Warehousing vs Data Lake

Data Warehouse Vs Data Lake

Structured, processed Data Structured / Semi-structured / unstructured, raw

Schema-on-write Processing Schema-on-read

Expensive for large data volumes Storage Designed for low-cost storage

Less Agile, fixed configuration Agility Highly agile, configure and reconfigure as needed

Mature Security Maturing

Business Professionals Users Data scientists


Enterprise Big Data Platform Architecture

Data Producers Data Acquisition Data Curation Data Hub Data Access Consumption

Source Systems Ingestion Data Lake Transform Data Storage Data Consumption Users
A
/ Curated Layer Reportin
Raw Layer c g
Processin c Cubes
MDM e
Internal g Reposito DWH s
ry Reports/ Dashboards Report Users
Sources Internal s
/
External Source Real Data ELT/ETL A
Staging Analytics
Time/ Transformatio
Silver
P
n I
Batch/ External s Model
APIs Data Data Quality
/
Data Preparation,
Data Scientists
E
Staging
Unstructured Standardization x Building, Training,
data c
Transformation h
Deployment,
Structured Data Archiv Cleansing a Ops
Semi Structured al, n Downstream Admin /
Profiling g Operations
data Purgin Applications
e
g

Data Management, Security and Controls Framework

Data Reference Metadata & Privacy and Data Data


Data Quality
Governance. Data Lineage IAM Controls Security

46
Various Key Personas to Run the Program

Analytics Persona
Data Engineer Persona Data Governance Persona
(Citizen Data Science and Reporting)

Challenges: Challenges: Challenges:


• Data residing in silos • Lack of data quality , data standards and • In-accurate data leading to inaccurate decisions
• Lack of automation policies, Data lineage across system • Lack of real time information for making decisions
• Variety of data structures – Structured, Unstructured and • Lack of Master data management • Lack of Self-Service features
Semi Structured • Lack of data sharing and cataloging
• Lack of consolidated view

Goals: Goals: Goals:


• Provide a Consolidated view of Data • Create data quality dashboards for • Access to the correct information in the
• Provide accurate and integrated data for consumption monitoring form of reports and dashboards
as per SLAs • Ensure data cataloguing and • Ability to perform deeper analytics
democratization algorithms and data science models
• Ensure data security and access levels • Ability to create self-service dashboards

47
How to choose technology to implement Data & AI
project in any organization?
How to choose technology to implement Data & AI
project in any organization?

Does the selected


Know Customer Gather Insights from
Customer budget Customer available Skills technology fulfill your
requirements website & services
requirements?

Study their experiences Stakeholders (Decision Learn their workflow and


Look Into other Initiatives
makers, influencers) best methodologies
Data Ingestion

50
What is Data Ingestion

The process of importing, transferring, loading and processing data for later use or
storage in database.
• Involves connecting to various data sources, extracting the data, and detecting
changed data.
• Data ingestion subsystems need to fetch data from variety of sources (such as
RDBMS, web-logs, application-logs, streaming data, social media, etc.),

Batch

Data in
Ingestion

Real Time
Hadoop
Data

Streaming

51
Apache Kafka Introduction

Kafka is high performance, real time messaging system. It


is an open source tool and is part of Apache projects.
The characteristics of kafka are:
• It is distributed and portioned messaging system.
• It is highly fault-tolerant.
• It is highly scalable
• It can process and send millions of messages per second
to several receivers.

52
Kafka Use Cases
Kafka can be used for various purposes in an organization, such as:
Messaging
Millions of messages can be sent and received in real-time, using kafka
Service

Real Time Stream Kafka can be used to process a continuous stream of information in real-
Processing time and pass it to stream processing systems such as Storm.

Kafka can be used to collect physical log files from multiple systems and
Log Aggregation store them in a central location such as HDFS.

Event Sourcing A time ordered sequence of events can be maintained through Kafka

53
Some of the used of Kafka at LinkedIn are as follows:
• Collect metrics
Monitoring • Create monitoring dashboards

• Used for message queues in content feeds


Messaging • As publish-subscribe system for searches and
content feeds
• Collection of pages views and clicks
Analytics • Store into a central Hadoop-based analytics
system
A building block • For distributed databases
for distributed • For distributed log systems
Apps
54
Data Storage

55
What is SQL?

• SQL is Structured Query Language, which is a computer language for


storing, manipulating, and retrieving data storages in relational
database.
• SQL is the standard language for Relation Database System All relational
database management systems like MySQL, MS Access, Oracle, Sybase
and SQL Server use SQL as standard database language.

56
What is NoSQL?

NoSQL is a class of database management systems


(DBMS) that do not follow all of the rules of a relational
DBMS and cannot use traditional SQL to query data.

57
Features of NoSQL
Non-Relational Schema-free Simple API Distributed
• NoSQL databases never • NoSQL databases are • Offers easy to use • Multiple NoSQL databases
follow the relational model either schema-free or have interfaces for storage and can be executed in a
• Never provide tables with relaxed schemas querying data provided distributed fashion
flat fixed-column records • Do not require any sort of • APIs allow low-level data • Offers auto-scaling and fail-
• Work with self-contained definition of the schema of manipulation & selection over capabilities
aggregates or BLOBs the data methods • Often ACID concept can be
• Doesn't require object- • Offers heterogeneous • Text-based protocols sacrificed for scalability
relational mapping and structures of data in the mostly used with HTTP and throughput
data normalization same domain REST with JSON • Mostly no synchronous
• No complex features like • Mostly used no standard replication between
query languages, query based query language distributed nodes
planners,referential • Web-enabled databases Asynchronous Multi-
integrity joins, ACID running as internet-facing Master Replication, peer-
services to-peer, HDFS Replication
• Only providing eventual
consistency
• Shared Nothing
Architecture. This enables
less coordination and
higher distribution.

58
Types of NoSQL Databases

59
Advantages of NoSQL?
• Can be used as Primary or Analytic Data
Source
• Big Data Capability
• No Single Point of Failure
• Easy Replication
• It provides fast performance and horizontal
scalability.
• Can handle structured, semi-structured, and
unstructured data with equal effect.
• Handles big data which manages data
velocity, variety, volume, and complexity.

60
Disadvantages of NoSQL?
• No standardization rules
• Limited query capabilities
• RDBMS databases and tools are comparatively mature
• It does not offer any traditional database capabilities, like consistency when
multiple transactions are performed simultaneously.
• When the volume of data increases it is difficult to maintain unique values
as keys become difficult
• Doesn't work as well with relational data
• Open source options so not so popular for enterprises.

61
Difference Between Structured & Unstructured Data
Features Structured Data Unstructured Data

Representation Discrete –rows and columns Less defined boundaries and


easily addressable

Storage DBMS or file formats Unmanaged file structures

Metadata Syntax Semantics

Integration Tools ETL or ELT Batch processing or manual


data entry that involves codes

Standards SQL, ADO.net, ODBC, e.t.c. Open XML, SMTP, SMS,


CSV, e.t.c
Databases SQL, MySQL, Oracle, MangoDB, Cassendra,
Postgres and other RDBMS CouchDB can be use for
databases unstructured databases

62
Data Warehousing vs Data Lake

Data Warehouse Vs Data Lake


Structured, processed Data Structured / Semi-structured /
unstructured, raw

Schema-on-write Processing Schema-on-read


Expensive for large data volumes Storage Designed for low-cost storage

Less Agile, fixed configuration Agility Highly agile, configure and


reconfigure as needed

Mature Security Maturing


Business Professionals Users Data scientists

63

You might also like