0% found this document useful (0 votes)

6 views63 pages

Week 1 Lecture 1

The document provides an overview of the data analytics landscape, emphasizing the significance of big data characterized by its volume, velocity, variety, and veracity. It discusses the challenges associated with traditional data processing methods, the role of Hadoop in managing big data, and the differences between business intelligence and data science. Additionally, it highlights the importance of understanding big data technologies and the ecosystem for effective data management and analysis.

Uploaded by

parth25stat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views63 pages

Week 1 Lecture 1

Uploaded by

parth25stat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 63

Data Analytics

Landscape

6/25/22 11:59 AM 1
Data Analytics Landscape

People Science
Data

Data Analytics

Technology Processes Business

6/25/22 11:59 AM 2
Data Never Sleeps!

2023 This is
What happens in
One Minute on
Internet
Big Data: Changing the Game of Organizations

Transactions + Interactions + Observations = BIG DATA

Big Data Definition
✔ Big Data is a massive volume of both structured and unstructured data
that is so large that its difficult to process with traditional database and
software techniques.

✔ Big Data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.

✔ Big Data is data whose scale, diversity, and complexity require new
architecture, techniques, algorithms, and analytics to mange it and
extract value and hidden knowledge from it…
The 4 V’s of Big Data

VOLUME VELOCITY VARIETY VERACITY

Real-Time and Structured/ Data Bias

Unprecedented Stream Processing Unstructured /
Amount of Data Semi-Structured Data Noise
Massive and
Continuous Data Raw Data from Data Abnormality
Flow Heterogeneous
Sources

5
DATA

VOLUME VELOCITY VARIETY VERACITY

Real-Time and Structured/ Data Bias

Unprecedented Stream Processing Unstructured /
Amount of Data Semi-Structured Data Noise
Massive and
Continuous Data Raw Data from Data Abnormality
Flow Heterogeneous
Sources

5
The 4 V‘s of Big Data - Volume

40 ZETTABYTES Its estimated that • Most of world‘s current

(43 TRILLION 2.5 QUINTILLION data is in the form of
GIGIBYTES) BYTES unstructured data –
(2.3 TRILLION GIGABYTES) natural text, images,
of data are created each videos, raw sensory
day motor data
Volume • An autonomous car can
generate as much as 4
SCALE OF DATA Terabyetes data per day.
• Facebook‘s last analysis
Most companies in
6.65 on 60 PetaByte of data --
the U.S have at least
Spark
BILLION 162 TERABYTES
(166,000 • Most business‘ have data
PEOPLE in classical formats and
Have cell phone World GIGABYTES)
of data stored of relatively small size.
population: 7.8
Billion

8
The 4 V‘s of Big Data - Volume

Kilobytes or
Terabytes

Huge
Required
Amount of
Large
Data
Storage
Generation
Volume

Saved in Online or
Records Offline
Tables , Transact-
Files ions

https://www.whishworks.com/blog/big-data/understanding-the-3-vs-of-big-data-volume-velocity-and-variety 9
The 4 V‘s of Big Data - Velocity
The New York stock
exchange captures
8 TB OF TRADE
INFORMATION
In Devices 2022 per day
Velocity
ANALYSIS OF Modern cars
STREAMING DATA 100 SENSOR
that monitor items such as
fuel level and tire pressure
18.9 BILLION (2ms)
NETWORK
CONNECTIONS
Almost 2.5 connections per
person on earth

10
The 4 V‘s of Big Data - Velocity

Speed of Generating Data

Generated Real Time

Online or Offline Data

In Streams, Batch or Bits

https://www.whishworks.com/blog/big-data/understanding-the-3-vs-of-big-data-volume-velocity-and-variety 11
The 4 V‘ – Variety

Structured Semi-structured Un-Structured

• In advance AI fields
• Has a data model • Partially modeled • Free Text
ends up in tabular data
non-structured data
form • Images becomes vital.
• XML, JSON,
• Relational MongoDB • Videos
Databases / Data
warehouses • Variable Length • Audio
Time series e.g. • In classical
• CSVs / Excels sensor readings • Sensory Motor Data
business, still the
most values lies in
the structured data.

Structured Data Unstructured Data

12
The 4 V‘s – Veracity
Data is accurate, precise and trusted

Invalid or outdated data

Incomplete data can yield
not representing the
unintended results
present

Different databases not

always have the same Local/silo databases
unique identifier.

Manual created reports Lack of data model or data

leads to invalid data dictionary

13
The 4 V‘s – The 5th V

Cost Reduction

Time Saving

Development & Optimization

Strategic Planning & Operations

14
Four A’s of Big Data

ata

In
dD Aggregation

te
re

rag
e
att

te
c

d
S

Da
ta
Acquisition Analysis

ge
Lo

ed
g

wl
Da

o
Application
ta

Kn
15
Why Business Needs Big Data?

16
The Problem With ETL (Extract, Transform, Load)
• Exploding amounts of stored data the ETL
process starts being a real problem.

• Data sets are very big like 100GB or

Terabytes. That means Billions or Trillions of
rows.
• This has the result that the ETL process
for large data sets takes longer and
longer.

• Very quickly the ETL performance gets so bad

it won't deliver results to analytics anymore.

Common SQL Platform Architecture

17
Scaling Up & Scaling Out Database
Storage Area Network

Scaling Up Database

Scaling Out Database

• Increasing RAM for row Caching.
• Using More powerful CPU • Scaling out is the opposite of scaling up.
• Increasing Optimizing networking • Use Storage Area Network to store the
performance data. You can then use up to eight SQL
• Scaling up the system is fairly easy. servers, attach them to SAN and let them
handle queries.
18
Do Not Use Big Data When!

• If you don't run into scaling issues please, do not

use big data tools!
• Big data is an expensive thing. A Hadoop cluster
for instance needs at least five servers to work
properly. More is better.
• This costs a lot of money. Especially when you
are talking the maintenance and development
on top big data tools into account.
• If you don't need it it's making absolutely no
sense at all!

19
20
Difference between Business Intelligence & Data Science?

vs
What is Business Intelligence?

21
DATA SCIENCE & BUSINESS INTELLIGENCE
Business
Value
Data Science
• Predictive analysis &
prescriptive analysis
• Why….? What will….? What
High should I do?

Data Science Business Intelligence

• Descriptive analysis, standard
reporting
• What happened?
Low Business
Intelligence

Past Future Time

http://hashtaggers586.blogspot.com/ 22
Difference between Business Intelligence & Data Science
Business Intelligence Data Science
Data Analysis Yes Yes
Statistics Yes Yes
Visualization Yes Yes
Data Sources Usually SQL, often Data Less structured (logs, cloud data,
warehouse SQL, no-SQL, text)

Tools Statistics, Visualization Statistics, Machine Learning,

Graph Analysis, NLP

Focus Present and past Future

Method Analytic Scientific
Goal Better strategic decision Proactive decision for planning
and operations
23
Technology

People Science
Data

Data Analytics

Technology Processes Business

24
Big Data Architecture Main Components

Ingestion • Based on the nature and velocity of data we

decide which data ingestion tool should we use?

Storage • Based on the nature and volume of data we decide

which database structure should we use?

• Based on Business requirements we choose

Processing processing platform to process data.

• The last but very important part is

Visualization to visualize the data so we can take
decisions based on visualization.

25
Data Platform – Data Life Cycle

• Streaming
Enrich • Data
Serve • Machine
& Data Flow • Data Warehouse • Operational Learning &
Engineering Database AI

Collect Report Predict

Security | Governance | Lineage | Management | Automation

How Organization Generate Data?

Data Ingestion

Batch
Data
Platform
Real Time
The Rise of Big Data

Earlier with limited data, only one processor and one storage unit was needed

Structured Data

28
The Rise of Big Data
Soon, processor
A single Data generation
was notincreased leading
enough to to high
process such volume of data
high volume of along withkinds
different different data
of data as it
formats
was very time consuming.

Structured Data

Semi Structured data

Unstructured Data

29
The Rise of Big Data
Hence,
The multiple
Single processors
Storage werethe
unit become used to process
bottleneck duehigh volume
to which of dataoverhead
network and this
saved
was time
generated

Structured Data

Semi Structured data

Unstructured Data

30
The Rise of Big Data
The solution was to use distributed storage for each processor. This enabled
This methodThis is known
worked andasthere
parallel
wasprocessing
no networkwith distributed
overhead storage
generated
easy access to store and access data.

Structured Data

Parallel Semi Structured data

Processing Distributed
Storage

Unstructured Data

31
Big Data Challenges and Solution

Single central storage Distributed Storage Massive amount of data

which cannot be stored,
processed and analyzed
using the traditional ways
Serial Processing Parallel Processing
A
A Process Output Process Output
A
Input Inputs

Lack of ability to process Ability to process every

unstructured data type of data

32
Hadoop As Solution

Single central storage Distributed Storage

Serial Processing Parallel Processing

A
A Process Output Process Output
A
Input Inputs

Lack of ability to process Ability to process every

unstructured data type of data

33
What is Hadoop?
Hadoop is a framework that manages big data storage in a distributed way and
processes it parallelly

Big Data Storing Processing Analyzing

34
Components of Hadoop

Hadoop Distributed File System (HDFS) is specially designed for storage

of huge datasets in commodity hardware

Distributed
Storage

35
HDFS
Features of HDFS

Provides Implemented on
Provides data Highly fault
distributed commodity
security tolerant
storage hardware

36
What is MapReduce?
Hadoop MapReduce is a programming technique where huge data is
processed in a parallel and distributed fashion

Processor

Output
Big Data

MapReduce used for parallel

processing of the Big Data, which is
stored in HDFS

37
Hadoop Use Case – Combating Fraudulent Activities
Detecting fraudulent transactions is one among the various problems any bank faces

38
Hadoop Use Case – Combating Fraudulent Activities
Approaches used by Zion’s security team to combat fraudulent activities

Security Information
Management – SIM Tools Parallel Processing System

Problem Problem

It was based on RDBMS

Unable to store huge data Analyzing unstructured data

which needed to be analyzed was not possible
39
Hadoop Use Case – Combating Fraudulent Activities
How Hadoop solved the problems

Storing Processing Analyzing Detecting

Processing of Processing of
unstructured data unstructured data
Zions could now Zions could now
(like server logs, (like server logs,
store massive store massive
customer data, customer data,
amount of data amount of data
customer customer
using Hadoop using Hadoop
transactions) was transactions) was
now possible now possible

40
Hadoop EcoSystem
Tableu
Impala
Sqoop
Knime
• Hadoop has four main modules: Hadoop Chuckwa
Rapidminer
common, HDFS, MapReduce and YARN. The Kafka
way these modules are woven together is what Flume
makes Hadoop so successful. Hadoop Hbase
• Hadoop's core functionality is the driver of HDFS Arrow
Hadoop's adoption. Many Apache side Hive
MapReduce
projects use it's core functions that makes him Pig
Spark
so popular. YARN
Flink
• Because of all those side projects Hadoop has Drill
turned more into an ecosystem that also
makes him popular.
Storm Zookeeper
Ambari
Oozie
41
Hadoop Is Everywhere?

1. Although Hadoop is so popular it is not the silver bullet. It’s not the tool that you should use for
everything.
2. Often times it does not make sense to deploy a Hadoop cluster, because it can be overkill.
3. Hadoop does not run on a single server. You basically need at least five servers, better six to run a small
cluster. Because of that the initial platform costs are quite high. One option you have is to use a
specialized systems like Cassandra, MongoDB or other NoSQL DB's for storage. Or you move to Amazon
and use Amazon's Simple Storage Service, or S3.
4. Guess what the tech behind S3 is. Yes, HDFS. That's why AWS also has the equivalent to MapReduce
named Elastic MapReduce.
5. The great thing about S3 is that you can start very small. When your system grows you don't have to
worry about s3's server scaling.

42
Should You Learn Big Data Ecosystem?

1. Yes, I definitely recommend you to get to now how Hadoop works and how to use it.
2. As I have explained you, the ecosystem is quite large. Many big data projects use
Hadoop or can interface with it. That’s why it is generally a good idea to know as
many big data technologies as possible.
3. Not in depth, but to the point that you know how they work and how you can use
them.
4. Your main goal should be to be able to hit the ground running when you join a big
data project. Plus, most of the technologies are open source. You can try them out for
free.

43
A State-of-the Art Enterprise Big Data Platform

Vision

Holistic and Act as a single

Comprehensive view of all Improve Oversight and source of truth for all
internal and external Data efficiency business units
with adv. data management
and analytical capabilities
Key Capabilities
Structured /
Data Real time/ Transformatio Standardizati API
CDC Unstructure
Ingestion Batch n on integration
d
AI/ML Metadata Archival/ Data
Data Quality MDM Cataloging
Models Management Purging Security
Internal/
Data Data Dashboardin
Data Lake Reporting Self Service External
Warehouse Exchange g
Users
Scope Of Work

1. Design and Build Enterprise Data Management platform comprising of Data Lake/DWH
2. Design and Build Data governance including Data quality, Master Data management, Metadata management , Data
Catalog and Security
3. Design and build consumption layers including Reporting, Dashboarding and AI/ML models

44
ata Warehousing vs Data Lake

Data Warehouse Vs Data Lake

Structured, processed Data Structured / Semi-structured / unstructured, raw

Schema-on-write Processing Schema-on-read

Expensive for large data volumes Storage Designed for low-cost storage

Less Agile, fixed configuration Agility Highly agile, configure and reconfigure as needed

Mature Security Maturing

Business Professionals Users Data scientists

Enterprise Big Data Platform Architecture

Data Producers Data Acquisition Data Curation Data Hub Data Access Consumption

Source Systems Ingestion Data Lake Transform Data Storage Data Consumption Users
A
/ Curated Layer Reportin
Raw Layer c g
Processin c Cubes
MDM e
Internal g Reposito DWH s
ry Reports/ Dashboards Report Users
Sources Internal s
/
External Source Real Data ELT/ETL A
Staging Analytics
Time/ Transformatio
Silver
P
n I
Batch/ External s Model
APIs Data Data Quality
/
Data Preparation,
Data Scientists
E
Staging
Unstructured Standardization x Building, Training,
data c
Transformation h
Deployment,
Structured Data Archiv Cleansing a Ops
Semi Structured al, n Downstream Admin /
Profiling g Operations
data Purgin Applications
e
g

Data Management, Security and Controls Framework

Data Reference Metadata & Privacy and Data Data

Data Quality
Governance. Data Lineage IAM Controls Security

46
Various Key Personas to Run the Program

Analytics Persona
Data Engineer Persona Data Governance Persona
(Citizen Data Science and Reporting)

Challenges: Challenges: Challenges:

• Data residing in silos • Lack of data quality , data standards and • In-accurate data leading to inaccurate decisions
• Lack of automation policies, Data lineage across system • Lack of real time information for making decisions
• Variety of data structures – Structured, Unstructured and • Lack of Master data management • Lack of Self-Service features
Semi Structured • Lack of data sharing and cataloging
• Lack of consolidated view

Goals: Goals: Goals:

• Provide a Consolidated view of Data • Create data quality dashboards for • Access to the correct information in the
• Provide accurate and integrated data for consumption monitoring form of reports and dashboards
as per SLAs • Ensure data cataloguing and • Ability to perform deeper analytics
democratization algorithms and data science models
• Ensure data security and access levels • Ability to create self-service dashboards

47
How to choose technology to implement Data & AI
project in any organization?
How to choose technology to implement Data & AI
project in any organization?

Does the selected

Know Customer Gather Insights from
Customer budget Customer available Skills technology fulfill your
requirements website & services
requirements?

Study their experiences Stakeholders (Decision Learn their workflow and

Look Into other Initiatives
makers, influencers) best methodologies
Data Ingestion

50
What is Data Ingestion

The process of importing, transferring, loading and processing data for later use or
storage in database.
• Involves connecting to various data sources, extracting the data, and detecting
changed data.
• Data ingestion subsystems need to fetch data from variety of sources (such as
RDBMS, web-logs, application-logs, streaming data, social media, etc.),

Batch

Data in
Ingestion

Real Time
Hadoop
Data

Streaming

51
Apache Kafka Introduction

Kafka is high performance, real time messaging system. It

is an open source tool and is part of Apache projects.
The characteristics of kafka are:
• It is distributed and portioned messaging system.
• It is highly fault-tolerant.
• It is highly scalable
• It can process and send millions of messages per second
to several receivers.

52
Kafka Use Cases
Kafka can be used for various purposes in an organization, such as:
Messaging
Millions of messages can be sent and received in real-time, using kafka
Service

Real Time Stream Kafka can be used to process a continuous stream of information in real-
Processing time and pass it to stream processing systems such as Storm.

Kafka can be used to collect physical log files from multiple systems and
Log Aggregation store them in a central location such as HDFS.

Event Sourcing A time ordered sequence of events can be maintained through Kafka

53
Some of the used of Kafka at LinkedIn are as follows:
• Collect metrics
Monitoring • Create monitoring dashboards

• Used for message queues in content feeds

Messaging • As publish-subscribe system for searches and
content feeds
• Collection of pages views and clicks
Analytics • Store into a central Hadoop-based analytics
system
A building block • For distributed databases
for distributed • For distributed log systems
Apps
54
Data Storage

55
What is SQL?

• SQL is Structured Query Language, which is a computer language for

storing, manipulating, and retrieving data storages in relational
database.
• SQL is the standard language for Relation Database System All relational
database management systems like MySQL, MS Access, Oracle, Sybase
and SQL Server use SQL as standard database language.

56
What is NoSQL?

NoSQL is a class of database management systems

(DBMS) that do not follow all of the rules of a relational
DBMS and cannot use traditional SQL to query data.

57
Features of NoSQL
Non-Relational Schema-free Simple API Distributed
• NoSQL databases never • NoSQL databases are • Offers easy to use • Multiple NoSQL databases
follow the relational model either schema-free or have interfaces for storage and can be executed in a
• Never provide tables with relaxed schemas querying data provided distributed fashion
flat fixed-column records • Do not require any sort of • APIs allow low-level data • Offers auto-scaling and fail-
• Work with self-contained definition of the schema of manipulation & selection over capabilities
aggregates or BLOBs the data methods • Often ACID concept can be
• Doesn't require object- • Offers heterogeneous • Text-based protocols sacrificed for scalability
relational mapping and structures of data in the mostly used with HTTP and throughput
data normalization same domain REST with JSON • Mostly no synchronous
• No complex features like • Mostly used no standard replication between
query languages, query based query language distributed nodes
planners,referential • Web-enabled databases Asynchronous Multi-
integrity joins, ACID running as internet-facing Master Replication, peer-
services to-peer, HDFS Replication
• Only providing eventual
consistency
• Shared Nothing
Architecture. This enables
less coordination and
higher distribution.

58
Types of NoSQL Databases

59
Advantages of NoSQL?
• Can be used as Primary or Analytic Data
Source
• Big Data Capability
• No Single Point of Failure
• Easy Replication
• It provides fast performance and horizontal
scalability.
• Can handle structured, semi-structured, and
unstructured data with equal effect.
• Handles big data which manages data
velocity, variety, volume, and complexity.

60
Disadvantages of NoSQL?
• No standardization rules
• Limited query capabilities
• RDBMS databases and tools are comparatively mature
• It does not offer any traditional database capabilities, like consistency when
multiple transactions are performed simultaneously.
• When the volume of data increases it is difficult to maintain unique values
as keys become difficult
• Doesn't work as well with relational data
• Open source options so not so popular for enterprises.

61
Difference Between Structured & Unstructured Data
Features Structured Data Unstructured Data

Representation Discrete –rows and columns Less defined boundaries and

easily addressable

Storage DBMS or file formats Unmanaged file structures

Metadata Syntax Semantics

Integration Tools ETL or ELT Batch processing or manual

data entry that involves codes

Standards SQL, ADO.net, ODBC, e.t.c. Open XML, SMTP, SMS,

CSV, e.t.c
Databases SQL, MySQL, Oracle, MangoDB, Cassendra,
Postgres and other RDBMS CouchDB can be use for
databases unstructured databases

62
Data Warehousing vs Data Lake

Data Warehouse Vs Data Lake

Structured, processed Data Structured / Semi-structured /
unstructured, raw

Schema-on-write Processing Schema-on-read

Expensive for large data volumes Storage Designed for low-cost storage

Less Agile, fixed configuration Agility Highly agile, configure and

reconfigure as needed

Mature Security Maturing

Business Professionals Users Data scientists

DFo Section 2 Quiz
80% (5)
DFo Section 2 Quiz
22 pages
Latihan Azure Microsoft-1
No ratings yet
Latihan Azure Microsoft-1
33 pages
Big Data PPT 55b0fc01e7543
No ratings yet
Big Data PPT 55b0fc01e7543
31 pages
Big Data Project
100% (3)
Big Data Project
61 pages
Dbase III Plus Tutorial
100% (2)
Dbase III Plus Tutorial
24 pages
Basic Concepts in Big Data 1
No ratings yet
Basic Concepts in Big Data 1
43 pages
BDT 1
No ratings yet
BDT 1
49 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
83 pages
Big Data
No ratings yet
Big Data
31 pages
Prepared By: Asmita Deshmukh
No ratings yet
Prepared By: Asmita Deshmukh
51 pages
Introduction To Big Data Analytics
100% (4)
Introduction To Big Data Analytics
112 pages
Big Data..Unit-1 Notes
No ratings yet
Big Data..Unit-1 Notes
16 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
Chapter 1
No ratings yet
Chapter 1
40 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
07-08 What Is Big Data
No ratings yet
07-08 What Is Big Data
41 pages
BDM 1
No ratings yet
BDM 1
37 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Unit 1 Introduction: Data Science and Big Data: Syllabus
No ratings yet
Unit 1 Introduction: Data Science and Big Data: Syllabus
38 pages
Big Data Class - Introduction
No ratings yet
Big Data Class - Introduction
60 pages
BD U-1 (Anupam Sir)
No ratings yet
BD U-1 (Anupam Sir)
20 pages
Seminar On: Big Data
No ratings yet
Seminar On: Big Data
23 pages
Module-1-Introduction To BigData Platform
No ratings yet
Module-1-Introduction To BigData Platform
21 pages
Big Data MINING AND TOOLS
No ratings yet
Big Data MINING AND TOOLS
44 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
Future Revolution On Big Data
No ratings yet
Future Revolution On Big Data
24 pages
Big Data Presentation Slide
100% (1)
Big Data Presentation Slide
30 pages
BA ppt
No ratings yet
BA ppt
17 pages
BDA
No ratings yet
BDA
148 pages
BIG DATA INTRODUCTION hadoop
No ratings yet
BIG DATA INTRODUCTION hadoop
24 pages
Unit 1 and Unit 2 notes bda
No ratings yet
Unit 1 and Unit 2 notes bda
11 pages
UNIT I BIG DATA Extra Content
No ratings yet
UNIT I BIG DATA Extra Content
15 pages
Big Data
No ratings yet
Big Data
31 pages
Da Unit - I - Notes
No ratings yet
Da Unit - I - Notes
30 pages
What Is Big Data?
No ratings yet
What Is Big Data?
3 pages
Big-Data-ppt
No ratings yet
Big-Data-ppt
30 pages
Unit-Iii CC&BD CS71
No ratings yet
Unit-Iii CC&BD CS71
89 pages
8 Revolution of Big Data
No ratings yet
8 Revolution of Big Data
18 pages
Big Data Analytics
No ratings yet
Big Data Analytics
58 pages
IntroductiontoBigData-I
No ratings yet
IntroductiontoBigData-I
16 pages
BDA U1 copy
No ratings yet
BDA U1 copy
78 pages
BDA-Unit-1
No ratings yet
BDA-Unit-1
23 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
Big Data
No ratings yet
Big Data
30 pages
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
No ratings yet
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
12 pages
UNIT 1
No ratings yet
UNIT 1
57 pages
Big Data
No ratings yet
Big Data
14 pages
Big Data Analytics
No ratings yet
Big Data Analytics
73 pages
Big Data Analytics
No ratings yet
Big Data Analytics
83 pages
Lect 3 Big Data Lesson02
No ratings yet
Lect 3 Big Data Lesson02
51 pages
05-Big Data
No ratings yet
05-Big Data
29 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
29 pages
20210913115458D3708 - Session 01 Introduction To Big Data Analytics
No ratings yet
20210913115458D3708 - Session 01 Introduction To Big Data Analytics
28 pages
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
No ratings yet
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
58 pages
Unit-III CC&BD Cs62 Ab
No ratings yet
Unit-III CC&BD Cs62 Ab
85 pages
Hamid Seminar Doc
No ratings yet
Hamid Seminar Doc
57 pages
Big Data Presentation
No ratings yet
Big Data Presentation
24 pages
Now To Be Data
No ratings yet
Now To Be Data
16 pages
Big Data
No ratings yet
Big Data
16 pages
ETB 1 (Big data)
No ratings yet
ETB 1 (Big data)
28 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Big Data: the Revolution That Is Transforming Our Work, Market and World
From Everand
Big Data: the Revolution That Is Transforming Our Work, Market and World
PAT NAKAMOTO
No ratings yet
Crash Course Big Data
From Everand
Crash Course Big Data
IntroBooks Team
No ratings yet
SQL: Queries, Programming, Triggers: Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1
No ratings yet
SQL: Queries, Programming, Triggers: Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1
32 pages
CoCubes - Syllabus Engineering
No ratings yet
CoCubes - Syllabus Engineering
3 pages
Powerfactory 2022: I Nteg R Ated P Ow Er System Analysis Software For
No ratings yet
Powerfactory 2022: I Nteg R Ated P Ow Er System Analysis Software For
28 pages
Auditing IT Governance Controls
No ratings yet
Auditing IT Governance Controls
55 pages
70-483 Exam Dumps With PDF and VCE Download (1-30)
No ratings yet
70-483 Exam Dumps With PDF and VCE Download (1-30)
30 pages
Chapter 4 Thesis Example PDF
100% (2)
Chapter 4 Thesis Example PDF
7 pages
SE Abr Resume
No ratings yet
SE Abr Resume
1 page
Invitation to Computer Science 6th Edition G.Michael Schneider instant download
No ratings yet
Invitation to Computer Science 6th Edition G.Michael Schneider instant download
64 pages
Vipul Sharma
No ratings yet
Vipul Sharma
5 pages
MTech PLM 2011
No ratings yet
MTech PLM 2011
15 pages
Netezza Exadata: Teradata
No ratings yet
Netezza Exadata: Teradata
10 pages
Mesh Central 2 User Guide
No ratings yet
Mesh Central 2 User Guide
57 pages
ATQMS-MKTG-DOC-Adroit Ignite HMI Technical Description
No ratings yet
ATQMS-MKTG-DOC-Adroit Ignite HMI Technical Description
97 pages
SET Commands
No ratings yet
SET Commands
22 pages
Hybrid Algorithm For Food Recognition, Calorie Estimation & Dietary Enforcement
No ratings yet
Hybrid Algorithm For Food Recognition, Calorie Estimation & Dietary Enforcement
6 pages
DBMS Module 1
No ratings yet
DBMS Module 1
55 pages
A Project Management Information System Serves Five Principal Purposes
No ratings yet
A Project Management Information System Serves Five Principal Purposes
9 pages
New Comers Technical Guide
No ratings yet
New Comers Technical Guide
15 pages
Project Report SMS Portal System
74% (19)
Project Report SMS Portal System
53 pages
Lab#11 Mongodb Basic CRUD Commands
No ratings yet
Lab#11 Mongodb Basic CRUD Commands
9 pages
Real-Time Object-Oriented Modeling: Tutorial Structure
No ratings yet
Real-Time Object-Oriented Modeling: Tutorial Structure
27 pages
Answers To Review Questions
0% (1)
Answers To Review Questions
46 pages
11g Lag Resolution Using Scn Method
No ratings yet
11g Lag Resolution Using Scn Method
5 pages
SASBU_WellView_Manual_Full
No ratings yet
SASBU_WellView_Manual_Full
436 pages
SmartCalibration and ISO17025 PDF
No ratings yet
SmartCalibration and ISO17025 PDF
18 pages
Explain The Term Nosql'. Describe Vertical and Horizontal Scaling
No ratings yet
Explain The Term Nosql'. Describe Vertical and Horizontal Scaling
13 pages
Brio Tips and Tricks 3
No ratings yet
Brio Tips and Tricks 3
78 pages

Week 1 Lecture 1

Uploaded by

Week 1 Lecture 1

Uploaded by

Data Analytics

Technology Processes Business

Transactions + Interactions + Observations = BIG DATA

VOLUME VELOCITY VARIETY VERACITY

Real-Time and Structured/ Data Bias

VOLUME VELOCITY VARIETY VERACITY

Real-Time and Structured/ Data Bias

40 ZETTABYTES Its estimated that • Most of world‘s current

Speed of Generating Data

Generated Real Time

Online or Offline Data

In Streams, Batch or Bits

Structured Semi-structured Un-Structured

Structured Data Unstructured Data

Invalid or outdated data

Different databases not

Manual created reports Lack of data model or data

Development & Optimization

Strategic Planning & Operations

• Data sets are very big like 100GB or

• Very quickly the ETL performance gets so bad

Common SQL Platform Architecture

Scaling Out Database

• If you don't run into scaling issues please, do not

Data Science Business Intelligence

Past Future Time

Tools Statistics, Visualization Statistics, Machine Learning,

Focus Present and past Future

Technology Processes Business

Ingestion • Based on the nature and velocity of data we

Storage • Based on the nature and volume of data we decide

• Based on Business requirements we choose

• The last but very important part is

Collect Report Predict

Security | Governance | Lineage | Management | Automation

Semi Structured data

Semi Structured data

Parallel Semi Structured data

Single central storage Distributed Storage Massive amount of data

Lack of ability to process Ability to process every

Single central storage Distributed Storage

Serial Processing Parallel Processing

Lack of ability to process Ability to process every

Big Data Storing Processing Analyzing

Hadoop Distributed File System (HDFS) is specially designed for storage

MapReduce used for parallel

It was based on RDBMS

Unable to store huge data Analyzing unstructured data

Storing Processing Analyzing Detecting

Holistic and Act as a single

Data Warehouse Vs Data Lake

Structured, processed Data Structured / Semi-structured / unstructured, raw

Schema-on-write Processing Schema-on-read

Mature Security Maturing

Business Professionals Users Data scientists

Data Management, Security and Controls Framework

Data Reference Metadata & Privacy and Data Data

Challenges: Challenges: Challenges:

Goals: Goals: Goals:

Does the selected

Study their experiences Stakeholders (Decision Learn their workflow and

Kafka is high performance, real time messaging system. It

• Used for message queues in content feeds

• SQL is Structured Query Language, which is a computer language for

NoSQL is a class of database management systems

Representation Discrete –rows and columns Less defined boundaries and

Storage DBMS or file formats Unmanaged file structures

Metadata Syntax Semantics

Integration Tools ETL or ELT Batch processing or manual

Standards SQL, ADO.net, ODBC, e.t.c. Open XML, SMTP, SMS,

Data Warehouse Vs Data Lake

Schema-on-write Processing Schema-on-read

Less Agile, fixed configuration Agility Highly agile, configure and

Mature Security Maturing

You might also like