0% found this document useful (0 votes)
24 views44 pages

Second Exam Summary

Uploaded by

Dina Bardakji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views44 pages

Second Exam Summary

Uploaded by

Dina Bardakji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Big Data Processing Concepts Lecture 10:

Chapter 6 Part 1
• What is MapReduce?
- MapReduce is a programming model and framework used to process
large amounts of data in parallel across many computers.
- It was developed to handle the vast amounts of data that modern
applications generate and need to process efficiently.

• Key Concepts
- Parallel Processing: This means breaking a task into smaller parts
and processing them simultaneously across multiple computers,
making it faster to handle large datasets.
- Key-Value Pairs: In MapReduce, data is represented as pairs of keys
and values.
- Think of a key as a unique identifier and the value as the data
associated with it.
- For example, in a list of words, the word itself could be the key, and
the number of times it appears could be the value.

• How MapReduce Works


The Two Main Functions
- Map Function: This function takes input data and converts it into a
set of key-value pairs.
- The goal is to divide the data into manageable chunks.
- For instance, if you're counting words in a book, the map function
would produce pairs like ("word", 1) for each word.
- Reduce Function: After the map function processes the data, the
reduce function takes the intermediate key-value pairs, groups them
by key, and performs calculations or aggregations.
- Continuing the word count example, the reduce function would sum up
all the values for each word, giving you the total count for each word.

• The Workflow
1. Input Splitting: The input data is divided into smaller pieces called
splits. Each split is processed by a separate map task, allowing for
parallel processing.
2. Mapping: Each map task processes its split of data and outputs
intermediate key-value pairs. These pairs are then sorted by key.
3. Shuffling and Sorting: The system automatically groups all the
values associated with the same key together, ensuring that they are
sent to the same reduce task.
4. Reducing: The reduce tasks process the grouped data, performing
operations like summing, averaging, or filtering, and produce the
final output.
5. Output: The results are written to an output file, which contains the
processed data.

Example: Counting Words in a Text


- Imagine you have a document with the following lines:
"cat dog"
"dog cat"
"dog fish"
- Step-by-Step Process

1. Input Splits:
Split 1: "cat dog"
Split 2: "dog cat" and "dog fish"

2. Map Phase:
Split 1 produces: ("cat", 1), ("dog", 1)
Split 2 produces: ("dog", 1), ("cat", 1), ("dog", 1), ("fish", 1)
3. Shuffle and Sort:
Grouped by key:
"cat": [1, 1]
"dog": [1, 1, 1]
"fish": [1]

4. Reduce Phase:
"cat": 2
"dog": 3
"fish": 1

5. Final Output:
("cat", 2), ("dog", 3), ("fish", 1)

• Apache Hadoop
- Apache Hadoop is an open-source framework that uses MapReduce for
processing large datasets. It consists of:
- HDFS (Hadoop Distributed File System): Stores data across multiple
machines, ensuring fault tolerance and scalability.
- YARN (Yet Another Resource Negotiator): Manages computing
resources and schedules jobs.

• Why Hadoop?
- Hadoop is designed to handle big data efficiently by distributing tasks
across many nodes (computers), offering fault tolerance (data is
replicated across nodes), and allowing for the processing of diverse data
types.
• Differences Between Hadoop and MapReduce
- Hadoop: A comprehensive framework for distributed storage and
processing.
- MapReduce: A specific programming model used within Hadoop for
processing data.
Big Data Processing Concepts Lecture 11:
Chapter 6 Part 2
• Introduction to Big Data Processing with MapReduce
- MapReduce is a powerful programming model for processing large
datasets in parallel across distributed clusters.
- It divides tasks into smaller sub-tasks, enabling efficient data processing.
However, despite its strengths, there are several pitfalls that need
addressing for optimal performance.

• Pitfalls of MapReduce
1. Assumption of Homogeneous Nodes
- Default Scheduler: The original Hadoop MapReduce scheduler
assumes that all nodes in a cluster are homogeneous, meaning they have
similar processing power and capabilities.
- Straggler Tasks: These are tasks that take longer to complete than
others. The scheduler attempts to mitigate this by speculatively copying
and re-executing these tasks on idle nodes.
- Challenge: In reality, clusters often consist of heterogeneous nodes with
varying capabilities. This assumption can lead to inefficiencies in
resource utilization and task execution.

2. Complexity in Iterative Computation


- Iterative Algorithms: Many data mining and graph analysis tasks
require iterative computations, where results from one step are used in
subsequent steps.
- Single Job Limitation: Implementing these complex tasks in a single
MapReduce job is difficult and often inefficient.
- Multiple Jobs: Running multiple MapReduce jobs to handle these tasks
is computationally expensive and time-consuming.
- Need for Extension: There's a demand for extending the MapReduce
model to better support iterative computations, possibly through
frameworks that integrate with MapReduce.

3. Real-Time Computing Challenges


- Batch-Oriented Design: MapReduce was initially designed for batch
processing, where data is processed in large chunks.
- Data Loading Requirement: All data must be loaded into the
distributed file system before processing can begin, which is inefficient
for real-time applications.
- Limitation: This design is not suitable for applications requiring
immediate data processing, such as real-time analytics or interactive
querying.

4. Underutilization of Modern Hardware


- Linear Execution: Map and reduce tasks in the original MapReduce
framework are executed linearly, without exploiting the parallel
capabilities of modern hardware like multi-core CPUs and GPUs.
- Hardware Potential: This linear execution model limits the ability to
fully utilize the computing power available, resulting in suboptimal
performance.

5. Configuration Complexity
- Cluster Environment: MapReduce applications typically run in clusters
composed of many computers, requiring intricate configuration and
setup.
- Challenges: The complexity involved in setting up and tuning these
clusters poses significant challenges for users, leading to a demand for
tools and optimizations that can simulate MapReduce contexts and
analyze performance dynamically.
6. Security and Data Protection
- Shared Resources: In cloud computing environments, resources are
shared among multiple users, increasing the risk of data breaches.
- Authentication Mechanisms: The original MapReduce framework
provides basic authentication methods like Token-based and Kerberos,
which may not be sufficient for protecting sensitive data.
- Security Concerns: There is a need for stronger data protection
measures and enhanced authentication and authorization protocols to
safeguard large-scale sensitive data.

• Introduction to YARN (Yet Another Resource Negotiator)


- YARN is a significant enhancement to the Hadoop ecosystem,
addressing many of the limitations of the original MapReduce model by
offering improved resource management and scheduling capabilities.

• Advantages of YARN
1. Scalability: YARN enhances the scalability of Hadoop clusters,
allowing them to efficiently handle larger volumes of data and more
complex tasks.
2. Cluster Utilization: It optimizes resource usage, ensuring that all
available resources in the cluster are utilized effectively.
3. User Agility: YARN allows users to run different processing
frameworks alongside MapReduce, providing greater flexibility in data
processing.

• New Services in YARN


- Resource Manager: A central component responsible for managing and
allocating resources across the cluster.
- Application Master: Each application has its own Application Master,
which negotiates resources with the Resource Manager and coordinates
task execution with Node Managers.
- Backward Compatibility: YARN maintains backward compatibility
with existing MapReduce applications, allowing them to run within the
new framework without modification.

• YARN Components

1. Resource Manager
- Role: Acts as the central authority for resource management in a YARN
cluster.
- Functions: Optimizes cluster utilization by balancing constraints such
as capacity guarantees, fairness, and service level agreements.
- Pluggable Scheduler: Allows different scheduling algorithms to be
used, such as those focusing on capacity or fairness, to meet specific
needs.
2. Application Master
- Purpose: Manages the lifecycle of applications, including negotiating
resources from the Resource Manager and coordinating with Node
Managers to execute tasks.
- Responsibilities: Requests resource containers, tracks their status,
monitors progress, and adapts to changing requirements.

3. Resource Model
- Flexibility: YARN supports a flexible resource model, allowing
applications to request specific resources based on their requirements.
- Resource Specifications: Applications can specify requirements such as
memory, CPU cores, and even network or GPU resources, enabling
precise resource allocation.
Processing systems for big data Lecture 12
Chapter 6 Part 3
• Introduction to Big Data Processing Systems
- Big data processing systems are essential for handling vast volumes of
information efficiently.
- As data grows exponentially, different paradigms have been developed
to cater to specific processing needs.
- Each paradigm is designed to optimize performance based on the nature
and requirements of the data being processed.

• Main Paradigms of Big Data Processing


1. Continuous Processing
- Continuous processing systems are designed to handle data as it arrives,
operating indefinitely without waiting for the entire dataset to be
available.
- This approach is ideal for applications where data is generated
continuously, such as IoT devices and live user interactions.

• Key Characteristics:
- Unbounded Data: Continuous processing works with streams of data
that flow indefinitely, processing each piece as soon as it arrives. This is
crucial for applications where data is constantly being generated and
needs immediate attention.
- Low Latency: While prioritizing throughput, continuous processing
maintains relatively low latency, typically ranging from milliseconds to
seconds. The emphasis is on efficient processing rather than meeting
strict deadlines.
• Applications:
- IoT Device Monitoring: For example, sensors in a smart home system
continuously send data about temperature, humidity, and motion, which
needs to be processed in real-time.
- Live User Interactions: Online gaming platforms process user actions
and interactions continuously to provide seamless gaming experiences.
• Example Tools:
- Apache Kafka Streams: Facilitates real-time data streaming and
processing.
- Apache Flink: Offers high-throughput, low-latency stream processing
capabilities.
- Apache Storm: Designed for processing unbounded data streams in
real-time.

• Programming Model:
- Dataflow Model: Represents data as a series of transformations.
- For instance, Google Dataflow and Apache Beam use this model to
efficiently process data streams by applying a series of transformations
to each incoming data element.

2. Real-Time Processing
- Real-time processing ensures data is processed immediately or within
tight deadlines.
- This paradigm is crucial for applications that require strict timing
constraints and immediate responses.

• Key Characteristics:
- Timing Guarantees: Real-time processing must adhere to predefined
deadlines, often within milliseconds. This is essential for applications
where timing is critical.
- Predictability: Ensures deterministic and consistent processing of
events under strict timing requirements, providing reliability and
stability.

• Hard and Soft Real-Time:


- Hard Real-Time: Deadlines must always be met, such as in medical
devices and autonomous vehicles where delays can lead to catastrophic
outcomes.
- Soft Real-Time: Occasional delays are tolerable, such as in video
streaming and gaming where minor latency does not significantly impact
the user experience.

• Applications:
- Automotive Safety Systems: Real-time processing is vital for systems
like collision detection and avoidance in self-driving cars.
- Financial Trading Platforms: Immediate processing of market data is
crucial for executing trades at optimal times.
- Control Systems in Self-Driving Vehicles: Real-time data from sensors
and cameras is processed to make instantaneous driving decisions.

• Example Tools:
- Apache Kafka (with real-time configurations): Provides robust event
processing capabilities, ensuring timely data handling.

• Programming Model:
- Event-Driven Model: Events trigger immediate processing. For
example, obstacle avoidance systems in self-driving vehicles rely on
real-time data processing to react to environmental changes instantly.
3. Event Processing
- Event processing focuses on detecting, analyzing, and responding to
individual events.
- It can occur in continuous, real-time, or batch contexts, emphasizing the
identification and handling of specific events.

• Key Characteristics:
- Event-Driven: Systems are triggered by events rather than periodic data
processing, allowing for immediate responses to specific occurrences.

• Event Processing Patterns:


- Complex Event Processing (CEP): Detects patterns or sequences of
events over time, such as fraud detection where multiple transactions are
analyzed for suspicious activity.
- Event Correlation: Links events based on time, causality, or context to
find patterns, such as detecting cybersecurity threats through correlated
network activity.

• Applications:
- Fraud Detection: Systems monitor transactions in real-time to identify
and respond to fraudulent activities.
- Anomaly Detection: Identifying deviations from normal behavior in
systems, such as unusual network traffic patterns indicating a potential
security breach.
- User Activity Tracking: Monitoring user interactions to provide
personalized experiences or detect unusual behaviors.

• Example Tools:
- Apache Flink (with CEP libraries): Offers tools for complex event
processing, enabling the detection of intricate event patterns.
• Programming Model:
- CEP Model: Defines rules for detecting patterns or relationships
between events, such as security information and event management
systems that analyze logs for potential threats.

4. Batch Processing
- Batch processing involves processing large volumes of data as a single
unit or "batch".
- It operates on a finite dataset available all at once, making it suitable for
scenarios where data can be collected and processed in bulk.

• Key Characteristics:
- Bounded Data: Works on fixed datasets, such as logs, transactions, or
files collected over time.
- This is ideal for applications where data does not need immediate
processing.
- High Throughput: Optimized for efficiency and scalability, focusing
on processing large volumes of data quickly rather than minimizing
latency.

• Applications:
- Historical Data Analysis: Batch processing is ideal for analyzing past
data to identify trends or generate reports.
- Data Warehousing: Large datasets are processed to populate data
warehouses for business intelligence and analytics.

• Example Tools:
- Apache Hadoop: Utilizes the MapReduce model for batch processing,
efficiently handling vast amounts of data.
• Programming Model:
- MapReduce Model: Processes data in two steps—mapping
(transforming) and reducing (aggregating).
- For example, Apache Hadoop uses this model to process log files and
generate summary reports.

• True Real-Time Processing vs. Near Real-Time Processing

1. True Real-Time Processing


- Characteristics: Processes data with minimal latency (often in
milliseconds or seconds) to provide instant results, crucial for
applications where immediate response is necessary.
Example: Fraud detection in financial transactions, where the system must
react instantly to prevent unauthorized activities.

2. Near Real-Time Processing


- Characteristics: Processes data with a slight delay, usually due to
system buffering or batch mini processing. Latency may be in seconds to
minutes, suitable for applications where immediate response is not
critical.
Example: Social media trend analysis, where data is processed quickly but
not instantly to identify emerging trends.

• Factors Impacting Real-Time Performance


- Latency Tolerance: Some applications can tolerate slight delays, while
others require instant responses to function effectively.
- Data Volume: Higher volumes of data can introduce processing delays
if the system is not scalable enough to handle the load efficiently.
- System Design: Systems optimized for low-latency processing will have
fewer delays compared to batch-oriented systems, which prioritize
throughput over immediate response.
• True Real-Time Processing Platforms
- Apache Kafka: Known for fast, scalable, distributed real-time
processing, ideal for applications requiring immediate data handling.
- Apache Flink: Offers high-throughput, low-latency processing
capabilities.
- MillWheel: Provides real-time event processing with strong consistency
guarantees.
- Apache Apex: Supports real-time stream processing with low latency.
- Apache Samza: Designed for processing streams of data in real-time.

1. Apache Kafka
- Apache Kafka is a leading technology for real-time data processing,
renowned for its fast, scalable, distributed platform and fault-tolerant
system.

• Key Features
- Low Latency and High Throughput: Kafka is engineered to handle
large volumes of data quickly and efficiently, making it suitable for
applications requiring real-time data processing.
- Cluster Architecture: Kafka operates as a cluster on any number of
servers, storing record streams in classes known as topics.
- Each record contains a key, a value, and a timestamp, ensuring
organized and efficient data handling.

• Apache Kafka Applications


- Data Pipelines: Kafka builds reliable pipelines for real-time data
streams, ensuring data flows quickly and in order between systems or
applications.
Example: Sharing order data from a website to a shipping system, ensuring
timely and accurate information transfer.
- Streaming Applications: Kafka enables applications to react to record
streams in real-time.
Example: An app that watches live stock prices and sends alerts based on
changes, providing timely information to users.

• Kafka's Architecture

1. Producers: Send messages (data) to Kafka.


Example: Self-driving vehicles send sensor and camera data to Kafka,
providing real-time information for processing.

2. Topics: Mailboxes where producers send messages. Each topic can hold
many messages, organized for efficient access.
Example: Vehicle-sensors topic contains sensor data from cars, categorized
for specific processing tasks.

3. Partitions: Topics are split into smaller parts for faster and scalable
processing. This division enhances performance by distributing load
across multiple servers.
Example: Partition 1 handles obstacle detection data, ensuring focused and
efficient processing.

4. Brokers: Servers that store partitions and manage data. Kafka clusters
have multiple brokers for reliability and scalability.
Example: If one broker fails, another takes over, ensuring continuous data
availability.

5. Consumers: Subscribe to topics and read messages from partitions,


enabling real-time data access and processing.
Example: Path-planning systems read sensor data to calculate routes,
providing immediate navigation solutions.
6. Zookeeper (or KRaft): Coordinates brokers, producers, and consumers,
ensuring smooth operations and reliable data management.

• How Kafka Maintains Low Latency


- Zero-Copy I/O: Kafka uses zero-copy, transferring data directly from
disk to network without CPU involvement, enhancing speed and
efficiency.
Lecture 13: Data Warehouses and Data
Lakes
• Data Warehouses
Introduction to Data Warehouses
- In the late 1980s, IBM researchers Barry Devlin and Paul Murphy
introduced the concept of data warehouses.
- Their goal was to create an architectural model that would streamline the
flow of data into decision support environments.
- A data warehouse is defined as a subject-oriented, nonvolatile,
integrated, time-variant collection of data that supports management
decisions.

• Definition and Purpose


- A data warehouse (DW) is a large data repository where data can be
stored and integrated from various sources in a structured manner.
- It aids in decision-making processes through effective data analytics.
- The process of compiling information into a data warehouse is known as
data warehousing.

• Importance in Business Intelligence


- Data warehouses serve as crucial tools in business intelligence.
- They are used in enterprise data management by most medium and large
organizations.
- In a data warehouse framework, data is periodically extracted from
programs that support business operations and duplicated onto
specialized processing units.
- The developed data warehouse then becomes the primary source for
producing, analyzing, and presenting reports through instantaneous
reports, e-portals, and digital readouts.
• Online Analytical Processing (OLAP)
- Data warehouses employ Online Analytical Processing (OLAP), which
differs from Online Transaction Processing (OLTP) systems.
- OLTP systems automate administrative data processes like order entry
and banking transactions, which are essential organizational activities.
- In contrast, data warehouses focus on decision support.
- They integrate data from various sources, aiding in analysis, data
mining, and reporting.

• Data Warehouse Architecture

The image illustrates the architecture and flow of data within a data
warehouse system.
• Components and Flow
1. Data Sources
- Customer Relationship Management (CRM): Systems that manage a
company's interactions with current and potential customers.
- Enterprise Resource Planning (ERP): Integrated management systems
that handle the main business processes.
- Supply Chain Management (SCM): Systems that manage the flow of
goods and services, including all processes that transform raw materials
into final products.
- External Sources: Data from outside the organization, such as market
research or social media.
- Other Sources: Any additional data sources relevant to the business.

2. ETL Process
Extract, Transform, Load (ETL): This process involves:
- Extraction: Retrieving data from various sources.
- Transformation: Converting data into a consistent format suitable for
analysis.
- Loading: Storing the transformed data in the data warehouse.

3. Data Warehouse
- Meta Data: Information about the data, such as its source, format, and
structure.
- Summary Data: Aggregated information that provides an overview of
key metrics.
- Raw Data: Unprocessed data stored for detailed analysis.

4. Outputs and Usage


- Analysis: The data warehouse supports complex queries and data
analysis for strategic decision-making.
- Reporting: Enables the generation of reports that provide insights into
business performance.
- Mining: Facilitates data mining to discover patterns and relationships
within the data.
• Example
- Consider a retail company that uses a data warehouse to consolidate
sales data from multiple stores.
- This warehouse allows the company to analyze sales trends, customer
preferences, and inventory levels, enabling informed decision-making
and strategic planning.

• Data Lakes
Introduction to Data Lakes
- By the early 21st century, new types of diverse data were emerging in
increasing volumes.
- The need for better solutions to store and analyze large amounts of semi-
structured and unstructured data became apparent.
- Traditional schema-on-write approaches like ETL were inefficient for
these requirements, leading to the development of data lakes.

• Definition and Purpose


- Data lakes are centralized storage repositories that allow users to store
raw, unprocessed data in its original format.
- This includes unstructured, semi-structured, and structured data.
- Data lakes help enterprises make better business decisions through
visualizations or dashboards derived from big data analysis, machine
learning, and real-time analytics.

• Architecture
- Data lakes use a flat architecture to store data in its raw format.
- Each data entity in the lake is associated with a unique identifier and a
set of extended metadata.
- Consumers can use purpose-built schemas to query relevant data,
resulting in a smaller dataset that can be analyzed to answer specific
questions.
• Architecture of a data lake

• Data Types
- Structured Data: Data that is organized in a predefined format, such as
databases or spreadsheets.
- Semi-structured Data: Data with some organizational properties, but
not as rigid as structured data, like JSON or XML files.
- Unstructured Data: Data without a predefined format, such as text
documents, images, or videos.
- Binary Data: Raw data in binary format, such as images or audio files.

• Data Lake
- Central Repository: The data lake acts as a centralized storage system
that holds all types of data in its raw form.
- It allows for the storage of vast amounts of data without the need for
immediate processing or structuring.
• End Users and Applications
- Data Filtering: Users can filter data to retrieve specific subsets relevant
to their needs.
- Machine Learning: The raw data in the lake can be used for training
machine learning models, enabling advanced analytics and predictions.
- Data Visualizations: Tools can be applied to visualize data, helping to
uncover insights and trends.
- Analytics Dashboard: End users can create dashboards to monitor and
analyze data in real-time, facilitating informed decision-making.

• Features and Challenges


- Data lakes offer flexibility and scalability, allowing organizations to
scale them according to their needs by separating storage from
computation.
- However, they also present challenges related to implementation and
data analytics.
- Complex transformation and preprocessing of data are eliminated,
reducing the upfront financial overhead of data ingestion.

• Example
- An example of a data lake is a social media company storing raw user
interaction data, such as likes, comments, and shares.
- This data can be analyzed to understand user engagement patterns,
identify popular content, and personalize user experiences.
• Differences Between Data Warehouses and Data Lakes

1. Data
- Data Warehouse
Focus: Stores processed data related to specific business processes.
Example: A bank uses a data warehouse to store transactional data, customer
account details, and financial reports. This data is structured and processed for
generating monthly statements and compliance reports.
- Data Lake
Focus: Stores all types of data, including raw, unprocessed data.
Example: A media company uses a data lake to store video files, text articles,
and social media feeds. This raw data can be used for sentiment analysis or
content recommendation systems.

2. Processing
- Data Warehouse
Nature: Data is cleaned, transformed, and loaded into structured formats.
Example: An e-commerce platform processes order details and customer
interactions to create structured tables for sales analysis.
- Data Lake
Nature: Data remains mostly in its original form, allowing flexibility in
analysis.
Example: Sensor data from IoT devices is stored raw for later analysis to
optimize manufacturing processes.

3. Type of Data
- Data Warehouse
Format: Structured and tabular, often in relational databases.
Example: Tables containing customer demographics, purchase history, and
loyalty program details.
- Data Lake
Format: Supports unstructured, semi-structured, and structured data.
Example: XML files, JSON documents, images, and audio recordings stored
together for diverse analytical needs.

4. Task
- Data Warehouse
Optimization: Efficient data retrieval for specific queries.
Example: Generating end-of-year financial summaries and forecasts using
aggregated data.
- Data Lake
Stewardship: Facilitates collaborative data management and exploratory
analysis.
Example: Data scientists use raw datasets to experiment with machine
learning models without predefined constraints.

5. Agility
- Data Warehouse
Configuration: Fixed setups make changes cumbersome.
Example: Altering table structures requires significant planning and
downtime.
- Data Lake
Configuration: Easily adaptable to new data types and analytical
requirements.
Example: Adding new data sources like social media feeds or logs without
restructuring existing data.

6. Users
- Data Warehouse
Users: Business professionals and analysts focus on predefined reports and
KPIs.
Example: Marketing teams analyze customer segmentation data to tailor
campaigns.
- Data Lake
Users: Data scientists and developers explore data for insights and innovation.
Example: Developers build predictive models using diverse datasets stored in
the lake.

7. Storage
- Data Warehouse
Cost: High-performance storage for quick access.
Example: Using enterprise-grade SSDs to ensure rapid query responses.
- Data Lake
Cost: Economical storage solutions for large volumes.
Example: Cloud storage like Amazon S3 offers scalable, cost-efficient
options for storing massive datasets.

8. Security
- Data Warehouse
Control: Robust access controls and security measures.
Example: Implementing encryption and user authentication to protect
sensitive financial data.
- Data Lake
Control: Requires additional security layers to manage diverse data.
Example: Using access logs and encryption to secure multimedia content and
user-generated data.

9. Schema
- Data Warehouse
Schema: Defined before data entry, requiring structured formats.
Example: Predefined schemas for transaction tables and customer profiles.
- Data Lake
Schema: Applied during data analysis, allowing flexibility.
Example: Data scientists define schemas as needed when querying raw logs
or text files.

10. Data Processing


- Data Warehouse
Introduction: Extensive ETL processes for integrating new data.
Example: Transforming and loading new customer data from CRM systems
into structured tables.
- Data Lake
Introduction: Supports fast data ingestion and real-time streaming.
Example: Ingesting real-time data from online platforms for immediate
analysis.

11. Data Granularity


- Data Warehouse
Detail: Stores aggregated data for high-level insights.
Example: Summarized sales data by region and product category.
- Data Lake
Detail: Maintains detailed, granular data for in-depth analysis.
Example: Detailed logs of every user interaction on an e-commerce site for
behavioral analysis.
12. Tools
- Data Warehouse
Tools: Commercial solutions for structured data management.
Example: Using Oracle or Microsoft SQL Server for database management
and reporting.
- Data Lake
Tools: Open-source tools for handling diverse data types.
Example: Leveraging Hadoop and Spark for processing large, complex
datasets.
Lecture 14: Data Warehouse and Data Lake
Architecture Part 1
• Data Warehouse Architecture
A Data Warehouse is a centralized repository designed to store, manage, and
analyze large volumes of data for organizational decision-making.
- Its architecture determines how data is processed, stored, and accessed.
- There are different architectural models of data warehouses, their
components, and their roles with detailed explanations and examples are
stated below:
1. Single-Tier Architecture
Definition: A single-layer model where the focus is on minimizing data
storage and redundancy.
- Advantages:
Minimization of Data Redundancy: By storing data in a single layer,
duplication of data is reduced.
Simplified Design: The architecture is straightforward and easy to manage.
- Disadvantages:
Lack of Separation Between Analytical and Transactional Processing:
• Analytical processing: involves querying and analyzing historical data
for insights.
• Transactional processing: focuses on daily operations like order
processing or inventory updates.
- In a single-tier architecture, both processes share the same system,
leading to performance bottlenecks.
Usage: Rarely used in practice due to its limitations.
Example: Consider a small business with a single database for both sales
transactions and reporting. As the business grows, the system becomes
inefficient because frequent reporting queries slow down transactional
processes.
2. Two-Tier Architecture
Definition: This model separates the data warehouse from the data sources
using a staging area.
• Components:
Data Sources: Operational databases, external sources, etc.
Staging Area: A temporary storage space where data is cleansed,
transformed, and prepared for loading into the data warehouse.
Data Warehouse: The central repository for processed data.
- Advantages:
Ensures that only cleansed and formatted data enters the warehouse.
Reduces the complexity of integrating data from multiple sources.
- Disadvantages:
Limited Scalability: The architecture struggles to support a growing number
of users and queries.
Not Expandable: Adding new components or functionalities is challenging.
- Usage: Suitable for small to medium-sized organizations with limited
data and user needs.
Example: A retail company uses a two-tier architecture where data from
point-of-sale systems is first processed in a staging area before being loaded
into the warehouse for reporting.

3. Three-Tier Architecture
Definition: The most commonly used architecture for data warehouses,
consisting of three layers: bottom, middle, and top tiers.
• Components:
i. Bottom Tier:
Role: Acts as the database layer where raw data is stored.
Processes: Data is extracted, cleansed, transformed, and loaded (ETL process)
into the warehouse.
Technology: Typically implemented using relational database management
systems (RDBMS).
- Data Cleansing and Transformation:
Ensures data quality by removing inconsistencies, duplicates, and errors.
Example: Standardizing customer names and addresses from multiple
sources.
- Role of RDBMS:
Traditional RDBMS is optimized for transactional processing (e.g., adding,
updating, deleting records).
Challenges:
- Not designed for handling large-scale analytical queries.
Example: Running a query to calculate total sales across all regions for the
past five years can be slow.

• Alternatives to RDBMS:
i. Parallel Database Systems:
Distributes the processing load across multiple servers.
Example: Splitting a large sales database into regional subsets, each
processed by a separate server.

ii. Multidimensional Databases (MDDBs):


Stores data in a multidimensional format, enabling faster analytical queries.
Example: A cube structure where dimensions include time, product, and
region.

ii. Middle Tier:


Role: Acts as a mediator between the database and end-users.
Technology: Online Analytical Processing (OLAP) server.
Functions:
- Provides an abstract view of the database.
- Supports multidimensional data operations like slicing, dicing, and
drilling.
- Enables efficient access to aggregated data for analytics.
• OLAP Server:
- Provides a user-friendly view of the database.
- Supports complex analytics through multidimensional operations:
1. Slicing: Extracting a subset of data based on a specific dimension.
Example: Viewing sales data for a specific year.
2. Dicing: Creating a smaller cube by selecting specific dimensions.
Example: Analyzing sales data for a specific product category and region.
3. Drilling: Navigating through data hierarchies.
Example: Drilling down from yearly sales to monthly sales.

iii. Top Tier:


Role: The front-end client layer that interacts with users.
Components: Query tools, reporting tools, analysis tools, data mining tools,
and APIs.
Functions:
- Allows users to retrieve, analyze, and visualize data.
- Provides a user-friendly interface for business users and decision-makers
without requiring technical expertise.
Example: A multinational corporation uses a three-tier architecture to manage
its global sales data. The bottom tier stores raw sales data, the middle tier
processes this data for multidimensional analysis, and the top tier provides
dashboards and reports for decision-makers.
Tools and APIs:
- Enable users to interact with the data warehouse.
Examples:
- Query Tools: Allow users to write SQL queries to retrieve data.
- Reporting Tools: Generate operational reports (e.g., daily sales reports).
- Data Mining Tools: Discover patterns and correlations in large datasets.
Example: Identifying customer segments based on purchasing behavior.
- OLAP Tools: Facilitate multidimensional analysis.
- APIs: Allow external applications to connect to the data warehouse.
• Key Components of Data Warehouse Architecture
1. Data Sources
- Operational Data:
Refers to the data generated by day-to-day business operations, such as sales
transactions, inventory updates, customer interactions, etc.
Example: A retail store’s point-of-sale (POS) system recording daily sales.
- External Data:
Includes data from outside the organization, such as market trends, competitor
data, or social media insights.
Example: Weather data for predicting sales of seasonal products or social
media sentiment analysis for brand perception.

1. ETL Tools
Role: Convert raw data into a unified format for the data warehouse.
Steps:
i. Extraction:
Collecting data from operational systems and external sources.
Example: Extracting sales data from the POS system and customer data from
a CRM system.

ii. Transformation:
Cleansing and standardizing the data to ensure consistency and compatibility.
Example: Converting all date formats to a standard format (e.g., YYYY-MM-
DD) or aggregating sales data by region.

iii. Loading:
Importing the transformed data into the data warehouse database.
Example: Loading the cleaned sales and customer data into a centralized
repository for analysis.
2. Data Warehouse Database
Core Component:
- The central repository where all cleansed and transformed data is stored.
- Acts as the foundation for all analytics and reporting.
Technology: Typically implemented using RDBMS (Relational Database
Management Systems) or multidimensional databases (MDDBs) for better
analytical performance.

3. Data Marts
Definition:
- Subsets of the data warehouse that are tailored to meet the specific needs
of departments or business units.
Example: A marketing data mart containing campaign performance data or a
sales data mart with regional sales figures.
Purpose:
- Improve performance by allowing faster access to relevant data.
- Provide a focused view of data for specific analytical needs.

4. Metadata
Definition: Data about data that defines the structure and usage of the data
warehouse.
Functions:
- Describes the source, format, and relationships of data.
- Helps in building and maintaining the warehouse.
Example: Metadata might include information about the schema of the
warehouse, such as table names, column names, and data types.

5. Query Tools
Types:
- Query and Reporting Tools:
Generate operational reports.
Example: A daily sales report showing revenue by region.
- Application Development Tools:
Support custom analytical applications.
Example: A tool to analyze customer churn rates.
- Data Mining Tools:
Discover patterns and correlations.
Example: Identifying factors influencing customer loyalty.
- OLAP Tools:
Enable multidimensional analysis.
Example: Analyzing sales trends across different time periods.
Examples:
- SQL query tools for writing custom queries.
- Reporting tools for generating predefined reports (e.g., monthly sales
reports).

6. Reporting/Analysis/OLAP/Data Mining Tools


Purpose:
- These tools provide advanced capabilities for analyzing and visualizing
data stored in the warehouse.
Components:
i. Reporting Tools:
Generate static and ad-hoc reports for operational and strategic decision-
making.
Example: A report showing quarterly revenue by region.

ii. OLAP Tools:


Enable multidimensional analysis of data through slicing, dicing, and drilling.
Example: Analyzing sales trends by product category and time period.

iii. Data Mining Tools:


Discover patterns, correlations, and insights in large datasets.
Example: Identifying factors influencing customer churn.
7. Presentation Layer
Purpose:
- The final layer where data is presented to end-users in a meaningful and
actionable format.
Types of Reports:
i. Interactive Reports:
Allow users to interact with the data (e.g., drill down into specific details).
Example: A dashboard that lets users filter sales data by region and time.

ii. Ad-Hoc Reports:


Custom reports generated on-demand to answer specific business questions.
Example: A report showing the impact of a recent marketing campaign.

iii. Static Reports:


Predefined reports that remain unchanged.
Example: A daily sales report emailed to managers.
Lecture 15: Data Warehouse and Data Lake
Architecture Part 2
• Introduction to Data Lake Architecture
A data lake is a centralized repository designed to store vast amounts of raw,
structured, semi-structured, and unstructured data.
- It is built for handling diverse data types and is highly scalable to
support modern big data applications.
- Below is a detailed explanation of its key features, architecture, and
components.

• Key Features of a Data Lake


i. Centralized Repository:
- Data lakes store all kinds of data (structured, semi-structured, and
unstructured) in a single, centralized system.
Examples:
- Structured: Relational databases (e.g., customer purchase records).
- Semi-structured: JSON or XML files (e.g., web logs).
- Unstructured: Images, videos, and audio files (e.g., surveillance
footage).

ii. Schema-on-Read:
- Unlike traditional data warehouses (which use schema-on-write), data
lakes use schema-on-read.
This means:
- Data is stored in its raw format without predefined schemas.
- The schema is applied only when the data is accessed for analysis.
Example: A social media company storing raw user posts and applying
different schemas for sentiment analysis, keyword extraction, or trend
detection.
iii. Layered Architecture:
- Data lakes are organized into zones or layers (e.g., raw, processed,
curated) to manage the data lifecycle effectively.
- This ensures proper governance, security, and accessibility.

iv. Decoupled Compute and Storage:


- The compute layer (e.g., processing engines like Apache Spark, Presto,
or AWS Athena) is separated from the storage layer (e.g., Amazon S3,
Hadoop Distributed File System).
Advantages:
- Independent scaling: Storage and compute resources can be scaled
separately.
- Cost optimization: Only pay for the compute resources when they are
in use.
- Flexibility: Multiple analytics engines can access the same data without
duplication.

v. Scalable Storage:
- Data lakes are built on scalable platforms like Hadoop, Amazon S3, or
Azure Data Lake.
- They handle massive data volumes without requiring fixed schemas.
Data Lake Architecture

The diagram highlights the flow of data from diverse sources, through
various layers of the data lake, to its final use in analytics and applications.
- Below is a detailed explanation of each component in the image:

1. Data Sources
The leftmost section of the diagram shows the various data sources that feed
into the data lake.
These sources include:
i. Streaming Data:
Real-time data generated by IoT devices, sensors, or applications.
Example: Live stock prices or temperature readings from a sensor.

ii. Social Media:


Data from platforms like Twitter, Facebook, or Instagram.
Example: User-generated posts, likes, shares, and comments.
iii. Web Scraping:
Data extracted from websites using automated tools.
Example: Collecting product prices from e-commerce websites.

iv. Word Documents:


Text-based files that may contain reports, contracts, or other
structured/unstructured information.
- Files:
Generic file-based data sources, such as CSV, Excel, or log files.
- Photos:
Image data, often used in machine learning or computer vision applications.
- PDF Documents:
Documents in PDF format, often requiring parsing to extract meaningful data.

2. Data Lake Layers


The central part of the diagram represents the layered architecture of the data
lake, which supports the lifecycle of data from ingestion to processing and
consumption.
i. Raw Layer
Purpose:
- Acts as the initial landing zone for all ingested data.
- Stores raw, unprocessed data in its native format.
Characteristics:
- No transformations are applied at this stage.
- End-users typically do not have access to this layer.
Example: Raw JSON files from social media or unprocessed logs from a web
server.

ii. Standardized Layer


Purpose:
- Optional layer used to standardize data formats for better performance in
subsequent stages.
Characteristics:
- Data from the raw layer is converted into a format suitable for cleansing
and analysis (e.g., Parquet or ORC).
Example: Converting raw CSV files into Parquet format for efficient
querying.

iii. Cleansed Layer


Purpose:
- Transforms raw or standardized data into structured, consumable
datasets.
Characteristics:
- Includes data cleansing, normalization, and consolidation.
- Organized by purpose, type, and file structure.
- End-users and applications primarily interact with this layer.
Example: A cleansed dataset of customer transactions, ready for reporting.

iv. Sandbox Layer


Purpose:
- Provides a workspace for analysts and data scientists to experiment with
and enrich data.
Characteristics:
- Temporary and exploratory in nature.
- Allows integration with external data sources for testing and analysis.
Example: Analysts testing a new machine learning model using enriched
datasets.

v. Application Layer
Purpose:
- Known as the trusted or production layer, it provides ready-to-use data
for business applications.
Characteristics:
- Data is enriched with business logic and prepared for operational use.
- Often used for machine learning models, dashboards, and reporting.
Example: A dataset prepared for a predictive model identifying customer
churn.

vi. Orchestration
- Above the data lake layers, the diagram highlights orchestration
processes:
Orchestration (Data Lake):
- Refers to managing the flow of data between layers within the data lake.
Example: Using Apache Airflow to automate the movement of data from the
raw layer to the cleansed layer.
Orchestration (Applications):
- Refers to managing the flow of data from the data lake to external
applications or systems.
Example: Scheduling the export of cleansed data for use in a business
intelligence tool.

3. Security, Governance, Metadata, and Stewardship


- At the bottom of the data lake layers, these foundational components
ensure the reliability, security, and usability of the data lake:
i. Security:
Protects data from unauthorized access.
Example: Role-based access control (RBAC) to restrict access to sensitive
data.

ii. Governance:
Ensures data quality, compliance, and monitoring.
Example: Logging all data access and transformations for audit purposes.

iii. Metadata:
Provides information about the data, such as its source, structure, and purpose.
Example: Metadata describing the schema of a dataset and its intended use.
iv. Stewardship:
Involves managing and overseeing data to ensure it is accurate and accessible.
Example: Assigning a data steward to monitor the quality of customer data.

4. Integration with Other Systems


- The right side of the diagram shows how the data lake integrates with
other systems and applications for analytics and operational use.
i. Archive and Offload
a. Archive:
- Stores historical data for long-term retention.
Example: Archiving old transaction logs that are rarely accessed.

b. Offload:
- Offloads resource-intensive ETL processes from traditional data
warehouses to the data lake.
Example: Moving data cleansing tasks from a data warehouse to the data
lake.

ii. Enterprise Data Warehouse (EDW)


- The data lake can feed cleansed and transformed data into an EDW for
traditional business intelligence use cases.
Example: Loading cleansed sales data into an EDW for monthly financial
reporting.

iii. OLAP and BI Tools


- OLAP (Online Analytical Processing):
Used for multidimensional analysis of data.
Example: Analyzing sales trends by region and product category.
- BI (Business Intelligence):
Tools used for generating reports and dashboards.
Example: A dashboard showing KPIs like revenue growth and customer
acquisition.

- Advanced Analytics
Supports advanced data analysis, such as predictive modeling and machine
learning.
Example: Using machine learning to predict customer churn based on past
behavior.

- Operationalized Data Science


Integrates machine learning models and analytics into production systems.
Example: Deploying a recommendation engine for an e-commerce website.

You might also like