Second Exam Summary
Second Exam Summary
Chapter 6 Part 1
• What is MapReduce?
- MapReduce is a programming model and framework used to process
large amounts of data in parallel across many computers.
- It was developed to handle the vast amounts of data that modern
applications generate and need to process efficiently.
• Key Concepts
- Parallel Processing: This means breaking a task into smaller parts
and processing them simultaneously across multiple computers,
making it faster to handle large datasets.
- Key-Value Pairs: In MapReduce, data is represented as pairs of keys
and values.
- Think of a key as a unique identifier and the value as the data
associated with it.
- For example, in a list of words, the word itself could be the key, and
the number of times it appears could be the value.
• The Workflow
1. Input Splitting: The input data is divided into smaller pieces called
splits. Each split is processed by a separate map task, allowing for
parallel processing.
2. Mapping: Each map task processes its split of data and outputs
intermediate key-value pairs. These pairs are then sorted by key.
3. Shuffling and Sorting: The system automatically groups all the
values associated with the same key together, ensuring that they are
sent to the same reduce task.
4. Reducing: The reduce tasks process the grouped data, performing
operations like summing, averaging, or filtering, and produce the
final output.
5. Output: The results are written to an output file, which contains the
processed data.
1. Input Splits:
Split 1: "cat dog"
Split 2: "dog cat" and "dog fish"
2. Map Phase:
Split 1 produces: ("cat", 1), ("dog", 1)
Split 2 produces: ("dog", 1), ("cat", 1), ("dog", 1), ("fish", 1)
3. Shuffle and Sort:
Grouped by key:
"cat": [1, 1]
"dog": [1, 1, 1]
"fish": [1]
4. Reduce Phase:
"cat": 2
"dog": 3
"fish": 1
5. Final Output:
("cat", 2), ("dog", 3), ("fish", 1)
• Apache Hadoop
- Apache Hadoop is an open-source framework that uses MapReduce for
processing large datasets. It consists of:
- HDFS (Hadoop Distributed File System): Stores data across multiple
machines, ensuring fault tolerance and scalability.
- YARN (Yet Another Resource Negotiator): Manages computing
resources and schedules jobs.
• Why Hadoop?
- Hadoop is designed to handle big data efficiently by distributing tasks
across many nodes (computers), offering fault tolerance (data is
replicated across nodes), and allowing for the processing of diverse data
types.
• Differences Between Hadoop and MapReduce
- Hadoop: A comprehensive framework for distributed storage and
processing.
- MapReduce: A specific programming model used within Hadoop for
processing data.
Big Data Processing Concepts Lecture 11:
Chapter 6 Part 2
• Introduction to Big Data Processing with MapReduce
- MapReduce is a powerful programming model for processing large
datasets in parallel across distributed clusters.
- It divides tasks into smaller sub-tasks, enabling efficient data processing.
However, despite its strengths, there are several pitfalls that need
addressing for optimal performance.
• Pitfalls of MapReduce
1. Assumption of Homogeneous Nodes
- Default Scheduler: The original Hadoop MapReduce scheduler
assumes that all nodes in a cluster are homogeneous, meaning they have
similar processing power and capabilities.
- Straggler Tasks: These are tasks that take longer to complete than
others. The scheduler attempts to mitigate this by speculatively copying
and re-executing these tasks on idle nodes.
- Challenge: In reality, clusters often consist of heterogeneous nodes with
varying capabilities. This assumption can lead to inefficiencies in
resource utilization and task execution.
5. Configuration Complexity
- Cluster Environment: MapReduce applications typically run in clusters
composed of many computers, requiring intricate configuration and
setup.
- Challenges: The complexity involved in setting up and tuning these
clusters poses significant challenges for users, leading to a demand for
tools and optimizations that can simulate MapReduce contexts and
analyze performance dynamically.
6. Security and Data Protection
- Shared Resources: In cloud computing environments, resources are
shared among multiple users, increasing the risk of data breaches.
- Authentication Mechanisms: The original MapReduce framework
provides basic authentication methods like Token-based and Kerberos,
which may not be sufficient for protecting sensitive data.
- Security Concerns: There is a need for stronger data protection
measures and enhanced authentication and authorization protocols to
safeguard large-scale sensitive data.
• Advantages of YARN
1. Scalability: YARN enhances the scalability of Hadoop clusters,
allowing them to efficiently handle larger volumes of data and more
complex tasks.
2. Cluster Utilization: It optimizes resource usage, ensuring that all
available resources in the cluster are utilized effectively.
3. User Agility: YARN allows users to run different processing
frameworks alongside MapReduce, providing greater flexibility in data
processing.
• YARN Components
1. Resource Manager
- Role: Acts as the central authority for resource management in a YARN
cluster.
- Functions: Optimizes cluster utilization by balancing constraints such
as capacity guarantees, fairness, and service level agreements.
- Pluggable Scheduler: Allows different scheduling algorithms to be
used, such as those focusing on capacity or fairness, to meet specific
needs.
2. Application Master
- Purpose: Manages the lifecycle of applications, including negotiating
resources from the Resource Manager and coordinating with Node
Managers to execute tasks.
- Responsibilities: Requests resource containers, tracks their status,
monitors progress, and adapts to changing requirements.
3. Resource Model
- Flexibility: YARN supports a flexible resource model, allowing
applications to request specific resources based on their requirements.
- Resource Specifications: Applications can specify requirements such as
memory, CPU cores, and even network or GPU resources, enabling
precise resource allocation.
Processing systems for big data Lecture 12
Chapter 6 Part 3
• Introduction to Big Data Processing Systems
- Big data processing systems are essential for handling vast volumes of
information efficiently.
- As data grows exponentially, different paradigms have been developed
to cater to specific processing needs.
- Each paradigm is designed to optimize performance based on the nature
and requirements of the data being processed.
• Key Characteristics:
- Unbounded Data: Continuous processing works with streams of data
that flow indefinitely, processing each piece as soon as it arrives. This is
crucial for applications where data is constantly being generated and
needs immediate attention.
- Low Latency: While prioritizing throughput, continuous processing
maintains relatively low latency, typically ranging from milliseconds to
seconds. The emphasis is on efficient processing rather than meeting
strict deadlines.
• Applications:
- IoT Device Monitoring: For example, sensors in a smart home system
continuously send data about temperature, humidity, and motion, which
needs to be processed in real-time.
- Live User Interactions: Online gaming platforms process user actions
and interactions continuously to provide seamless gaming experiences.
• Example Tools:
- Apache Kafka Streams: Facilitates real-time data streaming and
processing.
- Apache Flink: Offers high-throughput, low-latency stream processing
capabilities.
- Apache Storm: Designed for processing unbounded data streams in
real-time.
• Programming Model:
- Dataflow Model: Represents data as a series of transformations.
- For instance, Google Dataflow and Apache Beam use this model to
efficiently process data streams by applying a series of transformations
to each incoming data element.
2. Real-Time Processing
- Real-time processing ensures data is processed immediately or within
tight deadlines.
- This paradigm is crucial for applications that require strict timing
constraints and immediate responses.
• Key Characteristics:
- Timing Guarantees: Real-time processing must adhere to predefined
deadlines, often within milliseconds. This is essential for applications
where timing is critical.
- Predictability: Ensures deterministic and consistent processing of
events under strict timing requirements, providing reliability and
stability.
• Applications:
- Automotive Safety Systems: Real-time processing is vital for systems
like collision detection and avoidance in self-driving cars.
- Financial Trading Platforms: Immediate processing of market data is
crucial for executing trades at optimal times.
- Control Systems in Self-Driving Vehicles: Real-time data from sensors
and cameras is processed to make instantaneous driving decisions.
• Example Tools:
- Apache Kafka (with real-time configurations): Provides robust event
processing capabilities, ensuring timely data handling.
• Programming Model:
- Event-Driven Model: Events trigger immediate processing. For
example, obstacle avoidance systems in self-driving vehicles rely on
real-time data processing to react to environmental changes instantly.
3. Event Processing
- Event processing focuses on detecting, analyzing, and responding to
individual events.
- It can occur in continuous, real-time, or batch contexts, emphasizing the
identification and handling of specific events.
• Key Characteristics:
- Event-Driven: Systems are triggered by events rather than periodic data
processing, allowing for immediate responses to specific occurrences.
• Applications:
- Fraud Detection: Systems monitor transactions in real-time to identify
and respond to fraudulent activities.
- Anomaly Detection: Identifying deviations from normal behavior in
systems, such as unusual network traffic patterns indicating a potential
security breach.
- User Activity Tracking: Monitoring user interactions to provide
personalized experiences or detect unusual behaviors.
• Example Tools:
- Apache Flink (with CEP libraries): Offers tools for complex event
processing, enabling the detection of intricate event patterns.
• Programming Model:
- CEP Model: Defines rules for detecting patterns or relationships
between events, such as security information and event management
systems that analyze logs for potential threats.
4. Batch Processing
- Batch processing involves processing large volumes of data as a single
unit or "batch".
- It operates on a finite dataset available all at once, making it suitable for
scenarios where data can be collected and processed in bulk.
• Key Characteristics:
- Bounded Data: Works on fixed datasets, such as logs, transactions, or
files collected over time.
- This is ideal for applications where data does not need immediate
processing.
- High Throughput: Optimized for efficiency and scalability, focusing
on processing large volumes of data quickly rather than minimizing
latency.
• Applications:
- Historical Data Analysis: Batch processing is ideal for analyzing past
data to identify trends or generate reports.
- Data Warehousing: Large datasets are processed to populate data
warehouses for business intelligence and analytics.
• Example Tools:
- Apache Hadoop: Utilizes the MapReduce model for batch processing,
efficiently handling vast amounts of data.
• Programming Model:
- MapReduce Model: Processes data in two steps—mapping
(transforming) and reducing (aggregating).
- For example, Apache Hadoop uses this model to process log files and
generate summary reports.
1. Apache Kafka
- Apache Kafka is a leading technology for real-time data processing,
renowned for its fast, scalable, distributed platform and fault-tolerant
system.
• Key Features
- Low Latency and High Throughput: Kafka is engineered to handle
large volumes of data quickly and efficiently, making it suitable for
applications requiring real-time data processing.
- Cluster Architecture: Kafka operates as a cluster on any number of
servers, storing record streams in classes known as topics.
- Each record contains a key, a value, and a timestamp, ensuring
organized and efficient data handling.
• Kafka's Architecture
2. Topics: Mailboxes where producers send messages. Each topic can hold
many messages, organized for efficient access.
Example: Vehicle-sensors topic contains sensor data from cars, categorized
for specific processing tasks.
3. Partitions: Topics are split into smaller parts for faster and scalable
processing. This division enhances performance by distributing load
across multiple servers.
Example: Partition 1 handles obstacle detection data, ensuring focused and
efficient processing.
4. Brokers: Servers that store partitions and manage data. Kafka clusters
have multiple brokers for reliability and scalability.
Example: If one broker fails, another takes over, ensuring continuous data
availability.
The image illustrates the architecture and flow of data within a data
warehouse system.
• Components and Flow
1. Data Sources
- Customer Relationship Management (CRM): Systems that manage a
company's interactions with current and potential customers.
- Enterprise Resource Planning (ERP): Integrated management systems
that handle the main business processes.
- Supply Chain Management (SCM): Systems that manage the flow of
goods and services, including all processes that transform raw materials
into final products.
- External Sources: Data from outside the organization, such as market
research or social media.
- Other Sources: Any additional data sources relevant to the business.
2. ETL Process
Extract, Transform, Load (ETL): This process involves:
- Extraction: Retrieving data from various sources.
- Transformation: Converting data into a consistent format suitable for
analysis.
- Loading: Storing the transformed data in the data warehouse.
3. Data Warehouse
- Meta Data: Information about the data, such as its source, format, and
structure.
- Summary Data: Aggregated information that provides an overview of
key metrics.
- Raw Data: Unprocessed data stored for detailed analysis.
• Data Lakes
Introduction to Data Lakes
- By the early 21st century, new types of diverse data were emerging in
increasing volumes.
- The need for better solutions to store and analyze large amounts of semi-
structured and unstructured data became apparent.
- Traditional schema-on-write approaches like ETL were inefficient for
these requirements, leading to the development of data lakes.
• Architecture
- Data lakes use a flat architecture to store data in its raw format.
- Each data entity in the lake is associated with a unique identifier and a
set of extended metadata.
- Consumers can use purpose-built schemas to query relevant data,
resulting in a smaller dataset that can be analyzed to answer specific
questions.
• Architecture of a data lake
• Data Types
- Structured Data: Data that is organized in a predefined format, such as
databases or spreadsheets.
- Semi-structured Data: Data with some organizational properties, but
not as rigid as structured data, like JSON or XML files.
- Unstructured Data: Data without a predefined format, such as text
documents, images, or videos.
- Binary Data: Raw data in binary format, such as images or audio files.
• Data Lake
- Central Repository: The data lake acts as a centralized storage system
that holds all types of data in its raw form.
- It allows for the storage of vast amounts of data without the need for
immediate processing or structuring.
• End Users and Applications
- Data Filtering: Users can filter data to retrieve specific subsets relevant
to their needs.
- Machine Learning: The raw data in the lake can be used for training
machine learning models, enabling advanced analytics and predictions.
- Data Visualizations: Tools can be applied to visualize data, helping to
uncover insights and trends.
- Analytics Dashboard: End users can create dashboards to monitor and
analyze data in real-time, facilitating informed decision-making.
• Example
- An example of a data lake is a social media company storing raw user
interaction data, such as likes, comments, and shares.
- This data can be analyzed to understand user engagement patterns,
identify popular content, and personalize user experiences.
• Differences Between Data Warehouses and Data Lakes
1. Data
- Data Warehouse
Focus: Stores processed data related to specific business processes.
Example: A bank uses a data warehouse to store transactional data, customer
account details, and financial reports. This data is structured and processed for
generating monthly statements and compliance reports.
- Data Lake
Focus: Stores all types of data, including raw, unprocessed data.
Example: A media company uses a data lake to store video files, text articles,
and social media feeds. This raw data can be used for sentiment analysis or
content recommendation systems.
2. Processing
- Data Warehouse
Nature: Data is cleaned, transformed, and loaded into structured formats.
Example: An e-commerce platform processes order details and customer
interactions to create structured tables for sales analysis.
- Data Lake
Nature: Data remains mostly in its original form, allowing flexibility in
analysis.
Example: Sensor data from IoT devices is stored raw for later analysis to
optimize manufacturing processes.
3. Type of Data
- Data Warehouse
Format: Structured and tabular, often in relational databases.
Example: Tables containing customer demographics, purchase history, and
loyalty program details.
- Data Lake
Format: Supports unstructured, semi-structured, and structured data.
Example: XML files, JSON documents, images, and audio recordings stored
together for diverse analytical needs.
4. Task
- Data Warehouse
Optimization: Efficient data retrieval for specific queries.
Example: Generating end-of-year financial summaries and forecasts using
aggregated data.
- Data Lake
Stewardship: Facilitates collaborative data management and exploratory
analysis.
Example: Data scientists use raw datasets to experiment with machine
learning models without predefined constraints.
5. Agility
- Data Warehouse
Configuration: Fixed setups make changes cumbersome.
Example: Altering table structures requires significant planning and
downtime.
- Data Lake
Configuration: Easily adaptable to new data types and analytical
requirements.
Example: Adding new data sources like social media feeds or logs without
restructuring existing data.
6. Users
- Data Warehouse
Users: Business professionals and analysts focus on predefined reports and
KPIs.
Example: Marketing teams analyze customer segmentation data to tailor
campaigns.
- Data Lake
Users: Data scientists and developers explore data for insights and innovation.
Example: Developers build predictive models using diverse datasets stored in
the lake.
7. Storage
- Data Warehouse
Cost: High-performance storage for quick access.
Example: Using enterprise-grade SSDs to ensure rapid query responses.
- Data Lake
Cost: Economical storage solutions for large volumes.
Example: Cloud storage like Amazon S3 offers scalable, cost-efficient
options for storing massive datasets.
8. Security
- Data Warehouse
Control: Robust access controls and security measures.
Example: Implementing encryption and user authentication to protect
sensitive financial data.
- Data Lake
Control: Requires additional security layers to manage diverse data.
Example: Using access logs and encryption to secure multimedia content and
user-generated data.
9. Schema
- Data Warehouse
Schema: Defined before data entry, requiring structured formats.
Example: Predefined schemas for transaction tables and customer profiles.
- Data Lake
Schema: Applied during data analysis, allowing flexibility.
Example: Data scientists define schemas as needed when querying raw logs
or text files.
3. Three-Tier Architecture
Definition: The most commonly used architecture for data warehouses,
consisting of three layers: bottom, middle, and top tiers.
• Components:
i. Bottom Tier:
Role: Acts as the database layer where raw data is stored.
Processes: Data is extracted, cleansed, transformed, and loaded (ETL process)
into the warehouse.
Technology: Typically implemented using relational database management
systems (RDBMS).
- Data Cleansing and Transformation:
Ensures data quality by removing inconsistencies, duplicates, and errors.
Example: Standardizing customer names and addresses from multiple
sources.
- Role of RDBMS:
Traditional RDBMS is optimized for transactional processing (e.g., adding,
updating, deleting records).
Challenges:
- Not designed for handling large-scale analytical queries.
Example: Running a query to calculate total sales across all regions for the
past five years can be slow.
• Alternatives to RDBMS:
i. Parallel Database Systems:
Distributes the processing load across multiple servers.
Example: Splitting a large sales database into regional subsets, each
processed by a separate server.
1. ETL Tools
Role: Convert raw data into a unified format for the data warehouse.
Steps:
i. Extraction:
Collecting data from operational systems and external sources.
Example: Extracting sales data from the POS system and customer data from
a CRM system.
ii. Transformation:
Cleansing and standardizing the data to ensure consistency and compatibility.
Example: Converting all date formats to a standard format (e.g., YYYY-MM-
DD) or aggregating sales data by region.
iii. Loading:
Importing the transformed data into the data warehouse database.
Example: Loading the cleaned sales and customer data into a centralized
repository for analysis.
2. Data Warehouse Database
Core Component:
- The central repository where all cleansed and transformed data is stored.
- Acts as the foundation for all analytics and reporting.
Technology: Typically implemented using RDBMS (Relational Database
Management Systems) or multidimensional databases (MDDBs) for better
analytical performance.
3. Data Marts
Definition:
- Subsets of the data warehouse that are tailored to meet the specific needs
of departments or business units.
Example: A marketing data mart containing campaign performance data or a
sales data mart with regional sales figures.
Purpose:
- Improve performance by allowing faster access to relevant data.
- Provide a focused view of data for specific analytical needs.
4. Metadata
Definition: Data about data that defines the structure and usage of the data
warehouse.
Functions:
- Describes the source, format, and relationships of data.
- Helps in building and maintaining the warehouse.
Example: Metadata might include information about the schema of the
warehouse, such as table names, column names, and data types.
5. Query Tools
Types:
- Query and Reporting Tools:
Generate operational reports.
Example: A daily sales report showing revenue by region.
- Application Development Tools:
Support custom analytical applications.
Example: A tool to analyze customer churn rates.
- Data Mining Tools:
Discover patterns and correlations.
Example: Identifying factors influencing customer loyalty.
- OLAP Tools:
Enable multidimensional analysis.
Example: Analyzing sales trends across different time periods.
Examples:
- SQL query tools for writing custom queries.
- Reporting tools for generating predefined reports (e.g., monthly sales
reports).
ii. Schema-on-Read:
- Unlike traditional data warehouses (which use schema-on-write), data
lakes use schema-on-read.
This means:
- Data is stored in its raw format without predefined schemas.
- The schema is applied only when the data is accessed for analysis.
Example: A social media company storing raw user posts and applying
different schemas for sentiment analysis, keyword extraction, or trend
detection.
iii. Layered Architecture:
- Data lakes are organized into zones or layers (e.g., raw, processed,
curated) to manage the data lifecycle effectively.
- This ensures proper governance, security, and accessibility.
v. Scalable Storage:
- Data lakes are built on scalable platforms like Hadoop, Amazon S3, or
Azure Data Lake.
- They handle massive data volumes without requiring fixed schemas.
Data Lake Architecture
The diagram highlights the flow of data from diverse sources, through
various layers of the data lake, to its final use in analytics and applications.
- Below is a detailed explanation of each component in the image:
1. Data Sources
The leftmost section of the diagram shows the various data sources that feed
into the data lake.
These sources include:
i. Streaming Data:
Real-time data generated by IoT devices, sensors, or applications.
Example: Live stock prices or temperature readings from a sensor.
v. Application Layer
Purpose:
- Known as the trusted or production layer, it provides ready-to-use data
for business applications.
Characteristics:
- Data is enriched with business logic and prepared for operational use.
- Often used for machine learning models, dashboards, and reporting.
Example: A dataset prepared for a predictive model identifying customer
churn.
vi. Orchestration
- Above the data lake layers, the diagram highlights orchestration
processes:
Orchestration (Data Lake):
- Refers to managing the flow of data between layers within the data lake.
Example: Using Apache Airflow to automate the movement of data from the
raw layer to the cleansed layer.
Orchestration (Applications):
- Refers to managing the flow of data from the data lake to external
applications or systems.
Example: Scheduling the export of cleansed data for use in a business
intelligence tool.
ii. Governance:
Ensures data quality, compliance, and monitoring.
Example: Logging all data access and transformations for audit purposes.
iii. Metadata:
Provides information about the data, such as its source, structure, and purpose.
Example: Metadata describing the schema of a dataset and its intended use.
iv. Stewardship:
Involves managing and overseeing data to ensure it is accurate and accessible.
Example: Assigning a data steward to monitor the quality of customer data.
b. Offload:
- Offloads resource-intensive ETL processes from traditional data
warehouses to the data lake.
Example: Moving data cleansing tasks from a data warehouse to the data
lake.
- Advanced Analytics
Supports advanced data analysis, such as predictive modeling and machine
learning.
Example: Using machine learning to predict customer churn based on past
behavior.