Moving Data In and Out of Hadoop

The document discusses methods for moving data in and out of Hadoop, referred to as ingress and egress, utilizing HDFS and MapReduce. It highlights tools like Flume, Chukwa, Scribe, and Sqoop for data collection and transfer, along with the HDFS File Slurper for file format automation. Additionally, it mentions Oozie for scheduling regular data transfers and the use of DBInputFormat for importing data from relational databases.

Uploaded by

pratima depa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

Moving Data In and Out of Hadoop

Uploaded by

pratima depa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 17

Moving Data In and Out of Hadoop

• It is refered as Ingress and Engress

• Hadoop Supports ingress and engress at a low
level in HDFS and MapReduce
• Files can be moved in and out of Hadoop by
HDFS(Writing External Data at HDFS level-Data
Push) and data can be pulled from external
data sources and pushed to external data sinks
using MapReduce(Reading External data at
MapReduce level-Data Pull).
Key Elements of Ingress and Engress
• Idempotence
• Aggregation
• Data Format Transformation
• Recoverability
• Correctness
• Resource Consumption and Performance
• Monitoring
Hadoop Ingress with different data sources-Log files, Semi Structured data/
Binary files, HBase

• Flume,Chukwa,Scribe are log collecting and

distribution frameworks that uses HDFS as
data sink for that log data.
• Flume
• It is a Distributed System for collecting
streaming data.
• It’s highly customizable and supports plugin
architecture.
Chukwa(Apache sub project to collect and store data in HDFS)
Scribe
• Purpose: Scribe is used for collecting and distributing log data
across multiple nodes.
• Functionality: A Scribe server runs on each node and forwards
logs to a central Scribe server.
• Reliability: Logs are persisted to a local disk if the downstream
server is unreachable.
• Supported Data Sinks: It can store logs in various storage
backends, including HDFS, NFS, and regular filesystems.
2. Difference from Other Log Collectors:
• Unlike Flume or Chukwa, Scribe does not pull logs automatically.
• Instead, the user must push log data to the Scribe server.
• For example, Apache logs require writing a daemon (background
process) to forward logs to Scribe.
Technique 2: An automated mechanism to copy files into HDFS

• Existing tools like Flume, Scribe, and Chukwa are mainly designed for log file
transportation.What if you need to transfer different file formats, such as semi-structured
or binary files?

Solution:
• The HDFS File Slurper is an open-source utility that can copy any file format into or out of
HDFS.

How the HDFS File Slurper Works:

The HDFS File Slurper is a simple tool that automates copying files between a local directory
and HDFS, and vice versa. It follows a structured five-step process:
• Scan: The Slurper reads files from the source directory.
• Determine HDFS destination: Optionally, it consults a script to determine where in HDFS
the file should be placed.
• Write: The file is copied to HDFS.
• Verify: An optional verification step ensures successful transfer.
• Relocate file: The original file is moved to a completed directory after a successful copy.
Technique 3: Scheduling Regular Ingress Activities with Oozie

• If your data resides on a filesystem, web server, or other system,

you need a way to regularly pull it into Hadoop.

The challenge consists of two tasks:

• Importing data into Hadoop.
• Scheduling regular data transfers.
Oozie is used to automate data ingress into HDFS.
• It can also trigger post-ingress activities, such as launching a
MapReduce job to process the data.
• Oozie is an Apache project that originated at Yahoo! and acts as a
workflow engine for Hadoop.
• Oozie’s coordinator engine can schedule tasks based on time and
data triggers.
• We want to move data from a relational database into HDFS using
MapReduce while managing concurrent database connections effectively.
• Solution: This technique uses the DBInputFormat class to import data
from a relational database into HDFS. It ensures mechanisms are in place
to handle the load on the database.
• Key Classes: DBInputFormat: Reads data from the database via JDBC (Java
Database Connectivity).
• DBOutputFormat: Writes data to the database.
• How It Works: DBInputFormat reads data from relational databases and
maps it into the Hadoop ecosystem. To do this, it requires a bean
representation of the table, which implements the Writable and
DBWritable interfaces. The Writable interface is specific to Hadoop for
handling serialization/deserialization.
• We want to load relational data into your Hadoop cluster in an
efficient, scalable, and idempotent way without the complexity of
implementing custom MapReduce logic.
• Sqoop is a tool designed for bulk data transfer between relational
databases and Hadoop.
• It supports importing data into HDFS, Hive, or HBase and exporting
data back into relational databases.
• Created by Cloudera, it’s an Apache project in incubation.
• Importing Process:Importing data with Sqoop involves two main
activities:
• Connecting to the Data Source: Sqoop gathers metadata and
statistics from the source database.
• Executing the Import: A MapReduce job is launched to bring the data
into Hadoop.
• Sqoop uses connectors to interact with databases. There are two
types:Common Connector: Handles regular reads and writes.
• Fast Connector: Uses database-specific optimizations for bulk data
imports, making the process more efficient.

Case Study Report DS
No ratings yet
Case Study Report DS
67 pages
Moving Data In and Out of Hadoop
No ratings yet
Moving Data In and Out of Hadoop
13 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
22241A66C5_Assignment21
No ratings yet
22241A66C5_Assignment21
16 pages
UNIT-2 IMP QUES ANS
No ratings yet
UNIT-2 IMP QUES ANS
8 pages
Cloudera Academic Partnership 8 PDF
No ratings yet
Cloudera Academic Partnership 8 PDF
69 pages
Unit-3 (HDFS-II)
No ratings yet
Unit-3 (HDFS-II)
28 pages
BIG DATA UNIT -2
No ratings yet
BIG DATA UNIT -2
18 pages
bda 1 exp
No ratings yet
bda 1 exp
5 pages
Slide 4 Data Loading Tool
No ratings yet
Slide 4 Data Loading Tool
77 pages
SqoopVSFlume
No ratings yet
SqoopVSFlume
18 pages
Module IV .Docx
No ratings yet
Module IV .Docx
5 pages
Unit 3
No ratings yet
Unit 3
12 pages
Data Ingest
No ratings yet
Data Ingest
15 pages
Bda Imp No Header Footer (1)
No ratings yet
Bda Imp No Header Footer (1)
25 pages
Essential Hadoop Tools: Module - 2 Session - 2
No ratings yet
Essential Hadoop Tools: Module - 2 Session - 2
6 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Cse 17CS82 M2 S2 PPT
No ratings yet
Cse 17CS82 M2 S2 PPT
20 pages
Practise Quiz Ccd-470 Exam (05-2014) - Cloudera Quiz Learning
No ratings yet
Practise Quiz Ccd-470 Exam (05-2014) - Cloudera Quiz Learning
74 pages
Apache - SQOOP and Flume
No ratings yet
Apache - SQOOP and Flume
16 pages
Presentation of Big Data
No ratings yet
Presentation of Big Data
4 pages
BigData Module 2
No ratings yet
BigData Module 2
18 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Data Lake 1
No ratings yet
Data Lake 1
48 pages
15CS82 Module 2
No ratings yet
15CS82 Module 2
12 pages
Unit 3 Topic 8 Flume and Scoop
No ratings yet
Unit 3 Topic 8 Flume and Scoop
35 pages
HADOOP notes unit 3 and 4
No ratings yet
HADOOP notes unit 3 and 4
14 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
18 module 2
No ratings yet
18 module 2
9 pages
04
No ratings yet
04
23 pages
Unit 2
No ratings yet
Unit 2
15 pages
Week 4 - Hadoop Ecosystem
No ratings yet
Week 4 - Hadoop Ecosystem
109 pages
Unit_IV_Hadoop
No ratings yet
Unit_IV_Hadoop
90 pages
21CS71-Module-2-git
No ratings yet
21CS71-Module-2-git
11 pages
Hadoop
No ratings yet
Hadoop
104 pages
Unit 3 Part 2 Scoopflume
No ratings yet
Unit 3 Part 2 Scoopflume
10 pages
Flume_Agent
No ratings yet
Flume_Agent
23 pages
Big Data - CH04
No ratings yet
Big Data - CH04
37 pages
Module 2
No ratings yet
Module 2
27 pages
Random (1)
No ratings yet
Random (1)
3 pages
Lab Manual Big Data
No ratings yet
Lab Manual Big Data
22 pages
Importing and Exporting Files in Hadoop Distributed File System
No ratings yet
Importing and Exporting Files in Hadoop Distributed File System
6 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
AZ-3
No ratings yet
AZ-3
19 pages
BDA
No ratings yet
BDA
30 pages
Apache Flume - Data Transfer in Hadoop - Tutorialspoint
No ratings yet
Apache Flume - Data Transfer in Hadoop - Tutorialspoint
2 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Experiment No 1
No ratings yet
Experiment No 1
13 pages
(Ebook) Hadoop MapReduce Cookbook by Srinath Perera; Thilina Gunarathne ISBN 9781849517287, 1849517282 - Download the ebook now and own the full detailed content
100% (2)
(Ebook) Hadoop MapReduce Cookbook by Srinath Perera; Thilina Gunarathne ISBN 9781849517287, 1849517282 - Download the ebook now and own the full detailed content
51 pages
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
No ratings yet
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
52 pages
Bda Lab 1
No ratings yet
Bda Lab 1
9 pages
BDA Module 2-2023
No ratings yet
BDA Module 2-2023
30 pages
HADOOP
No ratings yet
HADOOP
4 pages
Data Migration From RDBMS To Hadoop: Platform Migration Approach
No ratings yet
Data Migration From RDBMS To Hadoop: Platform Migration Approach
25 pages
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
bda unit 4-1
No ratings yet
bda unit 4-1
64 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
UNIT-4
No ratings yet
UNIT-4
119 pages
Just Enough Programming Logic and Design 1st Edition Joyce Farrell 2024 scribd download
100% (14)
Just Enough Programming Logic and Design 1st Edition Joyce Farrell 2024 scribd download
53 pages
Unit 4 Design
No ratings yet
Unit 4 Design
60 pages
Automation Consultant (AI/ML) : Job Description and Skill Set Requirement
No ratings yet
Automation Consultant (AI/ML) : Job Description and Skill Set Requirement
4 pages
CC Question Bank All Units
No ratings yet
CC Question Bank All Units
28 pages
CLO 1 (PLO 4) - Assignment 1 - (20%) (Group) - Image Registration and Digitising
No ratings yet
CLO 1 (PLO 4) - Assignment 1 - (20%) (Group) - Image Registration and Digitising
4 pages
Taking DevOps Interview
No ratings yet
Taking DevOps Interview
7 pages
G ENTOO
No ratings yet
G ENTOO
4 pages
MySQL Definition
No ratings yet
MySQL Definition
3 pages
B2B Marketing
No ratings yet
B2B Marketing
22 pages
INF1501 - Module 2 - Study notes
No ratings yet
INF1501 - Module 2 - Study notes
14 pages
Ui Ux
No ratings yet
Ui Ux
2 pages
Oracle Fusion HCM Data Loader
No ratings yet
Oracle Fusion HCM Data Loader
3 pages
Part 10 - DCIM-For-Dummies - 3rd-Edition
No ratings yet
Part 10 - DCIM-For-Dummies - 3rd-Edition
5 pages
NOSQL Data Management
No ratings yet
NOSQL Data Management
21 pages
Microservice Maturity Model Proposal Daniel Bryant Danielbryantuk
No ratings yet
Microservice Maturity Model Proposal Daniel Bryant Danielbryantuk
3 pages
The Beginners (Ultimate) Guide To APN Settings
No ratings yet
The Beginners (Ultimate) Guide To APN Settings
4 pages
Operating System - Module II
No ratings yet
Operating System - Module II
13 pages
Social Media Performance KPI - Go-Forth 2021 - 2022 (Organic + Paid)
No ratings yet
Social Media Performance KPI - Go-Forth 2021 - 2022 (Organic + Paid)
21 pages
Data Warehousing: Modern Database Management 8 Edition
No ratings yet
Data Warehousing: Modern Database Management 8 Edition
34 pages
Agile Software Engineering
No ratings yet
Agile Software Engineering
47 pages
Shubham Khedekar
No ratings yet
Shubham Khedekar
25 pages
Gmef and Online Gad Form
No ratings yet
Gmef and Online Gad Form
38 pages
Create Your Mailbox Rule
No ratings yet
Create Your Mailbox Rule
32 pages
Dbms Manual
No ratings yet
Dbms Manual
22 pages
Bis303 Mis q3 Done
No ratings yet
Bis303 Mis q3 Done
4 pages
Itil and The CMDB
No ratings yet
Itil and The CMDB
5 pages
UNIT 3 Content
No ratings yet
UNIT 3 Content
118 pages
Fastest Akola Broadband Plans
No ratings yet
Fastest Akola Broadband Plans
1 page
Tuning Database Locks & Latches: Hamid R. Minoui
No ratings yet
Tuning Database Locks & Latches: Hamid R. Minoui
60 pages

Moving Data In and Out of Hadoop

Uploaded by

Moving Data In and Out of Hadoop

Uploaded by

Moving Data In and Out of Hadoop

• It is refered as Ingress and Engress

• Flume,Chukwa,Scribe are log collecting and

How the HDFS File Slurper Works:

• If your data resides on a filesystem, web server, or other system,

The challenge consists of two tasks:

You might also like