0% found this document useful (0 votes)
36 views

Moving Data In and Out of Hadoop

The document discusses methods for moving data in and out of Hadoop, referred to as ingress and egress, utilizing HDFS and MapReduce. It highlights tools like Flume, Chukwa, Scribe, and Sqoop for data collection and transfer, along with the HDFS File Slurper for file format automation. Additionally, it mentions Oozie for scheduling regular data transfers and the use of DBInputFormat for importing data from relational databases.

Uploaded by

pratima depa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Moving Data In and Out of Hadoop

The document discusses methods for moving data in and out of Hadoop, referred to as ingress and egress, utilizing HDFS and MapReduce. It highlights tools like Flume, Chukwa, Scribe, and Sqoop for data collection and transfer, along with the HDFS File Slurper for file format automation. Additionally, it mentions Oozie for scheduling regular data transfers and the use of DBInputFormat for importing data from relational databases.

Uploaded by

pratima depa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Moving Data In and Out of Hadoop

• It is refered as Ingress and Engress


• Hadoop Supports ingress and engress at a low
level in HDFS and MapReduce
• Files can be moved in and out of Hadoop by
HDFS(Writing External Data at HDFS level-Data
Push) and data can be pulled from external
data sources and pushed to external data sinks
using MapReduce(Reading External data at
MapReduce level-Data Pull).
Key Elements of Ingress and Engress
• Idempotence
• Aggregation
• Data Format Transformation
• Recoverability
• Correctness
• Resource Consumption and Performance
• Monitoring
Hadoop Ingress with different data sources-Log files, Semi Structured data/
Binary files, HBase

• Flume,Chukwa,Scribe are log collecting and


distribution frameworks that uses HDFS as
data sink for that log data.
• Flume
• It is a Distributed System for collecting
streaming data.
• It’s highly customizable and supports plugin
architecture.
Chukwa(Apache sub project to collect and store data in HDFS)
Scribe
• Purpose: Scribe is used for collecting and distributing log data
across multiple nodes.
• Functionality: A Scribe server runs on each node and forwards
logs to a central Scribe server.
• Reliability: Logs are persisted to a local disk if the downstream
server is unreachable.
• Supported Data Sinks: It can store logs in various storage
backends, including HDFS, NFS, and regular filesystems.
2. Difference from Other Log Collectors:
• Unlike Flume or Chukwa, Scribe does not pull logs automatically.
• Instead, the user must push log data to the Scribe server.
• For example, Apache logs require writing a daemon (background
process) to forward logs to Scribe.
Technique 2: An automated mechanism to copy files into HDFS

• Existing tools like Flume, Scribe, and Chukwa are mainly designed for log file
transportation.What if you need to transfer different file formats, such as semi-structured
or binary files?

Solution:
• The HDFS File Slurper is an open-source utility that can copy any file format into or out of
HDFS.

How the HDFS File Slurper Works:

The HDFS File Slurper is a simple tool that automates copying files between a local directory
and HDFS, and vice versa. It follows a structured five-step process:
• Scan: The Slurper reads files from the source directory.
• Determine HDFS destination: Optionally, it consults a script to determine where in HDFS
the file should be placed.
• Write: The file is copied to HDFS.
• Verify: An optional verification step ensures successful transfer.
• Relocate file: The original file is moved to a completed directory after a successful copy.
Technique 3: Scheduling Regular Ingress Activities with Oozie

• If your data resides on a filesystem, web server, or other system,


you need a way to regularly pull it into Hadoop.

The challenge consists of two tasks:


• Importing data into Hadoop.
• Scheduling regular data transfers.
Oozie is used to automate data ingress into HDFS.
• It can also trigger post-ingress activities, such as launching a
MapReduce job to process the data.
• Oozie is an Apache project that originated at Yahoo! and acts as a
workflow engine for Hadoop.
• Oozie’s coordinator engine can schedule tasks based on time and
data triggers.
• We want to move data from a relational database into HDFS using
MapReduce while managing concurrent database connections effectively.
• Solution: This technique uses the DBInputFormat class to import data
from a relational database into HDFS. It ensures mechanisms are in place
to handle the load on the database.
• Key Classes: DBInputFormat: Reads data from the database via JDBC (Java
Database Connectivity).
• DBOutputFormat: Writes data to the database.
• How It Works: DBInputFormat reads data from relational databases and
maps it into the Hadoop ecosystem. To do this, it requires a bean
representation of the table, which implements the Writable and
DBWritable interfaces. The Writable interface is specific to Hadoop for
handling serialization/deserialization.
• We want to load relational data into your Hadoop cluster in an
efficient, scalable, and idempotent way without the complexity of
implementing custom MapReduce logic.
• Sqoop is a tool designed for bulk data transfer between relational
databases and Hadoop.
• It supports importing data into HDFS, Hive, or HBase and exporting
data back into relational databases.
• Created by Cloudera, it’s an Apache project in incubation.
• Importing Process:Importing data with Sqoop involves two main
activities:
• Connecting to the Data Source: Sqoop gathers metadata and
statistics from the source database.
• Executing the Import: A MapReduce job is launched to bring the data
into Hadoop.
• Sqoop uses connectors to interact with databases. There are two
types:Common Connector: Handles regular reads and writes.
• Fast Connector: Uses database-specific optimizations for bulk data
imports, making the process more efficient.

You might also like