0% found this document useful (0 votes)

54 views

Data Ingest

The document discusses using HDFS commands, Sqoop, and other tools to ingest and access data in Hadoop. Key points include: - HDFS commands like hdfs dfs can be used to copy files to/from HDFS, list directories, delete files, and more. - Sqoop allows importing and exporting bulk data between HDFS and structured data stores like MySQL. It can import an entire database or single tables. - Other tools like Flume, Kafka, and Spark Streaming can ingest real-time streaming data into HDFS.

Uploaded by

Younes BERARE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views

Data Ingest

Uploaded by

Younes BERARE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

10/13/2021

Learning objectives
Skill Description Technology To Use
Load data into and out of HDFS using the Hadoop File System
HDFS Command Line
commands

Data Ingest Import data from a MySQL database into HDFS using Sqoop Sqoop

H. EL GHAZI Export data to a MySQL database from HDFS using Sqoop Sqoop

Change the delimiter and file format of data during import using
Sqoop
Sqoop
Flume or Kafka or Spark
Ingest real-time and near-real-time streaming data into HDFS
Streaming
Flume or Kafka or Spark
Process streaming data as it is loaded onto the cluster
Streaming
1 2

Apache Hadoop ecosystem

HDFS
• Scalable and economical data
storage, processing, and analysis

3 4
10/13/2021

HDFS Basic Concepts Options for Accessing HDFS

• From the command line
hdfs dfs

• In Spark
hdfs://nnhost:port/file…

• Other programs
• Java API
• Used by Hadoop tools such as
MapReduce, Impala, Hue, Sqoop,
Flume
• RESTful interface

5 6

HDFS Command Line Examples (1) HDFS Command Line Examples (2)
• Copy file fname.txt from local disk to the user’s directory in HDFS • Display the contents of the HDFS file /user/me/grades.txt
hdfs dfs -put localfname.txt targetfname.txt hdfs dfs -cat /user/me/grades.txt
• This will copy the file to /user/username/targetfname.txt
• Copy that file to the local disk, named as notes.txt
hdfs dfs -get /user/me/grades.txt notes.txt
• Get a directory listing of the user’s home directory in HDFS • Create a directory called input under the user’s home directory
hdfs dfs -ls
hdfs dfs -mkdir input
• Get a directory listing of the HDFS root directory
hdfs dfs –ls /

7 8
10/13/2021

HDFS Command Line Examples (3) The Hue HDFS File Browser
• Delete a file • The File Browser in Hue lets you view and manage your HDFS
hdfs dfs -rm input_old/file1 directories and files
• Delete a set of files using a wildcard
hdfs dfs -rm input_old/*

• Delete the directory input_old and all its contents

hdfs dfs -rm -r input_old

9 10

Apache Sqoop Overview

• Apache Sqoop(TM) is a tool designed for efficiently transferring
bulk data between HDFS and structured datastores such as
Sqoop relational databases.
• Can import all tables, a single table, or a partial table into HDFS
• Data can be imported in a variety of formats
• Sqoop can also export data from HDFS to a database

11 12
10/13/2021

Basic Syntax Exploring a Database with Sqoop

• Sqoop is a command-line utility with several subcommands, called • This command will list all tables in the retails database in MySQL
tools sqoop list-tables \
• There are tools for import, export, listing database contents, and more --connect jdbc:mysql://dbhost/retails \
--username dbuser \
• Run sqoop help to see a list of all tools --password pw
• Run sqoop help tool-name for help on using a specific tool
• Basic syntax of a Sqoop invocation • We can perform database queries using the eval tool
sqoop eval \
--query "SELECT * FROM my_table LIMIT 5" \
--connect jdbc:mysql://dbhost/retails \
--username dbuser \
--password pw

13 14

Importing an Entire Database with Sqoop Importing a Single Table with Sqoop
• The import-all-tables tool imports an entire database • The import tool imports a single table
• Stored as comma-delimited files • This example imports the product table
• Default base location is your HDFS home directory • It stores the data in HDFS as comma-delimited fields
• Data will be in subdirectories corresponding to name of each table
sqoop import-all-tables \ sqoop import --table product \
--connect jdbc:mysql://dbhost/retails \ --connect jdbc:mysql://dbhost/retails \
--username dbuser --password pw --username dbuser --password pw

• Use the --warehouse-dir option to specify a different base directory

sqoop import-all-tables \
--connect jdbc:mysql://dbhost/retails \
--username dbuser --password pw \
--warehouse-dir /retails
15 16
10/13/2021

Importing Partial Tables with Sqoop Specifying an Alternate Delimiter

• Import only specified columns from customers table • By default, Sqoop generates text files with comma-delimited fields
sqoop import --table customers \ • This example writes tab-delimited fields instead
--connect jdbc:mysql://dbhost/retails \
--username dbuser --password pw \ sqoop import --table customers \
--columns "id,first_name,last_name,state" --connect jdbc:mysql://dbhost/retails \
--username dbuser --password pw \
--fields-terminated-by "\t"

• Import only matching rows from customers table

sqoop import --table customers \
--connect jdbc:mysql://dbhost/retails \
--username dbuser --password pw \
--where "state='CA'"

17 18

Exporting Data from Hadoop to RDBMS with

Storing Data in Other Data Formats
Sqoop
• By default, Sqoop stores data in text format files • It is sometimes necessary to push data in HDFS back to an RDBMS
• Sqoop supports importing data as Parquet or Avro files • Sqoop supports this via the export tool
• The RDBMS table must already exist prior to export
sqoop import --table customers \
--connect jdbc:mysql://dbhost/retails \
--username dbuser --password pw \ sqoop export \
--as-parquetfile --connect jdbc:mysql://dbhost/retails \
--username dbuser --password pw \
--export-dir /retails/new_users \
sqoop import --table customers \ --update-mode allowinsert \
--connect jdbc:mysql://dbhost/retails \ --table customers
--username dbuser --password pw \
--as-avrodatafile

19 20
10/13/2021

What Is Apache Flume?

• Apache Flume is a high-performance system for data collection
• Name derives from original use case of near-real time log data ingestion

Flume • Now widely used for collection of any streaming event data
• Supports aggregating data from many sources into HDFS
• Benefits of Flume
• Horizontally-scalable
• Extensible
• Reliable

21 22

Common Flume Data Sources Large-Scale Deployment Example

• Flume collects data using configurable agents
• Agents can receive data from many sources, including other agents
• Large-scale deployments use multiple tiers for scalability and reliability
• Flume supports inspection and modification of in-flight data

23 24
10/13/2021

Flume Events Components in Flume’s Architecture

• An event is the fundamental unit of data in Flume • Source
• Consists of a body (payload) and a collection of headers (metadata) • Receives events from the external actor that generates them
• Headers consist of name-value pairs • Sink
• Headers are mainly used for directing output • Sends an event to its destination
• Channel
• Buffers events from the source until they are drained by the sink
• Agent
• Configures and hosts the source, channel, and sink
• A Java process that runs in a JVM

25 26

Flume Data Flow Built-In Flume Sources

• This diagram illustrates how syslog data might be captured to HDFS • Syslog
• Captures messages from UNIX syslog daemon over the network
1. Server running a syslog daemon logs a message
• Netcat
2. Flume agent configured with syslog source retrieves event • Captures any data written to a socket on an arbitrary TCP port
3. Source pushes event to the channel, where it is buffered in memory • Exec
4. Sink pulls data from the channel and writes it to HDFS • Executes a UNIX program and reads events from standard output *
• Spooldir
• Extracts events from files appearing in a specified (local) directory
• HTTP Source
• Retrieves events from HTTP requests
• Kafka
• Retrieves events by consuming messages from a Kafka topic

27 28
10/13/2021

Built-In Flume Sinks Built-In Flume Channels

• Null • Memory
• Discards all events (Flume equivalent of /dev/null)
• Stores events in the machine’s RAM
• Logger
• Logs event to INFO level using SLF4J • Extremely fast, but not reliable (memory is volatile)
• IRC • File
• Sends event to a specified Internet Relay Chat channel • Stores events on the machine’s local disk
• HDFS • Slower than RAM, but more reliable (data is written to disk)
• Writes event to a file in the specified directory in HDFS
• Kafka • Kafka
• Sends event as a message to a Kafka topic • Uses Kafka as a scalable, reliable, and highly available channel between any
• HBaseSink source and sink type
• Stores event in HBase

29 30

Flume Agent Configuration File Example: Configuring Flume Components (1)

• Configure Flume agents through a Java properties file • Example: Configure a Flume agent to collect data from remote spool
• You can configure multiple agents in a single file directories and save to HDFS
• The configuration file uses hierarchical references
• Assign each component a user-defined ID
• Use that ID in the names of additional properties
# Define sources, sinks, and channel for agent named 'agent1'
agent1.sources = mysource
agent1.sinks = mysink
agent1.channels = mychannel
# Sets a property “prop1" for the source associated with agent1
agent1.sources.mysource.prop1 = pp1
# Sets a property “prop2" for the sink associated with agent1
agent1.sinks.mysink.prop2 = pp2
31 32
10/13/2021

Example: Configuring Flume Components (2) Starting a Flume Agent

• Typical command line invocation
agent1.sources = src1 • The --name argument must match the agent’s name in the configuration file
agent1.sinks = sink1
agent1.channels = ch1 • Setting root logger as shown will display log messages in the terminal

agent1.channels.ch1.type = memory
flume-ng agent \
--conf /etc/flume-ng/conf \
agent1.sources.src1.type = spooldir
Connects source --conf-file /path/to/flume.conf \
agent1.sources.src1.spoolDir = /var/flume/incoming
and channel --name agent1 \
agent1.sources.src1.channels = ch1 -Dflume.root.logger=INFO,console
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = /loudacre/logdata Connects sink
agent1.sinks.sink1.channel = ch1 and channel

33 34

What Is Apache Kafka?

• Apache Kafka is a distributed commit log service
• Widely used for data ingest

Kafka • Conceptually similar to a publish-subscribe messaging system

• Offers scalability, performance, reliability, and flexibility
• Kafka is used for a variety of use cases, such as
• Log aggregation
• Messaging
• Web site activity tracking
• Stream processing
• Event sourcing

35 36
10/13/2021

Key Terminology Example: High-Level Architecture

• Message
• A single data record passed by Kafka
• Topic
• A named log or feed of messages within Kafka
• Producer
• A program that writes messages to Kafka
• Consumer
• A program that reads messages from Kafka

37 38

Messages Topics
• Messages in Kafka are variable-size byte arrays • There is no explicit limit on the number of topics
• Represent arbitrary user-defined content • However, Kafka works better with a few large topics than many small ones
• Use any format your application requires • A topic can be created explicitly or simply by publishing to the topic
• Common formats include free-form text, JSON, and Avro • This behavior is configurable
• There is no explicit limit on message size • We recommends that administrators disable auto-creation of topics to avoid
• Optimal performance at a few KB per message accidental creation of large numbers of topics
• Practical limit of 1MB per message
• Kafka retains all messages for a defined time period and/or total size

39 40
10/13/2021

Producers Consumers
• Producers publish messages to Kafka topics • A consumer reads messages that were published to Kafka topics
• They communicate with Kafka, not a consumer • They communicate with Kafka, not any producer
• Kafka persists messages to disk on receipt • Consumer actions do not affect other consumers
• For example, having one consumer display the messages in a topic as they are
published does not change what is consumed by other consumers
• They can come and go without impact on the cluster or other
consumers

41 42

Producers and Consumers Topic Partitioning

• Tools available as part of Kafka • Kafka divides each topic into some number of partitions
• Command-line producer and consumer tools • Topic partitioning improves scalability and throughput
• Client (producer and consumer) Java APIs
• A growing number of other APIs are available from third parties • A topic partition is an ordered and immutable sequence of messages
• Client libraries in many languages including Python, PHP, C/C++, Go, .NET, and Ruby • New messages are appended to the partition as they are received
• Integrations with other tools and projects include • Each message is assigned a unique sequential ID known as an offset
• Apache Flume
• Apache Spark
• Amazon AWS
• syslog
• Kafka also has a large and growing ecosystem

43 44
10/13/2021

Consumer Groups Kafka Clusters

• One or more consumers can form their own consumer group that • A Kafka cluster consists of one or more brokers—servers running the
work together to consume the messages in a topic Kafka broker daemon
• Each partition is consumed by only one member of a consumer group • Kafka depends on the Apache ZooKeeper service for coordination
• Message ordering is preserved per partition, but not across the topic

45 46

Apache ZooKeeper Kafka Brokers

• Apache ZooKeeper is a coordination service for distributed • Brokers are the fundamental daemons that make up a Kafka cluster
applications • A broker fully stores a topic partition on disk, with caching in memory
• Kafka depends on the ZooKeeper service for coordination • A single broker can reasonably host 1000 topic partitions
• Typically running three or five ZooKeeper instances
• One broker is elected controller of the cluster (for assignment of topic
• Kafka uses ZooKeeper to keep track of brokers running in the cluster partitions to brokers, and so on)
• Kafka uses ZooKeeper to detect the addition or removal of consumers • Each broker daemon runs in its own JVM
• A single machine can run multiple broker daemons

47 48
10/13/2021

Topic Replication Messages Are Replicated

• Configure the producer with a list of one or more brokers
• At topic creation, a topic can be set with a replication count • The producer asks the first available broker for the leader of the desired topic
• Doing so is recommended, as it provides fault tolerance partition
• Each broker can act as a leader for some topic partitions and a • The producer then sends the message to the leader
follower for others • The leader writes the message to its local log
• Followers passively replicate the leader • Each follower then writes the message to its own log
• If the leader fails, a follower will automatically become the new leader • After acknowledgements from followers, the message is committed

49 50

Kafka cluster component Creating Topics from the Command Line

consume • The kafka-topics command offers a simple way to create Kafka topics
topic partition message consumer
• Provide the topic name of your choice, such as device_status
follower leader • You must also specify the ZooKeeper connection string for your cluster
stores produce

coordinate kafka-topics --create \

ZooKeeper Broker --zookeeper zkhost1:2181,zkhost2:2181,zkhost3:2181 \
Producer --replication-factor 3 \
--partitions 5 \
--topic device_status

Contoller

51 52
10/13/2021

Displaying Topics from the Command Line Running a Producer from the Command Line
• Use the --list option to list all topics • You can run a producer using the kafka-console-producer tool
kafka-topics --list \
• Specify one or more brokers in the --broker-list option
--zookeeper zkhost1:2181,zkhost2:2181,zkhost3:2181 • Each broker consists of a hostname, a colon, and a port number
• If specifying multiple brokers, separate them with commas
• You must also provide the name of the topic
• Use the --help option to list all kafka-topics options
kafka-console-producer \
kafka-topics --help --broker-list brokerhost1:9092,brokerhost2:9092 \
--topic device_status

53 54

Writing File Contents to Topics Using the

Running a Consumer from the Command Line
Command Line
• Using UNIX pipes or redirection, you can read input from files • You can run a consumer with the kafka-console-consumer tool
• The data can then be sent to a topic using the command line producer • This requires the ZooKeeper connection string for your cluster
• This example shows how to read input from a file named alerts.txt • Unlike starting a producer, which instead requires a list of brokers
• Each line in this file becomes a separate message in the topic • The command also requires a topic name
cat alerts.txt | kafka-console-producer \ • Use --from-beginning to read all available messages
--broker-list brokerhost1:9092,brokerhost2:9092 \
--topic device_status • Otherwise, it reads only new messages
kafka-console-consumer \
• This technique can be an easy way to integrate with existing programs --zookeeper zkhost1:2181,zkhost2:2181,zkhost3:2181 \
--topic device_status \
--from-beginning

55 56
10/13/2021

Spark Streaming
Homework

Deloitte-Cn-Tech-Cloud-Security-Solution - Cyber and Cloud
No ratings yet
Deloitte-Cn-Tech-Cloud-Security-Solution - Cyber and Cloud
23 pages
Data Ware House Concept 2019 (Compatibility Mode) PDF
No ratings yet
Data Ware House Concept 2019 (Compatibility Mode) PDF
25 pages
The Practical Application of Geographical Profiling - An Essay
No ratings yet
The Practical Application of Geographical Profiling - An Essay
14 pages
Trường Đại Học Khoa Học Tự Nhiên
No ratings yet
Trường Đại Học Khoa Học Tự Nhiên
8 pages
Progressive User Directory 07.04.2017
No ratings yet
Progressive User Directory 07.04.2017
42 pages
Cubes Poster - PyCon 2014
100% (1)
Cubes Poster - PyCon 2014
2 pages
2004 - Enterprise Architecture Measures of Effectiveness
No ratings yet
2004 - Enterprise Architecture Measures of Effectiveness
14 pages
Oracle Drivers Config For HA
No ratings yet
Oracle Drivers Config For HA
73 pages
RFI Enterprise Software Asset Management System
No ratings yet
RFI Enterprise Software Asset Management System
2 pages
Spo Operating Process
No ratings yet
Spo Operating Process
15 pages
Financial Data Mart
0% (1)
Financial Data Mart
9 pages
Corda Solution Guide PDF
No ratings yet
Corda Solution Guide PDF
16 pages
IGS Directory 2016-Sorted
No ratings yet
IGS Directory 2016-Sorted
252 pages
1. Phần Mềm Cài Đặt: 2.1. Install Oracle Goldengate For Oracle
No ratings yet
1. Phần Mềm Cài Đặt: 2.1. Install Oracle Goldengate For Oracle
9 pages
Fea v2 PDF
No ratings yet
Fea v2 PDF
434 pages
REPEAT 2 Architecture Patterns For Multi-Region Active-Active ARC213-R2
No ratings yet
REPEAT 2 Architecture Patterns For Multi-Region Active-Active ARC213-R2
91 pages
The Modernization of The Data Warehouse
No ratings yet
The Modernization of The Data Warehouse
17 pages
OracleKafka Build Scalable ETL Pipeline With Connectors
100% (1)
OracleKafka Build Scalable ETL Pipeline With Connectors
27 pages
Present Scenario of Agent Banking in Bangladesh: December 2019
No ratings yet
Present Scenario of Agent Banking in Bangladesh: December 2019
12 pages
Pääkkönen Anton-2017-Asset Management in An ICT Company Using ISO IEC 19770
No ratings yet
Pääkkönen Anton-2017-Asset Management in An ICT Company Using ISO IEC 19770
100 pages
10gen-MongoDB Operations Best Practices
No ratings yet
10gen-MongoDB Operations Best Practices
26 pages
The Digital Office-Improving The Way We Work
No ratings yet
The Digital Office-Improving The Way We Work
24 pages
Mise en Scene
No ratings yet
Mise en Scene
18 pages
Data Profiling Overview: What Is Data Profiling, and How Can It Help With Data Quality?
No ratings yet
Data Profiling Overview: What Is Data Profiling, and How Can It Help With Data Quality?
3 pages
CRM in Investment Banking
No ratings yet
CRM in Investment Banking
11 pages
PSO Data Analytics Day 1
100% (1)
PSO Data Analytics Day 1
106 pages
DB2 LUW For The Oracle DBA
No ratings yet
DB2 LUW For The Oracle DBA
46 pages
IBM Replication Updates: 4+ in 45: The Fillmore Group - February 2019
No ratings yet
IBM Replication Updates: 4+ in 45: The Fillmore Group - February 2019
46 pages
DATA WAREHOUSE - Imp
No ratings yet
DATA WAREHOUSE - Imp
76 pages
Best Practices For Developing Apache Kafka Applications On Confluent Cloud
No ratings yet
Best Practices For Developing Apache Kafka Applications On Confluent Cloud
39 pages
Banking Data Warehouse and Basel II From IBM
100% (2)
Banking Data Warehouse and Basel II From IBM
27 pages
My Alphabet Animals Coloring Ebook
No ratings yet
My Alphabet Animals Coloring Ebook
27 pages
Exactly Once Delivery and Transactional Messaging in Kafka
No ratings yet
Exactly Once Delivery and Transactional Messaging in Kafka
67 pages
Lab Aws 14-10
100% (1)
Lab Aws 14-10
25 pages
Magic Quadrant For C 763557 NDX
No ratings yet
Magic Quadrant For C 763557 NDX
51 pages
Unite Real-Time and Batch Analytics With AWS Glue
No ratings yet
Unite Real-Time and Batch Analytics With AWS Glue
28 pages
CodeCommit Setup Step1 PDF
100% (1)
CodeCommit Setup Step1 PDF
310 pages
The Best Blockchain Developer Tools
No ratings yet
The Best Blockchain Developer Tools
6 pages
Alfresco Architecture
No ratings yet
Alfresco Architecture
17 pages
Advanced Data Warehouse Design
0% (1)
Advanced Data Warehouse Design
12 pages
Surat Meet
No ratings yet
Surat Meet
6 pages
Running Minio in Minikube GitHub
No ratings yet
Running Minio in Minikube GitHub
4 pages
DW Design and Data Model Example
No ratings yet
DW Design and Data Model Example
42 pages
Surat Data
No ratings yet
Surat Data
23 pages
Content Platform Architecture Fundamentals Whitepaper
100% (1)
Content Platform Architecture Fundamentals Whitepaper
34 pages
2016 05 10 Apache Nifi Deep Dive 160511170654
No ratings yet
2016 05 10 Apache Nifi Deep Dive 160511170654
34 pages
Ibm BPM Comparison 2046800 PDF
No ratings yet
Ibm BPM Comparison 2046800 PDF
32 pages
Slide 4 Data Loading Tool
No ratings yet
Slide 4 Data Loading Tool
77 pages
Unit-3 (HDFS-II)
No ratings yet
Unit-3 (HDFS-II)
28 pages
Cloudera Academic Partnership 8 PDF
No ratings yet
Cloudera Academic Partnership 8 PDF
69 pages
Essential Hadoop Tools: Module - 2 Session - 2
No ratings yet
Essential Hadoop Tools: Module - 2 Session - 2
6 pages
Cse 17CS82 M2 S2 PPT
No ratings yet
Cse 17CS82 M2 S2 PPT
20 pages
Data Lake 1
No ratings yet
Data Lake 1
48 pages
15CS82 Module 2
No ratings yet
15CS82 Module 2
12 pages
BigData Module 2
No ratings yet
BigData Module 2
18 pages
Apache - SQOOP and Flume
No ratings yet
Apache - SQOOP and Flume
16 pages
Sqoop & Flume: Issues With Data Load Into Hadoop
No ratings yet
Sqoop & Flume: Issues With Data Load Into Hadoop
6 pages
06 - Acquire Data Using CLI and Flume
No ratings yet
06 - Acquire Data Using CLI and Flume
13 pages
UNIT-2 IMP QUES ANS
No ratings yet
UNIT-2 IMP QUES ANS
8 pages
Unit 3 Topic 8 Flume and Scoop
No ratings yet
Unit 3 Topic 8 Flume and Scoop
35 pages
Tidy Finance With R Stefan Voigt Patrick Weiss Christoph Scheuch pdf download
No ratings yet
Tidy Finance With R Stefan Voigt Patrick Weiss Christoph Scheuch pdf download
90 pages
Oose - Unit-5
No ratings yet
Oose - Unit-5
98 pages
Lec 12 OpenMP
No ratings yet
Lec 12 OpenMP
152 pages
Martin-Schroder-Design-Patterns-Callback-Pattern-0.38.0
No ratings yet
Martin-Schroder-Design-Patterns-Callback-Pattern-0.38.0
9 pages
Manual PyParsing PDF
No ratings yet
Manual PyParsing PDF
48 pages
Simulink-Targetlink Datatypes Defnt Problem
No ratings yet
Simulink-Targetlink Datatypes Defnt Problem
16 pages
1st Sem BCA Question Bank
No ratings yet
1st Sem BCA Question Bank
174 pages
DS Lab Manual R23
50% (2)
DS Lab Manual R23
3 pages
PHP 5 Cookies
No ratings yet
PHP 5 Cookies
6 pages
Final Project Report MRI Reconstruction
No ratings yet
Final Project Report MRI Reconstruction
19 pages
Functional Specification: Bajaj Power Generation Business
No ratings yet
Functional Specification: Bajaj Power Generation Business
5 pages
Oracle Academy 40
No ratings yet
Oracle Academy 40
4 pages
Vijayalakshmi Biodata 02.03.2022
No ratings yet
Vijayalakshmi Biodata 02.03.2022
8 pages
Logcat 1675753409374
No ratings yet
Logcat 1675753409374
49 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
COMPUTER2k19 PDF
No ratings yet
COMPUTER2k19 PDF
22 pages
ABAP4 Tuning Checklist - ABAP Development - SCN Wiki
No ratings yet
ABAP4 Tuning Checklist - ABAP Development - SCN Wiki
2 pages
CBSE-XII - Computer Science-2017 (C)
No ratings yet
CBSE-XII - Computer Science-2017 (C)
9 pages
Haddock User Guide
No ratings yet
Haddock User Guide
23 pages
hill_climbing
No ratings yet
hill_climbing
3 pages
Unit 4 Deadlocks
No ratings yet
Unit 4 Deadlocks
14 pages
Final Fortran Questions 1
No ratings yet
Final Fortran Questions 1
5 pages
Stream Classes
No ratings yet
Stream Classes
19 pages
ATDD - How To Guide
No ratings yet
ATDD - How To Guide
4 pages
CC105-WEEK-13-SQL-VIEWS
No ratings yet
CC105-WEEK-13-SQL-VIEWS
16 pages
06.Object-Oriented Programming in PHP
No ratings yet
06.Object-Oriented Programming in PHP
39 pages
Activity 04 Writing Formulas in Smartsheet 2023
No ratings yet
Activity 04 Writing Formulas in Smartsheet 2023
6 pages
CS609 Solution by Junaid
No ratings yet
CS609 Solution by Junaid
7 pages
Constraint Satisfaction Problem
No ratings yet
Constraint Satisfaction Problem
5 pages
Anonymous Intelligence Agency
No ratings yet
Anonymous Intelligence Agency
3 pages

Data Ingest

Uploaded by

Data Ingest

Uploaded by

10/13/2021

Apache Hadoop ecosystem

HDFS Basic Concepts Options for Accessing HDFS

• Delete the directory input_old and all its contents

Apache Sqoop Overview

Basic Syntax Exploring a Database with Sqoop

• Use the --warehouse-dir option to specify a different base directory

Importing Partial Tables with Sqoop Specifying an Alternate Delimiter

• Import only matching rows from customers table

Exporting Data from Hadoop to RDBMS with

What Is Apache Flume?

Common Flume Data Sources Large-Scale Deployment Example

Flume Events Components in Flume’s Architecture

Flume Data Flow Built-In Flume Sources

Built-In Flume Sinks Built-In Flume Channels

Flume Agent Configuration File Example: Configuring Flume Components (1)

Example: Configuring Flume Components (2) Starting a Flume Agent

What Is Apache Kafka?

Kafka • Conceptually similar to a publish-subscribe messaging system

Key Terminology Example: High-Level Architecture

Producers and Consumers Topic Partitioning

Consumer Groups Kafka Clusters

Apache ZooKeeper Kafka Brokers

Topic Replication Messages Are Replicated

Kafka cluster component Creating Topics from the Command Line

coordinate kafka-topics --create \

Writing File Contents to Topics Using the

You might also like