Hadoop introduction

Background
● Big data challenge - (link)

User Case - Retails
●
知己知彼，百戰不殆 – Retail and The Big Data Evolution:

User Case - Retails
● Data is providing gains in three main ways: opening new channels,
tailoring my service, and driving revenue:

Pain Point
● Big data is the term for a collection of data sets so large and complex that it
becomes difficult to process using on-hand database management tools or
traditional data processing applications.

So what is Hadoop?
● Hadoop was created by Doug Cutting and Mike Cafarella.
● Hadoop provides the reliable shared storage and analysis system. (HDFS)
● It is designed to scale up from a single server to thousand of machines, with a
high degree of fault tolerance. (HDFS)
● It provides a programming model and an associated implementation for
processing and generating large data sets with a parallel, distributed
algorithm on a cluster. (MapReduce)

Hadoop Distributed FileSystem (HDFS)
● A given file is broken down into blocks (default=64MB), then blocks are replicated
across cluster (default=3).
● The number and size of file have no limition.
● HDFS allows you to put/get/delete files. (No update!)
● Follows the philosophy – "Write Once, Read Multiple Times!"

Hadoop Ecosystem
● The Hadoop Ecosystem Table

Flume
● Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data.
● Each Flume agent has a source, a sink and a channel

Sqoop
● Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured datastores such as relational databases.

Hive
● Apache Hive is a high-level abstraction on top of MapReduce
● Uses an SQL-like language called HiveQL
● Generates MapReduce jobs that run on the Hadoop cluster
● Originally developed by Facebook for data warehousing - Now an open-source
Apache project.

HBase
● Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.
● Linear scalability, capabile of storing hundreds of terabytes of data
● Automatic and configurable sharding of tables
● Automatic failover support
● Block cache and Bloom Filters for real-time queries
● Provides realtime random read/write access to data stored in HDFS.

Pig
● Apache Pig is a platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs, coupled with infrastructure for
evaluating these programs.
● The data flow language (Pig Latin)
● The interactive shell where you can type Pig Latin statements (Grunt)
● The Pig interpreter and execution engine

Oozie
● Oozie is a 'workflow engine' which runs on a server and typically outside the
cluster. It can runs workflows of Hadoop jobs including Pig, Hive, Sqoop jobs and
submit those jobs to the cluster based on a workflow definition.

Appendix
● What is Hadoop?
● https://www.youtube.com/watch?v=4DgTLaFNQq0
● https://www.youtube.com/watch?v=9s-vSeWej1U
● Intro to MapReduce
● https://www.youtube.com/watch?v=HFplUBeBhcM
● What is Big Data? Big Data Explained (Hadoop & MapReduce)
● Big Data University - Hadoop Fundamentals I - v2
● Big Data Challenges

Appendix
● Hadoop Tutorial 1 - What is Hadoop?
● Hadoop Tutorial 2 - Challenges Created by Big Data
● Hadoop Tutorial 3 - History Behind Creation of Hadoop (Google, Yahoo, and Apac
● Hadoop Tutorial 4 - Overview of Hadoop Projects
● Hadoop Tutorial 5 - Steps to Install Hadoop on a Personal Computer (Windows/OS
● Hadoop Tutorial 6 - Downloading and Installing Oracle VirtualBox
● Hadoop Tutorial 7 - Downloading Hadoop Appliance for Oracle VirtualBox
●

Hadoop introduction

More Related Content

What's hot

Viewers also liked

Similar to Hadoop introduction

Recently uploaded

Hadoop introduction