Background
● Big data challenge - (link)
User Case - Retails
●
知己知彼,百戰不殆 – Retail and The Big Data Evolution:
User Case - Retails
● Data is providing gains in three main ways: opening new channels,
tailoring my service, and driving revenue:
Pain Point
● Big data is the term for a collection of data sets so large and complex that it
becomes difficult to process using on-hand database management tools or
traditional data processing applications.
So what is Hadoop?
● Hadoop was created by Doug Cutting and Mike Cafarella.
● Hadoop provides the reliable shared storage and analysis system. (HDFS)
● It is designed to scale up from a single server to thousand of machines, with a
high degree of fault tolerance. (HDFS)
● It provides a programming model and an associated implementation for
processing and generating large data sets with a parallel, distributed
algorithm on a cluster. (MapReduce)
History
Hadoop Distributed FileSystem (HDFS)
● A given file is broken down into blocks (default=64MB), then blocks are replicated
across cluster (default=3).
● The number and size of file have no limition.
● HDFS allows you to put/get/delete files. (No update!)
● Follows the philosophy – "Write Once, Read Multiple Times!"
MapReduceFlow
Hadoop Ecosystem
● The Hadoop Ecosystem Table
Flume
● Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data.
● Each Flume agent has a source, a sink and a channel
Sqoop
● Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured datastores such as relational databases.
Hive
● Apache Hive is a high-level abstraction on top of MapReduce
● Uses an SQL-like language called HiveQL
● Generates MapReduce jobs that run on the Hadoop cluster
● Originally developed by Facebook for data warehousing - Now an open-source
Apache project.
HBase
● Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.
● Linear scalability, capabile of storing hundreds of terabytes of data
● Automatic and configurable sharding of tables
● Automatic failover support
● Block cache and Bloom Filters for real-time queries
● Provides realtime random read/write access to data stored in HDFS.
Pig
● Apache Pig is a platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs, coupled with infrastructure for
evaluating these programs.
● The data flow language (Pig Latin)
● The interactive shell where you can type Pig Latin statements (Grunt)
● The Pig interpreter and execution engine
Oozie
● Oozie is a 'workflow engine' which runs on a server and typically outside the
cluster. It can runs workflows of Hadoop jobs including Pig, Hive, Sqoop jobs and
submit those jobs to the cluster based on a workflow definition.
Why?
Why?
Appendix
● What is Hadoop?
● https://www.youtube.com/watch?v=4DgTLaFNQq0
● https://www.youtube.com/watch?v=9s-vSeWej1U
● Intro to MapReduce
● https://www.youtube.com/watch?v=HFplUBeBhcM
● What is Big Data? Big Data Explained (Hadoop & MapReduce)
● Big Data University - Hadoop Fundamentals I - v2
● Big Data Challenges
Appendix
● Hadoop Tutorial 1 - What is Hadoop?
● Hadoop Tutorial 2 - Challenges Created by Big Data
● Hadoop Tutorial 3 - History Behind Creation of Hadoop (Google, Yahoo, and Apac
● Hadoop Tutorial 4 - Overview of Hadoop Projects
● Hadoop Tutorial 5 - Steps to Install Hadoop on a Personal Computer (Windows/OS
● Hadoop Tutorial 6 - Downloading and Installing Oracle VirtualBox
● Hadoop Tutorial 7 - Downloading Hadoop Appliance for Oracle VirtualBox
●

Hadoop introduction

  • 2.
    Background ● Big datachallenge - (link)
  • 3.
    User Case -Retails ● 知己知彼,百戰不殆 – Retail and The Big Data Evolution:
  • 4.
    User Case -Retails ● Data is providing gains in three main ways: opening new channels, tailoring my service, and driving revenue:
  • 5.
    Pain Point ● Bigdata is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
  • 6.
    So what isHadoop? ● Hadoop was created by Doug Cutting and Mike Cafarella. ● Hadoop provides the reliable shared storage and analysis system. (HDFS) ● It is designed to scale up from a single server to thousand of machines, with a high degree of fault tolerance. (HDFS) ● It provides a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. (MapReduce)
  • 7.
  • 8.
    Hadoop Distributed FileSystem(HDFS) ● A given file is broken down into blocks (default=64MB), then blocks are replicated across cluster (default=3). ● The number and size of file have no limition. ● HDFS allows you to put/get/delete files. (No update!) ● Follows the philosophy – "Write Once, Read Multiple Times!"
  • 9.
  • 10.
    Hadoop Ecosystem ● TheHadoop Ecosystem Table
  • 11.
    Flume ● Flume isa distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. ● Each Flume agent has a source, a sink and a channel
  • 12.
    Sqoop ● Apache Sqoop(TM)is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  • 13.
    Hive ● Apache Hiveis a high-level abstraction on top of MapReduce ● Uses an SQL-like language called HiveQL ● Generates MapReduce jobs that run on the Hadoop cluster ● Originally developed by Facebook for data warehousing - Now an open-source Apache project.
  • 14.
    HBase ● Apache HBase™is the Hadoop database, a distributed, scalable, big data store. ● Linear scalability, capabile of storing hundreds of terabytes of data ● Automatic and configurable sharding of tables ● Automatic failover support ● Block cache and Bloom Filters for real-time queries ● Provides realtime random read/write access to data stored in HDFS.
  • 15.
    Pig ● Apache Pigis a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. ● The data flow language (Pig Latin) ● The interactive shell where you can type Pig Latin statements (Grunt) ● The Pig interpreter and execution engine
  • 16.
    Oozie ● Oozie isa 'workflow engine' which runs on a server and typically outside the cluster. It can runs workflows of Hadoop jobs including Pig, Hive, Sqoop jobs and submit those jobs to the cluster based on a workflow definition.
  • 17.
  • 18.
  • 19.
    Appendix ● What isHadoop? ● https://www.youtube.com/watch?v=4DgTLaFNQq0 ● https://www.youtube.com/watch?v=9s-vSeWej1U ● Intro to MapReduce ● https://www.youtube.com/watch?v=HFplUBeBhcM ● What is Big Data? Big Data Explained (Hadoop & MapReduce) ● Big Data University - Hadoop Fundamentals I - v2 ● Big Data Challenges
  • 20.
    Appendix ● Hadoop Tutorial1 - What is Hadoop? ● Hadoop Tutorial 2 - Challenges Created by Big Data ● Hadoop Tutorial 3 - History Behind Creation of Hadoop (Google, Yahoo, and Apac ● Hadoop Tutorial 4 - Overview of Hadoop Projects ● Hadoop Tutorial 5 - Steps to Install Hadoop on a Personal Computer (Windows/OS ● Hadoop Tutorial 6 - Downloading and Installing Oracle VirtualBox ● Hadoop Tutorial 7 - Downloading Hadoop Appliance for Oracle VirtualBox ●