HADOOP
Glossary
Scalable: It is a system whose
performance is improved after
having added more hardware
capacity, proportional to the
added capacity, it is said to be a
scalable system.
Nodes: A node is a point of
intersection, connection or union
of several elements that converge
in the same place
Cluster: The term cluster (of the
English cluster, meaning group or
cluster) is applied to the groups
or conglomerates of computers
linked together by a high-speed
network and which behave as if
they were a single computer
Open source: Open source is a
software development model
based on open collaboration
How did you get to Hadoop?
• Google worked from the first years of S. XXI on new
methods for access to information and its work was
directed to the massive treatment of large volumes of
data and in parallel systems.
• Innovations that shaped the development of Hadoop,
created by Google
1. The Google File System (GFS)- (2003)
2. MapReduce: Simplified Data Processing on
Large Clusters.-(2004)
3. Big Table(2006)
Google File System.
It is a scalable distributed file system for
intensive applications of large
distributed data
MapReduce
It is a programming model and an
associated implementation for
processing and generation of large data
sets
Big Table
It's a structured data management
distributed storage system that was
designed by Google to scale to a very
large size.
History of Hadoop
2004 - 2006 • Google publishes the GFS and MapReduce articles.
2006 • Doug Cutting, a software engineer who worked at Google,
implements an open source version called Nutch
• In 2006 formally Hadoop appears
2007 • An alliance was made between Google and IBM for university
research purposes to build a joint research group of MapReduce and
GFS
2008 • Hadoop begins to become popular and commercial exploitation
begins and the foundation of Apache Software takes responsibility
for the project
• In July 2009, new member of the Board of
Directors of the Apache Software Foundation
2009
2010
• Cutting abandons Yahoo and leaves to Cloudera
one of the most active organizations in the
development and implantation of Hadoop
• He is currently Chairman of the Board
of the Apache Foundation and works in
Cloudera as Software Architect, whose
distribution of Hadoop leads the
market.
• Cloudera is an organization that
provides services of training and
certification, sopor and sale of tools
for management in its cluster.
• The current era of Hadoop began in
2011 when the three major database
providers (Oracle, IBM and Microsoft)
adopted it.
What is
Hadoop?
• Hadoop is an open source implementation of
MapReduce, originally founded on Yahoo, in early
2006, created by Doug Cutting.
• Hadoop represents the most complete ecosystem to
solve in an efficient and economic way the
scalability of data, especially large volumes
(Terabytes and Petabytes).
• Currently Hadoop is led by the Apache Hadoop
foundation.
• Hadoo is a framework framework that allows you to
process large amounts of data at very low cost.
• Hadoop runs on low cost commercial hardware and
reduces the cost compared to other commercial
data storage and processing alternatives.
• Hadoop was designed to run on a large number of
machines that do not share memory, or disks.
Characteristics
• Hadoop is a distributed system whose main task is to solve the problem of storing
information that exceeds the capacity of a machine.
• The core of Hadoop is MapReduce
• Hadoop consists of two fundamental parts:
• A file system (HDFS)
• The MapReduce programming paradigm
Hadoop components
1. Hadoop Distributed File System
(HDFS). It is inspired by the Google
project (GFS)
2. Hadoop MapReduce. (mapper-
reduce). Achieve the manipulation
of the distributed data to nodes of
a cluster and obtain a high
parallelism in the processing
3. Hadoop Common.It is a set of
libraries that support several
Hadoop processes.
To understand the Hadoop
system better, it is important
to know its fundamental
infrastructure of the file
system and the programming
model
The file system allows
applications to run on
different servers
The programming model is a
framework.
Apache Hadoop Project
 Defined by the Apache Hadoop foundation, it develops open
source software for distributed, reliable and scalable
computing.
 The Apache Hadoop software library is a framework that
allows the distributed processing of large data sets.
 It is designed to scale from a few servers to thousands of
machines.
 Consider 4 components:
1. Hadoop Common
2. Hadoop Distributed File System
3. Hadoop YARN
4. Hadoop MapReduce
Applications that use Hadoop
• Facebook
• Twitter
• EBay
• EHarmony
• Netflix
• AOL
• Apple
• Linkedln
• Tuenti
Projects related to Hadoop Apache
• Avro
• Cassandra
• Chukwa
• Hbase
• Hive
• Mahout
• Pig
• Zookeeper
Hadoop
platforms
The consultant of Forrester published in 2012 a study
on Hadoop solutions in conclusion the results were:
• Amazon Web Services holds the leadership thanks to
Elastic MapReduce, its proven subscription service
rich in benefits
• IBM and EMC greenplum offers Hadoop solutions
with important EDW portfolios
• MapR and Cloudera impress with the best business-
scale distribution solutions
• Hortonworks offers an impressive portfolio of
professional services based on Hadoop
Thank You

Cap 10 ingles

  • 1.
  • 2.
    Glossary Scalable: It isa system whose performance is improved after having added more hardware capacity, proportional to the added capacity, it is said to be a scalable system. Nodes: A node is a point of intersection, connection or union of several elements that converge in the same place Cluster: The term cluster (of the English cluster, meaning group or cluster) is applied to the groups or conglomerates of computers linked together by a high-speed network and which behave as if they were a single computer Open source: Open source is a software development model based on open collaboration
  • 3.
    How did youget to Hadoop? • Google worked from the first years of S. XXI on new methods for access to information and its work was directed to the massive treatment of large volumes of data and in parallel systems. • Innovations that shaped the development of Hadoop, created by Google 1. The Google File System (GFS)- (2003) 2. MapReduce: Simplified Data Processing on Large Clusters.-(2004) 3. Big Table(2006)
  • 4.
    Google File System. Itis a scalable distributed file system for intensive applications of large distributed data MapReduce It is a programming model and an associated implementation for processing and generation of large data sets Big Table It's a structured data management distributed storage system that was designed by Google to scale to a very large size.
  • 5.
    History of Hadoop 2004- 2006 • Google publishes the GFS and MapReduce articles. 2006 • Doug Cutting, a software engineer who worked at Google, implements an open source version called Nutch • In 2006 formally Hadoop appears 2007 • An alliance was made between Google and IBM for university research purposes to build a joint research group of MapReduce and GFS 2008 • Hadoop begins to become popular and commercial exploitation begins and the foundation of Apache Software takes responsibility for the project
  • 6.
    • In July2009, new member of the Board of Directors of the Apache Software Foundation 2009 2010 • Cutting abandons Yahoo and leaves to Cloudera one of the most active organizations in the development and implantation of Hadoop
  • 7.
    • He iscurrently Chairman of the Board of the Apache Foundation and works in Cloudera as Software Architect, whose distribution of Hadoop leads the market. • Cloudera is an organization that provides services of training and certification, sopor and sale of tools for management in its cluster. • The current era of Hadoop began in 2011 when the three major database providers (Oracle, IBM and Microsoft) adopted it.
  • 8.
    What is Hadoop? • Hadoopis an open source implementation of MapReduce, originally founded on Yahoo, in early 2006, created by Doug Cutting. • Hadoop represents the most complete ecosystem to solve in an efficient and economic way the scalability of data, especially large volumes (Terabytes and Petabytes). • Currently Hadoop is led by the Apache Hadoop foundation. • Hadoo is a framework framework that allows you to process large amounts of data at very low cost. • Hadoop runs on low cost commercial hardware and reduces the cost compared to other commercial data storage and processing alternatives. • Hadoop was designed to run on a large number of machines that do not share memory, or disks.
  • 9.
    Characteristics • Hadoop isa distributed system whose main task is to solve the problem of storing information that exceeds the capacity of a machine. • The core of Hadoop is MapReduce • Hadoop consists of two fundamental parts: • A file system (HDFS) • The MapReduce programming paradigm
  • 10.
    Hadoop components 1. HadoopDistributed File System (HDFS). It is inspired by the Google project (GFS) 2. Hadoop MapReduce. (mapper- reduce). Achieve the manipulation of the distributed data to nodes of a cluster and obtain a high parallelism in the processing 3. Hadoop Common.It is a set of libraries that support several Hadoop processes.
  • 11.
    To understand theHadoop system better, it is important to know its fundamental infrastructure of the file system and the programming model The file system allows applications to run on different servers The programming model is a framework.
  • 12.
    Apache Hadoop Project Defined by the Apache Hadoop foundation, it develops open source software for distributed, reliable and scalable computing.  The Apache Hadoop software library is a framework that allows the distributed processing of large data sets.  It is designed to scale from a few servers to thousands of machines.  Consider 4 components: 1. Hadoop Common 2. Hadoop Distributed File System 3. Hadoop YARN 4. Hadoop MapReduce
  • 13.
    Applications that useHadoop • Facebook • Twitter • EBay • EHarmony • Netflix • AOL • Apple • Linkedln • Tuenti Projects related to Hadoop Apache • Avro • Cassandra • Chukwa • Hbase • Hive • Mahout • Pig • Zookeeper
  • 14.
    Hadoop platforms The consultant ofForrester published in 2012 a study on Hadoop solutions in conclusion the results were: • Amazon Web Services holds the leadership thanks to Elastic MapReduce, its proven subscription service rich in benefits • IBM and EMC greenplum offers Hadoop solutions with important EDW portfolios • MapR and Cloudera impress with the best business- scale distribution solutions • Hortonworks offers an impressive portfolio of professional services based on Hadoop
  • 15.