Hadoop Ecosystem Overview
About Neev
Web

Mobile

Magento eCommerce
SaaS Applications
Video Streaming Portals
Rich Internet Apps
Custom Development

iPhone
Android
Windows Phone 7
HTML5 Apps

Cloud
AWS Consulting Partner
Rackspace
Joyent
Heroku
Google App Engine

Key Company Highlights
250+ team with experience in
managing
offshore, distributed
development.
Neev Technologies
established in Jan ’05
VC Funding in 2009 By Basil
Partners

User Interface Design and User Experience Design

Part of Publicis Groupe
Member of NASSCOM.

Performance Consulting Practices
Development Centers in
Bangalore and Pune.
Quality Assurance & Testing

Outsourced Product Development

Offices at
Bangalore, USA, Delhi, Pune,
Singapore and Stockholm.
Hadoop in a Nutshell : An Overview
• Hadoop as we know is a Java based massive scalable distributed
framework for processing large data (several peta bytes) across a
cluster (1000s) of commodity computers.
• The Hadoop ecosystem has grown over the last few years and
there is a lot of jargon in terms of tools as well as frameworks.
• Many organizations are investing & innovating heavily in Hadoop

to make it better and easier. The mind map on the next slide
should be useful to get a high level picture of the ecosystem.
Hadoop : The Big Picture
Hadoop Core
The core consists of
1) HDFS or Hadoop Distributed File System is designed to run on a commodity
cluster of machines. It is highly fault tolerant and is useful for processing
large data sets. Files stored in HDFS are organized into blocks, typically
64MB or 128MB, and stored across nodes in the cluster. Each block of data
is also replicated across more nodes generally 3 to avoid data loss in case of
failure
2) MapReduce is a software framework for processing a large data set(peta
byte scale), on a cluster of commodity hardware. When MapReduce is
run, Hadoop splits the input and locates the nodes on the cluster. The
actual jobs are then run at or close to the node where the data is residing
so that the data is as close to the computation node. This stops the network
from getting flooded with data or becoming a bottleneck
Hadoop : Distributions
Hadoop Distribution
Description
Apache
Purely Open Source maintained by Apache
Cloudera
The leading distribution with capabilities like
management, security, high availability and integration
with many other solutions
for
software
and
hardware.
HortonWorks
Only version for Windows Servers
MapR
unique features like mounting over NFS
GreenPlum
Uses an SQL based Database Engine
Intel
Intel’s open source version
AmazonEMR
Amazon’s version of MapReduce called Elastic
MapReduce, a part of AWS. EMR allows a Hadoop
cluster to be deployed and MapReduce jobs to be run
in the cloud with just a few clicks.
Related Projects
Related Projects

Description

Avro

Data serialization framework that is useful in Hadoop and other
systems
Framework for analyzing large data set using a high level language
called Pig Latin
Hive is a data warehouse framework that stores querying of large
data sets stored in Hadoop

Pig
Hive

Hbase
Mahout
Yarn
Ozzie
Flume
Sqoop
Cascading

HBase is a distributed scalable data store based on Hadoop
Mahout is a scalable Machine learning library
YARN is the next generation of MapReduce
Involves running a sequence of MapReduce and other pre and post processing jobs at scheduled times or based on data availability
A distributed, reliable and available service for collecting,
aggregating and moving log data to HDFS
Designed for transferring data between Hadoop and relational
databases
Application framework for building application using Hadoop
Related Technologies
Related
Technologies
Twitter Storm

HPCC

Dremel

Description
As opposed to Hadoop which is a batch processing system,
Storm is a distributed real time processing system
developed by Twitter. Storm is fast, scalable and easy to
use.
High Performance Computing Cluster is an MPP(Massive
parallel processing) computing platform that helps solving
problems with handling huge data.
A scalable interactive ad-hoc query system for analysis of
read-only nested data built by Google.
Clients
Partnerships
Neev Information Technologies Pvt. Ltd.
India - Bangalore

India - Pune

The Estate, # 121,6th Floor,

#13 L’Square, 3rd Floor

Dickenson Road

Parihar Chowk, Aundh,

Bangalore-560042

Pune – 411007.

Phone :+91 80 25594416

Phone : +91-64103338

USA

sales@neevtech.com
Sweden

Singapore

Neev AB, Birger Jarlsgatan
1121 Boyce Rd Ste 1400,
Pittsburgh PA 15241

Phone : +1 888-979-7860

#08-03 SGX Centre 2, 4

53, 6tr,

Shenton Way,

11145, Stockholm

Singapore 068807

Phone: +46723250723

Phone: +65 6435 1961

For more info on our offerings, visit www.neevtech.com

Hadoop Ecosystem at a Glance

  • 1.
  • 2.
    About Neev Web Mobile Magento eCommerce SaaSApplications Video Streaming Portals Rich Internet Apps Custom Development iPhone Android Windows Phone 7 HTML5 Apps Cloud AWS Consulting Partner Rackspace Joyent Heroku Google App Engine Key Company Highlights 250+ team with experience in managing offshore, distributed development. Neev Technologies established in Jan ’05 VC Funding in 2009 By Basil Partners User Interface Design and User Experience Design Part of Publicis Groupe Member of NASSCOM. Performance Consulting Practices Development Centers in Bangalore and Pune. Quality Assurance & Testing Outsourced Product Development Offices at Bangalore, USA, Delhi, Pune, Singapore and Stockholm.
  • 3.
    Hadoop in aNutshell : An Overview • Hadoop as we know is a Java based massive scalable distributed framework for processing large data (several peta bytes) across a cluster (1000s) of commodity computers. • The Hadoop ecosystem has grown over the last few years and there is a lot of jargon in terms of tools as well as frameworks. • Many organizations are investing & innovating heavily in Hadoop to make it better and easier. The mind map on the next slide should be useful to get a high level picture of the ecosystem.
  • 4.
    Hadoop : TheBig Picture
  • 5.
    Hadoop Core The coreconsists of 1) HDFS or Hadoop Distributed File System is designed to run on a commodity cluster of machines. It is highly fault tolerant and is useful for processing large data sets. Files stored in HDFS are organized into blocks, typically 64MB or 128MB, and stored across nodes in the cluster. Each block of data is also replicated across more nodes generally 3 to avoid data loss in case of failure 2) MapReduce is a software framework for processing a large data set(peta byte scale), on a cluster of commodity hardware. When MapReduce is run, Hadoop splits the input and locates the nodes on the cluster. The actual jobs are then run at or close to the node where the data is residing so that the data is as close to the computation node. This stops the network from getting flooded with data or becoming a bottleneck
  • 6.
    Hadoop : Distributions HadoopDistribution Description Apache Purely Open Source maintained by Apache Cloudera The leading distribution with capabilities like management, security, high availability and integration with many other solutions for software and hardware. HortonWorks Only version for Windows Servers MapR unique features like mounting over NFS GreenPlum Uses an SQL based Database Engine Intel Intel’s open source version AmazonEMR Amazon’s version of MapReduce called Elastic MapReduce, a part of AWS. EMR allows a Hadoop cluster to be deployed and MapReduce jobs to be run in the cloud with just a few clicks.
  • 7.
    Related Projects Related Projects Description Avro Dataserialization framework that is useful in Hadoop and other systems Framework for analyzing large data set using a high level language called Pig Latin Hive is a data warehouse framework that stores querying of large data sets stored in Hadoop Pig Hive Hbase Mahout Yarn Ozzie Flume Sqoop Cascading HBase is a distributed scalable data store based on Hadoop Mahout is a scalable Machine learning library YARN is the next generation of MapReduce Involves running a sequence of MapReduce and other pre and post processing jobs at scheduled times or based on data availability A distributed, reliable and available service for collecting, aggregating and moving log data to HDFS Designed for transferring data between Hadoop and relational databases Application framework for building application using Hadoop
  • 8.
    Related Technologies Related Technologies Twitter Storm HPCC Dremel Description Asopposed to Hadoop which is a batch processing system, Storm is a distributed real time processing system developed by Twitter. Storm is fast, scalable and easy to use. High Performance Computing Cluster is an MPP(Massive parallel processing) computing platform that helps solving problems with handling huge data. A scalable interactive ad-hoc query system for analysis of read-only nested data built by Google.
  • 9.
  • 10.
  • 11.
    Neev Information TechnologiesPvt. Ltd. India - Bangalore India - Pune The Estate, # 121,6th Floor, #13 L’Square, 3rd Floor Dickenson Road Parihar Chowk, Aundh, Bangalore-560042 Pune – 411007. Phone :+91 80 25594416 Phone : +91-64103338 USA [email protected] Sweden Singapore Neev AB, Birger Jarlsgatan 1121 Boyce Rd Ste 1400, Pittsburgh PA 15241 Phone : +1 888-979-7860 #08-03 SGX Centre 2, 4 53, 6tr, Shenton Way, 11145, Stockholm Singapore 068807 Phone: +46723250723 Phone: +65 6435 1961 For more info on our offerings, visit www.neevtech.com