SlideShare a Scribd company logo
{Python} in Big Data World
Objectives

•   What is Bigdata
•   What is Hadoop and its Ecosystem
•   Writing Hadoop jobs using Map Reduce
    programming
structured data     {Relational data with well
                    defined schemas}




         Multi structured data

                  {Social data, blogs , click stream,
                  Machine generated, xml etc…}
Trends … Gartner
                        Mobile analytics
                Mobility        App stores and Market place

Human computer interface Big Data    Personal cloud
         Multi touch UI                             In memory computing
                             Advanced Analytics
         Green data centre        Flash Memory
                                                  Social CRM
                              Solid state drive
                                                  HTML5
                   Context aware computing
The Problem…
Source : The Economist
The Problem…

Facebook

     955 million active users as of March 2012,
     1 in 3 Internet users have a Facebook
     account

     More than 30 billion pieces of content (web
     links, news stories, blog posts, notes, photo
     albums, etc.) shared each month.

     Holds 30PB of data for analysis, adds 12 TB of
     compressed data daily
The Problem…

Twitter

     500 million users, 340 million daily tweets
     1.6 billion search queries a day
     7 TB data for analysis generated daily



  Traditional data storage, techniques & analysis
  tools just do not work at these scales !
Big Data Dimensions (V3)




                     Volume




                 Varity    Velocity
                     Value
Hadoop
What is Hadoop …


 Flexible and available architecture for
 large scale distributed batch processing
 on a network of commodity hardware.
Apache top level project
              http://hadoop.apache.org/

                  500 contributors

It has one of the strongest eco systems with large no of sub projects

       Yahoo has one of the biggest installation Hadoop
       Running 1000s of servers on Hadoop
Inspired by …


   {Google GFS + Map Reduce + Big Table}

         Architecture behind Google’s

                Search Engine


        Creator of Hadoop project
Use cases … What is Hadoop used for

    Big/Social data analysis
    Text mining, patterns search
    Machine log analysis
    Geo-spacitial analysis
    Trend Analysis
    Genome Analysis
    Drug Discovery
    Fraud and compliance management
    Video and image analysis
Who uses Hadoop … long list

• Amazon/A9
• Facebook
• Google
• IBM
• Disney
• Last.fm
• New York Times
• Yahoo!
• Twitter
• Linked in
What is Hadoop used for?

 • Search
       Yahoo, Amazon, Zvents
 • Log processing
       Facebook, Yahoo, ContextWeb, Last.fm
 • Recommendation Systems
        Facebook , Disney
 • Data Warehouse
       Facebook, AOL , Disney
 • Video and Image Analysis
       New York Times
 • Computing Carbon Foot Print
       Opower.
Our own …




     ADDHAAR uses Hadoop and Hbase
     for its data processing …
Hadoop ecosystem …


                      ZooKeeper


      Flume              Oozie                 Whirr


       Chukwa             Avro                Sqoop


               Hive                     Pig


              HBase               MapReduce/HDFS
Hive: Datawarehouse infrastructure built on top of
hadoop for data summarization and aggregation of
data more like in sql like language called as hiveQL.
Hbase: Hbase is a Nosql columnar database and is an
implementation of Google Bigtable. It can scale to
store billions of rows.
Flume: Apache Flume is a distributed, reliable, and
available service for efficiently collecting, aggregating,
and moving large amounts of log data
Avro: A data serialization system.
Sqoop: Used for transferring bulk data between
Hadoop and traditional structured data stores.
Hadoop distribution …
Scale up




                                   App




           Traditional Databases
Scale out
                    App




            Hadoop distributed file system
Why Hadoop ?




   How Hadoop is different from other
   parallel processing architectures such
   As MPI, OpenMP, Globus ?
Move compute to data in Hadoop
While in other parallel processing the
data gets distributed to compute.
Hadoop Components …

HDFS
Map Reduce
Job tracker
Task Tracker
Name Node
Python + Analytics   High Level Language
                     Highly Interactive
                     Highly Extensible
                     Functional
                     Many Extensible libs
                       like
                            • SciPy
                            • NumPy
                            • Metaplotlib
                            • Pandas
                            • Ipython
                            • StatsModel
                            • Ntltk
                     to name a few.
What is
common
between
Mumbai
Dabbawalas
and Apache
Hadoop

Source : Cloudstory.in
Author : Janakiram MSV
What is MapReduce



MapReduce is a programming model for
processing large data sets on distributed
computing.
Map reduce steps




    Map        Shuffle   Reduce
Map   Shuffle   Reduce
Map Reduce
  Java
  Hive
  Pig Scripts
  Datameer
  Cascading
       Cascalog
       Scalding
  Streaming frameworks
       Wukong
       Dumbo
       MrJobs
       Happy
Pig Script (Word Count)
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);

-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
 words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- filter out any words that are just white spaces
 filtered_words = FILTER words BY word MATCHES 'w+';

-- create a group for each word
word_groups = GROUP filtered_words BY word;

-- count the entries in each group
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS
count, group AS word;

-- order the records by count
ordered_word_count = ORDER word_count BY count DESC;
 STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
Hive (WordCount)
CREATE TABLE input (line STRING);
LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTO TABLE input;

-- temporary table to hold words
CREATE TABLE words (word STRING);

SELECT word, COUNT(*) FROM input LATERAL VIEW explode(split(text, ' ')) lTable
as word GROUP BY word;
Hadoop Streaming…




  http://www.michael-
  noll.com/tutorials/writing-an-hadoop-
  mapreduce-program-in-python/
Map: mapper.py
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
       # remove leading and trailing whitespace
      line = line.strip()
      # split the line into words
      words = line.split()
      # increase counters
      for word in words:
           # write the results to STDOUT (standard output);
           # what we output here will be the input for the
           # Reduce step, i.e. the input for reducer.py
           #
           # tab-delimited; the trivial word count is 1
           print '%st%s' % (word, 1)
Reduce: reducer.py
#!/usr/bin/env python
from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
       # remove leading and trailing whitespace
       line = line.strip()

     # parse the input we got from mapper.py
     word, count = line.split('t', 1)
Reduce: reducer.py ( cont )
     # convert count (currently a string) to int
     try:
         count = int(count)
     except ValueError:
         # count was not a number, so silently
         # ignore/discard this line
         continue

        # this IF-switch only works because Hadoop sorts map
        output
        # by key (here: word) before it is passed to the reducer
        if current_word == word:
             current_count += count
        else:
             if current_word:
                 # write result to STDOUT
                 print '%st%s' % (current_word, current_count)
                 current_count = count
                 current_word = word
Reduce: reducer.py (cont)

# do not forget to output the last word if needed!
if current_word == word:
        print '%st%s' % (current_word, current_count)
Thank You

More Related Content

What's hot (20)

Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
Donald Miner
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
stratapps
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
Sean Murphy
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
Adam Kawa
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
Donald Miner
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
mortardata
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
guest27e6764
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Edureka!
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
daijy
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
David Wellman
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
Nick Dimiduk
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at Yahoo
Mithun Radhakrishnan
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
Portland R User Group
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
zingopen
 
Power of Python with Big Data
Power of Python with Big DataPower of Python with Big Data
Power of Python with Big Data
Edureka!
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
jeffturner
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
Donald Miner
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
stratapps
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
Adam Kawa
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
Donald Miner
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
mortardata
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
guest27e6764
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Edureka!
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
daijy
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
David Wellman
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
Nick Dimiduk
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at Yahoo
Mithun Radhakrishnan
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
zingopen
 
Power of Python with Big Data
Power of Python with Big DataPower of Python with Big Data
Power of Python with Big Data
Edureka!
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
jeffturner
 

Similar to Python in big data world (20)

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
Adeel Ahmad
 
Handling not so big data
Handling not so big dataHandling not so big data
Handling not so big data
SATOSHI TAGOMORI
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
Cloudera, Inc.
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
Purna Chander
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
J S Jodha
 
Big data
Big dataBig data
Big data
revathireddyb
 
Big data
Big dataBig data
Big data
revathireddyb
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
Stefano Paluello
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
Krishna Sangeeth KS
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
amrutupre
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
saisreealekhya
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
Ahmed Salman
 
Hadoop
HadoopHadoop
Hadoop
Mayuri Gupta
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
Kumaresan Manickavelu
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Jesus Rodriguez
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Mr. Ankit
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
Phil Young
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
Adeel Ahmad
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
Cloudera, Inc.
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
J S Jodha
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
Stefano Paluello
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
Krishna Sangeeth KS
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
amrutupre
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
saisreealekhya
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Jesus Rodriguez
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Mr. Ankit
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
Phil Young
 

Python in big data world

  • 1. {Python} in Big Data World
  • 2. Objectives • What is Bigdata • What is Hadoop and its Ecosystem • Writing Hadoop jobs using Map Reduce programming
  • 3. structured data {Relational data with well defined schemas} Multi structured data {Social data, blogs , click stream, Machine generated, xml etc…}
  • 4. Trends … Gartner Mobile analytics Mobility App stores and Market place Human computer interface Big Data Personal cloud Multi touch UI In memory computing Advanced Analytics Green data centre Flash Memory Social CRM Solid state drive HTML5 Context aware computing
  • 6. Source : The Economist
  • 7. The Problem… Facebook 955 million active users as of March 2012, 1 in 3 Internet users have a Facebook account More than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) shared each month. Holds 30PB of data for analysis, adds 12 TB of compressed data daily
  • 8. The Problem… Twitter 500 million users, 340 million daily tweets 1.6 billion search queries a day 7 TB data for analysis generated daily Traditional data storage, techniques & analysis tools just do not work at these scales !
  • 9. Big Data Dimensions (V3) Volume Varity Velocity Value
  • 11. What is Hadoop … Flexible and available architecture for large scale distributed batch processing on a network of commodity hardware.
  • 12. Apache top level project http://hadoop.apache.org/ 500 contributors It has one of the strongest eco systems with large no of sub projects Yahoo has one of the biggest installation Hadoop Running 1000s of servers on Hadoop
  • 13. Inspired by … {Google GFS + Map Reduce + Big Table} Architecture behind Google’s Search Engine Creator of Hadoop project
  • 14. Use cases … What is Hadoop used for Big/Social data analysis Text mining, patterns search Machine log analysis Geo-spacitial analysis Trend Analysis Genome Analysis Drug Discovery Fraud and compliance management Video and image analysis
  • 15. Who uses Hadoop … long list • Amazon/A9 • Facebook • Google • IBM • Disney • Last.fm • New York Times • Yahoo! • Twitter • Linked in
  • 16. What is Hadoop used for? • Search Yahoo, Amazon, Zvents • Log processing Facebook, Yahoo, ContextWeb, Last.fm • Recommendation Systems Facebook , Disney • Data Warehouse Facebook, AOL , Disney • Video and Image Analysis New York Times • Computing Carbon Foot Print Opower.
  • 17. Our own … ADDHAAR uses Hadoop and Hbase for its data processing …
  • 18. Hadoop ecosystem … ZooKeeper Flume Oozie Whirr Chukwa Avro Sqoop Hive Pig HBase MapReduce/HDFS
  • 19. Hive: Datawarehouse infrastructure built on top of hadoop for data summarization and aggregation of data more like in sql like language called as hiveQL. Hbase: Hbase is a Nosql columnar database and is an implementation of Google Bigtable. It can scale to store billions of rows. Flume: Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data Avro: A data serialization system. Sqoop: Used for transferring bulk data between Hadoop and traditional structured data stores.
  • 21. Scale up App Traditional Databases
  • 22. Scale out App Hadoop distributed file system
  • 23. Why Hadoop ? How Hadoop is different from other parallel processing architectures such As MPI, OpenMP, Globus ?
  • 24. Move compute to data in Hadoop While in other parallel processing the data gets distributed to compute.
  • 25. Hadoop Components … HDFS Map Reduce Job tracker Task Tracker Name Node
  • 26. Python + Analytics High Level Language Highly Interactive Highly Extensible Functional Many Extensible libs like • SciPy • NumPy • Metaplotlib • Pandas • Ipython • StatsModel • Ntltk to name a few.
  • 28. What is MapReduce MapReduce is a programming model for processing large data sets on distributed computing.
  • 29. Map reduce steps Map Shuffle Reduce
  • 30. Map Shuffle Reduce
  • 31. Map Reduce Java Hive Pig Scripts Datameer Cascading  Cascalog  Scalding Streaming frameworks  Wukong  Dumbo  MrJobs  Happy
  • 32. Pig Script (Word Count) input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray); -- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES 'w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by count ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
  • 33. Hive (WordCount) CREATE TABLE input (line STRING); LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTO TABLE input; -- temporary table to hold words CREATE TABLE words (word STRING); SELECT word, COUNT(*) FROM input LATERAL VIEW explode(split(text, ' ')) lTable as word GROUP BY word;
  • 34. Hadoop Streaming… http://www.michael- noll.com/tutorials/writing-an-hadoop- mapreduce-program-in-python/
  • 35. Map: mapper.py #!/usr/bin/env python import sys # input comes from STDIN (standard input) for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # split the line into words words = line.split() # increase counters for word in words: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py # # tab-delimited; the trivial word count is 1 print '%st%s' % (word, 1)
  • 36. Reduce: reducer.py #!/usr/bin/env python from operator import itemgetter import sys current_word = None current_count = 0 word = None # input comes from STDIN for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # parse the input we got from mapper.py word, count = line.split('t', 1)
  • 37. Reduce: reducer.py ( cont ) # convert count (currently a string) to int try: count = int(count) except ValueError: # count was not a number, so silently # ignore/discard this line continue # this IF-switch only works because Hadoop sorts map output # by key (here: word) before it is passed to the reducer if current_word == word: current_count += count else: if current_word: # write result to STDOUT print '%st%s' % (current_word, current_count) current_count = count current_word = word
  • 38. Reduce: reducer.py (cont) # do not forget to output the last word if needed! if current_word == word: print '%st%s' % (current_word, current_count)

Editor's Notes

  • #4: This slide talks about the amount of percentage of different data. Out of all data available on the earth, only very small percentage data is being is processed by enter prises. There is lot of multi structured data yet to be processed.
  • #5: This slide shows the various trends predicted by Gartner. One of the big trend is Big Dataand cloud
  • #7: Business, governments and society are only starting to tap to its vast potential
  • #28: This slide talks about how the architecture parallel between Mumbai Dabbawala and Hadoop parallel architecture