Python can be used for big data applications and processing on Hadoop. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the distributed processing of large datasets across clusters of computers using simple programming models. MapReduce is a programming model used in Hadoop for processing and generating large datasets in a distributed computing environment.
The document discusses using Python with Hadoop frameworks. It introduces Hadoop Distributed File System (HDFS) and MapReduce, and how to use the mrjob library to write MapReduce jobs in Python. It also covers using Python with higher-level Hadoop frameworks like Pig, accessing HDFS with snakebite, and using Python clients for HBase and the PySpark API for the Spark framework. Key advantages discussed are Python's rich ecosystem and ability to access Hadoop frameworks.
IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit
This document discusses using IPython Notebook as a unified data science interface for Hadoop. It proposes that a unified environment needs: 1) mixed local and distributed processing via Apache Spark, 2) access to languages like Python via PySpark, 3) seamless SQL integration via SparkSQL, and 4) visualization and reporting via IPython Notebook. The document demonstrates this environment by exploring open payments data between doctors/hospitals and manufacturers.
This document discusses Apache Pig and its role in data science. It begins with an introduction to Pig, describing it as a high-level scripting language for operating on large datasets in Hadoop. It transforms data operations into MapReduce/Tez jobs and optimizes the number of jobs required. The document then covers using Pig for understanding data through statistics and sampling, machine learning by sampling large datasets and applying models with UDFs, and natural language processing on large unstructured data.
This is part of an introductory course to Big Data Tools for Artificial Intelligence. These slides introduce students to the use of Apache Pig as an ETL tool over Hadoop.
Owen O'Malley is an architect at Yahoo who works full-time on Hadoop. He discusses Hadoop's origins, how it addresses the problem of scaling applications to large datasets, and its key components including HDFS and MapReduce. Yahoo uses Hadoop extensively, including for building its Webmap and running experiments on large datasets.
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
The document discusses key concepts related to the Pig analytics framework. It covers topics like why Pig was developed, what Pig is, comparisons of Pig to MapReduce and Hive, Pig architecture involving Pig Latin scripts, a runtime engine, and execution via a Grunt shell or Pig server, how Pig works by loading data and executing Pig Latin scripts, Pig's data model using atoms and tuples, and features of Pig like its ability to process structured, semi-structured, and unstructured data without requiring complex coding.
This document provides an overview and objectives of a Python course for big data analytics. It discusses why Python is well-suited for big data tasks due to its libraries like PyDoop and SciPy. The course includes demonstrations of web scraping using Beautiful Soup, collecting tweets using APIs, and running word count on Hadoop using Pydoop. It also discusses how Python supports key aspects of data science like accessing, analyzing, and visualizing large datasets.
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
Way too many enterprise decision makers have clouded and uninformed views of how Hadoop works and what it does. Donald Miner offers high-level observations about Hadoop technologies and explains how Hadoop can shift the paradigms inside of an organization, based on his report Hadoop: What You Need To Know—Hadoop Basics for the Enterprise Decision Maker, forthcoming from O’Reilly Media.
After a basic introduction to Hadoop and the Hadoop ecosystem, Donald outlines 10 basic concepts you need to understand to master Hadoop:
Hadoop masks being a distributed system: what it means for Hadoop to abstract away the details of distributed systems and why that’s a good thing
Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
Hadoop runs on commodity hardware: an honest definition of commodity hardware and why this is a good thing for enterprises
Hadoop handles unstructured data: why Hadoop is better for unstructured data than other data systems from a storage and computation perspective
In Hadoop, you load data first and ask questions later: the differences between schema-on-read and schema-on-write and the drawbacks this represents
Hadoop is open source: what it really means for Hadoop to be open source from a practical perspective, not just a “feel good” perspective
HDFS stores the data but has some major limitations: an overview of HDFS (replication, not being able to edit files, and the NameNode)
YARN controls everything going on and is mostly behind the scenes: an overview of YARN and the pitfalls of sharing resources in a distributed environment and the capacity scheduler
MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
The Hadoop ecosystem is constantly growing and evolving: an overview of current tools such as Spark and Kafka and a glimpse of some things on the horizon
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.
This is a talk I gave at Data Science MD meetup. It was based on the talk I gave about a month before at Data Science NYC (http://www.slideshare.net/DonaldMiner/data-scienceandhadoop). I talk about data exploration, NLP, Classifiers, and recommendation systems, plus some other things. I tried to depict a realistic view of Hadoop here.
Pig is a platform for analyzing large datasets that sits on top of Hadoop. It provides a simple language called Pig Latin for expressing data analysis processes. Pig Latin scripts are compiled into series of MapReduce jobs that process and analyze data in parallel across a Hadoop cluster. Pig aims to be easier to use than raw MapReduce programs by providing high-level operations like JOIN, FILTER, GROUP, and allowing analysis to be expressed without writing Java code. Common use cases for Pig include log and web data analysis, ETL processes, and quick prototyping of algorithms for large-scale data.
Hadoop and Pig are tools for analyzing large datasets. Hadoop uses MapReduce and HDFS for distributed processing and storage. Pig provides a high-level language for expressing data analysis jobs that are compiled into MapReduce programs. Common tasks like joins, filters, and grouping are built into Pig for easier programming compared to lower-level MapReduce.
This document provides an introduction to Apache Pig including:
- What Pig is and how it offers a high-level language called PigLatin for analyzing large datasets.
- How PigLatin provides common data operations and types and is more natural for analysts than MapReduce.
- Examples of how WordCount looks in PigLatin versus Java MapReduce.
- How Pig works by parsing, optimizing, and executing PigLatin scripts as MapReduce jobs on Hadoop.
- Considerations for developing, running, and optimizing PigLatin scripts.
A talk I gave on what Hadoop does for the data scientist. I talk about data exploration, NLP, Classifiers, and recommendation systems, plus some other things. I tried to depict a realistic view of Hadoop here.
This document summarizes a presentation about using Hadoop for large scale data analysis. It introduces Hadoop's architecture which uses a distributed file system and MapReduce programming model. It discusses how Hadoop can handle large amounts of data reliably across commodity hardware. Examples shown include word count and stock analysis algorithms in MapReduce. The document concludes by mentioning other Hadoop projects like HBase, Pig and Hive that extend its capabilities.
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!
This Edureka Pig Tutorial ( Pig Tutorial Blog Series: https://goo.gl/KPE94k ) will help you understand the concepts of Apache Pig in depth.
Check our complete Hadoop playlist here: https://goo.gl/ExJdZs
Below are the topics covered in this Pig Tutorial:
1) Entry of Apache Pig
2) Pig vs MapReduce
3) Twitter Case Study on Apache Pig
4) Apache Pig Architecture
5) Pig Components
6) Pig Data Model
7) Running Pig Commands and Pig Scripts (Log Analysis)
Pig programming is more fun: New features in Pigdaijy
In the last year, we add lots of new language features into Pig. Pig programing is much more easier than before. With Pig Macro, we can write functions for Pig and we can modularize Pig program. Pig embedding allow use to embed Pig statement into Python and make use of rich language features of Python such as loop and branch. Java is no longer the only choice to write Pig UDF, we can write UDF in Python, Javascript and Ruby. Nested foreach and cross gives us more ways to manipulate data, which is not possible before. We also add tons of syntax sugar to simplify the Pig syntax. For example, direct syntax support for map, tuple and bag, project range expression in foreach, etc. We also revive the support for illustrate command to ease the debugging. In this paper, I will give an overview of all these features and illustrate how to use these features to program more efficiently in Pig. I will also give concrete example to demonstrate how Pig language evolves overtime with these language improvements.
So you want to get started with Hadoop, but how. This session will show you how to get started with Hadoop development using Pig. Prior Hadoop experience is not needed.
Thursday, May 8th, 02:00pm-02:50pm
The document discusses a presentation about practical problem solving with Hadoop and Pig. It provides an agenda that covers introductions to Hadoop and Pig, including the Hadoop distributed file system, MapReduce, performance tuning, and examples. It discusses how Hadoop is used at Yahoo, including statistics on usage. It also provides examples of how Hadoop has been used for applications like log processing, search indexing, and machine learning.
This document introduces Pig, an open source platform for analyzing large datasets that sits on top of Hadoop. It provides an example of using Pig Latin to find the top 5 most visited websites by users aged 18-25 from user and website data. Key points covered include who uses Pig, how it works, performance advantages over MapReduce, and upcoming new features. The document encourages learning more about Pig through online documentation and tutorials.
Hadoop Summit 2016 presentation.
As Yahoo continues to grow and diversify its mobile products, its platform team faces an ever-expanding, sophisticated group of analysts and product managers, who demand faster answers: to measure key engagement metrics like Daily Active Users, app installs and user retention; to run longitudinal analysis; and to re-calculate metrics over rolling time windows. All of this, quickly enough not to need a coffee break while waiting for results. The optimal solution for this use-case would have to take into account raw performance, cost, security implications and ease of data management. Could the benefits of Hive, ORC and Tez, coupled with a good data design provide the performance our customers crave? Or would it make sense to use more nascent, off-grid querying systems? This talk will examine the efficacy of using Hive for large-scale mobile analytics. We will quantify Hive performance on a traditional, shared, multi-tenant Hadoop cluster, and compare it with more specialized analytics tools on a single-tenant cluster. We will also highlight which tuning parameters yield maximum benefits, and analyze the surprisingly ineffectual ones. Finally, we will detail several enhancements made by Yahoo's Hive team (in split calculation, stripe elimination and the metadata system) to successfully boost performance.
1. The document discusses using Hadoop and Hive at Zing to build a log collecting, analyzing, and reporting system.
2. Scribe is used for fast log collection and storing data in Hadoop/Hive. Hive provides SQL-like queries to analyze large datasets.
3. The system transforms logs into Hive tables, runs analysis jobs in Hive, then exports data to MySQL for web reporting. This provides a scalable, high performance solution compared to the initial RDBMS-only system.
Programmers love Python because of how fast and easy it is to use. Python cuts development time in half with its simple to read syntax and easy compilation feature. Debugging your programs is a breeze in Python with its built in debugger. Python is continued to be a favourite option for data scientists who use it for building and using Machine learning applications and other scientific computations.
Python has evolved as the most preferred Language for Data Analytics and the increasing search trends on python also indicates that Python is the next "Big Thing" and a must for Professionals in the Data Analytics domain.
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of computers. It allows data to be stored reliably in its Hadoop Distributed File System (HDFS) and processed in parallel using MapReduce. HDFS stores data redundantly across nodes for fault tolerance, while MapReduce breaks jobs into smaller tasks that can run across a cluster in parallel. Together HDFS and MapReduce provide scalable and fault-tolerant data storage and processing.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
Introduction to Hadoop.
What are Hadoop, MapReeduce, and Hadoop Distributed File System.
Who uses Hadoop?
How to run Hadoop?
What are Pig, Hive, Mahout?
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.
This is a talk I gave at Data Science MD meetup. It was based on the talk I gave about a month before at Data Science NYC (http://www.slideshare.net/DonaldMiner/data-scienceandhadoop). I talk about data exploration, NLP, Classifiers, and recommendation systems, plus some other things. I tried to depict a realistic view of Hadoop here.
Pig is a platform for analyzing large datasets that sits on top of Hadoop. It provides a simple language called Pig Latin for expressing data analysis processes. Pig Latin scripts are compiled into series of MapReduce jobs that process and analyze data in parallel across a Hadoop cluster. Pig aims to be easier to use than raw MapReduce programs by providing high-level operations like JOIN, FILTER, GROUP, and allowing analysis to be expressed without writing Java code. Common use cases for Pig include log and web data analysis, ETL processes, and quick prototyping of algorithms for large-scale data.
Hadoop and Pig are tools for analyzing large datasets. Hadoop uses MapReduce and HDFS for distributed processing and storage. Pig provides a high-level language for expressing data analysis jobs that are compiled into MapReduce programs. Common tasks like joins, filters, and grouping are built into Pig for easier programming compared to lower-level MapReduce.
This document provides an introduction to Apache Pig including:
- What Pig is and how it offers a high-level language called PigLatin for analyzing large datasets.
- How PigLatin provides common data operations and types and is more natural for analysts than MapReduce.
- Examples of how WordCount looks in PigLatin versus Java MapReduce.
- How Pig works by parsing, optimizing, and executing PigLatin scripts as MapReduce jobs on Hadoop.
- Considerations for developing, running, and optimizing PigLatin scripts.
A talk I gave on what Hadoop does for the data scientist. I talk about data exploration, NLP, Classifiers, and recommendation systems, plus some other things. I tried to depict a realistic view of Hadoop here.
This document summarizes a presentation about using Hadoop for large scale data analysis. It introduces Hadoop's architecture which uses a distributed file system and MapReduce programming model. It discusses how Hadoop can handle large amounts of data reliably across commodity hardware. Examples shown include word count and stock analysis algorithms in MapReduce. The document concludes by mentioning other Hadoop projects like HBase, Pig and Hive that extend its capabilities.
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!
This Edureka Pig Tutorial ( Pig Tutorial Blog Series: https://goo.gl/KPE94k ) will help you understand the concepts of Apache Pig in depth.
Check our complete Hadoop playlist here: https://goo.gl/ExJdZs
Below are the topics covered in this Pig Tutorial:
1) Entry of Apache Pig
2) Pig vs MapReduce
3) Twitter Case Study on Apache Pig
4) Apache Pig Architecture
5) Pig Components
6) Pig Data Model
7) Running Pig Commands and Pig Scripts (Log Analysis)
Pig programming is more fun: New features in Pigdaijy
In the last year, we add lots of new language features into Pig. Pig programing is much more easier than before. With Pig Macro, we can write functions for Pig and we can modularize Pig program. Pig embedding allow use to embed Pig statement into Python and make use of rich language features of Python such as loop and branch. Java is no longer the only choice to write Pig UDF, we can write UDF in Python, Javascript and Ruby. Nested foreach and cross gives us more ways to manipulate data, which is not possible before. We also add tons of syntax sugar to simplify the Pig syntax. For example, direct syntax support for map, tuple and bag, project range expression in foreach, etc. We also revive the support for illustrate command to ease the debugging. In this paper, I will give an overview of all these features and illustrate how to use these features to program more efficiently in Pig. I will also give concrete example to demonstrate how Pig language evolves overtime with these language improvements.
So you want to get started with Hadoop, but how. This session will show you how to get started with Hadoop development using Pig. Prior Hadoop experience is not needed.
Thursday, May 8th, 02:00pm-02:50pm
The document discusses a presentation about practical problem solving with Hadoop and Pig. It provides an agenda that covers introductions to Hadoop and Pig, including the Hadoop distributed file system, MapReduce, performance tuning, and examples. It discusses how Hadoop is used at Yahoo, including statistics on usage. It also provides examples of how Hadoop has been used for applications like log processing, search indexing, and machine learning.
This document introduces Pig, an open source platform for analyzing large datasets that sits on top of Hadoop. It provides an example of using Pig Latin to find the top 5 most visited websites by users aged 18-25 from user and website data. Key points covered include who uses Pig, how it works, performance advantages over MapReduce, and upcoming new features. The document encourages learning more about Pig through online documentation and tutorials.
Hadoop Summit 2016 presentation.
As Yahoo continues to grow and diversify its mobile products, its platform team faces an ever-expanding, sophisticated group of analysts and product managers, who demand faster answers: to measure key engagement metrics like Daily Active Users, app installs and user retention; to run longitudinal analysis; and to re-calculate metrics over rolling time windows. All of this, quickly enough not to need a coffee break while waiting for results. The optimal solution for this use-case would have to take into account raw performance, cost, security implications and ease of data management. Could the benefits of Hive, ORC and Tez, coupled with a good data design provide the performance our customers crave? Or would it make sense to use more nascent, off-grid querying systems? This talk will examine the efficacy of using Hive for large-scale mobile analytics. We will quantify Hive performance on a traditional, shared, multi-tenant Hadoop cluster, and compare it with more specialized analytics tools on a single-tenant cluster. We will also highlight which tuning parameters yield maximum benefits, and analyze the surprisingly ineffectual ones. Finally, we will detail several enhancements made by Yahoo's Hive team (in split calculation, stripe elimination and the metadata system) to successfully boost performance.
1. The document discusses using Hadoop and Hive at Zing to build a log collecting, analyzing, and reporting system.
2. Scribe is used for fast log collection and storing data in Hadoop/Hive. Hive provides SQL-like queries to analyze large datasets.
3. The system transforms logs into Hive tables, runs analysis jobs in Hive, then exports data to MySQL for web reporting. This provides a scalable, high performance solution compared to the initial RDBMS-only system.
Programmers love Python because of how fast and easy it is to use. Python cuts development time in half with its simple to read syntax and easy compilation feature. Debugging your programs is a breeze in Python with its built in debugger. Python is continued to be a favourite option for data scientists who use it for building and using Machine learning applications and other scientific computations.
Python has evolved as the most preferred Language for Data Analytics and the increasing search trends on python also indicates that Python is the next "Big Thing" and a must for Professionals in the Data Analytics domain.
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of computers. It allows data to be stored reliably in its Hadoop Distributed File System (HDFS) and processed in parallel using MapReduce. HDFS stores data redundantly across nodes for fault tolerance, while MapReduce breaks jobs into smaller tasks that can run across a cluster in parallel. Together HDFS and MapReduce provide scalable and fault-tolerant data storage and processing.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
Introduction to Hadoop.
What are Hadoop, MapReeduce, and Hadoop Distributed File System.
Who uses Hadoop?
How to run Hadoop?
What are Pig, Hive, Mahout?
This document summarizes Tagomori Satoshi's presentation on handling "not so big data" at the YAPC::Asia 2014 conference. It discusses different types of data processing frameworks for various data sizes, from sub-gigabytes up to petabytes. It provides overviews of MapReduce, Spark, Tez, and stream processing frameworks. It also discusses what Hadoop is and how the Hadoop ecosystem has evolved to include these additional frameworks.
This document provides an overview of Hadoop and how it can be used for data consolidation, schema flexibility, and query flexibility compared to a relational database. It describes the key components of Hadoop including HDFS for storage and MapReduce for distributed processing. Examples of industry use cases are also presented, showing how Hadoop enables affordable long-term storage and scalable processing of large amounts of structured and unstructured data.
This document provides an overview and summary of Hadoop and related big data technologies:
- It describes what big data is in terms of volume, velocity, and variety of data. Hadoop is a framework for distributed storage and processing of large datasets across clusters of computers.
- Core Hadoop components like HDFS for storage and MapReduce for processing are introduced. Popular Hadoop tools like Pig, Hive, Sqoop and testing approaches are also summarized briefly.
- The document provides examples of using MapReduce jobs, PigLatin scripts, loading data into Hive tables and exporting data between HDFS and MySQL using Sqoop. It highlights key differences between Hive and Pig as well.
The document provides statistics on the amount of data generated and shared on various digital platforms each day: over 1 terabyte of data from NYSE, 144.8 billion emails sent, 340 million tweets, 684,000 pieces of content shared on Facebook, 72 hours of new video uploaded to YouTube per minute, and more. It outlines the massive scale of data creation and sharing occurring across social media, financial, and other digital platforms.
Big data refers to large amounts of data from various sources that is analyzed to solve problems. It is characterized by volume, velocity, and variety. Hadoop is an open source framework used to store and process big data across clusters of computers. Key components of Hadoop include HDFS for storage, MapReduce for processing, and HIVE for querying. Other tools like Pig and HBase provide additional functionality. Together these tools provide a scalable infrastructure to handle the volume, speed, and complexity of big data.
Big data refers to large amounts of data from various sources that is analyzed to solve problems. It is characterized by volume, velocity, and variety. Hadoop is an open source framework used to store and process big data across clusters of computers. Key components of Hadoop include HDFS for storage, MapReduce for processing, and HIVE for querying. Other tools like Pig and HBase provide additional functionality. Together these tools provide a scalable infrastructure to handle the volume, speed, and complexity of big data.
Hive is used at Facebook for data warehousing and analytics tasks on a large Hadoop cluster. It allows SQL-like queries on structured data stored in HDFS files. Key features include schema definitions, data summarization and filtering, extensibility through custom scripts and functions. Hive provides scalability for Facebook's rapidly growing data needs through its ability to distribute queries across thousands of nodes.
This document provides an overview of big data and Hadoop. It discusses the scale of big data, noting that Facebook handles 180PB per year and Twitter handles 1.2 million tweets per second. It also covers the volume, variety, and velocity challenges of big data. Hadoop and MapReduce are introduced as the leading solutions for distributed storage and processing of big data using a scale-out architecture. Key ideas of Hadoop include storing large data across multiple machines in HDFS and processing that data in parallel using MapReduce jobs.
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune amrutupre
MindScripts Technologies, is the leading Big-Data Hadoop Training institutes in Pune, providing a complete Big-Data Hadoop Course with Cloud-Era certification.
A short overview of Bigdata along with its popularity, ups and downs from past to present. We had a look of its needs, challenges and risks too. Architectures involved in it. Vendors associated with it.
This presentation Simplify the concepts of Big data and NoSQL databases & Hadoop components.
The Original Source:
http://zohararad.github.io/presentations/big-data-introduction/
Big data refers to large volumes of data that are diverse in type and are produced rapidly. It is characterized by the V's: volume, velocity, variety, veracity, and value. Hadoop is an open-source software framework for distributed storage and processing of big data across clusters of commodity servers. It has two main components: HDFS for storage and MapReduce for processing. Hadoop allows for the distributed processing of large data sets across clusters in a reliable, fault-tolerant manner. The Hadoop ecosystem includes additional tools like HBase, Hive, Pig and Zookeeper that help access and manage data. Understanding Hadoop is a valuable skill as many companies now rely on big data and Hadoop technologies.
Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through its distributed file system HDFS and scalable processing of large datasets through its MapReduce programming model. Hadoop has a master/slave architecture with a single NameNode master and multiple DataNode slaves. The NameNode manages the file system namespace and regulates access to files by clients. DataNodes store file system blocks and service read/write requests. MapReduce allows programmers to write distributed applications by implementing map and reduce functions. It automatically parallelizes tasks across clusters and handles failures. Hadoop is widely used by companies like Yahoo and Amazon to process massive amounts of data.
This document discusses building big data solutions using Microsoft's HDInsight platform. It provides an overview of big data and Hadoop concepts like MapReduce, HDFS, Hive and Pig. It also describes HDInsight and how it can be used to run Hadoop clusters on Azure. The document concludes by discussing some challenges with Hadoop and the broader ecosystem of technologies for big data beyond just Hadoop.
This document provides an overview of Hadoop and Big Data. It begins with introducing key concepts like structured, semi-structured, and unstructured data. It then discusses the growth of data and need for Big Data solutions. The core components of Hadoop like HDFS and MapReduce are explained at a high level. The document also covers Hadoop architecture, installation, and developing a basic MapReduce program.
This document provides an overview of big data and Hadoop. It defines big data using the 3Vs - volume, variety, and velocity. It describes Hadoop as an open-source software framework for distributed storage and processing of large datasets. The key components of Hadoop are HDFS for storage and MapReduce for processing. HDFS stores data across clusters of commodity hardware and provides redundancy. MapReduce allows parallel processing of large datasets. Careers in big data involve working with Hadoop and related technologies to extract insights from large and diverse datasets.
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
2. Objectives
• What is Bigdata
• What is Hadoop and its Ecosystem
• Writing Hadoop jobs using Map Reduce
programming
3. structured data {Relational data with well
defined schemas}
Multi structured data
{Social data, blogs , click stream,
Machine generated, xml etc…}
4. Trends … Gartner
Mobile analytics
Mobility App stores and Market place
Human computer interface Big Data Personal cloud
Multi touch UI In memory computing
Advanced Analytics
Green data centre Flash Memory
Social CRM
Solid state drive
HTML5
Context aware computing
7. The Problem…
Facebook
955 million active users as of March 2012,
1 in 3 Internet users have a Facebook
account
More than 30 billion pieces of content (web
links, news stories, blog posts, notes, photo
albums, etc.) shared each month.
Holds 30PB of data for analysis, adds 12 TB of
compressed data daily
8. The Problem…
Twitter
500 million users, 340 million daily tweets
1.6 billion search queries a day
7 TB data for analysis generated daily
Traditional data storage, techniques & analysis
tools just do not work at these scales !
11. What is Hadoop …
Flexible and available architecture for
large scale distributed batch processing
on a network of commodity hardware.
12. Apache top level project
http://hadoop.apache.org/
500 contributors
It has one of the strongest eco systems with large no of sub projects
Yahoo has one of the biggest installation Hadoop
Running 1000s of servers on Hadoop
13. Inspired by …
{Google GFS + Map Reduce + Big Table}
Architecture behind Google’s
Search Engine
Creator of Hadoop project
14. Use cases … What is Hadoop used for
Big/Social data analysis
Text mining, patterns search
Machine log analysis
Geo-spacitial analysis
Trend Analysis
Genome Analysis
Drug Discovery
Fraud and compliance management
Video and image analysis
15. Who uses Hadoop … long list
• Amazon/A9
• Facebook
• Google
• IBM
• Disney
• Last.fm
• New York Times
• Yahoo!
• Twitter
• Linked in
16. What is Hadoop used for?
• Search
Yahoo, Amazon, Zvents
• Log processing
Facebook, Yahoo, ContextWeb, Last.fm
• Recommendation Systems
Facebook , Disney
• Data Warehouse
Facebook, AOL , Disney
• Video and Image Analysis
New York Times
• Computing Carbon Foot Print
Opower.
17. Our own …
ADDHAAR uses Hadoop and Hbase
for its data processing …
19. Hive: Datawarehouse infrastructure built on top of
hadoop for data summarization and aggregation of
data more like in sql like language called as hiveQL.
Hbase: Hbase is a Nosql columnar database and is an
implementation of Google Bigtable. It can scale to
store billions of rows.
Flume: Apache Flume is a distributed, reliable, and
available service for efficiently collecting, aggregating,
and moving large amounts of log data
Avro: A data serialization system.
Sqoop: Used for transferring bulk data between
Hadoop and traditional structured data stores.
32. Pig Script (Word Count)
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- filter out any words that are just white spaces
filtered_words = FILTER words BY word MATCHES 'w+';
-- create a group for each word
word_groups = GROUP filtered_words BY word;
-- count the entries in each group
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS
count, group AS word;
-- order the records by count
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
33. Hive (WordCount)
CREATE TABLE input (line STRING);
LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTO TABLE input;
-- temporary table to hold words
CREATE TABLE words (word STRING);
SELECT word, COUNT(*) FROM input LATERAL VIEW explode(split(text, ' ')) lTable
as word GROUP BY word;
35. Map: mapper.py
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%st%s' % (word, 1)
36. Reduce: reducer.py
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('t', 1)
37. Reduce: reducer.py ( cont )
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
# this IF-switch only works because Hadoop sorts map
output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%st%s' % (current_word, current_count)
current_count = count
current_word = word
38. Reduce: reducer.py (cont)
# do not forget to output the last word if needed!
if current_word == word:
print '%st%s' % (current_word, current_count)
#4: This slide talks about the amount of percentage of different data. Out of all data available on the earth, only very small percentage data is being is processed by enter prises. There is lot of multi structured data yet to be processed.
#5: This slide shows the various trends predicted by Gartner. One of the big trend is Big Dataand cloud
#7: Business, governments and society are only starting to tap to its vast potential
#28: This slide talks about how the architecture parallel between Mumbai Dabbawala and Hadoop parallel architecture