Apache Storm Tutorial Point
Apache Storm Tutorial Point
Apache Storm
This tutorial will explore the principles of Apache Storm, distributed messaging, installation,
creating Storm topologies and deploy them to a Storm cluster, workflow of Trident, real-time
applications and finally concludes with some useful examples.
Audience
This tutorial has been prepared for professionals aspiring to make a career in Big Data Analytics
using Apache Storm framework. This tutorial will give you enough understanding on creating
and deploying a Storm cluster in a distributed environment.
Prerequisites
Before proceeding with this tutorial, you must have a good understanding of Core Java and any
of the Linux flavors.
All the content and graphics published in this e-book are the property of Tutorials Point (I) Pvt.
Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish any
contents or a part of contents of this e-book in any manner without written consent of the
publisher.
We strive to update the contents of our website and tutorials as timely and as precisely as
possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt. Ltd.
provides no guarantee regarding the accuracy, timeliness or completeness of our website or its
contents including this tutorial. If you discover any errors on our website or in this tutorial,
please notify us at [email protected]
i
Apache Storm
Table of Contents
About the Tutorial..............................................................................................................................................i
Audience............................................................................................................................................................i
Prerequisites ......................................................................................................................................................i
Topology ...........................................................................................................................................................5
Tasks .................................................................................................................................................................5
Workers ............................................................................................................................................................6
ii
Apache Storm
Local Cluster....................................................................................................................................................28
Non-JVM languages.........................................................................................................................................30
Twitter ............................................................................................................................................................43
Submitting a Topology.....................................................................................................................................50
iii
Apache Storm
Submitting a Topology.....................................................................................................................................57
Klout ...............................................................................................................................................................59
Telecom Industry.............................................................................................................................................59
iv
1. Apache Storm – Introduction Apache Storm
Apache Storm is continuing to be a leader in real-time data analytics. Storm is easy to setup,
operate and it guarantees that every message will be processed through the topology at least
once.
Storm Hadoop
Stateless Stateful
Storm topology runs until shutdown by the MapReduce jobs are executed in a sequential
user or an unexpected unrecoverable failure. order and completed eventually.
5
Apache Storm
If nimbus / supervisor dies, restarting makes If the JobTracker dies, all the running jobs are
it continue from where it stopped, hence lost.
nothing gets affected.
Twitter – Twitter is using Apache Storm for its range of “Publisher Analytics products”.
“Publisher Analytics Products” process each and every tweets and clicks in the Twitter
Platform. Apache Storm is deeply integrated with Twitter infrastructure.
NaviSite – NaviSite is using Storm for Event log monitoring/auditing system. Every logs
generated in the system will go through the Storm. Storm will check the message against the
configured set of regular expression and if there is a match, then that particular message will
be saved to the database.
Wego – Wego is a travel metasearch engine located in Singapore. Travel related data comes
from many sources all over the world with different timing. Storm helps Wego to search real-
time data, resolves concurrency issues and find the best match for the end-user.
Storm is open source, robust, and user friendly. It could be utilized in small companies
as well as large corporations.
Storm is fault tolerant, flexible, reliable, and supports any programming language.
Storm is unbelievably fast because it has enormous power of processing the data.
Storm can keep up the performance even under increasing load by adding resources
linearly. It is highly scalable.
Storm performs data refresh and end-to-end delivery response in seconds or minutes
depends upon the problem. It has very low latency.
6
Apache Storm
Storm provides guaranteed data processing even if any of the connected nodes in the
cluster die or messages are lost.
7
2. Apache Storm – Core Concepts Apache Storm
Apache Storm reads raw stream of real-time data from one end and passes it through a
sequence of small processing units and output the processed / useful information at the other
end.
Components Description
Source of stream. Generally, Storm accepts input data from raw data
sources like Twitter Streaming API, Apache Kafka queue, Kestrel queue,
Spouts etc. Otherwise you can write spouts to read data from datasources.
“ISpout" is the core interface for implementing spouts. Some of the
specific interfaces are IRichSpout, BaseRichSpout, KafkaSpout, etc.
8
Apache Storm
Bolts are logical processing units. Spouts pass data to bolts and bolts
process and produce a new output stream. Bolts can perform the
operations of filtering, aggregation, joining, interacting with data sources
Bolts
and databases. Bolt receives data and emits to one or more bolts. “IBolt”
is the core interface for implementing bolts. Some of the common
interfaces are IRichBolt, IBasicBolt, etc.
Let’s take a real-time example of “Twitter Analysis” and see how it can be modelled in Apache
Storm. The following diagram depicts the structure.
The input for the “Twitter Analysis” comes from Twitter Streaming API. Spout will read the
tweets of the users using Twitter Streaming API and output as a stream of tuples. A single
tuple from the spout will have a twitter username and a single tweet as comma separated
values. Then, this steam of tuples will be forwarded to the Bolt and the Bolt will split the tweet
into individual word, calculate the word count, and persist the information to a configured
datasource. Now, we can easily get the result by querying the datasource.
9
Apache Storm
Topology
Spouts and bolts are connected together and they form a topology. Real-time application logic
is specified inside Storm topology. In simple words, a topology is a directed graph where
vertices are computation and edges are stream of data.
A simple topology starts with spouts. Spout emits the data to one or more bolts. Bolt
represents a node in the topology having the smallest processing logic and the output of a
bolt can be emitted into another bolt as input.
Storm keeps the topology always running, until you kill the topology. Apache Storm’s main
job is to run the topology and will run any number of topology at a given time.
Tasks
Now you have a basic idea on spouts and bolts. They are the smallest logical unit of the
topology and a topology is built using a single spout and an array of bolts. They should be
executed properly in a particular order for the topology to run successfully. The execution of
each and every spout and bolt by Storm is called as “Tasks”. In simple words, a task is either
the execution of a spout or a bolt. At a given time, each spout and bolt can have multiple
instances running in multiple separate threads.
Workers
A topology runs in a distributed manner, on multiple worker nodes. Storm spreads the tasks
evenly on all the worker nodes. The worker node’s role is to listen for jobs and start or stop
the processes whenever a new job arrives.
Stream Grouping
Stream of data flows from spouts to bolts or from one bolt to another bolt. Stream grouping
controls how the tuples are routed in the topology and helps us to understand the tuples flow
in the topology. There are four in-built groupings as explained below.
Shuffle Grouping
10
Apache Storm
In shuffle grouping, an equal number of tuples is distributed randomly across all of the
workers executing the bolts. The following diagram depicts the structure.
11
Apache Storm
Field Grouping
The fields with same values in tuples are grouped together and the remaining tuples kept
outside. Then, the tuples with the same field values are sent forward to the same worker
executing the bolts. For example, if the stream is grouped by the field “word”, then the tuples
with the same string, “Hello” will move to the same worker. The following diagram shows how
Field Grouping works.
Global Grouping
All the streams can be grouped and forward to one bolt. This grouping sends tuples generated
by all instances of the source to a single target instance (specifically, pick the worker with
lowest ID).
12
Apache Storm
All Grouping
All Grouping sends a single copy of each tuple to all instances of the receiving bolt. This kind
of grouping is used to send signals to bolts. All grouping is useful for join operations.
13
Apache Storm
14
3. Storm – Cluster Architecture Apache Storm
One of the main highlight of the Apache Storm is that it is a fault-tolerant, fast with no “Single
Point of Failure” (SPOF) distributed application. We can install Apache Storm in as many
systems as needed to increase the capacity of the application.
Let’s have a look at how the Apache Storm cluster is designed and its internal architecture.
The following diagram depicts the cluster design.
Apache Storm has two type of nodes, Nimbus (master node) and Supervisor (worker node).
Nimbus is the central component of Apache Storm. The main job of Nimbus is to run the
Storm topology. Nimbus analyzes the topology and gathers the task to be executed. Then, it
will distributes the task to an available supervisor.
A supervisor will have one or more worker process. Supervisor will delegate the tasks to
worker processes. Worker process will spawn as many executors as needed and run the task.
Apache Storm uses an internal distributed messaging system for the communication between
nimbus and supervisors.
Components Description
15
Apache Storm
The nodes that follow instructions given by the nimbus are called as
Supervisors. A supervisor has multiple worker processes and it
Supervisor
governs worker processes to complete the tasks assigned by the
nimbus.
Storm is stateless in nature. Even though stateless nature has its own disadvantages, it
actually helps Storm to process real-time data in the best possible and quickest way.
Storm is not entirely stateless though. It stores its state in Apache ZooKeeper. Since the state
is available in Apache ZooKeeper, a failed nimbus can be restarted and made to work from
where it left. Usually, service monitoring tools like monit will monitor Nimbus and restart it
if there is any failure.
Apache Storm also have an advanced topology called Trident Topology with state
maintenance and it also provides a high-level API like Pig. We will discuss all these features
in the coming chapters.
16
4. Apache Storm – Workflow Apache Storm
A working Storm cluster should have one nimbus and one or more supervisors. Another
important node is Apache ZooKeeper, which will be used for the coordination between the
nimbus and the supervisors.
Initially, the nimbus will wait for the “Storm Topology” to be submitted to it. The
Once a topology is submitted, it will process the topology and gather all the tasks that
are to be carried out and the order in which the task is to be executed.
Then, the nimbus will evenly distribute the tasks to all the available supervisors.
At a particular time interval, all supervisors will send heartbeats to the nimbus to
inform that they are still alive.
When a supervisor dies and doesn’t send a heartbeat to the nimbus, then the nimbus
assigns the tasks to another supervisor.
When the nimbus itself dies, supervisors will work on the already assigned task without
any issue.
Once all the tasks are completed, the supervisor will wait for a new task to come in.
The restarted nimbus will continue from where it stopped. Similarly, the dead
supervisor can also be restarted automatically. Since both the nimbus and the
supervisor can be restarted automatically and both will continue as before, Storm is
guaranteed to process all the task at least once.
Once all the topologies are processed, the nimbus waits for a new topology to arrive
and similarly the supervisor waits for new tasks.
Local mode: This mode is used for development, testing, and debugging because it
is the easiest way to see all the topology components working together. In this mode,
we can adjust parameters that enable us to see how our topology runs in different
Storm configuration environments. In Local mode, storm topologies run on the local
machine in a single JVM.
Production mode: In this mode, we submit our topology to the working storm
cluster, which is composed of many processes, usually running on different machines.
17
Apache Storm
As discussed in the workflow of storm, a working cluster will run indefinitely until it is
shutdown.
18
Apache Storm
19