0% found this document useful (0 votes)

2 views21 pages

Data Analytics and Visualization Unit-III

Uploaded by

tunnuofficial01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views21 pages

Data Analytics and Visualization Unit-III

Uploaded by

tunnuofficial01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

You are on page 1/ 21

BCDS501

Introduction to Data Analytics and Visualization

UNIT 3 Mining Data Streams

SYLLABUS

Introduction to streams concepts, stream data model and

architecture,stream computing, sampling data in a stream, filtering
streams, counting distinct elements in a stream, estimating moments,
counting oneness in a window, decaying window, Real-time Analytics
Platform (RTAP) applications, Case studies – real time sentiment
analysis, stock market predictions.
Data stream refers to the continuous flow of data generated by various sources in real-time. It
plays a crucial role in modern technology, enabling applications to process and analyze
information as it arrives, leading to timely insights and actions.

In this article, we are going to discuss concepts of the data stream in data analytics in detail
what data streams are, their importance, and how they are used in fields like
finance, telecommunications, and IoT (Internet of Things).

Introduction to stream concepts

A data stream is an existing, continuous, ordered (implicitly by entrance time or
explicitly by timestamp) chain of items. It is unfeasible to control the order in which units
arrive, nor it is feasible to locally capture stream in its entirety.
It is enormous volumes of data, items arrive at a high rate.
Types of Data Streams
•Data stream –
A data stream is a(possibly unchained) sequence of tuples. Each tuple comprised of a set of
attributes, similar to a row in a database table.
•Transactional data stream –
It is a log interconnection between entities
1.Credit card – purchases by consumers from producer
2.Telecommunications – phone calls by callers to the dialed parties

3.Web – accesses by clients of information at servers

•Measurement data streams –
1.Sensor Networks – a physical natural phenomenon, road traffic
2.IP Network – traffic at router interfaces
3.Earth climate – temperature, humidity level at weather stations
Examples of Stream Sources
1.Sensor Data –
In navigation systems, sensor data is used. Imagine a temperature sensor floating about
in the ocean, sending back to the base station a reading of the surface temperature each hour.
The data generated by this sensor is a stream of real numbers. We have 3.5 terabytes arriving
every day and we for sure need to think about what we can be kept continuing and what can
only be archived.

2.Image Data –
Satellites frequently send down-to-earth streams containing many terabytes of images per day.
Surveillance cameras generate images with lower resolution than satellites, but there can be
numerous of them, each producing a stream of images at a break of 1 second each.

3.Internet and Web Traffic –

A bobbing node in the center of the internet receives streams of IP packets from many inputs
and paths them to its outputs. Websites receive streams of heterogeneous types. For example,
Google receives a hundred million search queries per day.
Characteristics of Data Streams
1.Large volumes of continuous data, possibly infinite.
2.Steady changing and requires a fast, real-time response.
3.Data stream captures nicely our data processing needs of today.
4.Random access is expensive and a single scan algorithm
5.Store only the summary of the data seen so far.
6.Maximum stream data are at a pretty low level or multidimensional in creation, needs
multilevel and multidimensional treatment.
Applications of Data Streams
1.Fraud perception
2.Real-time goods dealing
3.Consumer enterprise
4.Observing and describing on inside IT systems
Advantages of Data Streams
•This data is helpful in upgrading sales
•Help in recognizing the fallacy
•Helps in minimizing costs
•It provides details to react swiftly to risk
Disadvantages of Data Streams
•Lack of security of data in the cloud
•Hold cloud donor subordination
•Off-premises warehouse of details introduces the probable for disconnection

What is DSMS Architecture?

DSMS stands for data stream management
system. It is nothing but a software application
just like DBMS (database management system)
but it involves processing and management of a
continuously flowing data stream rather than
static data like Excel PDF or other files. It is
generally used to deal data streams from with
various sources which include sensor data,
social media fields, financial reports, etc.
Just like DBMS, DSMS also provides a wide
range of operations like storage, processing,
analyzing, integration also helps to generate
the visualization and report only used for data
streams.
There are wide range of DSMS applications
available in the market among them Apache Flint, Apache Kafka, Apache Storm, Amazon
kinesis, etc. DSMS processes 2 types of queries standard queries and ad hoc queries.

DSMS consists of various layer which are dedicated to perform particular operation which are
as follows:
1. Data source Layer
The first layer of DSMS is data source layer as it name suggest it is comprises of all the data
sources which includes sensors, social media feeds, financial market, stock markets etc. In the
layer capturing and parsing of data stream happens. Basically it is the collection layer which
collects the data.
2. Data Ingestion Layer
You can consider this layer as bridge between data source layer and processing layer. The main
purpose of this layer is to handle the flow of data i.e., data flow control, data buffering and
data routing.
3. Processing Layer
This layer consider as heart of DSMS architecture it is functional layer of DSMS applications. It
process the data streams in real time. To perform processing it is uses processing engines like
Apache flink or Apache storm etc., The main function of this layer is to filter, transform,
aggregate and enriching the data stream. This can be done by derive insights and detect
patterns.
4. Storage Layer
Once data is process we need to store the processed data in any storage unit. Storage layer
consist of various storage like NoSQL database, distributed database etc., It helps to ensure
data durability and availability of data in case of system failure.
5. Querying Layer
As mentioned above it support 2 types of query ad hoc query and standard query. This layer
provides the tools which can be used for querying and analyzing the stored data stream. It also
have SQL like query languages or programming API. This queries can be question like how
many entries are done? which type of data is inserted? etc.,
6. Visualization and Reporting Layer
This layer provides tools for perform visualization like charts, pie chart, histogram etc., On the
basis of this visual representation it also helps to generate the report for analysis.
7. Integration Layer
This layer responsible for integrating DSMS application with traditional system, business
intelligence tools, data warehouses, ML application, NLP applications. It helps to improve
already present running applications.
The layers are responsible for working of DSMS applications. It provides scalable and fault
tolerance application which can handle huge volume of streaming data. These layer can
change according to the business requirements some may include all layer some may exclude
layers.
5 Main Components Of Data Streaming Architecture

A modern data streaming architecture refers to a collection of tools and components

designed to receive and handle high-volume data streams from various
origins. Streaming data is data that is continuously generated and transmitted by various
devices or applications, such as IoT sensors, security logs, web clicks, etc.

A data streaming architecture enables organizations to analyze and act on a data

stream in real time, rather than waiting for batch processing.

A typical data streaming architecture consists of 5 main components:

A. Stream Processing Engine

This is the core component that processes streaming data. It can perform various operations
such as filtering, aggregation, transformation, enrichment, windowing, etc. A stream processing
engine can also support Complex Event Processing (CEP) which is the ability to detect
patterns or anomalies in streaming data and trigger actions accordingly.

Some popular stream processing tools are Apache Spark Streaming, Apache Flink, Apache
Kafka Streams, etc.
How To Choose A Stream Processing Engine?
Honestly, there is no definitive answer to this question as different stream processing engines
have different strengths and weaknesses and different use cases have different requirements
and constraints.

However, there are some general factors that you must consider when choosing a
stream processing engine. Let’s take a look at them:

•Data volume and velocity: Look at how much data you need to process per second
or per minute and how fast you need to process it. Depending on their architecture and
design, some stream processing engines can handle higher throughput and lower
latency than others.
•Data variety and quality: The type of data you need to process is a very important
factor. Depending on their schema support and data cleansing capabilities, different
stream processing engines can handle data of different complexity or diverse data
types.
•Processing complexity and functionality: The type and complexity of data
processing needed are also very important. Some stream processing engines can
support more sophisticated or flexible processing logic than others which depends on
their programming model and API.
•Scalability and reliability: The level of reliability you need for your stream
processing engine depends on how you plan to use it. Other factors like your company’s
future plans determine how scalable the stream processing engine needs to be. Some
stream processing engines can scale up or down more easily than others.
•Integration and compatibility: Stream processing engines don’t work alone. Your
stream processing engine must integrate with other components of your data streaming
architecture. Depending on their connectors and formats, some stream processing
engines can interoperate more seamlessly than others.

B. Message Broker

Message brokers act as buffers between the data sources and the stream processing
engine. It collects data from various sources, converts it to a standard message format (such
as JSON or Avro), and then streams it continuously for consumption by other
components.

A message broker also provides features such as scalability, fault tolerance, load
balancing, partitioning, etc. Some examples of message brokers are Apache Kafka, Amazon
Kinesis Streams, etc.

How To Choose A Message Broker?

With so many message brokers, selecting one can feel overwhelming. Here are factors you
should consider when selecting a message broker for your data streaming architecture.
•Broker scale: With an ever-increasing amount of messages sent every second, it’s
important to choose a message broker that can handle your data without sacrificing
performance or reliability.
•Data persistency: When looking for a message broker, you want to make sure it can
store your messages securely and reliably. Whether it’s a disk or memory-based storage,
having the ability to recover data in case of unexpected failures gives you peace of mind
that your important information won’t be lost.
•Integration and compatibility: It is very important to check how well your selected
message broker integrates with other components of your data streaming architecture.
Always choose a message broker that can interoperate seamlessly with your chosen
components and supports common formats and protocols.

C. Data Storage
This component stores the processed or raw streaming data for later use. Data
storage can be either persistent or ephemeral, relational or non-relational, structured or
unstructured, etc. Because of the large amount and diverse format of event streams, many
organizations opt to store their streaming event data in cloud object stores as an
operational data lake. A standard method of loading data into the data storage is using ETL
pipelines.

Some examples of data storage systems are Amazon S3, Hadoop Distributed File System
(HDFS), Apache Cassandra, Elasticsearch, etc.

Read more about different types of data storage systems such as databases, data warehouses,
and data lakes here.

How To Choose A Data Storage Service?

There are various factors that you must consider when deciding on a data storage service.
Some of these include:

•Data volume and velocity: When it comes to data storage, you need a service that
can keep up with your needs – without sacrificing performance or reliability. How much
do you have to store and how quickly must it be accessed? Your selection should meet
both of these criteria for the best results.
•Data variety and quality: It pays to choose a storage service that can accommodate
your data type, and format, as well as offer features like compression for smoother
operation. What sort of data are you looking to store? How organized and consistent is
it? Are there any issues with quality or integrity? Plus encryption and deduplication
mean improved security – so make sure it’s on the table.
•Data access and functionality: What kind of access do you need to your data? Do
you just want basic read/write operations or something more intricate like queries and
analytics? If it’s the latter, a batch processing system or real-time might be necessary.
You’ll have to find a service that can provide whatever your required pattern is as well
as features such as indexing, partitioning, and caching – all these extras could take up
any slack in functionality.
•Integration and compatibility: Pick something capable of smooth integration and
compatibility across different components, like message brokers to data stream
processing engines. Plus, it should support common formats/protocols for optimal
performance.
D. Data Ingestion Layer

The data ingestion layer is a crucial part that collects data from various sources and
transfers it to a data storage system for further data manipulation or analysis. This
layer is responsible for processing different types of data, including structured, unstructured, or
semi-structured, and formats like CSV, JSON, or XML. It is also responsible for ensuring that the
data is accurate, secure, and consistent.

Some examples of data ingestion tools are Apache Flume, Logstash, Amazon Kinesis Firehose,
etc.

How To Choose A Data Ingestion Tool?

There are many tools and technologies available in the market that can help you implement
your desired data ingestion layer, such as Apache Flume, Apache Kafka, Azure Data Factory,
etc.

Here are some factors that you can consider when selecting a tool or technology for your data
ingestion layer:

•Data source and destination compatibility: When searching for the right data
ingestion tool, make sure it can easily link up with your data sources and destinations.
To get this done without a hitch, double-check if there are any connectors or adapters
provided along with the tech. These will allow everything to come together quickly and
seamlessly.
•Data transformation capability: When selecting a tool or technology for data
transformation, it is important to consider whether it supports your specific logic. Check
if it offers pre-built functions or libraries for common transformations, or if it allows you
to write your custom code for more complex transformations. Your chosen tool or
technology should align with your desired data transformation logic to ensure smooth
and efficient data processing.
•Scalability and reliability: It is important to consider its scalability and reliability
when selecting your data ingestion tool. Consider if it can handle the amount of data
you anticipate without affecting its performance or dependability. Check if it offers
features like parallelism, partitioning, or fault tolerance to ensure it can scale and
remain dependable.

While you can use dedicated data ingestion tools such as Flume or Kafka, a better option would
be to use a tool like Estuary Flow that combines multiple components of streaming
architecture. It includes data ingestion, stream processing, and message broker; and it contains
data-lake-style storage in the cloud.

Estuary Flow supports a wide range of streaming data sources and formats so you can
easily ingest and analyze data from social media feeds, IoT sensors, clickstream data,
databases, and file systems.

This means you can get access to insights from your data sources faster than ever before.
Whether you need to run a historic analysis or react quickly to changes, our stream processing
engines will provide you with the necessary support.

And the best part?

You don’t need to be a coding expert to use Estuary Flow. Our powerful no-code solution
makes it easy for organizations to create and manage data pipelines so you can focus on
getting insights from your data instead of wrestling with code.

E. Data Visualization & Reporting Tools

Data components that provide a user interface for exploring and analyzing streaming
data. They can also generate reports or dashboards based on predefined metrics or queries.
These components work together to form a complete data streaming architecture that
can handle various types of streaming data and deliver real-time insights.

Some examples of data visualization and reporting tools are Grafana, Kibana, Tableau, Power
BI, etc.

How To Choose A Data Visualization & Reporting Tools For Your Data Streaming Architecture
Here are some factors to consider when choosing data visualization and reporting tools for your
data streaming architecture.

•Type and volume of data you want to stream: When choosing data visualization
and reporting tools, you might need different tools for different types of data, such as
structured or unstructured data, and for different formats, such as JSON or CSV.
•Latency and reliability requirements: If you’re looking to analyze data quickly and
with precision, check whether the tools match your latency and reliability requirements.
•Scalability and performance requirements: Select data visualization and reporting
tools that can adapt and scale regardless of how much your input increases or
decreases.
•Features and functionality of the tools: If you’re trying to make the most of your
data, you must select tools with features catered specifically to what you need. These
might include filtering abilities and interactive visualization options along with alerts for
collaboration purposes.

Now that we are familiar with the main components, let’s discuss the data streaming
architecture diagram.

Data Streaming Architecture Diagrams

Data streaming architecture is a powerful way to unlock insights and make real-time decisions
from continuous streams of incoming data. This innovative setup includes three key
components: data sources that provide the raw information, pipelines for processing it all in an
orderly manner, and finally applications or services consuming the processed results.

With these elements combined, you can track important trends as they happen.

•Data sources: Data sources, such as IoT devices, web applications, and social media
platforms are the lifeblood of data pipelines. They can also use different protocols, like
HTTP or MQTT, to continuously feed info up into your pipeline whether it’s through push
mechanisms or pull processes.
•Data pipelines:Data pipelines are powerful systems that enable you to receive and
store data from a multitude of sources while having the ability to scale up or down.
These tools can also transform your raw data into useful information through operations
like validation, enrichment, filtering, and aggregation.
•Data consumers: Data consumers are the individuals, organizations, or even
computer programs that access and make use of data from pipelines for a variety of
applications, like real-time analytics, reporting visualizations, decisions making, and
more.

With such information at their fingertips, the users can analyze it further through
descriptive analysis which examines trends in historical data to predict what may
happen next. Additionally, others choose to utilize predictive technology as well as
prescription tactics, both crucial aspects of any business decision-making process.

There are many ways to design and implement a data streaming architecture depending on
your needs and goals. Here are some examples of a modern streaming data architecture:

I. Lambda Architecture

This is a hybrid architecture that uses two layers: a historical layer utilizing traditional
technologies like Spark, then another for near-real time with quickly responding
streaming tools such as Kafka or Storm.

These are unified together in an extra serving layer for optimal accuracy, scalability, and fault
tolerance – though this complexity does come at some cost when it comes to latency and
maintenance needs.

II. Kappa Architecture

This is a simplified approach that uses only stream processing for both historical and
real-time data. It uses one layer: a stream layer that processes all the incoming
data using stream technologies such as Kafka Streams or Flink.

The processed data is stored in queryable storage that supports both batch and
stream queries. This approach not only offers low latency, simplicity, and consistency but
also requires high performance, reliability, and idempotency.

Stream Computing :-
•Stream computing is a computing paradigm that reads data from collections of software or hardware
sensors in stream form and computes continuous data streams.

•Stream computing uses software programs that compute continuous data streams.

•Stream computing uses software algorithm that analyzes the data in real time.

•Stream computing is one effective way to support Big Data by providing extremely low-latency velocities
with massively parallel processing architectures.

•It is becoming the fastest and most efficient way to obtain useful knowledge from Big Data.

Data Sampling is a statistical method that is used to analyze and observe a subset of data from
a larger piece of dataset and configure meaningful information, all the required info from the
subset that helps in gaining information, or drawing conclusion for the larger dataset, or it's
parent dataset.
•Sampling in data science helps in finding more better and accurate results and works best
when the data size is big.
•Sampling helps in identifying the entire pattern on which the subset of the dataset is based
upon and on the basis of that smaller dataset, entire sample size is presumed to hold the same
properties.
•It is a quicker and more effective method to draw conclusions.
What is Data Sampling important?
Data sampling is important for a couple of key reasons:
1.Cost and Time Efficiency: Sampling allows researchers to collect and analyze a subset of
data rather than the entire population. This reduces the time and resources required for data
collection and analysis, making it more cost-effective, especially when dealing with large
datasets.

2.Feasibility: In many cases, it's impractical or impossible to analyze the entire population due
to constraints such as time, budget, or accessibility. Sampling makes it feasible to study a
representative portion of the population while still yielding reliable results.

3.Risk Reduction: Sampling helps mitigate the risk of errors or biases that may occur when
analyzing the entire population. By selecting a random or systematic sample, researchers can
minimize the impact of outliers or anomalies that could skew the results.

4.Accuracy: In some cases, examining the entire population might not even be possible. For
instance, testing every single item in a large batch of manufactured goods would be
impractical. Data sampling allows researchers to get a good understanding of the whole
population by examining a well-chosen subset.
Types of Data Sampling Techniques
There are mainly two types of Data Sampling techniques which are further divided into 4
sub-categories each. They are as follows:
Probability Data Sampling Technique
Probability Data Sampling technique involves selecting data points from a dataset in such
a way that every data point has an equal chance of being chosen. Probability sampling
techniques ensure that the sample is representative of the population from which it is drawn,
making it possible to generalize the findings from the sample to the entire population with a
known level of confidence.
1.Simple Random Sampling: In Simple random sampling, every dataset has an equal chance
or probability of being selected. For eg. Selection of head or tail. Both of the outcomes of the
event have equal probabilities of getting selected.

2.Systematic Sampling: In Systematic sampling, a regular interval is chosen each after which
the dataset continues for sampling. It is more easier and regular than the previous method of
sampling and reduces inefficiency while improving the speed. For eg. In a series of 10 numbers,
we have a sampling after every 2nd number. Here we use the process of Systematic sampling.

3.Stratified Sampling: In Stratified sampling, we follow the strategy of divide & conquer. We
opt for the strategy of dividing into groups on the basis of similar properties and then perform
sampling. This ensures better accuracy. For eg. In a workplace data, the total number of
employees is divided among men and women.

4.Cluster Sampling: Cluster sampling is more or less like stratified sampling. However in
cluster sampling we choose random data and form it in groups, whereas in stratified we use
strata, or an orderly division takes place in the latter. For eg. Picking up users of different
networks from a total combination of users.
Non-Probability Data Sampling
Non-probability data sampling means that the selection happens on a non-random basis, and it
depends on the individual as to which data does it want to pick. There is no random selection
and every selection is made by a thought and an idea behind it.
1.Convenience Sampling: As the name suggests, the data checker selects the data based on
his/her convenience. It may choose the data sets that would require lesser calculations, and
save time while bringing results at par with probability data sampling technique. For eg.
Dataset involving recruitment of people in IT Industry, where the convenience would be to
choose the data which is the latest one, and the one which encompasses youngsters more.

2.Voluntary Response Sampling: As the name suggests, this sampling method depends on
the voluntary response of the audience for the data. For eg. If a survey is being conducted on
types of Blood groups found in majority at a particular place, and the people who are willing to
take part in this survey, and then if the data sampling is conducted, it will be referred to as the
voluntary response sampling.

3.Purposive Sampling: The Sampling method that involves a special purpose falls under
purposive sampling. For eg. If we need to tackle the need of education, we may conduct a
survey in the rural areas and then create a dataset based on people's responses. Such type of
sampling is called Purposive Sampling.

4.Snowball Sampling: Snowball sampling technique takes place via contacts. For eg. If we
wish to conduct a survey on the people living in slum areas, and one person contacts us to the
other and so on, it is called a process of snowball sampling.
Data Sampling Process

The process of data sampling involves the following steps:

•Find a Target Dataset: Identify the dataset that you want to analyze or draw conclusions
about. This dataset represents the larger population from which a sample will be drawn.
•Select a Sample Size: Determine the size of the sample you will collect from the target
dataset. The sample size is the subset of the larger dataset on which the sampling process will
be performed.
•Decide the Sampling Technique: Choose a suitable sampling technique from options such
as Simple Random Sampling, Systematic Sampling, Cluster Sampling, Snowball Sampling, or
Stratified Sampling. The choice of technique depends on factors such as the nature of the
dataset and the research objectives.
•Perform Sampling: Apply the selected sampling technique to collect data from the target
dataset. Ensure that the sampling process is carried out systematically and according to the
chosen method.
•Draw Inferences for the Entire Dataset: Analyze the properties and characteristics of the
sampled data subset. Use statistical methods and analysis techniques to draw inferences and
insights that are representative of the entire dataset.
•Extend Properties to the Entire Dataset: Extend the findings and conclusions derived
from the sample to the entire target dataset. This involves extrapolating the insights gained
from the sample to make broader statements or predictions about the larger population.
Advantages of Data Sampling
•Data Sampling helps draw conclusions, or inferences of larger datasets using a smaller sample
space, which concerns the entire dataset.
•It helps save time and is a quicker and faster approach.
•It is a better way in terms of cost effectiveness as it reduces the cost for data analysis,
observation and collection. It is more of like gaining the data, applying sampling method &
drawing the conclusion.
•It is more accurate in terms of result and conclusion.
Disadvantages of Data Sampling
•Sampling Error: It is the act of differentiation among the entire sample size and the smaller
dataset. There arise some differences in characteristics, or properties among both the datasets
that reduce the accuracy and the sample set is unable to represent a larger piece of
information. Sampling Error mostly occurs by a chance and is regarded as an error-less term.
•It becomes difficult in a few data sampling methods, such as forming clusters of similar
properties.
•Sampling Bias: It is the process of choosing a sample set which does not represent the entire
population on a whole. It occurs mostly due to incorrect method of sampling usage and consists
of errors as the given dataset is not properly able to draw conclusions for the larger set of data.
Sample Size Determination
Sample size is the universal dataset concerning to which several other smaller datasets are
created that help in inferring the properties of the entire dataset. Following are a series of steps
that are involved during sample size determination.
1.Firstly calculate the population size, as in the total sample space size on which the sampling
has to be performed.

2.Find the values of confidence levels that represent the accuracy of the data.
3.Find the value of error margins if any with respect to the sample space dataset.
4.Calculate the deviation from the mean or average value from that of standard deviation value
calculated.
Best Practices for Effective Data Sampling
Before performing data sampling methods, one should keep in mind the below three mentioned
considerations for effective data sampling.
1.Statistical Regularity: A larger sample space, or parent dataset means more accurate
results. This is because then the probability of every data to be chosen is equal, ie., regular.
When picked at random, a larger data ensures a regularity among all the data.
2.Dataset must be accurate and verified from the respective sources.
3.In Stratified Data Sampling technique, one needs to be clear about the kind of strata or group
it will be making.

4.Inertia of Large Numbers: As mentioned in the first principle, this too states that the
parent data set must be large enough to gain better and clear results.

GET IT ON YT

Distinct Elements in a Stream

Given an array of integers arr[], the task is to return the no of distinct elements in
subarray arr[0, i] for 0 <= i <arr.size().
The array will have positive and negative values. positive value means you have to append it
into your data and negative value means you have to remove it from your data.
Note: If the element is not present in the data and you get the -ve of that element then no
changes should occur.
Examples:

Input: arr[] = [5, 5, 7, -5, -7, 1, 2, -2]

Output: [1, 1, 2, 2, 1, 2, 3, 2]
Explanation: Proper adding and removal of intgers will give this output.
Input: arr[] = [9, 9, 3, -9, -3, -9]
Output: [1, 1, 2, 2, 1, 0]
Explanation: Proper adding and removal of intgers will give this output.

Expected Time Complexity: O(n).

Expected Auxiliary Space: O(n).
Constraints:
1 ≤ arr.size() ≤ 106
-106 ≤ arr[i] ≤ 106

Estimating Moments :-
•Estimating moments is a generalization of the problem of counting distinct elements in a stream. The
problem, called computing "moments," involves the distribution of frequencies of different elements in the
stream.

•Suppose a stream consists of elements chosen from a universal set. Assume the universal set is ordered so
we can speak of the ithith element for any i.
•Let mimi be the number of occurrences of the ithith element for any i. Then the kthkth-order moment of
the stream is the sum over all i of (mi)k(mi)k

. For example :-

•The 0th0th moment is the sum of 1 of each mi that is greater than 0 i.e., 0th0th moment is a count of the
number of distinct element in the stream.
•The 1st moment is the sum of the mimi ’s, which must be the length of the stream. Thus, first moments are
especially easy to compute i.e., just count the length of the stream seen so far.
•The second moment is the sum of the squares of the mimi’s. It is sometimes called the surprise number,
since it measures how uneven the distribution of elements in the stream is.
•To see the distinction, suppose we have a stream of length 100, in which eleven different elements appear.
The most even distribution of these eleven elements would have one appearing 10 times and the other ten
appearing 9 times each.

•In this case, the surprise number is 102102 + 10 × 9292 = 910. At the other extreme, one of the eleven
elements could appear 90 times and the other ten appear 1 time each. Then, the surprise number would
be 902902 + 10 × 12 = 8110.

GET IT ON YT

Decaying Window Algorithm

This algorithm allows you to identify the most popular elements (trending, in other words) in an
incoming data stream.
The decaying window algorithm not only tracks the most recurring elements in an incoming data stream, but also
discounts any random spikes or spam requests that might have boosted an element’s frequency. In a decaying window,
you assign a score or weight to every element of the incoming data stream. Further, you need to calculate the aggregate
sum for each distinct element by adding all the weights assigned to that element. The element with the highest total
score is listed as trending or the most popular.

1. Assign each element with a weight/score.

2. Calculate aggregate sum for each distinct element by adding all the weights assigned to that element.

In a decaying window algorithm, you assign more weight to newer elements. For a new element, you first reduce the
weight of all the existing elements by a constant factor k and then assign the new element with a specific weight. The
aggregate sum of the decaying exponential weights can be calculated using the following formula:

∑t−1i=0at−i(1−c)i

Here, c is usually a small constant of the order

10−6 or 10−9. Whenever a new element, say at+1 , arrives in the data stream you perform the following steps to
achieve an updated sum:

1. Multiply the current sum/score by the value (1−c).

2. Add the weight corresponding to the new element.

Weight decays exponentially over time

In a data stream consisting of various elements, you maintain a separate sum for each distinct element. For every
incoming element, you multiply the sum of all the existing elements by a value of (1−c). Further, you add the weight of
the incoming element to its corresponding aggregate sum.

A threshold can be kept to, ignore elements of weight lesser than that.
Finally, the element with the highest aggregate score is listed as the most popular element.
Example
For example, consider a sequence of twitter tags below:
fifa, ipl, fifa, ipl, ipl, ipl, fifa

Also, let's say each element in sequence has weight of 1.

Let's c be 0.1
The aggregate sum of each tag in the end of above stream will be calculated as below:
fifa
fifa - 1 * (1-0.1) = 0.9
ipl - 0.9 * (1-0.1) + 0 = 0.81 (adding 0 because current tag is different than fifa)
fifa - 0.81 * (1-0.1) + 1 = 1.729 (adding 1 because current tag is fifa only)
ipl - 1.729 * (1-0.1) + 0 = 1.5561
ipl - 1.5561 * (1-0.1) + 0 = 1.4005
ipl - 1.4005 * (1-0.1) + 0 = 1.2605
fifa - 1.2605 * (1-0.1) + 1 = 2.135
ipl
fifa - 0 * (1-0.1) = 0
ipl - 0 * (1-0.1) + 1 = 1
fifa - 1 * (1-0.1) + 0 = 0.9 (adding 0 because current tag is different than ipl)
ipl - 0.9 * (1-0.01) + 1 = 1.81
ipl - 1.81 * (1-0.01) + 1 = 2.7919
ipl - 2.7919 * (1-0.01) + 1 = 3.764
fifa - 3.764 * (1-0.01) + 0 = 3.7264

In the end of the sequence, we can see the score of fifa is 2.135 but ipl is 3.7264
So, ipl is more trending then fifa
Even though both of them occurred same number of times in input there score is still different.
Advantages of Decaying Window Algorithm:
1. Sudden spikes or spam data is taken care.
2. New element is given more weight by this mechanism, to achieve right trending output.

GET IT ON YT.
What Is Real-Time Sentiment Analysis?

Real-time Sentiment Analysis is a machine learning (ML) technique that automatically recognizes and extracts the

sentiment in a text whenever it occurs. It is most commonly used to analyze brand and product mentions in live social

comments and posts. An important thing to note is that real-time sentiment analysis can be done only from social media

platforms that share live feeds like Twitter does.

The real-time sentiment analysis process uses several ML tasks such as natural language processing, text analysis,

semantic clustering, etc to identify opinions expressed about brand experiences in live feeds and extract business

intelligence from them.

Why Do We Need Real-Time Sentiment Analysis?

Real-time sentiment analysis has several applications for brand and customer analysis. These include the following.

1.Live social feeds from video platforms like Instagram or Facebook

2.Real-time sentiment analysis of text feeds from platforms such as Twitter. This is immensely helpful in

prompt addressing of negative or wrongful social mentions as well as threat detection in cyberbullying.

3.Live monitoring of Influencer live streams.

4.Live video streams of interviews, news broadcasts, seminars, panel discussions, speaker events, and lectures.

5.Live audio streams such as in virtual meetings on Zoom or Skype, or at product support call centers

for customer feedback analysis.

6.Live monitoring of product review platforms for brand mentions.

7.Up-to-date scanning of news websites for relevant news through keywords and hashtags along with

the sentiment in the news.

Read in detail about the applications of real-time sentiment analysis.

How Is Real-Time Sentiment Analysis Done?

Live sentiment analysis is done through machine learning algorithms that are trained to recognize and analyze all data

types from multiple data sources, across different languages, for sentiment.

A real-time sentiment analysis platform needs to be first trained on a data set based on your industry and needs. Once

this is done, the platform performs live sentiment analysis of real-time feeds effortlessly.

Below are the steps involved in the process.

Step 1 - Data collection

To extract sentiment from live feeds from social media or other online sources, we first need to add live APIs of those

specific platforms, such as Instagram or Facebook. In case of a platform or online scenario that does not have a live

API, such as can be the case of Skype or Zoom, repeat, time-bound data pull requests are carried out. This gives the

solution the ability to constantly track relevant data based on your set criteria.

Step 2 - Data processing

All the data from the various platforms thus gathered is now analyzed. All text data in comments are cleaned up and

processed for the next stage. All non-text data from live video or audio feeds is transcribed and also added to the text

pipeline. In this case, the platform extracts semantic insights by first converting the audio, and the audio in the video

data, to text through speech-to-text software.

This transcript has timestamps for each word and is indexed section by section based on pauses or changes in the

speaker. A granular analysis of the audio content like this gives the solution enough context to correctly identify entities,

themes, and topics based on your requirements. This time-bound mapping of the text also helps with semantic search.

Even though this may seem like a long drawn-out process, the algorithms complete this in seconds.

Step 3 - Data analysis

All the data is now analyzed using native natural language processing (NLP), semantic clustering, and aspect-based

sentiment analysis. The platform derives sentiment from aspects and themes it discovers from the live feed, giving you

the sentiment score for each of them.

It can also give you an overall sentiment score in percentile form and tell you sentiment based on language and data

sources, thus giving you a break-up of audience opinions based on various demographics.

Step 4 - Data visualization

All the intelligence derived from the real-time sentiment analysis in step 3 is now showcased on a reporting dashboard

in the form of statistics, graphs, and other visual elements. It is from this sentiment analysis dashboard that you can set

alerts for brand mentions and keywords in live feeds as well.

Learn more about the steps in sentiment analysis.

What Are The Most Important Features Of A Real-Time Sentiment Analysis Platform?

A live feed sentiment analysis solution must have certain features that are necessary to extract and determine real-time

insights. These are:

•Multiplatform
One of the most important features of a real-time sentiment analysis tool is its ability to analyze multiple social media

platforms. This multiplatform capability means that the tool is robust enough to handle API calls from different

platforms, which have different rules and configurations so that you get accurate insights from live data.

This gives you the flexibility to choose whether you want to have a combination of platforms for live feed analysis such

as from a Ted talk, live seminar, and Twitter, or just a single platform, say, live Youtube video analysis.

•Multimedia

Being multi-platform also means that the solution needs to have the capability to process multiple data types such as

audio, video, and text. In this way, it allows you to discover brand and customer sentiment through live TikTok social

listening, real-time Instagram social listening, or live Twitter feed analysis, effortlessly, regardless of the data format.

•Multilingual

Another important feature is a multilingual capability. For this, the platform needs to have part-of-speech taggers for

each language that it is analyzing. Machine translations can lead to a loss of meanings and nuances when translating

non-Germanic languages such as Korean, Chinese, or Arabic into English. This can lead to inaccurate insights from live

conversations.

•Web scraping

While metrics from a social media platform can tell you numerical data like the number of followers, posts, likes,

dislikes, etc, a real-time sentiment analysis platform can perform data scraping for more qualitative insights. The tool’s

in-built web scraper automatically extracts data from the social media platform you want to extract sentiment from. It

does so by sending HTTP requests to the different web pages it needs to target for the desired information, downloads

them, and then prepares them for analysis.

It parses the saved data and applies various ML tasks such as NLP, semantic classification, and sentiment analysis. And

in this way gives you customer insights beyond the numerical metrics that you are looking for.

•Alerts

The sentiment analysis tool for live feeds must have the capability to track and simplify complex data sets as it conducts

repeat scans for brand mentions, keywords, and hashtags. These repeat scans, ultimately, give you live updates based on

comments, posts, and audio content on various channels. Through this feature, you can set alerts for particular keywords

or when there is a spike in your mentions. You can get these notifications on your mobile device or via email.

•Reporting

Another major feature of a real-time sentiment analysis platform is the reporting dashboard. The insights visualization

dashboard is needed to give you the insights that you require in a manner that is easily understandable. Color-coded pie
charts, bar graphs, word clouds, and other formats make it easy for you to assess sentiment in topics, aspects, and the

overall brand, while also giving you metrics in percentile form.

The user-friendly customer experience analysis solution, Repustate IQ, has a very comprehensive reporting dashboard

that gives numerous insights based on various aspects, topics, and sentiment combinations. In addition, it is also

available as an API that can be easily integrated with a dashboard such as Power BI or Tableau that you are already

using. This gives you the ability to leverage a high-precision sentiment analysis API without having to invest in yet

another end-to-end solution that has a fixed reporting dashboard

UNIT 3 Notes Data Analytics
No ratings yet
UNIT 3 Notes Data Analytics
136 pages
Chapter 1
No ratings yet
Chapter 1
13 pages
5-Introduction to Streams Concepts , Stream Data Model and Architecture-03!02!2025
No ratings yet
5-Introduction to Streams Concepts , Stream Data Model and Architecture-03!02!2025
17 pages
Big Data Stream Mining
No ratings yet
Big Data Stream Mining
8 pages
Module 4
No ratings yet
Module 4
23 pages
UNIT 3
No ratings yet
UNIT 3
135 pages
Unit3 Mining Data Streams
No ratings yet
Unit3 Mining Data Streams
18 pages
Customizing Stata Graphs Made Easy (Part 2) : 18, Number 4, Pp. 786-802
No ratings yet
Customizing Stata Graphs Made Easy (Part 2) : 18, Number 4, Pp. 786-802
17 pages
Unit II A Database Management System architecture
No ratings yet
Unit II A Database Management System architecture
2 pages
Chapter 1-1
No ratings yet
Chapter 1-1
34 pages
Module 3 - TIME ORIENTED DATA-1
No ratings yet
Module 3 - TIME ORIENTED DATA-1
30 pages
4 Bda Chapter4 Answer
No ratings yet
4 Bda Chapter4 Answer
6 pages
Stream Processing
No ratings yet
Stream Processing
70 pages
4 Building Blocks of A Streaming Data Architecture
No ratings yet
4 Building Blocks of A Streaming Data Architecture
11 pages
FALLSEM2024-25_SWE2011_ETH_VL2024250103282_2024-08-19_Reference-Material-I
No ratings yet
FALLSEM2024-25_SWE2011_ETH_VL2024250103282_2024-08-19_Reference-Material-I
53 pages
Big Data 3rd Unit
No ratings yet
Big Data 3rd Unit
16 pages
BDA GTU Study Material Presentations Unit-4 29092021094703AM
No ratings yet
BDA GTU Study Material Presentations Unit-4 29092021094703AM
33 pages
Big Data Analytics_Unit 3
No ratings yet
Big Data Analytics_Unit 3
64 pages
Bda L4
No ratings yet
Bda L4
32 pages
Mining&Data Stream Unit-3_removed
No ratings yet
Mining&Data Stream Unit-3_removed
50 pages
Unit 3-6
No ratings yet
Unit 3-6
14 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
unit-3 notes
No ratings yet
unit-3 notes
10 pages
UNIT-II 30-1-24
No ratings yet
UNIT-II 30-1-24
162 pages
Unit-2 BDA
No ratings yet
Unit-2 BDA
30 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
53 pages
BDA-2
No ratings yet
BDA-2
16 pages
DataStreaming L-4
No ratings yet
DataStreaming L-4
16 pages
Unit 3
No ratings yet
Unit 3
30 pages
BigData_Mod2
No ratings yet
BigData_Mod2
12 pages
Session 3.9.1
No ratings yet
Session 3.9.1
11 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
50 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
11 pages
BIG_DATA_UNIT_II_NOTES
No ratings yet
BIG_DATA_UNIT_II_NOTES
19 pages
Spin Dump
No ratings yet
Spin Dump
1,695 pages
Bda M4
No ratings yet
Bda M4
57 pages
Unit II(Big Data)
No ratings yet
Unit II(Big Data)
19 pages
BDA Mod 3
No ratings yet
BDA Mod 3
57 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
57 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
5 Unit
No ratings yet
5 Unit
5 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
Cloud Computing Unit-II
No ratings yet
Cloud Computing Unit-II
45 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Unit 4
No ratings yet
Unit 4
84 pages
Uint 4miningdatastream 230810162429 9d7c02a7
No ratings yet
Uint 4miningdatastream 230810162429 9d7c02a7
11 pages
Unit 2
No ratings yet
Unit 2
10 pages
Introduction To Stream Data Model
50% (2)
Introduction To Stream Data Model
15 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
Cloud Computing Unit-IV
No ratings yet
Cloud Computing Unit-IV
24 pages
Data Analytics and Visualization Unit-II
No ratings yet
Data Analytics and Visualization Unit-II
23 pages
Introduction To Stream Concepts - Stream Data Model and Architecture
No ratings yet
Introduction To Stream Concepts - Stream Data Model and Architecture
8 pages
Cloud Computing Unit-V
No ratings yet
Cloud Computing Unit-V
2 pages
Industry Brief
No ratings yet
Industry Brief
18 pages
Data Streams1
No ratings yet
Data Streams1
10 pages
Any Time Anywhere Computing Mobile Computing Concepts and Technology 1st edition by Abdelsalam Helal 0792386108 9780792386100 download
100% (4)
Any Time Anywhere Computing Mobile Computing Concepts and Technology 1st edition by Abdelsalam Helal 0792386108 9780792386100 download
43 pages
Unit 4 Notes PDF
100% (2)
Unit 4 Notes PDF
27 pages
Biografi Alan Turing Full Version.1.2.p
No ratings yet
Biografi Alan Turing Full Version.1.2.p
169 pages
UNIT-3 (Mining Data Streams)
No ratings yet
UNIT-3 (Mining Data Streams)
50 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
image
No ratings yet
image
22 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
Grade 8 Ch 2 QA
No ratings yet
Grade 8 Ch 2 QA
2 pages
Online BCA Brochure 2025
No ratings yet
Online BCA Brochure 2025
18 pages
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Principles of Real-Time Data Streaming: Definitive Reference for Developers and Engineers
From Everand
Principles of Real-Time Data Streaming: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Versaart Re640
No ratings yet
Versaart Re640
140 pages
HTML and Css
No ratings yet
HTML and Css
127 pages
Big Data Analytics Unit 2 MINING DATA STREAMS
100% (2)
Big Data Analytics Unit 2 MINING DATA STREAMS
22 pages
Lecture 12 DFA With Output 29052023 112044pm
No ratings yet
Lecture 12 DFA With Output 29052023 112044pm
42 pages
NetFlow Protocols and Applications: Definitive Reference for Developers and Engineers
From Everand
NetFlow Protocols and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Unit 5 Pythn
No ratings yet
Unit 5 Pythn
20 pages
MODEL NO.: V420H2 - P01: TFT LCD Approval Specification
No ratings yet
MODEL NO.: V420H2 - P01: TFT LCD Approval Specification
31 pages
CCS340 Compressed
No ratings yet
CCS340 Compressed
50 pages
Final Lab Manual
No ratings yet
Final Lab Manual
41 pages
Assignment 1 Final
No ratings yet
Assignment 1 Final
52 pages
InfluxDB Essentials: Definitive Reference for Developers and Engineers
From Everand
InfluxDB Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
07 - Model Selection & Building
No ratings yet
07 - Model Selection & Building
17 pages
Introduction To Single Board Computing
No ratings yet
Introduction To Single Board Computing
12 pages
Lab07-ERD in RSA
No ratings yet
Lab07-ERD in RSA
4 pages
Module II
No ratings yet
Module II
22 pages
Iceberg 3.0 - Auto Signup Process
No ratings yet
Iceberg 3.0 - Auto Signup Process
4 pages
Chapter 1-Introduction To Computer Secuirty
No ratings yet
Chapter 1-Introduction To Computer Secuirty
15 pages
Racing Simulation of A Formula 1 Vehicle With Kinetic Energy Recovery System. SAE Technical Paper Series.
No ratings yet
Racing Simulation of A Formula 1 Vehicle With Kinetic Energy Recovery System. SAE Technical Paper Series.
15 pages
Cnai Export
No ratings yet
Cnai Export
3 pages
CV Ar 2023
No ratings yet
CV Ar 2023
2 pages
Uniflair LE Chilled Water Service Manual ERIN-9QSR7N - R1 - EN PDF
100% (1)
Uniflair LE Chilled Water Service Manual ERIN-9QSR7N - R1 - EN PDF
88 pages
Armitage Hacking Mode Easy
No ratings yet
Armitage Hacking Mode Easy
36 pages
4.1 4-S2S-IPSecVPN-Tunnel-Router
No ratings yet
4.1 4-S2S-IPSecVPN-Tunnel-Router
6 pages
Answer Sheet of C++ S5 CSC
No ratings yet
Answer Sheet of C++ S5 CSC
5 pages
BROSUR Lcmsms QSight-Triple-Quad PT Mbi
No ratings yet
BROSUR Lcmsms QSight-Triple-Quad PT Mbi
8 pages
Application Design: Key Principles For Data-Intensive App Systems
From Everand
Application Design: Key Principles For Data-Intensive App Systems
Rob Botwright
No ratings yet
CDS11309 Toyota Hilux 1KD 2012 Kit
No ratings yet
CDS11309 Toyota Hilux 1KD 2012 Kit
14 pages

Data Analytics and Visualization Unit-III

Uploaded by

Data Analytics and Visualization Unit-III

Uploaded by

BCDS501

Introduction to Data Analytics and Visualization

UNIT 3 Mining Data Streams

Introduction to streams concepts, stream data model and

Introduction to stream concepts

3.Web – accesses by clients of information at servers

3.Internet and Web Traffic –

What is DSMS Architecture?

A modern data streaming architecture refers to a collection of tools and components

A data streaming architecture enables organizations to analyze and act on a data

A typical data streaming architecture consists of 5 main components:

A. Stream Processing Engine

How To Choose A Message Broker?

How To Choose A Data Storage Service?

How To Choose A Data Ingestion Tool?

And the best part?

E. Data Visualization & Reporting Tools

Data Streaming Architecture Diagrams

II. Kappa Architecture

The process of data sampling involves the following steps:

Distinct Elements in a Stream

Input: arr[] = [5, 5, 7, -5, -7, 1, 2, -2]

Expected Time Complexity: O(n).

Decaying Window Algorithm

1. Assign each element with a weight/score.

Here, c is usually a small constant of the order

1. Multiply the current sum/score by the value (1−c).

2. Add the weight corresponding to the new element.

Weight decays exponentially over time

Also, let's say each element in sequence has weight of 1.

platforms that share live feeds like Twitter does.

intelligence from them.

1.Live social feeds from video platforms like Instagram or Facebook

3.Live monitoring of Influencer live streams.

for customer feedback analysis.

6.Live monitoring of product review platforms for brand mentions.

the sentiment in the news.

Read in detail about the applications of real-time sentiment analysis.

Below are the steps involved in the process.

Step 2 - Data processing

data, to text through speech-to-text software.

Step 3 - Data analysis

the sentiment score for each of them.

Step 4 - Data visualization

alerts for brand mentions and keywords in live feeds as well.

Learn more about the steps in sentiment analysis.

insights. These are:

them, and then prepares them for analysis.

overall brand, while also giving you metrics in percentile form.

another end-to-end solution that has a fixed reporting dashboard

You might also like