Unit 1.1data Science Technology Stack
Unit 1.1data Science Technology Stack
Unit I
Outline
Data Science Storage Tools
Data Lake
Data Vault
Data Warehouse Bus Matrix
Data Science Processing Tools
Spark
Mesos
Akka
Cassandra
Kafka
Elastic Search
R
Scala
Python
MQTT
The Future
Introduction
In 1960, Peter Naur started using the term data science as
a substitute for computer science. He stated that to work with
data, you require more than just computer science.
Data science is an interdisciplinary science that
incorporates practices and methods with actionable
knowledge and insights from data in heterogeneous
schemas (structured, semi-structured, or unstructured).
It amalgamates the scientific fields of data exploration
with thought-provoking research fields such as data
engineering, information science, computer science,
statistics, artificial intelligence, machine learning, data
mining, and predictive analytics.
Data Analytics
Data analytics is the science of fact-finding analysis of raw data,
with the goal of drawing conclusions from the data lake.
Data analytics is driven by certified algorithms to statistically
define associations between data that produce insights.
Machine Learning
Machine learning is the capability of systems to learn without
explicit software development. It evolved from the study of
pattern recognition and computational learning theory.
Data Mining
Data mining is processing data to isolate patterns and establish
relationships between data entities within the data lake.
For data mining to be successful, there is a small number of
critical data-mining theories that you must know about data
patterns.
Statistics
Statistics is the study of the collection, analysis, interpretation,
presentation, and organization of data.
Statistics deals with all aspects of data, including the planning of
data collection, in terms of the design of surveys and
experiments.
Data science and statistics are closely related.
Algorithms
An algorithm is a self-contained step-by-step set of processes
to achieve a specific outcome.
Algorithms execute calculations, data processing, or automated
reasoning tasks with repeatable outcomes.
Algorithms are the backbone of the data science process.
It assembles a series of methods and procedures that will ease
the complexity and processing of your specific data lake.
Data Visualization
Data visualization is a key communication channel with the
business.
It consists of the creation and study of the visual
representation of business insights.
Data science’s principal deliverable is visualization.
You will have to take your highly technical results and
transform them into a format that you can show to non-
specialists.
The successful transformation of data results to actionable
knowledge is a skill set.
Storytelling
Data storytelling is the process of translating data analyses into
layperson’s terms, in order to influence a business decision or
action.
You can have the finest data science, but without the business
story to translate your findings into business-relevant actions,
you will not succeed.
Rapid Information Factory
Ecosystem
The Data Science Technology Stack covers the data
processing requirements in the Rapid Information Factory
ecosystem.
The Rapid Information Factory ecosystem is a
convention of techniques used for processing
developments.
Data Science Storage Tools
Schema-on-Write:-
A traditional relational database management system
(RDBMS) requires a schema before you can load the data.
To retrieve data from any structured data schemas, you
may have been running standard SQL queries for a
number of years.
Advantages :-
In traditional data ecosystems, tools assume schemas and can only
work once the schema is described, so there is only one view on the
data.
The approach is extremely valuable in articulating relationships
between data points, so there are already relationships configured.
It is an efficient way to store “dense” data.
All the data is in the same data store.
Schema-on-Write:-
Disadvantage :-
We can’t upload data until the table is created and we can’t
create tables until we understand the schema of the data that
will be in this table.
Its schemas are typically purpose-built, which makes them hard
to change and maintain.
It generally loses the raw/atomic data as a source for future
analysis.
It requires considerable modeling/implementation effort before
being able to work with the data.
If a specific type of data can’t be stored in the schema, you
can’t effectively process it from the schema.
Schema-on-Read :-
The data storage methodology does not require a schema before you can
load the data; create schema only when reading the data.
Initially ,you store the data with minimum structure.
The essential schema is applied during the query phase.
Advantages :-
It provides flexibility to store unstructured, semi-structured, and
disorganized data.
It allows for unlimited flexibility when querying data from the structure.
Leaf-level data is kept intact and untransformed for reference and use
for the future.
The methodology encourages experimentation and exploration.
It increases the speed of generating fresh actionable knowledge.
It reduces the cycle time between data generation to availability of
actionable knowledge.
The exploding growth of unstructured data and overhead of ETL for
storing data in RDBMS is the main reason for shift to schema-on-read.
Disadvantage :-
Inaccuracies and slow query speed.
Since the data is not subjected to rigorous ETL and data
cleaning nor any validation, inaccurate or incomplete query
result are possible.
Which is better?
It depends on the use case.
Is the workload mostly data supporting where results
needs to be fast and repetitive.
Use a schema-on-write
Is the workload mostly unknowns of data and
constant new sources.
Use schema-on-read.
Data Lake
A data lake is a storage repository for a massive amount of raw data.
It stores data in native(flat file) format, in anticipation of future
requirements.
A data lake uses a less restricted schema-on-read-based architecture
to store data.
Each data element in the data lake is assigned a distinctive identifier
and tagged with a set of metadata tags.
A data lake is typically deployed using distributed data object
storage, to enable the schema-on-read structure.
This means that business analytics and data mining tools access the
data without a complex schema.
Using a schema-on-read methodology enables you to load your data
as it comes and start to get value from it instantaneously.
For deployment onto the cloud, it is a cost-effective solution to use
Amazon’s Simple Storage Service (Amazon S3) to store the base
data for the data lake.
Key Data Lake Concepts
Why Data Lake?
The main objective of building a data lake is to offer an
unrefined view of data to data scientists.
Reasons for using Data Lake are:
With the onset of storage engines like Hadoop storing
disparate information has become easy. There is no need to
model data into an enterprise-wide schema with a Data Lake.
With the increase in data volume, data quality, and metadata,
the quality of analyses also increases.
Machine Learning and Artificial Intelligence can be used to
make profitable predictions.
It offers a competitive advantage to the implementing
organization.
There is no data silo structure. Data Lake gives 360 degrees
view of customers and makes analysis more robust.
Data Lake Implementation
Design of Data Lake should be driven by what is available
instead of what is required.
The schema and data requirement is not defined until it is
queried.
Data discovery, ingestion, storage, administration, quality,
transformation, and visualization should be managed
independently.
The Data Lake architecture should be tailored to a specific
industry.
Faster on-boarding of newly discovered data sources is
important
The Data Lake should support existing enterprise data
management techniques and methods
Challenges in Data Lake Implementation
In Data Lake, Data volume is higher, so the process must
be more reliant on programmatic administration
It is difficult to deal with sparse, incomplete, volatile data
Wider scope of dataset and source needs larger data
governance & support
Parameters Data Lakes Data Warehouse
Data Data lakes store everything. Data Warehouse focuses only on
Business Processes.
Processing Data are mainly unprocessed Highly processed data.
Type of Data It can be Unstructured, semi- It is mostly in tabular form &
structured and structured. structure.
Task Share data stewardship Optimized for data retrieval
Agility Highly agile, configure and reconfigure Compare to Data lake it is less agile
as needed. and has fixed configuration.
Users Data Lake is mostly used by Data Business professionals widely use data
Scientist Warehouse
Storage Data lakes design for low-cost storage. Expensive storage that give fast
response times are used
Security Offers lesser control. Allows better control of the data.
Replacement of EDW Data lake can be source for EDW Complementary to EDW (not
replacement)
Schema Schema on reading (no predefined Schema on write (predefined schemas)
schemas)
Data Processing Helps for fast ingestion of new data. Time-consuming to introduce new
content.
Data Granularity Data at a low level of detail or Data at the summary or aggregated
granularity. level of detail.
Tools Can use open source/tools like Mostly commercial tools.
Hadoop/ Map Reduce
Benefits of using Data Lake
Helps fully with productionizing & advanced analytics
Offers cost-effective scalability and flexibility
Offers value from unlimited data types
Reduces long-term cost of ownership
Allows economic storage of files
Quickly adaptable to changes
The main advantage of data lake is the centralization of
different content sources
Users, from various departments, may be scattered
around the globe can have flexible access to the data
Risk of using Data Lake
After some time, Data Lake may lose relevance and
momentum
It also increases storage & computes costs
There is no way to get insights from others who have
worked with the data because there is no account of the
lineage of findings by previous analysts
The biggest risk of data lakes is security and access
control. Sometimes data can be placed into a lake without
any oversight, as some of the data may have privacy and
regulatory need
Data Vault
Data vault modeling, designed by Dan Linstedt, is a database
modeling method that is intentionally structured to be in
control of long-term historical storage of data from multiple
operational systems.
The data vaulting processes transform the schema-on-
read data lake into a schema-on-write data vault.
The data vault is designed into the schema-on-read query
request and then executed against the data lake.
It supports chronological historical data tracking for full
auditing and enables parallel loading of the structures.
Definition:- “A detail oriented, historical tracking and uniquely
linked set of normalized tables that support one or more functional
areas of business. It is a hybrid approach encompassing the best of
breed between 3NF and Star Schemas.The design is flexible,
scalable, consistent and adaptable to the needs of the enterprise.”
Data Vault
It is a data model that is architected specifically to meet
the needs of today’s enterprise data warehouses.
Extensive possibilities for data attribution.
All data relationships are key driven.
Relationships can be dropped and created on-the-fly.
Data Mining can discover new relationships between elements
Artificial Intelligence can be utilized to rank the relevancy of
the relationships to the user configured outcome.
The structure is built from three basic data structures:
HUBS
LINKS
SATELLITES
Dimensional Modeling to Data Vault
https://www.youtube.com/watch?v=--OJpdPeH80
(watch this video to understand shift from DM to Data
Vault)
The Data Vault model is comprised of three basic table
types:
HUB
LINK
SATELLITE
Data Vault
The process of building a Data Vault in 5 easy steps.
Step 1: Establish the Business Keys, Hubs
Step 2: Establish the relationships between the Business Keys,
Links
Step 3: Establish description around the Business Keys,
Satellites.
Step 4: Add Standalone components like Calendars and
code/descriptions for decoding in Data Marts
Step 5: Tune for query optimization, add performance tables
such as Bridge tables and Point-In-Time structures
Data Vault
HUB :-
Data vault hubs contain a set of unique business keys that
normally do not change over time and, therefore, are stable
data entities to store in hubs.
Hubs hold a surrogate key for each hub data entry and
metadata labeling the source of the business key.
The hub is the core backbone of the data vault.
LINKS :-
Data vault links are associations between business keys.
These links are essentially many-to-many joins, with
additional metadata to enhance the particular link.
Links are often used to deal with changes in data
granularity reducing the impact of adding a new
business key to a linked Hub.
Data Vault
SATELLITES :-
Data vault satellites hold the chronological and descriptive
characteristics for a specific section of business data.
The hubs and links form the structure of the model but have
no chronological characteristics and hold no descriptive
characteristics.
Satellites consist of characteristics and metadata linking them
to their specific hub.
Metadata labeling the origin of the association and
characteristics, along with a time line with start and end dates
for the characteristics, is put in safekeeping, for future use from
the data section.
Each satellite holds an entire chronological history of the data
entities within the specific satellite.
Data Warehouse Bus Matrix
The Enterprise Bus Matrix is a data warehouse planning
tool and model created by Ralph Kimball and used by
numerous people worldwide over the last 40+ years.
The bus matrix and architecture builds upon the concept
of conformed dimensions that are interlinked by facts.
It is component or consider as part of data warehouse
architecture.
It is visual picture of business process and conformed
dimensions.
A data warehouse is a consolidated, organized and
structured repository for storing data.
R
R is a programming language and software environment
for statistical computing and graphics.
The R language is widely used by data scientists,
statisticians, data miners, and data engineers for
developing statistical software and performing data
analysis.
The capabilities of R are extended through user-created
packages using specialized statistical techniques and
graphical procedures.
A core set of packages is contained within the core
installation of R, with additional packages accessible from
the Comprehensive R Archive Network (CRAN).
Knowledge of the following packages is a must:
sqldf (data frames using SQL): This function reads a file into R
while filtering data with an sql statement. Only the filtered part is
processed by R, so files larger than those R can natively import
can be used as data sources.
forecast (forecasting of time series): This package provides
forecasting functions for time series and linear models.
dplyr (data aggregation): Tools for splitting, applying, and combining
data within R.
stringr (string manipulation): Simple, consistent wrappers for
common string operations.
randomForest (random forest predictive models): Leo Breiman
and Adele Cutler’s random forests for classification and
regression
ggplot2 (data visualization): Creates elegant data
visualizations, using the grammar of graphics. This is a
super-visualization capability.
reshape2 (data restructuring): Flexibly restructures and
aggregates data, using just two functions: melt and dcast
(or acast).
Scala
Scala stands for Scalable language.
Scala is a general-purpose, high-level, multi-paradigm
programming language.
It is a pure object-oriented programming language which
also provides the support to the functional programming
approach.
Scala programs can convert to bytecodes and can run on
the JVM (Java Virtual Machine).
It also provides the Javascript runtimes.
Scala is highly influenced by Java and some other
programming languages like Lisp, Haskell, Pizza etc.
Scala has many reasons for being popular among programmers. Few of the reasons
are :
Easy to Start: Scala is a high level language so it is closer to other popular
programming languages like Java, C, C++. Thus it becomes very easy to learn Scala
for anyone. For Java programmers, Scala is more easy to learn.
Contains best Features: Scala contains the features of different languages like C,
C++, Java etc. which makes the it more useful, scalable and productive.
Close integration with Java: The source code of the Scala is designed in such a
way that its compiler can interpret the Java classes. Also, Its compiler can utilize the
frameworks, Java Libraries, and tools etc. After compilation, Scala programs can run
on JVM.
Web – Based & Desktop Application Development: For the web applications
it provides the support by compiling to JavaScript. Similarly for desktop applications,
it can be compiled to JVM bytecode.
Used by Big Companies: Most of the popular companies like Apple, Twitter,
Walmart, Google etc. move their most of codes to Scala from some other
languages. reason being it is highly scalable and can be used in backend operations.
Python
Python is a popular programming language. It was created by Guido
van Rossum, and released in 1991.
It is used for:
web development (server-side),
software development,
mathematics,
system scripting.
Python can be used on a server to create web applications.
Python can be used alongside software to create workflows.
Python can connect to database systems. It can also read and modify
files.
Python can be used to handle big data and perform complex
mathematics.
Python can be used for rapid prototyping, or for production-ready
software development.
Python works on different platforms (Windows, Mac,
Linux, Raspberry Pi, etc).
Python has a simple syntax similar to the English language.
Python has syntax that allows developers to write
programs with fewer lines than some other programming
languages.
Python can be treated in a procedural way, an object-
orientated way or a functional way.
MQTT
MQTT stands for Message Queuing Telemetry Transport.
It is designed as a lightweight messaging protocol that
uses publish/subscribe operations to exchange data
between clients and the server.
Its small size, low power usage, minimized data packets
and ease of implementation make the protocol ideal of
the “machine-to-machine” or “Internet of Things” world.
MQTT-enabled devices include handheld scanners,
advertising boards, footfall counters, and other machines.
Mosquitto is an open source message broker that
implements the MQTT protocol.
MQTT server is called a broker and the clients are simply
the connected devices.
When a device (a client) wants to send data to the
broker, we call this operation a “publish”.
When a device (a client) wants to receive data from the
broker, we call this operation a “subscribe”.
Data Science Processing Tools
Today’s world is flooded with data from different sources.
So companies are trying to find the best tool to manage
this data and make something profit out of it.
There are many data processing tools and software.
Hadoop
Spark
Mesos
Akka
Cassandra
Kafka
One of the biggest challenges with respect to Big Data is analyzing
the data.
There are multiple solutions available to do this. The most popular
one is Apache Hadoop.
Apache Hadoop is an open-source framework written in Java that
allows us to store and process Big Data in a distributed
environment, across various clusters of computers using simple
programming constructs.
Hadoop uses an algorithm called MapReduce, which divides the
task into small parts and assigns them to a set of computers.
Hadoop also has its own file system, Hadoop Distributed File
System (HDFS), which is based on Google File System (GFS).
HDFS is designed to run on low-cost hardware.
Hadoop data architecture/ecosystem
Hadoop core components
Complementary/other components
Core Hadoop Components
Hadoop Common: It is a set of
common utilities and libraries which
handle other Hadoop modules. It
makes sure that the hardware failures
are managed by Hadoop cluster
automatically.
HDFS: It is a Hadoop Distributed File
System that stores data in the form of
small memory blocks and distributes
them across the cluster. Each data is
replicated multiple times to ensure
data availability.
Hadoop YARN: It allocates
resources which in turn allow
different users to execute various
applications without worrying about
the increased workloads.
Hadoop MapReduce: It executes
tasks in a parallel fashion by
distributing the data as small blocks.
What is MapReduce?
MapReduce is a processing technique built on divide and conquer algorithm.
It is made of two different tasks - Map and Reduce.
Map breaks different elements into tuples to perform a job,
Reduce collects and combines the output from Map task and fetches it.
MapReduce is the processing engine of the Apache Hadoop that was directly
derived from the Google MapReduce.
The MapReduce application is written basically in Java.
It conveniently computes huge amounts of data by the applications
of mapping and reducing steps in order to come up with the solution for the
required problem.
Enables parallel processing required to perform Big Data jobs
Applicable to a wide variety of business data processing applications
A cost-effective solution for centralized processing frameworks
Can be integrated with SQL to facilitate parallel processing capability
Complementary/Other Hadoop Compone
nts
Following are some of the companies that have
implemented this open-source infrastructure,
Hadoop:
Social Networking Websites: Facebook, Twitter, LinkedIn,
etc.
Online Portals:Yahoo, AOL, etc.
E-commerce: eBay, Alibaba, etc.
IT Developer: Cloudspace
Drawbacks of Hadoop
Low Processing Speed: In Hadoop, the MapReduce algorithm, which is a parallel and distributed
algorithm, processes really large datasets. These are the tasks need to be performed here:
Map: Map takes some amount of data as input and converts it into another set of data, which again is
divided into key/value pairs.
Reduce: The output of the Map task is fed into Reduce as input. In the Reduce task, as the name
suggests, those key/value pairs are combined into a smaller set of tuples. The Reduce task is always
done after Mapping.
Batch Processing: Hadoop deploys batch processing, which is collecting data and then processing it in
bulk later. Although batch processing is efficient for processing high volumes of data, it does not process
streamed data. Because of this, the performance is lower.
No Data Pipelining: Hadoop does not support data pipelining (i.e., a sequence of stages where the
previous stage’s output ID is the next stage’s input).
Not Easy to Use: MapReduce developers need to write their own code for each and every operation,
which makes it really difficult to work with. And also, MapReduce has no interactive mode.
Latency: In Hadoop, the MapReduce framework is slower, since it supports different formats, structures,
and huge volumes of data.
Lengthy Line of Code: Since Hadoop is written in Java, the code is lengthy. And, this takes more time to
execute the program.
Spark
Apache Spark is an open source cluster computing framework.
Originally developed at the AMP Lab of the University of
California, Berkeley, the Spark code base was donated to the
Apache Software Foundation, which now maintains it as an
open source project.
Spark offers an interface for programming distributed clusters
with implicit data parallelism and fault-tolerance.
Spark can be deployed in numerous ways like in Machine
Learning, streaming data, and graph processing.
Spark supports programming languages like Python, Scala, Java,
and R.
Spark core is the base of the whole project.
Spark uses a specialized fundamental data structure known as
RDD(Resilient Distributed Database)that is a logical collection
of data partitioned across machine .
Components of Spark
Components of Spark
Spark Core
Spare Core is the basic building block of Spark, which
includes all components for job scheduling, performing
various memory operations, fault tolerance, and more.
Spark Core is also home to the API that consists of
RDD(Resilient Distributed Datasets (RDDs) are a distributed
memory abstraction that lets programmers perform in-memory
computations on large clusters in a fault-tolerant manner.).
Spark Core provides APIs for building and manipulating data
in RDD
Spark SQL
Spark SQL is a component on top of the Spark Core that presents a
data abstraction called Data Frames.
Spark SQL makes accessible a domain-specific language (DSL) to
manipulate data frames. This feature of Spark enables ease of transition
from your traditional SQL environments into the Spark environment.
Apache Spark works with the unstructured data using its ‘go to’ tool,
Spark SQL.
Spark SQL allows querying data via SQL, as well as via Apache Hive’s
form of SQL called Hive Query Language (HQL).
It also supports data from various sources like parse tables, log files,
JSON, etc.
Spark SQL allows programmers to combine SQL queries
with programmable changes or manipulations supported by RDD
in Python, Java, Scala, and R.
Spark Streaming
Spark Streaming leverages Spark Core’s fast scheduling capability to perform
streaming analytics.
Spark Streaming has built-in support to consume from Kafka, Flume, Twitter,
ZeroMQ, Kinesis, and TCP/IP sockets.
The process of streaming is the primary technique for importing data from the
data source to the data lake.
Streaming is becoming the leading technique to load from multiple data sources.
There are connectors available for many data sources.
Running on top of Spark, Spark Streaming enables powerful interactive and
analytical applications across both streaming and historical data, while inheriting
Spark’s ease of use and fault tolerance characteristics.
Spark Streaming processes live streams of data.
Examples of this data include log files, messages containing status updates posted
by users, etc.
MLlib Machine Learning Library
Spark MLlib is a distributed machine learning framework used on top of
the Spark Core by means of the distributed memory-based Spark
architecture.
Common machine learning and statistical algorithms have been
implemented and are shipped with MLlib, which simplifies large-scale
machine learning pipelines, including
Dimensionality reduction techniques, such as singular value
decomposition (SVD) and principal component analysis (PCA)
Summary statistics, correlations, stratified sampling, hypothesis testing,
and random data generation
Collaborative filtering techniques, including alternating least squares (ALS)
Classification and regression: support vector machines, logistic regression,
linear regression, decision trees, and naïve Bayes classification
Cluster analysis methods, including k-means and latent Dirichlet allocation
(LDA)
Optimization algorithms, such as stochastic gradient descent and limited-
memory BFGS (LBFGS) Feature extraction and transformation functions
MLlib
Machine learning has quickly emerged as a critical piece in
mining Big Data for actionable insights.
Built on top of Spark, MLlib is a scalable machine learning
library that delivers both high-quality algorithms (e.g., multiple
iterations to increase accuracy) and blazing speed (up to 100x
faster than MapReduce).
The library is usable in Java, Scala, and Python as part of Spark
applications, so that you can include it in complete workflows.
It provides various types of ML algorithms including regression,
clustering, and classification, which can perform various
operations on data to get meaningful insights out of it.
GraphX
GraphX is a graph computation engine built on top of Spark
that enables users to interactively build, transform and reason
about graph structured data at scale.
It comes complete with a library of common algorithms.
GraphX provides outstanding speed and capacity for
running massively parallel and machine-learning
algorithms.
Use Cases of Apache Spark in Real Life
Many companies use Apache Spark to improve their business insights.
These companies gather terabytes of data from users and use it to enhance consumer services.
Some of the Apache Spark use cases are as follows:
E-commerce: Many e-commerce giants use Apache Spark to improve their consumer experience.
eBay: eBay deploys Apache Spark to provide discounts or offers to its customers based on their
earlier purchases.
Alibaba: Alibaba runs the largest Spark jobs in the world. Some of these jobs analyze big data,
while the rest perform extraction on image data. These components are displayed on a large
graph, and Spark is used for deriving results.
Healthcare: Apache Spark is being deployed by many healthcare companies to provide their
customers with better services. One such company which uses Spark is MyFitnessPal, which helps
people achieve a healthier lifestyle through diet and exercises.
Media and Entertainment: Some of the video streaming websites use Apache Spark, along with
MongoDB, to show relevant ads to their users based on their previous activity on that website.
For example, Netflix, one of the major players in the video streaming industry, uses Apache
Spark to recommend shows to its users based on the previous shows they have watched.
How Spark Is Better than Hadoop?
In-memory Processing: In-memory processing is faster when compared to
Hadoop, as there is no time spent in moving data/processes in and out of the disk.
Spark is 100 times faster than MapReduce as everything is done here in memory.
Stream Processing: Apache Spark supports stream processing, which involves
continuous input and output of data. Stream processing is also called real-time
processing.
Less Latency: Apache Spark is relatively faster than Hadoop, since it caches most
of the input data in memory by the Resilient Distributed Dataset (RDD). RDD
manages distributed processing of data and the transformation of that data. This is
where Spark does most of the operations such as transformation and managing the
data. Each dataset in an RDD is partitioned into logical portions, which can then be
computed on different nodes of a cluster.
Lazy Evaluation: Apache Spark starts evaluating only when it is absolutely needed.
This plays an important role in contributing to its speed.
Less Lines of Code: Although Spark is written in both Scala and Java, the
implementation is in Scala, so the number of lines are relatively lesser in Spark when
compared to Hadoop.
Mesos
Apache Mesos is an open source cluster manager that was developed at the University of
California, Berkeley.
Apache Mesos handles workloads in a distributed environment through dynamic
resource sharing and isolation in a fine-grained manner, improving cluster utilization.
It sits between the application layer and the operating system and makes it easier to deploy
and manage applications in large-scale clustered environments more efficiently.
Mesos brings together the existing resources of the machines/nodes in a cluster into a single
pool from which a variety of workloads may utillize.
Also known as node abstraction, this removes the need to allocate specific machines for
different workloads.
Mesos decides how many resources to offer each framework, while frameworks decide which
resources to accept and which computations to run on them.
Mesos isolates the processes running in a cluster, such as memory, CPU, file system, rack
locality and I/O, to keep them from interfering with each other. Such isolation allows Mesos to
create a single, large pool of resources to offer workloads.
The idea is to deploy multiple distributed systems to a shared pool of nodes in order to
increase resource utilization.
A lot of modern workloads and frameworks can run on Mesos, including Hadoop,
Memecached, Ruby on Rails, Storm, JBoss Data Grid, MPI, Spark and Node.js, as well as various
web servers, databases and application servers.
Companies such as Twitter, Airbnb and Xogito utilize Apache Mesos.
Mesos Architecture
Mesos consists of a master process that manages slave
daemons running on each cluster node, and frameworks
that run tasks on these slaves.
The master implements fine-grained sharing across
frameworks using resource offers.
Each resource offer is a list of free resources on multiple
slaves.
The master decides how many resources to offer to each
framework according to an organizational policy, such as
fair sharing or priority.
Mesos Architecture
Mesos consists of a master process that manages slave
daemons running on each cluster node, and frameworks
that run tasks on these slaves.
The master implements fine-grained sharing across
frameworks using resource offers.
Each resource offer is a list of free resources on multiple
slaves.
The master decides how many resources to offer to each
framework according to an organizational policy, such as
fair sharing or priority.
Framework scheduling in Mesos
Each framework running on Mesos consists of two
components:
a scheduler that registers with the master to be offered
resources,
an executor process that is launched on slave nodes to run
the framework's tasks.
While the master determines how many resources to
offer to each framework, the frameworks' schedulers
select which of the offered resources to use.
When a framework accepts offered resources, it passes
Mesos a description of the tasks it wants to launch on
them.
slave 1 reports to the master that it has
4CPUs and 4 GB of memory free. The
master then invokes the allocation
module, which tells it that framework 1
should be offered all available resources.
the master sends a resource offer
describing these resources to framework
1.
the framework's scheduler replies to the
master with information about two tasks
to run on the slave, using 2 CPUs; 1 GB
RAM for the first task, and 1 CPUs; 2 GB
RAM for the second task.
the master sends the tasks to the slave,
which allocates appropriate resources to
the framework's executor, which in turn
launches the two tasks (depicted with
dotted borders).
Because 1 CPU and 1 GB of RAM are still
free, the allocation module may now offer
them to framework 2.
Akka
Writing concurrent programs is hard.
Having to deal with threads, locks, race conditions, and so on is highly error-
prone and can lead to code that is difficult to read, test, and maintain.
Akka is a open-source library or a toolkit.
It is used to create concurrent, distributed and fault-tolerant
application.
You can integrate this library into any JVM(Java Virtual Machine) support
language.
Akka is written in Scala.
It implements Actor Based Model.
The Actor Model provides a higher level of abstraction for writing concurrent
and distributed applications.
It helps to developer to deals with explicit locking and thread management.
Akka makes it easier to write correct concurrent and parallel application.
Akka
An Actor is an entity which communicates to other actor by message
passing. Actor has it's own state and behavior.
An actor is essentially nothing more than an object that receives messages
and takes actions to handle them.
It is decoupled from the source of the message and its only responsibility
is to properly recognize the type of message it has received and take action
accordingly.
Upon receiving a message, an actor may take one or more of the following
actions:
Execute some operations itself (such as performing calculations, persisting data,
calling an external web service, and so on)
Forward the message, or a derived message, to another actor.
Instantiate a new actor and forward the message to it.
All the complexity of creating and scheduling threads, receiving and
dispatching messages, and handling race conditions and synchronization, is
relegated to the framework to handle transparently.
Akka
Features of AKKA :-
Event-driven. Using Actors, one can write code that handles
requests asynchronously and employs non-blocking operations
exclusively.
Scalable. In Akka, adding nodes without having to modify the
code is possible, thanks both to message passing and location
transparency.
Resilient. Any application will encounter errors and fail at
some point in time. Akka provides “supervision” (fault
tolerance) strategies to facilitate a self-healing system.
Responsive. Many of today’s high performance and rapid
response applications need to give quick feedback to the user
and therefore need to react to events in an extremely timely
manner. Akka’s non-blocking, message-based strategy helps
achieve this.
Akka
Akka provides:
Multi-threaded behavior without the use of low-level concurrency
constructs like atomics or locks.
Transparent remote communication between systems and their
components — relieving you from writing and maintaining difficult
networking code.
Reference link :-
https://doc.akka.io/docs/akka/current/guide/introduction.html
https://www.javatpoint.com/akka-tutorial
https://www.toptal.com/scala/concurrency-and-fault-tolerance-
made-easy-an-intro-to-akka
Cassandra
Apache Cassandra is an open source, distributed and
decentralized/distributed storage system (database), designed for handling a
high volume of structured data across commodity servers.
Data is placed on different machines with more than one replication factor
that provides high availability and no single point of failure.
Cassandra was first developed at Facebook for inbox search.
It is a type of NoSQL database.
NoSQL databases are called "Not Only SQL" or "Non-relational" databases.
NoSQL databases store and retrieve data other than tabular relations such as
relation databases.
These databases are schema-free, support easy replication, have simple API,
eventually consistent, and can handle huge amounts of data.
The primary objective of a NoSQL database is to have
simplicity of design,
horizontal scaling, and
finer control over availability.
NoSQL databases include MongoDB, HBase, and Cassandra.
Apache HBase :-
HBase is an open source, non-relational, distributed database
modeled after Google’s BigTable and is written in Java.
It is developed as a part of Apache Hadoop project and runs
on top of HDFS, providing BigTable-like capabilities for
Hadoop.
MongoDB :-
MongoDB is a cross-platform document-oriented database
system that avoids using the traditional table-based relational
database structure in favor of JSON-like documents with
dynamic schemas making the integration of data in certain
types of applications easier and faster.
Features Of Cassandra
Elastic scalability − Cassandra is highly scalable; it allows to add more hardware
to accommodate more customers and more data as per requirement.
Always on architecture − Cassandra has no single point of failure and it is
continuously available for business-critical applications that cannot afford a failure.
Fast linear-scale performance − Cassandra is linearly scalable, i.e., it increases
your throughput as you increase the number of nodes in the cluster. Therefore it
maintains a quick response time.
Flexible data storage − Cassandra accommodates all possible data formats
including: structured, semi-structured, and unstructured. It can dynamically
accommodate changes to your data structures according to your need.
Easy data distribution − Cassandra provides the flexibility to distribute data
where you need by replicating data across multiple data centers.
Transaction support − Cassandra supports properties like Atomicity,
Consistency, Isolation, and Durability (ACID).
Fast writes − Cassandra was designed to run on cheap commodity hardware. It
performs blazingly fast writes and can store hundreds of terabytes of data, without
sacrificing the read efficiency.
Cassandra Query language: Cassandra provides query language that is similar
like SQL language called CQL.
Cassandra Use Cases/Application
Messaging :- Cassandra is a great database for the companies
that provides Mobile phones and messaging services. These
companies have a huge amount of data, so Cassandra is best
for them.
Internet of things Application:- Cassandra is a great
database for the applications where data is coming at very high
speed from different devices or sensors.
Product Catalogs and retail apps:- Cassandra is used by
many retailers for durable shopping cart protection and fast
product catalog input and output.
Social Media Analytics and recommendation engine :-
Cassandra is a great database for many online companies and
social media providers for analysis and recommendation to
their customers.
Kafka
This is a high-scale messaging backbone that enables
communication between data processing entities.
Apache Kafka is a distributed streaming platform.
The Apache Kafka streaming platform, consisting of
Kafka Core, Kafka Streams, and Kafka Connect, is
the foundation of the Confluent Platform.
Kafka components empower the capture, transfer,
processing, and storage of data streams in a distributed,
fault-tolerant manner throughout an organization in real
time.
Kafka
A streaming platform has three key capabilities:
Publish and subscribe to streams of records, similar to a
message queue or enterprise messaging system.
Store streams of records in a fault-tolerant durable way.
Process streams of records as they occur.
Kafka is generally used for two broad classes of
applications:
Building real-time streaming data pipelines that reliably get data
between systems or applications.
Building real-time streaming applications that transform or
react to the streams of data.
Kafka
Kafka Core
At the core of the Confluent Platform is Apache Kafka. Confluent
extends that core to make configuring, deploying, and managing Kafka less
complex.
Kafka Streams
Kafka Streams is an open source solution that you can integrate into
your application to build and execute powerful stream-processing functions.
Kafka Connect
This ensures Confluent-tested and secure connectors for numerous
standard data systems.
Connectors make it quick and stress-free to start setting up consistent data
pipelines.
These connectors are completely integrated with the platform, via the schema
registry.
Kafka Connect enables the data processing capabilities that accomplish the
movement of data into the core of the data solution from the edge of the
business ecosystem.
Kafka
Kafka is used to build real-time streaming data pipelines and
real-time streaming applications.
A data pipeline reliably processes and moves data from one
system to another, and a streaming application is an application
that consumes streams of data.
For example, if you want to create a data pipeline that takes in
user activity data to track how people use your website in
real-time, Kafka would be used to ingest and store streaming
data while serving reads for the applications powering the data
pipeline.
Kafka is also often used as a message broker solution, which is
a platform that processes and mediates communication
between two applications.
Kafka Core Components
Kafka has four core APIs:
The Producer API allows an application to publish a stream
of records to one or more Kafka topics.
The Consumer API allows an application to subscribe to
one or more topics and process the stream of records
produced to them.
The Streams API allows an application to act as a stream
processor, consuming an input stream from one or more topics
and producing an output stream to one or more output topics,
effectively transforming the input streams to output streams.
The Connector API allows building and running reusable
producers or consumers that connect Kafka topics to existing
applications or data systems. For example, a connector to a
relational database might capture every change to a table.
Kafka vs Confluent: What are the
differences?
Kafka: Distributed, fault tolerant, high throughput pub-sub messaging system. Kafka
is a distributed, partitioned, replicated commit log service. It provides the
functionality of a messaging system, but with a unique design; Confluent: We
make a stream data platform to help companies harness their high volume real-time
data streams. It is a data streaming platform based on Apache Kafka: a full-scale
streaming platform, capable of not only publish-and-subscribe, but also the
storage and processing of data within the stream.
Kafka belongs to "Message Queue" category of the tech stack, while Confluent
can be primarily classified under "Stream Processing".
Some of the features offered by Kafka are:
• Written at LinkedIn in Scala
• Used by LinkedIn to offload processing of all page and other views
• Defaults to using persistence, uses OS disk cache for hot data (has higher
throughput then any of the above having persistence enabled)
On the other hand, Confluent provides the following key features:
• Reliable
• High-performance stream data platform
• Manage and organize data from different sources.
Kafka vs Confluent: What are the
differences?
Elastic Search
Elasticsearch is an open-source, broadly-distributable, readily-
scalable, enterprise-grade search engine accessible through an
extensive and elaborate API, Elasticsearch can power extremely
fast searches that support your data discovery applications.
Elasticsearch is a database that stores, retrieves, and manages document-
oriented and semi-structured data.
Elasticsearch is built on Apache Lucene and was first released in 2010 by
Elasticsearch N.V. (now known as Elastic). Known for its simple REST APIs,
distributed nature, speed, and scalability, Elasticsearch is the central
component of the Elastic Stack, a set of free and open tools for data
ingestion, enrichment, storage, analysis, and visualization.
Elasticsearch (ES) is a NoSQL distributed database.
Elasticsearch relies on flexible data models to build and update visitor
profiles to meet the demanding workloads and low latency required for
real-time engagement.
When you use Elasticsearch, you store data in JSON document form. Then,
you query them for retrieval.
Real-world projects require search on different fields by applying some
conditions, different weights, recent documents, values of some predefined
fields, and so on.