0% found this document useful (0 votes)
105 views

data science notes Mtech

This document serves as a comprehensive tutorial on data science, covering its definition, importance, job roles, and necessary skills. It highlights the growing demand for data science professionals and outlines various job titles, prerequisites, and tools used in the field. Additionally, it discusses the data science lifecycle and applications in real-world scenarios such as image recognition and gaming.

Uploaded by

ravi03sehrawat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views

data science notes Mtech

This document serves as a comprehensive tutorial on data science, covering its definition, importance, job roles, and necessary skills. It highlights the growing demand for data science professionals and outlines various job titles, prerequisites, and tools used in the field. Additionally, it discusses the data science lifecycle and applications in real-world scenarios such as image recognition and gaming.

Uploaded by

ravi03sehrawat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 115

Data Science Tutorial for Beginners

Data Science has become the most demanding job of the 21st century. Every organization is
looking for candidates with knowledge of data science. In this tutorial, we are giving an
introduction to data science, with data science Job roles, tools for data science, components
of data science, application, etc.

So let's start,

What is Data Science?

Data science is a deep study of the massive amount of data, which involves extracting
meaningful insights from raw, structured, and unstructured data that is processed using the
scientific method, different technologies, and algorithms.

It is a multidisciplinary field that uses tools and techniques to manipulate the data so that you
can find something new and meaningful

Data science uses the most powerful hardware, programming systems, and most efficient
algorithms to solve the data related problems. It is the future of artificial intelligence.

In short, we can say that data science is all about:

o Asking the correct questions and analyzing the raw data.


o Modeling the data using various complex and efficient algorithms.
o Visualizing the data to get a better perspective.
o Understanding the data to make better decisions and finding the final result.
Example:

Let suppose we want to travel from station A to station B by car. Now, we need to take some
decisions such as which route will be the best route to reach faster at the location, in which
route there will be no traffic jam, and which will be cost-effective. All these decision factors
will act as input data, and we will get an appropriate answer from these decisions, so this
analysis of data is called the data analysis, which is a part of data science.

Need for Data Science:

Some years ago, data was less and mostly available in a structured form, which could be
easily stored in excel sheets, and processed using BI tools.

But in today's world, data is becoming so vast, i.e., approximately 2.5 quintals bytes of data
is generating on every day, which led to data explosion. It is estimated as per researches, that
by 2020, 1.7 MB of data will be created at every single second, by a single person on earth.
Every Company requires data to work, grow, and improve their businesses.

Now, handling of such huge amount of data is a challenging task for every organization. So
to handle, process, and analysis of this, we required some complex, powerful, and efficient
algorithms and technology, and that technology came into existence as data Science.
Following are some main reasons for using data science technology:
o With the help of data science technology, we can convert the massive amount of raw
and unstructured data into meaningful insights.
o Data science technology is opting by various companies, whether it is a big brand or a
startup. Google, Amazon, Netflix, etc, which handle the huge amount of data, are
using data science algorithms for better customer experience.
o Data science is working for automating transportation such as creating a self-driving
car, which is the future of transportation.
o Data science can help in different predictions such as various survey, elections, flight
ticket confirmation, etc.

Data science Jobs:

As per various surveys, data scientist job is becoming the most demanding Job of the 21st
century due to increasing demands for data science. Some people also called it "the hottest
job title of the 21st century". Data scientists are the experts who can use various statistical
tools and machine learning algorithms to understand and analyze the data.

The average salary range for data scientist will be approximately $95,000 to $ 165,000 per
annum, and as per different researches, about 11.5 millions of job will be created by the
year 2026.

Types of Data Science Job

If you learn data science, then you get the opportunity to find the various exciting job roles in
this domain. The main job roles are given below:

1. Data Scientist
2. Data Analyst
3. Machine learning expert
4. Data engineer
5. Data Architect
6. Data Administrator
7. Business Analyst
8. Business Intelligence Manager

Below is the explanation of some critical job titles of data science.

1. Data Analyst:
Data analyst is an individual, who performs mining of huge amount of data, models the data,
looks for patterns, relationship, trends, and so on. At the end of the day, he comes up with
visualization and reporting for analyzing the data for decision making and problem-solving
process.

Skill required: For becoming a data analyst, you must get a good background
in mathematics, business intelligence, data mining, and basic knowledge of statistics. You
should also be familiar with some computer languages and tools such as MATLAB, Python,
SQL, Hive, Pig, Excel, SAS, R, JS, Spark, etc.

2. Machine Learning Expert:

The machine learning expert is the one who works with various machine learning algorithms
used in data science such as regression, clustering, classification, decision tree, random
forest, etc.

Skill Required: Computer programming languages such as Python, C++, R, Java, and
Hadoop. You should also have an understanding of various algorithms, problem-solving
analytical skill, probability, and statistics.

3. Data Engineer:

A data engineer works with massive amount of data and responsible for building and
maintaining the data architecture of a data science project. Data engineer also works for the
creation of data set processes used in modeling, mining, acquisition, and verification.

Skill required: Data engineer must have depth knowledge of SQL, MongoDB, Cassandra,
HBase, Apache Spark, Hive, MapReduce, with language knowledge of Python, C/C++,
Java, Perl, etc.

4. Data Scientist:

A data scientist is a professional who works with an enormous amount of data to come up
with compelling business insights through the deployment of various tools, techniques,
methodologies, algorithms, etc.

Skill required: To become a data scientist, one should have technical language skills such
as R, SAS, SQL, Python, Hive, Pig, Apache spark, MATLAB. Data scientists must have
an understanding of Statistics, Mathematics, visualization, and communication skills.

Prerequisite for Data Science

Non-Technical Prerequisite:

o Curiosity: To learn data science, one must have curiosities. When you have curiosity
and ask various questions, then you can understand the business problem easily.
o Critical Thinking: It is also required for a data scientist so that you can find multiple
new ways to solve the problem with efficiency.
o Communication skills: Communication skills are most important for a data scientist
because after solving a business problem, you need to communicate it with the team.

Technical Prerequisite:

o Machine learning: To understand data science, one needs to understand the concept
of machine learning. Data science uses machine learning algorithms to solve various
problems.
o Mathematical modeling: Mathematical modeling is required to make fast
mathematical calculations and predictions from the available data.
o Statistics: Basic understanding of statistics is required, such as mean, median, or
standard deviation. It is needed to extract knowledge and obtain better results from the
data.
o Computer programming: For data science, knowledge of at least one programming
language is required. R, Python, Spark are some required computer programming
languages for data science.
o Databases: The depth understanding of Databases such as SQL, is essential for data
science to get the data and to work with data.

Difference between BI and Data Science

BI stands for business intelligence, which is also used for data analysis of business
information: Below are some differences between BI and Data sciences:

Criterion Business intelligence Data science

Data Business intelligence deals with structured Data science deals with structured and
Source data, e.g., data warehouse. unstructured data, e.g., weblogs, feedback,
etc.

Method Analytical(historical data) Scientific(goes deeper to know the reason


for the data report)

Skills Statistics and Visualization are the two Statistics, Visualization, and Machine
skills required for business intelligence. learning are the required skills for data
science.
Focus Business intelligence focuses on both Past Data science focuses on past data, present
and present data data, and also future predictions.

Data Science Components:

The main components of Data Science are given below:

1. Statistics: Statistics is one of the most important components of data science. Statistics is a
way to collect and analyze the numerical data in a large amount and finding meaningful
insights from it.

2. Domain Expertise: In data science, domain expertise binds data science together. Domain
expertise means specialized knowledge or skills of a particular area. In data science, there are
various areas for which we need domain experts.

3. Data engineering: Data engineering is a part of data science, which involves acquiring,
storing, retrieving, and transforming the data. Data engineering also includes metadata (data
about data) to the data.

4. Visualization: Data visualization is meant by representing data in a visual context so that


people can easily understand the significance of data. Data visualization makes it easy to
access the huge amount of data in visuals.

5. Advanced computing: Heavy lifting of data science is advanced computing. Advanced


computing involves designing, writing, debugging, and maintaining the source code of
computer programs.
6. Mathematics: Mathematics is the critical part of data science. Mathematics involves the
study of quantity, structure, space, and changes. For a data scientist, knowledge of good
mathematics is essential.

7. Machine learning: Machine learning is backbone of data science. Machine learning is all
about to provide training to a machine so that it can act as a human brain. In data science, we
use various machine learning algorithms to solve the problems.

Tools for Data Science

Following are some tools required for data science:

o Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio, MATLAB, Excel,
RapidMiner.
o Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS Redshift
o Data Visualization tools: R, Jupyter, Tableau, Cognos.
o Machine learning tools: Spark, Mahout, Azure ML studio.

Machine learning in Data Science

To become a data scientist, one should also be aware of machine learning and its algorithms,
as in data science, there are various machine learning algorithms which are broadly being
used. Following are the name of some machine learning algorithms used in data science:

o Regression
o Decision tree
o Clustering
o Principal component analysis
o Support vector machines
o Naive Bayes
o Artificial neural network
o Apriori

We will provide you some brief introduction for few of the important algorithms here,

1. Linear Regression Algorithm: Linear regression is the most popular machine learning
algorithm based on supervised learning. This algorithm work on regression, which is a
method of modeling target values based on independent variables. It represents the form of
the linear equation, which has a relationship between the set of inputs and predictive output.
This algorithm is mostly used in forecasting and predictions. Since it shows the linear
relationship between input and output variable, hence it is called linear regression.

The below equation can describe the relationship between x and y variables:

1. Y= mx+c

Where, y= Dependent variable


X= independent variable
M= slope
C= intercept.

2. Decision Tree: Decision Tree algorithm is another machine learning algorithm, which
belongs to the supervised learning algorithm. This is one of the most popular machine
learning algorithms. It can be used for both classification and regression problems.

In the decision tree algorithm, we can solve the problem, by using tree representation in
which, each node represents a feature, each branch represents a decision, and each leaf
represents the outcome.

Following is the example for a Job offer problem:


In the decision tree, we start from the root of the tree and compare the values of the root
attribute with record attribute. On the basis of this comparison, we follow the branch as per
the value and then move to the next node. We continue comparing these values until we reach
the leaf node with predicated class value.

3. K-Means Clustering: K-means clustering is one of the most popular algorithms of


machine learning, which belongs to the unsupervised learning algorithm. It solves the
clustering problem.

If we are given a data set of items, with certain features and values, and we need to categorize
those set of items into groups, so such type of problems can be solved using k-means
clustering algorithm.

K-means clustering algorithm aims at minimizing an objective function, which known as


squared error function, and it is given as:

Where, J(V) => Objective function


'||xi - vj||' => Euclidean distance between xi and vj .
th
ci' => Number of data points in i cluster.
C => Number of clusters.

How to solve a problem in Data Science using Machine learning algorithms?

Now, let's understand what are the most common types of problems occurred in data science
and what is the approach to solving the problems. So in data science, problems are solved
using algorithms, and below is the diagram representation for applicable algorithms for
possible questions:
Is this A or B? :

We can refer to this type of problem which has only two fixed solutions such as Yes or No, 1
or 0, may or may not. And this type of problems can be solved using classification
algorithms.

Is this different? :

We can refer to this type of question which belongs to various patterns, and we need to find
odd from them. Such type of problems can be solved using Anomaly Detection Algorithms.

How much or how many?

The other type of problem occurs which ask for numerical values or figures such as what is
the time today, what will be the temperature today, can be solved using regression
algorithms.

How is this organized?

Now if you have a problem which needs to deal with the organization of data, then it can be
solved using clustering algorithms.

Clustering algorithm organizes and groups the data based on features, colors, or other
common characteristics.

Data Science Lifecycle

The life-cycle of data science is explained as below diagram.


The main phases of data science life cycle are given below:

1. Discovery: The first phase is discovery, which involves asking the right questions. When
you start any data science project, you need to determine what are the basic requirements,
priorities, and project budget. In this phase, we need to determine all the requirements of the
project such as the number of people, technology, time, data, an end goal, and then we can
frame the business problem on first hypothesis level.

2. Data preparation: Data preparation is also known as Data Munging. In this phase, we
need to perform the following tasks:

o Data cleaning
o Data Reduction
o Data integration
o Data transformation,

After performing all the above tasks, we can easily use this data for our further processes.

3. Model Planning: In this phase, we need to determine the various methods and techniques
to establish the relation between input variables. We will apply Exploratory data
analytics(EDA) by using various statistical formula and visualization tools to understand the
relations between variable and to see what data can inform us. Common tools used for model
planning are:

o SQL Analysis Services


o R
o SAS
o Python

4. Model-building: In this phase, the process of model building starts. We will create
datasets for training and testing purpose. We will apply different techniques such as
association, classification, and clustering, to build the model.
Following are some common Model building tools:

o SAS Enterprise Miner


o WEKA
o SPCS Modeler
o MATLAB

5. Operationalize: In this phase, we will deliver the final reports of the project, along with
briefings, code, and technical documents. This phase provides you a clear overview of
complete project performance and other components on a small scale before the full
deployment.

6. Communicate results: In this phase, we will check if we reach the goal, which we have
set on the initial phase. We will communicate the findings and final result with the business
team.

Applications of Data Science:

o Image recognition and speech recognition:


Data science is currently using for Image and speech recognition. When you upload
an image on Facebook and start getting the suggestion to tag to your friends. This
automatic tagging suggestion uses image recognition algorithm, which is part of data
science.
When you say something using, "Ok Google, Siri, Cortana", etc., and these devices
respond as per voice control, so this is possible with speech recognition algorithm.
o Gaming world:
In the gaming world, the use of Machine learning algorithms is increasing day by day.
EA Sports, Sony, Nintendo, are widely using data science for enhancing user
experience.
o Internet search:
When we want to search for something on the internet, then we use different types of
search engines such as Google, Yahoo, Bing, Ask, etc. All these search engines use
the data science technology to make the search experience better, and you can get a
search result with a fraction of seconds.
o Transport:
Transport industries also using data science technology to create self-driving cars.
With self-driving cars, it will be easy to reduce the number of road accidents.
o Healthcare:
In the healthcare sector, data science is providing lots of benefits. Data science is
being used for tumor detection, drug discovery, medical image analysis, virtual
medical bots, etc.
o Recommendation systems:
Most of the companies, such as Amazon, Netflix, Google Play, etc., are using data
science technology for making a better user experience with personalized
recommendations. Such as, when you search for something on Amazon, and you
started getting suggestions for similar products, so this is because of data science
technology.
o Risk detection:
Finance industries always had an issue of fraud and risk of losses, but with the help of
data science, this can be rescued.
Most of the finance companies are looking for the data scientist to avoid risk and any
type of losses with an increase in customer satisfaction.

Data Science Process


Data Science could be a space that incorporates working with colossal sums of information,
creating calculations, working with machine learning and more to come up with trade
insights. It incorporates working with the gigantic sum of information. Different processes
are included to infer the information from the source like extraction of data, information
preparation, model planning, model building and many more. The below image depicts the
various processes of Data Science.

Let’s go through each process briefly.


• Discovery
To begin with, it is exceptionally imperative to get the different determinations,
prerequisites, needs and required budget-related with the venture. You must have the
capacity to inquire the correct questions like do you have got the desired assets. These
assets can be in terms of individuals, innovation, time and information. In this stage, you
too got to outline the trade issue and define starting hypotheses (IH) to test.
• Information Preparation
In this stage, you would like to investigate, preprocess and condition data for modeling.
You’ll be able to perform information cleaning, changing, and visualization. This will
assist you to spot the exceptions and build up a relationship between the factors. Once
you have got cleaned and arranged the information, it’s time to do exploratory analytics
on it.
• Model Planning
Here, you may decide the strategies and methods to draw the connections between
factors. These connections will set the base for the calculations which you may execute
within the following stage. You may apply Exploratory Data Analytics (EDA) utilizing
different factual equations and visualization apparatuses.
• Model Building
In this stage, you’ll create datasets for training and testing purposes. You may analyze
different learning procedures like classification, association, and clustering and at last,
actualize the most excellent fit technique to construct the show.
• Operationalize
In this stage, you convey the last briefings, code, and specialized reports. In expansion,
now a pilot venture is additionally actualized in a real-time generation environment. This
will give you a clear picture of the execution and other related limitations.
• Communicate Results
Presently, it is critical to assess the outcome of the objective. So, within the final stage,
you recognize all the key discoveries, communicate to the partners and decide in the
event that the outcomes about the venture are a victory or a disappointment based on the
criteria created in Stage 1.

A Data Scientist’s Tool Kit


Essential tools for data science

A Data Scientists primary role is to apply machine learning, statistical methods and
exploratory analysis to data to extract insights and aid decision making. Programming and the
use of computational tools are essential to this role. In fact, many people have described the
field using something along the lines of this famous quote.

A data scientist is someone who is better at statistics than any software engineer and better at
software engineering than any statistician

If you are beginning your journey in learning data science or want to improve your existing
skills it is essential to have a good understanding of the tools you need to perform this role
effectively.

Python for data science has gradually grown in popularity over the last ten years and is now by
far the most popular programming language for practitioners in the field. In the following
article, I am going to give an overview of the core tools used by data scientists largely
focussed on python-based tools.
NumPy

NumPy is a powerful library for performing mathematical and scientific computations with
python. You will find that many other data science libraries require it as a dependency to run
as it is one of the fundamental scientific packages.

This tool interacts with data as an N-dimensional array object. It provides tools for
manipulating arrays, performing array operations, basic statistics and common linear algebra
calculations such as cross and dot product operations.

Pandas

The Pandas library simplifies the manipulation and analysis of data in python. Pandas works
with two fundamental data structures. They are Series, which is a one-dimensional labelled
array, and a DataFrame, which is a two-dimensional labelled data structure. The Pandas
package has a multitude of tools for reading data from various sources, including CSV files
and relational databases.

Once data has been made available as one of these data structures pandas has a wide range of
very simple functions provided for cleaning, transforming and analysing data. These include
built-in tools to handle missing data, simple plotting functionality and excel-like pivot tables.

SciPy

SciPy is another core scientific computational python library. This library is built to interact
with NumPy arrays and depends on much of the functionality made available through NumPy.
However, although to use this package you need to have NumPy both installed and imported,
there is no need to directly import the functionality as this is automatically made available.

Scipy effectively builds on the mathematical functionality available in NumPy. Where NumPy
provides very fast array manipulation, SciPy works with these arrays and enables the
application of advanced mathematical and scientific computations.

Scikit-learn

Scikit-learn is a user friendly, comprehensive and powerful library for machine learning. It
contains functions to apply most machine learning techniques to data and has a consistent user
interface for each.

This library also provides tools for data cleaning, data pre-processing and model validation.
One of its most powerful features are the concept of machine learning pipelines. These
pipelines enable the various steps in machine learning e.g. preprocessing, training and so on to
be chained together into one object.
Keras

Keras is a python API which aims to provide a simple interface for working with neural
networks. Popular deep learning libraries such as Tensorflow are notorious for not being very
user-friendly. Keras sits on top of these frameworks to provide a friendly way to interact with
them.

Keras supports both convolutional and recurrent networks, provides support for multi-
backends and runs on both CPU and GPU.

Matplotlib

Matplotlib is one of the fundamental plotting libraries in python. Many other popular plotting
libraries depend on the matplotlib API including the pandas plotting functionality and
Seaborn.

Matplolib is a very rich plotting library and contains functionality to create a wide range of
charts and visualisations. Additionally, it contains functions to create animated and interactive
charts.

Jupyter notebooks

Jupyter notebooks are an interactive python programming interface. The benefit of writing
python in a notebook environment is that it allows you to easily render visualisations, datasets
and data summaries directly within the program.

These notebooks are also ideal for sharing data science work as they can be highly annotated
by including markdown text directly in line with the code and visualisations.

Python IDE

Jupyter notebooks are a useful place to write code for data science. However, there will be
many instances when writing code into reusable modules will be needed. This will particularly
be the case if you are writing code to put a machine learning model into production.

In these instances and an IDE (Integrated Development Environment) is useful as they provide
lots of useful features such as integrated python style guides, unit testing and version control. I
personally use PyCharm but there many others available.

Github

Github is a very popular version control platform. One of the fundamental principles of data
science is that code and results should be reproducible either by yourself at a future point in
time or by others. Version control provides a mechanism to track and record changes to your
work online.

Additionally, Github enables a safe form of collaboration on a project. This is achieved by a


person cloning a branch (effectively a copy of your project), making changes locally and then
uploading these for review before they are integrated into the project. For an introductory
guide to Github for data scientists see my previous article here.

This article has given a brief introduction to the core toolkit for data science work. In my next
article, I am going to cover how to set up your computer for effective data science work and
will run through these tools and others in more detail.

Data Analytics and its type


Analytics is the discovery and communication of meaningful patterns in data. Especially,
valuable in areas rich with recorded information, analytics relies on the simultaneous
application of statistics, computer programming, and operation research to qualify
performance. Analytics often favors data visualization to communicate insight.
Firms may commonly apply analytics to business data, to describe, predict, and improve
business performance. Especially, areas within include predictive analytics, enterprise
decision management, etc. Since analytics can require extensive computation(because of big
data), the algorithms and software used to analytics harness the most current methods in
computer science.
In a nutshell, analytics is the scientific process of transforming data into insight for making
better decisions. The goal of Data Analytics is to get actionable insights resulting in smarter
decisions and better business outcomes.
It is critical to design and built a data warehouse or Business Intelligence(BI) architecture that
provides a flexible, multi-faceted analytical ecosystem, optimized for efficient ingestion and
analysis of large and diverse data sets.
There are four types of data analytics:
1. Predictive (forecasting)
2. Descriptive (business intelligence and data mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics
Predictive Analytics: Predictive analytics turn the data into valuable, actionable information.
predictive analytics uses data to determine the probable outcome of an event or a likelihood
of a situation occurring.
Predictive analytics holds a variety of statistical techniques from modeling, machine,
learning, data mining, and game theory that analyze current and historical facts to make
predictions about a future event. Techniques that are used for predictive analytics are:
• Linear Regression
• Time series analysis and forecasting
• Data Mining
There are three basic cornerstones of predictive analytics:
• Predictive modeling
• Decision Analysis and optimization
• Transaction profiling
Descriptive Analytics: Descriptive analytics looks at data and analyze past event for insight
as to how to approach future events. It looks at the past performance and understands the
performance by mining historical data to understand the cause of success or failure in the
past. Almost all management reporting such as sales, marketing, operations, and finance uses
this type of analysis.
The descriptive model quantifies relationships in data in a way that is often used to classify
customers or prospects into groups. Unlike a predictive model that focuses on predicting the
behavior of a single customer, Descriptive analytics identifies many different relationships
between customer and product.
Common examples of Descriptive analytics are company reports that provide historic
reviews like:
• Data Queries
• Reports
• Descriptive Statistics
• Data dashboard
Prescriptive Analytics: Prescriptive Analytics automatically synthesize big data,
mathematical science, business rule, and machine learning to make a prediction and then
suggests a decision option to take advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting action
benefit from the predictions and showing the decision maker the implication of each decision
option. Prescriptive Analytics not only anticipates what will happen and when to happen but
also why it will happen. Further, Prescriptive Analytics can suggest decision options on how
to take advantage of a future opportunity or mitigate a future risk and illustrate the
implication of each decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning by using
analytics to leverage operational and usage data combined with data of external factors such
as economic data, population demography, etc.
Diagnostic Analytics: In this analysis, we generally use historical data over other data to
answer any question or for the solution of any problem. We try to find any dependency and
pattern in the historical data of the particular problem.
For example, companies go for this analysis because it gives a great insight into a problem,
and they also keep detailed information about their disposal otherwise data collection may
turn out individual for every problem and it will be very time-consuming. Common
techniques used for Diagnostic Analytics are:

• Data discovery
• Data mining
• Correlations

DATA SCIENCE APPLICATIONS AND EXAMPLES


• Healthcare: Data science can identify and predict disease, and personalize healthcare
recommendations.
• Transportation: Data science can optimize shipping routes in real-time.
• Sports: Data science can accurately evaluate athletes’ performance.
• Government: Data science can prevent tax evasion and predict incarceration rates.
• E-commerce: Data science can automate digital ad placement.
• Gaming: Data science can improve online gaming experiences.
• Social media: Data science can create algorithms to pinpoint compatible partners.
• What is Data Collection: A Definition
Before we define what is data collection, it’s essential to ask the question, “What is data?”
The abridged answer is, data is various kinds of information formatted in a particular way.
Therefore, data collection is the process of gathering, measuring, and analyzing accurate data
from a variety of relevant sources to find answers to research problems, answer questions,
evaluate outcomes, and forecast trends and probabilities.

Our society is highly dependent on data, which underscores the importance of collecting it.
Accurate data collection is necessary to make informed business decisions, ensure quality
assurance, and keep research integrity.

During data collection, the researchers must identify the data types, the sources of data, and
what methods are being used. We will soon see that there are many different data collection
methods. There is heavy reliance on data collection in research, commercial, and government
fields.

Before an analyst begins collecting data, they must answer three questions first:

• What’s the goal or purpose of this research?

• What kinds of data are they planning on gathering?

• What methods and procedures will be used to collect, store, and process the information?

Additionally, we can break up data into qualitative and quantitative types. Qualitative data
covers descriptions such as color, size, quality, and appearance. Quantitative data,
unsurprisingly, deals with numbers, such as statistics, poll numbers, percentages, etc.

Why Do We Need Data Collection?

Before a judge makes a ruling in a court case or a general creates a plan of attack, they must
have as many relevant facts as possible. The best courses of action come from informed
decisions, and information and data are synonymous.

The concept of data collection isn’t a new one, as we’ll see later, but the world has changed.
There is far more data available today, and it exists in forms that were unheard of a century
ago. The data collection process has had to change and grow with the times, keeping pace
with technology.

Whether you’re in the world of academia, trying to conduct research, or part of the
commercial sector, thinking of how to promote a new product, you need data collection to
help you make better choices.
Now that you know what is data collection and why we need it, let's take a look at the
different methods of data collection. While the phrase “data collection” may sound all high-
tech and digital, it doesn’t necessarily entail things like computers, big data, and the internet.
Data collection could mean a telephone survey, a mail-in comment card, or even some guy
with a clipboard asking passersby some questions. But let’s see if we can sort the different
data collection methods into a semblance of organized categories.

What Are the Different Methods of Data Collection?

The following are seven primary methods of collecting data in business analytics.

• Surveys

• Transactional Tracking

• Interviews and Focus Groups

• Observation

• Online Tracking

• Forms

• Social Media Monitoring

Data collection breaks down into two methods. As a side note, many terms, such as
techniques, methods, and types, are interchangeable and depending on who uses them. One
source may call data collection techniques “methods,” for instance. But whatever labels we
use, the general concepts and breakdowns apply across the board whether we’re talking about
marketing analysis or a scientific research project.

The two methods are:

• Primary

As the name implies, this is original, first-hand data collected by the data researchers. This
process is the initial information gathering step, performed before anyone carries out any
further or related research. Primary data results are highly accurate provided the researcher
collects the information. However, there’s a downside, as first-hand research is potentially
time-consuming and expensive.
• Secondary

Secondary data is second-hand data collected by other parties and already having undergone
statistical analysis. This data is either information that the researcher has tasked other people
to collect or information the researcher has looked up. Simply put, it’s second-hand
information. Although it’s easier and cheaper to obtain than primary information, secondary
information raises concerns regarding accuracy and authenticity. Quantitative data makes up
a majority of secondary data.

Sources of Data Collection


Data is a collection of measurements and facts and a tool that help an individual or a group of
individuals to reach a sound conclusion by providing them with some information. It helps
the analyst understand, analyze, and interpret different socio-economic problems like
unemployment, poverty, inflation, etc. Besides understanding the issues, it also helps in
determining the reasons behind the problem to find possible solutions for them. Data not only
includes theoretical information but some numerical facts too that can support the
information. The collection of data is the first step of the statistical investigation and can be
gathered through two different sources, namely, primary sources and secondary sources.
Sources of Collection of Data
1. Primary Source
It is a collection of data from the source of origin. It provides the researcher with first-hand
quantitative and raw information related to the statistical study. In short, the primary sources
of data give the researcher direct access to the subject of research. For example, statistical
data, works of art, and interview transcripts.
2. Secondary Source
It is a collection of data from some institutions or agencies that have already collected the
data through primary sources. It does not provide the researcher with first-hand quantitative
and raw information related to the study. Hence, the secondary source of data collection
interprets, describes, or synthesizes the primary sources. For example, reviews, government
websites containing surveys or data, academic books, published journals, articles, etc.
Even though primary sources provide more credibility to the collected data because of the
presence of evidence, but good research will require both primary and secondary sources of
data collection.
Primary and Secondary Data
1. Primary Data
The data collected by the investigator from primary sources for the first time from scratch is
known as primary data. This data is collected directly from the source of origin. It is real-time
data and is always specific to the researcher’s needs. The primary data is available in raw
form. The investigator has to spend a long time period in the collection of primary data and
hence is expensive also. However, the accuracy and reliability of primary data are more than
the secondary data. Some examples of sources for the collection of primary data are
observations, surveys, experiments, personal interviews, questionnaires, etc.
2. Secondary Data
The data already in existence which has been previously collected by someone else for other
purposes is known as secondary data. It does not include any real-time data as the research
has already been done on that information. However, the cost of collecting secondary data is
less. As the data has already been collected in the past, it can be found in refined form. The
accuracy and reliability of secondary data are relatively less than the primary data. The
chances of finding the exact information or data specific to the researcher’s needs are less.
However, the time required to collect secondary data is short and hence is a quick and easy
process. Some examples of sources for the collection of secondary data are books, journals,
internal records, government records, articles, websites, government publications, etc.
Principle Difference between Primary and Secondary Data
• Difference in Objective: The primary data collected by the investigator is always for the
specific objective. Therefore, there is no need to make any adjustments for the purpose of
the study. However, the secondary data collected by the investigator has already been
collected by someone else for some other purpose. Therefore, the investigator has to
make necessary adjustments to the data to suit the main objective of the present study.
• Difference in Originality: As the primary data is collected from the beginning from the
source of origin, the data is original. However, the secondary data is already present
somewhere and hence is not original.
• Difference in Cost of Collection: The cost of collecting primary data is higher than the
cost of collecting secondary data in terms of time, effort and money. It is because the data
is being collected for the first time from the source of origin. However, the cost of
collecting secondary data is less as the data is gathered from published or unpublished
sources.
Methods of Collecting Primary Data
• Direct Personal Investigation: As the name suggests, the method of direct personal
investigation involves collecting data personally from the source of origin. In simple
words, the investigator makes direct contact with the person from whom he/she wants to
obtain information. This method can attain success only when the investigator collecting
data is efficient, diligent, tolerant and impartial. For example, direct contact with the
household women to obtain information about their daily routine and schedule.
• Indirect Oral Investigation: In this method of collecting primary data, the investigator
does not make direct contact with the person from whom he/she needs information,
instead, they collect the data orally from some other person who has the necessary
required information. For example, collecting data of employees from their superiors or
managers.
• Information from Local Sources or Correspondents: In this method, for the collection
of data, the investigator appoints correspondents or local persons at various places, which
are then furnished by them to the investigator. With the help of correspondents and local
persons, the investigators can cover a wide area.
• Information through Questionnaires and Schedules: In this method of collecting
primary data, the investigator, while keeping in mind the motive of the study, prepares a
questionnaire. The investigator can collect data through the questionnaire in two ways:
Mailing Method: This method involves mailing the questionnaires to the informants for
the collection of data. The investigator attaches a letter with the questionnaire in the mail
to define the purpose of the study or research. The investigator also assures the
informants that their information would be kept secret, and then the informants note the
answers to the questionnaire and return the completed file.
Enumerator’s Method: This method involves the preparation of a questionnaire
according to the purpose of the study or research. However, in this case, the enumerator
reaches out to the informants himself with the prepared questionnaire. Enumerators are
not the investigators themselves; they are the people who help the investigator in the
collection of data.
Sources of Collecting Secondary Data
1. Published Sources
• Government Publications: Government publishes different documents which consists of
different varieties of information or data published by the Ministries, Central and State
Governments in India as their routine activity. As the government publishes these
Statistics, they are fairly reliable to the investigator. Examples of Government
publications on Statistics are the Annual Survey of Industries, Statistical Abstract of
India, etc.
• Semi-Government Publications: Different Semi-Government bodies also publish data
related to health, education, deaths and births. These kinds of data are also reliable and
used by different informants. Some examples of semi-government bodies are
Metropolitan Councils, Municipalities, etc.
• Publications of Trade Associations: Various big trade associations collect and publish
data from their research and statistical divisions of different trading activities and their
aspects. For example, data published by Sugar Mills Association regarding different sugar
mills in India.
• Journals and Papers: Different newspapers and magazines provide a variety of
statistical data in their writings, which are used by different investigators for their studies.
• International Publications: Different international organizations like IMF, UNO, ILO,
World Bank, etc., publish a variety of statistical information which are used as secondary
data.
• Publications of Research Institutions: Research institutions and universities also
publish their research activities and their findings, which are used by different
investigators as secondary data. For example National Council of Applied Economics, the
Indian Statistical Institute, etc.
2. Unpublished Sources
Another source of collecting secondary data is unpublished sources. The data in unpublished
sources is collected by different government organizations and other organizations. These
organizations usually collect data for their self-use and are not published anywhere. For
example, research work done by professors, professionals, teachers and records maintained
by business and private enterprises.
Data Collection With API — For Beginners
A simple guide to leveraging APIs to obtain data using Python

The Application Programming Interface (API) has become a core component of many of the
products and services we’ve become accustomed to using.

It is able to bolster relations between companies and clients. For companies, it is a convenient
way to promote their own business to their clients while ensuring the security of their backend
systems. For clients, APIs provide the means to access data that can be used to fuel their
research or product development.

Here, I will give a brief overview of APIs and demonstrate how you can use this resource for
your own data collection using Python.
Forms of Data Collection

Before discussing APIs, let’s quickly go over the options you have when it comes to procuring
data.

1. Collecting your own data

This one seems like a no-brainer; if you want some data, why not collect your own? After all,
no one understands your requirements better than you yourself. So, just get out there can start
collecting, right?

Wrong.

For most cases, collecting your own data is an absurd notion. Procuring information of the
required quantity and quality requires considerable time, money, manpower, and resources.

This is an infeasible (if not impossible) undertaking.

2. Using ready-made datasets

Why go through the trouble of collecting and processing data when you can just use someone
else’s preprocessed datasets?

Ready-made datasets can be appealing since someone has already done all the hard work for
you in making them. You’ve no doubt encountered plenty of them on sites like Kaggle.com
and Data.gov.

Unfortunately, the convenience of this approach comes at the cost of flexibility and control.
When you use a ready-made dataset, you are restricted by the preprocessing performed on that
dataset prior to its upload.

Chances are that some of the records or features that would have been useful to you were
discarded by the source.

These sources of data certainly have their merits, but their limitations will add constraints to
any subsequent analysis or modeling and can hamper the success of your project.

3. Web scraping

Web scraping is somewhat the middle ground between collecting your own data and using
someone else’s.
You get to access other people’s data by going to their websites and choosing exactly what
parts you want to collect.

On paper, this seems like a good deal, but web scraping comes with its own caveats.

For starters, extracting data from websites through scraping can be challenging. Web scraping
tools such as Selenium require a strong grasp of HTML and XML Path Language (XPath).
Furthermore, the scripts required to navigate websites and procure the needed information can
be long and may require a lot of time to write.

In addition, web scraping can at times be unethical or even illegal. While some websites have
no qualms with scraping, others may be less tolerant. It isn’t uncommon for websites to upload
copyrighted data or set terms that stipulate conditions for scraping.

Web scraping without sufficient caution and care could get you in trouble.

Benefits of APIs

APIs offer a way to obtain needed data while avoiding the disadvantages of the
aforementioned data collection methods.

It spares you the trouble of having to collect data yourself as you can directly procure data
from another entity. You get the freedom to select the raw data that you can process as you
wish.

You also don’t need to worry about any legal ramifications. Companies require you to possess
an identifier known as an API key before granting you access to their API. You can obtain an
API key by directly applying for it. The API key acts as a barrier of entry, ensuring that only
clients that have been approved by the company can reap the benefits the API.

Finally, the best part of this resource is that it can facilitate data extraction with just a few lines
of code!

What APIs do in a nutshell

So, what role does an API play in your efforts to collect data from an external source? To
answer this, we should first introduce a little terminology.

When a client wants access to certain data from a foreign server, they make a request to that
server.
When the server receives the request, it generates a response that it sends back to the client.

An API plays the role of a middle man in this exchange. It is responsible for delivering your
request to the server and then delivering the corresponding response back to you.

Using APIs

Unfortunately, there is no one-size-fits-all approach towards handling APIs.

APIs of different sources vary in terms of accessibility and implementation.

Most websites have their own unique requirements for requests and set a unique format for
their responses. They also differ in regards to their restrictions. For instance, some APIs limit
the number of requests you can make in a day.

Thus, to gain an understanding of a specific API, you will need to read their documentation.

Case Study

Although APIs are simple to use, they might be hard to understand at first. Let’s perform a
demonstration to see how they can be used to collect data.

We will use an API from the New York Times called the Article Search API. This API allows
you to search for articles by a query.

Their documentation clearly explains how to make create URIs for requests. You can
customize the URI to make requests for specific articles. For example, you can specify the
articles’ date of publishment and apply other filters.

First, we create a function in Python that generates a URI for the API given the query, sorting
order, page number, and API key.

Data Mining vs Data Exploration

There are two main methodologies or techniques used to retrieve relevant data from large,
unorganized pools. They are manual and automatic methods. The manual method is another
name for data exploration, while the automatic method is also known as data mining.

Data mining generally refers to gathering relevant data from large databases. On the other
hand, data exploration generally refers to a data user finding their way through large amounts
of data to gather necessary information. Let's study both methods in detail and compare their
differences.
What is Data Exploration?

Data exploration refers to the initial step in data analysis. Data analysts use data visualization
and statistical techniques to describe dataset characterizations, such as size, quantity, and
accuracy, to understand the nature of the data better.

Data exploration techniques include both manual analysis and automated data exploration
software solutions that visually explore and identify relationships between different data
variables, the structure of the dataset, the presence of outliers, and the distribution of data
values to reveal patterns and points of interest, enabling data analysts to gain greater insight
into the raw data.

Data is often gathered in large, unstructured volumes from various sources. Data analysts
must first understand and develop a comprehensive view of the data before extracting
relevant data for further analysis, such as univariate, bivariate, multivariate, and principal
components analysis.

Why is Data Exploration Important?

Humans process visual data better than numerical data. Therefore it is extremely challenging
for data scientists and data analysts to assign meaning to thousands of rows and columns of
data points and communicate that meaning without any visual components.

Data visualization in data exploration leverages familiar visual cues such as shapes,
dimensions, colors, lines, points, and angles so that data analysts can effectively visualize and
define the metadata and then perform data cleansing. Performing the initial step of data
exploration enables data analysts to understand better and visually identify anomalies and
relationships that might otherwise go undetected.

Data Exploration Tools

Manual data exploration methods entail writing scripts to analyze raw data or manually
filtering data into spreadsheets. Automated data exploration tools, such as data visualization
software, help data scientists easily monitor data sources and perform big data exploration on
otherwise overwhelmingly large datasets. Graphical displays of data, such as bar charts and
scatter plots, are valuable tools in visual data exploration.

A popular tool for manual data exploration is Microsoft Excel spreadsheets, which can create
basic charts for data exploration, view raw data, and identify the correlation between
variables. To identify the correlation between two continuous variables in Excel, use the
CORREL() function to return the correlation. To identify the correlation between two
categorical variables in Excel, the two-way table method, the stacked column chart method,
and the chi-square test are effective.

There is a wide variety of proprietary automated data exploration solutions, including


business intelligence tools, data visualization software, data preparation software vendors,
and data exploration platforms. There are also open-source data exploration tools that include
regression capabilities and visualization features, which can help businesses, integrate diverse
data sources to enable faster data exploration. Most data analytics software includes data
visualization tools.
What can Data Exploration Do?

In general, the goals of data Exploration come into these three categories.

1. Archival: Data Exploration can convert data from physical formats (such as books,
newspapers, and invoices) into digital formats (such as databases) for backup.
2. Transfer the data format: If you want to transfer the data from your current website
into a new website under development, you can collect data from your own website
by extracting it.
3. Data analysis: As the most common goal, the extracted data can be further analyzed
to generate insights. This may sound similar to the data analysis process in data
mining, but note that data analysis is the goal of data Exploration, not part of its
process. What's more, the data is analyzed differently. One example is that e-store
owners extract product details from eCommerce websites like Amazon to monitor
competitors' strategies.

Use Cases of Data Exploration

Data Exploration has been widely used in multiple industries serving different purposes.
Besides monitoring prices in eCommerce, data Exploration can help in individual paper
research, news aggregation, marketing, real estate, travel and tourism, consulting, finance,
and many more.

o Lead generation: Companies can extract data from directories like Yelp,
Crunchbase, and Yellowpages and generate leads for business development. You can
check out this video to see how to extract data from Yellowpages with a web scraping
template.
o Content & news aggregation: Content aggregation websites can get regular data
feeds from multiple sources and keep their sites fresh and up-to-date.
o Sentiment analysis: After extracting the online reviews/comments/feedback from
social media websites like Instagram and Twitter, people can analyze the underlying
attitudes and understand how they perceive a brand, product, or phenomenon.

What is Data Mining?

Data mining could be called a subset of Data Analysis. It explores and analyzes huge
knowledge to find important patterns and rules.

Data mining could also be a systematic and successive method of identifying and discovering
hidden patterns and data throughout a big dataset. Moreover, it is used to build machine
learning models that are further used in artificial intelligence.
What Can Data Mining Do?

Data mining tools can sweep through the databases and identify hidden patterns efficiently by
automating the mining process. For businesses, data mining is often used to discover patterns
and relationships in data to help make optimal business decisions.

Use Cases of Data Mining

After data mining became widespread in the 1990s, companies in various industries -
including retail, finance, healthcare, transportation, telecommunication, E-commerce, etc.,
started to use data mining techniques to generate insights from data. Data mining can help
segment customers, detect fraud, forecast sales, etc. Specific uses of data mining include:

o Customer segmentation: Through mining customer data and identifying the


characteristics of target customers, companies can align them into a distinct group and
provide special offers that cater to their needs.
o Market basket analysis: This is a technique based on the theory that you are likely to
buy another group of products if you buy a certain group of products. One famous
example is that when fathers buy diapers for their infants, they tend to buy beers
together with the diapers.
o Forecasting sales: It may sound similar to market basket analysis, but data mining is
used to predict when a customer will buy a product again in the future. For instance, a
coach buys a bucket of protein powder that should last 9 months. The store that sold
the protein powder would plan to release new protein powder 9 months later so that
the coach would buy it again.
o Detecting frauds: Data mining aids in building models to detect fraud. By collecting
samples of fraudulent and non-fraudulent reports, businesses are empowered to
identify which transactions are suspicious.
o Discover patterns in manufacturing: In the manufacturing industry, data mining is
used to help design systems by uncovering the relationships between product
architecture, portfolio, and customer needs. It can also predict future product
development time and costs.

Difference between Data Exploration and Data Mining

There are two primary methods for extracting data from disparate sources in data science:
data exploration and data mining. Data Exploration can be part of data mining, where the aim
is to collect and integrate data from different sources. As a relatively complex process, data
mining comes as discovering patterns to make sense of data and predict the future. Both
require different skill sets and expertise, yet the increasing popularity of non-coding data
Exploration tools and data mining tools greatly enhances productivity and makes people's
lives much easier.
Data Mining Data Exploration

Data mining is also named knowledge discovery Data Exploration is used interchangeably with web
in databases, extraction, data/pattern analysis, and exploration, web scraping, web crawling, data
information harvesting. retrieval, data harvesting, etc.

Data mining studies are mostly on structured Data Exploration usually retrieves data out of
data. unstructured or poorly structured data sources.

Data mining aims to make available data more Data Exploration is to collect data and gather them
useful for generating insights. into a place where they can be stored or further
processed.

Data mining is based on mathematical methods to Data Exploration is based on programming


reveal patterns or trends. languages or data Exploration tools to crawl the data
sources.

The purpose of data mining is to find facts that Data Exploration deals with existing information.
are previously unknown or ignored,

Data mining is much more complicated and Data Exploration can be extremely easy and cost-
requires large investments in staff training. effective when conducted with the right tool.

What is data storage management?

Data is more valuable than ever before. As several outlets such as The
Economist and Forbes have pointed out, data has surpassed oil as the most valuable
commodity in the current global market, and current estimates consider this trend will
continue well into the future.

What is data storage management?


Data storage is the process of preserving information in a digital form with the use of hard
drives or other data storage systems compatible with computer interfaces. Data storage
management is the act of keeping stored data properly archived, cataloged, and secure.

Why is data storage management important?


Proper data storage management makes sure your data will be available whenever
someone needs access to it. Be it for your customers or your employees, having your data
organized is the best way to always have it within reach at a moment’s notice.
Data management strategies
Before searching for which data management solution better suits your business, consider the
following tips:

• Understand your data needs


• Do not rely on cheap storage management solutions
• Use a tiered approach as a data storage strategy
• Map out disaster recovery plans
• Use intelligent data storage solutions

Various data sources in Data Science — Overview and Usage

ted using a programming language. String parsing or Regex can be used to split each

parameter in the entry.

Example for Structure data in text file (Logfile)


#regex for parsing log fileregex = “<your_regex_here>”#read the text file and use re.findall() to
extract each parametersFile = open(“text.log”,’r’)for i in File.readlines():print(re.findall(regex,i))

Database

The database is a commonly used way to securely store data, generally in tabular form and

efficiently manage huge amounts of data. SQL is a commonly used database. In a database,
queries are sent by the user to the DBMS which performs the query and returns the result. In
contrast with text files, databases require a user-password combination to gain access and

perform tasks. In this section, we will use Python to access an existing MySQL database.

Since Database is a vast topic, check the tutorial link in reference for further learning.

Data storage in SQL table.


import
mysql.connectormydb=mysql.connector.connect(host=”hostname_here”,user=”username”,passwo
rd=”password_here”)
cursor=mydb.cursor()#execute query
mycursor.execute(“QUERY_HERE”)#commiting confirms the changes to the database
mydb.commit()

CSV files

One of the most common ways in which data is stored is in a CSV file. CSV file consists of

data that are comma-separated. When opened in software like Excel, CSV displays like an

excel sheet, where data is stored column-wise and row-wise.CSV files can be easily accessed

and processed using programming language like Python using libraries like Pandas. Pandas

also allows a wide range of mathematical and analytics operations on a data frame. Check the

references section for links.


Data Storage in CSV files
import pandas as pddata = pd.read_csv(“file.csv”)print(data) #prints full
dataframeprint(data[‘column_name’]) #prints a single column in dataframe

Cloud Data warehouses / Cloud Databases

Data science often correlates with cloud platforms. With the ability to set up huge machines

and elastic property, cloud computing is emerging globally and has great future potential. The

major storage solutions offered by the cloud are data warehouses and cloud database.
Although the functionality remains the same, warehouses are used to store large amounts of

incoming data for analytics purposes, while cloud database stores the usual customer data in

the cloud. Both of these can be accessed from an application using its respective APIs. GCP

provides the google cloud API which requires you to send the credentials to connect to any

service, while for AWS you can use pyodbc library to connect to any service using the

connection string that the platform provides.


Pipeline design for an analytics system

Miscellaneous Sources

Multimedia data

Certain data will be present in multimedia forms like Images or Audio. These are present as it

is in folders. For such type of data, libraries like OpenCV is used to read image and convert

into an array. Audios are generally converted to an image (spectrogram) which again comes

back to an image processing problem.

Multimedia data often exists in folders

Social Media / APIs

Certain data science problems require you to connect to an API or another platform like social

media to obtain specific data. This Medium article demonstrates this scenario wherein data is
fetched from Twitter hashtags and sentiment analysis is performed on it for natural language
processing. APIs provide live streaming of data (Eg: COVID data or Election results data).

This helps perform live analytics and dashboard generation.

What is Data Analytics? Introduction to Data Analysis

Data has been the buzzword for ages now. Either the data being generated from large-scale
enterprises or the data generated from an individual, each and every aspect of data needs to
be analyzed to benefit yourself from it. But how do we do it? Well, that’s where the term
‘Data Analytics’ comes in. In this blog on ‘What is Data Analytics?’, you will get an insight
of this term with a hands-on.

Why is Data Analytics important?


Data Analytics has a key role in improving your business as it is used to gather hidden
insights, generate reports, perform market analysis, and improve business requirements.

What is the role of Data Analytics?


You can refer below:

• Gather Hidden Insights – Hidden insights from data are gathered and then analyzed
with respect to business requirements.
• Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in
business.
• Perform Market Analysis – Market Analysis can be performed to understand the
strengths and weaknesses of competitors.
• Improve Business Requirement – Analysis of Data allows improving Business to
customer requirements and experience.

Now that you know the need for Data Analytics, let me quickly elaborate on what is Data
Analytics for you.

What is Data Analytics for Beginners?


Data Analytics refers to the techniques used to analyze data to enhance productivity and
business gain. Data is extracted from various sources and is cleaned and categorized to
analyze various behavioral patterns. The techniques and the tools used vary according to the
organization or individual.

So, in short, if you understand your Business Administration and have the capability to
perform Exploratory Data Analysis, to gather the required information, then you are good to
go with a career in Data Analytics.

So, now that you know what is Data Analytics, let me quickly cover the top tools used in this
field.
What are the tools used in Data Analytics?
With the increasing demand for Data Analytics in the market, many tools have emerged with
various functionalities for this purpose. Either open-source or user-friendly, the top tools in
the data analytics market are as follows.

• R programming – This tool is the leading analytics tool used for statistics and data
modeling. R compiles and runs on various platforms such as UNIX, Windows, and
Mac OS. It also provides tools to automatically install all packages as per user-
requirement.
• Python – Python is an open-source, object-oriented programming language that is
easy to read, write, and maintain. It provides various machine learning and
visualization libraries such as Scikit-learn, TensorFlow, Matplotlib, Pandas, Keras,
etc. It also can be assembled on any platform like SQL server, a MongoDB database
or JSON
• Tableau Public – This is a free software that connects to any data source such as
Excel, corporate Data Warehouse, etc. It then creates visualizations, maps,
dashboards etc with real-time updates on the web.
• QlikView – This tool offers in-memory data processing with the results delivered to
the end-users quickly. It also offers data association and data visualization with data
being compressed to almost 10% of its original size.
• SAS – A programming language and environment for data manipulation and
analytics, this tool is easily accessible and can analyze data from different sources.
• Microsoft Excel – This tool is one of the most widely used tools for data analytics.
Mostly used for clients’ internal data, this tool analyzes the tasks that summarize the
data with a preview of pivot tables.
• RapidMiner – A powerful, integrated platform that can integrate with any data
source types such as Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc.
This tool is mostly used for predictive analytics, such as data mining, text
analytics, machine learning.
• KNIME – Konstanz Information Miner (KNIME) is an open-source data analytics
platform, which allows you to analyze and model data. With the benefit of visual
programming, KNIME provides a platform for reporting and integration through its
modular data pipeline concept.
• OpenRefine – Also known as GoogleRefine, this data cleaning software will help
you clean up data for analysis. It is used for cleaning messy data, the transformation
of data and parsing data from websites.
• Apache Spark – One of the largest large-scale data processing engine, this tool
executes applications in Hadoop clusters 100 times faster in memory and 10 times
faster on disk. This tool is also popular for data pipelines and machine learning model
development.

Introduction of Statistics and its Types


Statistics simply means numerical data, and is field of math that generally deals with
collection of data, tabulation, and interpretation of numerical data. It is actually a form of
mathematical analysis that uses different quantitative models to produce a set of experimental
data or studies of real life. It is an area of applied mathematics concern with data collection
analysis, interpretation, and presentation. Statistics deals with how data can be used to solve
complex problems. Some people consider statistics to be a distinct mathematical science
rather than a branch of mathematics.
Statistics makes work easy and simple and provides a clear and clean picture of work you do
on a regular basis.
Basic terminology of Statistics :
• Population –
It is actually a collection of set of individuals or objects or events whose properties are to
be analyzed.
• Sample –
It is the subset of a population.
Types of Statistics :

1. Descriptive Statistics :
Descriptive statistics uses data that provides a description of the population either through
numerical calculation or graph or table. It provides a graphical summary of data. It is simply
used for summarizing objects, etc. There are two categories in this as following below.
• (a). Measure of central tendency –
Measure of central tendency is also known as summary statistics that is used to represents
the center point or a particular value of a data set or sample set.
In statistics, there are three common measures of central tendency as shown below:
• (i) Mean :
It is measure of average of all value in a sample set.
For example,

• (ii) Median :
It is measure of central value of a sample set. In these, data set is ordered from
lowest to highest value and then finds exact middle.
For example,
• (iii) Mode :
It is value most frequently arrived in sample set. The value repeated most of
time in central set is actually mode.
For example,

• (b). Measure of Variability –


Measure of Variability is also known as measure of dispersion and used to describe
variability in a sample or population. In statistics, there are three common measures of
variability as shown below:
• (i) Range :
It is given measure of how to spread apart values in sample set or data set.
Range = Maximum value - Minimum value
• (ii) Variance :
It simply describes how much a random variable defers from expected value
and it is also computed as square of deviation.
S2= ∑ni=1 [(xi - ͞x)2 ÷ n]
In these formula, n represent total data points, ͞x represent mean of data points
and xi represent individual data points.
• (iii) Dispersion :
It is measure of dispersion of set of data from its mean.
σ= √ (1÷n) ∑ni=1 (xi - μ)2
2. Inferential Statistics :
Inferential Statistics makes inference and prediction about population based on a sample of
data taken from population. It generalizes a large dataset and applies probabilities to draw a
conclusion. It is simply used for explaining meaning of descriptive stats. It is simply used to
analyze, interpret result, and draw conclusion. Inferential Statistics is mainly related to and
associated with hypothesis testing whose main target is to reject null hypothesis.
Hypothesis testing is a type of inferential procedure that takes help of sample data to evaluate
and assess credibility of a hypothesis about a population. Inferential statistics are generally
used to determine how strong relationship is within sample. But it is very difficult to obtain a
population list and draw a random sample.
Inferential statistics can be done with help of various steps as given below:
1. Obtain and start with a theory.
2. Generate a research hypothesis.
3. Operationalize or use variables
4. Identify or find out population to which we can apply study material.
5. Generate or form a null hypothesis for these population.
6. Collect and gather a sample of children from population and simply run study.
7. Then, perform all tests of statistical to clarify if obtained characteristics of sample are
sufficiently different from what would be expected under null hypothesis so that we can
be able to find and reject null hypothesis.
Types of inferential statistics –
Various types of inferential statistics are used widely nowadays and are very easy to interpret.
These are given below:
• One sample test of difference/One sample hypothesis test
• Confidence Interval
• Contingency Tables and Chi-Square Statistic
• T-test or Anova
• Pearson Correlation
• Bi-variate Regression
• Multi-variate Regression

What is Central Tendency?

Central tendency is a descriptive summary of a dataset through a single value that reflects the
center of the data distribution. Along with the variability (dispersion) of a dataset, central
tendency is a branch of descriptive statistics.

The central tendency is one of the most quintessential concepts in statistics. Although it does
not provide information regarding the individual values in the dataset, it delivers a
comprehensive summary of the whole dataset.

Measures of Central Tendency

Generally, the central tendency of a dataset can be described using the following measures:

• Mean (Average): Represents the sum of all values in a dataset divided by the total
number of the values.
• Median: The middle value in a dataset that is arranged in ascending order (from the
smallest value to the largest value). If a dataset contains an even number of values, the
median of the dataset is the mean of the two middle values.
• Mode: Defines the most frequently occurring value in a dataset. In some cases, a
dataset may contain multiple modes, while some datasets may not have any mode at
all.

Even though the measures above are the most commonly used to define central tendency,
there are some other measures, including, but not limited to, geometric mean, harmonic
mean, midrange, and geometric median.

The selection of a central tendency measure depends on the properties of a dataset. For
instance, the mode is the only central tendency measure for categorical data, while a median
works best with ordinal data.

Although the mean is regarded as the best measure of central tendency for quantitative data,
that is not always the case. For example, the mean may not work well with quantitative
datasets that contain extremely large or extremely small values. The extreme values may
distort the mean. Thus, you may consider other measures.

The measures of central tendency can be found using a formula or definition. Also, they can
be identified using a frequency distribution graph. Note that for datasets that follow a normal
distribution, the mean, median, and mode are located on the same spot on the graph.

What Is Variance in Statistics? Definition, Formula, and Example

What Is Variance?
The term variance refers to a statistical measurement of the spread between numbers in a
data set. More specifically, variance measures how far each number in the set is from
the mean (average), and thus from every other number in the set. Variance is often depicted
by this symbol: σ2. It is used by both analysts and traders to determine volatility and market
security.

The square root of the variance is the standard deviation (SD or σ), which helps determine
the consistency of an investment’s returns over a period of time.
Advantages and Disadvantages of Variance
Statisticians use variance to see how individual numbers relate to each other within a data
set, rather than using broader mathematical techniques such as arranging numbers into
quartiles. The advantage of variance is that it treats all deviations from the mean as the same
regardless of their direction. The squared deviations cannot sum to zero and give the
appearance of no variability at all in the data.

One drawback to variance, though, is that it gives added weight to outliers. These are the
numbers far from the mean. Squaring these numbers can skew the data. Another pitfall of
using variance is that it is not easily interpreted. Users often employ it primarily to take the
square root of its value, which indicates the standard deviation of the data. As noted above,
investors can use standard deviation to assess how consistent returns are over time.

Introduction

Welcome to the world of Probability in Data Science! Let me start things off with an intuitive

example.

Suppose you are a teacher at a university. After checking assignments for a week, you graded

all the students. You gave these graded papers to a data entry guy in the university and tell

him to create a spreadsheet containing the grades of all the students. But the guy only stores

the grades and not the corresponding students.

He made another blunder, he missed a couple of entries in a hurry and we have no idea whose

grades are missing. Let’s find a way to solve this.

One way is that you visualize the grades and see if you can find a trend in the data.
The graph that you have plot is called the frequency distribution of the data. You see that

there is a smooth curve like structure that defines our data, but do you notice an anomaly? We

have an abnormally low frequency at a particular score range. So the best guess would be to

have missing values that remove the dent in the distribution.

This is how you would try to solve a real-life problem using data analysis. For any Data

Scientist, a student or a practitioner, distribution is a must know concept. It provides the basis

for analytics and inferential statistics.

While the concept of probability gives us the mathematical calculations, distributions help us

actually visualize what’s happening underneath.

In this article, I have covered some important types of probability distributions which are

explained in a lucid as well as comprehensive manner.

Note: This article assumes you have a basic knowledge of probability. If not, you can refer

this probability distributions.

Table of Contents

1. Common Data Types


2. Types of Distributions in Statistics
1. Bernoulli Distribution
2. Uniform Distribution
3. Binomial Distribution
4. Normal Distribution
5. Poisson Distribution
6. Exponential Distribution
3. Relations between the Distributions
4. Test your Knowledge!

Common Data Types

Before we jump on to the explanation of distributions, let’s see what kind of data can we

encounter. The data can be discrete or continuous.

Discrete Data, as the name suggests, can take only specified values. For example, when you

roll a die, the possible outcomes are 1, 2, 3, 4, 5 or 6 and not 1.5 or 2.45.

Continuous Data can take any value within a given range. The range may be finite or

infinite. For example, A girl’s weight or height, the length of the road. The weight of a girl

can be any value from 54 kgs, or 54.5 kgs, or 54.5436kgs.

Now let us start with the types of distributions.

Types of Distributions
Bernoulli Distribution

Let’s start with the easiest distribution that is Bernoulli Distribution. It is actually easier to

understand than it sounds!

All you cricket junkies out there! At the beginning of any cricket match, how do you decide

who is going to bat or ball? A toss! It all depends on whether you win or lose the toss, right?

Let’s say if the toss results in a head, you win. Else, you lose. There’s no midway.

A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure),

and a single trial. So the random variable X which has a Bernoulli distribution can take value
1 with the probability of success, say p, and the value 0 with the probability of failure, say q

or 1-p.

Here, the occurrence of a head denotes success, and the occurrence of a tail denotes failure.

Probability of getting a head = 0.5 = Probability of getting a tail since there are only two

possible outcomes.

The probability mass function is given by: px(1-p)1-x where x € (0, 1).

It can also be written as

The probabilities of success and failure need not be equally likely, like the result of a fight

between me and Undertaker. He is pretty much certain to win. So in this case probability of

my success is 0.15 while my failure is 0.85

Here, the probability of success(p) is not same as the probability of failure. So, the chart

below shows the Bernoulli Distribution of our fight.

Here, the probability of success = 0.15 and probability of failure = 0.85. The expected value

is exactly what it sounds. If I punch you, I may expect you to punch me back. Basically
expected value of any distribution is the mean of the distribution. The expected value of a

random variable X from a Bernoulli distribution is found as follows:

E(X) = 1*p + 0*(1-p) = p

The variance of a random variable from a bernoulli distribution is:

V(X) = E(X²) – [E(X)]² = p – p² = p(1-p)

There are many examples of Bernoulli distribution such as whether it’s going to rain

tomorrow or not where rain denotes success and no rain denotes failure and Winning

(success) or losing (failure) the game.

Uniform Distribution

When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these outcomes

are equally likely and that is the basis of a uniform distribution. Unlike Bernoulli

Distribution, all the n number of possible outcomes of a uniform distribution are equally

likely.

A variable X is said to be uniformly distributed if the density function is:

The graph of a uniform distribution curve looks like


You can see that the shape of the Uniform distribution curve is rectangular, the reason why

Uniform distribution is called rectangular distribution.

For a Uniform Distribution, a and b are the parameters.

The number of bouquets sold daily at a flower shop is uniformly distributed with a maximum

of 40 and a minimum of 10.

Let’s try calculating the probability that the daily sales will fall between 15 and 30.

The probability that daily sales will fall between 15 and 30 is (30-15)*(1/(40-10)) = 0.5

Similarly, the probability that daily sales are greater than 20 is = 0.667

The mean and variance of X following a uniform distribution is:

Mean -> E(X) = (a+b)/2

Variance -> V(X) = (b-a)²/12

The standard uniform density has parameters a = 0 and b = 1, so the PDF for standard

uniform density is given by:

Binomial Distribution

Let’s get back to cricket. Suppose that you won the toss today and this indicates a successful

event. You toss again but you lost this time. If you win a toss today, this does not necessitate

that you will win the toss tomorrow. Let’s assign a random variable, say X, to the number of

times you won the toss. What can be the possible value of X? It can be any number

depending on the number of times you tossed a coin.


There are only two possible outcomes. Head denoting success and tail denoting failure.

Therefore, probability of getting a head = 0.5 and the probability of failure can be easily

computed as: q = 1- p = 0.5.

A distribution where only two outcomes are possible, such as success or failure, gain or loss,

win or lose and where the probability of success and failure is same for all the trials is called

a Binomial Distribution.

The outcomes need not be equally likely. Remember the example of a fight between me and

Undertaker? So, if the probability of success in an experiment is 0.2 then the probability of

failure can be easily computed as q = 1 – 0.2 = 0.8.

Each trial is independent since the outcome of the previous toss doesn’t determine or affect

the outcome of the current toss. An experiment with only two possible outcomes repeated n

number of times is called binomial. The parameters of a binomial distribution are n and p

where n is the total number of trials and p is the probability of success in each trial.

On the basis of the above explanation, the properties of a Binomial Distribution are

1. Each trial is independent.


2. There are only two possible outcomes in a trial- either a success or a failure.
3. A total number of n identical trials are conducted.
4. The probability of success and failure is same for all trials. (Trials are identical.)

The mathematical representation of binomial distribution is given by:

A binomial distribution graph where the probability of success does not equal the probability

of failure looks like


Now, when probability of success = probability of failure, in such a situation the graph of

binomial distribution looks like

The mean and variance of a binomial distribution are given by:

Mean -> µ = n*p

Variance ->Var(X) = n*p*q

Normal Distribution

Normal distribution represents the behavior of most of the situations in the universe (That is

why it’s called a “normal” distribution. I guess!). The large sum of (small) random variables

often turns out to be normally distributed, contributing to its widespread application. Any

distribution is known as Normal distribution if it has the following characteristics:

1. The mean, median and mode of the distribution coincide.


2. The curve of the distribution is bell-shaped and symmetrical about the line x=μ.
3. The total area under the curve is 1.
4. Exactly half of the values are to the left of the center and the other half to the right.
A normal distribution is highly different from Binomial Distribution. However, if the number

of trials approaches infinity then the shapes will be quite similar.

The PDF of a random variable X following a normal distribution is given by:

The mean and variance of a random variable X which is said to be normally distributed is

given by:

Mean -> E(X) = µ

Variance ->Var(X) = σ^2

Here, µ (mean) and σ (standard deviation) are the parameters.

The graph of a random variable X ~ N (µ, σ) is shown below.

A standard normal distribution is defined as the distribution with mean 0 and standard

deviation 1. For such a case, the PDF becomes:


Machine Learning Algorithms

Machine Learning algorithms are the programs that can learn the hidden patterns from the
data, predict the output, and improve the performance from experiences on their own.
Different algorithms can be used in machine learning for different tasks, such as simple linear
regression that can be used for prediction problems like stock market prediction, and the
KNN algorithm can be used for classification problems.

In this topic, we will see the overview of some popular and most commonly used machine
learning

algorithms along with their use cases and categories.

Types of Machine Learning Algorithms

Machine Learning Algorithm can be broadly classified into three types:

1. Supervised Learning Algorithms


2. Unsupervised Learning Algorithms
3. Reinforcement Learning algorithm

The below diagram illustrates the different ML algorithm, along with the categories

1) Supervised Learning Algorithm

Supervised learning is a type of Machine learning in which the machine needs external
supervision to learn. The supervised learning models are trained using the labeled dataset.
Once the training and processing are done, the model is tested by providing a sample test data
to check whether it predicts the correct output.

The goal of supervised learning is to map input data with the output data. Supervised learning
is based on supervision, and it is the same as when a student learns things in the teacher's
supervision. The example of supervised learning is spam filtering.
Supervised learning can be divided further into two categories of problem:

o Classification
o Regression

Examples of some popular supervised learning algorithms are Simple Linear regression,
Decision Tree, Logistic Regression, KNN algorithm, etc. Read more..

2) Unsupervised Learning Algorithm

It is a type of machine learning in which the machine does not need any external supervision
to learn from the data, hence called unsupervised learning. The unsupervised models can be
trained using the unlabelled dataset that is not classified, nor categorized, and the algorithm
needs to act on that data without any supervision. In unsupervised learning, the model doesn't
have a predefined output, and it tries to find useful insights from the huge amount of data.
These are used to solve the Association and Clustering problems. Hence further, it can be
classified into two types:

o Clustering
o Association

Examples of some Unsupervised learning algorithms are K-means Clustering, Apriori


Algorithm, Eclat, etc. Read more..

3) Reinforcement Learning

In Reinforcement learning, an agent interacts with its environment by producing actions, and
learn with the help of feedback. The feedback is given to the agent in the form of rewards,
such as for each good action, he gets a positive reward, and for each bad action, he gets a
negative reward. There is no supervision provided to the agent. Q-Learning algorithm is
used in reinforcement learning. Read more…

List of Popular Machine Learning Algorithm


1. Linear Regression Algorithm
2. Logistic Regression Algorithm
3. Decision Tree
4. SVM
5. Naïve Bayes
6. KNN
7. K-Means Clustering
8. Random Forest
9. Apriori
10. PCA

1. Linear Regression

Linear regression is one of the most popular and simple machine learning algorithms that is
used for predictive analysis. Here, predictive analysis defines prediction of something, and
linear regression makes predictions for continuous numbers such as salary, age, etc.

It shows the linear relationship between the dependent and independent variables, and shows
how the dependent variable(y) changes according to the independent variable (x).

It tries to best fit a line between the dependent and independent variables, and this best fit line
is knowns as the regression line.

The equation for the regression line is:

y= a0+ a*x+ b

Here, y= dependent variable

x= independent variable

a0 = Intercept of line.

Linear regression is further divided into two types:

o Simple Linear Regression: In simple linear regression, a single independent variable


is used to predict the value of the dependent variable.
o Multiple Linear Regression: In multiple linear regression, more than one
independent variables are used to predict the value of the dependent variable.

The below diagram shows the linear regression for prediction of weight according to
height: Read more..
2. Logistic Regression

Logistic regression is the supervised learning algorithm, which is used to predict the
categorical variables or discrete values. It can be used for the classification problems in
machine learning, and the output of the logistic regression algorithm can be either Yes or
NO, 0 or 1, Red or Blue, etc.

Logistic regression is similar to the linear regression except how they are used, such as Linear
regression is used to solve the regression problem and predict continuous values, whereas
Logistic regression is used to solve the Classification problem and used to predict the discrete
values.

Instead of fitting the best fit line, it forms an S-shaped curve that lies between 0 and 1. The S-
shaped curve is also known as a logistic function that uses the concept of the threshold. Any
value above the threshold will tend to 1, and below the threshold will tend to 0. Read more..

3. Decision Tree Algorithm

A decision tree is a supervised learning algorithm that is mainly used to solve the
classification problems but can also be used for solving the regression problems. It can work
with both categorical variables and continuous variables. It shows a tree-like structure that
includes nodes and branches, and starts with the root node that expand on further branches till
the leaf node. The internal node is used to represent the features of the dataset, branches
show the decision rules, and leaf nodes represent the outcome of the problem.

Some real-world applications of decision tree algorithms are identification between cancerous
and non-cancerous cells, suggestions to customers to buy a car, etc. Read more..
4. Support Vector Machine Algorithm

A support vector machine or SVM is a supervised learning algorithm that can also be used for
classification and regression problems. However, it is primarily used for classification
problems. The goal of SVM is to create a hyperplane or decision boundary that can segregate
datasets into different classes.

The data points that help to define the hyperplane are known as support vectors, and hence it
is named as support vector machine algorithm.

Some real-life applications of SVM are face detection, image classification, Drug
discovery, etc. Consider the below diagram:

As we can see in the above diagram, the hyperplane has classified datasets into two different
classes. Read more..

5. Naïve Bayes Algorithm:

Naïve Bayes classifier is a supervised learning algorithm, which is used to make predictions
based on the probability of the object. The algorithm named as Naïve Bayes as it is based
on Bayes theorem, and follows the naïve assumption that says' variables are independent of
each other.

The Bayes theorem is based on the conditional probability; it means the likelihood that
event(A) will happen, when it is given that event(B) has already happened. The equation for
Bayes theorem is given as:

Naïve Bayes classifier is one of the best classifiers that provide a good result for a given
problem. It is easy to build a naïve bayesian model, and well suited for the huge amount of
dataset. It is mostly used for text classification. Read more..
6. K-Nearest Neighbour (KNN)

K-Nearest Neighbour is a supervised learning algorithm that can be used for both
classification and regression problems. This algorithm works by assuming the similarities
between the new data point and available data points. Based on these similarities, the new
data points are put in the most similar categories. It is also known as the lazy learner
algorithm as it stores all the available datasets and classifies each new case with the help of
K-neighbours. The new case is assigned to the nearest class with most similarities, and any
distance function measures the distance between the data points. The distance function can
be Euclidean, Minkowski, Manhattan, or Hamming distance, based on the
requirement. Read more..

7. K-Means Clustering

K-means clustering is one of the simplest unsupervised learning algorithms, which is used to
solve the clustering problems. The datasets are grouped into K different clusters based on
similarities and dissimilarities, it means, datasets with most of the commonalties remain in
one cluster which has very less or no commonalities between other clusters. In K-means, K-
refers to the number of clusters, and means refer to the averaging the dataset in order to find
the centroid.

It is a centroid-based algorithm, and each cluster is associated with a centroid. This algorithm
aims to reduce the distance between the data points and their centroids within a cluster.

This algorithm starts with a group of randomly selected centroids that form the clusters at
starting and then perform the iterative process to optimize these centroids' positions.

It can be used for spam detection and filtering, identification of fake news, etc. Read more..

8. Random Forest Algorithm

Random forest is the supervised learning algorithm that can be used for both classification
and regression problems in machine learning. It is an ensemble learning technique that
provides the predictions by combining the multiple classifiers and improve the performance
of the model.

It contains multiple decision trees for subsets of the given dataset, and find the average to
improve the predictive accuracy of the model. A random-forest should contain 64-128 trees.
The greater number of trees leads to higher accuracy of the algorithm.

To classify a new dataset or object, each tree gives the classification result and based on the
majority votes, the algorithm predicts the final output.

Random forest is a fast algorithm, and can efficiently deal with the missing & incorrect
data. Read more..
9. Apriori Algorithm

Apriori algorithm is the unsupervised learning algorithm that is used to solve the association
problems. It uses frequent itemsets to generate association rules, and it is designed to work on
the databases that contain transactions. With the help of these association rule, it determines
how strongly or how weakly two objects are connected to each other. This algorithm uses a
breadth-first search and Hash Tree to calculate the itemset efficiently.

The algorithm process iteratively for finding the frequent itemsets from the large dataset.

The apriori algorithm was given by the R. Agrawal and Srikant in the year 1994. It is
mainly used for market basket analysis and helps to understand the products that can be
bought together. It can also be used in the healthcare field to find drug reactions in
patients. Read more..

10. Principle Component Analysis

Principle Component Analysis (PCA) is an unsupervised learning technique, which is used


for dimensionality reduction. It helps in reducing the dimensionality of the dataset that
contains many features correlated with each other. It is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features with the help of
orthogonal transformation. It is one of the popular tools that is used for exploratory data
analysis and predictive modeling.

PCA works by considering the variance of each attribute because the high variance shows the
good split between the classes, and hence it reduces the dimensionality.

Linear Regression in Machine Learning

Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression
shows the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.

The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
Mathematically, we can represent a linear regression as:

y= a0+a1x+ ε

Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model
representation.

Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

o Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Simple Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.

Linear Regression Line

A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:
o Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable increases
on X-axis, then such a relationship is termed as a Positive linear relationship.

o Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent variable increases
on the X-axis, then such a relationship is called a negative linear relationship.

Finding the best fit line:

When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will
have the least error.

The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.

Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line
of regression, and the cost function is used to estimate the values of the coefficient for
the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a
linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also known
as Hypothesis function.

For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can be
written as:

For the above linear equation, MSE can be calculated as:

Where,

N=Total number of observation


Yi = Actual value
(a1xi+a0)= Predicted value.

Residuals: The distance between the actual value and predicted values is called residual. If
the observed points are far from the regression line, then the residual will be high, and so cost
function will high. If the scatter points are close to the regression line, then the residual will
be small and hence the cost function.

Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update
the values to reach the minimum cost function.

Model Performance:

The Goodness of fit determines how the line of regression fits the set of observations. The
process of finding the best model out of various models is called optimization. It can be
achieved by below method:

1. R-squared method:

o R-squared is a statistical method that determines the goodness of fit.


o It measures the strength of the relationship between the dependent and independent
variables on a scale of 0-100%.
o The high value of R-square determines the less difference between the predicted
values and actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
o It can be calculated from the below formula:

Assumptions of Linear Regression

Below are some important assumptions of Linear Regression. These are some formal checks
while building a Linear Regression model, which ensures to get the best possible result from
the given dataset.

o Linear relationship between the features and target:


Linear regression assumes the linear relationship between the dependent and
independent variables.
o Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors
and target variables. Or we can say, it is difficult to determine which predictor
variable is affecting the target variable and which is not. So, the model assumes either
little or no multicollinearity between the features or independent variables.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern
distribution of data in the scatter plot.
o Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal distribution
pattern. If error terms are not normally distributed, then confidence intervals will
become either too wide or too narrow, which may cause difficulties in finding
coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any
deviation, which means the error is normally distributed.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be
any correlation in the error term, then it will drastically reduce the accuracy of the
model. Autocorrelation usually occurs if there is a dependency between residual
errors.

Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is
used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it
can learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog.
On the basis of the support vectors, it will classify it as a cat. Consider the below diagram.
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such data
is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-


dimensional space, but we need to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if
there are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.

Naïve Bayes Classifier Algorithm

o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes


theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training
dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can make
quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple
without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
o The formula for Bayes' theorem is given as:
Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No
5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:


o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:


o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.

Applications of Naïve Bayes Classifier:


o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager
learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a normal distribution.
This means if predictors take continuous values instead of discrete, then the model
assumes that these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular
word is present or not in a document. This model is also famous for document
classification tasks.

Module 4

What is Data Visualization?

Data visualization is a graphical representation of quantitative information and data by using


visual elements like graphs, charts, and maps.

Data visualization convert large and small data sets into visuals, which is easy to understand
and process for humans.

Data visualization tools provide accessible ways to understand outliers, patterns, and trends in
the data.

In the world of Big Data, the data visualization tools and technologies are required to analyze
vast amounts of information.

Data visualizations are common in your everyday life, but they always appear in the form of
graphs and charts. The combination of multiple visualizations and bits of information are still
referred to as Infographics.

Data visualizations are used to discover unknown facts and trends. You can see visualizations
in the form of line charts to display change over time. Bar and column charts are useful for
observing relationships and making comparisons. A pie chart is a great way to show parts-of-
a-whole. And maps are the best way to share geographical data visually.

Today's data visualization tools go beyond the charts and graphs used in the Microsoft Excel
spreadsheet, which displays the data in more sophisticated ways such as dials and gauges,
geographic maps, heat maps, pie chart, and fever chart.

What makes Data Visualization Effective?

Effective data visualization are created by communication, data science, and design collide.
Data visualizations did right key insights into complicated data sets into meaningful and
natural.

American statistician and Yale professor Edward Tufte believe useful data visualizations
consist of ?complex ideas communicated with clarity, precision, and efficiency.

To craft an effective data visualization, you need to start with clean data that is well-sourced
and complete. After the data is ready to visualize, you need to pick the right chart.

After you have decided the chart type, you need to design and customize your visualization to
your liking. Simplicity is essential - you don't want to add any elements that distract from the
data.

Introduction to Types of Data Visualization

Data Visualization is defined as the pictorial representation of the data to provide the fact-
based analysis to decision-makers as text data might not be able to reveal the pattern or trends
needed to recognize data; based upon the visualization, it is classified into 6 different types,
i.e. Temporal (data is linear and one dimensional), Hierarchical (it visualizes ordered groups
within a larger group ), Network (involve visualization for the connection of datasets to
datasets), Multidimensional (contrast of temporal type), Geospatial( involves geospatial or
spatial maps) and Miscellaneous.
What is Data Visualization?
Data visualization is a methodology by which the data in raw format is portrayed to bring out
the meaning of that. With the advent of big data, it has become imperative to build a
meaningful way of showcasing the data so that the amount of data doesn’t become
overwhelming. The part of portraying the data can be used for various purposes, such as
finding trends/commonalities/patterns in data, building models for machine learning, or being
used for a simple operation like aggregation.

Different Types of Data Visualization


Data visualization is broadly classified into 6 different types. Though the area of data
visualization is ever-growing, it won’t be a surprise if the number of categories increases.

Temporal: Data for these types of visualization should satisfy both conditions: data represented
should be linear and should be one-dimensional. These visualization types are represented through
lines that might overlap and have a common start and finish data point.

Scatter Plots Uses dots to represent a data point. The


most common in today’s world is machine
learning during exploratory data analysis.

Pie Chart This type of visualization includes


circular graphics where the arc length
signifies the magnitude.

Polar area Like Pie chart, the Polar area diagram is a


diagram circular plot, except the sector angles are
equal in length, and the distance of
extending from center signifies the
magnitude.

Line graphs Like the scatter plot, the data is


represented by points, except joined by
lines to maintain continuity.

Timelines In this way, we display a list of data


points in chronological order of time.
Time series In time series, we represent the magnitude
sequences of data in a 2-D graph in chronological
order of timestamp in data.

Hierarchical: These types of visualizations portray ordered groups within a larger group. In simple
language, the main intuition behind these visualizations is the clusters can be displayed if the flow
of the clusters starts from a single point.

Tree Diagram In a tree diagram, the hierarchical flow is


represented in the form of a tree, as the
name suggests. Few terminologies for this
representation are:
– Root Node: Origination point.

– Child node: Has a parent above

– Leaf node: No more child node.

Ring Charts / The tree representation in the Tree


Sunburst diagram is converted into a radial basis.
Diagram This type helps in presenting the tree in a
concise size. The innermost circle is the
root node. And the area of the child node
signifies the % of data.

TreeMap The tree is represented in the form of


rectangles closely packed. The area
signifies the quantity contained.

Circle Packing Similar to a treemap, it uses circular


packing instead of rectangles.
Network: The visualization of these type connects datasets to datasets. These visualizations portray
how these datasets relate to one another within a network.

Matrix charts This type of visualization is widely used


to find the connection between different
variables within themselves. For example,
correlation plot.

Alluvial diagrams This is a type of flow diagram in which


the changes in the flow of the network are
represented over intervals as desired by
the user.

Word cloud This is typically used for representing text


data. The words are closely packed, and
the size of the text signifies the frequency
of the word.

node-link Here the nodes are represented as dots,


diagrams and the connection between nodes is
presented.

Multidimensional: In contrast to the temporal type of visualization, these types can have multiple
dimensions. In this, we can use 2 or more features to create a 3-D visualization through concurrent
layers. These will enable the user to present key takeaways by breaking a lot of non-useful data.
Scatter plots In multi-dimensional data, we select any 2
features and then plot them in a 2-D
scatter plot. By doing this we would
have nC2 = n(n-1)/2 graphs.

Stacked bar The representation segment bars on top of


graphs each other. It can be either a 100%
Stacked Bar graph where the segregation
is represented in % or a simple stacked
bar graph, which denotes the actual
magnitude.

Parallel Co- In this representation, a backdrop is


ordinate plot drawn, and n parallel lines are drawn (for
n-dimensional data).

Geospatial: These visualizations relates to present real-life physical location by crossing it over
with maps (It may be a geospatial or spatial map). The intuition behind these visualizations is to
create a holistic view of performance.

Flow map The movement of information or objects


from one location to another is presented
where the size of the arrow signifies the
amount.

Choropleth Map The geospatial map is colored on the basis


of a particular data variable.
Cartogram This type of representation uses the
thematic variable for mapping. These
maps distort reality to present
information. This means that on a
particular variable, the maps are
exaggerated. For example, the image on
the left is a spatial map distorted to a bee-
hive structure.

Heat Map These are very similar to Choropleth in


the geospatial genre but can be used in
areas apart from geospatial as well.

Miscellaneous: These visualizations can’t be generalized in a particularly large group. So instead


of forming smaller groups for the individual type, we group it into miscellaneous. Few examples
are below:

Open-High-Low- This type of graphs is typically used for


Close chart stock price representation. The increasing
trend is called as Bullish and decreasing
as Bearish.

Kagi-Chart Typically the demand-supply of an asset


is represented using this chart.

Types of Encoding Techniques

The process of conversion of data from one form to another form is known as Encoding. It is
used to transform the data so that data can be supported and used by different systems.
Encoding works similarly to converting temperature from centigrade to Fahrenheit, as it just
gets converted in another form, but the original value always remains the same. Encoding is
used in mainly two fields:

o Encoding in Electronics: In electronics, encoding refers to converting analog signals


to digital signals.
o Encoding in Computing: In computing, encoding is a process of converting data to
an equivalent cipher by applying specific code, letters, and numbers to the data.

Note: Encoding is different from encryption as its main purpose is not to hide the data but
to convert it into a format so that it can be properly consumed.

In this topic, we are going to discuss the different types of encoding techniques that are used
in computing.

Type of Encoding Technique

o Character Encoding
o Image & Audio and Video Encoding

Character Encoding

Character encoding encodes characters into bytes. It informs the computers how to interpret
the zero and ones into real characters, numbers, and symbols. The computer understands only
binary data; hence it is required to convert these characters into numeric codes. To achieve
this, each character is converted into binary code, and for this, text documents are saved with
encoding types. It can be done by pairing numbers with characters. If we don't apply
character encoding, our website will not display the characters and text in a proper format.
Hence it will decrease the readability, and the machine would not be able to process data
correctly. Further, character encoding makes sure that each character has a proper
representation in computer or binary format.

There are different types of Character Encoding techniques, which are given below:

1. HTML Encoding
2. URL Encoding
3. Unicode Encoding
4. Base64 Encoding
5. Hex Encoding
6. ASCII Encoding

HTML Encoding

HTML encoding is used to display an HTML page in a proper format. With encoding, a web
browser gets to know that which character set to be used.

In HTML, there are various characters used in HTML Markup such as <, >. To encode these
characters as content, we need to use an encoding.

URL Encoding

URL (Uniform resource locator) Encoding is used to convert characters in such a format
that they can be transmitted over the internet. It is also known as percent-encoding. The
URL Encoding is performed to send the URL to the internet using the ASCII character-set.
Non-ASCII characters are replaced with a %, followed by the hexadecimal digits.

UNICODE Encoding

Unicode is an encoding standard for a universal character set. It allows encoding, represent,
and handle the text represented in most of the languages or writing systems that are available
worldwide. It provides a code point or number for each character in every supported
language. It can represent approximately all the possible characters possible in all the
languages. A particular sequence of bits is known as a coding unit.

A UNICODE standard can use 8, 16, or 32 bits to represent the characters.

The Unicode standard defines Unicode Transformation Format (UTF) to encode the code
points.

UNICODE Encoding standard has the following UTF schemes:

o UTF-8 Encoding
The UTF8 is defined by the UNICODE standard, which is variable-width character
encoding used in Electronics Communication. UTF-8 is capable of encoding all
1,112,064 valid character code points in Unicode using one to four one-byte (8-bit)
code units.
o UTF-16 Encoding
UTF16 Encoding represents a character's code points using one of two 16-bits
integers.
o UTF-32 Encoding
UTF32 Encoding represents each code point as 32-bit integers.
Base64 Encoding

Base64 Encoding is used to encode binary data into equivalent ASCII Characters. The
Base64 encoding is used in the Mail system as mail systems such as SMTP can't work with
binary data because they accept ASCII textual data only. It is also used in simple HTTP
authentication to encode the credentials. Moreover, it is also used to transfer the binary data
into cookies and other parameters to make data unreadable to prevent tampering. If an image
or another file is transferred without Base64 encoding, it will get corrupted as the mail system
is not able to deal with binary data.

Base64 represents the data into blocks of 3 bytes, where each byte contains 8 bits; hence it
represents 24 bits. These 24 bits are divided into four groups of 6 bits. Each of these groups
or chunks are converted into equivalent Base64 value.

ASCII Encoding

American Standard Code for Information Interchange (ASCII) is a type of character-


encoding. It was the first character encoding standard released in the year 1963.

Th ASCII code is used to represent English characters as numbers, where each letter is
assigned with a number from 0 to 127. Most modern character-encoding schemes are based
on ASCII, though they support many additional characters. It is a single byte encoding only
using the bottom 7 bits. In an ASCII file, each alphabetic, numeric, or special character is
represented with a 7-bit binary number. Each character of the keyboard has an equivalent
ASCII value.

Image and Audio & Video Encoding

Image and audio & video encoding are performed to save storage space. A media file such as
image, audio, and video are encoded to save them in a more efficient and compressed format.

These encoded files contain the same content with usually similar quality, but in compressed
size, so that they can be saved within less space, can be transferred easily via mail, or can be
downloaded on the system.

We can understand it as a . WAV audio file is converted into .MP3 file to reduce the size by
1/10th to its original size.

Retinal Variables
The most fundamental choice in any data visualization project is how your real-world values
will be translated into marks on the page or screen. In this exercise we’ll be encoding
an extremely simple data set repeatedly in order to exhaustively catalog the different ways a
handful of numbers can be represented.

Refer to Jacques Bertin’s cheat sheet from The Semiology of Graphics for all the ways
quantitative and qualitative can be encoded visually:
Table of content

• What is Categorical Data?


• Label Encoding or Ordinal Encoding
• One hot Encoding
• Dummy Encoding
• Effect Encoding
• Binary Encoding
• BaseN Encoding
• Hash Encoding
• Target Encoding

What is categorical data?

Since we are going to be working on categorical variables in this article, here is a quick

refresher on the same with a couple of examples. Categorical variables are usually

represented as ‘strings’ or ‘categories’ and are finite in number. Here are a few examples:

1. The city where a person lives: Delhi, Mumbai, Ahmedabad, Bangalore, etc.
2. The department a person works in: Finance, Human resources, IT, Production.
3. The highest degree a person has: High school, Diploma, Bachelors, Masters, PhD.
4. The grades of a student: A+, A, B+, B, B- etc.

In the above examples, the variables only have definite possible values. Further, we can see

there are two kinds of categorical data-

• Ordinal Data: The categories have an inherent order


• Nominal Data: The categories do not have an inherent order
In Ordinal data, while encoding, one should retain the information regarding the order in

which the category is provided. Like in the above example the highest degree a person

possesses, gives vital information about his qualification. The degree is an important feature

to decide whether a person is suitable for a post or not.

While encoding Nominal data, we have to consider the presence or absence of a feature. In

such a case, no notion of order is present. For example, the city a person lives in. For the data,

it is important to retain where a person lives. Here, We do not have any order or sequence. It

is equal if a person lives in Delhi or Bangalore.

For encoding categorical data, we have a python package category_encoders. The following

code helps you install easily.

pip install category_encoders

Label Encoding or Ordinal Encoding

We use this categorical data encoding technique when the categorical feature is ordinal. In

this case, retaining the order is important. Hence encoding should reflect the sequence.

In Label encoding, each label is converted into an integer value. We will create a variable that

contains the categories representing the education qualification of a person.

Python Code:
Fit and transform train data

df_train_transformed = encoder.fit_transform(train_df)

One Hot Encoding

We use this categorical data encoding technique when the features are nominal(do not have

any order). In one hot encoding, for each level of a categorical feature, we create a new

variable. Each category is mapped with a binary variable containing either 0 or 1. Here, 0

represents the absence, and 1 represents the presence of that category.

These newly created binary features are known as Dummy variables. The number of dummy

variables depends on the levels present in the categorical variable. This might sound

complicated. Let us take an example to understand this better. Suppose we have a dataset

with a category animal, having different animals like Dog, Cat, Sheep, Cow, Lion. Now we

have to one-hot encode this data.

After encoding, in the second table, we have dummy variables each representing a category

in the feature Animal. Now for each category that is present, we have 1 in the column of that

category and 0 for the others. Let’s see how to implement a one-hot encoding in python.
importcategory_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':[
'Delhi','Mumbai','Hydrabad','Chennai','Bangalore','Delhi','Hydrabad','Bangalore','Delhi'
]})

#Create object for one-hot encoding


encoder=ce.OneHotEncoder(cols='City',handle_unknown='return_nan',return_df=True,use_c
at_names=True)

#Original Data
data

#Fit and transform Data


data_encoded = encoder.fit_transform(data)
data_encoded

Now let’s move to another very interesting and widely used encoding technique i.eDummy

encoding.
Dummy Encoding

Dummy coding scheme is similar to one-hot encoding. This categorical data encoding

method transforms the categorical variable into a set of binary variables (also known as

dummy variables). In the case of one-hot encoding, for N categories in a variable, it uses N

binary variables. The dummy encoding is a small improvement over one-hot-encoding.

Dummy encoding uses N-1 features to represent N labels/categories.

To understand this better let’s see the image below. Here we are coding the same data using

both one-hot encoding and dummy encoding techniques. While one-hot uses 3 variables to

represent the data whereas dummy encoding uses 2 variables to code 3 categories.

Let us implement it in python.

importcategory_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi,'Hyder
abad']})

#Original Data
data
#encode the data
data_encoded=pd.get_dummies(data=data,drop_first=True)
data_encoded

Here using drop_first argument, we are representing the first label Bangalore using 0.

Drawbacks of One-Hot and Dummy Encoding

One hot encoder and dummy encoder are two powerful and effective encoding schemes.

They are also very popular among the data scientists, But may not be as effective when-

1. A large number of levels are present in data. If there are multiple categories in a
feature variable in such a case we need a similar number of dummy variables to
encode the data. For example, a column with 30 different values will require 30 new
variables for coding.
2. If we have multiple categorical features in the dataset similar situation will occur and
again we will end to have several binary features each representing the categorical
feature and their multiple categories e.g a dataset having 10 or more categorical
columns.

In both the above cases, these two encoding schemes introduce sparsity in the dataset i.e

several columns having 0s and a few of them having 1s. In other words, it creates multiple

dummy features in the dataset without adding much information.


Also, they might lead to a Dummy variable trap. It is a phenomenon where features are

highly correlated. That means using the other variables, we can easily predict the value of a

variable.

Due to the massive increase in the dataset, coding slows down the learning of the model

along with deteriorating the overall performance that ultimately makes the model

computationally expensive. Further, while using tree-based models these encodings are not

an optimum choice.

Effect Encoding:

This encoding technique is also known as Deviation Encoding or Sum Encoding. Effect

encoding is almost similar to dummy encoding, with a little difference. In dummy coding, we

use 0 and 1 to represent the data but in effect encoding, we use three values i.e. 1,0, and -1.

The row containing only 0s in dummy encoding is encoded as -1 in effect encoding. In the

dummy encoding example, the city Bangalore at index 4 was encoded as 0000. Whereas in

effect encoding it is represented by -1-1-1-1.

Let us see how we implement it in python-

importcategory_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi,'Hyder
abad']}) encoder=ce.sum_coding.SumEncoder(cols='City',verbose=False,)

#Original Data
data
encoder.fit_transform(data)

Effect encoding is an advanced technique. In case you are interested to know more about

effect encoding, refer to this interesting paper.

Hash Encoder

To understand Hash encoding it is necessary to know about hashing. Hashing is the

transformation of arbitrary size input in the form of a fixed-size value. We use hashing

algorithms to perform hashing operations i.e to generate the hash value of an input. Further,

hashing is a one-way process, in other words, one can not generate original input from the

hash representation.

Hashing has several applications like data retrieval, checking data corruption, and in data

encryption also. We have multiple hash functions available for example Message Digest

(MD, MD2, MD5), Secure Hash Function (SHA0, SHA1, SHA2), and many more.
Just like one-hot encoding, the Hash encoder represents categorical features using the new

dimensions. Here, the user can fix the number of dimensions after transformation

using n_component argument. Here is what I mean – A feature with 5 categories can be

represented using N new features similarly, a feature with 100 categories can also be

transformed using N new features. Doesn’t this sound amazing?

By default, the Hashing encoder uses the md5 hashing algorithm but a user can pass any

algorithm of his choice. If you want to explore the md5 algorithm, I suggest this paper.

importcategory_encoders as ce
import pandas as pd

#Create the dataframe


data=pd.DataFrame({'Month':['January','April','March','April','Februay','June','July','June','Sep
tember']})

#Create object for hash encoder


encoder=ce.HashingEncoder(cols='Month',n_components=6)

#Fit and Transform Data


encoder.fit_transform(data)
Since Hashing transforms the data in lesser dimensions, it may lead to loss of information.

Another issue faced by hashing encoder is the collision. Since here, a large number of

features are depicted into lesser dimensions, hence multiple values can be represented by the

same hash value, this is known as a collision.

Moreover, hashing encoders have been very successful in some Kaggle competitions. It is

great to try if the dataset has high cardinality features.

Binary Encoding

Binary encoding is a combination of Hash encoding and one-hot encoding. In this encoding

scheme, the categorical feature is first converted into numerical using an ordinal encoder.

Then the numbers are transformed in the binary number. After that binary value is split into

different columns.

Binary encoding works really well when there are a high number of categories. For example

the cities in a country where a company supplies its products.

#Import the libraries


importcategory_encoders as ce
import pandas as pd
#Create the Dataframe
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyder
abad','Mumbai','Agra']})

#Create object for binary encoding


encoder= ce.BinaryEncoder(cols=['city'],return_df=True)

#Original Data
data

#Fit and Transform Data


data_encoded=encoder.fit_transform(data)
data_encoded
Binary encoding is a memory-efficient encoding scheme as it uses fewer features than one-

hot encoding. Further, It reduces the curse of dimensionality for data with high cardinality.

Base N Encoding

Before diving into BaseN encoding let’s first try to understand what is Base here?

In the numeral system, the Base or the radix is the number of digits or a combination of digits

and letters used to represent the numbers. The most common base we use in our life is 10 or

decimal system as here we use 10 unique digits i.e 0 to 9 to represent all the numbers.

Another widely used system is binary i.e. the base is 2. It uses 0 and 1 i.e 2 digits to express

all the numbers.

For Binary encoding, the Base is 2 which means it converts the numerical values of a

category into its respective Binary form. If you want to change the Base of encoding scheme

you may use Base N encoder. In the case when categories are more and binary encoding is

not able to handle the dimensionality then we can use a larger base such as 4 or 8.

#Import the libraries


importcategory_encoders as ce
import pandas as pd

#Create the dataframe


data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyder
abad','Mumbai','Agra']})

#Create an object for Base N Encoding


encoder= ce.BaseNEncoder(cols=['city'],return_df=True,base=5)

#Original Data
data
#Fit and Transform Data
data_encoded=encoder.fit_transform(data)
data_encoded

In the above example, I have used base 5 also known as the Quinary system. It is similar to

the example of Binary encoding. While Binary encoding represents the same data by 4 new

features the BaseN encoding uses only 3 new variables.

Hence BaseN encoding technique further reduces the number of features required to

efficiently represent the data and improving memory usage. The default Base for Base N is 2

which is equivalent to Binary Encoding.

Target Encoding

Target encoding is a Baysian encoding technique.


Bayesian encoders use information from dependent/target variables to encode the categorical

data.

In target encoding, we calculate the mean of the target variable for each category and replace

the category variable with the mean value. In the case of the categorical target variables, the

posterior probability of the target replaces each category..

#import the libraries


import pandas as pd
importcategory_encoders as ce

#Create the Dataframe


data=pd.DataFrame({'class':['A,','B','C','B','C','A','A','A'],'Marks':[50,30,70,80,45,97,80,68]})

#Create target encoding object


encoder=ce.TargetEncoder(cols='class')

#Original Data
Data

#Fit and Transform Train Data


encoder.fit_transform(data['class'],data['Marks'])
We perform Target encoding for train data only and code the test data using results obtained

from the training dataset. Although, a very efficient coding system, it has the

following issues responsible for deteriorating the model performance-

1. It can lead to target leakage or overfitting. To address overfitting we can use different
techniques.
1. In the leave one out encoding, the current target value is reduced from the
overall mean of the target to avoid leakage.
2. In another method, we may introduce some Gaussian noise in the target
statistics. The value of this noise is hyperparameter to the model.
2. The second issue, we may face is the improper distribution of categories in train and
test data. In such a case, the categories may assume extreme values. Therefore the
target means for the category are mixed with the marginal mean of the target.

Unit 5
Major Applications of Data Science
Data Science is the deep study of a large quantity of data, which involves extracting some
meaningful from the raw, structured, and unstructured data. The extracting out meaningful
data from large amounts use processing of data and this processing can be done using
statistical techniques and algorithm, scientific techniques, different technologies, etc. It uses
various tools and techniques to extract meaningful data from raw data. Data Science is also
known as the Future of Artificial Intelligence.
For Example, Jagroop loves books to read but every time when he wants to buy some books
he was always confused that which book he should buy as there are plenty of choices in front
of him. This Data Science Technique will useful. When he opens Amazon he will get product
recommendations on the basis of his previous data. When he chooses one of them he also
gets a recommendation to buy these books with this one as this set is mostly bought. So all
Recommendation of Products and Showing set of books purchased collectively is one of the
examples of Data Science.
Applications of Data Science
1. In Search Engines
The most useful application of Data Science is Search Engines. As we know when we want
to search for something on the internet, we mostly used Search engines like Google, Yahoo,
Safari, Firefox, etc. So Data Science is used to get Searches faster.
For Example, When we search something suppose “Data Structure and algorithm courses ”
then at that time on the Internet Explorer we get the first link of GeeksforGeeks Courses. This
happens because the GeeksforGeeks website is visited most in order to get information
regarding Data Structure courses and Computer related subjects. So this analysis is Done
using Data Science, and we get the Topmost visited Web Links.
2. In Transport
Data Science also entered into the Transport field like Driverless Cars. With the help of
Driverless Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the help
of Data Science techniques, the Data is analyzed like what is the speed limit in Highway,
Busy Streets, Narrow Roads, etc. And how to handle different situations while driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an
issue of fraud and risk of losses. Thus, Financial Industries needs to automate risk of loss
analysis in order to carry out strategic decisions for the company. Also, Financial Industries
uses Data Science Analytics tools in order to predict the future. It allows the companies to
predict customer lifetime value and their stock market moves.
For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data
Science is used to examine past behavior with past data and their goal is to examine the
future outcome. Data is analyzed in such a way that it makes it possible to predict future
stock prices over a set timetable.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user
experience with personalized recommendations.
For Example, When we search for something on the E-commerce websites we get
suggestions similar to choices according to our past data and also we get recommendations
according to most buy the product, most rated, most searched, etc. This is all done with the
help of Data Science.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
• Detecting Tumor.
• Drug discoveries.
• Medical Image Analysis.
• Virtual Medical Bots.
• Genetics and Genomics.
• Predictive Modeling for Diagnosis etc.
6. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When we upload
our image with our friend on Facebook, Facebook gives suggestions Tagging who is in the
picture. This is done with the help of machine learning and Data Science. When an Image is
Recognized, the data analysis is done on one’s Facebook friends and after analysis, if the
faces which are present in the picture matched with someone else profile then Facebook
suggests us auto-tagging.
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever the
user searches on the Internet, he/she will see numerous posts everywhere. This can be
explained properly with an example: Suppose I want a mobile phone, so I just Google search
it and after that, I changed my mind to buy offline. Data Science helps those companies who
are paying for Advertisements for their mobile. So everywhere on the internet in the social
media, in the websites, in the apps everywhere I will see the recommendation of that mobile
phone which I searched for. So this will force me to buy online.
8. Airline Routing Planning
With the help of Data Science, Airline Sector is also growing like with the help of it, it
becomes easy to predict flight delays. It also helps to decide whether to directly land into the
destination or take a halt in between like a flight can have a direct route from Delhi to the
U.S.A or it can halt in between after that reach at the destination.
9. Data Science in Gaming
In most of the games where a user will play with an opponent i.e. a Computer Opponent, data
science concepts are used with machine learning where with the help of past data the
Computer will improve its performance. There are many games like Chess, EA Sports, etc.
will use Data Science concepts.
10. Medicine and Drug Development
The process of creating medicine is very difficult and time-consuming and has to be done
with full disciplined because it is a matter of Someone’s life. Without Data Science, it takes
lots of time, resources, and finance or developing new Medicine or drug but with the help of
Data Science, it becomes easy because the prediction of success rate can be easily determined
based on biological data or factors. The algorithms based on data science will forecast how
this will react to the human body without lab experiments.
11. In Delivery Logistics
Various Logistics companies like DHL, FedEx, etc. make use of Data Science. Data Science
helps these companies to find the best route for the Shipment of their Products, the best time
suited for delivery, the best mode of transport to reach the destination, etc.
12. Autocomplete
AutoComplete feature is an important part of Data Science where the user will get the facility
to just type a few letters or words, and he will get the feature of auto-completing the line. In
Google Mail, when we are writing formal mail to someone so at that time data science
concept of Autocomplete feature is used where he/she is an efficient choice to auto-complete
the whole line. Also in Search Engines in social media, in various apps, AutoComplete
feature is widely used.

Python – Data visualization using Bokeh


Bokeh is a data visualization library in Python that provides high-performance interactive
charts and plots. Bokeh output can be obtained in various mediums like notebook, html and
server. It is possible to embed bokeh plots in Django and flask apps.

Bokeh provides two visualization interfaces to users:


bokeh.models : A low level interface that provides high flexibility to application developers.
bokeh.plotting : A high level interface for creating visual glyphs.
To install bokeh package, run the following command in the terminal:
pip install bokeh
The dataset used for generating bokeh graphs is collected from Kaggle.
Code #1: Scatter Markers
To create scatter circle markers, circle() method is used.

# import modules

frombokeh.plotting importfigure, output_notebook, show

# output to notebook

output_notebook()

# create figure

p =figure(plot_width =400, plot_height =400)

# add a circle renderer with

# size, color and alpha

p.circle([1, 2, 3, 4, 5], [4, 7, 1, 6, 3],

size =10, color ="navy", alpha =0.5)

# show the results

show(p)
Output :

Code #2: Single line


To create a single line, line() method is used.

# import modules

frombokeh.plotting importfigure, output_notebook, show

# output to notebook

output_notebook()

# create figure

p =figure(plot_width =400, plot_height =400)

# add a line renderer

p.line([1, 2, 3, 4, 5], [3, 1, 2, 6, 5],

line_width =2, color ="green")


# show the results

show(p)

Output :

Code #3: Bar Chart


Bar chart presents categorical data with rectangular bars. The length of the bar is proportional
to the values that are represented.

# import necessary modules

importpandas as pd

frombokeh.charts importBar, output_notebook, show

# output to notebook

output_notebook()

# read data in dataframe

df =pd.read_csv(r"D:/kaggle/mcdonald/menu.csv")

# create bar
p =Bar(df, "Category", values ="Calories",

title ="Total Calories by Category",

legend ="top_right")

# show the results

show(p)

Output :

Code #4: Box Plot


Box plot is used to represent statistical data on a plot. It helps to summarize statistical
properties of various data groups present in the data.

# import necessary modules

frombokeh.charts importBoxPlot, output_notebook, show

importpandas as pd
# output to notebook

output_notebook()

# read data in dataframe

df =pd.read_csv(r"D:/kaggle / mcdonald / menu.csv")

# create bar

p =BoxPlot(df, values ="Protein", label ="Category",

color ="yellow", title ="Protein Summary (grouped by category)",

legend ="top_right")

# show the results

show(p)

Output :
Code #5: Histogram
Histogram is used to represent distribution of numerical data. The height of a rectangle in a
histogram is proportional to the frequency of values in a class interval.

# import necessary modules

frombokeh.charts importHistogram, output_notebook, show

importpandas as pd

# output to notebook

output_notebook()

# read data in dataframe

df =pd.read_csv(r"D:/kaggle / mcdonald / menu.csv")

# create histogram

p =Histogram(df, values ="Total Fat",

title ="Total Fat Distribution",

color ="navy")

# show the results

show(p)
Output :

Code #6: Scatter plot


Scatter plot is used to plot values of two variables in a dataset. It helps to find correlation
among the two variables that are selected.

# import necessary modules

frombokeh.charts importScatter, output_notebook, show

importpandas as pd

# output to notebook

output_notebook()

# read data in dataframe

df =pd.read_csv(r"D:/kaggle / mcdonald / menu.csv")

# create scatter plot

p =Scatter(df, x ="Carbohydrates", y ="Saturated Fat",

title ="Saturated Fat vs Carbohydrates",


xlabel ="Carbohydrates", ylabel ="Saturated Fat",

color ="orange")

# show the results

show(p)

Output :

Unit 6
Top 10 Data Analytics Trends For 2022
In today’s current market trend, data is driving any organization in a countless number of
ways. Data Science, Big Data Analytics, and Artificial Intelligence are the key trends in
today’s accelerating market. As more organizations are adopting data-driven models to
streamline their business processes, the data analytics industry is seeing humongous growth.
From fueling fact-based decision-making to adopting data-driven models to expanding data-
focused product offerings, organizations are inclining more towards data analytics.
These progressing data analytics trends can help organizations deal with many changes
and uncertainties. So, let’s take a look at a few of these Data Analytics trends that are
becoming an inherent part of the industry.

Trend 1: Smarter and Scalable Artificial Intelligence

COVID-19 has changed the business landscape in myriad ways and historical data is no
more relevant. So, in place of traditional AI techniques, arriving in the market are some
scalable and smarter Artificial Intelligence and Machine Learning techniques that can
work with small data sets. These systems are highly adaptive, protect privacy, are much
faster, and also provide a faster return on investment. The combination of AI and Big
data can automate and reduce most of the manual tasks.

Trend 2: Agile and Composed Data & Analytics

Agile data and analytics models are capable of digital innovation, differentiation, and
growth. The goal of edge and composable data analytics is to provide a user-friendly,
flexible, and smooth experience using multiple data analytics, AI, and ML solutions. This
will not only enable leaders to connect business insights and actions but also, encourage
collaboration, promote productivity, agility and evolve the analytics capabilities of the
organization.

Trend 3: Hybrid Cloud Solutions and Cloud Computing

One of the biggest data trends for 2022 is the increase in the use of hybrid cloud
services and cloud computation. Public clouds are cost-effective but do not provide high
security whereas a private cloud is secure but more expensive. Hence, a hybrid cloud is a
balance of both a public cloud and a private cloud where cost and security are balanced to
offer more agility. This is achieved by using artificial intelligence and machine learning.
Hybrid clouds are bringing change to organizations by offering a centralized database,
data security, scalability of data, and much more at such a cheaper cost.

Trend 4: Data Fabric

A data fabric is a powerful architectural framework and set of data services that standardize
data management practices and consistent capabilities across hybrid multi-cloud
environments. With the current accelerating business trend as data becomes more
complex, more organizations will rely on this framework since this technology can reuse
and combine different integration styles, data hub skills, and technologies. It also reduces
design, deployment, and maintenance time by 30%, 30%, and 70%, respectively, thereby
reducing the complexity of the whole system. By 2026, it will be highly adopted as a re-
architect solution in the form of an IaaS (Infrastructure as a Service) platform.

Trend 5: Edge Computing For Faster Analysis

There are many big data analytic tools available in the market but still persists the
problems of enormous data processing capabilities. This has led to the development of the
concept of quantum computing. By applying laws of quantum mechanics, computation has
speeded up the processing capabilities of the enormous amount of data by using less
bandwidth while also offering better security and data privacy. This is much better than
classical computing as the decisions here are taken using quantum bits of a processor called
Sycamore, which can solve a problem in just 200 seconds.
However, Edge Computing will need a lot of fine-tuning before it can be significantly
adopted by organizations. Nevertheless, with the accelerating market trend, it will soon
make its presence felt and will become an integral part of business processes.
Trend 6: Augmented Analytics

Augmented Analytics is another leading business analytics trend in today’s corporate


world. This is a concept of data analytics that uses Natural Language Processing,
Machine Learning, and Artificial Intelligence to automate and enhance data analytics,
data sharing, business intelligence, and insight discovery.
From assisting with data preparation to automating and processing data and deriving
insights from it, Augmented Analytics is now doing the work of a Data Scientist. Data
within the enterprise and outside the enterprise can also be combined with the help of
augmented analytics and it makes the business processes relatively easier.

Trend 7: The Death of Predefined Dashboards

Earlier businesses were restricted to predefined static dashboards and manual data
exploration restricted to data analysts or citizen data scientists. But it seems dashboards
have outlived their utility due to the lack of their interactivity and user-friendliness.
Questions are being raised about the utility and ROI of dashboards, leading organizations
and business users to look for solutions that will enable them to explore data on their own
and reduce maintenance costs.
It seems slowly business will be replaced by modern automated and dynamic BI tools that
will present insights customized according to a user’s needs and delivered to their point of
consumption.

Trend 8: XOps

XOps has become a crucial part of business transformation processes with the adoption
of Artificial Intelligence and Data Analytics across any organization. XOps started with
DevOps that is a combination of development and operations and its goal is to improve
business operations, efficiencies, and customer experiences by using the best practices of
DevOps. It aims in ensuring reliability, re-usability, and repeatability and also ensure a
reduction in the duplication of technology and processes. Overall, the primary aim of XOps
is to enable economies of scale and help organizations to drive business values by
delivering a flexible design and agile orchestration in affiliation with other software
disciplines.

Trend 9: Engineered Decision Intelligence

Decision intelligence is gaining a lot of attention in today’s market. It includes a wide


range of decision-making and enables organizations to more quickly gain insights needed
to drive actions for the business. It also includes conventional analytics, AI, and complex
adaptive system applications. When combined with composability and common data fabric,
engineering decision intelligence has great potential to help organizations rethink how they
optimize decision-making. In other words, engineered decision analytics is not made to
replace humans, rather it can help to augment decisions taken by humans.
Trend 10: Data Visualization

With evolving market trends and business intelligence, data visualization has captured the
market in a go. Data Visualization is indicated as the last mile of the analytics process and
assists enterprises to perceive vast chunks of complex data. Data Visualization has made it
easier for companies to make decisions by using visually interactive ways. It influences the
methodology of analysts by allowing data to be observed and presented in the form of
patterns, charts, graphs, etc. Since the human brain interprets and remembers visuals more,
hence it is a great way to predict future trends for the firm.

What is Data Visualization?

Data visualization is defined as a graphical representation that contains

the information and the data.

By using visual elements like charts, graphs, and maps, data visualization techniques

provide an accessible way to see and understand trends, outliers, and patterns in data.

In modern days we have a lot of data in our hands i.e, in the world of Big Data, data

visualization tools, and technologies are crucial to analyze massive amounts of information

and make data-driven decisions.

It is used in many areas such as:

• To model complex events.


• Visualize phenomenons that cannot be observed directly, such as weather patterns, medical
conditions, or mathematical relationships.

Benefits of Good Data Visualization

Since our eyes can capture the colors and patterns, therefore, we can quickly identify the red

portion from blue, square from the circle, our culture is visual, including everything from art

and advertisements to TV and movies.

So, Data visualization is another technique of visual art that grabs our interest and keeps our

main focus on the message captured with the help of eyes.

Whenever we visualize a chart, we quickly identify the trends and outliers present in the

dataset.
The basic uses of the Data Visualization technique are as follows:

• It is a powerful technique to explore the data with presentable and interpretable results.
• In the data mining process, it acts as a primary step in the pre-processing portion.
• It supports the data cleaning process by finding incorrect data and corrupted or missing
values.
• It also helps to construct and select variables, which means we have to determine which
variable to include and discard in the analysis.
• In the process of Data Reduction, it also plays a crucial role while combining the categories.

Image Source: Google Images

Different Types of Analysis for Data Visualization

Mainly, there are three different types of analysis for Data Visualization:

Univariate Analysis: In the univariate analysis, we will be using a single feature to analyze

almost all of its properties.

Bivariate Analysis: When we compare the data between exactly 2 features then it is known

as bivariate analysis.

Multivariate Analysis: In the multivariate analysis, we will be comparing more than 2

variables.

NOTE:

In this article, our main goal is to understand the following concepts:


• How do find some inferences from the data visualization techniques?
• In which condition, which technique is more useful than others?

We are not going to deep dive into the coding/implementation part of different techniques on

a particular dataset but we try to find the answer to the above questions and understand only

the snippet code with the help of sample plots for each of the data visualization techniques.

Now, let’s started with the different Data Visualization techniques:

Univariate Analysis Techniques for Data Visualization


1. Distribution Plot

• It is one of the best univariate plots to know about the distribution of data.
• When we want to analyze the impact on the target variable(output) with respect to an
independent variable(input), we use distribution plots a lot.
• This plot gives us a combination of both probability density functions(pdf) and histogram in a
single plot.

Implementation:

• The distribution plot is present in the Seaborn package.

The code snippet is as follows:

Python Code:

Some conclusions inferred from the above distribution plot:


From the above distribution plot we can conclude the following observations:

• We have observed that we created a distribution plot on the feature ‘Age’(input variable) and
we used different colors for the Survival status(output variable) as it is the class to be
predicted.
• There is a huge overlapping area between the PDFs for different combinations.
• In this plot, the sharp block-like structures are called histograms, and the smoothed curve is
known as the Probability density function(PDF).

NOTE:

The Probability density function(PDF) of a curve can help us to capture the underlying

distribution of that feature which is one major takeaway from Data visualization or

Exploratory Data Analysis(EDA).

2. Box and Whisker Plot

• This plot can be used to obtain more statistical details about the data.
• The straight lines at the maximum and minimum are also called whiskers.
• Points that lie outside the whiskers will be considered as an outlier.
• The box plot also gives us a description of the 25th, 50th,75th quartiles.
• With the help of a box plot, we can also determine the Interquartile range(IQR) where
maximum details of the data will be present. Therefore, it can also give us a clear idea about
the outliers in the dataset.

Fig. General Diagram for a Box-plot

Implementation:

• Boxplot is available in the Seaborn library.


• Here x is considered as the dependent variable and y is considered as the independent
variable. These box plots come under univariate analysis, which means that we are
exploring data only with one variable.
• Here we are trying to check the impact of a feature named “axil_nodes” on the class
named “Survival status” and not between any two independent features.

The code snippet is as follows:

sns.boxplot(x='SurvStat',y='axil_nodes',data=hb)

Some conclusions inferred from the above box plot:

From the above box and whisker plot we can conclude the following observations:

• How much data is present in the 1st quartile and how many points are outliers etc.
• For class 1, we can see that it is very little or no data is present between the median and the
1st quartile.
• There are more outliers for class 1 in the feature named axil_nodes.

NOTE:

We can get details about outliers that will help us to well prepare the data before feeding it to

a model since outliers influence a lot of Machine learning models.

3. Violin Plot

• The violin plots can be considered as a combination of Box plot at the middle and distribution
plots(Kernel Density Estimation) on both sides of the data.
• This can give us the description of the distribution of the dataset like whether the distribution
is multimodal, Skewness, etc.
• It also gives us useful information like a 95% confidence interval.
Fig. General Diagram for a Violin-plot

Implementation:

• The Violin plot is present in the Seaborn package.

The code snippet is as follows:

sns.violinplot(x='SurvStat',y='op_yr',data=hb,size=6)

Some conclusions inferred from the above violin plot:

From the above violin plot we can conclude the following observations:

• The median of both classes is close to 63.


• The maximum number of persons with class 2 has an op_yr value of 65 whereas, for persons
in class1, the maximum value is around 60.
• Also, the 3rd quartile to median has a lesser number of data points than the median to the 1st
quartile.
Bivariate Analysis Techniques for Data Visualization
1. Line Plot

• This is the plot that you can see in the nook and corners of any sort of analysis between 2
variables.
• The line plots are nothing but the values on a series of data points will be connected with
straight lines.
• The plot may seem very simple but it has more applications not only in machine learning but
in many other areas.

Implementation:

• The line plot is present in the Matplotlib package.

The code snippet is as follows:

plt.plot(x,y)

Some conclusions inferred from the above line plot:

From the above line plot we can conclude the following observations:

• These are used right from performing distribution Comparison using Q-Q plots to CV tuning
using the elbow method.
• Used to analyze the performance of a model using the ROC- AUC curve.

2. Bar Plot

• This is one of the widely used plots, that we would have seen multiple times not just in data
analysis, but we use this plot also wherever there is a trend analysis in many fields.
• Though it may seem simple it is powerful in analyzing data like sales figures every
week, revenue from a product, Number of visitors to a site on each day of a week, etc.

Implementation:

• The bar plot is present in the Matplotlib package.

The code snippet is as follows:

plt.bar(x,y)

Some conclusions inferred from the above bar plot:

From the above bar plot we can conclude the following observations:

• We can visualize the data in a cool plot and can convey the details straight forward to others.
• This plot may be simple and clear but it’s not much frequently used in Data science
applications.

3. Scatter Plot

• It is one of the most commonly used plots used for visualizing simple data in Machine
learning and Data Science.
• This plot describes us as a representation, where each point in the entire dataset is present
with respect to any 2 to 3 features(Columns).
• Scatter plots are available in both 2-D as well as in 3-D. The 2-D scatter plot is the common
one, where we will primarily try to find the patterns, clusters, and separability of the data.

Implementation:
• The scatter plot is present in the Matplotlib package.

The code snippet is as follows:

plt.scatter(x,y)

Some conclusions inferred from the above Scatter plot:

From the above Scatter plot we can conclude the following observations:

• The colors are assigned to different data points based on how they were present in the
dataset i.e, target column representation.
• We can color the data points as per their class label given in the dataset.

Data Science in application development

Data Science is a field of study that focuses on using scientific methods and algorithms to
extract knowledge from data. The Data Scientist role may differ depending on the project.
Some associate this position with application analytics, others with vaguely defined AI, and
the truth lies somewhere in between. Ultimately, the Data Scientist's primary goal is to
improve the quality of application development and bring value to a product.

The role of a Data Scientist

Data Scientist is generally required to have knowledge in data analysis, data transformations,
and machine learning. However, different positions are related to this role, such as:

• Data Analyst,
• Data Engineer,
• Machine Learning Engineer,
• MLOps,
• orDataOps.

Data Scientists may often be perceived as full-stack developers of a machine learning world.
Therefore, many companies prefer hiring Data Scientists with particular skills that involve
mentioned roles to fit project requirements. In small teams, Data Scientists are responsible for
designing architecture and building data processing pipelines, preparing application analytics,
developing machine learning solutions, deploying these to the production environment and
monitoring results.

Transforming data into value

The primary purpose of a Data Scientist's work is to solve problems that include reducing
costs, increasing revenue and improving user experience. It can be either achieved by
maintaining and investigating application analytics or introducing AI systems in a project.

Application analytics usually include components addressing the following questions:

Users demographic

• Where do the users come from?


• What age are they?
• What devices and systems are they using?

Users activity

• How many active users have the application?


• What time does the application suffer from increased traffic?
• How does the cohort analysis look like?
• What is the users' engagement time?

Users paths

• Which application features are frequently used?


• Where do bottlenecks occur in applications flows?

Additional application KPIs

• What is the overall user engagement?


• What is application revenue?

A/B tests results

• What are the results of A/B testing?


• How can the results change considering different user segments?

Crashes and Errors

• How many users are affected?


• Is there any pattern in the segment of affected users?

Analytics can give plenty of information to the development team and the client. Therefore,
application development can be accelerated with tasks prioritization, features validation, and
detection of hidden issues.

Although analytics is an important part of application development, Data Scientists are also
responsible for delivering machine learning solutions. Machine learning is a branch of
science that focuses on automatic insights extraction in order to build a knowledge model that
can perform a certain task. On the other hand, AI (Artificial Intelligence) is a much broader
term often used by marketers. As a result, that expression has become a buzzword and is
loosely used as a machine learning term equivalent in the business world.

There is a wide variety of applications that can utilize machine learning. Some common AI
systems with examples are presented below:

Recommender systems - profiling a user to propose the best items that fit their interests;

Customer segmentation - assigning users to different segments (e.g., based on their


behavior) to maximize profit from marketing campaigns;

Image recognition - detect the particular object on images/videos (may be used for censoring
inappropriate content);

Anomaly and fraud detection - automated detection of anomalies (detecting changes in


users behavior, transaction flows, or detecting cyberattacks attempts by analyzing network
traffic);

Text mining - e.g., sentiment analysis that provides information concerning positive or
negative attitudes toward a product based on the content provided by the user (e.g., user
opinion);

Churn prediction - detecting and preventing users from leaving the application or canceling
the service subscription;

Other systems:

• Antispam filters;
• Forecasting methods (e.g., predicting future sales);
• Chatbots;
• and various processes automation.
Data Scientist during the application development life cycle

There are two approaches to hiring a Data Scientist. Preparing application MVP may be
difficult for a client financially. During the initial development, there is an obvious need for
developers rather than Data Scientists. In this scenario, Data Scientist is usually hired when
the application is publicly available. Gathered data can be utilized for further application
development and the application might require some AI-centered features.

On the other hand, Data Scientist knowledge and experience may be beneficial from the start
of the development cycle. Although introducing new machine learning solutions may not be
crucial for a new application, to apply these solutions proper data collection is required. That
means that Data Scientist should be included in the work related to designing databases and
data flows. This way, it will be more effortless to develop machine learning solutions in the
future.

You might also like