data science notes Mtech
data science notes Mtech
Data Science has become the most demanding job of the 21st century. Every organization is
looking for candidates with knowledge of data science. In this tutorial, we are giving an
introduction to data science, with data science Job roles, tools for data science, components
of data science, application, etc.
So let's start,
Data science is a deep study of the massive amount of data, which involves extracting
meaningful insights from raw, structured, and unstructured data that is processed using the
scientific method, different technologies, and algorithms.
It is a multidisciplinary field that uses tools and techniques to manipulate the data so that you
can find something new and meaningful
Data science uses the most powerful hardware, programming systems, and most efficient
algorithms to solve the data related problems. It is the future of artificial intelligence.
Let suppose we want to travel from station A to station B by car. Now, we need to take some
decisions such as which route will be the best route to reach faster at the location, in which
route there will be no traffic jam, and which will be cost-effective. All these decision factors
will act as input data, and we will get an appropriate answer from these decisions, so this
analysis of data is called the data analysis, which is a part of data science.
Some years ago, data was less and mostly available in a structured form, which could be
easily stored in excel sheets, and processed using BI tools.
But in today's world, data is becoming so vast, i.e., approximately 2.5 quintals bytes of data
is generating on every day, which led to data explosion. It is estimated as per researches, that
by 2020, 1.7 MB of data will be created at every single second, by a single person on earth.
Every Company requires data to work, grow, and improve their businesses.
Now, handling of such huge amount of data is a challenging task for every organization. So
to handle, process, and analysis of this, we required some complex, powerful, and efficient
algorithms and technology, and that technology came into existence as data Science.
Following are some main reasons for using data science technology:
o With the help of data science technology, we can convert the massive amount of raw
and unstructured data into meaningful insights.
o Data science technology is opting by various companies, whether it is a big brand or a
startup. Google, Amazon, Netflix, etc, which handle the huge amount of data, are
using data science algorithms for better customer experience.
o Data science is working for automating transportation such as creating a self-driving
car, which is the future of transportation.
o Data science can help in different predictions such as various survey, elections, flight
ticket confirmation, etc.
As per various surveys, data scientist job is becoming the most demanding Job of the 21st
century due to increasing demands for data science. Some people also called it "the hottest
job title of the 21st century". Data scientists are the experts who can use various statistical
tools and machine learning algorithms to understand and analyze the data.
The average salary range for data scientist will be approximately $95,000 to $ 165,000 per
annum, and as per different researches, about 11.5 millions of job will be created by the
year 2026.
If you learn data science, then you get the opportunity to find the various exciting job roles in
this domain. The main job roles are given below:
1. Data Scientist
2. Data Analyst
3. Machine learning expert
4. Data engineer
5. Data Architect
6. Data Administrator
7. Business Analyst
8. Business Intelligence Manager
1. Data Analyst:
Data analyst is an individual, who performs mining of huge amount of data, models the data,
looks for patterns, relationship, trends, and so on. At the end of the day, he comes up with
visualization and reporting for analyzing the data for decision making and problem-solving
process.
Skill required: For becoming a data analyst, you must get a good background
in mathematics, business intelligence, data mining, and basic knowledge of statistics. You
should also be familiar with some computer languages and tools such as MATLAB, Python,
SQL, Hive, Pig, Excel, SAS, R, JS, Spark, etc.
The machine learning expert is the one who works with various machine learning algorithms
used in data science such as regression, clustering, classification, decision tree, random
forest, etc.
Skill Required: Computer programming languages such as Python, C++, R, Java, and
Hadoop. You should also have an understanding of various algorithms, problem-solving
analytical skill, probability, and statistics.
3. Data Engineer:
A data engineer works with massive amount of data and responsible for building and
maintaining the data architecture of a data science project. Data engineer also works for the
creation of data set processes used in modeling, mining, acquisition, and verification.
Skill required: Data engineer must have depth knowledge of SQL, MongoDB, Cassandra,
HBase, Apache Spark, Hive, MapReduce, with language knowledge of Python, C/C++,
Java, Perl, etc.
4. Data Scientist:
A data scientist is a professional who works with an enormous amount of data to come up
with compelling business insights through the deployment of various tools, techniques,
methodologies, algorithms, etc.
Skill required: To become a data scientist, one should have technical language skills such
as R, SAS, SQL, Python, Hive, Pig, Apache spark, MATLAB. Data scientists must have
an understanding of Statistics, Mathematics, visualization, and communication skills.
Non-Technical Prerequisite:
o Curiosity: To learn data science, one must have curiosities. When you have curiosity
and ask various questions, then you can understand the business problem easily.
o Critical Thinking: It is also required for a data scientist so that you can find multiple
new ways to solve the problem with efficiency.
o Communication skills: Communication skills are most important for a data scientist
because after solving a business problem, you need to communicate it with the team.
Technical Prerequisite:
o Machine learning: To understand data science, one needs to understand the concept
of machine learning. Data science uses machine learning algorithms to solve various
problems.
o Mathematical modeling: Mathematical modeling is required to make fast
mathematical calculations and predictions from the available data.
o Statistics: Basic understanding of statistics is required, such as mean, median, or
standard deviation. It is needed to extract knowledge and obtain better results from the
data.
o Computer programming: For data science, knowledge of at least one programming
language is required. R, Python, Spark are some required computer programming
languages for data science.
o Databases: The depth understanding of Databases such as SQL, is essential for data
science to get the data and to work with data.
BI stands for business intelligence, which is also used for data analysis of business
information: Below are some differences between BI and Data sciences:
Data Business intelligence deals with structured Data science deals with structured and
Source data, e.g., data warehouse. unstructured data, e.g., weblogs, feedback,
etc.
Skills Statistics and Visualization are the two Statistics, Visualization, and Machine
skills required for business intelligence. learning are the required skills for data
science.
Focus Business intelligence focuses on both Past Data science focuses on past data, present
and present data data, and also future predictions.
1. Statistics: Statistics is one of the most important components of data science. Statistics is a
way to collect and analyze the numerical data in a large amount and finding meaningful
insights from it.
2. Domain Expertise: In data science, domain expertise binds data science together. Domain
expertise means specialized knowledge or skills of a particular area. In data science, there are
various areas for which we need domain experts.
3. Data engineering: Data engineering is a part of data science, which involves acquiring,
storing, retrieving, and transforming the data. Data engineering also includes metadata (data
about data) to the data.
7. Machine learning: Machine learning is backbone of data science. Machine learning is all
about to provide training to a machine so that it can act as a human brain. In data science, we
use various machine learning algorithms to solve the problems.
o Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio, MATLAB, Excel,
RapidMiner.
o Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS Redshift
o Data Visualization tools: R, Jupyter, Tableau, Cognos.
o Machine learning tools: Spark, Mahout, Azure ML studio.
To become a data scientist, one should also be aware of machine learning and its algorithms,
as in data science, there are various machine learning algorithms which are broadly being
used. Following are the name of some machine learning algorithms used in data science:
o Regression
o Decision tree
o Clustering
o Principal component analysis
o Support vector machines
o Naive Bayes
o Artificial neural network
o Apriori
We will provide you some brief introduction for few of the important algorithms here,
1. Linear Regression Algorithm: Linear regression is the most popular machine learning
algorithm based on supervised learning. This algorithm work on regression, which is a
method of modeling target values based on independent variables. It represents the form of
the linear equation, which has a relationship between the set of inputs and predictive output.
This algorithm is mostly used in forecasting and predictions. Since it shows the linear
relationship between input and output variable, hence it is called linear regression.
The below equation can describe the relationship between x and y variables:
1. Y= mx+c
2. Decision Tree: Decision Tree algorithm is another machine learning algorithm, which
belongs to the supervised learning algorithm. This is one of the most popular machine
learning algorithms. It can be used for both classification and regression problems.
In the decision tree algorithm, we can solve the problem, by using tree representation in
which, each node represents a feature, each branch represents a decision, and each leaf
represents the outcome.
If we are given a data set of items, with certain features and values, and we need to categorize
those set of items into groups, so such type of problems can be solved using k-means
clustering algorithm.
Now, let's understand what are the most common types of problems occurred in data science
and what is the approach to solving the problems. So in data science, problems are solved
using algorithms, and below is the diagram representation for applicable algorithms for
possible questions:
Is this A or B? :
We can refer to this type of problem which has only two fixed solutions such as Yes or No, 1
or 0, may or may not. And this type of problems can be solved using classification
algorithms.
Is this different? :
We can refer to this type of question which belongs to various patterns, and we need to find
odd from them. Such type of problems can be solved using Anomaly Detection Algorithms.
The other type of problem occurs which ask for numerical values or figures such as what is
the time today, what will be the temperature today, can be solved using regression
algorithms.
Now if you have a problem which needs to deal with the organization of data, then it can be
solved using clustering algorithms.
Clustering algorithm organizes and groups the data based on features, colors, or other
common characteristics.
1. Discovery: The first phase is discovery, which involves asking the right questions. When
you start any data science project, you need to determine what are the basic requirements,
priorities, and project budget. In this phase, we need to determine all the requirements of the
project such as the number of people, technology, time, data, an end goal, and then we can
frame the business problem on first hypothesis level.
2. Data preparation: Data preparation is also known as Data Munging. In this phase, we
need to perform the following tasks:
o Data cleaning
o Data Reduction
o Data integration
o Data transformation,
After performing all the above tasks, we can easily use this data for our further processes.
3. Model Planning: In this phase, we need to determine the various methods and techniques
to establish the relation between input variables. We will apply Exploratory data
analytics(EDA) by using various statistical formula and visualization tools to understand the
relations between variable and to see what data can inform us. Common tools used for model
planning are:
4. Model-building: In this phase, the process of model building starts. We will create
datasets for training and testing purpose. We will apply different techniques such as
association, classification, and clustering, to build the model.
Following are some common Model building tools:
5. Operationalize: In this phase, we will deliver the final reports of the project, along with
briefings, code, and technical documents. This phase provides you a clear overview of
complete project performance and other components on a small scale before the full
deployment.
6. Communicate results: In this phase, we will check if we reach the goal, which we have
set on the initial phase. We will communicate the findings and final result with the business
team.
A Data Scientists primary role is to apply machine learning, statistical methods and
exploratory analysis to data to extract insights and aid decision making. Programming and the
use of computational tools are essential to this role. In fact, many people have described the
field using something along the lines of this famous quote.
A data scientist is someone who is better at statistics than any software engineer and better at
software engineering than any statistician
If you are beginning your journey in learning data science or want to improve your existing
skills it is essential to have a good understanding of the tools you need to perform this role
effectively.
Python for data science has gradually grown in popularity over the last ten years and is now by
far the most popular programming language for practitioners in the field. In the following
article, I am going to give an overview of the core tools used by data scientists largely
focussed on python-based tools.
NumPy
NumPy is a powerful library for performing mathematical and scientific computations with
python. You will find that many other data science libraries require it as a dependency to run
as it is one of the fundamental scientific packages.
This tool interacts with data as an N-dimensional array object. It provides tools for
manipulating arrays, performing array operations, basic statistics and common linear algebra
calculations such as cross and dot product operations.
Pandas
The Pandas library simplifies the manipulation and analysis of data in python. Pandas works
with two fundamental data structures. They are Series, which is a one-dimensional labelled
array, and a DataFrame, which is a two-dimensional labelled data structure. The Pandas
package has a multitude of tools for reading data from various sources, including CSV files
and relational databases.
Once data has been made available as one of these data structures pandas has a wide range of
very simple functions provided for cleaning, transforming and analysing data. These include
built-in tools to handle missing data, simple plotting functionality and excel-like pivot tables.
SciPy
SciPy is another core scientific computational python library. This library is built to interact
with NumPy arrays and depends on much of the functionality made available through NumPy.
However, although to use this package you need to have NumPy both installed and imported,
there is no need to directly import the functionality as this is automatically made available.
Scipy effectively builds on the mathematical functionality available in NumPy. Where NumPy
provides very fast array manipulation, SciPy works with these arrays and enables the
application of advanced mathematical and scientific computations.
Scikit-learn
Scikit-learn is a user friendly, comprehensive and powerful library for machine learning. It
contains functions to apply most machine learning techniques to data and has a consistent user
interface for each.
This library also provides tools for data cleaning, data pre-processing and model validation.
One of its most powerful features are the concept of machine learning pipelines. These
pipelines enable the various steps in machine learning e.g. preprocessing, training and so on to
be chained together into one object.
Keras
Keras is a python API which aims to provide a simple interface for working with neural
networks. Popular deep learning libraries such as Tensorflow are notorious for not being very
user-friendly. Keras sits on top of these frameworks to provide a friendly way to interact with
them.
Keras supports both convolutional and recurrent networks, provides support for multi-
backends and runs on both CPU and GPU.
Matplotlib
Matplotlib is one of the fundamental plotting libraries in python. Many other popular plotting
libraries depend on the matplotlib API including the pandas plotting functionality and
Seaborn.
Matplolib is a very rich plotting library and contains functionality to create a wide range of
charts and visualisations. Additionally, it contains functions to create animated and interactive
charts.
Jupyter notebooks
Jupyter notebooks are an interactive python programming interface. The benefit of writing
python in a notebook environment is that it allows you to easily render visualisations, datasets
and data summaries directly within the program.
These notebooks are also ideal for sharing data science work as they can be highly annotated
by including markdown text directly in line with the code and visualisations.
Python IDE
Jupyter notebooks are a useful place to write code for data science. However, there will be
many instances when writing code into reusable modules will be needed. This will particularly
be the case if you are writing code to put a machine learning model into production.
In these instances and an IDE (Integrated Development Environment) is useful as they provide
lots of useful features such as integrated python style guides, unit testing and version control. I
personally use PyCharm but there many others available.
Github
Github is a very popular version control platform. One of the fundamental principles of data
science is that code and results should be reproducible either by yourself at a future point in
time or by others. Version control provides a mechanism to track and record changes to your
work online.
This article has given a brief introduction to the core toolkit for data science work. In my next
article, I am going to cover how to set up your computer for effective data science work and
will run through these tools and others in more detail.
• Data discovery
• Data mining
• Correlations
Our society is highly dependent on data, which underscores the importance of collecting it.
Accurate data collection is necessary to make informed business decisions, ensure quality
assurance, and keep research integrity.
During data collection, the researchers must identify the data types, the sources of data, and
what methods are being used. We will soon see that there are many different data collection
methods. There is heavy reliance on data collection in research, commercial, and government
fields.
Before an analyst begins collecting data, they must answer three questions first:
• What methods and procedures will be used to collect, store, and process the information?
Additionally, we can break up data into qualitative and quantitative types. Qualitative data
covers descriptions such as color, size, quality, and appearance. Quantitative data,
unsurprisingly, deals with numbers, such as statistics, poll numbers, percentages, etc.
Before a judge makes a ruling in a court case or a general creates a plan of attack, they must
have as many relevant facts as possible. The best courses of action come from informed
decisions, and information and data are synonymous.
The concept of data collection isn’t a new one, as we’ll see later, but the world has changed.
There is far more data available today, and it exists in forms that were unheard of a century
ago. The data collection process has had to change and grow with the times, keeping pace
with technology.
Whether you’re in the world of academia, trying to conduct research, or part of the
commercial sector, thinking of how to promote a new product, you need data collection to
help you make better choices.
Now that you know what is data collection and why we need it, let's take a look at the
different methods of data collection. While the phrase “data collection” may sound all high-
tech and digital, it doesn’t necessarily entail things like computers, big data, and the internet.
Data collection could mean a telephone survey, a mail-in comment card, or even some guy
with a clipboard asking passersby some questions. But let’s see if we can sort the different
data collection methods into a semblance of organized categories.
The following are seven primary methods of collecting data in business analytics.
• Surveys
• Transactional Tracking
• Observation
• Online Tracking
• Forms
Data collection breaks down into two methods. As a side note, many terms, such as
techniques, methods, and types, are interchangeable and depending on who uses them. One
source may call data collection techniques “methods,” for instance. But whatever labels we
use, the general concepts and breakdowns apply across the board whether we’re talking about
marketing analysis or a scientific research project.
• Primary
As the name implies, this is original, first-hand data collected by the data researchers. This
process is the initial information gathering step, performed before anyone carries out any
further or related research. Primary data results are highly accurate provided the researcher
collects the information. However, there’s a downside, as first-hand research is potentially
time-consuming and expensive.
• Secondary
Secondary data is second-hand data collected by other parties and already having undergone
statistical analysis. This data is either information that the researcher has tasked other people
to collect or information the researcher has looked up. Simply put, it’s second-hand
information. Although it’s easier and cheaper to obtain than primary information, secondary
information raises concerns regarding accuracy and authenticity. Quantitative data makes up
a majority of secondary data.
The Application Programming Interface (API) has become a core component of many of the
products and services we’ve become accustomed to using.
It is able to bolster relations between companies and clients. For companies, it is a convenient
way to promote their own business to their clients while ensuring the security of their backend
systems. For clients, APIs provide the means to access data that can be used to fuel their
research or product development.
Here, I will give a brief overview of APIs and demonstrate how you can use this resource for
your own data collection using Python.
Forms of Data Collection
Before discussing APIs, let’s quickly go over the options you have when it comes to procuring
data.
This one seems like a no-brainer; if you want some data, why not collect your own? After all,
no one understands your requirements better than you yourself. So, just get out there can start
collecting, right?
Wrong.
For most cases, collecting your own data is an absurd notion. Procuring information of the
required quantity and quality requires considerable time, money, manpower, and resources.
Why go through the trouble of collecting and processing data when you can just use someone
else’s preprocessed datasets?
Ready-made datasets can be appealing since someone has already done all the hard work for
you in making them. You’ve no doubt encountered plenty of them on sites like Kaggle.com
and Data.gov.
Unfortunately, the convenience of this approach comes at the cost of flexibility and control.
When you use a ready-made dataset, you are restricted by the preprocessing performed on that
dataset prior to its upload.
Chances are that some of the records or features that would have been useful to you were
discarded by the source.
These sources of data certainly have their merits, but their limitations will add constraints to
any subsequent analysis or modeling and can hamper the success of your project.
3. Web scraping
Web scraping is somewhat the middle ground between collecting your own data and using
someone else’s.
You get to access other people’s data by going to their websites and choosing exactly what
parts you want to collect.
On paper, this seems like a good deal, but web scraping comes with its own caveats.
For starters, extracting data from websites through scraping can be challenging. Web scraping
tools such as Selenium require a strong grasp of HTML and XML Path Language (XPath).
Furthermore, the scripts required to navigate websites and procure the needed information can
be long and may require a lot of time to write.
In addition, web scraping can at times be unethical or even illegal. While some websites have
no qualms with scraping, others may be less tolerant. It isn’t uncommon for websites to upload
copyrighted data or set terms that stipulate conditions for scraping.
Web scraping without sufficient caution and care could get you in trouble.
Benefits of APIs
APIs offer a way to obtain needed data while avoiding the disadvantages of the
aforementioned data collection methods.
It spares you the trouble of having to collect data yourself as you can directly procure data
from another entity. You get the freedom to select the raw data that you can process as you
wish.
You also don’t need to worry about any legal ramifications. Companies require you to possess
an identifier known as an API key before granting you access to their API. You can obtain an
API key by directly applying for it. The API key acts as a barrier of entry, ensuring that only
clients that have been approved by the company can reap the benefits the API.
Finally, the best part of this resource is that it can facilitate data extraction with just a few lines
of code!
So, what role does an API play in your efforts to collect data from an external source? To
answer this, we should first introduce a little terminology.
When a client wants access to certain data from a foreign server, they make a request to that
server.
When the server receives the request, it generates a response that it sends back to the client.
An API plays the role of a middle man in this exchange. It is responsible for delivering your
request to the server and then delivering the corresponding response back to you.
Using APIs
Most websites have their own unique requirements for requests and set a unique format for
their responses. They also differ in regards to their restrictions. For instance, some APIs limit
the number of requests you can make in a day.
Thus, to gain an understanding of a specific API, you will need to read their documentation.
Case Study
Although APIs are simple to use, they might be hard to understand at first. Let’s perform a
demonstration to see how they can be used to collect data.
We will use an API from the New York Times called the Article Search API. This API allows
you to search for articles by a query.
Their documentation clearly explains how to make create URIs for requests. You can
customize the URI to make requests for specific articles. For example, you can specify the
articles’ date of publishment and apply other filters.
First, we create a function in Python that generates a URI for the API given the query, sorting
order, page number, and API key.
There are two main methodologies or techniques used to retrieve relevant data from large,
unorganized pools. They are manual and automatic methods. The manual method is another
name for data exploration, while the automatic method is also known as data mining.
Data mining generally refers to gathering relevant data from large databases. On the other
hand, data exploration generally refers to a data user finding their way through large amounts
of data to gather necessary information. Let's study both methods in detail and compare their
differences.
What is Data Exploration?
Data exploration refers to the initial step in data analysis. Data analysts use data visualization
and statistical techniques to describe dataset characterizations, such as size, quantity, and
accuracy, to understand the nature of the data better.
Data exploration techniques include both manual analysis and automated data exploration
software solutions that visually explore and identify relationships between different data
variables, the structure of the dataset, the presence of outliers, and the distribution of data
values to reveal patterns and points of interest, enabling data analysts to gain greater insight
into the raw data.
Data is often gathered in large, unstructured volumes from various sources. Data analysts
must first understand and develop a comprehensive view of the data before extracting
relevant data for further analysis, such as univariate, bivariate, multivariate, and principal
components analysis.
Humans process visual data better than numerical data. Therefore it is extremely challenging
for data scientists and data analysts to assign meaning to thousands of rows and columns of
data points and communicate that meaning without any visual components.
Data visualization in data exploration leverages familiar visual cues such as shapes,
dimensions, colors, lines, points, and angles so that data analysts can effectively visualize and
define the metadata and then perform data cleansing. Performing the initial step of data
exploration enables data analysts to understand better and visually identify anomalies and
relationships that might otherwise go undetected.
Manual data exploration methods entail writing scripts to analyze raw data or manually
filtering data into spreadsheets. Automated data exploration tools, such as data visualization
software, help data scientists easily monitor data sources and perform big data exploration on
otherwise overwhelmingly large datasets. Graphical displays of data, such as bar charts and
scatter plots, are valuable tools in visual data exploration.
A popular tool for manual data exploration is Microsoft Excel spreadsheets, which can create
basic charts for data exploration, view raw data, and identify the correlation between
variables. To identify the correlation between two continuous variables in Excel, use the
CORREL() function to return the correlation. To identify the correlation between two
categorical variables in Excel, the two-way table method, the stacked column chart method,
and the chi-square test are effective.
In general, the goals of data Exploration come into these three categories.
1. Archival: Data Exploration can convert data from physical formats (such as books,
newspapers, and invoices) into digital formats (such as databases) for backup.
2. Transfer the data format: If you want to transfer the data from your current website
into a new website under development, you can collect data from your own website
by extracting it.
3. Data analysis: As the most common goal, the extracted data can be further analyzed
to generate insights. This may sound similar to the data analysis process in data
mining, but note that data analysis is the goal of data Exploration, not part of its
process. What's more, the data is analyzed differently. One example is that e-store
owners extract product details from eCommerce websites like Amazon to monitor
competitors' strategies.
Data Exploration has been widely used in multiple industries serving different purposes.
Besides monitoring prices in eCommerce, data Exploration can help in individual paper
research, news aggregation, marketing, real estate, travel and tourism, consulting, finance,
and many more.
o Lead generation: Companies can extract data from directories like Yelp,
Crunchbase, and Yellowpages and generate leads for business development. You can
check out this video to see how to extract data from Yellowpages with a web scraping
template.
o Content & news aggregation: Content aggregation websites can get regular data
feeds from multiple sources and keep their sites fresh and up-to-date.
o Sentiment analysis: After extracting the online reviews/comments/feedback from
social media websites like Instagram and Twitter, people can analyze the underlying
attitudes and understand how they perceive a brand, product, or phenomenon.
Data mining could be called a subset of Data Analysis. It explores and analyzes huge
knowledge to find important patterns and rules.
Data mining could also be a systematic and successive method of identifying and discovering
hidden patterns and data throughout a big dataset. Moreover, it is used to build machine
learning models that are further used in artificial intelligence.
What Can Data Mining Do?
Data mining tools can sweep through the databases and identify hidden patterns efficiently by
automating the mining process. For businesses, data mining is often used to discover patterns
and relationships in data to help make optimal business decisions.
After data mining became widespread in the 1990s, companies in various industries -
including retail, finance, healthcare, transportation, telecommunication, E-commerce, etc.,
started to use data mining techniques to generate insights from data. Data mining can help
segment customers, detect fraud, forecast sales, etc. Specific uses of data mining include:
There are two primary methods for extracting data from disparate sources in data science:
data exploration and data mining. Data Exploration can be part of data mining, where the aim
is to collect and integrate data from different sources. As a relatively complex process, data
mining comes as discovering patterns to make sense of data and predict the future. Both
require different skill sets and expertise, yet the increasing popularity of non-coding data
Exploration tools and data mining tools greatly enhances productivity and makes people's
lives much easier.
Data Mining Data Exploration
Data mining is also named knowledge discovery Data Exploration is used interchangeably with web
in databases, extraction, data/pattern analysis, and exploration, web scraping, web crawling, data
information harvesting. retrieval, data harvesting, etc.
Data mining studies are mostly on structured Data Exploration usually retrieves data out of
data. unstructured or poorly structured data sources.
Data mining aims to make available data more Data Exploration is to collect data and gather them
useful for generating insights. into a place where they can be stored or further
processed.
The purpose of data mining is to find facts that Data Exploration deals with existing information.
are previously unknown or ignored,
Data mining is much more complicated and Data Exploration can be extremely easy and cost-
requires large investments in staff training. effective when conducted with the right tool.
Data is more valuable than ever before. As several outlets such as The
Economist and Forbes have pointed out, data has surpassed oil as the most valuable
commodity in the current global market, and current estimates consider this trend will
continue well into the future.
ted using a programming language. String parsing or Regex can be used to split each
Database
The database is a commonly used way to securely store data, generally in tabular form and
efficiently manage huge amounts of data. SQL is a commonly used database. In a database,
queries are sent by the user to the DBMS which performs the query and returns the result. In
contrast with text files, databases require a user-password combination to gain access and
perform tasks. In this section, we will use Python to access an existing MySQL database.
Since Database is a vast topic, check the tutorial link in reference for further learning.
CSV files
One of the most common ways in which data is stored is in a CSV file. CSV file consists of
data that are comma-separated. When opened in software like Excel, CSV displays like an
excel sheet, where data is stored column-wise and row-wise.CSV files can be easily accessed
and processed using programming language like Python using libraries like Pandas. Pandas
also allows a wide range of mathematical and analytics operations on a data frame. Check the
Data science often correlates with cloud platforms. With the ability to set up huge machines
and elastic property, cloud computing is emerging globally and has great future potential. The
major storage solutions offered by the cloud are data warehouses and cloud database.
Although the functionality remains the same, warehouses are used to store large amounts of
incoming data for analytics purposes, while cloud database stores the usual customer data in
the cloud. Both of these can be accessed from an application using its respective APIs. GCP
provides the google cloud API which requires you to send the credentials to connect to any
service, while for AWS you can use pyodbc library to connect to any service using the
Miscellaneous Sources
Multimedia data
Certain data will be present in multimedia forms like Images or Audio. These are present as it
is in folders. For such type of data, libraries like OpenCV is used to read image and convert
into an array. Audios are generally converted to an image (spectrogram) which again comes
Certain data science problems require you to connect to an API or another platform like social
media to obtain specific data. This Medium article demonstrates this scenario wherein data is
fetched from Twitter hashtags and sentiment analysis is performed on it for natural language
processing. APIs provide live streaming of data (Eg: COVID data or Election results data).
Data has been the buzzword for ages now. Either the data being generated from large-scale
enterprises or the data generated from an individual, each and every aspect of data needs to
be analyzed to benefit yourself from it. But how do we do it? Well, that’s where the term
‘Data Analytics’ comes in. In this blog on ‘What is Data Analytics?’, you will get an insight
of this term with a hands-on.
• Gather Hidden Insights – Hidden insights from data are gathered and then analyzed
with respect to business requirements.
• Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in
business.
• Perform Market Analysis – Market Analysis can be performed to understand the
strengths and weaknesses of competitors.
• Improve Business Requirement – Analysis of Data allows improving Business to
customer requirements and experience.
Now that you know the need for Data Analytics, let me quickly elaborate on what is Data
Analytics for you.
So, in short, if you understand your Business Administration and have the capability to
perform Exploratory Data Analysis, to gather the required information, then you are good to
go with a career in Data Analytics.
So, now that you know what is Data Analytics, let me quickly cover the top tools used in this
field.
What are the tools used in Data Analytics?
With the increasing demand for Data Analytics in the market, many tools have emerged with
various functionalities for this purpose. Either open-source or user-friendly, the top tools in
the data analytics market are as follows.
• R programming – This tool is the leading analytics tool used for statistics and data
modeling. R compiles and runs on various platforms such as UNIX, Windows, and
Mac OS. It also provides tools to automatically install all packages as per user-
requirement.
• Python – Python is an open-source, object-oriented programming language that is
easy to read, write, and maintain. It provides various machine learning and
visualization libraries such as Scikit-learn, TensorFlow, Matplotlib, Pandas, Keras,
etc. It also can be assembled on any platform like SQL server, a MongoDB database
or JSON
• Tableau Public – This is a free software that connects to any data source such as
Excel, corporate Data Warehouse, etc. It then creates visualizations, maps,
dashboards etc with real-time updates on the web.
• QlikView – This tool offers in-memory data processing with the results delivered to
the end-users quickly. It also offers data association and data visualization with data
being compressed to almost 10% of its original size.
• SAS – A programming language and environment for data manipulation and
analytics, this tool is easily accessible and can analyze data from different sources.
• Microsoft Excel – This tool is one of the most widely used tools for data analytics.
Mostly used for clients’ internal data, this tool analyzes the tasks that summarize the
data with a preview of pivot tables.
• RapidMiner – A powerful, integrated platform that can integrate with any data
source types such as Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc.
This tool is mostly used for predictive analytics, such as data mining, text
analytics, machine learning.
• KNIME – Konstanz Information Miner (KNIME) is an open-source data analytics
platform, which allows you to analyze and model data. With the benefit of visual
programming, KNIME provides a platform for reporting and integration through its
modular data pipeline concept.
• OpenRefine – Also known as GoogleRefine, this data cleaning software will help
you clean up data for analysis. It is used for cleaning messy data, the transformation
of data and parsing data from websites.
• Apache Spark – One of the largest large-scale data processing engine, this tool
executes applications in Hadoop clusters 100 times faster in memory and 10 times
faster on disk. This tool is also popular for data pipelines and machine learning model
development.
1. Descriptive Statistics :
Descriptive statistics uses data that provides a description of the population either through
numerical calculation or graph or table. It provides a graphical summary of data. It is simply
used for summarizing objects, etc. There are two categories in this as following below.
• (a). Measure of central tendency –
Measure of central tendency is also known as summary statistics that is used to represents
the center point or a particular value of a data set or sample set.
In statistics, there are three common measures of central tendency as shown below:
• (i) Mean :
It is measure of average of all value in a sample set.
For example,
• (ii) Median :
It is measure of central value of a sample set. In these, data set is ordered from
lowest to highest value and then finds exact middle.
For example,
• (iii) Mode :
It is value most frequently arrived in sample set. The value repeated most of
time in central set is actually mode.
For example,
Central tendency is a descriptive summary of a dataset through a single value that reflects the
center of the data distribution. Along with the variability (dispersion) of a dataset, central
tendency is a branch of descriptive statistics.
The central tendency is one of the most quintessential concepts in statistics. Although it does
not provide information regarding the individual values in the dataset, it delivers a
comprehensive summary of the whole dataset.
Generally, the central tendency of a dataset can be described using the following measures:
• Mean (Average): Represents the sum of all values in a dataset divided by the total
number of the values.
• Median: The middle value in a dataset that is arranged in ascending order (from the
smallest value to the largest value). If a dataset contains an even number of values, the
median of the dataset is the mean of the two middle values.
• Mode: Defines the most frequently occurring value in a dataset. In some cases, a
dataset may contain multiple modes, while some datasets may not have any mode at
all.
Even though the measures above are the most commonly used to define central tendency,
there are some other measures, including, but not limited to, geometric mean, harmonic
mean, midrange, and geometric median.
The selection of a central tendency measure depends on the properties of a dataset. For
instance, the mode is the only central tendency measure for categorical data, while a median
works best with ordinal data.
Although the mean is regarded as the best measure of central tendency for quantitative data,
that is not always the case. For example, the mean may not work well with quantitative
datasets that contain extremely large or extremely small values. The extreme values may
distort the mean. Thus, you may consider other measures.
The measures of central tendency can be found using a formula or definition. Also, they can
be identified using a frequency distribution graph. Note that for datasets that follow a normal
distribution, the mean, median, and mode are located on the same spot on the graph.
What Is Variance?
The term variance refers to a statistical measurement of the spread between numbers in a
data set. More specifically, variance measures how far each number in the set is from
the mean (average), and thus from every other number in the set. Variance is often depicted
by this symbol: σ2. It is used by both analysts and traders to determine volatility and market
security.
The square root of the variance is the standard deviation (SD or σ), which helps determine
the consistency of an investment’s returns over a period of time.
Advantages and Disadvantages of Variance
Statisticians use variance to see how individual numbers relate to each other within a data
set, rather than using broader mathematical techniques such as arranging numbers into
quartiles. The advantage of variance is that it treats all deviations from the mean as the same
regardless of their direction. The squared deviations cannot sum to zero and give the
appearance of no variability at all in the data.
One drawback to variance, though, is that it gives added weight to outliers. These are the
numbers far from the mean. Squaring these numbers can skew the data. Another pitfall of
using variance is that it is not easily interpreted. Users often employ it primarily to take the
square root of its value, which indicates the standard deviation of the data. As noted above,
investors can use standard deviation to assess how consistent returns are over time.
Introduction
Welcome to the world of Probability in Data Science! Let me start things off with an intuitive
example.
Suppose you are a teacher at a university. After checking assignments for a week, you graded
all the students. You gave these graded papers to a data entry guy in the university and tell
him to create a spreadsheet containing the grades of all the students. But the guy only stores
He made another blunder, he missed a couple of entries in a hurry and we have no idea whose
One way is that you visualize the grades and see if you can find a trend in the data.
The graph that you have plot is called the frequency distribution of the data. You see that
there is a smooth curve like structure that defines our data, but do you notice an anomaly? We
have an abnormally low frequency at a particular score range. So the best guess would be to
This is how you would try to solve a real-life problem using data analysis. For any Data
Scientist, a student or a practitioner, distribution is a must know concept. It provides the basis
While the concept of probability gives us the mathematical calculations, distributions help us
In this article, I have covered some important types of probability distributions which are
Note: This article assumes you have a basic knowledge of probability. If not, you can refer
Table of Contents
Before we jump on to the explanation of distributions, let’s see what kind of data can we
Discrete Data, as the name suggests, can take only specified values. For example, when you
roll a die, the possible outcomes are 1, 2, 3, 4, 5 or 6 and not 1.5 or 2.45.
Continuous Data can take any value within a given range. The range may be finite or
infinite. For example, A girl’s weight or height, the length of the road. The weight of a girl
Types of Distributions
Bernoulli Distribution
Let’s start with the easiest distribution that is Bernoulli Distribution. It is actually easier to
All you cricket junkies out there! At the beginning of any cricket match, how do you decide
who is going to bat or ball? A toss! It all depends on whether you win or lose the toss, right?
Let’s say if the toss results in a head, you win. Else, you lose. There’s no midway.
A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure),
and a single trial. So the random variable X which has a Bernoulli distribution can take value
1 with the probability of success, say p, and the value 0 with the probability of failure, say q
or 1-p.
Here, the occurrence of a head denotes success, and the occurrence of a tail denotes failure.
Probability of getting a head = 0.5 = Probability of getting a tail since there are only two
possible outcomes.
The probability mass function is given by: px(1-p)1-x where x € (0, 1).
The probabilities of success and failure need not be equally likely, like the result of a fight
between me and Undertaker. He is pretty much certain to win. So in this case probability of
Here, the probability of success(p) is not same as the probability of failure. So, the chart
Here, the probability of success = 0.15 and probability of failure = 0.85. The expected value
is exactly what it sounds. If I punch you, I may expect you to punch me back. Basically
expected value of any distribution is the mean of the distribution. The expected value of a
There are many examples of Bernoulli distribution such as whether it’s going to rain
tomorrow or not where rain denotes success and no rain denotes failure and Winning
Uniform Distribution
When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these outcomes
are equally likely and that is the basis of a uniform distribution. Unlike Bernoulli
Distribution, all the n number of possible outcomes of a uniform distribution are equally
likely.
The number of bouquets sold daily at a flower shop is uniformly distributed with a maximum
Let’s try calculating the probability that the daily sales will fall between 15 and 30.
The probability that daily sales will fall between 15 and 30 is (30-15)*(1/(40-10)) = 0.5
Similarly, the probability that daily sales are greater than 20 is = 0.667
The standard uniform density has parameters a = 0 and b = 1, so the PDF for standard
Binomial Distribution
Let’s get back to cricket. Suppose that you won the toss today and this indicates a successful
event. You toss again but you lost this time. If you win a toss today, this does not necessitate
that you will win the toss tomorrow. Let’s assign a random variable, say X, to the number of
times you won the toss. What can be the possible value of X? It can be any number
Therefore, probability of getting a head = 0.5 and the probability of failure can be easily
A distribution where only two outcomes are possible, such as success or failure, gain or loss,
win or lose and where the probability of success and failure is same for all the trials is called
a Binomial Distribution.
The outcomes need not be equally likely. Remember the example of a fight between me and
Undertaker? So, if the probability of success in an experiment is 0.2 then the probability of
Each trial is independent since the outcome of the previous toss doesn’t determine or affect
the outcome of the current toss. An experiment with only two possible outcomes repeated n
number of times is called binomial. The parameters of a binomial distribution are n and p
where n is the total number of trials and p is the probability of success in each trial.
On the basis of the above explanation, the properties of a Binomial Distribution are
A binomial distribution graph where the probability of success does not equal the probability
Normal Distribution
Normal distribution represents the behavior of most of the situations in the universe (That is
why it’s called a “normal” distribution. I guess!). The large sum of (small) random variables
often turns out to be normally distributed, contributing to its widespread application. Any
The mean and variance of a random variable X which is said to be normally distributed is
given by:
A standard normal distribution is defined as the distribution with mean 0 and standard
Machine Learning algorithms are the programs that can learn the hidden patterns from the
data, predict the output, and improve the performance from experiences on their own.
Different algorithms can be used in machine learning for different tasks, such as simple linear
regression that can be used for prediction problems like stock market prediction, and the
KNN algorithm can be used for classification problems.
In this topic, we will see the overview of some popular and most commonly used machine
learning
The below diagram illustrates the different ML algorithm, along with the categories
Supervised learning is a type of Machine learning in which the machine needs external
supervision to learn. The supervised learning models are trained using the labeled dataset.
Once the training and processing are done, the model is tested by providing a sample test data
to check whether it predicts the correct output.
The goal of supervised learning is to map input data with the output data. Supervised learning
is based on supervision, and it is the same as when a student learns things in the teacher's
supervision. The example of supervised learning is spam filtering.
Supervised learning can be divided further into two categories of problem:
o Classification
o Regression
Examples of some popular supervised learning algorithms are Simple Linear regression,
Decision Tree, Logistic Regression, KNN algorithm, etc. Read more..
It is a type of machine learning in which the machine does not need any external supervision
to learn from the data, hence called unsupervised learning. The unsupervised models can be
trained using the unlabelled dataset that is not classified, nor categorized, and the algorithm
needs to act on that data without any supervision. In unsupervised learning, the model doesn't
have a predefined output, and it tries to find useful insights from the huge amount of data.
These are used to solve the Association and Clustering problems. Hence further, it can be
classified into two types:
o Clustering
o Association
3) Reinforcement Learning
In Reinforcement learning, an agent interacts with its environment by producing actions, and
learn with the help of feedback. The feedback is given to the agent in the form of rewards,
such as for each good action, he gets a positive reward, and for each bad action, he gets a
negative reward. There is no supervision provided to the agent. Q-Learning algorithm is
used in reinforcement learning. Read more…
1. Linear Regression
Linear regression is one of the most popular and simple machine learning algorithms that is
used for predictive analysis. Here, predictive analysis defines prediction of something, and
linear regression makes predictions for continuous numbers such as salary, age, etc.
It shows the linear relationship between the dependent and independent variables, and shows
how the dependent variable(y) changes according to the independent variable (x).
It tries to best fit a line between the dependent and independent variables, and this best fit line
is knowns as the regression line.
y= a0+ a*x+ b
x= independent variable
a0 = Intercept of line.
The below diagram shows the linear regression for prediction of weight according to
height: Read more..
2. Logistic Regression
Logistic regression is the supervised learning algorithm, which is used to predict the
categorical variables or discrete values. It can be used for the classification problems in
machine learning, and the output of the logistic regression algorithm can be either Yes or
NO, 0 or 1, Red or Blue, etc.
Logistic regression is similar to the linear regression except how they are used, such as Linear
regression is used to solve the regression problem and predict continuous values, whereas
Logistic regression is used to solve the Classification problem and used to predict the discrete
values.
Instead of fitting the best fit line, it forms an S-shaped curve that lies between 0 and 1. The S-
shaped curve is also known as a logistic function that uses the concept of the threshold. Any
value above the threshold will tend to 1, and below the threshold will tend to 0. Read more..
A decision tree is a supervised learning algorithm that is mainly used to solve the
classification problems but can also be used for solving the regression problems. It can work
with both categorical variables and continuous variables. It shows a tree-like structure that
includes nodes and branches, and starts with the root node that expand on further branches till
the leaf node. The internal node is used to represent the features of the dataset, branches
show the decision rules, and leaf nodes represent the outcome of the problem.
Some real-world applications of decision tree algorithms are identification between cancerous
and non-cancerous cells, suggestions to customers to buy a car, etc. Read more..
4. Support Vector Machine Algorithm
A support vector machine or SVM is a supervised learning algorithm that can also be used for
classification and regression problems. However, it is primarily used for classification
problems. The goal of SVM is to create a hyperplane or decision boundary that can segregate
datasets into different classes.
The data points that help to define the hyperplane are known as support vectors, and hence it
is named as support vector machine algorithm.
Some real-life applications of SVM are face detection, image classification, Drug
discovery, etc. Consider the below diagram:
As we can see in the above diagram, the hyperplane has classified datasets into two different
classes. Read more..
Naïve Bayes classifier is a supervised learning algorithm, which is used to make predictions
based on the probability of the object. The algorithm named as Naïve Bayes as it is based
on Bayes theorem, and follows the naïve assumption that says' variables are independent of
each other.
The Bayes theorem is based on the conditional probability; it means the likelihood that
event(A) will happen, when it is given that event(B) has already happened. The equation for
Bayes theorem is given as:
Naïve Bayes classifier is one of the best classifiers that provide a good result for a given
problem. It is easy to build a naïve bayesian model, and well suited for the huge amount of
dataset. It is mostly used for text classification. Read more..
6. K-Nearest Neighbour (KNN)
K-Nearest Neighbour is a supervised learning algorithm that can be used for both
classification and regression problems. This algorithm works by assuming the similarities
between the new data point and available data points. Based on these similarities, the new
data points are put in the most similar categories. It is also known as the lazy learner
algorithm as it stores all the available datasets and classifies each new case with the help of
K-neighbours. The new case is assigned to the nearest class with most similarities, and any
distance function measures the distance between the data points. The distance function can
be Euclidean, Minkowski, Manhattan, or Hamming distance, based on the
requirement. Read more..
7. K-Means Clustering
K-means clustering is one of the simplest unsupervised learning algorithms, which is used to
solve the clustering problems. The datasets are grouped into K different clusters based on
similarities and dissimilarities, it means, datasets with most of the commonalties remain in
one cluster which has very less or no commonalities between other clusters. In K-means, K-
refers to the number of clusters, and means refer to the averaging the dataset in order to find
the centroid.
It is a centroid-based algorithm, and each cluster is associated with a centroid. This algorithm
aims to reduce the distance between the data points and their centroids within a cluster.
This algorithm starts with a group of randomly selected centroids that form the clusters at
starting and then perform the iterative process to optimize these centroids' positions.
It can be used for spam detection and filtering, identification of fake news, etc. Read more..
Random forest is the supervised learning algorithm that can be used for both classification
and regression problems in machine learning. It is an ensemble learning technique that
provides the predictions by combining the multiple classifiers and improve the performance
of the model.
It contains multiple decision trees for subsets of the given dataset, and find the average to
improve the predictive accuracy of the model. A random-forest should contain 64-128 trees.
The greater number of trees leads to higher accuracy of the algorithm.
To classify a new dataset or object, each tree gives the classification result and based on the
majority votes, the algorithm predicts the final output.
Random forest is a fast algorithm, and can efficiently deal with the missing & incorrect
data. Read more..
9. Apriori Algorithm
Apriori algorithm is the unsupervised learning algorithm that is used to solve the association
problems. It uses frequent itemsets to generate association rules, and it is designed to work on
the databases that contain transactions. With the help of these association rule, it determines
how strongly or how weakly two objects are connected to each other. This algorithm uses a
breadth-first search and Hash Tree to calculate the itemset efficiently.
The algorithm process iteratively for finding the frequent itemsets from the large dataset.
The apriori algorithm was given by the R. Agrawal and Srikant in the year 1994. It is
mainly used for market basket analysis and helps to understand the products that can be
bought together. It can also be used in the healthcare field to find drug reactions in
patients. Read more..
PCA works by considering the variance of each attribute because the high variance shows the
good split between the classes, and hence it reduces the dimensionality.
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression
shows the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
Mathematically, we can represent a linear regression as:
y= a0+a1x+ ε
Here,
The values for x and y variables are training datasets for Linear Regression model
representation.
Linear regression can be further divided into two types of the algorithm:
A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:
o Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable increases
on X-axis, then such a relationship is termed as a Positive linear relationship.
When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will
have the least error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line
of regression, and the cost function is used to estimate the values of the coefficient for
the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a
linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also known
as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can be
written as:
Where,
Residuals: The distance between the actual value and predicted values is called residual. If
the observed points are far from the regression line, then the residual will be high, and so cost
function will high. If the scatter points are close to the regression line, then the residual will
be small and hence the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update
the values to reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The
process of finding the best model out of various models is called optimization. It can be
achieved by below method:
1. R-squared method:
Below are some important assumptions of Linear Regression. These are some formal checks
while building a Linear Regression model, which ensures to get the best possible result from
the given dataset.
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is
used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it
can learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog.
On the basis of the support vectors, it will classify it as a cat. Consider the below diagram.
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such data
is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if
there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple
without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution.
This means if predictors take continuous values instead of discrete, then the model
assumes that these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular
word is present or not in a document. This model is also famous for document
classification tasks.
Module 4
Data visualization convert large and small data sets into visuals, which is easy to understand
and process for humans.
Data visualization tools provide accessible ways to understand outliers, patterns, and trends in
the data.
In the world of Big Data, the data visualization tools and technologies are required to analyze
vast amounts of information.
Data visualizations are common in your everyday life, but they always appear in the form of
graphs and charts. The combination of multiple visualizations and bits of information are still
referred to as Infographics.
Data visualizations are used to discover unknown facts and trends. You can see visualizations
in the form of line charts to display change over time. Bar and column charts are useful for
observing relationships and making comparisons. A pie chart is a great way to show parts-of-
a-whole. And maps are the best way to share geographical data visually.
Today's data visualization tools go beyond the charts and graphs used in the Microsoft Excel
spreadsheet, which displays the data in more sophisticated ways such as dials and gauges,
geographic maps, heat maps, pie chart, and fever chart.
Effective data visualization are created by communication, data science, and design collide.
Data visualizations did right key insights into complicated data sets into meaningful and
natural.
American statistician and Yale professor Edward Tufte believe useful data visualizations
consist of ?complex ideas communicated with clarity, precision, and efficiency.
To craft an effective data visualization, you need to start with clean data that is well-sourced
and complete. After the data is ready to visualize, you need to pick the right chart.
After you have decided the chart type, you need to design and customize your visualization to
your liking. Simplicity is essential - you don't want to add any elements that distract from the
data.
Data Visualization is defined as the pictorial representation of the data to provide the fact-
based analysis to decision-makers as text data might not be able to reveal the pattern or trends
needed to recognize data; based upon the visualization, it is classified into 6 different types,
i.e. Temporal (data is linear and one dimensional), Hierarchical (it visualizes ordered groups
within a larger group ), Network (involve visualization for the connection of datasets to
datasets), Multidimensional (contrast of temporal type), Geospatial( involves geospatial or
spatial maps) and Miscellaneous.
What is Data Visualization?
Data visualization is a methodology by which the data in raw format is portrayed to bring out
the meaning of that. With the advent of big data, it has become imperative to build a
meaningful way of showcasing the data so that the amount of data doesn’t become
overwhelming. The part of portraying the data can be used for various purposes, such as
finding trends/commonalities/patterns in data, building models for machine learning, or being
used for a simple operation like aggregation.
Temporal: Data for these types of visualization should satisfy both conditions: data represented
should be linear and should be one-dimensional. These visualization types are represented through
lines that might overlap and have a common start and finish data point.
Hierarchical: These types of visualizations portray ordered groups within a larger group. In simple
language, the main intuition behind these visualizations is the clusters can be displayed if the flow
of the clusters starts from a single point.
Multidimensional: In contrast to the temporal type of visualization, these types can have multiple
dimensions. In this, we can use 2 or more features to create a 3-D visualization through concurrent
layers. These will enable the user to present key takeaways by breaking a lot of non-useful data.
Scatter plots In multi-dimensional data, we select any 2
features and then plot them in a 2-D
scatter plot. By doing this we would
have nC2 = n(n-1)/2 graphs.
Geospatial: These visualizations relates to present real-life physical location by crossing it over
with maps (It may be a geospatial or spatial map). The intuition behind these visualizations is to
create a holistic view of performance.
The process of conversion of data from one form to another form is known as Encoding. It is
used to transform the data so that data can be supported and used by different systems.
Encoding works similarly to converting temperature from centigrade to Fahrenheit, as it just
gets converted in another form, but the original value always remains the same. Encoding is
used in mainly two fields:
Note: Encoding is different from encryption as its main purpose is not to hide the data but
to convert it into a format so that it can be properly consumed.
In this topic, we are going to discuss the different types of encoding techniques that are used
in computing.
o Character Encoding
o Image & Audio and Video Encoding
Character Encoding
Character encoding encodes characters into bytes. It informs the computers how to interpret
the zero and ones into real characters, numbers, and symbols. The computer understands only
binary data; hence it is required to convert these characters into numeric codes. To achieve
this, each character is converted into binary code, and for this, text documents are saved with
encoding types. It can be done by pairing numbers with characters. If we don't apply
character encoding, our website will not display the characters and text in a proper format.
Hence it will decrease the readability, and the machine would not be able to process data
correctly. Further, character encoding makes sure that each character has a proper
representation in computer or binary format.
There are different types of Character Encoding techniques, which are given below:
1. HTML Encoding
2. URL Encoding
3. Unicode Encoding
4. Base64 Encoding
5. Hex Encoding
6. ASCII Encoding
HTML Encoding
HTML encoding is used to display an HTML page in a proper format. With encoding, a web
browser gets to know that which character set to be used.
In HTML, there are various characters used in HTML Markup such as <, >. To encode these
characters as content, we need to use an encoding.
URL Encoding
URL (Uniform resource locator) Encoding is used to convert characters in such a format
that they can be transmitted over the internet. It is also known as percent-encoding. The
URL Encoding is performed to send the URL to the internet using the ASCII character-set.
Non-ASCII characters are replaced with a %, followed by the hexadecimal digits.
UNICODE Encoding
Unicode is an encoding standard for a universal character set. It allows encoding, represent,
and handle the text represented in most of the languages or writing systems that are available
worldwide. It provides a code point or number for each character in every supported
language. It can represent approximately all the possible characters possible in all the
languages. A particular sequence of bits is known as a coding unit.
The Unicode standard defines Unicode Transformation Format (UTF) to encode the code
points.
o UTF-8 Encoding
The UTF8 is defined by the UNICODE standard, which is variable-width character
encoding used in Electronics Communication. UTF-8 is capable of encoding all
1,112,064 valid character code points in Unicode using one to four one-byte (8-bit)
code units.
o UTF-16 Encoding
UTF16 Encoding represents a character's code points using one of two 16-bits
integers.
o UTF-32 Encoding
UTF32 Encoding represents each code point as 32-bit integers.
Base64 Encoding
Base64 Encoding is used to encode binary data into equivalent ASCII Characters. The
Base64 encoding is used in the Mail system as mail systems such as SMTP can't work with
binary data because they accept ASCII textual data only. It is also used in simple HTTP
authentication to encode the credentials. Moreover, it is also used to transfer the binary data
into cookies and other parameters to make data unreadable to prevent tampering. If an image
or another file is transferred without Base64 encoding, it will get corrupted as the mail system
is not able to deal with binary data.
Base64 represents the data into blocks of 3 bytes, where each byte contains 8 bits; hence it
represents 24 bits. These 24 bits are divided into four groups of 6 bits. Each of these groups
or chunks are converted into equivalent Base64 value.
ASCII Encoding
Th ASCII code is used to represent English characters as numbers, where each letter is
assigned with a number from 0 to 127. Most modern character-encoding schemes are based
on ASCII, though they support many additional characters. It is a single byte encoding only
using the bottom 7 bits. In an ASCII file, each alphabetic, numeric, or special character is
represented with a 7-bit binary number. Each character of the keyboard has an equivalent
ASCII value.
Image and audio & video encoding are performed to save storage space. A media file such as
image, audio, and video are encoded to save them in a more efficient and compressed format.
These encoded files contain the same content with usually similar quality, but in compressed
size, so that they can be saved within less space, can be transferred easily via mail, or can be
downloaded on the system.
We can understand it as a . WAV audio file is converted into .MP3 file to reduce the size by
1/10th to its original size.
Retinal Variables
The most fundamental choice in any data visualization project is how your real-world values
will be translated into marks on the page or screen. In this exercise we’ll be encoding
an extremely simple data set repeatedly in order to exhaustively catalog the different ways a
handful of numbers can be represented.
Refer to Jacques Bertin’s cheat sheet from The Semiology of Graphics for all the ways
quantitative and qualitative can be encoded visually:
Table of content
Since we are going to be working on categorical variables in this article, here is a quick
refresher on the same with a couple of examples. Categorical variables are usually
represented as ‘strings’ or ‘categories’ and are finite in number. Here are a few examples:
1. The city where a person lives: Delhi, Mumbai, Ahmedabad, Bangalore, etc.
2. The department a person works in: Finance, Human resources, IT, Production.
3. The highest degree a person has: High school, Diploma, Bachelors, Masters, PhD.
4. The grades of a student: A+, A, B+, B, B- etc.
In the above examples, the variables only have definite possible values. Further, we can see
which the category is provided. Like in the above example the highest degree a person
possesses, gives vital information about his qualification. The degree is an important feature
While encoding Nominal data, we have to consider the presence or absence of a feature. In
such a case, no notion of order is present. For example, the city a person lives in. For the data,
it is important to retain where a person lives. Here, We do not have any order or sequence. It
For encoding categorical data, we have a python package category_encoders. The following
We use this categorical data encoding technique when the categorical feature is ordinal. In
this case, retaining the order is important. Hence encoding should reflect the sequence.
In Label encoding, each label is converted into an integer value. We will create a variable that
Python Code:
Fit and transform train data
df_train_transformed = encoder.fit_transform(train_df)
We use this categorical data encoding technique when the features are nominal(do not have
any order). In one hot encoding, for each level of a categorical feature, we create a new
variable. Each category is mapped with a binary variable containing either 0 or 1. Here, 0
These newly created binary features are known as Dummy variables. The number of dummy
variables depends on the levels present in the categorical variable. This might sound
complicated. Let us take an example to understand this better. Suppose we have a dataset
with a category animal, having different animals like Dog, Cat, Sheep, Cow, Lion. Now we
After encoding, in the second table, we have dummy variables each representing a category
in the feature Animal. Now for each category that is present, we have 1 in the column of that
category and 0 for the others. Let’s see how to implement a one-hot encoding in python.
importcategory_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':[
'Delhi','Mumbai','Hydrabad','Chennai','Bangalore','Delhi','Hydrabad','Bangalore','Delhi'
]})
#Original Data
data
Now let’s move to another very interesting and widely used encoding technique i.eDummy
encoding.
Dummy Encoding
Dummy coding scheme is similar to one-hot encoding. This categorical data encoding
method transforms the categorical variable into a set of binary variables (also known as
dummy variables). In the case of one-hot encoding, for N categories in a variable, it uses N
To understand this better let’s see the image below. Here we are coding the same data using
both one-hot encoding and dummy encoding techniques. While one-hot uses 3 variables to
represent the data whereas dummy encoding uses 2 variables to code 3 categories.
importcategory_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi,'Hyder
abad']})
#Original Data
data
#encode the data
data_encoded=pd.get_dummies(data=data,drop_first=True)
data_encoded
Here using drop_first argument, we are representing the first label Bangalore using 0.
One hot encoder and dummy encoder are two powerful and effective encoding schemes.
They are also very popular among the data scientists, But may not be as effective when-
1. A large number of levels are present in data. If there are multiple categories in a
feature variable in such a case we need a similar number of dummy variables to
encode the data. For example, a column with 30 different values will require 30 new
variables for coding.
2. If we have multiple categorical features in the dataset similar situation will occur and
again we will end to have several binary features each representing the categorical
feature and their multiple categories e.g a dataset having 10 or more categorical
columns.
In both the above cases, these two encoding schemes introduce sparsity in the dataset i.e
several columns having 0s and a few of them having 1s. In other words, it creates multiple
highly correlated. That means using the other variables, we can easily predict the value of a
variable.
Due to the massive increase in the dataset, coding slows down the learning of the model
along with deteriorating the overall performance that ultimately makes the model
computationally expensive. Further, while using tree-based models these encodings are not
an optimum choice.
Effect Encoding:
This encoding technique is also known as Deviation Encoding or Sum Encoding. Effect
encoding is almost similar to dummy encoding, with a little difference. In dummy coding, we
use 0 and 1 to represent the data but in effect encoding, we use three values i.e. 1,0, and -1.
The row containing only 0s in dummy encoding is encoded as -1 in effect encoding. In the
dummy encoding example, the city Bangalore at index 4 was encoded as 0000. Whereas in
importcategory_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi,'Hyder
abad']}) encoder=ce.sum_coding.SumEncoder(cols='City',verbose=False,)
#Original Data
data
encoder.fit_transform(data)
Effect encoding is an advanced technique. In case you are interested to know more about
Hash Encoder
transformation of arbitrary size input in the form of a fixed-size value. We use hashing
algorithms to perform hashing operations i.e to generate the hash value of an input. Further,
hashing is a one-way process, in other words, one can not generate original input from the
hash representation.
Hashing has several applications like data retrieval, checking data corruption, and in data
encryption also. We have multiple hash functions available for example Message Digest
(MD, MD2, MD5), Secure Hash Function (SHA0, SHA1, SHA2), and many more.
Just like one-hot encoding, the Hash encoder represents categorical features using the new
dimensions. Here, the user can fix the number of dimensions after transformation
using n_component argument. Here is what I mean – A feature with 5 categories can be
represented using N new features similarly, a feature with 100 categories can also be
By default, the Hashing encoder uses the md5 hashing algorithm but a user can pass any
algorithm of his choice. If you want to explore the md5 algorithm, I suggest this paper.
importcategory_encoders as ce
import pandas as pd
Another issue faced by hashing encoder is the collision. Since here, a large number of
features are depicted into lesser dimensions, hence multiple values can be represented by the
Moreover, hashing encoders have been very successful in some Kaggle competitions. It is
Binary Encoding
Binary encoding is a combination of Hash encoding and one-hot encoding. In this encoding
scheme, the categorical feature is first converted into numerical using an ordinal encoder.
Then the numbers are transformed in the binary number. After that binary value is split into
different columns.
Binary encoding works really well when there are a high number of categories. For example
#Original Data
data
hot encoding. Further, It reduces the curse of dimensionality for data with high cardinality.
Base N Encoding
Before diving into BaseN encoding let’s first try to understand what is Base here?
In the numeral system, the Base or the radix is the number of digits or a combination of digits
and letters used to represent the numbers. The most common base we use in our life is 10 or
decimal system as here we use 10 unique digits i.e 0 to 9 to represent all the numbers.
Another widely used system is binary i.e. the base is 2. It uses 0 and 1 i.e 2 digits to express
For Binary encoding, the Base is 2 which means it converts the numerical values of a
category into its respective Binary form. If you want to change the Base of encoding scheme
you may use Base N encoder. In the case when categories are more and binary encoding is
not able to handle the dimensionality then we can use a larger base such as 4 or 8.
#Original Data
data
#Fit and Transform Data
data_encoded=encoder.fit_transform(data)
data_encoded
In the above example, I have used base 5 also known as the Quinary system. It is similar to
the example of Binary encoding. While Binary encoding represents the same data by 4 new
Hence BaseN encoding technique further reduces the number of features required to
efficiently represent the data and improving memory usage. The default Base for Base N is 2
Target Encoding
data.
In target encoding, we calculate the mean of the target variable for each category and replace
the category variable with the mean value. In the case of the categorical target variables, the
#Original Data
Data
from the training dataset. Although, a very efficient coding system, it has the
1. It can lead to target leakage or overfitting. To address overfitting we can use different
techniques.
1. In the leave one out encoding, the current target value is reduced from the
overall mean of the target to avoid leakage.
2. In another method, we may introduce some Gaussian noise in the target
statistics. The value of this noise is hyperparameter to the model.
2. The second issue, we may face is the improper distribution of categories in train and
test data. In such a case, the categories may assume extreme values. Therefore the
target means for the category are mixed with the marginal mean of the target.
Unit 5
Major Applications of Data Science
Data Science is the deep study of a large quantity of data, which involves extracting some
meaningful from the raw, structured, and unstructured data. The extracting out meaningful
data from large amounts use processing of data and this processing can be done using
statistical techniques and algorithm, scientific techniques, different technologies, etc. It uses
various tools and techniques to extract meaningful data from raw data. Data Science is also
known as the Future of Artificial Intelligence.
For Example, Jagroop loves books to read but every time when he wants to buy some books
he was always confused that which book he should buy as there are plenty of choices in front
of him. This Data Science Technique will useful. When he opens Amazon he will get product
recommendations on the basis of his previous data. When he chooses one of them he also
gets a recommendation to buy these books with this one as this set is mostly bought. So all
Recommendation of Products and Showing set of books purchased collectively is one of the
examples of Data Science.
Applications of Data Science
1. In Search Engines
The most useful application of Data Science is Search Engines. As we know when we want
to search for something on the internet, we mostly used Search engines like Google, Yahoo,
Safari, Firefox, etc. So Data Science is used to get Searches faster.
For Example, When we search something suppose “Data Structure and algorithm courses ”
then at that time on the Internet Explorer we get the first link of GeeksforGeeks Courses. This
happens because the GeeksforGeeks website is visited most in order to get information
regarding Data Structure courses and Computer related subjects. So this analysis is Done
using Data Science, and we get the Topmost visited Web Links.
2. In Transport
Data Science also entered into the Transport field like Driverless Cars. With the help of
Driverless Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the help
of Data Science techniques, the Data is analyzed like what is the speed limit in Highway,
Busy Streets, Narrow Roads, etc. And how to handle different situations while driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an
issue of fraud and risk of losses. Thus, Financial Industries needs to automate risk of loss
analysis in order to carry out strategic decisions for the company. Also, Financial Industries
uses Data Science Analytics tools in order to predict the future. It allows the companies to
predict customer lifetime value and their stock market moves.
For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data
Science is used to examine past behavior with past data and their goal is to examine the
future outcome. Data is analyzed in such a way that it makes it possible to predict future
stock prices over a set timetable.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user
experience with personalized recommendations.
For Example, When we search for something on the E-commerce websites we get
suggestions similar to choices according to our past data and also we get recommendations
according to most buy the product, most rated, most searched, etc. This is all done with the
help of Data Science.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
• Detecting Tumor.
• Drug discoveries.
• Medical Image Analysis.
• Virtual Medical Bots.
• Genetics and Genomics.
• Predictive Modeling for Diagnosis etc.
6. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When we upload
our image with our friend on Facebook, Facebook gives suggestions Tagging who is in the
picture. This is done with the help of machine learning and Data Science. When an Image is
Recognized, the data analysis is done on one’s Facebook friends and after analysis, if the
faces which are present in the picture matched with someone else profile then Facebook
suggests us auto-tagging.
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever the
user searches on the Internet, he/she will see numerous posts everywhere. This can be
explained properly with an example: Suppose I want a mobile phone, so I just Google search
it and after that, I changed my mind to buy offline. Data Science helps those companies who
are paying for Advertisements for their mobile. So everywhere on the internet in the social
media, in the websites, in the apps everywhere I will see the recommendation of that mobile
phone which I searched for. So this will force me to buy online.
8. Airline Routing Planning
With the help of Data Science, Airline Sector is also growing like with the help of it, it
becomes easy to predict flight delays. It also helps to decide whether to directly land into the
destination or take a halt in between like a flight can have a direct route from Delhi to the
U.S.A or it can halt in between after that reach at the destination.
9. Data Science in Gaming
In most of the games where a user will play with an opponent i.e. a Computer Opponent, data
science concepts are used with machine learning where with the help of past data the
Computer will improve its performance. There are many games like Chess, EA Sports, etc.
will use Data Science concepts.
10. Medicine and Drug Development
The process of creating medicine is very difficult and time-consuming and has to be done
with full disciplined because it is a matter of Someone’s life. Without Data Science, it takes
lots of time, resources, and finance or developing new Medicine or drug but with the help of
Data Science, it becomes easy because the prediction of success rate can be easily determined
based on biological data or factors. The algorithms based on data science will forecast how
this will react to the human body without lab experiments.
11. In Delivery Logistics
Various Logistics companies like DHL, FedEx, etc. make use of Data Science. Data Science
helps these companies to find the best route for the Shipment of their Products, the best time
suited for delivery, the best mode of transport to reach the destination, etc.
12. Autocomplete
AutoComplete feature is an important part of Data Science where the user will get the facility
to just type a few letters or words, and he will get the feature of auto-completing the line. In
Google Mail, when we are writing formal mail to someone so at that time data science
concept of Autocomplete feature is used where he/she is an efficient choice to auto-complete
the whole line. Also in Search Engines in social media, in various apps, AutoComplete
feature is widely used.
# import modules
# output to notebook
output_notebook()
# create figure
show(p)
Output :
# import modules
# output to notebook
output_notebook()
# create figure
show(p)
Output :
importpandas as pd
# output to notebook
output_notebook()
df =pd.read_csv(r"D:/kaggle/mcdonald/menu.csv")
# create bar
p =Bar(df, "Category", values ="Calories",
legend ="top_right")
show(p)
Output :
importpandas as pd
# output to notebook
output_notebook()
# create bar
legend ="top_right")
show(p)
Output :
Code #5: Histogram
Histogram is used to represent distribution of numerical data. The height of a rectangle in a
histogram is proportional to the frequency of values in a class interval.
importpandas as pd
# output to notebook
output_notebook()
# create histogram
color ="navy")
show(p)
Output :
importpandas as pd
# output to notebook
output_notebook()
color ="orange")
show(p)
Output :
Unit 6
Top 10 Data Analytics Trends For 2022
In today’s current market trend, data is driving any organization in a countless number of
ways. Data Science, Big Data Analytics, and Artificial Intelligence are the key trends in
today’s accelerating market. As more organizations are adopting data-driven models to
streamline their business processes, the data analytics industry is seeing humongous growth.
From fueling fact-based decision-making to adopting data-driven models to expanding data-
focused product offerings, organizations are inclining more towards data analytics.
These progressing data analytics trends can help organizations deal with many changes
and uncertainties. So, let’s take a look at a few of these Data Analytics trends that are
becoming an inherent part of the industry.
COVID-19 has changed the business landscape in myriad ways and historical data is no
more relevant. So, in place of traditional AI techniques, arriving in the market are some
scalable and smarter Artificial Intelligence and Machine Learning techniques that can
work with small data sets. These systems are highly adaptive, protect privacy, are much
faster, and also provide a faster return on investment. The combination of AI and Big
data can automate and reduce most of the manual tasks.
Agile data and analytics models are capable of digital innovation, differentiation, and
growth. The goal of edge and composable data analytics is to provide a user-friendly,
flexible, and smooth experience using multiple data analytics, AI, and ML solutions. This
will not only enable leaders to connect business insights and actions but also, encourage
collaboration, promote productivity, agility and evolve the analytics capabilities of the
organization.
One of the biggest data trends for 2022 is the increase in the use of hybrid cloud
services and cloud computation. Public clouds are cost-effective but do not provide high
security whereas a private cloud is secure but more expensive. Hence, a hybrid cloud is a
balance of both a public cloud and a private cloud where cost and security are balanced to
offer more agility. This is achieved by using artificial intelligence and machine learning.
Hybrid clouds are bringing change to organizations by offering a centralized database,
data security, scalability of data, and much more at such a cheaper cost.
A data fabric is a powerful architectural framework and set of data services that standardize
data management practices and consistent capabilities across hybrid multi-cloud
environments. With the current accelerating business trend as data becomes more
complex, more organizations will rely on this framework since this technology can reuse
and combine different integration styles, data hub skills, and technologies. It also reduces
design, deployment, and maintenance time by 30%, 30%, and 70%, respectively, thereby
reducing the complexity of the whole system. By 2026, it will be highly adopted as a re-
architect solution in the form of an IaaS (Infrastructure as a Service) platform.
There are many big data analytic tools available in the market but still persists the
problems of enormous data processing capabilities. This has led to the development of the
concept of quantum computing. By applying laws of quantum mechanics, computation has
speeded up the processing capabilities of the enormous amount of data by using less
bandwidth while also offering better security and data privacy. This is much better than
classical computing as the decisions here are taken using quantum bits of a processor called
Sycamore, which can solve a problem in just 200 seconds.
However, Edge Computing will need a lot of fine-tuning before it can be significantly
adopted by organizations. Nevertheless, with the accelerating market trend, it will soon
make its presence felt and will become an integral part of business processes.
Trend 6: Augmented Analytics
Earlier businesses were restricted to predefined static dashboards and manual data
exploration restricted to data analysts or citizen data scientists. But it seems dashboards
have outlived their utility due to the lack of their interactivity and user-friendliness.
Questions are being raised about the utility and ROI of dashboards, leading organizations
and business users to look for solutions that will enable them to explore data on their own
and reduce maintenance costs.
It seems slowly business will be replaced by modern automated and dynamic BI tools that
will present insights customized according to a user’s needs and delivered to their point of
consumption.
Trend 8: XOps
XOps has become a crucial part of business transformation processes with the adoption
of Artificial Intelligence and Data Analytics across any organization. XOps started with
DevOps that is a combination of development and operations and its goal is to improve
business operations, efficiencies, and customer experiences by using the best practices of
DevOps. It aims in ensuring reliability, re-usability, and repeatability and also ensure a
reduction in the duplication of technology and processes. Overall, the primary aim of XOps
is to enable economies of scale and help organizations to drive business values by
delivering a flexible design and agile orchestration in affiliation with other software
disciplines.
With evolving market trends and business intelligence, data visualization has captured the
market in a go. Data Visualization is indicated as the last mile of the analytics process and
assists enterprises to perceive vast chunks of complex data. Data Visualization has made it
easier for companies to make decisions by using visually interactive ways. It influences the
methodology of analysts by allowing data to be observed and presented in the form of
patterns, charts, graphs, etc. Since the human brain interprets and remembers visuals more,
hence it is a great way to predict future trends for the firm.
By using visual elements like charts, graphs, and maps, data visualization techniques
provide an accessible way to see and understand trends, outliers, and patterns in data.
In modern days we have a lot of data in our hands i.e, in the world of Big Data, data
visualization tools, and technologies are crucial to analyze massive amounts of information
Since our eyes can capture the colors and patterns, therefore, we can quickly identify the red
portion from blue, square from the circle, our culture is visual, including everything from art
So, Data visualization is another technique of visual art that grabs our interest and keeps our
Whenever we visualize a chart, we quickly identify the trends and outliers present in the
dataset.
The basic uses of the Data Visualization technique are as follows:
• It is a powerful technique to explore the data with presentable and interpretable results.
• In the data mining process, it acts as a primary step in the pre-processing portion.
• It supports the data cleaning process by finding incorrect data and corrupted or missing
values.
• It also helps to construct and select variables, which means we have to determine which
variable to include and discard in the analysis.
• In the process of Data Reduction, it also plays a crucial role while combining the categories.
Mainly, there are three different types of analysis for Data Visualization:
Univariate Analysis: In the univariate analysis, we will be using a single feature to analyze
Bivariate Analysis: When we compare the data between exactly 2 features then it is known
as bivariate analysis.
variables.
NOTE:
We are not going to deep dive into the coding/implementation part of different techniques on
a particular dataset but we try to find the answer to the above questions and understand only
the snippet code with the help of sample plots for each of the data visualization techniques.
• It is one of the best univariate plots to know about the distribution of data.
• When we want to analyze the impact on the target variable(output) with respect to an
independent variable(input), we use distribution plots a lot.
• This plot gives us a combination of both probability density functions(pdf) and histogram in a
single plot.
Implementation:
Python Code:
• We have observed that we created a distribution plot on the feature ‘Age’(input variable) and
we used different colors for the Survival status(output variable) as it is the class to be
predicted.
• There is a huge overlapping area between the PDFs for different combinations.
• In this plot, the sharp block-like structures are called histograms, and the smoothed curve is
known as the Probability density function(PDF).
NOTE:
The Probability density function(PDF) of a curve can help us to capture the underlying
distribution of that feature which is one major takeaway from Data visualization or
• This plot can be used to obtain more statistical details about the data.
• The straight lines at the maximum and minimum are also called whiskers.
• Points that lie outside the whiskers will be considered as an outlier.
• The box plot also gives us a description of the 25th, 50th,75th quartiles.
• With the help of a box plot, we can also determine the Interquartile range(IQR) where
maximum details of the data will be present. Therefore, it can also give us a clear idea about
the outliers in the dataset.
Implementation:
sns.boxplot(x='SurvStat',y='axil_nodes',data=hb)
From the above box and whisker plot we can conclude the following observations:
• How much data is present in the 1st quartile and how many points are outliers etc.
• For class 1, we can see that it is very little or no data is present between the median and the
1st quartile.
• There are more outliers for class 1 in the feature named axil_nodes.
NOTE:
We can get details about outliers that will help us to well prepare the data before feeding it to
3. Violin Plot
• The violin plots can be considered as a combination of Box plot at the middle and distribution
plots(Kernel Density Estimation) on both sides of the data.
• This can give us the description of the distribution of the dataset like whether the distribution
is multimodal, Skewness, etc.
• It also gives us useful information like a 95% confidence interval.
Fig. General Diagram for a Violin-plot
Implementation:
sns.violinplot(x='SurvStat',y='op_yr',data=hb,size=6)
From the above violin plot we can conclude the following observations:
• This is the plot that you can see in the nook and corners of any sort of analysis between 2
variables.
• The line plots are nothing but the values on a series of data points will be connected with
straight lines.
• The plot may seem very simple but it has more applications not only in machine learning but
in many other areas.
Implementation:
plt.plot(x,y)
From the above line plot we can conclude the following observations:
• These are used right from performing distribution Comparison using Q-Q plots to CV tuning
using the elbow method.
• Used to analyze the performance of a model using the ROC- AUC curve.
2. Bar Plot
• This is one of the widely used plots, that we would have seen multiple times not just in data
analysis, but we use this plot also wherever there is a trend analysis in many fields.
• Though it may seem simple it is powerful in analyzing data like sales figures every
week, revenue from a product, Number of visitors to a site on each day of a week, etc.
Implementation:
plt.bar(x,y)
From the above bar plot we can conclude the following observations:
• We can visualize the data in a cool plot and can convey the details straight forward to others.
• This plot may be simple and clear but it’s not much frequently used in Data science
applications.
3. Scatter Plot
• It is one of the most commonly used plots used for visualizing simple data in Machine
learning and Data Science.
• This plot describes us as a representation, where each point in the entire dataset is present
with respect to any 2 to 3 features(Columns).
• Scatter plots are available in both 2-D as well as in 3-D. The 2-D scatter plot is the common
one, where we will primarily try to find the patterns, clusters, and separability of the data.
Implementation:
• The scatter plot is present in the Matplotlib package.
plt.scatter(x,y)
From the above Scatter plot we can conclude the following observations:
• The colors are assigned to different data points based on how they were present in the
dataset i.e, target column representation.
• We can color the data points as per their class label given in the dataset.
Data Science is a field of study that focuses on using scientific methods and algorithms to
extract knowledge from data. The Data Scientist role may differ depending on the project.
Some associate this position with application analytics, others with vaguely defined AI, and
the truth lies somewhere in between. Ultimately, the Data Scientist's primary goal is to
improve the quality of application development and bring value to a product.
Data Scientist is generally required to have knowledge in data analysis, data transformations,
and machine learning. However, different positions are related to this role, such as:
• Data Analyst,
• Data Engineer,
• Machine Learning Engineer,
• MLOps,
• orDataOps.
Data Scientists may often be perceived as full-stack developers of a machine learning world.
Therefore, many companies prefer hiring Data Scientists with particular skills that involve
mentioned roles to fit project requirements. In small teams, Data Scientists are responsible for
designing architecture and building data processing pipelines, preparing application analytics,
developing machine learning solutions, deploying these to the production environment and
monitoring results.
The primary purpose of a Data Scientist's work is to solve problems that include reducing
costs, increasing revenue and improving user experience. It can be either achieved by
maintaining and investigating application analytics or introducing AI systems in a project.
Users demographic
Users activity
Users paths
Analytics can give plenty of information to the development team and the client. Therefore,
application development can be accelerated with tasks prioritization, features validation, and
detection of hidden issues.
Although analytics is an important part of application development, Data Scientists are also
responsible for delivering machine learning solutions. Machine learning is a branch of
science that focuses on automatic insights extraction in order to build a knowledge model that
can perform a certain task. On the other hand, AI (Artificial Intelligence) is a much broader
term often used by marketers. As a result, that expression has become a buzzword and is
loosely used as a machine learning term equivalent in the business world.
There is a wide variety of applications that can utilize machine learning. Some common AI
systems with examples are presented below:
Recommender systems - profiling a user to propose the best items that fit their interests;
Image recognition - detect the particular object on images/videos (may be used for censoring
inappropriate content);
Text mining - e.g., sentiment analysis that provides information concerning positive or
negative attitudes toward a product based on the content provided by the user (e.g., user
opinion);
Churn prediction - detecting and preventing users from leaving the application or canceling
the service subscription;
Other systems:
• Antispam filters;
• Forecasting methods (e.g., predicting future sales);
• Chatbots;
• and various processes automation.
Data Scientist during the application development life cycle
There are two approaches to hiring a Data Scientist. Preparing application MVP may be
difficult for a client financially. During the initial development, there is an obvious need for
developers rather than Data Scientists. In this scenario, Data Scientist is usually hired when
the application is publicly available. Gathered data can be utilized for further application
development and the application might require some AI-centered features.
On the other hand, Data Scientist knowledge and experience may be beneficial from the start
of the development cycle. Although introducing new machine learning solutions may not be
crucial for a new application, to apply these solutions proper data collection is required. That
means that Data Scientist should be included in the work related to designing databases and
data flows. This way, it will be more effortless to develop machine learning solutions in the
future.