0% found this document useful (0 votes)

6 views

IDS 1 IntroDataScience

The document provides an introduction to a Data Science course offered by BITS Pilani, outlining its objectives, structure, and evaluation schedule. It covers fundamental concepts of data science, its applications in various industries, and the skills required for data scientists. Additionally, it discusses the challenges in data science and the roles within a data science team.

Uploaded by

AtindranathGhosh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

IDS 1 IntroDataScience

Uploaded by

AtindranathGhosh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 88

INTRODUCTION TO DATA SCIENCE

MODULE # 1 : INTRODUCTION
IDS Course Team
BITS Pilani
The instructor is gratefully
acknowledging the authors who
made their course materials freely
available online.

INTRODUCTION TO DATA SCIENCE 3 / 79

COURSE OBJECTIVES
• Gain basic understanding of the role of Data Science in various
scenarios in the real-world of business, industry and government.
• Understand various roles and stages in a Data Science Project and ethical
issues to be considered.

• Explore the processes, tools and technologies for collection and

analysis of structured and unstructured data.
• Appreciate the importance of techniques like data visualization,
storytelling with data for the effective presentations of the outcomes with
the stakeholders.

• Understand techniques of preparing real-world data for data analytics.

• Implement data analytic techniques for discovering interesting patterns

from data.
INTRODUCTION TO DATA SCIENCE 4 / 67
COURSE STRUCTURE
• M1 Introduction to Data Science
• M2 Data Quality and Data Infrastructure
• M3 Data Preprocessing
• M4 Classification and Prediction
• M5 Association Analysis
• M6 Clustering
• M7 Anomaly Detection
• M8 Storytelling with Data
• M9 Ethics for Data Science

INTRODUCTION TO DATA SCIENCE 5 / 67

TEXT BOOKS

T1 Introduction to Data Mining, by Tan, Steinbach and Vipin

umar T2 Introducing Data Science by Cielen, Meysman
and Ali
T3 Storytelling with Data, A data visualization guide for business
professionals, by Cole, Nussbaumer Knaflic; Wiley
T4 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han
and Micheline Kamber Morgan Kaufmann Publishers, 2006

INTRODUCTION TO DATA SCIENCE 7 / 67

EVALUATION SCHEDULE
No Name Type Duration Weight Remarks
EC1 Quiz I Online 1 hr 5% Sum of both quizzes
Quiz II Online 1 hr 5%

Assignment Online 20 %

EC2 Mid-sem Online As announced 25%

EC3 End-sem Online As announced 45%

8 / 67
Fundamentals of Data Science
What is Data Science
Data science is a ‘concept to
unify statistics, data
analysis, machine learning
and their related methods’ in
order to ‘understand and
analyze actual phenomena’
with data. - Wikipedia https://www.zeolearn.com/magazine/data‐science‐vs‐machine‐learning ‐artificial ‐intelligence

“Data science is the discipline of making data useful”

- Chief Decision Scientist Google - Cassie Kozyrkov
What is Data Science
• Discovering what we don’t know from data
• Obtaining predictive insights
• Helps to create data products
• Helps to make actionable decisions
• Communicating stories from data
• Increases confidence in making valuable decisions that increases
business value
Data Science, AI and ML

https://www.sciencedirect.com/topics/physics-and-astronomy/artificial-intelligence
12 / 67
Data Science, AI and ML
• Artificial Intelligence
• AI involves making machines capable of mimicking human behavior,
particularly cognitive functions like facial recognition, automated driving…
• Machine Learning
• Considered a sub-field of or one of the tools of AI.
• Involves providing machines with the capability of learning from experience.
• Data Science
• Data science is the application of machine learning, artificial intelligence, and
other quantitative fields like statistics, visualization, and mathematics to
uncover insights from data to enable better decision marking.

13 / 67
Why Data Science ?
Case study - Moneyball: The Art of Winning an Unfair Game
• The 2003 publication of Moneyball: The Art of Winning an Unfair Game, by
Michael Lewis

• Moneyball told the story of how the Oakland Athletics, under general manager
Billy Beane, employed data and analytics to field a competitive baseball team
on a low budget

• The book was later made into a 2011 film

• It is the story of how existing data can be examined for meaning in ways that
were never intended or imagined when they were originally collected

• http://dataanalyticsedge.com/2019/11/14/moneyball-the-must-watch-movie-key-learning-for-eve
ry-aspiring-data-analyst-and-data-scientist/
Why Data Science ? - Data-Driven Decisions

• “As widely-familiar as the story is, it is almost as widely

misunderstood”. “Moneyball succeeded for the Oakland A’s not
because of data analytics but because of Beane, the leader who
understood the analytics’ potential and changed the organization so
it could deliver on that potential” -- forbes
• Decisions no longer have to be made in the dark or based on gut
instinct; they can be based on data, evidence, experiments and more
accurate forecasts -- McKinsey
TABLE OF CONTENTS

1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 16 / 79

USE CASES OF DATA SCIENCE

DataFlai
r
INTRODUCTION TO DATA SCIENCE 17 / 79
DATA SCIENCE IN FACEBOOK

Social Analytics
Quantitative research
Makes use of deep learning
Deeptext
Targeted Advertising

INTRODUCTION TO DATA SCIENCE 18 / 79

DATA SCIENCE IN AMAZON

Predictive Analysis
Anticipatory shipping model
Price discounts
Fraud Detection
Improving Packaging Efficiency

INTRODUCTION TO DATA SCIENCE 19 / 79

DATA SCIENCE IN UBER
Improving Rider Experience
Uber maintains large database of drivers, customers, and several other
records.
Makes extensive use of Big Data and crowdsourcing to derive insights and
provide best services to its customers.
Dynamic pricing
)
Use of big Data and data science to calculate fares based on specific parameters.
)
Uber matches customer profile with the most suitable driver and charges them
based on the time it takes to cover the distance rather than the distance itself.
)
The time of travel is calculated using algorithms that make use of data related
to traffic density and weather conditions.
)
When the demand is higher (more riders) than supply (less drivers), the price of
the ride goes up.
INTRODUCTION TO DATA SCIENCE 20 / 79
DATA SCIENCE IN BANK OF AMERICA
Improving Customer Experience
Erica – a virtual financial assistant (BoA)
)
Erica serves as a customer advisor to over 45 million users around the world.
)
Erica makes use of Speech Recognition to take customer inputs.
Fraud detection
)
Uses data science and predictive analytics to detect frauds in payments,
insurance, credit cards, and customer information.
Risk modeling
)
Use data science for risk modeling to regulate financial activities.
Customer segmentation
)
Segment their customers in the high-value and low-value segments.
)
Data scientists makes use of clustering, logistic regression, decision trees to
help the banks to understand the Customer Lifetime Value (CLV) and take
group them in the appropriate
INTRODUCTION TO DATA SCIENCE 21 / 79
segments.
DATA SCIENCE IN AIRBNB

Improving Customer Experience

Providing better search
results
)
Uses big data of customer and host information, homestays and lodge
records, and website traffic.
)
Uses data science to provide better search results to its customers and find
compatible hosts.
Detecting bounce rates
)
Use of demographic analytics to analyze bounce rates from their websites.
Providing ideal lodgings and localities
)
Uses knowledge graphs where the user’s preferences are matc hed with the
various parameters to provide ideal lodgings and localities.
INTRODUCTION TO DATA SCIENCE 22 / 79
DATA SCIENCE IN SPOTIFY

Improving Customer Experience and

recommendation Providing better music
streaming experience
)
Provide personalized music recommendations.
)
Uses over 600 GBs of daily data generated by the users to build its algorithms
to boost user experience.
Improving experience for artists and managers
)
Spotify for Artists application allows the artists and managers to analyze their
streams, fan approval and the hits they are generating through Spotify’s
playlists.

INTRODUCTION TO DATA SCIENCE 23 / 79

DATA SCIENCE IN SPOTIFY... CONTD..

Spotify uses data science to gain insights about which universities had the
highest percentage of party playlists and which ones spent the most time
on it.
”Spotify Insights” publishes information about the ongoing trends in the
music.
Spotify’s Niland, an API based product, uses machine learning to
provide better searches and recommendations to its users.
Spotify analyzes listening habits of its users to predict the Grammy Award
Winners.

INTRODUCTION TO DATA SCIENCE 24 / 79

APPLICATIONS OF DATA SCIENCE

DataFlai
r
INTRODUCTION TO DATA SCIENCE 25 / 79
DATA SCIENCE CHALLENGES

Data science challenges can be :

Finding the data
Getting Access to data
Understanding the data
Data Cleaning
Communicating

INTRODUCTION TO DATA SCIENCE 26 / 79

COGNITIVE BIAS

Cognitive Biases are the distortions of reality because of the lens through
which we view the world.
Each of us sees things differently based on our preconceptions, past
experiences, cultural, environmental, and social factors. This doesn’t
necessarily mean that the way we think or feel about something is truly
representative of reality.

INTRODUCTION TO DATA SCIENCE 27 / 79

ROLES IN DATA SCIENCE TEAM [1-7]

[1] Chief Analytics Officer / Chief Data

Officer
[2] Data analyst
[3] Business analyst
[4] Data scientist
[4a] Machine Learning Engineer
[4b] Data Journalist
[5] Data architect
[6] Data engineer
[7] Application/data visualization engineer

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-ke
y-models-and-roles/
INTRODUCTION TO DATA SCIENCE 28 / 79
SKILLSET FOR A DATA SCIENTIST

• PROGRAMMING
• QUANTITATIVE ANALYSIS
• PRODUCT INTUITION
• COMMUNICATION
• TEAMWORK

INTRODUCTION TO DATA SCIENCE 29 / 79

SKILLS REqUIRED FOR A DATA SCIENTIST

Communicati Qualitativ
ve e

Data
Curiou Technica
s Scienti l
st

Creativ Skeptic
e al

INTRODUCTION TO DATA SCIENCE 30 / 79

SKILLSET OF A DATA SCIENTIST

INTRODUCTION TO DATA SCIENCE 31 / 79

TOOLS AVAILABLE TO A DATA SCIENTIST

R
SQL
Python

Scala

Tools SAS

Hadoo
p
Julia
Tableau
Wek
a
INTRODUCTION TO DATA SCIENCE 32 / 79
ALGORITHMS FOR A DATA SCIENTIST

Logistic

K-means Regression Linear

Regressio
clustering n

PCA Algorith Aprior

ms i

Decisio
SVM
n
ANN Tree

INTRODUCTION TO DATA SCIENCE 33 / 79

Data Science Activity Examples

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Classification: Definition
Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as
possible.
– A test set is used to determine the accuracy of the model. Usually, the given data set is divided into
training and test sets, with training set used to build the model and test set used to validate it.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Classification Example
al al us
o ri c o ri c uo
eg eg t in
t t n ss
ca ca co c la
Tid Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat

1 Yes Single 125K No No Single 75K ?

2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
Set
10

7 Yes Divorced 220K No

8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10
10 No Single 90K Yes
Set Classifier Model
36

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Classification: Application 1
Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of consumers likely to
buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
• Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
• Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier
model.
From [Berry & Linoff] Data Mining Techniques, 1997

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Classification: Application 2
Fraud Detection
– Goal: Predict fraudulent cases in credit card transactions.
– Approach:
• Use credit card transactions and the information on its account-holder as attributes.
• When does a customer buy, what does he buy, how often he pays on time, etc
• Label past transactions as fraud or fair transactions. This forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card transactions on an account.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Classification: Application 3

Customer Attrition/Churn:
– Goal: To predict whether a customer is likely to be lost to a
competitor.
– Approach:
• Use detailed record of transactions with each of the past and
present customers, to find attributes.
• How often the customer calls, where he calls, what time-of-the day
he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.

From [Berry & Linoff] Data Mining Techniques, 1997 39

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Clustering Definition

Given a set of data points, each having a set of attributes, and

a similarity measure among them, find clusters such that
– Data points in one cluster are more similar to one another.
– Data points in separate clusters are less similar to one another.
Similarity Measures:
– Euclidean Distance if attributes are continuous.
– Other Problem-specific Measures.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Illustrating Clustering

Intracluster distances Intercluster distances

Intracluster distances Intercluster distances
are minimized are maximized
are minimized are maximized

Euclidean Distance
Based Clustering in
3-D space

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Plotting cholera cases on a map of London

A famous instance of clustering to solve

a problem took place
long ago in London, and it was done
entirely without computers. The
physician John Snow, dealing with a
Cholera outbreak plotted the cases on
a map of the
city.

42
Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Clustering: Application 1
Market Segmentation:
– Goal: subdivide a market into distinct subsets of customers where any subset may conceivably
be selected as a market target to be reached with a distinct marketing mix.
– Approach:
• Collect different attributes of customers based on their geographical and lifestyle related
information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns of customers in same cluster
vs. those from different clusters.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Clustering: Application 2
Document Clustering:
– Goal: To find groups of documents that are similar to each other
based on the important terms appearing in them.
– Approach: To identify frequently occurring terms in each document.
Form a similarity measure based on the frequencies of different
terms. Use it to cluster.
– Gain: Information Retrieval can utilize the clusters to relate a new
document or search term to clustered documents.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Association Rule Discovery: Definition
Given a set of records each of which contain some number of items from a given
collection;
– Produce dependency rules which will predict occurrence of an item based on
occurrences of other items.
Example of Association Rules
TID Items
1 Bread, Milk
{Diaper}  {Butter},
2 Bread, Diaper, Butter, Beans
{Milk, Bread}  {Beans, Coke},
3 Milk, Diaper, Butter, Coke
{Butter, Bread}  {Milk},
4 Bread, Milk, Diaper, Butter
5 Bread, Milk, Diaper, Coke

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Association Rule Discovery: Application 1

Marketing and Sales Promotion:

– Let the rule discovered be
{Bagels, … } --> {Potato Chips}
– Potato Chips as consequent => Can be used to determine what should be done to boost its sales.
– Bagels in the antecedent => Can be used to see which products would be affected if the store
discontinues selling bagels.
– Bagels in antecedent and Potato chips in consequent => Can be used to see what products
should be sold with Bagels to promote sale of Potato chips!

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Association Rule Discovery: Application
Inventory Management:
– Goal: A consumer appliance repair company wants to anticipate the nature
of repairs on its consumer products and keep the service vehicles equipped
with right parts to reduce on number of visits to consumer households.
– Approach: Process the data on tools and parts required in previous repairs at
different consumer locations and discover the co-occurrence patterns.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Prediction/Regression

Predict a value of a given continuous valued variable based on the values of other
variables, assuming a linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples:
• Predicting sales amounts of new product based on advertising expenditure.
• Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
• Time series prediction of stock market indices.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Deviation/Anomaly Detection

Detect significant deviations from normal behavior

Applications:
– Credit Card Fraud Detection
– Network Intrusion Detection

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

DATAOPS

DATAOPS AS DEFINED BY GARTNER

DataOps is a collaborative data management practice, really focused on
improving communication, integration, and automation of data flow
between managers and consumers of data within an organization.

INTRODUCTION TO DATA SCIENCE 50 / 79

DATAOPS
DataOps applies Agile development, DevOps and lean
manufacturing to data analytics development and operations.
DataOps consider the end-to-end data analytics process as a
sequence of operations-or a data pipeline

INTRODUCTION TO DATA SCIENCE 51 / 79

Agile
• Iterative and incremental software development methodologies

• Team and its processes and tools are organized around the goal
of publishing releases to the users every few weeks

• A development cycle is called a sprint or an iteration

• Non-sequential product development where market

requirements are quickly evolving

• Analogous to data-analytics environment where each new

analysis and report of the data inspires requests for additional
queries

Source: The DataOps Cookbook by Christopher Bergh, Gil Benghiat, and Eran Strod
DevOps
• merging of development and IT/Operations

• seeks to reduce time to deployment, decrease time to

market, minimize defects, and shorten the time required to
fix problems

• focuses on continuous delivery by leveraging on-demand IT

resources and by automating test and deployment of code

• borrowing methods from DevOps, DataOps brings these

same improvements to data science.

Source: The DataOps Cookbook by Christopher Bergh, Gil Benghiat, and Eran Strod
DATAOPS

INTRODUCTION TO DATA SCIENCE 54 / 79

MLOPS

MLOps is an ML engineering culture and practice that aims at unifying ML

system development (Dev) and ML system operation (Ops).

Machin
e
Learnin
g

MLOps

Data
DevOp
Engineerin
s
g

INTRODUCTION TO DATA SCIENCE 61 / 79

MLOPS
Real challenge isn’t building an ML model, but building an integrated ML
system and to continuously operate it in production.
MLOps is an ML engineering culture and practice that aims at unifying ML
system development (Dev) and ML system operation (Ops)
To deploy and maintain ML systems in production reliably and efficiently
Automating continuous integration (CI), continuous delivery (CD), and
continuous training (CT) for machine learning (ML) systems
Frameworks
)
Kubeflow and Cloud Build
)
Amazon AWS MLOps
)
Microsoft Azure MLOps

https://ml-ops.org/content/mlops-principles
INTRODUCTION TO DATA SCIENCE 62 / 79
Data science steps for ML
Steps can be completed manually or can be completed by an automatic pipeline

• Data extraction :select and integrate the relevant data from various data sources for
the ML task.
• Data analysis :perform exploratory data analysis
• Data preparation :prepare date for the ML task - involves data cleaning, data
annotations,data transformations splitting and feature engineering
• Model training : implement different algorithms with the prepared data to train
various ML models
• Model evaluation :evaluate on a holdout test set to assess the model quality
• Model validation :confirm whether model is adequate for deployment — predictive
performance is better than a certain baseline
• Model serving :deploy to a target environment to serve predictions
• Model monitoring :monitor models predictive performance to potentially invoke a
new iteration in the ML process

https://ml-ops.org/content/mlops-principles
INTRODUCTION TO DATA SCIENCE 63 / 79
MLOPS

https://builtin.com/machine-learning/mlops
INTRODUCTION TO DATA SCIENCE 64 / 79
Three phases

MLOps maturity of a project in three general phases: MLOps level 0, 1 or 2

• measure the amount of automatization to push data transformations, model

training and a final model to production.

• A completely manual process based on scripts and notebooks, with rare

(and technically complicated) deployments of models for predictions (level
0)

• To first automate the pipeline for model training and deployment (level 1)

• To automate the process to change and experiment with this pipeline (level
2)
DATAOPS AND MLOPS

INTRODUCTION TO DATA SCIENCE 67 / 79

SELF READING

INTRODUCTION TO DATA SCIENCE 68 / 79

DS Process

The standard process involves

1. understanding the problem,
2. preparing the data (samples),
3. developing the model,
4. applying the model on a data set to see how the model may work in
real world, and
5. production deployment.
A popular data mining process frameworks is CRISP-DM (Cross Industry
Standard Process for Data Mining). This framework was developed by a
consortium of companies involved in data mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

CRISP-DM
CRISP-DM Phases

Cross Industry Standard Process

for Data Mining
conceived around 1996
6 high-level phases
Used in IBM SPSS Modeler tool
Iterative approach to the
development of analytical
models.

IN TROD U C TI ON TO DATA S 70 / 60
CRISP-DM
P HASES
Business Understanding
) Understand project objectives and
requirements.
) Data mining problem definition.
Data Understanding
) Initial data collection and familiarization.
) Identify data quality issues.
) Identify initial obvious results.
Data Preparation
) Record and attribute selection.
) Data cleansing.

IN TROD U C TI ON TO DATA S 71 / 60
CRISP-DM
P HASES

Modeling
) Run the data mining tools.
Evaluation
) Determine if results meet business objectives.
) Identify business issues that should have been addressed
earlier.
Deployment
) Put the resulting models into practice.
) Set up for continuous mining of the data.

IN TROD U C TI ON TO DATA S 72 / 60
C R I S P - D M P HASES AND
T ASKS

IN TROD U C TI ON TO DATA S 73 / 60
A Case Study of Evaluating Job Readiness with Data
Mining Tools and CRISP-DM Methodology

Objectives
• Use of data mining techniques in evaluating job readiness of unemployed population in
Ireland
• whether to automate the classification system with regard to job readiness
Step 1 – Business Understanding
• Job readiness is one of the basic characteristics of a customer identified in the process of
registration.
• Existing system:
• evaluated by a case officer
• hundreds of officers making independent judgments

Ref: Wowczko, I. (2015). A case study of evaluating job readiness with data mining tools and CRISP-DM methodology. International Journal for Infonomics, 8(3), 1066-1070.
Step 1 – Business Understanding

a. Business Objectives :

• To analyze the registered unemployed population and examine the

relationships between its various features captured in the database

• Job readiness will decide the type of support offered to a client – job
matching or further training opportunities

b. Data Mining Objectives

• Analysis of attributes, subsets creation

• Use of data mining tools in order to identify the underlying patterns

( RapidMiner, Tableau,SQL Server)
Step 2 – Data Understanding
• Records of clients seeking guidance and support from the public employment services in
Ireland
• Sample was representative of all unemployed people registered with public employment
services in Ireland within a full year cycle

Exploratory Statistics (RapidMiner):

• Few numeric variables provide more insight into demographics of the customers registered

• 48% women and 52% men

• The majority of customers had been interviewed (62%) [INERVIEW_DATE attribute]

• The average customer is about 35.5 years old, the youngest is 16 and the oldest is 101
Step 2 – Data Understanding

Data Visualisation (Tableau)

relationship between the amount of information provided
by a client and their job readiness
Step 2 – Data Understanding

Average customer age

Step 3 – Data Preparation
• 139 attributes and 60775 rows
• The obsolete attributes, attributes with extremely high number of missing values, 69
attributes with no information value, were excluded from further processing.
• The remaining 70 attributes (10 integer attributes, 56 nominal attributes and 4 text attributes)
were pre-processed

Data Cleaning and Pre-Processing:

• A large number of missing values and incorrect values such as zeros, multiple zeros, ‘none’,
‘no’, spaces, multiple spaces, NULLs, etc.

• Among 70 attributes, only 8 did not require any data cleaning.

• wrong type attributes, e.g integer attributes were regarded as nominal, dates were stored as
integers
Step 3 – Data Preparation
Data transformation:

• missing and noisy values stand for ‘no’, ‘zero’, ‘did not happen’, etc

• some attributes were relevant in their raw form, they contain information that might be useful,
converting them into binomial attributes (Y, N). 7 attributes were cleaned and transformed
with this method (GENERAL_COMMENTS, EMAIL_ADDRESS, MOBILE_NUMBER,
PHONE_NUMBER, WORK_SKILLS, COMPUTER_PACKAGES and SPECIAL_NEEDS_REQS).

• missing INERVIEW_DATE simply means that a client has not yet been interviewed. filtering
those examples resulted in a loss of too much data. therefore, converted into binomial
attributes (Y, N).
Step 4 - Modelling

• Classification with Decision Tree achieved the accuracy of 81.97%

• One split on attribute EXP1 (experience declared by a client with relation to their main
profession - MANCO1)

• EXP1 attribute was excluded from the dataset to enable the algorithm finding a new split
point.
• Another two variables were identified with this method: EXP2 (experience declared by a
client with relation to their secondary profession – MANCO2) and FULL_TIME willingness
to work full-time)

• They also tried Random Forest, SVM, KNN etc.

Step 5 – Evaluation and Conclusion
• Consistency in the examined sample, the top discriminators between the two class labels are
attributes such as EXP1, EXP2 and FULL_TIME

• There is a very straightforward relationship among the features of the customers – a person
is job ready if they have some work experience and are willing to be employed full- time

• The observed pattern can be easily recognized and there was no previously unknown
information discovered in the process of mining the dataset

• It has been concluded that automating the system would not offer any advantage over the
existing classification based on simple heuristics

• Therefore, no step 6 - Deployment

W H Y CRISP-DM?

The data mining process must be reliable and repeatable by people with
little data mining skills.
CRISP-DM provides a uniform framework for
) guidelines.
) experience documentation.
CRISP-DM is flexible to account for differences.
) Different business/agency problems.
) Different data

IN TROD U C TI ON TO DATA S 83 / 60
DATA SCIENCE TEAM BUILDING

Get to know each other for better

communication Foster team cohesion and
teamwork
Encourage collaboration to boost team productivity and
performance.

https://towardsdatascience.com/why-team-building-is-important-to-data-scientists-a8fa74dbc09b
INTRODUCTION TO DATA SCIENCE 84 / 79
ORGANISATION OF DATA SCIENCE TEAM
[1] Decentralized
[2] Functional
[3] Consulting
[4] Centralized
[5] Center of Excellence
[6] Federated

INTRODUCTION TO DATA SCIENCE 85 / 79

SOFTWARE ENGINEERING
In general,
Software engineering is an engineering discipline that is concerned with all
aspects of software production.
Software includes computer programs, all associated
documentation, and configuration data that are needed for
software to work correctly.
Waterfall model, Iterative models, Agile models

INTRODUCTION TO DATA SCIENCE 86 / 79

DATA SCIENCE PROCESS

INTRODUCTION TO DATA SCIENCE 87 / 79

DATA SCIENCE VS. SOFTWARE ENGINEERING

Data Science Software Engineering

Data science involves Software engineering focuses on
analyzing creat-
huge amounts of data, with ing software that serves a specific
some aspects of programming pur- pose.
and development.
Uses a methodology involving Uses a methodology involving
vari- various
ous phases beginning from phases beginning from
requirements specification requirements specification
through model deployment to through software deployment
better decision making. into production.
INTRODUCTION TO DATA SCIENCE 88 / 79
DATA SCIENCE VS. SOFTWARE ENGINEERING

Data Science Software Engineering

Involves collecting and Concerned with creating useful
analyzing appli-
data cations
Data scientists utilize the ETL Software engineers use the SDLC
(Ex- pro-
tract, Tranform, Load) process cess
More process-oriented Uses frameworks like Waterfall,
Agile,
and Spiral
Data scientists use tools like Software engineers use tools like
Ama- Rails,
zon S3, MongoDB, Hadoop, and Django, Flask, and Vue.js
INTRODUCTION TO DATA SCIENCE 89 / 79
DATA SCIENCE VS. BUSINESS INTELLIGENCE

INTRODUCTION TO DATA SCIENCE 90 / 79

DATA SCIENCE VS. BUSINESS INTELLIGENCE

Data Science Business Intelligence

Perspective Looking forward Looking backward
Analysis Predictive Descriptive
Explorative Comparative
Data Same data, New Data,
New analysis Same analysis
Listens to Speaks for
data data
Distributed Warehoused
Scope Specific to business question Unlimited
Expertise Data scientist Business analyst
Deliverable Insight or story Table or report
Applicabilit Future, correction for Historic, confounding
INTRODUCTION TO DATA SCIENCE 91 / 79
DATA SCIENTIST VS. BUSINESS ANALYST

INTRODUCTION TO DATA SCIENCE 92 / 79

DATA SCIENCE VS. STATISTICS
Data Science Statistics
Type of problem Semi structured or Well structured
unstruc-
tured
Inference model Explicit inference No inference
Analysis Objective Need not be well formed Well formed objective
Type of Analysis Explorative Confirmative
Data collection Data collection is not linked Data collected based
to on
the objective the objective
Size of dataset Large Small
Heterogeneous Homogeneous
Paradigm Theory and heuristic Theory based
INTRODUCTION TO DATA SCIENCE 93 / 79
REFERENCES

Introducing Data Science by Cielen, Meysman and Ali

The Art of Data Science by Roger D Peng and Elizabeth Matsui

https://data-flair.training/blogs/data-science-use-cases

https://www.northeastern.edu/graduate/blog/what-does-a-data-scientist-do/

https://www.visual-paradigm.com/guide/software-development-process/
what-is-a-software-process-model/

Building an Analytics-Driven Organization, Accenture

INTRODUCTION TO DATA SCIENCE 94 / 79

REFERENCES

https://www.altexsoft.com/blog/datascience/
how-to-structure-data-science-team-key-models-and-roles/

https://www.cio.com/article/230532/what-is-a-data-scientist-a-key-data-analytics-rol
e-and-a-lucrative-career.html

https://atlan.com/what-is-dataops/

THANK YOU
INTRODUCTION TO DATA SCIENCE 95 / 79

HCIA-Security V4.0 Training Material
100% (1)
HCIA-Security V4.0 Training Material
499 pages
Grocery Shop Management System
57% (7)
Grocery Shop Management System
30 pages
Ids PPT and PDF
No ratings yet
Ids PPT and PDF
493 pages
Unit I Introduction To Data Science
No ratings yet
Unit I Introduction To Data Science
79 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
Lec1 - For Upload Complete
No ratings yet
Lec1 - For Upload Complete
111 pages
Module 1
No ratings yet
Module 1
192 pages
Data Science - AD1102-1
No ratings yet
Data Science - AD1102-1
53 pages
SAS 101 - Introduction to Data Science
No ratings yet
SAS 101 - Introduction to Data Science
10 pages
Bsd1313 Chapter 1
No ratings yet
Bsd1313 Chapter 1
60 pages
Chapter 1 Introduction To Datascience
No ratings yet
Chapter 1 Introduction To Datascience
13 pages
Data Science Basics
No ratings yet
Data Science Basics
25 pages
1.1 Idml
No ratings yet
1.1 Idml
3 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Dia1
No ratings yet
Dia1
88 pages
Datascience Notes
No ratings yet
Datascience Notes
161 pages
IDS Sec-1 CS1-CS8 Merged Slides
No ratings yet
IDS Sec-1 CS1-CS8 Merged Slides
419 pages
Unit 1 Data Science Notes
No ratings yet
Unit 1 Data Science Notes
33 pages
Fundamentals of Data Science
No ratings yet
Fundamentals of Data Science
53 pages
Bhavya Khurana
No ratings yet
Bhavya Khurana
21 pages
UNIT – I Intro To DS
No ratings yet
UNIT – I Intro To DS
18 pages
1st M
No ratings yet
1st M
29 pages
What Is Data Science Module1
No ratings yet
What Is Data Science Module1
33 pages
Data Science Intro Session-18 & 19
No ratings yet
Data Science Intro Session-18 & 19
48 pages
Data Science Introduction
No ratings yet
Data Science Introduction
22 pages
Data Science Presentation
No ratings yet
Data Science Presentation
27 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
43 pages
AI STD 10 Part B Unit 4
No ratings yet
AI STD 10 Part B Unit 4
25 pages
Getting Started With Data Science: Grade VIII
No ratings yet
Getting Started With Data Science: Grade VIII
32 pages
Introduction to Data-Science
No ratings yet
Introduction to Data-Science
246 pages
Handbook Introduction of Data Science AY 23-24
No ratings yet
Handbook Introduction of Data Science AY 23-24
171 pages
CH 1
No ratings yet
CH 1
34 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
26 pages
Unit 1-FDS
No ratings yet
Unit 1-FDS
18 pages
X AI SS CH4 LM
No ratings yet
X AI SS CH4 LM
57 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
Data Science
No ratings yet
Data Science
85 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
30 pages
5 - Data Analytics, Data Science and Machine Learning
No ratings yet
5 - Data Analytics, Data Science and Machine Learning
56 pages
Introduction to Data Science- Unit-1
No ratings yet
Introduction to Data Science- Unit-1
9 pages
1.1
No ratings yet
1.1
23 pages
Data Science Tutorial 1
No ratings yet
Data Science Tutorial 1
26 pages
Modul1 PPt.pptx
No ratings yet
Modul1 PPt.pptx
56 pages
What Is Data Science A Beginner’s Guide To Data Science (1)
No ratings yet
What Is Data Science A Beginner’s Guide To Data Science (1)
15 pages
Datascience
75% (8)
Datascience
28 pages
DS QB
No ratings yet
DS QB
81 pages
Data Science
No ratings yet
Data Science
40 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
02 Introduction_Fall 23-24
No ratings yet
02 Introduction_Fall 23-24
29 pages
himadev
No ratings yet
himadev
37 pages
Data Science XTH
No ratings yet
Data Science XTH
10 pages
Data Science 2020
100% (1)
Data Science 2020
123 pages
Module 4.1 - Data Science_c19a56558691ed09690242a995a65dbe
No ratings yet
Module 4.1 - Data Science_c19a56558691ed09690242a995a65dbe
56 pages
DATA SCIENCE
No ratings yet
DATA SCIENCE
8 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
Data Science Report - Compress
No ratings yet
Data Science Report - Compress
31 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
37 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Data and Analytics in Action: Project Ideas and Basic Code Skeleton in Python
From Everand
Data and Analytics in Action: Project Ideas and Basic Code Skeleton in Python
Zemelak Goraga
No ratings yet
WSS DPR 27.05.2025
No ratings yet
WSS DPR 27.05.2025
4 pages
3590192_98456_dseclzg519-lec-03
No ratings yet
3590192_98456_dseclzg519-lec-03
54 pages
ISM_Session-4_24&25_MAY_2025
No ratings yet
ISM_Session-4_24&25_MAY_2025
63 pages
DSECLZG519-Lec-01 (1)
No ratings yet
DSECLZG519-Lec-01 (1)
33 pages
3483043_98456_dseclzg519-lec-07
No ratings yet
3483043_98456_dseclzg519-lec-07
38 pages
3485357_784_dsad_session5
No ratings yet
3485357_784_dsad_session5
34 pages
Contact Session-2 Introduction to Intelligent Agents_DSE
No ratings yet
Contact Session-2 Introduction to Intelligent Agents_DSE
21 pages
ISM_Session 2_May 2025
No ratings yet
ISM_Session 2_May 2025
44 pages
ISM_Session 1_May 2025
No ratings yet
ISM_Session 1_May 2025
54 pages
Contact Session 3 ISM (17 May 2025)
No ratings yet
Contact Session 3 ISM (17 May 2025)
56 pages
IDS4
No ratings yet
IDS4
50 pages
IDS6
No ratings yet
IDS6
64 pages
Contact Session 5_with annotation
No ratings yet
Contact Session 5_with annotation
27 pages
Assignment_1 (2)
No ratings yet
Assignment_1 (2)
2 pages
Process Control - 2019
No ratings yet
Process Control - 2019
9 pages
7 - Repurchase Agreements-1
No ratings yet
7 - Repurchase Agreements-1
18 pages
Environmental Pollution Control: BITS Pilani
No ratings yet
Environmental Pollution Control: BITS Pilani
19 pages
Himadri Mishra res without photo
No ratings yet
Himadri Mishra res without photo
1 page
Part B ETI
No ratings yet
Part B ETI
10 pages
Python MCQs
No ratings yet
Python MCQs
52 pages
BAHA_EDP[1]
No ratings yet
BAHA_EDP[1]
47 pages
Dsa Question Paper
No ratings yet
Dsa Question Paper
2 pages
Fixing Suspended Twitter Accounts
No ratings yet
Fixing Suspended Twitter Accounts
7 pages
Dashboard in A Day Slides
No ratings yet
Dashboard in A Day Slides
40 pages
Ivunit Query Processing
No ratings yet
Ivunit Query Processing
12 pages
5 Testing Advanced I/O Devices
No ratings yet
5 Testing Advanced I/O Devices
10 pages
MicrocontrollersAP EC 50 MC
100% (1)
MicrocontrollersAP EC 50 MC
206 pages
NeuraCom, R 2.0, 12 Sep 2022
No ratings yet
NeuraCom, R 2.0, 12 Sep 2022
4 pages
DSU Unit 1 Notes Final
No ratings yet
DSU Unit 1 Notes Final
14 pages
LNMIIT BTP Report Template
No ratings yet
LNMIIT BTP Report Template
57 pages
HPE Storage Switch M-series SN2010M-詳規
No ratings yet
HPE Storage Switch M-series SN2010M-詳規
25 pages
1.1.1.8 Lab - Researching Network Collaboration Tools-Solved
No ratings yet
1.1.1.8 Lab - Researching Network Collaboration Tools-Solved
12 pages
Introduction To Information Technology P4
No ratings yet
Introduction To Information Technology P4
26 pages
ST750 ELV General Handbook
No ratings yet
ST750 ELV General Handbook
116 pages
OS Practical File
No ratings yet
OS Practical File
32 pages
Datamining
No ratings yet
Datamining
3 pages
BMC Release Note 130608
No ratings yet
BMC Release Note 130608
51 pages
Deepfakestack-Kavya R Shetty
No ratings yet
Deepfakestack-Kavya R Shetty
16 pages
Noc18-Cs19 Week 02 Assignment 01
No ratings yet
Noc18-Cs19 Week 02 Assignment 01
5 pages
task4
No ratings yet
task4
13 pages
Socket Programming
No ratings yet
Socket Programming
3 pages
ADM 201 Questions
No ratings yet
ADM 201 Questions
3 pages
Multiple Choice Questions
No ratings yet
Multiple Choice Questions
5 pages
Computer Studies Paper 1 Marking Scheme 1
No ratings yet
Computer Studies Paper 1 Marking Scheme 1
8 pages
Chapter 2
No ratings yet
Chapter 2
30 pages