0% found this document useful (0 votes)
6 views

IDS 1 IntroDataScience

The document provides an introduction to a Data Science course offered by BITS Pilani, outlining its objectives, structure, and evaluation schedule. It covers fundamental concepts of data science, its applications in various industries, and the skills required for data scientists. Additionally, it discusses the challenges in data science and the roles within a data science team.

Uploaded by

AtindranathGhosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

IDS 1 IntroDataScience

The document provides an introduction to a Data Science course offered by BITS Pilani, outlining its objectives, structure, and evaluation schedule. It covers fundamental concepts of data science, its applications in various industries, and the skills required for data scientists. Additionally, it discusses the challenges in data science and the roles within a data science team.

Uploaded by

AtindranathGhosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 88

INTRODUCTION TO DATA SCIENCE

MODULE # 1 : INTRODUCTION
IDS Course Team
BITS Pilani
The instructor is gratefully
acknowledging the authors who
made their course materials freely
available online.

INTRODUCTION TO DATA SCIENCE 3 / 79


COURSE OBJECTIVES
• Gain basic understanding of the role of Data Science in various
scenarios in the real-world of business, industry and government.
• Understand various roles and stages in a Data Science Project and ethical
issues to be considered.

• Explore the processes, tools and technologies for collection and


analysis of structured and unstructured data.
• Appreciate the importance of techniques like data visualization,
storytelling with data for the effective presentations of the outcomes with
the stakeholders.

• Understand techniques of preparing real-world data for data analytics.

• Implement data analytic techniques for discovering interesting patterns


from data.
INTRODUCTION TO DATA SCIENCE 4 / 67
COURSE STRUCTURE
• M1 Introduction to Data Science
• M2 Data Quality and Data Infrastructure
• M3 Data Preprocessing
• M4 Classification and Prediction
• M5 Association Analysis
• M6 Clustering
• M7 Anomaly Detection
• M8 Storytelling with Data
• M9 Ethics for Data Science

INTRODUCTION TO DATA SCIENCE 5 / 67


TEXT BOOKS

T1 Introduction to Data Mining, by Tan, Steinbach and Vipin


umar T2 Introducing Data Science by Cielen, Meysman
and Ali
T3 Storytelling with Data, A data visualization guide for business
professionals, by Cole, Nussbaumer Knaflic; Wiley
T4 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han
and Micheline Kamber Morgan Kaufmann Publishers, 2006

INTRODUCTION TO DATA SCIENCE 7 / 67


EVALUATION SCHEDULE
No Name Type Duration Weight Remarks
EC1 Quiz I Online 1 hr 5% Sum of both quizzes
Quiz II Online 1 hr 5%

Assignment Online 20 %

EC2 Mid-sem Online As announced 25%

EC3 End-sem Online As announced 45%

8 / 67
Fundamentals of Data Science
What is Data Science
Data science is a ‘concept to
unify statistics, data
analysis, machine learning
and their related methods’ in
order to ‘understand and
analyze actual phenomena’
with data. - Wikipedia https://www.zeolearn.com/magazine/data‐science‐vs‐machine‐learning ‐artificial ‐intelligence

“Data science is the discipline of making data useful”


- Chief Decision Scientist Google - Cassie Kozyrkov
What is Data Science
• Discovering what we don’t know from data
• Obtaining predictive insights
• Helps to create data products
• Helps to make actionable decisions
• Communicating stories from data
• Increases confidence in making valuable decisions that increases
business value
Data Science, AI and ML

https://www.sciencedirect.com/topics/physics-and-astronomy/artificial-intelligence
12 / 67
Data Science, AI and ML
• Artificial Intelligence
• AI involves making machines capable of mimicking human behavior,
particularly cognitive functions like facial recognition, automated driving…
• Machine Learning
• Considered a sub-field of or one of the tools of AI.
• Involves providing machines with the capability of learning from experience.
• Data Science
• Data science is the application of machine learning, artificial intelligence, and
other quantitative fields like statistics, visualization, and mathematics to
uncover insights from data to enable better decision marking.

13 / 67
Why Data Science ?
Case study - Moneyball: The Art of Winning an Unfair Game
• The 2003 publication of Moneyball: The Art of Winning an Unfair Game, by
Michael Lewis

• Moneyball told the story of how the Oakland Athletics, under general manager
Billy Beane, employed data and analytics to field a competitive baseball team
on a low budget

• The book was later made into a 2011 film

• It is the story of how existing data can be examined for meaning in ways that
were never intended or imagined when they were originally collected

• http://dataanalyticsedge.com/2019/11/14/moneyball-the-must-watch-movie-key-learning-for-eve
ry-aspiring-data-analyst-and-data-scientist/
Why Data Science ? - Data-Driven Decisions

• “As widely-familiar as the story is, it is almost as widely


misunderstood”. “Moneyball succeeded for the Oakland A’s not
because of data analytics but because of Beane, the leader who
understood the analytics’ potential and changed the organization so
it could deliver on that potential” -- forbes
• Decisions no longer have to be made in the dark or based on gut
instinct; they can be based on data, evidence, experiments and more
accurate forecasts -- McKinsey
TABLE OF CONTENTS

1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 16 / 79


USE CASES OF DATA SCIENCE

DataFlai
r
INTRODUCTION TO DATA SCIENCE 17 / 79
DATA SCIENCE IN FACEBOOK

Social Analytics
Quantitative research
Makes use of deep learning
Deeptext
Targeted Advertising

INTRODUCTION TO DATA SCIENCE 18 / 79


DATA SCIENCE IN AMAZON

Predictive Analysis
Anticipatory shipping model
Price discounts
Fraud Detection
Improving Packaging Efficiency

INTRODUCTION TO DATA SCIENCE 19 / 79


DATA SCIENCE IN UBER
Improving Rider Experience
Uber maintains large database of drivers, customers, and several other
records.
Makes extensive use of Big Data and crowdsourcing to derive insights and
provide best services to its customers.
Dynamic pricing
)
Use of big Data and data science to calculate fares based on specific parameters.
)
Uber matches customer profile with the most suitable driver and charges them
based on the time it takes to cover the distance rather than the distance itself.
)
The time of travel is calculated using algorithms that make use of data related
to traffic density and weather conditions.
)
When the demand is higher (more riders) than supply (less drivers), the price of
the ride goes up.
INTRODUCTION TO DATA SCIENCE 20 / 79
DATA SCIENCE IN BANK OF AMERICA
Improving Customer Experience
Erica – a virtual financial assistant (BoA)
)
Erica serves as a customer advisor to over 45 million users around the world.
)
Erica makes use of Speech Recognition to take customer inputs.
Fraud detection
)
Uses data science and predictive analytics to detect frauds in payments,
insurance, credit cards, and customer information.
Risk modeling
)
Use data science for risk modeling to regulate financial activities.
Customer segmentation
)
Segment their customers in the high-value and low-value segments.
)
Data scientists makes use of clustering, logistic regression, decision trees to
help the banks to understand the Customer Lifetime Value (CLV) and take
group them in the appropriate
INTRODUCTION TO DATA SCIENCE 21 / 79
segments.
DATA SCIENCE IN AIRBNB

Improving Customer Experience


Providing better search
results
)
Uses big data of customer and host information, homestays and lodge
records, and website traffic.
)
Uses data science to provide better search results to its customers and find
compatible hosts.
Detecting bounce rates
)
Use of demographic analytics to analyze bounce rates from their websites.
Providing ideal lodgings and localities
)
Uses knowledge graphs where the user’s preferences are matc hed with the
various parameters to provide ideal lodgings and localities.
INTRODUCTION TO DATA SCIENCE 22 / 79
DATA SCIENCE IN SPOTIFY

Improving Customer Experience and


recommendation Providing better music
streaming experience
)
Provide personalized music recommendations.
)
Uses over 600 GBs of daily data generated by the users to build its algorithms
to boost user experience.
Improving experience for artists and managers
)
Spotify for Artists application allows the artists and managers to analyze their
streams, fan approval and the hits they are generating through Spotify’s
playlists.

INTRODUCTION TO DATA SCIENCE 23 / 79


DATA SCIENCE IN SPOTIFY... CONTD..

Spotify uses data science to gain insights about which universities had the
highest percentage of party playlists and which ones spent the most time
on it.
”Spotify Insights” publishes information about the ongoing trends in the
music.
Spotify’s Niland, an API based product, uses machine learning to
provide better searches and recommendations to its users.
Spotify analyzes listening habits of its users to predict the Grammy Award
Winners.

INTRODUCTION TO DATA SCIENCE 24 / 79


APPLICATIONS OF DATA SCIENCE

DataFlai
r
INTRODUCTION TO DATA SCIENCE 25 / 79
DATA SCIENCE CHALLENGES

Data science challenges can be :


Finding the data
Getting Access to data
Understanding the data
Data Cleaning
Communicating

INTRODUCTION TO DATA SCIENCE 26 / 79


COGNITIVE BIAS

Cognitive Biases are the distortions of reality because of the lens through
which we view the world.
Each of us sees things differently based on our preconceptions, past
experiences, cultural, environmental, and social factors. This doesn’t
necessarily mean that the way we think or feel about something is truly
representative of reality.

INTRODUCTION TO DATA SCIENCE 27 / 79


ROLES IN DATA SCIENCE TEAM [1-7]

[1] Chief Analytics Officer / Chief Data


Officer
[2] Data analyst
[3] Business analyst
[4] Data scientist
[4a] Machine Learning Engineer
[4b] Data Journalist
[5] Data architect
[6] Data engineer
[7] Application/data visualization engineer

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-ke
y-models-and-roles/
INTRODUCTION TO DATA SCIENCE 28 / 79
SKILLSET FOR A DATA SCIENTIST

• PROGRAMMING
• QUANTITATIVE ANALYSIS
• PRODUCT INTUITION
• COMMUNICATION
• TEAMWORK

INTRODUCTION TO DATA SCIENCE 29 / 79


SKILLS REqUIRED FOR A DATA SCIENTIST

Communicati Qualitativ
ve e

Data
Curiou Technica
s Scienti l
st

Creativ Skeptic
e al

INTRODUCTION TO DATA SCIENCE 30 / 79


SKILLSET OF A DATA SCIENTIST

INTRODUCTION TO DATA SCIENCE 31 / 79


TOOLS AVAILABLE TO A DATA SCIENTIST

R
SQL
Python

Scala

Tools SAS

Hadoo
p
Julia
Tableau
Wek
a
INTRODUCTION TO DATA SCIENCE 32 / 79
ALGORITHMS FOR A DATA SCIENTIST

Logistic

K-means Regression Linear


Regressio
clustering n

PCA Algorith Aprior


ms i

Decisio
SVM
n
ANN Tree

INTRODUCTION TO DATA SCIENCE 33 / 79


Data Science Activity Examples

34

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Classification: Definition
Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as
possible.
– A test set is used to determine the accuracy of the model. Usually, the given data set is divided into
training and test sets, with training set used to build the model and test set used to validate it.

35

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Classification Example
al al us
o ri c o ri c uo
eg eg t in
t t n ss
ca ca co c la
Tid Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat

1 Yes Single 125K No No Single 75K ?


2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
Set
10

7 Yes Divorced 220K No


8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10
10 No Single 90K Yes
Set Classifier Model
36

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Classification: Application 1
Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of consumers likely to
buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
• Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
• Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier
model.
From [Berry & Linoff] Data Mining Techniques, 1997

37

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Classification: Application 2
Fraud Detection
– Goal: Predict fraudulent cases in credit card transactions.
– Approach:
• Use credit card transactions and the information on its account-holder as attributes.
• When does a customer buy, what does he buy, how often he pays on time, etc
• Label past transactions as fraud or fair transactions. This forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card transactions on an account.

38

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Classification: Application 3

Customer Attrition/Churn:
– Goal: To predict whether a customer is likely to be lost to a
competitor.
– Approach:
• Use detailed record of transactions with each of the past and
present customers, to find attributes.
• How often the customer calls, where he calls, what time-of-the day
he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.

From [Berry & Linoff] Data Mining Techniques, 1997 39

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Clustering Definition

Given a set of data points, each having a set of attributes, and


a similarity measure among them, find clusters such that
– Data points in one cluster are more similar to one another.
– Data points in separate clusters are less similar to one another.
Similarity Measures:
– Euclidean Distance if attributes are continuous.
– Other Problem-specific Measures.

40

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Illustrating Clustering

Intracluster distances Intercluster distances


Intracluster distances Intercluster distances
are minimized are maximized
are minimized are maximized

Euclidean Distance
Based Clustering in
3-D space

41

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Plotting cholera cases on a map of London

A famous instance of clustering to solve


a problem took place
long ago in London, and it was done
entirely without computers. The
physician John Snow, dealing with a
Cholera outbreak plotted the cases on
a map of the
city.

42
Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Clustering: Application 1
Market Segmentation:
– Goal: subdivide a market into distinct subsets of customers where any subset may conceivably
be selected as a market target to be reached with a distinct marketing mix.
– Approach:
• Collect different attributes of customers based on their geographical and lifestyle related
information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns of customers in same cluster
vs. those from different clusters.

43

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Clustering: Application 2
Document Clustering:
– Goal: To find groups of documents that are similar to each other
based on the important terms appearing in them.
– Approach: To identify frequently occurring terms in each document.
Form a similarity measure based on the frequencies of different
terms. Use it to cluster.
– Gain: Information Retrieval can utilize the clusters to relate a new
document or search term to clustered documents.

44

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Association Rule Discovery: Definition
Given a set of records each of which contain some number of items from a given
collection;
– Produce dependency rules which will predict occurrence of an item based on
occurrences of other items.
Example of Association Rules
TID Items
1 Bread, Milk
{Diaper}  {Butter},
2 Bread, Diaper, Butter, Beans
{Milk, Bread}  {Beans, Coke},
3 Milk, Diaper, Butter, Coke
{Butter, Bread}  {Milk},
4 Bread, Milk, Diaper, Butter
5 Bread, Milk, Diaper, Coke

45

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Association Rule Discovery: Application 1

Marketing and Sales Promotion:


– Let the rule discovered be
{Bagels, … } --> {Potato Chips}
– Potato Chips as consequent => Can be used to determine what should be done to boost its sales.
– Bagels in the antecedent => Can be used to see which products would be affected if the store
discontinues selling bagels.
– Bagels in antecedent and Potato chips in consequent => Can be used to see what products
should be sold with Bagels to promote sale of Potato chips!

46

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Association Rule Discovery: Application
Inventory Management:
– Goal: A consumer appliance repair company wants to anticipate the nature
of repairs on its consumer products and keep the service vehicles equipped
with right parts to reduce on number of visits to consumer households.
– Approach: Process the data on tools and parts required in previous repairs at
different consumer locations and discover the co-occurrence patterns.

47

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Prediction/Regression

Predict a value of a given continuous valued variable based on the values of other
variables, assuming a linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples:
• Predicting sales amounts of new product based on advertising expenditure.
• Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
• Time series prediction of stock market indices.

48

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Deviation/Anomaly Detection

Detect significant deviations from normal behavior


Applications:
– Credit Card Fraud Detection
– Network Intrusion Detection

49

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


DATAOPS

DATAOPS AS DEFINED BY GARTNER


DataOps is a collaborative data management practice, really focused on
improving communication, integration, and automation of data flow
between managers and consumers of data within an organization.

INTRODUCTION TO DATA SCIENCE 50 / 79


DATAOPS
DataOps applies Agile development, DevOps and lean
manufacturing to data analytics development and operations.
DataOps consider the end-to-end data analytics process as a
sequence of operations-or a data pipeline

INTRODUCTION TO DATA SCIENCE 51 / 79


Agile
• Iterative and incremental software development methodologies

• Team and its processes and tools are organized around the goal
of publishing releases to the users every few weeks

• A development cycle is called a sprint or an iteration

• Non-sequential product development where market


requirements are quickly evolving

• Analogous to data-analytics environment where each new


analysis and report of the data inspires requests for additional
queries

Source: The DataOps Cookbook by Christopher Bergh, Gil Benghiat, and Eran Strod
DevOps
• merging of development and IT/Operations

• seeks to reduce time to deployment, decrease time to


market, minimize defects, and shorten the time required to
fix problems

• focuses on continuous delivery by leveraging on-demand IT


resources and by automating test and deployment of code

• borrowing methods from DevOps, DataOps brings these


same improvements to data science.

Source: The DataOps Cookbook by Christopher Bergh, Gil Benghiat, and Eran Strod
DATAOPS

INTRODUCTION TO DATA SCIENCE 54 / 79


MLOPS

MLOps is an ML engineering culture and practice that aims at unifying ML


system development (Dev) and ML system operation (Ops).

Machin
e
Learnin
g

MLOps

Data
DevOp
Engineerin
s
g

INTRODUCTION TO DATA SCIENCE 61 / 79


MLOPS
Real challenge isn’t building an ML model, but building an integrated ML
system and to continuously operate it in production.
MLOps is an ML engineering culture and practice that aims at unifying ML
system development (Dev) and ML system operation (Ops)
To deploy and maintain ML systems in production reliably and efficiently
Automating continuous integration (CI), continuous delivery (CD), and
continuous training (CT) for machine learning (ML) systems
Frameworks
)
Kubeflow and Cloud Build
)
Amazon AWS MLOps
)
Microsoft Azure MLOps

https://ml-ops.org/content/mlops-principles
INTRODUCTION TO DATA SCIENCE 62 / 79
Data science steps for ML
Steps can be completed manually or can be completed by an automatic pipeline

• Data extraction :select and integrate the relevant data from various data sources for
the ML task.
• Data analysis :perform exploratory data analysis
• Data preparation :prepare date for the ML task - involves data cleaning, data
annotations,data transformations splitting and feature engineering
• Model training : implement different algorithms with the prepared data to train
various ML models
• Model evaluation :evaluate on a holdout test set to assess the model quality
• Model validation :confirm whether model is adequate for deployment — predictive
performance is better than a certain baseline
• Model serving :deploy to a target environment to serve predictions
• Model monitoring :monitor models predictive performance to potentially invoke a
new iteration in the ML process

https://ml-ops.org/content/mlops-principles
INTRODUCTION TO DATA SCIENCE 63 / 79
MLOPS

https://builtin.com/machine-learning/mlops
INTRODUCTION TO DATA SCIENCE 64 / 79
Three phases

MLOps maturity of a project in three general phases: MLOps level 0, 1 or 2

• measure the amount of automatization to push data transformations, model


training and a final model to production.

• A completely manual process based on scripts and notebooks, with rare


(and technically complicated) deployments of models for predictions (level
0)

• To first automate the pipeline for model training and deployment (level 1)

• To automate the process to change and experiment with this pipeline (level
2)
DATAOPS AND MLOPS

INTRODUCTION TO DATA SCIENCE 67 / 79


SELF READING

INTRODUCTION TO DATA SCIENCE 68 / 79


DS Process

The standard process involves


1. understanding the problem,
2. preparing the data (samples),
3. developing the model,
4. applying the model on a data set to see how the model may work in
real world, and
5. production deployment.
A popular data mining process frameworks is CRISP-DM (Cross Industry
Standard Process for Data Mining). This framework was developed by a
consortium of companies involved in data mining

69

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


CRISP-DM
CRISP-DM Phases

Cross Industry Standard Process


for Data Mining
conceived around 1996
6 high-level phases
Used in IBM SPSS Modeler tool
Iterative approach to the
development of analytical
models.

IN TROD U C TI ON TO DATA S 70 / 60
CRISP-DM
P HASES
Business Understanding
) Understand project objectives and
requirements.
) Data mining problem definition.
Data Understanding
) Initial data collection and familiarization.
) Identify data quality issues.
) Identify initial obvious results.
Data Preparation
) Record and attribute selection.
) Data cleansing.

IN TROD U C TI ON TO DATA S 71 / 60
CRISP-DM
P HASES

Modeling
) Run the data mining tools.
Evaluation
) Determine if results meet business objectives.
) Identify business issues that should have been addressed
earlier.
Deployment
) Put the resulting models into practice.
) Set up for continuous mining of the data.

IN TROD U C TI ON TO DATA S 72 / 60
C R I S P - D M P HASES AND
T ASKS

IN TROD U C TI ON TO DATA S 73 / 60
A Case Study of Evaluating Job Readiness with Data
Mining Tools and CRISP-DM Methodology

Objectives
• Use of data mining techniques in evaluating job readiness of unemployed population in
Ireland
• whether to automate the classification system with regard to job readiness
Step 1 – Business Understanding
• Job readiness is one of the basic characteristics of a customer identified in the process of
registration.
• Existing system:
• evaluated by a case officer
• hundreds of officers making independent judgments

Ref: Wowczko, I. (2015). A case study of evaluating job readiness with data mining tools and CRISP-DM methodology. International Journal for Infonomics, 8(3), 1066-1070.
Step 1 – Business Understanding

a. Business Objectives :

• To analyze the registered unemployed population and examine the


relationships between its various features captured in the database

• Job readiness will decide the type of support offered to a client – job
matching or further training opportunities

b. Data Mining Objectives

• Analysis of attributes, subsets creation

• Use of data mining tools in order to identify the underlying patterns


( RapidMiner, Tableau,SQL Server)
Step 2 – Data Understanding
• Records of clients seeking guidance and support from the public employment services in
Ireland
• Sample was representative of all unemployed people registered with public employment
services in Ireland within a full year cycle

Exploratory Statistics (RapidMiner):


• Few numeric variables provide more insight into demographics of the customers registered

• 48% women and 52% men

• The majority of customers had been interviewed (62%) [INERVIEW_DATE attribute]

• The average customer is about 35.5 years old, the youngest is 16 and the oldest is 101
Step 2 – Data Understanding

Data Visualisation (Tableau)


relationship between the amount of information provided
by a client and their job readiness
Step 2 – Data Understanding

Average customer age


Step 3 – Data Preparation
• 139 attributes and 60775 rows
• The obsolete attributes, attributes with extremely high number of missing values, 69
attributes with no information value, were excluded from further processing.
• The remaining 70 attributes (10 integer attributes, 56 nominal attributes and 4 text attributes)
were pre-processed

Data Cleaning and Pre-Processing:


• A large number of missing values and incorrect values such as zeros, multiple zeros, ‘none’,
‘no’, spaces, multiple spaces, NULLs, etc.

• Among 70 attributes, only 8 did not require any data cleaning.

• wrong type attributes, e.g integer attributes were regarded as nominal, dates were stored as
integers
Step 3 – Data Preparation
Data transformation:

• missing and noisy values stand for ‘no’, ‘zero’, ‘did not happen’, etc

• some attributes were relevant in their raw form, they contain information that might be useful,
converting them into binomial attributes (Y, N). 7 attributes were cleaned and transformed
with this method (GENERAL_COMMENTS, EMAIL_ADDRESS, MOBILE_NUMBER,
PHONE_NUMBER, WORK_SKILLS, COMPUTER_PACKAGES and SPECIAL_NEEDS_REQS).

• missing INERVIEW_DATE simply means that a client has not yet been interviewed. filtering
those examples resulted in a loss of too much data. therefore, converted into binomial
attributes (Y, N).
Step 4 - Modelling

• Classification with Decision Tree achieved the accuracy of 81.97%

• One split on attribute EXP1 (experience declared by a client with relation to their main
profession - MANCO1)

• EXP1 attribute was excluded from the dataset to enable the algorithm finding a new split
point.
• Another two variables were identified with this method: EXP2 (experience declared by a
client with relation to their secondary profession – MANCO2) and FULL_TIME willingness
to work full-time)

• They also tried Random Forest, SVM, KNN etc.


Step 5 – Evaluation and Conclusion
• Consistency in the examined sample, the top discriminators between the two class labels are
attributes such as EXP1, EXP2 and FULL_TIME

• There is a very straightforward relationship among the features of the customers – a person
is job ready if they have some work experience and are willing to be employed full- time

• The observed pattern can be easily recognized and there was no previously unknown
information discovered in the process of mining the dataset

• It has been concluded that automating the system would not offer any advantage over the
existing classification based on simple heuristics

• Therefore, no step 6 - Deployment


W H Y CRISP-DM?

The data mining process must be reliable and repeatable by people with
little data mining skills.
CRISP-DM provides a uniform framework for
) guidelines.
) experience documentation.
CRISP-DM is flexible to account for differences.
) Different business/agency problems.
) Different data

IN TROD U C TI ON TO DATA S 83 / 60
DATA SCIENCE TEAM BUILDING

Get to know each other for better


communication Foster team cohesion and
teamwork
Encourage collaboration to boost team productivity and
performance.

https://towardsdatascience.com/why-team-building-is-important-to-data-scientists-a8fa74dbc09b
INTRODUCTION TO DATA SCIENCE 84 / 79
ORGANISATION OF DATA SCIENCE TEAM
[1] Decentralized
[2] Functional
[3] Consulting
[4] Centralized
[5] Center of Excellence
[6] Federated

INTRODUCTION TO DATA SCIENCE 85 / 79


SOFTWARE ENGINEERING
In general,
Software engineering is an engineering discipline that is concerned with all
aspects of software production.
Software includes computer programs, all associated
documentation, and configuration data that are needed for
software to work correctly.
Waterfall model, Iterative models, Agile models

INTRODUCTION TO DATA SCIENCE 86 / 79


DATA SCIENCE PROCESS

INTRODUCTION TO DATA SCIENCE 87 / 79


DATA SCIENCE VS. SOFTWARE ENGINEERING

Data Science Software Engineering


Data science involves Software engineering focuses on
analyzing creat-
huge amounts of data, with ing software that serves a specific
some aspects of programming pur- pose.
and devel- opment.
Uses a methodology involving Uses a methodology involving
vari- various
ous phases beginning from phases beginning from
require- ments specification requirements specification
through model deployment to through software deploy- ment
better decision mak- ing. into production.
INTRODUCTION TO DATA SCIENCE 88 / 79
DATA SCIENCE VS. SOFTWARE ENGINEERING

Data Science Software Engineering


Involves collecting and Concerned with creating useful
analyzing appli-
data cations
Data scientists utilize the ETL Software engineers use the SDLC
(Ex- pro-
tract, Tranform, Load) process cess
More process-oriented Uses frameworks like Waterfall,
Agile,
and Spiral
Data scientists use tools like Software engineers use tools like
Ama- Rails,
zon S3, MongoDB, Hadoop, and Django, Flask, and Vue.js
INTRODUCTION TO DATA SCIENCE 89 / 79
DATA SCIENCE VS. BUSINESS INTELLIGENCE

INTRODUCTION TO DATA SCIENCE 90 / 79


DATA SCIENCE VS. BUSINESS INTELLIGENCE

Data Science Business Intelligence


Perspective Looking forward Looking backward
Analysis Predictive Descriptive
Explorative Comparative
Data Same data, New Data,
New analysis Same analysis
Listens to Speaks for
data data
Distributed Warehoused
Scope Specific to business question Unlimited
Expertise Data scientist Business analyst
Deliverable Insight or story Table or report
Applicabilit Future, correction for Historic, confounding
INTRODUCTION TO DATA SCIENCE 91 / 79
DATA SCIENTIST VS. BUSINESS ANALYST

INTRODUCTION TO DATA SCIENCE 92 / 79


DATA SCIENCE VS. STATISTICS
Data Science Statistics
Type of problem Semi structured or Well structured
unstruc-
tured
Inference model Explicit inference No inference
Analysis Objective Need not be well formed Well formed objective
Type of Analysis Explorative Confirmative
Data collection Data collection is not linked Data collected based
to on
the objective the objective
Size of dataset Large Small
Heterogeneous Homogeneous
Paradigm Theory and heuristic Theory based
INTRODUCTION TO DATA SCIENCE 93 / 79
REFERENCES

Introducing Data Science by Cielen, Meysman and Ali

The Art of Data Science by Roger D Peng and Elizabeth Matsui

https://data-flair.training/blogs/data-science-use-cases

https://www.northeastern.edu/graduate/blog/what-does-a-data-scientist-do/

https://www.visual-paradigm.com/guide/software-development-process/
what-is-a-software-process-model/

Building an Analytics-Driven Organization, Accenture

INTRODUCTION TO DATA SCIENCE 94 / 79


REFERENCES

https://www.altexsoft.com/blog/datascience/
how-to-structure-data-science-team-key-models-and-roles/

https://www.cio.com/article/230532/what-is-a-data-scientist-a-key-data-analytics-rol
e-and-a-lucrative-career.html

https://atlan.com/what-is-dataops/

THANK YOU
INTRODUCTION TO DATA SCIENCE 95 / 79

You might also like