Spark Python Course APPLY Project Solution Guide Hints

This document provides guidance for solving problems in the APPLY course project, including preparing the data through cleaning, augmentation, and analysis, then using the data to predict defaulters through machine learning algorithms and group the data with kmeans clustering. The guide recommends steps like data cleansing, adding new fields, querying the data, performing correlation analysis and machine learning, and clustering the data into groups. Students are encouraged to try their own approaches as well.

Uploaded by

Deepak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views2 pages

Spark Python Course APPLY Project Solution Guide Hints

Uploaded by

Deepak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Spark + Python: DO Big Data Analytics & ML

Course Apply Project: Solution Hints

This document contains a proposed solution guide for the problems provided in the course APPLY
project. This is just a guide. You are free try out and solve the problem in your own way. We recommend
you do the following steps

1. Data Cleansing and Augmentation

In this part, you will clean and prepare the data for further analysis

1. Load the csv file into a data frame

2. Remove the header lines
3. The CSV file has junk characters in some rows. Remove them
4. The CSV file has double quotes around certain values. Remove them.
5. Write a conversion function that would convert this text RDD into a Row RDD of transformed
data. Perform the following changes / transformations for the data
a. Create a new age variable where the age is rounded off to 10s. The age would be 10, 20,
30 etc. Required for PR#06
b. The Sex column contains both numeric (1, 2) and text representations (M, F). Normalize
them to 1 and 2.
c. Compute average Billed amount (optional). These are things you try out additionally.
d. Compute average Pay amount (optional)
e. Compute average Pay duration. Make sure the values are positive. The dataset has a lot
of negative values. This is required for PR#04
f. Compute Average Percentage paid as (average billed amount / average paid amount).
This is to pursue a hypothesis that there is a possibility that this value might be able to
predict defaulters. A low percentage paid “may” resulting in high defaulting. This is
where you get creative with the solution. Feel free to try other ones too.
6. Add a new column SEXNAME that contains Male and Female as values. Create a Data frame with
those IDs and values and then join them with the main data frame. Required for PR#02
7. Add a new column ED_STR that contains an actual string for education. Create a Data frame
with those IDs and values and then join them with the main data frame. Required for PR#03
8. Add a new column MARR_DESC that contains a description for marital status. Create a Data
frame with those IDs and values and then join them with the main data frame. Required for
PR#04

2. Perform Analysis
1. Load the Data frame as a temp table /view
2. Query the temp table to solve PR#02
3. Query the temp table to solve PR#03
4. Query the temp table to solve PR#04
5. Perform correlation analysis

3. Predict Defaulters ( PR#05 )

1. Prepare the data in the standard manner for machine learning
a. Convert to labeled point
b. Add indexing
c. Split into training and test data sets.
2. Run classification using 3 algorithms – namely Decision trees, Random Forests and Naïve Bayes.
Find out which one gives the most accuracy on the test dataset.

4. Group Data based on Attributes ( PR#06 ).

1. Create a filtered dataset with only the attributes required for grouping.
2. Perform centering and scaling on all the values
3. Use KMeans clustering to group the data into 4 clusters.

Compare your output with the solution provided. It is not necessary to match fully with the provided
solution. It’s just a guide.

Practice Questions for Tableau Desktop Specialist Certification Case Based
From Everand
Practice Questions for Tableau Desktop Specialist Certification Case Based
Exam OG
5/5 (1)
Rakesh Kumar - 21554244 - Big Data - Assessment 2
No ratings yet
Rakesh Kumar - 21554244 - Big Data - Assessment 2
23 pages
MS Excel Bengali Complete Guide With Image
95% (20)
MS Excel Bengali Complete Guide With Image
264 pages
Project paarth (1) (1)
No ratings yet
Project paarth (1) (1)
21 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Data Science - 2 Sets
No ratings yet
Data Science - 2 Sets
10 pages
StarterNotebook - Jupyter Notebook
No ratings yet
StarterNotebook - Jupyter Notebook
12 pages
AIML Practical exam codes 1
No ratings yet
AIML Practical exam codes 1
7 pages
Personalized Learning PPt
No ratings yet
Personalized Learning PPt
13 pages
Assignment 2: Hive
No ratings yet
Assignment 2: Hive
11 pages
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
ML Complete Notes Hridoy.docx
No ratings yet
ML Complete Notes Hridoy.docx
5 pages
Modelling and Simmulation Assignment - Ipynb - Colab
No ratings yet
Modelling and Simmulation Assignment - Ipynb - Colab
7 pages
DataAnalytics Lab Manual (1)
No ratings yet
DataAnalytics Lab Manual (1)
35 pages
ML Capacity Career Choice Prediction Annotation
No ratings yet
ML Capacity Career Choice Prediction Annotation
20 pages
Pyspark coding questions from StrataScratch platform
No ratings yet
Pyspark coding questions from StrataScratch platform
23 pages
ds
No ratings yet
ds
28 pages
DSBDA Lab Plan
No ratings yet
DSBDA Lab Plan
5 pages
PracticalList_EDT_BCA_2024 SET B1_4
No ratings yet
PracticalList_EDT_BCA_2024 SET B1_4
8 pages
Spark Python Course APPLY Project Problem Statement
No ratings yet
Spark Python Course APPLY Project Problem Statement
3 pages
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
No ratings yet
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
5 pages
Credit Risk Project
No ratings yet
Credit Risk Project
11 pages
IS5312 Mini Project-2
No ratings yet
IS5312 Mini Project-2
5 pages
Credit_Card_Approval_Prediction_Report-Final
No ratings yet
Credit_Card_Approval_Prediction_Report-Final
27 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Coding Notes Data Science
No ratings yet
Coding Notes Data Science
4 pages
DA LAB MANNUAL
No ratings yet
DA LAB MANNUAL
25 pages
DADM Unit 5 Programs
No ratings yet
DADM Unit 5 Programs
63 pages
Predicting Credit Card Approvals
100% (1)
Predicting Credit Card Approvals
14 pages
Cars Project PDF
No ratings yet
Cars Project PDF
9 pages
Final Project
No ratings yet
Final Project
4 pages
DP
No ratings yet
DP
9 pages
dsa _dk question paper
No ratings yet
dsa _dk question paper
4 pages
00 - Lesson - Data Science Workflow - Jupyter Notebook
No ratings yet
00 - Lesson - Data Science Workflow - Jupyter Notebook
6 pages
Datascience
No ratings yet
Datascience
8 pages
DS Question Bank Unit-1 Part-2
No ratings yet
DS Question Bank Unit-1 Part-2
3 pages
Machine Learning
100% (1)
Machine Learning
33 pages
Project Report-Micro Credit Loan
No ratings yet
Project Report-Micro Credit Loan
8 pages
DA lab
No ratings yet
DA lab
27 pages
S-9
No ratings yet
S-9
18 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
Data Preprocessing
No ratings yet
Data Preprocessing
18 pages
AIL303 M
No ratings yet
AIL303 M
22 pages
2022UCD2164-1-2
No ratings yet
2022UCD2164-1-2
35 pages
DA_Programs
No ratings yet
DA_Programs
44 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
PROJECTS
No ratings yet
PROJECTS
6 pages
DSBDA LAB_1_1736243987425
No ratings yet
DSBDA LAB_1_1736243987425
10 pages
Day-4 DS Practicals
No ratings yet
Day-4 DS Practicals
5 pages
Machine Learning Project: Name-Rasmita Mallick Date - 5 September 2021
100% (2)
Machine Learning Project: Name-Rasmita Mallick Date - 5 September 2021
47 pages
Machine Learning Project Report
No ratings yet
Machine Learning Project Report
65 pages
Matplotlib Project Report AIPT (2)
No ratings yet
Matplotlib Project Report AIPT (2)
6 pages
Building Logistic regression model in python
No ratings yet
Building Logistic regression model in python
24 pages
B Tech-AIML-question bank-2 Answer Key
No ratings yet
B Tech-AIML-question bank-2 Answer Key
9 pages
Advance Python
No ratings yet
Advance Python
5 pages
Data_preprocessing_example_programs1
No ratings yet
Data_preprocessing_example_programs1
9 pages
Exp 8_LM
No ratings yet
Exp 8_LM
10 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
Question 1 The Given Dataset Can Be Visualized As Follows
No ratings yet
Question 1 The Given Dataset Can Be Visualized As Follows
13 pages
PAMLSET2.docx (1)
No ratings yet
PAMLSET2.docx (1)
4 pages
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
From Everand
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
Sama Alshatali
No ratings yet
JI2024 Judging Rubric
No ratings yet
JI2024 Judging Rubric
5 pages
Tosca 2016: Installation Guide
100% (1)
Tosca 2016: Installation Guide
10 pages
Assignment No. 4: Elective Iii:Iot Lab
No ratings yet
Assignment No. 4: Elective Iii:Iot Lab
4 pages
Linux System Programming Part 5 - Interprocess Communication (IPC)
No ratings yet
Linux System Programming Part 5 - Interprocess Communication (IPC)
21 pages
Yash 1 Monthly
No ratings yet
Yash 1 Monthly
8 pages
Error Collation Database SQL - CCURE 9K - 03. SWH-TAB-000025729 - Latam
No ratings yet
Error Collation Database SQL - CCURE 9K - 03. SWH-TAB-000025729 - Latam
5 pages
An A-Z Index of The Bash Command Line For Linux
No ratings yet
An A-Z Index of The Bash Command Line For Linux
6 pages
Mme - 305
No ratings yet
Mme - 305
10 pages
Unit No. Course Contents: Course Code: BCS601 Course Name: Introduction To Embedded System Credits
No ratings yet
Unit No. Course Contents: Course Code: BCS601 Course Name: Introduction To Embedded System Credits
7 pages
3.1.3 Ou, Users and Group
No ratings yet
3.1.3 Ou, Users and Group
35 pages
DS Unit-3
No ratings yet
DS Unit-3
25 pages
Exam Questions SC-400: Microsoft Information Protection Administrator
No ratings yet
Exam Questions SC-400: Microsoft Information Protection Administrator
5 pages
Cpp2 Functions
No ratings yet
Cpp2 Functions
25 pages
Stock Market Game Essay
100% (2)
Stock Market Game Essay
7 pages
A Systematic Literature Review of Explainable Arti
No ratings yet
A Systematic Literature Review of Explainable Arti
30 pages
14 Communications Technology
No ratings yet
14 Communications Technology
32 pages
Time Table For Winter 2024 Theory Examination
No ratings yet
Time Table For Winter 2024 Theory Examination
1 page
1 Design Patterns CPP Creational m1 Slides
No ratings yet
1 Design Patterns CPP Creational m1 Slides
11 pages
Game Crash Log
No ratings yet
Game Crash Log
8 pages
PMR Resubmission Guidelines
No ratings yet
PMR Resubmission Guidelines
2 pages
ESP12S Documentation
No ratings yet
ESP12S Documentation
15 pages
Dell EMC Technical White Paper - Deployment OS
No ratings yet
Dell EMC Technical White Paper - Deployment OS
17 pages
Module 14
No ratings yet
Module 14
5 pages
528227-001P_Instant ID_Tech_Specs
No ratings yet
528227-001P_Instant ID_Tech_Specs
27 pages
SM96-23 Enhancements To Vista Model Dispensers Oct 1996
No ratings yet
SM96-23 Enhancements To Vista Model Dispensers Oct 1996
5 pages
OpenLDAP Admin Guide PDF
No ratings yet
OpenLDAP Admin Guide PDF
270 pages
Devops
No ratings yet
Devops
28 pages
Resume Rishabh Gupta
No ratings yet
Resume Rishabh Gupta
1 page
Information System Acquisition
No ratings yet
Information System Acquisition
18 pages

Spark Python Course APPLY Project Solution Guide Hints

Uploaded by

Spark Python Course APPLY Project Solution Guide Hints

Uploaded by

Spark + Python: DO Big Data Analytics & ML

Course Apply Project: Solution Hints

1. Data Cleansing and Augmentation

1. Load the csv file into a data frame

3. Predict Defaulters ( PR#05 )

4. Group Data based on Attributes ( PR#06 ).

You might also like