0% found this document useful (0 votes)
30 views

Machine Learning 2 Working-pages-Deleted

The document discusses the challenges faced by US businesses in attracting qualified foreign talent and outlines the role of the Office of Foreign Labor Certification (OFLC) in processing visa applications. It emphasizes the need for a machine learning solution to streamline the visa approval process and presents a detailed analysis of the data, model building, and evaluation of various classification models. The final recommendation is to use the Gradient Boosting model with oversampled data due to its high accuracy and balanced performance metrics.

Uploaded by

murali.dhiviya96
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Machine Learning 2 Working-pages-Deleted

The document discusses the challenges faced by US businesses in attracting qualified foreign talent and outlines the role of the Office of Foreign Labor Certification (OFLC) in processing visa applications. It emphasizes the need for a machine learning solution to streamline the visa approval process and presents a detailed analysis of the data, model building, and evaluation of various classification models. The final recommendation is to use the Gradient Boosting model with oversampled data due to its high accuracy and balanced performance metrics.

Uploaded by

murali.dhiviya96
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

MACHINE LEARNING 2

MODEL BUILDING AND TUNING

OFLC EASYVISA

BUSINESS REPORT

PGPDSBA.O. AUG24.A

DHIVIYA MURALIDHARAN
Problem Statement
Business communities in the United States are facing high demand for human resources, but one of
the constant challenges is identifying and attracting the right talent, which is perhaps the most
important element in remaining competitive. Companies in the United States look for hard-working,
talented, and qualified individuals both locally as well as abroad.

The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United
States to work on either a temporary or permanent basis. The act also protects US workers against
adverse impacts on their wages or working conditions by ensuring US employers' compliance with
statutory requirements when they hire foreign workers to fill workforce shortages. The immigration
programs are administered by the Office of Foreign Labor Certification (OFLC).

OFLC processes job certification applications for employers seeking to bring foreign workers into the
United States and grants certifications in those cases where employers can demonstrate that there
are not sufficient US workers available to perform the work at wages that meet or exceed the wage
paid for the occupation in the area of intended employment.

Objective
In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary
and permanent labor certifications. This was a nine percent increase in the overall number of
processed applications from the previous year. The process of reviewing every case is becoming a
tedious task as the number of applicants is increasing every year.

The increasing number of applicants every year calls for a Machine Learning based solution that can
help in shortlisting the candidates having higher chances of VISA approval. OFLC has hired the firm
EasyVisa for data-driven solutions. You as a data scientist at EasyVisa have to analyze the data
provided and, with the help of a classification model:

1. Facilitate the process of visa approvals.

2. Recommend a suitable profile for the applicants for whom the visa should be certified or
denied based on the drivers that significantly influence the case status.

Data Description
The data contains the different attributes of the employee and the employer. The detailed data
dictionary is given below.

• case_id: ID of each visa application

• continent: Information of continent the employee

• education_of_employee: Information of education of the employee

• has_job_experience: Does the employee has any job experience? Y= Yes; N = No

• requires_job_training: Does the employee require any job training? Y = Yes; N = No

• no_of_employees: Number of employees in the employer's company

• yr_of_estab: Year in which the employer's company was established


• region_of_employment: Information of foreign worker's intended region of employment in
the US.

• prevailing_wage: Average wage paid to similarly employed workers in a specific occupation


in the area of intended employment. The purpose of the prevailing wage is to ensure that
the foreign worker is not underpaid compared to other workers offering the same or similar
service in the same area of employment.

• unit_of_wage: Unit of prevailing wage. Values include Hourly, Weekly, Monthly, and Yearly.

• full_time_position: Is the position of work full-time? Y = Full-Time Position; N = Part-Time


Position

• case_status: Flag indicating if the Visa was certified or denied

Data Overview
Data Type

Statistical summary
Observations and Insights
• Data has 25840 rows and 12 columns which gives information about the employees and
employer

• It has no duplicate or missing values


• Categorical variables give information about employee’s education, job experience, whether
the employee requires training or not, continent, etc.

• Numerical variables give information about the number of employees in the company, year
of establishment of the company, prevailing wage, etc.

• case_id column was removed, as it had only unique values

• No_of_employees had negative number, which were treated with absolute values, as
company cannot have negative counting of employee

• Majority of the employees belong to Asia continent, Northeast region, with a Bachelor’s
degree and has prior work experience and does need Job training and are paid annually.

• Employers range between the years 1800-2016, and also company size (employees count)
range is also vary varied starting from 11 and 602069.

• Few outliers were detected in all the numerical variables, it was not treated because they
were genuine

• Year of establishment and number of employees were binned to have clear picture.

UNIVARIATE ANALYSIS

Distribution of Continent and Education of employee


Distribution of Job experience and Job Training

Distribution of Year of establishment and Region of employment


Distribution of Unit of wage and Full-time position

BIVARIATE, CORRELATION AND PAIRPLOT ANALYSIS


Distribution of continent and education of employee by case status

Distribution of Job experience and training by case status

Distribution of region of employment, Unit of wages and Full-time position


by case status
Distribution of Year of establishment with case status

Distribution of prevailing wage with case status

Distribution of no of employees with case status


Observations and Insights
• Majority of employees (>50%) are from Asia • Cases getting certified is highest for Europe
(80%), then Africa (72%), then Asia (65%), & least for S.America & N.America (around 60%)
• Majority of employees have either a bachelor's (40%) or a master's (38%) and minority of
applicants have either a doctorate (8%) or only a high school diploma (13%) • Cases getting
certified is highest for doctorate degree (>86%), followed by master degree (>76%), then
bachelor's (~62%) & high school (<35%)
• 58% of all applicants have prior job experience and 42% do not. Cases certified is high for
applicants with prior job experience (75%) & low for applicants without prior job experience
(~56%)
• Majority do not require the employee to receive any additional job training & are full time
rather than part time opportunities. These attributes were not found to have an impact on
the case statuses with equal number of cases getting certified independent of the attributes
• Majority of the applications are to Northeast (28.3%), then South (27.5%), then West
(25.8%), Midwest (16.9%) and least to Island (1.5%) regions
• Cases certified follows Midwest (75%), then South (70%), then Northeast, West, & Island
(60%)
• Region of employment being Midwest is an important attribute contributing positively to a
case being certified
• Approximately, 67% of all cases are approved and 33% of all cases are denied
• The distribution of number of employees is skewed right with several outliers. However,
greater than twice the number of cases (i.e., 65%) are certified than denied both for
employers having lesser as well as a greater number of employees
• The median prevailing wage for certified applications is slightly higher compared to denied
applications
• Year of establishment does not provide any important aspect for the visa certification or
denial.

DATA PROCESSING
There are several outliers, but not treated since they are found to be genuine. No missing and
duplicate values.

Data Preparation for Modelling


• Predict which visa application can be certified and denied
• One hot encoding the categorical variables
• Split the data into train, validation and test to be able to evaluate the model that we build
after hyper tuning.
MODEL BUILDING CRITERION
Types of Wrong Predictions:
False Negative (FN): Predicting an applicant should be denied when they should be approved.
False Positive (FP): Predicting an applicant should be approved when they should be denied.
Importance of Both Cases:
False Positive (FP):
Consequence: An unqualified employee gets a job that should have been filled by a US citizen.
Impact: Reduced quality of workforce, potential legal and ethical implications.
False Negative (FN):
Consequence: A qualified applicant is denied, and critical positions remain unfilled.
Impact: Reduced productivity and competitiveness of US companies, economic implications.
Reducing Losses:
Prioritize Review Process: Identify candidates predicted to be approved so agents can prioritize
these applications. This optimizes resource allocation and review efficiency.
Evaluation Metric - F1 Score:
Why F1 Score: It is the harmonic mean of precision and recall, making it a balanced metric to
minimize both False Positives and False Negatives.
Balanced Class Weights:
Purpose: Ensures the model does not Favor one class over the other, focusing equally on both
approval and denial predictions.

MODEL BUILDING METHODS & STEPS


1. Model building on original data
2. Model building on Oversampled data
3. Model building on Under sampled data
4. Hyper tuning the model
5. Model Comparison and Final model selection

Models built includes


• Bagging
• Random Forest
• AdaBoost
• Gradient Boost
• Decision Tree
Model is built on both Train and validation set for original, oversampled, under sampled.
From the 15 models built, we select 4 models for hyper tuning and select the best model to run them
on the test data. This avoids data leakage
1. Model building on original data

GBM and AdaBoost: These models show minimal differences between training and validation
performance, indicating they generalize well and are less likely to overfit.

Bagging and Random Forest: Both exhibit signs of overfitting. These ensemble methods often
perform well, but tuning hyperparameters like the number of trees, depth, and using regularization
techniques can help improve generalization.

Decision Tree: The model is highly overfitting, indicating a need for pruning or setting depth
limitations.

2. Model building on Over sampled data

GBM and AdaBoost: These models show minimal differences between training and validation
performance, indicating they generalize well and are less likely to overfit.

Bagging and Random Forest: Both exhibit signs of overfitting.

Decision Tree: The model is highly overfitting. Pruning or limiting the depth of the tree may help to
mitigate overfitting.

3. Model building on Under sampled data

GBM and AdaBoost: These models show minimal or negative differences between training and
validation performance, indicating they generalize well and are less likely to overfit.
Bagging and Random Forest: Both exhibit signs of overfitting. Consider tuning hyperparameters such
as the number of trees, maximum depth, or using regularization techniques to improve
generalization.

Decision Tree: The model is highly overfitting. Pruning or limiting the depth of the tree may help to
mitigate overfitting.

• After building 15 models, it was observed that both the GBM and Adaboost models, trained
on over and under sampled dataset, exhibited strong performance on both the training and
validation datasets.

• Sometimes models might overfit after under sampling and oversampling, so it's better to
tune the models to get a generalized performance

• We will tune these 4 models using the same data (under sampled or oversampled) as we
trained them on before

4. Hyper Tuning the model


a. Tuning AdaBoostClassifier model with Oversampled data

Confusion Matrix on Train set

Performance Metrics on train and validation set

b. Tuning AdaBoostClassifier model with Under sampled data

Confusion Matrix on Train set


Performance Metrics on train and validation set

c. Tuning Gradient Boosting model with Under sampled Data

Confusion matrix on Train set

Performance metrics on Train and validation set

d. Tuning Gradient Boosting model with Oversampled data

Confusion matrix on Train set

Performance metrics on Train and validation set


5.Model Comparison and Final Model Selection
Based on the evaluation results of the hyper tuned models for visa application prediction, it is
evident that all four models demonstrate notable enhancements in performance metrics
compared to their default counterparts. We can analyse this with the help of the performance
comparison data below:
On train set

On Validation set

Final Model Selection


1. Gradient Boosting with Oversampled Data:
o Highest validation accuracy (0.75) and F1 score (0.82).
o High recall (0.86) and precision (0.78).
o The training performance is also strong, indicating the model generalizes well
without overfitting.
2. AdaBoost with Oversampled Data:
o Validation accuracy (0.74) and F1 score (0.81) are slightly lower than Gradient
Boosting with Oversampled Data.
o However, it has a slightly higher recall (0.87), making it effective at identifying
positive cases.
3. Gradient Boosting with Under sampled Data and AdaBoost with Under sampled Data:
o Both models have lower validation accuracy and recall compared to their
oversampled counterparts.
o Indicate under sampling may not be as effective for this dataset.
Recommendation:
Based on the comparison, Gradient Boosting with Oversampled Data is the best model to fit on the
test data. It has the highest validation performance, indicating it is likely to generalize well to unseen
data. Additionally, its strong training performance suggests it has effectively learned from the
oversampled data without overfitting. And has the highest F1 Scores for both validation and training
sets, it's the best model for test data. The high F1 Score signifies that the model effectively balances
precision and recall, making it a reliable choice for accurately predicting positive and negative cases
in the dataset.
This model should provide the most consistent and accurate results on test data, ensuring a robust
performance in real-world scenarios.

Model Built on the test set

It is evident from the confusion matrix that


this model identifies FP and FN correctly at
81%, such that wrong applicant will not be
certified with visa or a eligible applicant will
be denied.

The most important features utilized in identifying the target variable, i.e., case_status, are:
Recommendations and Insights
The profile of the applicants for whose visa can be certified:
• Education level - At least has a Bachelor's degree - Master's and doctorate are
preferred.
• Job Experience - has job experience.
• Prevailing wage - has a high prevailing wage most likely yearly (The median prevailing
wage of the employees for whom the visa got certified is around 72k. )
• Continent - it has been observed that applicants from Europe, Africa, and Asia have
higher chances of visa certification.
The profile of the applicants for whom the visa status can be denied:
• Education level - high school degree.
• Job Experience - Doesn't have any job experience.
• Prevailing wage and unit of wage - applicants with hourly units of wage (The median
prevailing wage of the employees for whom the visa got certified is around 65k.)
• Continent - it has been observed that applicants from South America, North
America, and Oceania have higher chances of visa applications getting denied
Additional information like Gender of the applicant, marital status, Specialization of the degree,
number of years of experience of the employees can be given.
With respect to the employer, the salary slab according the experience they are offering and the
sector in which they are operating.

You might also like