Machine Learning 2 Working-pages-Deleted
Machine Learning 2 Working-pages-Deleted
OFLC EASYVISA
BUSINESS REPORT
PGPDSBA.O. AUG24.A
DHIVIYA MURALIDHARAN
Problem Statement
Business communities in the United States are facing high demand for human resources, but one of
the constant challenges is identifying and attracting the right talent, which is perhaps the most
important element in remaining competitive. Companies in the United States look for hard-working,
talented, and qualified individuals both locally as well as abroad.
The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United
States to work on either a temporary or permanent basis. The act also protects US workers against
adverse impacts on their wages or working conditions by ensuring US employers' compliance with
statutory requirements when they hire foreign workers to fill workforce shortages. The immigration
programs are administered by the Office of Foreign Labor Certification (OFLC).
OFLC processes job certification applications for employers seeking to bring foreign workers into the
United States and grants certifications in those cases where employers can demonstrate that there
are not sufficient US workers available to perform the work at wages that meet or exceed the wage
paid for the occupation in the area of intended employment.
Objective
In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary
and permanent labor certifications. This was a nine percent increase in the overall number of
processed applications from the previous year. The process of reviewing every case is becoming a
tedious task as the number of applicants is increasing every year.
The increasing number of applicants every year calls for a Machine Learning based solution that can
help in shortlisting the candidates having higher chances of VISA approval. OFLC has hired the firm
EasyVisa for data-driven solutions. You as a data scientist at EasyVisa have to analyze the data
provided and, with the help of a classification model:
2. Recommend a suitable profile for the applicants for whom the visa should be certified or
denied based on the drivers that significantly influence the case status.
Data Description
The data contains the different attributes of the employee and the employer. The detailed data
dictionary is given below.
• unit_of_wage: Unit of prevailing wage. Values include Hourly, Weekly, Monthly, and Yearly.
Data Overview
Data Type
Statistical summary
Observations and Insights
• Data has 25840 rows and 12 columns which gives information about the employees and
employer
• Numerical variables give information about the number of employees in the company, year
of establishment of the company, prevailing wage, etc.
• No_of_employees had negative number, which were treated with absolute values, as
company cannot have negative counting of employee
• Majority of the employees belong to Asia continent, Northeast region, with a Bachelor’s
degree and has prior work experience and does need Job training and are paid annually.
• Employers range between the years 1800-2016, and also company size (employees count)
range is also vary varied starting from 11 and 602069.
• Few outliers were detected in all the numerical variables, it was not treated because they
were genuine
• Year of establishment and number of employees were binned to have clear picture.
UNIVARIATE ANALYSIS
DATA PROCESSING
There are several outliers, but not treated since they are found to be genuine. No missing and
duplicate values.
GBM and AdaBoost: These models show minimal differences between training and validation
performance, indicating they generalize well and are less likely to overfit.
Bagging and Random Forest: Both exhibit signs of overfitting. These ensemble methods often
perform well, but tuning hyperparameters like the number of trees, depth, and using regularization
techniques can help improve generalization.
Decision Tree: The model is highly overfitting, indicating a need for pruning or setting depth
limitations.
GBM and AdaBoost: These models show minimal differences between training and validation
performance, indicating they generalize well and are less likely to overfit.
Decision Tree: The model is highly overfitting. Pruning or limiting the depth of the tree may help to
mitigate overfitting.
GBM and AdaBoost: These models show minimal or negative differences between training and
validation performance, indicating they generalize well and are less likely to overfit.
Bagging and Random Forest: Both exhibit signs of overfitting. Consider tuning hyperparameters such
as the number of trees, maximum depth, or using regularization techniques to improve
generalization.
Decision Tree: The model is highly overfitting. Pruning or limiting the depth of the tree may help to
mitigate overfitting.
• After building 15 models, it was observed that both the GBM and Adaboost models, trained
on over and under sampled dataset, exhibited strong performance on both the training and
validation datasets.
• Sometimes models might overfit after under sampling and oversampling, so it's better to
tune the models to get a generalized performance
• We will tune these 4 models using the same data (under sampled or oversampled) as we
trained them on before
On Validation set
The most important features utilized in identifying the target variable, i.e., case_status, are:
Recommendations and Insights
The profile of the applicants for whose visa can be certified:
• Education level - At least has a Bachelor's degree - Master's and doctorate are
preferred.
• Job Experience - has job experience.
• Prevailing wage - has a high prevailing wage most likely yearly (The median prevailing
wage of the employees for whom the visa got certified is around 72k. )
• Continent - it has been observed that applicants from Europe, Africa, and Asia have
higher chances of visa certification.
The profile of the applicants for whom the visa status can be denied:
• Education level - high school degree.
• Job Experience - Doesn't have any job experience.
• Prevailing wage and unit of wage - applicants with hourly units of wage (The median
prevailing wage of the employees for whom the visa got certified is around 65k.)
• Continent - it has been observed that applicants from South America, North
America, and Oceania have higher chances of visa applications getting denied
Additional information like Gender of the applicant, marital status, Specialization of the degree,
number of years of experience of the employees can be given.
With respect to the employer, the salary slab according the experience they are offering and the
sector in which they are operating.