HR Data Analysis Using Python
HR Data Analysis Using Python
#plot style
sns.set(style= 'whitegrid', palette= 'muted')
%matplotlib inline
In [3]: df.head()
Out[3]: Employee_Name EmpID Salary PositionID Position State Zip DOB Sex M
Production
1983-
0 Adinolfi, Wilson K 10026 62506 19 Technician MA 1960 M
07-10
I
Production
1988-
2 Akinkuolie, Sarah 10196 64955 20 Technician MA 1810 F
09-19
II
Production
1988-
3 Alagbe,Trina 10088 64991 19 Technician MA 1886 F
09-27
I
Production
1989-
4 Anderson, Carol 10069 50825 19 Technician MA 2169 F
09-08
I
5 rows × 28 columns
file:///C:/Users/KRITIKA/OneDrive/HR_Analysis.html 1/20
6/16/25, 3:57 AM HR_Analysis
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 311 entries, 0 to 310
Data columns (total 28 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Employee_Name 311 non-null object
1 EmpID 311 non-null int64
2 Salary 311 non-null int64
3 PositionID 311 non-null int64
4 Position 311 non-null object
5 State 311 non-null object
6 Zip 311 non-null int64
7 DOB 311 non-null datetime64[ns]
8 Sex 311 non-null object
9 MaritalDesc 311 non-null object
10 CitizenDesc 311 non-null object
11 HispanicLatino 311 non-null object
12 RaceDesc 311 non-null object
13 DateofHire 311 non-null datetime64[ns]
14 DateofTermination 104 non-null datetime64[ns]
15 TermReason 311 non-null object
16 EmploymentStatus 311 non-null object
17 Department 311 non-null object
18 ManagerName 311 non-null object
19 ManagerID 303 non-null float64
20 RecruitmentSource 311 non-null object
21 PerformanceScore 311 non-null object
22 EngagementSurvey 311 non-null float64
23 EmpSatisfaction 311 non-null int64
24 SpecialProjectsCount 311 non-null int64
25 LastPerformanceReview_Date 311 non-null datetime64[ns]
26 DaysLateLast30 311 non-null int64
27 Absences 311 non-null int64
dtypes: datetime64[ns](4), float64(2), int64(8), object(14)
memory usage: 68.2+ KB
columns like DateofTermination and ManagerID has missing values
In [5]: #statistics
df.describe()
file:///C:/Users/KRITIKA/OneDrive/HR_Analysis.html 2/20
6/16/25, 3:57 AM HR_Analysis
1979-02-06
mean 10156.000000 69020.684887 16.845659 6555.482315
09:48:02.315112544 22:50
1951-01-02
min 10001.000000 45046.000000 1.000000 1013.000000
00:00:00
1973-12-03
25% 10078.500000 55501.500000 18.000000 1901.500000
00:00:00
1980-09-30
50% 10156.000000 62810.000000 19.000000 2132.000000
00:00:00
1986-05-29
75% 10233.500000 72036.000000 20.000000 2355.000000
12:00:00
1992-08-17
max 10311.000000 250000.000000 30.000000 98052.000000
00:00:00
Salary ranges from ~30Kto 110K Employees have up to 20 absences, with some having DaysLateLast30 = 0,
indicating low tardiness Engagement Survey scores range between 1 and 5
Out[6]: Employee_Name 0
EmpID 0
Salary 0
PositionID 0
Position 0
State 0
Zip 0
DOB 0
Sex 0
MaritalDesc 0
CitizenDesc 0
HispanicLatino 0
RaceDesc 0
DateofHire 0
DateofTermination 207
TermReason 0
EmploymentStatus 0
Department 0
ManagerName 0
ManagerID 8
RecruitmentSource 0
PerformanceScore 0
EngagementSurvey 0
EmpSatisfaction 0
SpecialProjectsCount 0
LastPerformanceReview_Date 0
DaysLateLast30 0
Absences 0
dtype: int64
file:///C:/Users/KRITIKA/OneDrive/HR_Analysis.html 3/20
6/16/25, 3:57 AM HR_Analysis
DateofTermination: Missing in 207 records → likely still employed ManagerID: Missing in 8 records → may require
exclusion
CLEANING THE DATASET
In [11]: df[['Age','Tenure','is_terminated']].head(10)
0 42 13 0
1 50 10 1
2 37 13 1
3 37 17 0
4 36 13 1
5 48 13 0
6 46 10 0
7 42 11 0
8 55 15 0
9 37 10 0
--TOTAL EMPLOYEES
file:///C:/Users/KRITIKA/OneDrive/HR_Analysis.html 4/20
6/16/25, 3:57 AM HR_Analysis
Out[19]: np.int64(311)
file:///C:/Users/KRITIKA/OneDrive/HR_Analysis.html 5/20
6/16/25, 3:57 AM HR_Analysis
file:///C:/Users/KRITIKA/OneDrive/HR_Analysis.html 6/20
6/16/25, 3:57 AM HR_Analysis
--More than 200 employees are active. --Almost 75 - 90 employees are Voluntarily Terminated. --Less than 25 are
terminated for cause
file:///C:/Users/KRITIKA/OneDrive/HR_Analysis.html 7/20
6/16/25, 3:57 AM HR_Analysis
Out[12]: is_terminated
0 207
1 104
Name: count, dtype: int64
In [16]: #ploting
plt.figure(figsize=(7,5))
ax = sns.barplot(x = termination_count.index, y = termination_count.values, pale
file:///C:/Users/KRITIKA/OneDrive/HR_Analysis.html 8/20
6/16/25, 3:57 AM HR_Analysis
plt.xticks([0,1],['Active','Terminated'])
plt.title('Employee Termination Status')
plt.xlabel('Status')
plt.ylabel('Number of Employee')
plt.show()
C:\Users\KRITIKA\AppData\Local\Temp\ipykernel_14144\3128851600.py:3: FutureWarnin
g:
#ploting
file:///C:/Users/KRITIKA/OneDrive/HR_Analysis.html 9/20
6/16/25, 3:57 AM HR_Analysis
plt.figure(figsize=(10,6))
ax = sns.barplot(x= dept_count.index, y= dept_count.values, palette = 'pastel')
for bar in ax.patches:
height = bar.get_height()
plt.text(
bar.get_x() + bar.get_width()/2,
height + 1,
f'{int(height)}',
ha='center', va='bottom', fontsize=11, fontweight='bold'
)
plt.title('Termination By Department')
plt.xlabel('Departments')
plt.ylabel('Number of Terminations')
plt.xticks(rotation=30)
plt.tight_layout()
plt.show()
C:\Users\KRITIKA\AppData\Local\Temp\ipykernel_14144\3381452627.py:6: FutureWarnin
g:
#ploting
plt.figure(figsize=(3,3))
plt.pie(
file:///C:/Users/KRITIKA/OneDrive/HR_Analysis.html 10/20
6/16/25, 3:57 AM HR_Analysis
gender_count,
labels=labels,
autopct= '%1.1f%%',
startangle = 90,
colors = colors,
textprops={'fontsize' : 12})
plt.show()
#plot
plt.figure(figsize=(6,4))
sns.boxplot(y= terminated['Age'], color='#8da0cb')
file:///C:/Users/KRITIKA/OneDrive/HR_Analysis.html 11/20
6/16/25, 3:57 AM HR_Analysis
-Median age of terminated employees is around **45-47 years** - Most terminations occur between **45 to 55
years** - A few older employees also got terminated (visible as outliers)
--TENURE DISTRIBUTION OF TERMINATED EMPLOYEES
#plot
plt.figure(figsize=(6,4))
sns.boxplot(y= terminated['Tenure'], color='#fc8d62')
-Median tenure of terminated employees is approximately **13 - 13.5 years** - Most terminations occur between
**12 and 15 years** of service - Some employees were terminated after even **16+ years**, indicating long-term
file:///C:/Users/KRITIKA/OneDrive/HR_Analysis.html 12/20
6/16/25, 3:57 AM HR_Analysis
C:\Users\KRITIKA\AppData\Local\Temp\ipykernel_14144\4052678806.py:1: FutureWarnin
g:
-Active employees shows a higher median engagement score and a wider range of score compared to Terminated
employees.
-CORELATION HEATMAP FOR KEY NUMERIC VARIABLES
plt.figure(figsize=(8,6))
sns.heatmap(corr_matrix, annot=True, fmt= '.2f', cmap='YlGnBu', linewidths=0.5)
file:///C:/Users/KRITIKA/OneDrive/HR_Analysis.html 13/20
6/16/25, 3:57 AM HR_Analysis
plt.tight_layout()
plt.show()
--There is strong corelation of 0.19 between employee engagement and employee satisfaction. --There is a weak
positive correlation of 0.09 between age and tenure. --Negavtive correlation between age and tenure shows
minimal impact on employee satisfaction.
-EMPLOYEE SATISFACTION BY DEPARTMENT
plt.figure(figsize=(8,6))
sns.barplot(y= dept_satisfaction.index, x= dept_satisfaction.values, palette='Bl
C:\Users\KRITIKA\AppData\Local\Temp\ipykernel_14144\4113021662.py:4: FutureWarnin
g:
file:///C:/Users/KRITIKA/OneDrive/HR_Analysis.html 14/20
6/16/25, 3:57 AM HR_Analysis
--Executive Office department have the lowest average satisfaction score while the other department have higher
scores ranging from 3.5 to more than 4.0.
-ENGAGEMENT SCORE BY DEPARTMENT
plt.figure(figsize=(8,6))
sns.barplot(y= dept_engagement.index, x= dept_engagement.values, palette='Reds_d
C:\Users\KRITIKA\AppData\Local\Temp\ipykernel_14144\3055311766.py:4: FutureWarnin
g:
file:///C:/Users/KRITIKA/OneDrive/HR_Analysis.html 15/20
6/16/25, 3:57 AM HR_Analysis
- Departments like **Executive and Admin offices** have the **highest engagement** - Department such as
**Sales** show lower engagement scores
-TERMINATION BY PERFORMANCE SCORE
perf_termination = terminated_df['performancescore'].value_counts().reindex(df['
plt.figure(figsize=(8,6))
sns.barplot(x=perf_termination.index , y= perf_termination.values, palette= 'coo
plt.title('Termination By Performance Score')
plt.xlabel('Performance Score')
plt.ylabel('Number of Terminated Employee')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
C:\Users\KRITIKA\AppData\Local\Temp\ipykernel_14144\3651234695.py:6: FutureWarnin
g:
file:///C:/Users/KRITIKA/OneDrive/HR_Analysis.html 16/20
6/16/25, 3:57 AM HR_Analysis
- Most terminations occurred among employees rated as “Fully Meets” follwed by “Needs Improvement” - Very few
terminations from “Exceeds Expectations” or “PIP” categories
-ABSENCES BY TERMINATION STATUS
plt.figure(figsize=(8,6))
sns.boxplot(x= 'is_terminated' , y='absences' ,data=df, palette =['#8da0cb', '#f
plt.xticks([0,1],['Active','Terminated'])
plt.title('Absences By Employment Status')
plt.xlabel('Employment Status')
plt.ylabel('Number Of Absences')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
file:///C:/Users/KRITIKA/OneDrive/HR_Analysis.html 17/20
6/16/25, 3:57 AM HR_Analysis
--There is a slight difference between the median of the active and terminated number of absences.
-TENURE VS TERMINATION STATUS
In [57]: plt.figure(figsize=(8,6))
sns.violinplot(x= 'is_terminated', y= 'Tenure', data=df, palette=['#8da0cb', '#f
plt.xticks([0,1],['Active', 'Terminated'])
plt.title('Tenure Distribution By Employment Status')
plt.xlabel('Employment Status')
plt.ylabel('Tenure in Years')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
file:///C:/Users/KRITIKA/OneDrive/HR_Analysis.html 18/20
6/16/25, 3:57 AM HR_Analysis
--Large number of termination occurs after **12-14 years** --Active employees have a slightly wider distribution
extended upto 20 years.
This project focuses on analyzing employee data to uncover insights related to:
Termination trends
Employee engagement and satisfaction
Absenteeism patterns
Tenure and performance correlations
Key Insights
1. Total Employees -Total female = 176 -Totak male = 135 -Most of them are Single -
A large number of employees are from US citizen.
3. Termination Trends
file:///C:/Users/KRITIKA/OneDrive/HR_Analysis.html 19/20
6/16/25, 3:57 AM HR_Analysis
Departments like Executive and Admin offices have the highest engagement
Department such as Sales show lower engagement scores
7. Absenteeism
Tools Used
Python Libraries: Pandas, Matplotlib, Seaborn
Jupyter Notebook: For data exploration and visualization
Next Step: This analysis will be replicated in Power BI for interactive dashboarding.
file:///C:/Users/KRITIKA/OneDrive/HR_Analysis.html 20/20