Resume Screening for Recruitment
Table of Contents
Executive Summary...........................................1
Table of Contents...........................................2
Project Objective...........................................3
Scope.......................................................3
Methodology.................................................4
Artifacts Used..............................................5
Technical Coverage..........................................6
Project Coding..............................................7
Results.....................................................9
Challenges and Resolutions..................................10
Conclusion..................................................11
References..................................................12
Executive Summary
This project focuses on automating the resume screening process using a Python-based
approach. With the increase in job applications, recruiters face a significant burden in
shortlisting qualified candidates. The proposed system extracts and analyzes candidate
qualifications using Natural Language Processing (NLP), enabling HR teams to efficiently
identify resumes that match predefined job requirements and streamline the recruitment
workflow.
Project Objective
The primary objective is to create an automated resume screening tool capable of
identifying relevant skills and educational qualifications from resumes. This system aims to
reduce manual intervention and accelerate the initial screening process in hiring workflows
by generating a relevance score for each candidate.
Scope
This project supports parsing and analyzing resumes in both PDF and DOCX formats. It
defines a set of required skills and educational levels, matches those against the resume
contents, and returns a structured report. This tool is especially useful for pre-screening in
large recruitment drives and can be extended to include experience levels and certifications.
Methodology
The methodology includes the following steps:
1. Text Extraction: Using PyPDF2 for PDF and docx2txt for DOCX.
2. Preprocessing: Converting text to lowercase and tokenizing using NLTK.
3. NLP Parsing: spaCy's language model processes the text structure.
4. Keyword Matching: Identifying matches against a predefined set of skills and education.
5. Scoring: A score is computed based on the number of matches.
This process ensures that only the most relevant resumes move forward in the recruitment
pipeline.
Artifacts Used
- Python 3.10+
- PyPDF2: Library for PDF parsing
- docx2txt: Module for reading DOCX files
- spaCy: NLP processing
- NLTK: Tokenization
- Sample resume files
- Jupyter Notebook or Python CLI for testing
Technical Coverage
This solution leverages core Natural Language Processing techniques, including
tokenization and named entity recognition. Text extraction uses robust Python libraries
designed for various file types. Skill and education matching is efficiently implemented
using set operations. The result is a lightweight, extensible, and scalable resume filtering
engine that can be integrated into larger ATS systems.
Results
The system was tested on various resume formats and consistently extracted useful skills
and education matches. Each processed resume returned:
- List of matched skills
- Matched education qualifications
- Total score
The tool proved successful in reducing the manual screening effort and produced consistent
outputs across different document formats.
Challenges and Resolutions
Challenges:
- Inconsistent resume formats caused difficulties in text extraction.
- Some PDFs lacked clear structure for parsing.
- Tokenization mismatches due to case sensitivity.
Resolutions:
- Normalized all text inputs to lowercase.
- Used reliable libraries like PyPDF2 and docx2txt.
- Defined keyword sets carefully to match common resume terms.
Conclusion
This project demonstrates a practical implementation of automated resume screening. It
simplifies the initial recruitment phase by identifying qualified candidates based on
required skill sets and education. With further enhancements such as experience detection
and machine learning models, the system can evolve into a comprehensive ATS component.
References
- https://spacy.io
- https://www.nltk.org
- https://pypi.org/project/python-docx/
- https://pythonhosted.org/PyPDF2/
Project Coding
The following code represents the core logic of the resume screening system:
import docx2txt
import PyPDF2
import spacy
from nltk.tokenize import word_tokenize
nlp = spacy.load("en_core_web_sm")
REQUIRED_SKILLS = {"python", "machine learning", "data analysis",
"nlp", "sql"}
REQUIRED_EDUCATION = {"bachelor", "master", "phd"}
def extract_text_from_pdf(pdf_path):
text = ""
with open(pdf_path, "rb") as file:
reader = PyPDF2.PdfReader(file)
for page in reader.pages:
text += page.extract_text() + " "
return text
def extract_text_from_docx(docx_path):
return docx2txt.process(docx_path)
def screen_resume(resume_text):
resume_text = resume_text.lower()
tokens = set(word_tokenize(resume_text))
matched_skills = REQUIRED_SKILLS.intersection(tokens)
matched_education = REQUIRED_EDUCATION.intersection(tokens)
score = len(matched_skills) + len(matched_education)
return {
"matched_skills": list(matched_skills),
"matched_education": list(matched_education),
"score": score
}
def process_resume(file_path):
if file_path.endswith(".pdf"):
text = extract_text_from_pdf(file_path)
elif file_path.endswith(".docx"):
text = extract_text_from_docx(file_path)
else:
return "Unsupported file format"
return screen_resume(text)
if __name__ == "__main__":
resume_path = "ATS classic HR resume.docx"
result = process_resume(resume_path)
print(result)