This project demonstrates how to build a resume matching system to compare job descriptions with candidate resumes. We utilize Textract for extracting text from PDF resumes and the DistilBERT model to calculate similarity scores. The goal is to identify the most relevant resumes for a given job description.
For this task, I used the first 15 job descriptions from the HuggingFace dataset. (Link to dataset: https://huggingface.co/datasets/jacob-hugging-face/job-descriptions/viewer/default/train?row=0)
DistilBERT is a transformers model, smaller and faster than BERT, which was pretrained on the same corpus in a self-supervised fashion, using the BERT base model as a teacher. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts using the BERT base model. More precisely, it was pretrained with three objectives:
- Distillation loss: the model was trained to return the same probabilities as the BERT base model.
- Masked language modeling (MLM): this is part of the original training loss of the BERT base model. When taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
- Cosine embedding loss: the model was also trained to generate hidden states as close as possible as the BERT base model. This way, the model learns the same inner representation of the English language than its teacher model, while being faster for inference or downstream tasks.
Textract is considered a useful tool for text extraction, particularly from PDFs and other document formats, due to several advantages:
-
Support for Multiple Formats: Textract supports a wide range of document formats, including PDF, DOC, DOCX, XLSX, PPTX, and more. This versatility makes it a valuable choice for extracting text from various types of documents.
-
Simplicity: Textract provides a simple and straightforward interface for extracting text. You can typically extract text from a document with just a few lines of code, making it accessible for both beginners and experienced developers.
-
Accuracy: Textract is designed to extract text accurately, preserving the formatting and structure of the original document. It can handle complex documents with tables, images, and multiple fonts.
-
Platform Independence: Textract is available for multiple programming languages, including Python, Node.js, and Java, making it versatile and suitable for a wide range of development environments.
-
Open Source: Textract is open source and freely available, which means you can use it without incurring additional costs. This open-source nature also encourages community contributions and improvements.
-
Customization: Textract allows you to customize the extraction process to some extent. You can specify options for handling document-specific features or configuring the extraction behavior.
-
Cross-Platform Compatibility: Textract can be used on various operating systems, including Windows, macOS, and Linux, ensuring compatibility with different development environments.
-
Integration with Other Libraries: Textract can be easily integrated with other libraries and tools commonly used in data processing and analysis pipelines, such as natural language processing (NLP) libraries or machine learning frameworks.
-
Scalability: Textract can be applied to process a large number of documents efficiently, making it suitable for tasks involving large document collections or document management systems.
-
Community and Support: Due to its popularity and open-source nature, Textract benefits from an active community of users and contributors. You can find documentation, tutorials, and community support to assist with your projects.
Before you get started, ensure you have the following dependencies installed:
- Python (3.9)
- PyTorch (for the DistilBERT model)
- Transformers library (for DistilBERT)
- Textract (for PDF text extraction)
- scikit-learn (for cosine similarity)
pip install torch transformers textract numpy scikit-learn
Alternatively, these can be installed using `Requirements.txt`
- Clone the Repository
git clone https://github.com/vmukund36/Resume_matching.git
- Dataset Preperation
Download the dataset using the link: https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset. Change the dataset/folder path in the code as per your local path.
- Running the script
python3 main.py
This script will process the job descriptions, extract text from resumes, calculate similarity scores, and generate a list of top matching resumes for each job description.
4. Review Results
The results will be displayed in the terminal, showing the top 5 matching resumes for each specified job description along with similarity scores. In the python notebook `Capital_placement_assignment.ipynb`, I have chosen 'Accountant' for testing the CV matching and it produced good results and displayed top 5 relevant resumés.