Skip to content

GROUP 4. This repository contains the implementation of a Transformer-based model for abstractive text summarization and a rule-based approach for extractive text summarization.

Notifications You must be signed in to change notification settings

MohanKrishnaGR/Infosys_Springboard_Text-Summarization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

springboard-logo-removebg-preview

Deploy to Azure Container Instance

Docker Image Version Docker Image Size Docker Pulls

Text Summarization

A project by AI/ML Interns (Group 4) @ Infosys Springboard, Summer 2024.

Mentor

Mr. Narendra Kumar

Contents

Problem Statement

  • Developing an automated text summarization system that can accurately and efficiently condense large bodies of text into concise summaries is essential for enhancing business operations.
  • This project aims to deploy NLP techniques to create a robust text summarization tool capable of handling various types of documents across different domains.
  • The system should deliver high-quality summaries that retain the core information and contextual meaning of the original text.

Project Statement

  • Text Summarization focuses on converting large bodies of text into a few sentences summing up the gist of the larger text.
  • There is a wide variety of applications for text summarization including News Summary, Customer Reviews, Research Papers, etc.
  • This project aims to understand the importance of text summarization and apply different techniques to fulfill the purpose.

Approach to Solution

  • Figure: Intended Plan

Background Research

  • Literature Review

Solution

  • Selected Deep Learning Architecture

Workflow

  • Workflow for Abstractive Text Summarizer:

  • Workflow for Extractive Text Summarizer:

Data Collection

  • Data Preprocessing & Pre-processing Implemented in src/data_preprocessing.
  • Data collection from different sources:
    • CNN, Daily Mail: News
    • BillSum: Legal
    • ArXiv: Scientific
    • Dialoguesum: Conversations
  • Data integration ensures robust and multi-objective data, including News articles, Legal Documents – Acts, Judgements, Scientific papers, and Conversations.
  • Validated the data through Data Statistics and Exploratory Data Analysis (EDA) using Frequency Plotting for every data source.
  • Data cleansing optimized for NLP tasks: removed null records, lowercasing, punctuation removal, stop words removal, and lemmatization.
  • Data splitting using sci-kit learn for training, testing, and validating the model, saved in CSV format.

Abstractive Text Summarization

Model Training & Evaluation

  • Training:
    • Selected transformer architecture for ABSTRACTIVE SUMMARIZATION: fine-tuning a pre-trained model.
    • Chosen Facebook’s Bart Large model for its performance metrics and efficient trainable parameters.
      • 406,291,456 training parameters.

  • Methods:
    • Native PyTorch Implementation
    • Trainer API Implementation

Method 1 - Native PyTorch

  • Trained the model using manual training loop and evaluation loop in PyTorch. Implemented in: src/model.ipynb
  • Model Evaluation: Source code:src/evaluation.ipynb
    • Obtained inconsistent results in inferencing.
    • ROUGE1 (F-Measure) = 00.018
    • There's a suspected tensor error while training using method 1, which could be attributed to the inconsistency of the model's output.
    • Rejected for the further deployment.
    • Dire need to implement alternative approach.

Method 2 – Trainer Class Implementation

  • Utilized Trainer API from Hugging Face for optimized transformer model training. Implemented in: src/bart.ipynb

    • The model was trained with whole dataset for 10 epochs for 26:24:22 (HH:MM:SS) in 125420 steps.
  • Evaluation: Performance metrics using ROUGE scores. Source code: src/rouge.ipynb

    • Model 2 - results outperformed that of method 1.
    • ROUGE1 (F-Measure) = 61.32 -> Benchmark grade
      • Significantly higher than typical scores reported for state-of-the-art models on common datasets.
    • GPT4 performance for text summarization - ROUGE1 (F-Measure) is 63.22
    • Selected for further deployment.
  • Comparative analysis showed significant improvement in performance after fine-tuning. Source code: src/compare.ipynb


Extractive Text Summarization

  • Rather than choosing computationally intensive deep-learning models, utilizing a rule based approach will result in optimal solution. Utilized a new-and-novel approach of combining the matrix obtained from TF-IDF and KMeans Clustering methodology.
  • It is the expanded topic modeling specifically to be applied to multiple lower-level specialized entities (i.e., groups) embedded in a single document. It operates at the individual document and cluster level.
  • The sentence closest to the centroid (based on Euclidean distance) is selected as the representative sentence for that cluster.
  • Implementation: Preprocess text, extract features using TF-IDF, and summarize by selecting representative sentences.
    • Source code for implentation & evaluation: src/Extractive_Summarization.ipynb
    • ROUGE1 (F-Measure) = 24.71

Testing

  • Implemented text summarization application using Gradio library for a web-based interface, for testing the model's inference.
  • Source Code: src/interface.ipynb

Deployment


Application

  • File Structure: summarize/

API Endpoints

  • Developed using FastAPI framework for handling URLs, files, and direct text input.
    • Source Code: summarizer/app.py
  • Endpoints:
    • Root Endpoint
    • Summarize URL
    • Summarize File
    • Summarize Text

Extractor Modules

  • Extract text from various sources (URLs, PDF, DOCX) using BeautifulSoup and fitz.
  • Source Code: summarizer/extractors.py

Extractive Summary Script

  • Implemented extractive summarizer module. Same as implemented in: src/bart.ipynb
  • Source Code: summarizer/extractive_summary.py

User Interface

  • Developed a user-friendly interface using HTML, CSS, and JavaScript.
  • Source Code: summarizer/templates/index.html

Containerization

  • Developed a Dockerfile to build a Docker image for the FastAPI application.
  • Source Code: summarizer/Dockerfile
  • Image: Docker Image

CI/CD Pipeline






End Note

Thank you for your interest in our project! We welcome any feedback. Feel free to reach out to us.

About

GROUP 4. This repository contains the implementation of a Transformer-based model for abstractive text summarization and a rule-based approach for extractive text summarization.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages