Skip to content

SynthoraAI-AI-News-Content-Curator/LLM-Finetuning-Lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LLM Fine-Tuning Lab for SynthoraAI

License: MIT Python 3.9+ PyTorch Transformers CI/CD Kubernetes Production Ready

A production-ready, enterprise-grade LLM fine-tuning laboratory designed to support the SynthoraAI AI-Gov-Content-Curator project. This lab provides comprehensive tools, scripts, and pipelines for fine-tuning, deploying, and monitoring large language models at scale.

⭐ Production-Ready Features

  • βœ… Comprehensive Testing - Unit, integration, and E2E tests with 70%+ coverage
  • βœ… Advanced Monitoring - Prometheus metrics, Grafana dashboards, and structured logging
  • βœ… Security First - Vulnerability scanning, secret management, and secure API authentication
  • βœ… CI/CD Pipeline - Automated testing, building, and deployment with GitHub Actions
  • βœ… Kubernetes Ready - Production deployment configs with auto-scaling and health checks
  • βœ… MLflow Integration - Experiment tracking and model registry
  • βœ… Performance Optimized - Model quantization, caching, and batch processing
  • βœ… Disaster Recovery - Automated backups and recovery procedures
  • βœ… A/B Testing - Built-in framework for model comparison and gradual rollouts
  • βœ… Distributed Training - Multi-GPU and multi-node support with DeepSpeed

🎯 Purpose

This repository houses a production-grade fine-tuning infrastructure to:

  • Fine-tune LLMs for article summarization optimized for government content
  • Train models for content classification, sentiment analysis, and bias detection
  • Evaluate performance on government-specific datasets with comprehensive metrics
  • Deploy models to production with Kubernetes, monitoring, and auto-scaling
  • Track experiments with MLflow and version control
  • Monitor performance with Prometheus, Grafana, and custom metrics
  • Ensure security with authentication, rate limiting, and vulnerability scanning
  • Optimize performance with quantization, caching, and distributed training
  • Provide documentation and production deployment guides

πŸ—οΈ Production Architecture

LLM-Finetuning-Lab/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ api/              # Production FastAPI server
β”‚   β”œβ”€β”€ core/             # Core utilities (config, error handling)
β”‚   β”œβ”€β”€ training/         # Training scripts and distributed training
β”‚   β”œβ”€β”€ evaluation/       # Evaluation and benchmarking
β”‚   β”œβ”€β”€ data/             # Data processing and validation
β”‚   β”œβ”€β”€ models/           # Model architectures and wrappers
β”‚   β”œβ”€β”€ optimization/     # Model optimization (quantization, ONNX)
β”‚   β”œβ”€β”€ mlops/            # MLflow, backup, A/B testing
β”‚   └── utils/            # Monitoring, logging, utilities
β”œβ”€β”€ k8s/                  # Kubernetes manifests
β”‚   β”œβ”€β”€ deployment.yaml   # Application deployment
β”‚   └── monitoring-stack.yaml  # Prometheus & Grafana
β”œβ”€β”€ .github/
β”‚   └── workflows/        # CI/CD pipelines
β”‚       β”œβ”€β”€ ci.yml        # Main CI/CD pipeline
β”‚       └── security.yml  # Security scanning
β”œβ”€β”€ configs/              # Training configurations
β”œβ”€β”€ tests/                # Comprehensive test suite
β”œβ”€β”€ docs/                 # Production documentation
β”‚   └── PRODUCTION_DEPLOYMENT.md
└── requirements.txt      # Production dependencies

πŸš€ Quick Start

Prerequisites

  • Python 3.9+
  • CUDA 11.8+ (for GPU training)
  • 16GB+ RAM (32GB+ recommended)
  • 50GB+ disk space

Installation

# Clone the repository
git clone https://github.com/SynthoraAI-AI-News-Content-Curator/LLM-Finetuning-Lab.git
cd LLM-Finetuning-Lab

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install in development mode
pip install -e .

Basic Usage

from src.training import FineTuner
from src.data import DataLoader

# Load your dataset
dataset = DataLoader.from_json("datasets/processed/gov_articles.json")

# Initialize fine-tuner
tuner = FineTuner(
    model_name="google/flan-t5-base",
    task="summarization",
    config="configs/summarization.yaml"
)

# Train the model
tuner.train(dataset, epochs=3)

# Evaluate
results = tuner.evaluate()
print(f"ROUGE-L: {results['rouge_l']}")

# Save the model
tuner.save("checkpoints/flan-t5-gov-summarizer")

πŸ“Š Supported Tasks

1. Article Summarization

  • Fine-tune models for concise, accurate article summaries
  • Optimized for government and news content
  • Supports extractive and abstractive approaches

2. Content Classification

  • Categorize articles into 15+ topic categories
  • Multi-label classification support
  • Hierarchical topic modeling

3. Sentiment Analysis

  • Analyze article tone and objectivity
  • Detect urgency and controversy levels
  • Political bias detection

4. Bias Detection

  • Identify potential bias in article content
  • Analyze writing style and word choice
  • Generate bias reports

5. Q&A Generation

  • Train models for article-based question answering
  • RAG (Retrieval-Augmented Generation) optimization
  • Context-aware response generation

πŸ”§ Configuration

All training configurations are defined in YAML files under configs/:

# configs/summarization.yaml
model:
  name: "google/flan-t5-base"
  max_length: 512

training:
  batch_size: 8
  learning_rate: 5e-5
  epochs: 3
  warmup_steps: 500

data:
  max_source_length: 1024
  max_target_length: 200
  train_split: 0.8
  val_split: 0.1
  test_split: 0.1

πŸŽ“ Training Scripts

Train a Summarization Model

python scripts/train_summarizer.py \
  --config configs/summarization.yaml \
  --data datasets/processed/gov_articles.json \
  --output checkpoints/summarizer-v1

Train a Classification Model

python scripts/train_classifier.py \
  --config configs/classification.yaml \
  --data datasets/processed/labeled_articles.json \
  --output checkpoints/classifier-v1

Train for Bias Detection

python scripts/train_bias_detector.py \
  --config configs/bias_detection.yaml \
  --data datasets/processed/bias_annotated.json \
  --output checkpoints/bias-detector-v1

πŸ“ˆ Evaluation

Evaluate your fine-tuned models:

# Evaluate summarization model
python scripts/evaluate.py \
  --model checkpoints/summarizer-v1 \
  --task summarization \
  --test-data datasets/processed/test.json

# Generate evaluation report
python scripts/generate_report.py \
  --results outputs/eval_results.json \
  --output reports/model_performance.html

πŸ”¬ Benchmarks

Current model performance on government article dataset:

Model Task ROUGE-L Accuracy F1 Score
FLAN-T5-Base Summarization 0.42 - -
BERT-Base Classification - 0.89 0.87
RoBERTa-Large Bias Detection - 0.85 0.83
GPT-3.5-Turbo Q&A - 0.91 0.89

πŸ“š Datasets

Available Datasets

  1. Gov Articles Dataset (datasets/processed/gov_articles.json)

    • 50,000+ government articles
    • Includes summaries, topics, and metadata
    • Sources: state.gov, whitehouse.gov, congress.gov
  2. News Classification Dataset (datasets/processed/news_classified.json)

    • 100,000+ labeled news articles
    • 15+ topic categories
    • Balanced distribution
  3. Bias Annotated Dataset (datasets/processed/bias_annotated.json)

    • 10,000+ articles with bias annotations
    • Expert-reviewed labels
    • Multiple bias dimensions

Data Format

{
  "id": "article-123",
  "title": "Article Title",
  "content": "Full article content...",
  "summary": "AI-generated summary...",
  "topics": ["politics", "economy"],
  "source": "state.gov",
  "bias_score": 0.23,
  "sentiment": "neutral",
  "metadata": {
    "date": "2025-01-15",
    "author": "John Doe"
  }
}

πŸ”— Integration with SynthoraAI

This lab is designed to seamlessly integrate with the SynthoraAI backend:

# Export model for SynthoraAI
from src.utils import export_for_synthoraai

export_for_synthoraai(
    model_path="checkpoints/summarizer-v1",
    output_path="exports/synthoraai-summarizer",
    format="onnx",  # or "torchscript"
    quantize=True
)

API Integration Example

// In SynthoraAI backend (backend/utils/aiSummarizer.js)
const { loadModel, generateSummary } = require('./finetuned-model');

const model = await loadModel('path/to/exported/model');
const summary = await generateSummary(articleContent, {
  max_length: 200,
  min_length: 50,
  temperature: 0.7
});

πŸ› οΈ Advanced Features

LoRA Fine-Tuning

Efficient fine-tuning with Low-Rank Adaptation:

from src.training import LoRAFineTuner

tuner = LoRAFineTuner(
    model_name="meta-llama/Llama-2-7b-hf",
    lora_r=8,
    lora_alpha=16,
    lora_dropout=0.05
)

tuner.train(dataset)

Quantization

Reduce model size for deployment:

from src.utils import quantize_model

quantized_model = quantize_model(
    model_path="checkpoints/summarizer-v1",
    bits=8,  # 8-bit or 4-bit
    output_path="checkpoints/summarizer-v1-quantized"
)

Distributed Training

Scale training across multiple GPUs:

torchrun --nproc_per_node=4 scripts/train_distributed.py \
  --config configs/distributed_training.yaml

πŸ“Š Monitoring and Logging

Weights & Biases Integration

import wandb

wandb.init(
    project="synthoraai-finetuning",
    config=config
)

tuner.train(dataset, use_wandb=True)

TensorBoard Logging

tensorboard --logdir=runs/summarizer-experiment-1

πŸ§ͺ Testing

Run unit tests:

# Run all tests
pytest tests/

# Run specific test suite
pytest tests/test_training.py

# Run with coverage
pytest --cov=src tests/

πŸ“– Documentation

Detailed documentation is available in the docs/ directory:

🀝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for details.

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ”— Related Projects

πŸ‘₯ Team

Maintained by the SynthoraAI team. For questions or support, contact:

πŸ™ Acknowledgments

  • Google Generative AI team for Gemini API
  • Hugging Face for the Transformers library
  • OpenAI for GPT models
  • Meta for LLaMA models
  • The open-source ML community

πŸ“Š Roadmap

  • Initial setup and infrastructure
  • Summarization fine-tuning
  • Classification training
  • Multi-modal learning (text + images)
  • Reinforcement Learning from Human Feedback (RLHF)
  • Custom tokenizer for government terminology
  • Real-time fine-tuning pipeline
  • Automated hyperparameter optimization
  • Model distillation for edge deployment

Built with ❀️ by the SynthoraAI team

πŸ” Back to Top

About

Tools, scripts, and pipelines for fine-tuning, deploying, & monitoring LLMs at scale

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •