AI Story Video Generator

A fully local AI-powered video generation system that automatically creates story and knowledge-based videos from text keywords. The system operates entirely offline without requiring external APIs, making it suitable for environments with restricted internet access.

Project Overview

This project automates the complete video production pipeline, transforming a simple keyword into a fully-produced video with:

AI-generated scripts based on the input topic
Scene-by-scene image generation using Stable Diffusion
Text-to-speech audio narration
Automatic subtitle generation
Professional video composition and editing

All processing is performed locally using open-source AI models, ensuring privacy and eliminating dependency on cloud services.

Features

Fully Local Operation: No external API dependencies
Automated Script Generation: Uses Ollama with Qwen 2.5 7B for intelligent story creation
Local Image Generation: Stable Diffusion models (SD 1.5 / SDXL) for scene visualization
Text-to-Speech Synthesis: Coqui TTS or Piper TTS for natural voice narration
Automatic Subtitles: Built-in subtitle generation synchronized with audio
Video Composition: FFmpeg-based automated video editing and assembly
One-Command Execution: Single Python script orchestrates the entire pipeline

System Requirements

Required Software

Python 3.8+
Ollama - Download and Install
FFmpeg - Download and Install
CUDA (Optional but highly recommended for GPU acceleration)

Hardware Recommendations

GPU: NVIDIA GPU with 6GB+ VRAM (8GB+ recommended)
RAM: 16GB+ (32GB recommended)
Storage: At least 20GB free space (for model downloads)

Installation

1. Clone or Download the Project

cd AIStoryFarm

2. Install Python Dependencies

pip install -r requirements.txt

3. Install PyTorch with CUDA (Critical for GPU Acceleration)

Important: If your system has an NVIDIA GPU, you must install the PyTorch CUDA version to enable GPU acceleration. Without CUDA, the system will use CPU mode, which is extremely slow (40+ minutes per image).

Check Current PyTorch Version:

python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

If it displays CUDA available: False, install the CUDA version:

Windows (CUDA 12.1):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Windows (CUDA 11.8):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Verify Installation:

python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"N/A\"}')"

This should display CUDA available: True and your GPU name.

Notes:

Ensure your NVIDIA drivers are updated to the latest version
CUDA 12.1 requires NVIDIA driver 525.60.13 or higher
CUDA 11.8 requires NVIDIA driver 450.80.02 or higher

4. Install and Configure Ollama

Windows:

Download Ollama Windows version
Install Ollama (GUI application)
Ensure Ollama service is running (check system tray for Ollama icon)
Download the model using one of the following methods:

Method A: Using Command Line (Recommended)

If the ollama command is not available, add Ollama to your system PATH:

a. Locate Ollama installation directory (typically C:\Users\<username>\AppData\Local\Programs\Ollama)

b. Add ollama.exe directory to system PATH:

Press Win + R, type sysdm.cpl, press Enter
Click "Advanced" tab → "Environment Variables"
Find Path in "System Variables", click "Edit"
Click "New", add Ollama installation directory (e.g., C:\Users\<username>\AppData\Local\Programs\Ollama)
Click "OK" to save
Restart your command prompt (important!)

c. Verify installation:

ollama --version

d. Download model:

ollama pull qwen2.5:7b

Method B: Using Ollama GUI

Open Ollama GUI application
Search and download qwen2.5:7b model in the interface
Wait for download to complete

Method C: Using Full Path (CMD or PowerShell)

In CMD:

"%LOCALAPPDATA%\Programs\Ollama\ollama.exe" pull qwen2.5:7b

In PowerShell:

& "$env:LOCALAPPDATA\Programs\Ollama\ollama.exe" pull qwen2.5:7b

Linux/Mac:

curl https://ollama.ai/install.sh | sh
ollama pull qwen2.5:7b

5. Install FFmpeg

Windows:

Download FFmpeg Windows version
Extract and add bin directory to system PATH

Linux:

sudo apt-get install ffmpeg

Mac:

brew install ffmpeg

6. Configure TTS (Choose One)

Option A: Coqui TTS (Recommended, Auto-Install)

Coqui TTS is automatically installed via pip install TTS. The system automatically attempts to use the best available model:

XTTS v2 (Priority) - Highest quality, natural voice, multilingual support
Tacotron2 (Fallback) - Standard Chinese model
FastSpeech2 (Fallback) - Fast generation

Models are automatically downloaded on first run (XTTS v2 ~1.7GB, Tacotron2 ~500MB).

Option B: Piper TTS

Download Piper TTS

Download Chinese model:

# Create model directory
mkdir -p models/piper/zh_CN

# Download Chinese model (from Piper official)
# Place model files in models/piper/zh_CN/ directory

7. Stable Diffusion Models (Auto-Download on First Run)

On first run, the program automatically downloads models:

SD 1.5 (DreamShaper): ~4GB (lightweight, recommended, balanced model)
- Uses Lykon/DreamShaper-8
- Suitable for diverse subjects: people, animals, objects, scenes
- Good performance for story illustrations with optimized prompts and style constraints
- Fallback models: Realistic Vision V5.1 or original SD 1.5
SDXL: ~7GB (high quality, requires more VRAM)

Usage

Basic Usage

python main.py "story keyword"

Interactive Mode

Run without arguments to access an interactive menu:

python main.py

The interactive mode allows you to:

Select from predefined story topics
Input custom story text or load from file
Choose image generation model
Configure output settings

Test Image Generation (Custom Prompts)

To test different prompts for image generation:

# Using Chinese prompt
python test_image_generation.py "一位古代中國老翁坐在傳統木屋內，牆上掛著精美的壁畫"

# Using English prompt (recommended, better model understanding)
python test_image_generation.py "an old Chinese man sitting in a traditional wooden room with beautiful wall paintings, bronze wine cups on the table, sunset light through window"

# Custom parameters
python test_image_generation.py "your prompt" --steps 40 --guidance 10 --style ancient

# View all options
python test_image_generation.py

Prompt Tips:

English prompts typically yield better results
Be specific: include characters, actions, environment, lighting
Use --guidance to adjust strictness (7-12, default 9.0)
Use --steps to adjust quality (20-50, default 30)

Advanced Options

# Specify image style
python main.py "historical story" --style chinese_ink

# Specify TTS engine
python main.py "knowledge" --tts piper

# Use SDXL model (requires more VRAM)
python main.py "urban legend" --image-model sdxl

# Custom output filename
python main.py "story keyword" --output my_story

Batch Generation

Generate multiple videos at once:

# Using predefined list
python batch_generate.py

# Custom keyword list
python batch_generate.py --keywords "idiom story: waiting for rabbit" "history: three visits" "trivia: why is sky blue"

# Specify uniform style
python batch_generate.py --keywords "keyword1" "keyword2" --style cinematic

Parameter Reference

keyword: Topic keyword (required)
--style: Image style
- cinematic (default) - Cinematic style
- chinese_ink - Chinese ink painting
- ancient - Ancient scenes
- fantasy - Fantasy style
- horror - Horror style
- hand_drawn - Hand-drawn style
--tts: TTS engine (coqui or piper)
--image-model: Image model (sd15 or sdxl)
--output: Output filename (without extension)
--lora: Path to a LoRA weights file or folder (optional; see FINE_TUNING_GUIDE.md)
--lora-scale: LoRA strength 0–1 (default 0.8)
--checkpoint: Path to a local full model file, e.g. from CivitAI (see CIVITAI_IMPORT.md)

Technology Stack

Core Technologies

Python 3.8+: Primary programming language
PyTorch: Deep learning framework for image generation
Diffusers: Hugging Face library for Stable Diffusion models
Transformers: Model loading and inference
FFmpeg: Video processing, encoding, and composition

AI Models and Services

Text Generation (Script Creation)

Ollama: Local LLM server providing REST API interface
Qwen 2.5 7B: Large language model for story generation
- Generates structured scripts with paragraphs, scenes, and emotions
- Analyzes story context to recommend visual styles
- Outputs JSON-formatted script data
- Model size: ~4.4GB, runs locally via Ollama

Image Generation

Stable Diffusion 1.5: Base diffusion model architecture
DreamShaper-8: Fine-tuned model optimized for diverse subjects
Realistic Vision V5.1: Alternative model for realistic scenes
SDXL Turbo: Fast generation variant (1-4 steps)
Model Features:
- Automatic prompt translation (Chinese to English)
- Emotional context analysis
- Style-aware generation
- Character consistency across scenes
- Negative prompt optimization to prevent artifacts
- LoRA support for custom style fine-tuning

Text-to-Speech

Coqui TTS: Primary TTS engine
- XTTS v2: Highest quality, natural voice synthesis with emotional reference audio support
- Tacotron2: Standard Chinese model
- FastSpeech2: Fast generation option
Piper TTS: Alternative lightweight TTS engine

Video Processing

FFmpeg: Video composition and editing
- Image-to-video conversion with effects
- Audio synchronization
- Subtitle overlay
- Aspect ratio management (letterboxing)
- Video concatenation

Architecture Overview

The system follows a modular pipeline architecture:

Script Generation Module (scripts/generate_script.py)
- Interfaces with Ollama API
- Constructs prompts for LLM
- Parses and validates JSON responses
- Handles error recovery and JSON repair
Image Generation Module (scripts/generate_images.py)
- Loads Stable Diffusion models via Hugging Face Diffusers
- Translates prompts using Google Translator API
- Manages GPU/CPU device selection
- Implements prompt engineering with style constraints
- Supports LoRA and custom checkpoint loading
Audio Generation Module (scripts/generate_audio.py)
- Manages multiple TTS backends (Coqui, Piper)
- Handles model fallback chain
- Supports emotional reference audio for XTTS v2
- Implements audio normalization and cleanup
Video Generation Module (scripts/generate_video.py)
- Orchestrates FFmpeg operations
- Synchronizes audio and video segments
- Generates subtitle files (SRT format)
- Applies video effects (zoom, pan, static)
- Manages aspect ratio and letterboxing
Main Pipeline (main.py)
- Coordinates all modules
- Manages file I/O and directory structure
- Provides interactive and command-line interfaces
- Handles error propagation and user feedback

AI Usage Details

Script Generation Pipeline

Input Processing: User provides a keyword or topic
LLM Analysis: Qwen 2.5 analyzes the topic and generates:
- Story title
- Multiple paragraphs with narrative flow
- Scene descriptions for each paragraph
- Emotional context analysis
- Recommended visual style with reasoning
JSON Output: Structured script data for downstream processing
Validation: JSON structure validation and error recovery

Image Generation Pipeline

Prompt Construction:
- Scene description from script
- Style keywords based on LLM recommendation
- Emotional vocabulary from context analysis
- Story title and text as context
- Character consistency prompts
Translation: Chinese prompts translated to English for better model understanding
Model Selection: Automatic fallback chain for model compatibility
Generation: Stable Diffusion inference with optimized parameters
- Guidance scale: 7-12 (default 9.0)
- Inference steps: 20-50 (default 30)
- Negative prompts prevent artifacts
Quality Control: Negative prompts prevent artifacts and maintain consistency

Audio Generation Pipeline

Text Extraction: Paragraph text from generated script
TTS Selection: Automatic model selection (XTTS v2 → Tacotron2 → FastSpeech2)
Synthesis: Voice generation with natural prosody
- Optional emotional reference audio for XTTS v2
- Audio normalization for consistent volume
Duration Calculation: Audio length used for video synchronization

Video Composition Pipeline

Segment Creation: Each image paired with corresponding audio
Effect Application: Zoom, pan, or static effects
Synchronization: Video duration matches audio exactly
Aspect Ratio Management: Letterboxing to maintain 9:16 format
Subtitle Overlay: SRT-based subtitles with styling
Concatenation: All segments combined into final video

Project Structure

AIStoryFarm/
├── main.py                 # Main program entry point
├── requirements.txt        # Python dependencies
├── README.md              # This file
├── batch_generate.py      # Batch processing script
├── test_image_generation.py  # Image generation testing tool
├── scripts/                # Functional modules
│   ├── generate_script.py  # Script generation (LLM)
│   ├── generate_images.py  # Image generation (Stable Diffusion)
│   ├── generate_audio.py   # Audio generation (TTS)
│   └── generate_video.py   # Video generation (FFmpeg)
├── models/                 # Model files (auto-downloaded)
├── data/                   # Data files
│   ├── topics.json        # Predefined story topics
│   └── tts_reference.wav  # Optional emotional reference audio
├── output/                 # Output directory
│   └── {keyword}/
│       ├── script/         # Generated scripts
│       ├── images/         # Generated images
│       ├── audio/          # Generated audio
│       └── video/          # Final videos
├── images/                 # Temporary images (optional)
├── audio/                  # Temporary audio (optional)
└── video/                  # Temporary video (optional)

Troubleshooting

Issue 1: Ollama Command Not Found (Windows)

Error: 'ollama' is not recognized as an internal or external command

Solution:

Confirm Ollama is installed and running:
- Check system tray for Ollama icon
- If not present, launch Ollama from Start menu
Add Ollama to PATH:
- Locate Ollama installation: C:\Users\<your-username>\AppData\Local\Programs\Ollama
- Add this directory to system PATH (see installation steps above)
- Restart command prompt

Use full path (temporary solution):

In CMD:

"%LOCALAPPDATA%\Programs\Ollama\ollama.exe" pull qwen2.5:7b

In PowerShell:

& "$env:LOCALAPPDATA\Programs\Ollama\ollama.exe" pull qwen2.5:7b

Use GUI to download model:
- Open Ollama GUI, download model directly in interface

Issue 2: Ollama Connection Failed

Error: Unable to connect to Ollama

Solution:

Confirm Ollama is running (check system tray)

Confirm model is downloaded:

ollama list

Or using full path:

& "$env:LOCALAPPDATA\Programs\Ollama\ollama.exe" list

Check if Ollama service is running at http://localhost:11434
If Ollama is not running, launch from Start menu

Issue 3: FFmpeg Not Found

Error: FFmpeg not available

Solution:

Confirm FFmpeg is installed:
```
ffmpeg -version
```
Confirm FFmpeg is in system PATH

Issue 4: Using CPU Instead of GPU (Extremely Slow Image Generation)

Symptom: Image generation shows device: cpu, each image takes 40+ minutes

Solution:

Check PyTorch CUDA support:

python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

If it shows False, install PyTorch CUDA version:

# CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Or CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Confirm NVIDIA drivers are installed and updated
Re-run program, should display device: cuda

Issue 5: GPU Memory Insufficient

Error: CUDA out of memory

Solution:

Use lighter model:

python main.py "keyword" --image-model sd15

Reduce batch size (modify generate_images.py)
Close other programs using GPU
Use CPU mode (slower, not recommended)

Issue 6: TTS Generation Failed

Error: Coqui TTS not available or Piper TTS not available

Solution:

Coqui TTS: Confirm installation:
```
pip install TTS
```
Piper TTS: Confirm installation and model path configuration

Issue 7: Generated Images Unrelated to Topic

Symptom: Generated images don't match Chinese story content

Solution:

Check if scene descriptions in script are accurate
Try different style options (--style)
If problem persists, manually edit prompt templates in scripts/generate_images.py

Issue 8: Slow Model Downloads

Solution:

Use domestic mirrors (if available)
Manually download models to ~/.cache/huggingface/ directory
Use VPN or proxy

Customization

Modify Script Style

Edit the prompt template in scripts/generate_script.py. The prompt engineering determines the structure and quality of generated stories.

Modify Image Style

Edit the style_prompts dictionary in scripts/generate_images.py. Each style has associated keywords that influence the visual output.

LoRA Fine-Tuning (Custom Visual Style)

To train a LoRA for a specific visual style (e.g. Chinese ink, anime) and use it in story generation, see FINE_TUNING_GUIDE.md. After training, pass the LoRA path with --lora path/to/lora.safetensors or set the LORA_PATH environment variable.

Modify Video Effects

Edit effect parameters in scripts/generate_video.py. Effects include zoom, pan, and static positioning for each video segment.

Output Specifications

Generated videos are saved at:

output/{keyword}/video/{keyword}_with_subtitles.mp4

Video Specifications:

Resolution: 1080x1920 (Shorts format)
Frame Rate: 30 FPS
Format: MP4 (H.264 + AAC)
Aspect Ratio: 9:16 (vertical video)

Workflow

Input Keyword → User provides topic
Generate Script → Ollama + Qwen generates story paragraphs
Generate Images → Stable Diffusion generates background images for each scene
Generate Audio → TTS synthesizes voice narration
Compose Video → FFmpeg combines all elements
Output Video → Final MP4 file

Usage Recommendations

First Run: Recommend using --image-model sd15 (lighter)
GPU Acceleration: Ensure CUDA is correctly installed for optimal performance
Batch Generation: Can write scripts to loop main.py for batch processing
Customization: Modify module parameters and prompts as needed

Technical Implementation Notes

Model Loading Strategy

The system implements a fallback chain for model loading:

Primary: DreamShaper-8 (if available locally)
Secondary: Realistic Vision V5.1
Tertiary: Original Stable Diffusion 1.5

This ensures compatibility across different hardware configurations.

Memory Management

CUDA memory allocation uses expandable segments to reduce fragmentation
Models are loaded once and reused across multiple image generations
Automatic device selection (CUDA > CPU) with fallback

Error Handling

JSON parsing includes repair mechanisms for incomplete LLM responses
Model loading includes fallback chains
TTS includes automatic model selection based on availability
All modules include comprehensive error messages for debugging

License

This project is for educational and research purposes only.

Contributing

Issues and Pull Requests are welcome!

Contact

For issues, please submit an Issue on GitHub.

Note: This system runs entirely locally and does not depend on any external APIs, making it suitable for environments with restricted internet access.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
scripts		scripts
.gitignore		.gitignore
README.md		README.md
batch_generate.py		batch_generate.py
example.py		example.py
main.py		main.py
requirements.txt		requirements.txt
test_image_generation.py		test_image_generation.py

Folders and files

Latest commit

History

Repository files navigation

AI Story Video Generator

Project Overview

Features

System Requirements

Required Software

Hardware Recommendations

Installation

1. Clone or Download the Project

2. Install Python Dependencies

3. Install PyTorch with CUDA (Critical for GPU Acceleration)

Check Current PyTorch Version:

Windows (CUDA 12.1):

Windows (CUDA 11.8):

Verify Installation:

4. Install and Configure Ollama

Windows:

Linux/Mac:

5. Install FFmpeg

Windows:

Linux:

Mac:

6. Configure TTS (Choose One)

Option A: Coqui TTS (Recommended, Auto-Install)

Option B: Piper TTS

7. Stable Diffusion Models (Auto-Download on First Run)

Usage

Basic Usage

Interactive Mode

Test Image Generation (Custom Prompts)

Advanced Options

Batch Generation

Parameter Reference

Technology Stack

Core Technologies

AI Models and Services

Text Generation (Script Creation)

Image Generation

Text-to-Speech

Video Processing

Architecture Overview

AI Usage Details

Script Generation Pipeline

Image Generation Pipeline

Audio Generation Pipeline

Video Composition Pipeline

Project Structure

Troubleshooting

Issue 1: Ollama Command Not Found (Windows)

Issue 2: Ollama Connection Failed

Issue 3: FFmpeg Not Found

Issue 4: Using CPU Instead of GPU (Extremely Slow Image Generation)

Issue 5: GPU Memory Insufficient

Issue 6: TTS Generation Failed

Issue 7: Generated Images Unrelated to Topic

Issue 8: Slow Model Downloads

Customization

Modify Script Style

Modify Image Style

LoRA Fine-Tuning (Custom Visual Style)

Modify Video Effects

Output Specifications

Workflow

Usage Recommendations

Technical Implementation Notes

Model Loading Strategy

Memory Management

Error Handling

License

Contributing

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Packages