PDF-hoofdstukken splitsen

Een Python script om PDF-documenten uit te lezen en de inhoud op te splitsen in hoofdstukken op basis van vooraf gedefinieerde regex patronen. Deze tool is ontworpen voor specifieke technische documenten, slaat de eerste pagina's over en gebruikt reguliere expressies om de grenzen van hoofdstukken te bepalen.

Features

Slaat een opgegeven aantal beginpagina's over
Gebruikt document-specifieke regex patronen voor hoofdstuk detectie
Voert hoofdstukken uit als individuele tekstbestanden
Ondersteunt meerdere documenttypes met vaste configuraties

Requirements

Python 3.6+
PyMuPDF (fitz): pip installeer pymupdf

Usage:

python pdfSplitter.py <path-to-pdf> <instruction-key>

Output

Creëert een Output/ directory met submappen vernoemd naar input PDF's:

Output/
  ├── document1/
  │ ├── hoofdstuk_1.txt
  │ └── hoofdstuk_2.txt
  └── document2/
      ├── hoofdstuk_1.txt
      └── hoofdstuk_2.txt

(Huidige) `instruction-key(s)`

CCSK - CCSK Study Guide
Zero Trust Planning - Zero Trust Planning Study Guide
Zero Trust Strategy - Zero Trust Strategy Study Guide
Zero Trust Implementation - Zero Trust Implementation Study Guide

Nieuwe `instruction-key` toevoegen

Stappen

Voeg een nieuwe instruction-key toe aan de dictionary (bijvoorbeeld 'New Security Guide')
Bepalen hoeveel begin paginas je moet overslaan.
Schrijf een regex patroon dat overeenkomt met hoofdstuktitels in je document
Voeg to aan de DOCUMENT_INSTRUCTIONS dictionary:

DOCUMENT_INSTRUCTIONS = {
    # ... existing configurations ...
    
    'New Security Guide': {  # New instruction key
        'skip_pages': 5,     # Skip first 5 pages
        'regex_pattern': r'^Chapter\s+\d+:\s*.+',  # Pattern matching "Chapter X: Title"
        'flags': re.MULTILINE
    }
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Pdfs		Pdfs
.gitignore		.gitignore
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF-hoofdstukken splitsen

Features

Requirements

Usage:

Output

(Huidige) `instruction-key(s)`

Nieuwe `instruction-key` toevoegen

Stappen

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF-hoofdstukken splitsen

Features

Requirements

Usage:

Output

(Huidige) instruction-key(s)

Nieuwe instruction-key toevoegen

Stappen

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

(Huidige) `instruction-key(s)`

Nieuwe `instruction-key` toevoegen

Packages