Cacummaro - From Latin "cacumen" (peak, summit) - reaching the peak of document organization!
A comprehensive Java application for ingesting web pages, converting them to PDF, extracting metadata, classifying documents, and integrating with Obsidian for knowledge management.
- Web to PDF: Convert any web page to PDF using Playwright headless Chrome
- AI-Powered Classification: Machine learning + MCP integration with GPT-4, Claude, or custom AI models
- Hybrid Classification System: Combines TF-IDF ML, rule-based patterns, and external AI for maximum accuracy
- MCP Protocol Support: Integrate with any AI model via Model Context Protocol (future-proof architecture)
- Interactive Graph Visualization: Obsidian-style category graph with D3.js
- CouchDB Storage: Robust document storage with attachment support
- Obsidian Integration: Automatic note generation with frontmatter and metadata
- Full-text Search: Search across titles, descriptions, and metadata
- REST API: Comprehensive API for all operations
- Security: SSRF protection, input validation, and path traversal prevention
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Web UI β β REST API β β Services β
β βββββΊβ βββββΊβ β
β - Graph View β β - /ingest β β - PDF Gen β
β - Category List β β - /documents β β - AI Classifier β
β - Document View β β - /categories β β - Obsidian β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββ
β Classification Pipeline β β Obsidian Vault β
β ββββββββββββββββββββββββββββββββββββββ β β β
β β 1. MCP AI Models (Optional) β β β - Notes (.md) β
β β β’ GPT-4, Claude, Custom LLMs β β β - Links β
β β β’ JSON-RPC 2.0 Protocol ββββΌβββ€ - Metadata β
β ββββββββββββββββββββββββββββββββββββββ β βββββββββββββββββββ
β ββββββββββββββββββββββββββββββββββββββ β
β β 2. ML TF-IDF Classifier β β βββββββββββββββββββ
β β β’ PDF Text Extraction β β β CouchDB β
β β β’ Cosine Similarity β β β β
β ββββββββββββββββββββββββββββββββββββββ β β - Documents β
β ββββββββββββββββββββββββββββββββββββββ β β - Categories β
β β 3. Rule-Based Patterns β β β - PDF Binaries β
β β β’ Keyword Matching ββββΌβββ€ - ML Models β
β β β’ Domain Detection β β βββββββββββββββββββ
β ββββββββββββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββ β
β β 4. Merge & Deduplicate Results β β
β β β’ Highest Confidence Wins β β
β ββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββ
- Java 11+ (LTS recommended)
- Docker (for CouchDB)
- Maven 3.6+
- curl (for testing)
# Linux/macOS
./start.sh
# Windows
start.bat# Start CouchDB
docker-compose up -d
# Build application
mvn clean package
# Run application
java -jar target/cacummaro-1.0-SNAPSHOT.jar# Health check
curl http://localhost:8082/actuator/health
# Expected response:
# {"status":"UP"}
# Access the web interface
# Open browser to http://localhost:8082/Open your browser to http://localhost:8082/:
- Ingest Documents: Enter any URL and click "Convert" to create a PDF snapshot
- View Graph: Click "Smart Classification" to see the interactive category graph
- Download PDFs: Click on any category node, then download PDFs from the sidebar
curl -X POST http://localhost:8082/api/v1/ingest \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/article",
"options": {
"createObsidianNote": true,
"noteMetaTag": "data-note"
}
}'Response:
{
"id": "doc|550e8400-e29b-41d4-a716-446655440000",
"status": "STORED",
"pdfUrl": "/api/v1/documents/doc|550e8400-e29b-41d4-a716-446655440000/pdf"
}curl "http://localhost:8082/api/v1/search?q=machine%20learning&size=5"curl -o document.pdf \
"http://localhost:8082/api/v1/documents/doc|550e8400-e29b-41d4-a716-446655440000/pdf"curl "http://localhost:8082/api/v1/categories"Edit src/main/resources/application.yml:
cacummaro:
couchdb:
host: localhost
port: 5984
database: cacummaro_docs
username: admin
password: password
obsidian:
vault-path: ./obsidian-vault
enabled: true
pdf:
timeout: 30s
max-page-size: 10MB
# Classification System Configuration
classification:
confidence-threshold: 0.7
# Machine Learning TF-IDF Classifier
ml:
enabled: true
confidence-threshold: 0.6
model-path: ./ml-model.json
min-document-frequency: 2
max-features: 1000
# MCP AI Integration (GPT-4, Claude, Custom LLMs)
mcp:
enabled: false # Set to true to enable AI classification
server-url: http://localhost:3000/mcp
timeout-ms: 30000
tool-name: classify_document
security:
allowed-domains: []
blocked-internal-ips: true
ingestion:
max-concurrent: 5
retry-attempts: 3
retry-delay: 5ssrc/main/java/org/cacummaro/
βββ domain/ # Core entities (Document, Category)
βββ dto/ # API contracts (IngestRequest, etc.)
βββ repository/ # Data access (CouchDB implementations)
βββ service/
β βββ pdf/ # PDF generation (Playwright)
β βββ classification/ # Document classification
β βββ obsidian/ # Note generation
βββ controller/ # REST endpoints
βββ config/ # Spring configuration
# Unit tests
mvn test
# Integration tests
mvn failsafe:integration-test
# All tests
mvn verify- SSRF Protection: Blocks internal/private IP ranges
- Input Validation: Comprehensive URL and data validation
- Path Traversal Prevention: Secure Obsidian vault operations
- Sanitization: Clean filenames and content
- Rate Limiting: Configurable request limits
Cacummaro features a next-generation hybrid classification system that combines the best of traditional ML, modern AI, and rule-based approaches.
The Future of Document Classification
Integrates with cutting-edge AI models via the Model Context Protocol (MCP):
- GPT-4 / GPT-4 Turbo: Deep semantic understanding and contextual reasoning
- Claude (Anthropic): Nuanced categorization with reasoning explanations
- Local LLMs: Ollama, LLaMA, Mistral for privacy-focused deployments
- Custom Models: Your fine-tuned domain-specific AI models
Benefits:
- Semantic Understanding: Goes beyond keywords to understand context and meaning
- Continuous Learning: Leverage latest AI advancements without code changes
- Multi-language: Works across languages with advanced AI models
- Confidence Scores: AI provides reasoning behind classifications
- Plug & Play: Swap AI models by changing MCP server configuration
Configuration:
cacummaro:
mcp:
enabled: true
server-url: http://localhost:3000/mcp # Your MCP server
timeout-ms: 30000
tool-name: classify_documentExample MCP Response:
{
"categories": [
{
"name": "artificial-intelligence",
"confidence": 0.95,
"reasoning": "Article discusses neural networks, deep learning, and ML algorithms"
}
],
"model": "gpt-4"
}Content-Based Machine Learning
- Extracts full text from PDF documents using Apache PDFBox
- Uses TF-IDF vectorization (Term Frequency-Inverse Document Frequency)
- Cosine similarity for category matching
- Learns from your document corpus
- Always available as fallback when MCP is unavailable
Training the Model:
# Train on existing categorized documents
curl -X POST http://localhost:8082/api/v1/ml/train?maxDocuments=1000
# Check ML status
curl http://localhost:8082/api/v1/ml/statusBenefits:
- Fast: Local execution, no API calls
- Private: No data leaves your infrastructure
- Self-Learning: Improves as you categorize more documents
- Reliable: Always available, no external dependencies
Traditional keyword and domain-based classification
Built-in categories with pattern matching:
- Technology:
software, programming, ai, javascript, python, react, docker - Finance:
banking, investment, crypto, trading, stocks, portfolio - Science:
research, biology, physics, medical, study, experiment - Business:
startup, management, strategy, marketing, sales, leadership - News:
breaking, politics, government, journalism, election
Add custom categories in application.yml:
cacummaro:
classification:
rule-based:
category-keywords:
data-science:
- "pandas"
- "numpy"
- "data analysis"
- "visualization"All three classifiers work together:
- MCP AI runs first (if enabled) for deep semantic understanding
- ML TF-IDF analyzes PDF content for statistical patterns
- Rule-based checks for explicit keywords and domains
- Results are merged - highest confidence wins per category
- Source tracking - know which classifier found each category
Example Merged Result:
{
"categories": [
{
"name": "artificial-intelligence",
"confidence": 0.95,
"classifier": "mcp-gpt4" β AI model
},
{
"name": "technology",
"confidence": 0.88,
"classifier": "ml-tfidf-v1.0" β Local ML
},
{
"name": "programming",
"confidence": 0.75,
"classifier": "enhanced-rule-based-v1.0" β Rules
}
]
}Generated notes include:
---
source_url: https://example.com
title: Article Title
pdf_id: doc|uuid
fetchedAt: 2025-01-15T10:30:00Z
tags: [technology, page-snapshot]
---
# Article Title
## Description
Article description from meta tags
## Notes
Content from configurable meta tag
## Resources
[Download PDF](../pdfs/doc|uuid.pdf)
## Categories
- technology (confidence: 0.85)
## Metadata
- **Fetched:** 2025-01-15T10:30:00Z
- **PDF Size:** 1.2 MB-
CouchDB Connection Failed
docker-compose logs couchdb curl http://localhost:5984/
-
PDF Generation Timeout
cacummaro: pdf: timeout: 60s # Increase timeout
-
Java Version Issues
java -version # Must be 11+ -
Permission Issues (Linux/macOS)
chmod +x start.sh
Application logs are available in the console output. Key components:
org.cacummaro.service.pdf: PDF generationorg.cacummaro.service.classification: Document classificationorg.cacummaro.repository: Database operations
-
Implement the
Classifierinterface:@Service public class MyCustomClassifier implements Classifier { // Implementation }
-
Register as Spring Bean
-
Configure in
application.yml
- Implement the
PdfGeneratorinterface - Replace or add as alternative implementation
- Web Interface:
http://localhost:8082/ - Graph Visualization:
http://localhost:8082/graph - Health Endpoint:
http://localhost:8082/actuator/health - CouchDB Admin:
http://localhost:5984/_utils/ - Application Metrics: Available via Spring Actuator
# Build WAR file
mvn clean package -Pwar
# Deploy to Tomcat
cp target/cacummaro-1.0-SNAPSHOT.war $TOMCAT_HOME/webapps/MIT License - see LICENSE file for details.
- Web Interface with Thymeleaf
- Interactive Graph Visualization (D3.js)
- ML-based Classification (TF-IDF with PDF text extraction)
- MCP Integration (AI models via Model Context Protocol)
- Hybrid Classification (ML + AI + Rules)
- Enhanced MCP Features
- Multi-model voting (query multiple AI models, consensus-based results)
- Streaming classification results
- Fine-tuning integration APIs
- Classification Analytics Dashboard
- Accuracy metrics per classifier
- Confidence distribution charts
- Category suggestion improvements
- Batch Processing Queue for large-scale ingestion
- Docker Image with embedded ML models
- Kubernetes Deployment manifests
- Advanced Search with Elasticsearch integration
- Multi-user Support with role-based access
- API Authentication (OAuth2, JWT)
- Active Learning: User feedback improves classification
- Category recommendations based on document content
To run the application in debug mode (waiting for a debugger to attach on port 5005), use the following command in cmd.exe:
mvn spring-boot:run "-Dspring-boot.run.jvmArguments=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005"- The application will wait for a debugger to connect before starting.
- You can then attach your IDE (e.g., IntelliJ, Eclipse) to
localhost:5005. - Make sure to use double quotes as shown above when running in Windows cmd.exe.