Skip to content

A high-performance data ingestion and search tool designed to handle large-scale datasets efficiently using ScyllaDB. This project provides a robust solution for processing, storing, and querying large CSV and TXT files with optimized memory usage and concurrent processing capabilities.

Notifications You must be signed in to change notification settings

ZeraTS/osint.scylla

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔍 osint.scylla

A high-performance data management system powered by ScyllaDB

Python ScyllaDB License

🎯 Overview

osint.scylla is an advanced data management solution inspired by platforms like IntelX and Snusbase. Built to handle billions of records efficiently, it offers powerful search capabilities and exceptional performance at scale.

✨ Features

🚀 Performance 🔄 Processing 💾 Storage
• Multi-threading
• Async I/O
• Parallel processing
• CSV/TXT parsing
• Batch operations
• Progress tracking
• ScyllaDB backend
• Memory optimization
• Efficient indexing

🛠️ Installation

# Clone the repository
git clone https://github.com/ZeraTS/osint.scylla
cd osint.scylla

# Set up virtual environment
python -m venv venv
.\venv\Scripts\activate  # Windows
source venv/bin/activate # Linux/Mac

# Install dependencies
pip install -r [requirements.txt]

💭 Frequently Asked Questions

🔧 Setup & Installation
How do I install ScyllaDB?
  1. Download ScyllaDB from official website
  2. Follow OS-specific installation instructions
  3. Verify installation: scylla --version
  4. Start service: sudo systemctl start scylla-server
What are the system requirements?
  • Python 3.8 or higher
  • ScyllaDB 5.1+
  • Minimum 4GB RAM
  • SSD storage recommended
  • Windows/Linux/MacOS supported
How do I troubleshoot connection issues?
  1. Verify ScyllaDB is running: nodetool status
  2. Check default ports (9042) are open
  3. Ensure correct host/port in config
  4. Check firewall settings
📊 Data Management
What file formats are supported?
  • CSV files (*.csv)
  • Text files (*.txt)
  • JSON-formatted text files
  • Line-delimited data
How large can my files be?
  • Recommended: <1GB per batch
  • Maximum: Unlimited (chunked processing)
  • Memory usage is optimized
  • Large files auto-partitioned
How do I optimize import speed?
  1. Use SSD storage
  2. Increase batch size
  3. Enable parallel processing
  4. Pre-format your data
🔍 Search Operations
How do I perform searches?

Use format: field:value Examples:

What fields can I search?

Primary fields:

  • email
  • username
  • first_name
  • last_name
  • phone_number
  • city
  • state
Are searches case-sensitive?
  • Email: Case-sensitive
  • Username: Case-insensitive
  • Names: Case-insensitive
  • Other fields: Case-insensitive
⚡ Performance
How to handle large datasets?
  1. Enable chunked processing
  2. Use batch operations
  3. Implement proper indexing
  4. Monitor memory usage
How to improve search speed?
  • Create custom indexes
  • Use specific field searches
  • Optimize query patterns
  • Configure consistency levels
Best practices for scaling?
  1. Use SSD storage
  2. Configure proper memory allocation
  3. Enable compression
  4. Regular maintenance
🛡️ Security & Backup
How secure is the data?
  • Transport encryption (TLS)
  • Authentication required
  • Role-based access
  • Audit logging available
How to backup data?
  1. Use ScyllaDB snapshots
  2. Configure regular backups
  3. Export data periodically
  4. Maintain backup strategy
How to manage permissions?
  • Create user roles
  • Set access levels
  • Configure authentication
  • Monitor access logs

About

A high-performance data ingestion and search tool designed to handle large-scale datasets efficiently using ScyllaDB. This project provides a robust solution for processing, storing, and querying large CSV and TXT files with optimized memory usage and concurrent processing capabilities.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Languages