textractor 🚜

A simple text extractor for various files. Includes core functionality for extracting text from files, a command-line interface, restful API, and python bindings. Project is a work in progress.

How to use

There are four main ways to use textractor:

Command-line interface
Python bindings
Restful API
Core functionality

Command-line interface

Install the CLI with cargo:

cargo install --git https://github.com/nleroy917/textractor

Then run the CLI with:

textractor <file>

Python bindings

The python bindings are not yet available on PyPi, but you can install them from source. First, clone this repository:

git clone https://github.com/nleroy917/textractor

Then install the python bindings with:

cd textractor/textractor-py
make install

You need to ensure that you have the maturin package installed. You can install it with:

pip install maturin

Restful API

There is also a web server built with axum that can be run with:

cd textractor-web
cargo run --release

Core functionality

Finally, you can use the core functionality in your own Rust project. Add the following to your Cargo.toml:

[dependencies]
textractor = { git = "https://github.com/nleroy917/textractor" }

Then you can use the library in your project with:

use std::

use textractor::extraction::extract;


fn main() {

    let path = std::path::Path::new("path/to/file");
    let file = std::fs::File::open(path)?;
    let mut reader = std::io::BufReader::new(file);
    let mut data = Vec::new();

    reader.read_to_end(&mut data)?;

    let text = extract(&data)?;

    match text {
        Some(text) => Ok(text),
        None => Err(anyhow::anyhow!("Unsupported file type")),
    }

    println!("{}", text);
}

I am working to prioritize adding PPTX and XLSX support, as well as improving the text extraction for PDFs.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.github/workflows		.github/workflows
.vscode		.vscode
textractor-cli		textractor-cli
textractor-core		textractor-core
textractor-py		textractor-py
textractor-wasm		textractor-wasm
textractor-web		textractor-web
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
fly.toml		fly.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

textractor 🚜

How to use

Command-line interface

Python bindings

Restful API

Core functionality

Supported formats

About

Releases 4

Packages

Languages

nleroy917/textractor

Folders and files

Latest commit

History

Repository files navigation

textractor 🚜

How to use

Command-line interface

Python bindings

Restful API

Core functionality

Supported formats

About

Resources

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages