A simple text extractor for various files. Includes core functionality for extracting text from files, a command-line interface, restful API, and python bindings. Project is a work in progress.
There are four main ways to use textractor
:
- Command-line interface
- Python bindings
- Restful API
- Core functionality
Install the CLI with cargo
:
cargo install --git https://github.com/nleroy917/textractor
Then run the CLI with:
textractor <file>
The python bindings are not yet available on PyPi, but you can install them from source. First, clone this repository:
git clone https://github.com/nleroy917/textractor
Then install the python bindings with:
cd textractor/textractor-py
make install
You need to ensure that you have the maturin
package installed. You can install it with:
pip install maturin
There is also a web server built with axum
that can be run with:
cd textractor-web
cargo run --release
Finally, you can use the core functionality in your own Rust project. Add the following to your Cargo.toml
:
[dependencies]
textractor = { git = "https://github.com/nleroy917/textractor" }
Then you can use the library in your project with:
use std::
use textractor::extraction::extract;
fn main() {
let path = std::path::Path::new("path/to/file");
let file = std::fs::File::open(path)?;
let mut reader = std::io::BufReader::new(file);
let mut data = Vec::new();
reader.read_to_end(&mut data)?;
let text = extract(&data)?;
match text {
Some(text) => Ok(text),
None => Err(anyhow::anyhow!("Unsupported file type")),
}
println!("{}", text);
}
I am working to prioritize adding PPTX and XLSX support, as well as improving the text extraction for PDFs.
- Text (txt)
- Word (docx)
- PowerPoint (pptx)
- Excel (xlsx)
- Images (png, jpg, etc)