We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When I do
python -m olmocr.pipeline s3://my-bucket/workspace --pdfs s3://my-bucket/inputs/*.pdf
any PDF files containing a comma in their name won't get processed.
Say for example a file is named test, ab.pdf, then there is this warning in the logs:
test, ab.pdf
WARNING:olmocr.s3_utils:Attempt 8 failed to get_s3_bytes for ab.pdf: s3_path must start with s3://, gs://, or weka://.
Python 3.11.11 aiohappyeyeballs==2.4.6 aiohttp==3.11.13 aiosignal==1.3.2 annotated-types==0.7.0 anthropic==0.47.2 anyio==4.8.0 asttokens==3.0.0 attrs==25.1.0 beaker-py==1.34.1 bleach==6.2.0 boto3==1.37.1 botocore==1.37.1 cached_path==1.6.7 cachetools==5.5.2 certifi==2025.1.31 cffi==1.17.1 charset-normalizer==3.4.1 click==8.1.8 cloudpickle==3.1.1 compressed-tensors==0.8.0 cryptography==44.0.1 cuda-bindings==12.8.0 cuda-python==12.8.0 datasets==3.3.2 decorator==5.2.1 decord==0.6.0 dill==0.3.8 diskcache==5.6.3 distro==1.9.0 docker==7.1.0 einops==0.8.1 executing==2.2.0 fastapi==0.115.8 filelock==3.17.0 flashinfer==0.1.6+cu124torch2.4 frozenlist==1.5.0 fsspec==2024.12.0 ftfy==6.3.1 fuzzysearch==0.7.3 gguf==0.10.0 google-api-core==2.24.1 google-auth==2.38.0 google-cloud-core==2.4.2 google-cloud-storage==2.19.0 google-crc32c==1.6.0 google-resumable-media==2.7.2 googleapis-common-protos==1.68.0 h11==0.14.0 hf_transfer==0.1.9 httpcore==1.0.7 httptools==0.6.4 httpx==0.28.1 huggingface-hub==0.27.1 idna==3.10 importlib_metadata==8.6.1 iniconfig==2.0.0 interegular==0.3.3 ipython==8.32.0 jedi==0.19.2 Jinja2==3.1.5 jiter==0.8.2 jmespath==1.0.1 jsonschema==4.23.0 jsonschema-specifications==2024.10.1 lark==1.2.2 lingua-language-detector==2.0.2 litellm==1.61.16 llvmlite==0.44.0 lm-format-enforcer==0.10.10 markdown-it-py==3.0.0 markdown2==2.5.3 MarkupSafe==3.0.2 matplotlib-inline==0.1.7 mdurl==0.1.2 mistral_common==1.5.3 modelscope==1.23.1 mpmath==1.3.0 msgpack==1.1.0 msgspec==0.19.0 multidict==6.1.0 multiprocess==0.70.16 nest-asyncio==1.6.0 networkx==3.4.2 numba==0.61.0 numpy==1.26.4 nvidia-cublas-cu12==12.4.5.8 nvidia-cuda-cupti-cu12==12.4.127 nvidia-cuda-nvrtc-cu12==12.4.127 nvidia-cuda-runtime-cu12==12.4.127 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.2.1.3 nvidia-curand-cu12==10.3.5.147 nvidia-cusolver-cu12==11.6.1.9 nvidia-cusparse-cu12==12.3.1.170 nvidia-cusparselt-cu12==0.6.2 nvidia-ml-py==12.570.86 nvidia-nccl-cu12==2.21.5 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu12==12.4.127 -e git+https://github.com/allenai/olmocr.git@d4b902cea235bb64a252d1e3f53cad41e22eb6ea#egg=olmocr openai==1.64.0 opencv-python-headless==4.11.0.86 orjson==3.10.15 outlines==0.0.46 packaging==24.2 pandas==2.2.3 parso==0.8.4 partial-json-parser==0.2.1.1.post5 pexpect==4.9.0 pillow==11.1.0 pluggy==1.5.0 prometheus-fastapi-instrumentator==7.0.2 prometheus_client==0.21.1 prompt_toolkit==3.0.50 propcache==0.3.0 proto-plus==1.26.0 protobuf==5.29.3 psutil==7.0.0 ptyprocess==0.7.0 pure_eval==0.2.3 py-cpuinfo==9.0.0 pyairports==2.1.1 pyarrow==19.0.1 pyasn1==0.6.1 pyasn1_modules==0.4.1 pybind11==2.13.6 pycountry==24.6.1 pycparser==2.22 pydantic==2.10.6 pydantic_core==2.27.2 Pygments==2.19.1 pypdf==5.3.0 pypdfium2==4.30.1 pytest==8.3.4 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 python-multipart==0.0.20 pytz==2025.1 PyYAML==6.0.2 pyzmq==26.2.1 RapidFuzz==3.12.1 ray==2.42.1 referencing==0.36.2 regex==2024.11.6 requests==2.32.3 rich==13.9.4 rpds-py==0.23.1 rsa==4.9 s3transfer==0.11.2 safetensors==0.5.3 sentencepiece==0.2.0 setproctitle==1.3.5 sgl-kernel==0.0.3.post1 sglang==0.4.2 six==1.17.0 smart-open==7.1.0 sniffio==1.3.1 stack-data==0.6.3 starlette==0.45.3 sympy==1.13.1 tiktoken==0.9.0 tokenizers==0.21.0 torch==2.5.1 torchao==0.8.0 torchvision==0.20.1 tqdm==4.67.1 traitlets==5.14.3 transformers==4.49.0 triton==3.1.0 typing_extensions==4.12.2 tzdata==2025.1 urllib3==2.3.0 uvicorn==0.34.0 uvloop==0.21.0 vllm==0.6.4.post1 watchfiles==1.0.4 wcwidth==0.2.13 webencodings==0.5.1 websockets==15.0 wrapt==1.17.2 xformers==0.0.28.post3 xgrammar==0.1.13 xxhash==3.5.0 yarl==1.18.3 zipp==3.21.0 zstandard==0.23.0
The text was updated successfully, but these errors were encountered:
No branches or pull requests
🐛 Describe the bug
When I do
any PDF files containing a comma in their name won't get processed.
Say for example a file is named
test, ab.pdf
, then there is this warning in the logs:Versions
Python 3.11.11
aiohappyeyeballs==2.4.6
aiohttp==3.11.13
aiosignal==1.3.2
annotated-types==0.7.0
anthropic==0.47.2
anyio==4.8.0
asttokens==3.0.0
attrs==25.1.0
beaker-py==1.34.1
bleach==6.2.0
boto3==1.37.1
botocore==1.37.1
cached_path==1.6.7
cachetools==5.5.2
certifi==2025.1.31
cffi==1.17.1
charset-normalizer==3.4.1
click==8.1.8
cloudpickle==3.1.1
compressed-tensors==0.8.0
cryptography==44.0.1
cuda-bindings==12.8.0
cuda-python==12.8.0
datasets==3.3.2
decorator==5.2.1
decord==0.6.0
dill==0.3.8
diskcache==5.6.3
distro==1.9.0
docker==7.1.0
einops==0.8.1
executing==2.2.0
fastapi==0.115.8
filelock==3.17.0
flashinfer==0.1.6+cu124torch2.4
frozenlist==1.5.0
fsspec==2024.12.0
ftfy==6.3.1
fuzzysearch==0.7.3
gguf==0.10.0
google-api-core==2.24.1
google-auth==2.38.0
google-cloud-core==2.4.2
google-cloud-storage==2.19.0
google-crc32c==1.6.0
google-resumable-media==2.7.2
googleapis-common-protos==1.68.0
h11==0.14.0
hf_transfer==0.1.9
httpcore==1.0.7
httptools==0.6.4
httpx==0.28.1
huggingface-hub==0.27.1
idna==3.10
importlib_metadata==8.6.1
iniconfig==2.0.0
interegular==0.3.3
ipython==8.32.0
jedi==0.19.2
Jinja2==3.1.5
jiter==0.8.2
jmespath==1.0.1
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
lark==1.2.2
lingua-language-detector==2.0.2
litellm==1.61.16
llvmlite==0.44.0
lm-format-enforcer==0.10.10
markdown-it-py==3.0.0
markdown2==2.5.3
MarkupSafe==3.0.2
matplotlib-inline==0.1.7
mdurl==0.1.2
mistral_common==1.5.3
modelscope==1.23.1
mpmath==1.3.0
msgpack==1.1.0
msgspec==0.19.0
multidict==6.1.0
multiprocess==0.70.16
nest-asyncio==1.6.0
networkx==3.4.2
numba==0.61.0
numpy==1.26.4
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-cusparselt-cu12==0.6.2
nvidia-ml-py==12.570.86
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
-e git+https://github.com/allenai/olmocr.git@d4b902cea235bb64a252d1e3f53cad41e22eb6ea#egg=olmocr
openai==1.64.0
opencv-python-headless==4.11.0.86
orjson==3.10.15
outlines==0.0.46
packaging==24.2
pandas==2.2.3
parso==0.8.4
partial-json-parser==0.2.1.1.post5
pexpect==4.9.0
pillow==11.1.0
pluggy==1.5.0
prometheus-fastapi-instrumentator==7.0.2
prometheus_client==0.21.1
prompt_toolkit==3.0.50
propcache==0.3.0
proto-plus==1.26.0
protobuf==5.29.3
psutil==7.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
py-cpuinfo==9.0.0
pyairports==2.1.1
pyarrow==19.0.1
pyasn1==0.6.1
pyasn1_modules==0.4.1
pybind11==2.13.6
pycountry==24.6.1
pycparser==2.22
pydantic==2.10.6
pydantic_core==2.27.2
Pygments==2.19.1
pypdf==5.3.0
pypdfium2==4.30.1
pytest==8.3.4
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.20
pytz==2025.1
PyYAML==6.0.2
pyzmq==26.2.1
RapidFuzz==3.12.1
ray==2.42.1
referencing==0.36.2
regex==2024.11.6
requests==2.32.3
rich==13.9.4
rpds-py==0.23.1
rsa==4.9
s3transfer==0.11.2
safetensors==0.5.3
sentencepiece==0.2.0
setproctitle==1.3.5
sgl-kernel==0.0.3.post1
sglang==0.4.2
six==1.17.0
smart-open==7.1.0
sniffio==1.3.1
stack-data==0.6.3
starlette==0.45.3
sympy==1.13.1
tiktoken==0.9.0
tokenizers==0.21.0
torch==2.5.1
torchao==0.8.0
torchvision==0.20.1
tqdm==4.67.1
traitlets==5.14.3
transformers==4.49.0
triton==3.1.0
typing_extensions==4.12.2
tzdata==2025.1
urllib3==2.3.0
uvicorn==0.34.0
uvloop==0.21.0
vllm==0.6.4.post1
watchfiles==1.0.4
wcwidth==0.2.13
webencodings==0.5.1
websockets==15.0
wrapt==1.17.2
xformers==0.0.28.post3
xgrammar==0.1.13
xxhash==3.5.0
yarl==1.18.3
zipp==3.21.0
zstandard==0.23.0
The text was updated successfully, but these errors were encountered: