Skip to content

isaacbrodsky/duckdb-zipfs

Repository files navigation

Extension Test DuckDB Version License

This is a DuckDB extension that adds support for reading files from within zip archives and other archive formats such as tar.

Get started

Load from the community extensions repository:

INSTALL zipfs FROM community;
LOAD zipfs;

To read a file:

SELECT * FROM 'zip://examples/a.zip/a.csv';

To read a file from azure blob storage (or other file system):

SELECT * FROM 'zip://az://yourstorageaccount.blob.core.windows.net/yourcontainer/examples/a.zip/a.csv';

To read the table of contents of a zip file:

SELECT * FROM archive_contents('examples/a.zip');

File names

URL quick reference Description
zip://a.zip/*.csv Local zip file named a.zip, containing csv files.
zip://http://example.com/a.zip/*.csv Web hosted zip file named a.zip, containing csv files.
archive://a.tar.gz!!*.csv Local archive file named a.tar.gz, containg csv files.
compressed://a.jsonl.bz2 Local compressed ndjson file a.jsonl.bz2.
Function Description
zip_contents Read the table of contents of a zip file
archive_contents Read the table of contents of an archive file

File names passed into the zip:// URL scheme are expected to end with .zip, which indicates the end of the zip file name. The path after that is taken to be the file path within the zip archive.

Globbing within the zip archive is supported, but see below for performance limitations. A glob query looks like:

SELECT * FROM 'zip://examples/a.zip/*.csv';

Globbing for multiple zip files:

SELECT * FROM 'zip://examples/*.zip/*.csv';

You may use options to turn this behavior off and instead choose some string to split on:

SET zipfs_split = "!!";

SELECT * FROM 'zip://examples/a.zip!!b.csv';

Using zipfs_split also means you can read other archives supported by libarchive: (note different URL scheme, and libarchive is not available on Windows)

SET zipfs_split = "!!";

SELECT * FROM 'archive://examples/a.tar.gz!!b.csv';

It is also possible to read from a variety of compressed file formats directly:

SELECT * FROM read_json('compressed://examples/a.jsonl.bz2');

Archive vs zip

This extension supports both zip files and archive files. The zip file support is using miniz, the archive file support uses libarchive. libarchive supports a wider range of compression algorithms and container formats. libarchive is not available on Windows and using them there will result in an error.

Performance considerations

This extension is intended more for convience than high performance. It does not implement a file metadata cache as tarfs (on which this extension is based) does. As such, operations which require the central directory (index) of the zip file, such as globbing files, must reread the central directory multiple times, once for the glob and once for each file to open.

The selected file will be read entirely into memory, not streamed. Therefore it cannot be used to read files which are larger than memory when uncompressed.

Development

First, install vcpkg to vcpkg:

git clone https://github.com/Microsoft/vcpkg.git
./vcpkg/bootstrap-vcpkg.sh
export VCPKG_TOOLCHAIN_PATH=`pwd`/vcpkg/scripts/buildsystems/vcpkg.cmake

Then:

GEN=ninja make release
make test_release

License

duckdb-zipfs Copyright 2025 Isaac Brodsky. Licensed under the MIT License.

DuckDB Copyright 2018-2022 Stichting DuckDB Foundation (MIT License)

miniz Copyright 2013-2014 RAD Game Tools and Valve Software Copyright 2010-2014 Rich Geldreich and Tenacious Software LLC (MIT License)

DuckDB extension-template Copyright 2018-2022 DuckDB Labs BV (MIT License)

duckdb_tarfs (MIT license)

libarchive Copyright 2003-2018 Tim Kientzle (varying licenses, see repo)

About

DuckDB extension to read files within zip archives.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Generated from duckdb/extension-template