Get started

This is a DuckDB extension that adds support for reading files from within zip archives and other archive formats such as tar.

Get started

Load from the community extensions repository:

INSTALL zipfs FROM community;
LOAD zipfs;

To read a file:

SELECT * FROM 'zip://examples/a.zip/a.csv';

To read a file from azure blob storage (or other file system):

SELECT * FROM 'zip://az://yourstorageaccount.blob.core.windows.net/yourcontainer/examples/a.zip/a.csv';

To read the table of contents of a zip file:

SELECT * FROM archive_contents('examples/a.zip');

File names

URL quick reference	Description
`zip://a.zip/*.csv`	Local zip file named `a.zip`, containing csv files.
`zip://http://example.com/a.zip/*.csv`	Web hosted zip file named `a.zip`, containing csv files.
`archive://a.tar.gz!!*.csv`	Local archive file named `a.tar.gz`, containg csv files.
`compressed://a.jsonl.bz2`	Local compressed ndjson file `a.jsonl.bz2`.

Function	Description
`zip_contents`	Read the table of contents of a zip file
`archive_contents`	Read the table of contents of an archive file

File names passed into the zip:// URL scheme are expected to end with .zip, which indicates the end of the zip file name. The path after that is taken to be the file path within the zip archive.

Globbing within the zip archive is supported, but see below for performance limitations. A glob query looks like:

SELECT * FROM 'zip://examples/a.zip/*.csv';

Globbing for multiple zip files:

SELECT * FROM 'zip://examples/*.zip/*.csv';

You may use options to turn this behavior off and instead choose some string to split on:

SET zipfs_split = "!!";

SELECT * FROM 'zip://examples/a.zip!!b.csv';

Using zipfs_split also means you can read other archives supported by libarchive: (note different URL scheme, and libarchive is not available on Windows)

SET zipfs_split = "!!";

SELECT * FROM 'archive://examples/a.tar.gz!!b.csv';

It is also possible to read from a variety of compressed file formats directly:

SELECT * FROM read_json('compressed://examples/a.jsonl.bz2');

Archive vs zip

This extension supports both zip files and archive files. The zip file support is using miniz, the archive file support uses libarchive. libarchive supports a wider range of compression algorithms and container formats. libarchive is not available on Windows and using them there will result in an error.

Performance considerations

This extension is intended more for convience than high performance. It does not implement a file metadata cache as tarfs (on which this extension is based) does. As such, operations which require the central directory (index) of the zip file, such as globbing files, must reread the central directory multiple times, once for the glob and once for each file to open.

The selected file will be read entirely into memory, not streamed. Therefore it cannot be used to read files which are larger than memory when uncompressed.

Development

First, install vcpkg to vcpkg:

git clone https://github.com/Microsoft/vcpkg.git
./vcpkg/bootstrap-vcpkg.sh
export VCPKG_TOOLCHAIN_PATH=`pwd`/vcpkg/scripts/buildsystems/vcpkg.cmake

Then:

GEN=ninja make release
make test_release

License

duckdb_tarfs (MIT license)

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
duckdb @ 8a58519		duckdb @ 8a58519
examples		examples
extension-ci-tools @ ec20f45		extension-ci-tools @ ec20f45
scripts		scripts
src		src
test		test
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
extension_config.cmake		extension_config.cmake
vcpkg.json		vcpkg.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Get started

File names

Archive vs zip

Performance considerations

Development

License

About

Uh oh!

Releases 12

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Get started

File names

Archive vs zip

Performance considerations

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages