DuckDB BigQuery Extension

This repository contains the DuckDB BigQuery Extension. This extension lets DuckDB integrate with Google BigQuery, allowing users to directly access, manage, and manipulate their BigQuery datasets/tables using standard SQL statements. Inspired by official DuckDB RDBMS extensions like MySQL, PostgreSQL, and SQLite, this extension offers a similar feel.

Please note that this extension is in its initial release. While it covers most of the essentials, you might run into some limitations, issues, or bugs. So far, this extension has only been tested on Linux systems.

This extension only supports the following builds: linux_amd64, linux_amd64_gcc4, osx_arm64, and windows_amd64.

Preliminaries

Authentication Option 1: Configure ADC with your Google Account

To authenticate using your Google Account, first install the Google Cloud CLI (gcloud). Download the latest version from the Google Cloud CLI installation page and follow the instructions to select and authenticate your Google Cloud project for using BigQuery.

After installation, run the following command to authenticate and follow the steps along:

gcloud auth application-default login

Authentication Option 2: Configure Service account keys

Alternatively, you can authenticate using a service account. First, create a service account in the Google Cloud Console, assign the necessary roles, and download the JSON key file. Next, set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the file path. For example:

# On Linux or macOS
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/my/service-account-credentials.json"

# On Windows
set GOOGLE_APPLICATION_CREDENTIALS="C:\path\to\my\service-account-credentials.json"

Windows gRPC Configuration

On Windows, gRPC requires an additional environment variable to configure the trust store for SSL certificates. Download and configure it using (see official documentation):

@powershell -NoProfile -ExecutionPolicy unrestricted -Command ^
    (new-object System.Net.WebClient).Downloadfile( ^
        'https://pki.google.com/roots.pem', 'roots.pem')
set GRPC_DEFAULT_SSL_ROOTS_FILE_PATH=%cd%\roots.pem

This downloads the roots.pem file and sets the GRPC_DEFAULT_SSL_ROOTS_FILE_PATH environment variable to its location.

Quickstart

The BigQuery extension can be installed from the official Community Extension Repository, eliminating the need to enable the unsigned mode. Just use the following command to install and load the extension:

-- Install and load the DuckDB BigQuery extension from the Community Repository
FORCE INSTALL 'bigquery' FROM community;
LOAD 'bigquery';

Note: Windows user require an additional step to configure the gRPC SSL certificates (see here).

After loading the extension, you can connect to your BigQuery project using the ATTACH statement. Replace my_gcp_project with the name of your actual Google Cloud Project. Here is an example:

-- Attach to your BigQuery Project
D ATTACH 'project=my_gcp_project' as bq (TYPE bigquery, READ_ONLY);

-- Show all tables in all datasets in the attached BigQuery project
D SHOW ALL TABLES;
┌──────────┬──────────────────┬──────────┬──────────────┬───────────────────┬───────────┐
│ database │      schema      │   name   │ column_names │   column_types    │ temporary │
│ varchar  │     varchar      │  varchar │  varchar[]   │     varchar[]     │  boolean  │
├──────────┼──────────────────┼──────────┼──────────────┼───────────────────┼───────────┤
│ bq       │ quacking_dataset │ duck_tbl │ [i, s]       │ [BIGINT, VARCHAR] │ false     │
| bq       | barking_dataset  | dog_tbl  | [i, s]       | [BIGINT, VARCHAR] │ false     |
└──────────┴──────────────────┴──────────┴──────────────┴───────────────────┴───────────┘

-- Select data from a specific table in BigQuery
D SELECT * FROM bq.quacking_dataset.duck_tbl;
┌───────┬────────────────┐
│   i   │       s        │
│ int32 │    varchar     │
├───────┼────────────────┤
│    12 │ quack 🦆       │
│    13 │ quack quack 🦆 │
└───────┴────────────────┘

Depending on the number of schemas and tables, initializing the BigQuery catalog may take some time. However, once initialized, the tables are cached. To speed up this process, you also focus the loading process on a particular dataset by specifying a dataset= parameter as follows.

-- Attach to your BigQuery Project
D ATTACH 'project=my_gcp_project dataset=quacking_dataset' as bq (TYPE bigquery);

-- Show all tables in all datasets in the attached BigQuery project
D SHOW ALL TABLES;
┌──────────┬──────────────────┬──────────┬──────────────┬───────────────────┬───────────┐
│ database │      schema      │   name   │ column_names │   column_types    │ temporary │
│ varchar  │     varchar      │  varchar │  varchar[]   │     varchar[]     │  boolean  │
├──────────┼──────────────────┼──────────┼──────────────┼───────────────────┼───────────┤
│ bq       │ quacking_dataset │ duck_tbl │ [i, s]       │ [BIGINT, VARCHAR] │ false     │
└──────────┴──────────────────┴──────────┴──────────────┴───────────────────┴───────────┘

Additional Operations and Settings

The following SQL statements provide a brief overview of supported functionalities and include examples for interacting with BigQuery:

ATTACH 'project=my_gcp_project' as bq (TYPE bigquery);

-- Create a BigQuery dataset
CREATE SCHEMA bq.some_dataset;

-- Create a BigQuery table in some dataset
CREATE TABLE bq.some_dataset.tbl(id INTEGER, some_string VARCHAR);

-- Insert values into the table 
INSERT INTO bq.some_dataset.tbl VALUES (42, "my quacking string");

-- Retrieves rows from the table
SELECT some_string FROM bq.some_dataset.tbl;

-- Drop a BigQury table in some dataset
DROP TABLE IF EXISTS bq.some_dataset.tbl;

-- Drop a BigQuery dataset
DROP SCHEMA IF EXISTS bq.some_dataset;

-- Altering tables - rename table
ALTER TABLE bq.some_dataset.tbl RENAME TO tbl_renamed;

-- Altering tables - rename column
ALTER TABLE bq.some_dataset.tbl RENAME COLUMN i TO i2;

-- Altering tables - add column
ALTER TABLE bq.some_dataset.tbl ADD COLUMN j INTEGER;

-- Altering tables - drop column
ALTER TABLE bq.some_dataset.tbl DROP COLUMN i;

-- Altering tables - change column type
ALTER TABLE bq.some_dataset.tbl ALTER COLUMN i TYPE DOUBLE;

-- Altering tables - drop not null condition
ALTER TABLE bq.some_dataset.tbl ALTER COLUMN i DROP NOT NULL;

`bigquery_scan` Function

If you would rather query just one table directly instead of attaching all tables, you can achieve this by directly using the bigquery_scan functions, such as:

D SELECT * FROM bigquery_scan('my_gcp_project.quacking_dataset.duck_tbl');
┌───────┬────────────────┐
│   i   │       s        │
│ int32 │    varchar     │
├───────┼────────────────┤
│    12 │ quack 🦆       │
│    13 │ quack quack 🦆 │
└───────┴────────────────┘

`bigquery_execute` Function

The bigquery_execute function lets you run arbitrary queries directly in BigQuery in native Google SQL. These queries are executed without interpretation by DuckDB:

D ATTACH 'project=my_gcp_project' as bq (TYPE bigquery);
D CALL bigquery_execute('bq', '
    CREATE SCHEMA deluxe_dataset
    OPTIONS(
        location="us",
        default_table_expiration_days=3.75,
        labels=[("label1","value1"),("label2","value2")]
    )
');

`bigquery_select` Function

The bigquery_select function is similar to bigquery_scan, but it allows you to specify a SQL query string to be executed directly in BigQuery. This function is especially useful to get around the limitations of the BigQuery Storage Read API, such as reading from views or external tables.

D SELECT * FROM bigquery_select('my_gcp_project', 'SELECT * FROM `my_gcp_project.quacking_dataset.duck_tbl`');

It executes the query without further analyzing it in BigQuery. BigQuery transparently creates a temporary result table, which is fetched in exactly the same way as with bigquery_scan. The result table is a temporary table which will be deleted shortly after the query is executed.

`bigquery_clear_cache` Function

DuckDB caches schema metadata, such as datasets and table structures, to avoid repeated fetches from BigQuery. If the schema changes externally, use bigquery_clear_cache to update the cache and retrieve the latest schema information:

D CALL bigquery_clear_cache();

Reading Public Datasets

Public datasets can be accessed by specifying your project as a billing_project. All queries will then be executed and billed on that project. This works for functions such as bigquery_scan, bigquery_execute, and the ATTACH command.

D SELECT * FROM bigquery_scan('bigquery-public-data.geo_us_boundaries.cnecta', billing_project='my_gcp_project');
┌─────────┬──────────────────┬──────────────────────┬───┬───────────────────┬───────────────┬───────────────┬──────────────────────┐
│ geo_id  │ cnecta_fips_code │         name         │ … │ area_water_meters │ int_point_lat │ int_point_lon │     cnecta_geom      │
│ varchar │     varchar      │       varchar        │   │       int64       │    double     │    double     │       varchar        │
├─────────┼──────────────────┼──────────────────────┼───┼───────────────────┼───────────────┼───────────────┼──────────────────────┤
│ 710     │ 710              │ Augusta-Waterville…  │ … │         183936850 │    44.4092939 │   -69.6901717 │ POLYGON((-69.79281…  │
│ 775     │ 775              │ Portland-Lewiston-…  │ … │        1537827560 │    43.8562034 │   -70.3192682 │ POLYGON((-70.48007…  │
│ 770     │ 770              │ Pittsfield-North A…  │ … │          24514153 │    42.5337519 │   -73.1678825 │ POLYGON((-73.30698…  │
│ 790     │ 790              │ Springfield-Hartfo…  │ … │         256922043 │    42.0359069 │   -72.6213616 │ POLYGON((-72.63682…  │
│ 715     │ 715              │ Boston-Worcester-P…  │ … │        3004814151 │    42.3307869 │   -71.3296644 │ MULTIPOLYGON(((-71…  │
│ 725     │ 725              │ Lebanon-Claremont,…  │ … │          58476158 │    43.6727226 │   -72.2484543 │ POLYGON((-72.39601…  │
│ 720     │ 720              │ Bridgeport-New Hav…  │ … │         374068423 │    41.3603421 │   -73.1284227 │ MULTIPOLYGON(((-72…  │
├─────────┴──────────────────┴──────────────────────┴───┴───────────────────┴───────────────┴───────────────┴──────────────────────┤
│ 7 rows                                                                                                      11 columns (7 shown) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Additional Extension Settings

Settings	Description	Default
bq_debug_show_queries	[DEBUG] - whether to print all queries sent to BigQuery to stdout	`false`
bq_experimental_filter_pushdown	[EXPERIMENTAL] - Whether or not to use filter pushdown	`true`
bq_curl_ca_bundle_path	Path to the CA certificates used by cURL for SSL certificate verification

Limitations

There are some limitations that arise from the combination of DuckDB and BigQuery. These include:

Reading from Views: This DuckDB extension utilizes the BigQuery Storage Read API to optimize reading results. However, this approach has limitations (see here for more information). First, the Storage Read API does not support direct reading from logical or materialized views. Second, reading external tables is not supported.

To mitigate these limitations, you can use the bigquery_select function to execute the query directly in BigQuery.

Propagation Delay: After creating a table in BigQuery, there might be a brief propagation delay before the table becomes fully "visible". Therefore, be aware of potential delays when executing CREATE TABLE ... AS or CREATE OR REPLACE TABLE ... statements followed by immediate inserts. This delay is usually just a matter of seconds, but in rare cases, it can take up to a minute.
Primary Keys and Foreign Keys: While BigQuery recently introduced the concept of primary keys and foreign keys constraints, they differ from what you're accustomed to in DuckDB or other traditional RDBMS. Therefore, this extension does not support this concept.

Install Latest Updates from Custom Repository

Updates may not always be immediately available in the Community Extension repository. However, they can be obtained from a custom repository. To get the latest updates, start DuckDB with unsigned extensions setting enabled. Use the allow_unsigned_extensions flag for client connections, or start the CLI with -unsigned as follows:

# Example: CLI
duckdb -unsigned

# Example: Python
con = duckdb.connect(':memory:', config={'allow_unsigned_extensions' : 'true'})

Then set the custom repository and install the extension:

-- Set the custom repository, then install and load the DuckDB BigQuery extension
D SET custom_extension_repository = 'http://storage.googleapis.com/hafenkran';
D FORCE INSTALL 'bigquery';
D LOAD 'bigquery';

Building the Project

This extension uses VCPKG for dependency management. Enabling VCPKG is very simple: follow the installation instructions or just run the following:

git clone https://github.com/Microsoft/vcpkg.git
./vcpkg/bootstrap-vcpkg.sh
export VCPKG_TOOLCHAIN_PATH=`pwd`/vcpkg/scripts/buildsystems/vcpkg.cmake

Now to build the extension, run:

make

The main binaries that will be built are:

# the binary for the duckdb shell with the extension code automatically loaded.
./build/release/duckdb

# the test runner of duckdb. Again, the extension is already linked into the binary.
./build/release/test/unittest

# the loadable extension binary as it would be distributed.
./build/release/extension/bigquery/bigquery.duckdb_extension

After this step you can either simply start the shell with ./build/release/duckdb or by installing/loading the extension from inside your duckdb instance with:

INSTALL './build/release/extension/bigquery/bigquery.duckdb_extension'
LOAD 'bigquery'

Now you can use the features from this extension.

Using Docker

You can also build the project using Docker. This approach simplifies dependencies management and setup. You can build the docker image by using the provided make command as follows:

make docker-build

To run the Docker container, you'll need to authenticate with Google Cloud using a service account. Ensure you have a service account JSON file for Google Cloud authentication. The service account credentials must be set up as environment variables and mounted to the container. Use the following command to run the Docker container, replacing /path/to/my/service-account-credentials.json with the actual path to your service account JSON file:

docker run \
    -it \
    -v /path/to/my/service-account-credentials.json:/creds \
    -e GOOGLE_APPLICATION_CREDENTIALS=/creds/service-account-credentials.json \
    duckdb-bigquery:v1.0.0

Important Notes on Using Google BigQuery

When using this software with Google BigQuery, please ensure your usage complies with the Google API Terms of Service. Be mindful of the usage limits and quotas, and adhere to Google's Fair Use Policy.

Billing and Costs

Please be aware that using Google BigQuery through this software can incur costs. Google BigQuery is a paid service, and charges may apply based on the amount of data processed, stored, and the number of queries executed. Users are responsible for any costs associated with their use of Google BigQuery. For detailed information on BigQuery pricing, please refer to the Google BigQuery Pricing page. It is recommended to monitor your usage and set up budget alerts in the Google Cloud Console to avoid unexpected charges.

By using this software, you acknowledge and agree that you are solely responsible for any charges incurred from Google BigQuery.

Name		Name	Last commit message	Last commit date
Latest commit History 224 Commits
.github/workflows		.github/workflows
data		data
dev		dev
duckdb @ af39bd0		duckdb @ af39bd0
extension-ci-tools @ 00831df		extension-ci-tools @ 00831df
external		external
scripts		scripts
src		src
test		test
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.cmake-format.yaml		.cmake-format.yaml
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
extension_config.cmake		extension_config.cmake
vcpkg.json		vcpkg.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DuckDB BigQuery Extension

Preliminaries

Authentication Option 1: Configure ADC with your Google Account

Authentication Option 2: Configure Service account keys

Windows gRPC Configuration

Quickstart

Additional Operations and Settings

`bigquery_scan` Function

`bigquery_execute` Function

`bigquery_select` Function

`bigquery_clear_cache` Function

Reading Public Datasets

Additional Extension Settings

Limitations

Install Latest Updates from Custom Repository

Building the Project

Using Docker

Important Notes on Using Google BigQuery

Billing and Costs

About

Uh oh!

Releases

Packages

Languages

License

DataZooDE/duckdb-bigquery

Folders and files

Latest commit

History

Repository files navigation

DuckDB BigQuery Extension

Preliminaries

Authentication Option 1: Configure ADC with your Google Account

Authentication Option 2: Configure Service account keys

Windows gRPC Configuration

Quickstart

Additional Operations and Settings

bigquery_scan Function

bigquery_execute Function

bigquery_select Function

bigquery_clear_cache Function

Reading Public Datasets

Additional Extension Settings

Limitations

Install Latest Updates from Custom Repository

Building the Project

Using Docker

Important Notes on Using Google BigQuery

Billing and Costs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`bigquery_scan` Function

`bigquery_execute` Function

`bigquery_select` Function

`bigquery_clear_cache` Function

Packages