Skip to content

Latest commit

 

History

History
138 lines (99 loc) · 3.53 KB

pg-python-data-tools.md

File metadata and controls

138 lines (99 loc) · 3.53 KB

Tutorial: Accessing MyDuck Server with psycopg, pyarrow, and polars

0. Connecting to MyDuck Server using psycopg

psycopg is a popular PostgreSQL adapter for Python. Here is how you can connect to MyDuck Server using psycopg:

import psycopg

with psycopg.connect("dbname=postgres user=postgres host=127.0.0.1 port=5432", autocommit=True) as conn:
    with conn.cursor() as cur:
        ...

1. Using COPY Operation for Direct Interaction

The COPY command in PostgreSQL is a powerful tool for bulk data transfer. Here is how you can use it with the psycopg library to interact directly with MyDuck Server:

Writing Data Directly

with cur.copy("COPY test.tb1 (id, num, data) FROM STDIN") as copy:
    copy.write(b"1\t100\taaa\n")

Writing Data Row by Row

with cur.copy("COPY test.tb1 (id, num, data) FROM STDIN") as copy:
    copy.write_row((1, 100, "aaa"))

Reading Data Directly

with cur.copy("COPY test.tb1 TO STDOUT") as copy:
    for block in copy:
        print(block)

Reading Data Row by Row

with cur.copy("COPY test.tb1 TO STDOUT") as copy:
    for row in copy.rows():
        print(row)

2. Importing and Exporting Data in Arrow Format

The pyarrow package allows efficient data interchange between DataFrame libraries and MyDuck Server. Here is how to import and export data in Arrow format:

Creating a pandas DataFrame and Converting to Arrow Table

import pandas as pd
import pyarrow as pa

data = {
    'id': [1, 2, 3],
    'num': [100, 200, 300],
    'data': ['aaa', 'bbb', 'ccc']
}
df = pd.DataFrame(data)
table = pa.Table.from_pandas(df)

Writing Data to MyDuck Server in Arrow Format

import io

output_stream = io.BytesIO()
with pa.ipc.RecordBatchStreamWriter(output_stream, table.schema) as writer:
    writer.write_table(table)
with cur.copy("COPY test.tb1 FROM STDIN (FORMAT arrow)") as copy:
    copy.write(output_stream.getvalue())

Reading Data from MyDuck Server in Arrow Format

arrow_data = io.BytesIO()
with cur.copy("COPY test.tb1 TO STDOUT (FORMAT arrow)") as copy:
    for block in copy:
        arrow_data.write(block)

Deserializing Arrow Data to Arrow DataFrame

with pa.ipc.open_stream(arrow_data.getvalue()) as reader:
    arrow_df = reader.read_all()
    print(arrow_df)

Deserializing Arrow Data to pandas DataFrame

with pa.ipc.open_stream(arrow_data.getvalue()) as reader:
    pandas_df = reader.read_pandas()
    print(pandas_df)

3. Using Polars to Process DataFrames

Polars is a fast DataFrame library that can work with Arrow data. Here is how to use Polars to read Arrow or pandas dataframes:

Converting Arrow DataFrame to Polars DataFrame

import polars as pl

polars_df = pl.from_arrow(arrow_df)

Converting pandas DataFrame to Polars DataFrame

polars_df = pl.from_pandas(pandas_df)

4. Retrieving Query Results as DataFrames

You can also retrieve query results from MyDuck Server as DataFrames using Arrow format. Here is an example:

# Copy query result to a Polars DataFrame
arrow_data = io.BytesIO()
with cur.copy("COPY (SELECT id, num * num AS num FROM test.tb1) TO STDOUT (FORMAT arrow)") as copy:
    for block in copy:
        arrow_data.write(block)

    with pa.ipc.open_stream(arrow_data.getvalue()) as reader:
        arrow_table = reader.read_all()
        polars_df = pl.from_arrow(arrow_table)
        print(polars_df)