Skip to content

Commit

Permalink
Merge branch 'main' into 246-alter-subscription
Browse files Browse the repository at this point in the history
  • Loading branch information
TianyuZhang1214 authored Dec 4, 2024
2 parents e2ddbcd + b34532e commit 483fca9
Show file tree
Hide file tree
Showing 22 changed files with 854 additions and 200 deletions.
6 changes: 5 additions & 1 deletion .github/workflows/clients-compatibility.yml
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ jobs:
# curl -L -o ./java/postgresql-42.7.4.jar https://jdbc.postgresql.org/download/postgresql-42.7.4.jar
npm install pg
sudo cpanm --notest DBD::Pg
pip3 install psycopg2
pip3 install "psycopg[binary]" pandas pyarrow polars
# sudo R -e "install.packages('RPostgres', repos='http://cran.r-project.org')"
sudo gem install pg
Expand All @@ -123,3 +123,7 @@ jobs:
- name: Run the Compatibility Test for PostgreSQL Client
run: |
bats ./compatibility/pg/test.bats
- name: Run the Compatibility Test for Python Data Tools
run: |
bats ./compatibility/pg-pytools/test.bats
58 changes: 43 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,33 @@
<span>MyDuck Server</span>
</h1>


**MyDuck Server** unlocks serious power for your MySQL & Postgres analytics. Imagine the simplicity of (MySQL|Postgres)’s familiar interface fused with the raw analytical speed of [DuckDB](https://duckdb.org/). Now you can supercharge your analytical queries with DuckDB’s lightning-fast OLAP engine, all while using the tools and dialect you know.

<h1 style="display: flex; align-items: center;">
<img alt="duck under dolphin" style="margin-right: 0.2em" src="logo/MyDuck.svg">
</h1>

## 📑 Table of Contents

- [Why MyDuck](#-why-myduck-)
- [Key Features](#-key-features)
- [Performance](#-performance)
- [Getting Started](#-getting-started)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Usage](#usage)
- [Replicating Data](#replicating-data)
- [Connecting to Cloud MySQL & Postgres](#connecting-to-cloud-mysql--postgres)
- [HTAP Setup](#htap-setup)
- [Query Parquet Files](#query-parquet-files)
- [Already Using DuckDB?](#already-using-duckdb)
- [LLM Integration](#llm-integration)
- [Access from Python](#access-from-python)
- [Roadmap](#-roadmap)
- [Contributing](#-contributing)
- [Acknowledgements](#-acknowledgements)
- [License](#-license)

## ❓ Why MyDuck ❓

While MySQL and Postgres are the most popular open-source databases for OLTP, their performance in analytics often falls short. DuckDB, on the other hand, is built for fast, embedded analytical processing. MyDuck Server lets you enjoy DuckDB's high-speed analytics without leaving the (MySQL|Postgres) ecosystem.
Expand All @@ -24,9 +48,6 @@ MyDuck Server isn't here to replace MySQL & Postgres — it's here to help MySQL

## ✨ Key Features

<h1 style="display: flex; align-items: center;">
<img alt="duck under dolphin" style="margin-right: 0.2em" src="logo/MyDuck.svg">
</h1>

- **Blazing Fast OLAP with DuckDB**: MyDuck stores data in DuckDB, an OLAP-optimized database known for lightning-fast analytical queries. DuckDB enables MyDuck to execute queries up to 1000x faster than traditional MySQL & Postgres setups, making complex analytics practical that were previously unfeasible.

Expand Down Expand Up @@ -56,16 +77,6 @@ MyDuck Server isn't here to replace MySQL & Postgres — it's here to help MySQL

Typical OLAP queries can run **up to 1000x faster** with MyDuck Server compared to MySQL & Postgres alone, especially on large datasets. Under the hood, it's just DuckDB doing what it does best: processing analytical queries at lightning speed. You are welcome to run your own benchmarks and prepare to be amazed! Alternatively, you can refer to well-known benchmarks like the [ClickBench](https://benchmark.clickhouse.com/) and [H2O.ai db-benchmark](https://duckdblabs.github.io/db-benchmark/) to see how DuckDB performs against other databases and data science tools. Also remember that DuckDB has robust support for transactions, JOINs, and [larger-than-memory query processing](https://duckdb.org/2024/07/09/memory-management.html), which are unavailable in many competing systems and tools.

## 🎯 Roadmap

We have big plans for MyDuck Server! Here are some of the features we’re working on:

- [x] Be compatible with MySQL proxy tools like [ProxySQL](https://proxysql.com/).
- [x] Replicate data from PostgreSQL.
- [ ] Authentication.
- [ ] ...and more! We’re always looking for ways to make MyDuck Server better. If you have a feature request, please let us know by [opening an issue](https://github.com/apecloud/myduckserver/issues/new).


## 🏃‍♂️ Getting Started

### Prerequisites
Expand Down Expand Up @@ -140,14 +151,31 @@ With MyDuck's powerful analytics capabilities, you can create an hybrid transact
* Provisioning a MySQL HTAP cluster based on [ProxySQL](docs/tutorial/mysql-htap-proxysql-setup.md) or [MariaDB MaxScale](docs/tutorial/mysql-htap-maxscale-setup.md).
* Provisioning a PostgreSQL HTAP cluster based on [PGPool-II](docs/tutorial/pg-htap-pgpool-setup.md)

### Query & Load Parquet Files
### Query Parquet Files

Looking to load Parquet files into MyDuck Server and start querying? Follow our [Parquet file loading guide](docs/tutorial/load-parquet-files.md) for easy setup.

### Already Using DuckDB?

Already have a DuckDB file? You can seamlessly bootstrap MyDuck Server with it. See our [DuckDB file bootstrapping guide](docs/tutorial/bootstrap.md) for more details.

### LLM Integration

MyDuck Server can be integrated with LLM applications via the [Model Context Protocol (MCP)](https://modelcontextprotocol.io/introduction). Follow the [MCP integration guide](docs/tutorial/mcp.md) to set up MyDuck Server as an external data source for LLMs.

### Access from Python

MyDuck Server can be seamlessly accessed from the Python data science ecosystem. Follow the [Python integration guide](docs/tutorial/pg-python-data-tools.md) to connect to MyDuck Server from Python and export data to PyArrow, pandas, and Polars. Additionally, check out the [Ibis integration guide](docs/tutorial/connect-with-ibis-setup.md) for using the [Ibis](https://ibis-project.org/) dataframe API to query MyDuck Server directly.

## 🎯 Roadmap

We have big plans for MyDuck Server! Here are some of the features we’re working on:

- [x] Be compatible with MySQL proxy tools like [ProxySQL](https://proxysql.com/).
- [x] Replicate data from PostgreSQL.
- [ ] Authentication.
- [ ] ...and more! We’re always looking for ways to make MyDuck Server better. If you have a feature request, please let us know by [opening an issue](https://github.com/apecloud/myduckserver/issues/new).

## 💡 Contributing

Let’s make MySQL & Postgres analytics fast and powerful — together!
Expand Down
4 changes: 4 additions & 0 deletions adapter/adapter.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,10 @@ func GetConn(ctx *sql.Context) (*stdsql.Conn, error) {
return ctx.Session.(ConnectionHolder).GetConn(ctx)
}

func GetCatalogConn(ctx *sql.Context) (*stdsql.Conn, error) {
return ctx.Session.(ConnectionHolder).GetCatalogConn(ctx)
}

func CloseBackendConn(ctx *sql.Context) {
ctx.Session.(ConnectionHolder).CloseBackendConn()
}
Expand Down
6 changes: 5 additions & 1 deletion binlogreplication/binlog_replica_applier.go
Original file line number Diff line number Diff line change
Expand Up @@ -1245,14 +1245,18 @@ func (a *binlogReplicaApplier) appendRowFormatChanges(
}

func (a *binlogReplicaApplier) flushDeltaBuffer(ctx *sql.Context, reason delta.FlushReason) error {
conn, err := adapter.GetCatalogConn(ctx)
if err != nil {
return err
}
tx, err := adapter.GetCatalogTxn(ctx, nil)
if err != nil {
return err
}

defer a.deltaBufSize.Store(0)

if err = a.tableWriterProvider.FlushDeltaBuffer(ctx, tx, reason); err != nil {
if err = a.tableWriterProvider.FlushDeltaBuffer(ctx, conn, tx, reason); err != nil {
ctx.GetLogger().Errorf("Failed to flush changelog: %v", err.Error())
MyBinlogReplicaController.setSqlError(sqlerror.ERUnknownError, err.Error())
}
Expand Down
2 changes: 1 addition & 1 deletion binlogreplication/writer.go
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ type TableWriterProvider interface {
) (DeltaAppender, error)

// FlushDelta writes the accumulated changes to the database.
FlushDeltaBuffer(ctx *sql.Context, tx *stdsql.Tx, reason delta.FlushReason) error
FlushDeltaBuffer(ctx *sql.Context, conn *stdsql.Conn, tx *stdsql.Tx, reason delta.FlushReason) error

// DiscardDeltaBuffer discards the accumulated changes.
DiscardDeltaBuffer(ctx *sql.Context)
Expand Down
1 change: 0 additions & 1 deletion compatibility/mysql/python/mysql_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,6 @@ def run_tests(self):
for test in self.tests:
cursor = None
try:
self.conn.autocommit = False
cursor = self.conn.cursor()
if not test.run(cursor):
return False
Expand Down
54 changes: 54 additions & 0 deletions compatibility/pg-pytools/polars_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import io

import pandas as pd
import pyarrow as pa
import polars as pl
import psycopg

# Create a pandas DataFrame
data = {
'id': [1, 2, 3],
'num': [100, 200, 300],
'data': ['aaa', 'bbb', 'ccc']
}
df = pd.DataFrame(data)

# Convert the DataFrame to an Arrow Table
table = pa.Table.from_pandas(df)

with psycopg.connect("dbname=postgres user=postgres host=127.0.0.1 port=5432", autocommit=True) as conn:
with conn.cursor() as cur:
cur.execute("DROP SCHEMA IF EXISTS test CASCADE")
cur.execute("CREATE SCHEMA test")

# Create a new table
cur.execute("""
CREATE TABLE test.tb1 (
id integer PRIMARY KEY,
num integer,
data text)
""")

# Use psycopg to write the DataFrame to MyDuck Server
output_stream = io.BytesIO()
with pa.ipc.RecordBatchStreamWriter(output_stream, table.schema) as writer:
writer.write_table(table)
with cur.copy("COPY test.tb1 FROM STDIN (FORMAT arrow)") as copy:
copy.write(output_stream.getvalue())

# Copy the data from MyDuck Server back into a pandas DataFrame using Arrow format
arrow_data = io.BytesIO()
with cur.copy("COPY test.tb1 TO STDOUT (FORMAT arrow)") as copy:
for block in copy:
arrow_data.write(block)

# Read the Arrow data into a Polars DataFrame
with pa.ipc.open_stream(arrow_data.getvalue()) as reader:
arrow_df = reader.read_all()
polars_df = pl.from_arrow(arrow_df)

# Convert the original pandas DataFrame to Polars DataFrame for comparison
polars_df_original = pl.from_pandas(df)

# Compare the original Polars DataFrame with the DataFrame from PostgreSQL
assert polars_df.equals(polars_df_original), "DataFrames are not equal"
53 changes: 53 additions & 0 deletions compatibility/pg-pytools/psycopg_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
from psycopg import sql
import psycopg

rows = [
(1, 100, "aaa"),
(2, 200, "bbb"),
(3, 300, "ccc"),
(4, 400, "ddd"),
(5, 500, "eee"),
]

# Connect to an existing database
with psycopg.connect("dbname=postgres user=postgres host=127.0.0.1 port=5432", autocommit=True) as conn:
# Open a cursor to perform database operations
with conn.cursor() as cur:
cur.execute("DROP SCHEMA IF EXISTS test CASCADE")
cur.execute("CREATE SCHEMA test")

cur.execute("""
CREATE TABLE test.tb1 (
id integer PRIMARY KEY,
num integer,
data text)
""")


# Pass data to fill a query placeholders and let Psycopg perform the correct conversion
cur.execute(
"INSERT INTO test.tb1 (id, num, data) VALUES (%s, %s, %s)",
rows[0])

# Query the database and obtain data as Python objects
cur.execute("SELECT * FROM test.tb1")
row = cur.fetchone()
assert row == rows[0], "Row is not equal"

# Copy data from a file-like object to a table
print("Copy data from a file-like object to a table")
with cur.copy("COPY test.tb1 (id, num, data) FROM STDIN") as copy:
for row in rows[1:3]:
copy.write(f"{row[0]}\t{row[1]}\t{row[2]}\n".encode())
for row in rows[3:]:
copy.write_row(row)

# Copy data from a table to a file-like object
print("Copy data from a table to a file-like object")
with cur.copy(
"COPY (SELECT * FROM test.tb1 LIMIT %s) TO STDOUT",
(4,)
) as copy:
copy.set_types(["int4", "int4", "text"])
for i, row in enumerate(copy.rows()):
assert row == rows[i], f"Row {i} is not equal"
50 changes: 50 additions & 0 deletions compatibility/pg-pytools/pyarrow_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
import io

import pandas as pd
import pyarrow as pa
import psycopg

# Create a pandas DataFrame
data = {
'id': [1, 2, 3],
'num': [100, 200, 300],
'data': ['aaa', 'bbb', 'ccc']
}
df = pd.DataFrame(data)

# Convert the DataFrame to an Arrow Table
table = pa.Table.from_pandas(df)

with psycopg.connect("dbname=postgres user=postgres host=127.0.0.1 port=5432", autocommit=True) as conn:
with conn.cursor() as cur:
cur.execute("DROP SCHEMA IF EXISTS test CASCADE")
cur.execute("CREATE SCHEMA test")

# Create a new table
cur.execute("""
CREATE TABLE test.tb1 (
id integer PRIMARY KEY,
num integer,
data text)
""")

# Use psycopg to write the DataFrame to MyDuck Server
output_stream = io.BytesIO()
with pa.ipc.RecordBatchStreamWriter(output_stream, table.schema) as writer:
writer.write_table(table)
with cur.copy("COPY test.tb1 FROM STDIN (FORMAT arrow)") as copy:
copy.write(output_stream.getvalue())

# Copy the data from MyDuck Server back into a pandas DataFrame using Arrow format
arrow_data = io.BytesIO()
with cur.copy("COPY test.tb1 TO STDOUT (FORMAT arrow)") as copy:
for block in copy:
arrow_data.write(block)

# Read the Arrow data into a pandas DataFrame
with pa.ipc.open_stream(arrow_data.getvalue()) as reader:
df_from_pg = reader.read_pandas()
df = df.astype({'id': 'int64', 'num': 'int64'})
df_from_pg = df_from_pg.astype({'id': 'int64', 'num': 'int64'})
# Compare the original DataFrame with the DataFrame from PostgreSQL
assert df.equals(df_from_pg), "DataFrames are not equal"
49 changes: 49 additions & 0 deletions compatibility/pg-pytools/test.bats
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#!/usr/bin/env bats

setup() {
psql -h 127.0.0.1 -p 5432 -U postgres -c "DROP SCHEMA IF EXISTS test CASCADE;"
touch /tmp/test_pids
}

custom_teardown=""

set_custom_teardown() {
custom_teardown="$1"
}

teardown() {
if [ -n "$custom_teardown" ]; then
eval "$custom_teardown"
custom_teardown=""
fi

while read -r pid; do
if kill -0 "$pid" 2>/dev/null; then
kill "$pid"
wait "$pid" 2>/dev/null
fi
done < /tmp/test_pids
rm /tmp/test_pids
}

start_process() {
run timeout 2m "$@"
echo $! >> /tmp/test_pids
if [ "$status" -ne 0 ]; then
echo "$output"
echo "$stderr"
fi
[ "$status" -eq 0 ]
}

@test "pg-psycopg" {
start_process python3 $BATS_TEST_DIRNAME/psycopg_test.py
}

@test "pg-pyarrow" {
start_process python3 $BATS_TEST_DIRNAME/pyarrow_test.py
}

@test "pg-polars" {
start_process python3 $BATS_TEST_DIRNAME/polars_test.py
}
Loading

0 comments on commit 483fca9

Please sign in to comment.