Merge branch 'main' into 246-alter-subscription

apecloud · Dec 4, 2024 · 483fca9 · 483fca9
2 parents e2ddbcd + b34532e
commit 483fca9
Show file tree

Hide file tree

Showing 22 changed files with 854 additions and 200 deletions.
diff --git a/.github/workflows/clients-compatibility.yml b/.github/workflows/clients-compatibility.yml
@@ -108,7 +108,7 @@ jobs:
           # curl -L -o ./java/postgresql-42.7.4.jar https://jdbc.postgresql.org/download/postgresql-42.7.4.jar
           npm install pg
           sudo cpanm --notest DBD::Pg
-          pip3 install psycopg2
+          pip3 install "psycopg[binary]" pandas pyarrow polars
           # sudo R -e "install.packages('RPostgres', repos='http://cran.r-project.org')"
           sudo gem install pg
 
@@ -123,3 +123,7 @@ jobs:
       - name: Run the Compatibility Test for PostgreSQL Client
         run: |
           bats ./compatibility/pg/test.bats
+
+      - name: Run the Compatibility Test for Python Data Tools
+        run: |
+          bats ./compatibility/pg-pytools/test.bats
diff --git a/README.md b/README.md
@@ -3,9 +3,33 @@
     <span>MyDuck Server</span>
 </h1>
 
-
 **MyDuck Server** unlocks serious power for your MySQL & Postgres analytics. Imagine the simplicity of (MySQL|Postgres)’s familiar interface fused with the raw analytical speed of [DuckDB](https://duckdb.org/). Now you can supercharge your analytical queries with DuckDB’s lightning-fast OLAP engine, all while using the tools and dialect you know.
 
+<h1 style="display: flex; align-items: center;">
+    <img alt="duck under dolphin" style="margin-right: 0.2em" src="logo/MyDuck.svg">
+</h1>
+
+## 📑 Table of Contents
+
+- [Why MyDuck](#-why-myduck-)
+- [Key Features](#-key-features)
+- [Performance](#-performance)
+- [Getting Started](#-getting-started)
+  - [Prerequisites](#prerequisites)
+  - [Installation](#installation)
+  - [Usage](#usage)
+  - [Replicating Data](#replicating-data)
+  - [Connecting to Cloud MySQL & Postgres](#connecting-to-cloud-mysql--postgres)
+  - [HTAP Setup](#htap-setup)
+  - [Query Parquet Files](#query-parquet-files)
+  - [Already Using DuckDB?](#already-using-duckdb)
+  - [LLM Integration](#llm-integration)
+  - [Access from Python](#access-from-python)
+- [Roadmap](#-roadmap)
+- [Contributing](#-contributing)
+- [Acknowledgements](#-acknowledgements)
+- [License](#-license)
+
 ## ❓ Why MyDuck ❓
 
 While MySQL and Postgres are the most popular open-source databases for OLTP, their performance in analytics often falls short. DuckDB, on the other hand, is built for fast, embedded analytical processing. MyDuck Server lets you enjoy DuckDB's high-speed analytics without leaving the (MySQL|Postgres) ecosystem.
@@ -24,9 +48,6 @@ MyDuck Server isn't here to replace MySQL & Postgres — it's here to help MySQL
 
 ## ✨ Key Features
 
-<h1 style="display: flex; align-items: center;">
-    <img alt="duck under dolphin" style="margin-right: 0.2em" src="logo/MyDuck.svg">
-</h1>
 
 - **Blazing Fast OLAP with DuckDB**: MyDuck stores data in DuckDB, an OLAP-optimized database known for lightning-fast analytical queries. DuckDB enables MyDuck to execute queries up to 1000x faster than traditional MySQL & Postgres setups, making complex analytics practical that were previously unfeasible.
 
@@ -56,16 +77,6 @@ MyDuck Server isn't here to replace MySQL & Postgres — it's here to help MySQL
 
 Typical OLAP queries can run **up to 1000x faster** with MyDuck Server compared to MySQL & Postgres alone, especially on large datasets. Under the hood, it's just DuckDB doing what it does best: processing analytical queries at lightning speed. You are welcome to run your own benchmarks and prepare to be amazed! Alternatively, you can refer to well-known benchmarks like the [ClickBench](https://benchmark.clickhouse.com/) and [H2O.ai db-benchmark](https://duckdblabs.github.io/db-benchmark/) to see how DuckDB performs against other databases and data science tools. Also remember that DuckDB has robust support for transactions, JOINs, and [larger-than-memory query processing](https://duckdb.org/2024/07/09/memory-management.html), which are unavailable in many competing systems and tools.
 
-## 🎯 Roadmap
-
-We have big plans for MyDuck Server! Here are some of the features we’re working on:
-
-- [x] Be compatible with MySQL proxy tools like [ProxySQL](https://proxysql.com/).
-- [x] Replicate data from PostgreSQL.
-- [ ] Authentication.
-- [ ] ...and more! We’re always looking for ways to make MyDuck Server better. If you have a feature request, please let us know by [opening an issue](https://github.com/apecloud/myduckserver/issues/new).
-
-
 ## 🏃‍♂️ Getting Started
 
 ### Prerequisites
@@ -140,14 +151,31 @@ With MyDuck's powerful analytics capabilities, you can create an hybrid transact
 * Provisioning a MySQL HTAP cluster based on [ProxySQL](docs/tutorial/mysql-htap-proxysql-setup.md) or [MariaDB MaxScale](docs/tutorial/mysql-htap-maxscale-setup.md).
 * Provisioning a PostgreSQL HTAP cluster based on [PGPool-II](docs/tutorial/pg-htap-pgpool-setup.md)
 
-### Query & Load Parquet Files
+### Query Parquet Files
 
 Looking to load Parquet files into MyDuck Server and start querying? Follow our [Parquet file loading guide](docs/tutorial/load-parquet-files.md) for easy setup.
 
 ### Already Using DuckDB?
 
 Already have a DuckDB file? You can seamlessly bootstrap MyDuck Server with it. See our [DuckDB file bootstrapping guide](docs/tutorial/bootstrap.md) for more details.
 
+### LLM Integration
+
+MyDuck Server can be integrated with LLM applications via the [Model Context Protocol (MCP)](https://modelcontextprotocol.io/introduction). Follow the [MCP integration guide](docs/tutorial/mcp.md) to set up MyDuck Server as an external data source for LLMs.
+
+### Access from Python
+
+MyDuck Server can be seamlessly accessed from the Python data science ecosystem. Follow the [Python integration guide](docs/tutorial/pg-python-data-tools.md) to connect to MyDuck Server from Python and export data to PyArrow, pandas, and Polars. Additionally, check out the [Ibis integration guide](docs/tutorial/connect-with-ibis-setup.md) for using the [Ibis](https://ibis-project.org/) dataframe API to query MyDuck Server directly.
+
+## 🎯 Roadmap
+
+We have big plans for MyDuck Server! Here are some of the features we’re working on:
+
+- [x] Be compatible with MySQL proxy tools like [ProxySQL](https://proxysql.com/).
+- [x] Replicate data from PostgreSQL.
+- [ ] Authentication.
+- [ ] ...and more! We’re always looking for ways to make MyDuck Server better. If you have a feature request, please let us know by [opening an issue](https://github.com/apecloud/myduckserver/issues/new).
+
 ## 💡 Contributing
 
 Let’s make MySQL & Postgres analytics fast and powerful — together!

diff --git a/adapter/adapter.go b/adapter/adapter.go
@@ -21,6 +21,10 @@ func GetConn(ctx *sql.Context) (*stdsql.Conn, error) {
 	return ctx.Session.(ConnectionHolder).GetConn(ctx)
 }
 
+func GetCatalogConn(ctx *sql.Context) (*stdsql.Conn, error) {
+	return ctx.Session.(ConnectionHolder).GetCatalogConn(ctx)
+}
+
 func CloseBackendConn(ctx *sql.Context) {
 	ctx.Session.(ConnectionHolder).CloseBackendConn()
 }

diff --git a/binlogreplication/binlog_replica_applier.go b/binlogreplication/binlog_replica_applier.go
@@ -1245,14 +1245,18 @@ func (a *binlogReplicaApplier) appendRowFormatChanges(
 }
 
 func (a *binlogReplicaApplier) flushDeltaBuffer(ctx *sql.Context, reason delta.FlushReason) error {
+	conn, err := adapter.GetCatalogConn(ctx)
+	if err != nil {
+		return err
+	}
 	tx, err := adapter.GetCatalogTxn(ctx, nil)
 	if err != nil {
 		return err
 	}
 
 	defer a.deltaBufSize.Store(0)
 
-	if err = a.tableWriterProvider.FlushDeltaBuffer(ctx, tx, reason); err != nil {
+	if err = a.tableWriterProvider.FlushDeltaBuffer(ctx, conn, tx, reason); err != nil {
 		ctx.GetLogger().Errorf("Failed to flush changelog: %v", err.Error())
 		MyBinlogReplicaController.setSqlError(sqlerror.ERUnknownError, err.Error())
 	}

diff --git a/binlogreplication/writer.go b/binlogreplication/writer.go
@@ -51,7 +51,7 @@ type TableWriterProvider interface {
 	) (DeltaAppender, error)
 
 	// FlushDelta writes the accumulated changes to the database.
-	FlushDeltaBuffer(ctx *sql.Context, tx *stdsql.Tx, reason delta.FlushReason) error
+	FlushDeltaBuffer(ctx *sql.Context, conn *stdsql.Conn, tx *stdsql.Tx, reason delta.FlushReason) error
 
 	// DiscardDeltaBuffer discards the accumulated changes.
 	DiscardDeltaBuffer(ctx *sql.Context)

diff --git a/compatibility/mysql/python/mysql_test.py b/compatibility/mysql/python/mysql_test.py
@@ -58,7 +58,6 @@ def run_tests(self):
         for test in self.tests:
             cursor = None
             try:
-                self.conn.autocommit = False
                 cursor = self.conn.cursor()
                 if not test.run(cursor):
                     return False

diff --git a/compatibility/pg-pytools/polars_test.py b/compatibility/pg-pytools/polars_test.py
@@ -0,0 +1,54 @@
+import io
+
+import pandas as pd
+import pyarrow as pa
+import polars as pl
+import psycopg
+
+# Create a pandas DataFrame
+data = {
+    'id': [1, 2, 3],
+    'num': [100, 200, 300],
+    'data': ['aaa', 'bbb', 'ccc']
+}
+df = pd.DataFrame(data)
+
+# Convert the DataFrame to an Arrow Table
+table = pa.Table.from_pandas(df)
+
+with psycopg.connect("dbname=postgres user=postgres host=127.0.0.1 port=5432", autocommit=True) as conn:
+    with conn.cursor() as cur:
+        cur.execute("DROP SCHEMA IF EXISTS test CASCADE")
+        cur.execute("CREATE SCHEMA test")
+
+        # Create a new table
+        cur.execute("""
+            CREATE TABLE test.tb1 (
+                id integer PRIMARY KEY,
+                num integer,
+                data text)
+            """)
+
+        # Use psycopg to write the DataFrame to MyDuck Server
+        output_stream = io.BytesIO()
+        with pa.ipc.RecordBatchStreamWriter(output_stream, table.schema) as writer:
+            writer.write_table(table)
+        with cur.copy("COPY test.tb1 FROM STDIN (FORMAT arrow)") as copy:
+            copy.write(output_stream.getvalue())
+
+        # Copy the data from MyDuck Server back into a pandas DataFrame using Arrow format
+        arrow_data = io.BytesIO()
+        with cur.copy("COPY test.tb1 TO STDOUT (FORMAT arrow)") as copy:
+            for block in copy:
+                arrow_data.write(block)
+
+        # Read the Arrow data into a Polars DataFrame
+        with pa.ipc.open_stream(arrow_data.getvalue()) as reader:
+            arrow_df = reader.read_all()
+            polars_df = pl.from_arrow(arrow_df)
+
+            # Convert the original pandas DataFrame to Polars DataFrame for comparison
+            polars_df_original = pl.from_pandas(df)
+
+            # Compare the original Polars DataFrame with the DataFrame from PostgreSQL
+            assert polars_df.equals(polars_df_original), "DataFrames are not equal"
diff --git a/compatibility/pg-pytools/psycopg_test.py b/compatibility/pg-pytools/psycopg_test.py
@@ -0,0 +1,53 @@
+from psycopg import sql
+import psycopg
+
+rows = [
+    (1, 100, "aaa"),
+    (2, 200, "bbb"),
+    (3, 300, "ccc"),
+    (4, 400, "ddd"),
+    (5, 500, "eee"),
+]
+
+# Connect to an existing database
+with psycopg.connect("dbname=postgres user=postgres host=127.0.0.1 port=5432", autocommit=True) as conn:
+    # Open a cursor to perform database operations
+    with conn.cursor() as cur:
+        cur.execute("DROP SCHEMA IF EXISTS test CASCADE")
+        cur.execute("CREATE SCHEMA test")
+
+        cur.execute("""
+            CREATE TABLE test.tb1 (
+                id integer PRIMARY KEY,
+                num integer,
+                data text)
+            """)
+
+
+        # Pass data to fill a query placeholders and let Psycopg perform the correct conversion
+        cur.execute(
+            "INSERT INTO test.tb1 (id, num, data) VALUES (%s, %s, %s)",
+            rows[0])
+
+        # Query the database and obtain data as Python objects
+        cur.execute("SELECT * FROM test.tb1")
+        row = cur.fetchone()
+        assert row == rows[0], "Row is not equal"
+
+        # Copy data from a file-like object to a table
+        print("Copy data from a file-like object to a table")
+        with cur.copy("COPY test.tb1 (id, num, data) FROM STDIN") as copy:
+            for row in rows[1:3]:
+                copy.write(f"{row[0]}\t{row[1]}\t{row[2]}\n".encode())
+            for row in rows[3:]:
+                copy.write_row(row)
+
+        # Copy data from a table to a file-like object
+        print("Copy data from a table to a file-like object")
+        with cur.copy(
+                "COPY (SELECT * FROM test.tb1 LIMIT %s) TO STDOUT",
+                (4,)
+        ) as copy:
+            copy.set_types(["int4", "int4", "text"])
+            for i, row in enumerate(copy.rows()):
+                assert row == rows[i], f"Row {i} is not equal"
diff --git a/compatibility/pg-pytools/pyarrow_test.py b/compatibility/pg-pytools/pyarrow_test.py
@@ -0,0 +1,50 @@
+import io
+
+import pandas as pd
+import pyarrow as pa
+import psycopg
+
+# Create a pandas DataFrame
+data = {
+    'id': [1, 2, 3],
+    'num': [100, 200, 300],
+    'data': ['aaa', 'bbb', 'ccc']
+}
+df = pd.DataFrame(data)
+
+# Convert the DataFrame to an Arrow Table
+table = pa.Table.from_pandas(df)
+
+with psycopg.connect("dbname=postgres user=postgres host=127.0.0.1 port=5432", autocommit=True) as conn:
+    with conn.cursor() as cur:
+        cur.execute("DROP SCHEMA IF EXISTS test CASCADE")
+        cur.execute("CREATE SCHEMA test")
+
+        # Create a new table
+        cur.execute("""
+            CREATE TABLE test.tb1 (
+                id integer PRIMARY KEY,
+                num integer,
+                data text)
+            """)
+
+        # Use psycopg to write the DataFrame to MyDuck Server
+        output_stream = io.BytesIO()
+        with pa.ipc.RecordBatchStreamWriter(output_stream, table.schema) as writer:
+            writer.write_table(table)
+        with cur.copy("COPY test.tb1 FROM STDIN (FORMAT arrow)") as copy:
+            copy.write(output_stream.getvalue())
+
+        # Copy the data from MyDuck Server back into a pandas DataFrame using Arrow format
+        arrow_data = io.BytesIO()
+        with cur.copy("COPY test.tb1 TO STDOUT (FORMAT arrow)") as copy:
+            for block in copy:
+                arrow_data.write(block)
+
+        # Read the Arrow data into a pandas DataFrame
+        with pa.ipc.open_stream(arrow_data.getvalue()) as reader:
+            df_from_pg = reader.read_pandas()
+            df = df.astype({'id': 'int64', 'num': 'int64'})
+            df_from_pg = df_from_pg.astype({'id': 'int64', 'num': 'int64'})
+            # Compare the original DataFrame with the DataFrame from PostgreSQL
+            assert df.equals(df_from_pg), "DataFrames are not equal"
diff --git a/compatibility/pg-pytools/test.bats b/compatibility/pg-pytools/test.bats
@@ -0,0 +1,49 @@
+#!/usr/bin/env bats
+
+setup() {
+    psql -h 127.0.0.1 -p 5432 -U postgres -c "DROP SCHEMA IF EXISTS test CASCADE;"
+    touch /tmp/test_pids
+}
+
+custom_teardown=""
+
+set_custom_teardown() {
+    custom_teardown="$1"
+}
+
+teardown() {
+    if [ -n "$custom_teardown" ]; then
+        eval "$custom_teardown"
+        custom_teardown=""
+    fi
+
+    while read -r pid; do
+        if kill -0 "$pid" 2>/dev/null; then
+            kill "$pid"
+            wait "$pid" 2>/dev/null
+        fi
+    done < /tmp/test_pids
+    rm /tmp/test_pids
+}
+
+start_process() {
+    run timeout 2m "$@"
+    echo $! >> /tmp/test_pids
+    if [ "$status" -ne 0 ]; then
+        echo "$output"
+        echo "$stderr"
+    fi
+    [ "$status" -eq 0 ]
+}
+
+@test "pg-psycopg" {
+    start_process python3 $BATS_TEST_DIRNAME/psycopg_test.py
+}
+
+@test "pg-pyarrow" {
+    start_process python3 $BATS_TEST_DIRNAME/pyarrow_test.py
+}
+
+@test "pg-polars" {
+    start_process python3 $BATS_TEST_DIRNAME/polars_test.py
+}