MSSQL -> Azure Blob using DuckDB #504

EValge-IT · 2025-02-06T09:57:32Z

EValge-IT
Feb 6, 2025

I have recently upgraded slingCLI to the latest version but I've noticed some strange behaviour now when trying to transfer data over from an Azure MSSQL DB to an Azure Storage Account.

The previous version I used (1.3.2) worked a treat for my use case. I was able to specify file_max_rows and Sling would batch the database perfectly not pulling more than it needed to memory. This was the old replication.yaml I was using:

    source: MSSQL
    target: AZURE_STORAGE
    defaults:
      mode: full-refresh
      target_options:
        format: 'parquet'
        file_max_rows: 5000000
        transforms: [decode_utf8]
    streams:
      dbo.test:
        object: 'https://<STORAGE_ACCOUNT_NAME>.blob.core.windows.net/<CONTAINER_NAME>/{stream_table}'
        sql:
          SELECT *
          FROM dbo.test

However, now that Sling is using DuckDB I'm seeing that the replication task no longer supports file_max_rows. This isn't too much of an issue as using file_max_bytes is fine for my use case.

However, what I've noticed is that the first step is for Sling to seemingly ingest the entire table to DuckDB before creating the files and pushing them to the storage account. As you can probably imagine this pretty quickly leads to a OOMKilled error.

For context, I ran SlingCLI against my dataset and within 5 minutes of running it was already using 12GB of memory. In terms of rows it had read it was less than 25% of the table I was working on. (In this case the table was only 89GB).

Is this the intended behaviour? If so how do users run Sling when working with large tables (in my case its 300GB) without being OOMKilled. Or is there a way to stop DuckDB having the entire table ingested and instead only take smaller batches for processing and deleting them from DuckDB before proceeding the the next batch?

I know I can set SLING_DUCKDB_COMPUTE to false to return to the previous behaviour. However, I fear that it won't be supported for too much longer or has inherent problems which is why DuckDB was introduced.

Answered by flarco

Feb 6, 2025

@EValge-IT thanks for raising this. This was a concern for me as well.
Can you try with the latest dev build: 1.4.3.dev?
file_max_rows is added again, working with DuckDB.

View full answer

flarco · 2025-02-06T11:59:29Z

flarco
Feb 6, 2025
Maintainer

@EValge-IT thanks for raising this. This was a concern for me as well.
Can you try with the latest dev build: 1.4.3.dev?
file_max_rows is added again, working with DuckDB.

1 reply

EValge-IT Feb 6, 2025
Author

Hey @flarco

Thank you for the quick response, ran a very brief test with the dev build and its working as expected. Memory usage is still higher than the previous implementation but I guess that is to be expected. Not really an issue as adjusting file_max_rows to a lesser value resolved this.

I'll do some full testing on my end likely tomorrow now but from a first glance it looks like this fixes the issue!

Thanks again!

Update:
Managed to run a full export for one of my test tables (89GB). Works as expected, memory usage stayed below 5GB the entire run which is ideal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MSSQL -> Azure Blob using DuckDB #504

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

MSSQL -> Azure Blob using DuckDB #504

Uh oh!

Uh oh!

EValge-IT Feb 6, 2025

Replies: 1 comment · 1 reply

Uh oh!

flarco Feb 6, 2025 Maintainer

Uh oh!

Uh oh!

EValge-IT Feb 6, 2025 Author

EValge-IT
Feb 6, 2025

Replies: 1 comment 1 reply

flarco
Feb 6, 2025
Maintainer

EValge-IT Feb 6, 2025
Author