Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bulk loading #104

Closed
mcamou opened this issue Nov 26, 2020 · 16 comments
Closed

Add bulk loading #104

mcamou opened this issue Nov 26, 2020 · 16 comments
Labels
help wanted Extra attention is needed kind/feature A request for a new feature.

Comments

@mcamou
Copy link

mcamou commented Nov 26, 2020

Running INSERT INTO table VALUES(...) is really slow especially when e.g. inserting against an Azure server. You can batch inserts with INSERT INTO table VALUES(({},{}),({},{})), but that is limited to 1000 values. You can also use INSERT INTO table SELECT a,b FROM (VALUES {},{},{},{}))sub(a,b) as mentioned in steffengy#93, but you are limited to 2100 values in the query.

A solution to this would be to implement something similar to what Tedious (NodeJS library) implements: https://tediousjs.github.io/tedious/bulk-load.html

@esheppa
Copy link
Contributor

esheppa commented Nov 29, 2020

Hi @mcamou, there is another alternative as well, which should offer decent performance:

  1. Submit your data from the Rust side as a JSON string
  2. In the database query use something like insert into <mytable> select * from openjson(@P1)

There are some examples of this in a project I'm working on:

  1. Batched inserting (Rust side)
  2. Inserting JSON string into table (SQL Server side)

@Mart-Bogdan
Copy link

@esheppa nice hint.
Quite offtopic regarding tiberius, but regarding your project. You are using chunks in size of 100000. This is unfortunate, as sql server would perform lock escalation in case if single query affects more then 5000 row.

Safe bet is 4500 rows per batch.

This won't be faster for insert, but won't block other queries.

@pimeys
Copy link
Contributor

pimeys commented Dec 14, 2020

Regarding the idea, I'd like to see something even better than Tedious does: streaming insert. Provide an async file stream with mapping, save memory and execute a fast bulk insert.

I'm fascinated about the idea but not sure would it be possible to make working. I'm also extremely busy for the next few months and the next planned thing from me for Tiberius is to make TLS working on macOS. If somebody has time before spring and would like to try something out, I'd be interested to see a pull request!

@Mart-Bogdan
Copy link

It's definitely possible. I've used it in java JDBC using SQLBulkCopy class.
So it's covered by TDS protocol, but that could be huge amount of work.

@pimeys
Copy link
Contributor

pimeys commented Dec 15, 2020

I don't think it's a huge amount of work, we have good implementations already in JDBC and in Tedious. It's a good idea to read the TDS manual, but it might be easier to just see what Tedious does and what the JDBC driver does.

I hope the facilities in Tiberius are good enough, and I can provide help in our Discord server if you want to try your luck implementing it. We should discuss a bit how the implementation should look like beforehand.

Actually, I'm quite excited to write this by myself. I'm just quite busy now and I know the next thing I have allocated for Tiberius from work is going to be the macOS TLS fix, so I doubt I can start this that soon...

@pimeys pimeys added help wanted Extra attention is needed kind/feedback Issue for gathering feedback. kind/feature A request for a new feature. labels Dec 15, 2020
@nickolay
Copy link

I was curious about this and it turned out not very hard to get the basics working (for a 1x2 table with hard-coded structure and content for now). See https://github.com/nickolay/tiberius/commits/wip/bulkload

According to https://winprotocoldoc.blob.core.windows.net/productionwindowsarchives/MS-TDS/%5BMS-TDS%5D.pdf the bulk-load exchange can be expressed in terms of the structs already existing in tiberius:

  • An "INSERT BULK table_name (col1 type1, col2 type2, ...)" SQL batch request (where the columns and types have to match a subset of the target table)
  • A BulkLoadBCP request, consisting of
    • COLMETADATA token (TokenColMetaData), containing a number of column definitions (MetaDataColumn -> BaseMetaDataColumn -> BitFlags<ColumnFlag>, TypeInfo)
    • A number of ROW tokens (TokenRow -> ColumnData)
    • A DONE token (TokenDone)

So the bulk of the changes needed to make this work is simply adding Encode impls for the existing types. Only ColumnData is special: it already has an Encode implementation, which conflates encoding the column data with encoding its type info. Since the bulk-load stream does not repeat the type info before each value, I've split the methods encoding the data and the type info (encode_type_info_for_rpc).

The new bits are: a new bulk_load_tinyint test calls the new Client::bulk_load method passing in the table definition in a new BulkLoadMetadata struct. Under the hood the new BulkLoadRequest struct encodes the BulkLoadBCP request.

I leave the interesting part of designing the streaming API and the tedious work of implementing and testing the bulk uploading of all the supported types to someone else.

@pimeys
Copy link
Contributor

pimeys commented Mar 25, 2021

Can we even do a streaming API for bulk loads? It would be nice to read data from e.g. a huge CSV file in batches to memory, then write to the database and load another batch.

Does TDS actually allow this, or do you actually have to write all the data to the token before you can write it to the wire?

I'm not going to have any need for this feature, so I'm probably the wrong person to answer on how people would like to use this...

@nickolay
Copy link

I’m not sure I understand what you’re saying.. Note that each row is a separate token in TDS, so there shouldn’t be any protocol-level problems, the question is designing how the rust API should look like.

@pimeys
Copy link
Contributor

pimeys commented Mar 25, 2021

Mm, true. If there's anybody who'd actually need this feature, could you write down to this ticket how you'd like the API to look like? We don't need to go all async in the first try, but at least a good first PR would be nice for commenting...

@pimeys pimeys mentioned this issue Jul 27, 2021
@pimeys
Copy link
Contributor

pimeys commented Jul 29, 2021

Could people needing bulk inserts test and see if the WIP API works for you? I also don't have that much time anymore for it, so maybe if you want it to get finished faster, help would be nice for writing tests and fixing the issues...

@mcamou
Copy link
Author

mcamou commented Jul 30, 2021

Hi Julius, thanks for the effort. I'll test it next week.

@rogerpengyu
Copy link
Contributor

What's the latest on this topic?

@pimeys
Copy link
Contributor

pimeys commented Apr 8, 2022

The PR is there. The annoying thing is you have to define the columns before just right or nothing happens. I think the API sucks and we have no need for it, so I get no time to work on subject.

Check the PR and if you need the feature, some advice how the api should look like would be nice.

@janpio janpio removed the kind/feedback Issue for gathering feedback. label May 31, 2022
@rogerpengyu
Copy link
Contributor

I will spend some time on this if no one else is looking into it.

@rogerpengyu
Copy link
Contributor

could someone take a review at #227

This is the interface

    let mut req = client.bulk_insert("bulk_test1").await?;

    info!("start loading data");
    for i in 0..1000 {
        let int_column = [Some(32), None][i % 2];
        let float_column = [Some(34f32), None][i % 2];
        let string_column = [Some("aaa"), None][i % 2];

        let row = (int_column, float_column, string_column).into_row();

        req.send(row).await?;
    }

@pimeys
Copy link
Contributor

pimeys commented Aug 23, 2022

This was merged already a while ago.

@pimeys pimeys closed this as completed Aug 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed kind/feature A request for a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants