Add bulk loading #104

mcamou · 2020-11-26T13:13:22Z

Running INSERT INTO table VALUES(...) is really slow especially when e.g. inserting against an Azure server. You can batch inserts with INSERT INTO table VALUES(({},{}),({},{})), but that is limited to 1000 values. You can also use INSERT INTO table SELECT a,b FROM (VALUES {},{},{},{}))sub(a,b) as mentioned in steffengy#93, but you are limited to 2100 values in the query.

A solution to this would be to implement something similar to what Tedious (NodeJS library) implements: https://tediousjs.github.io/tedious/bulk-load.html

The text was updated successfully, but these errors were encountered:

esheppa · 2020-11-29T11:24:12Z

Hi @mcamou, there is another alternative as well, which should offer decent performance:

Submit your data from the Rust side as a JSON string
In the database query use something like insert into <mytable> select * from openjson(@P1)

There are some examples of this in a project I'm working on:

Batched inserting (Rust side)
Inserting JSON string into table (SQL Server side)

Mart-Bogdan · 2020-12-14T21:57:11Z

@esheppa nice hint.
Quite offtopic regarding tiberius, but regarding your project. You are using chunks in size of 100000. This is unfortunate, as sql server would perform lock escalation in case if single query affects more then 5000 row.

Safe bet is 4500 rows per batch.

This won't be faster for insert, but won't block other queries.

pimeys · 2020-12-14T22:10:37Z

Regarding the idea, I'd like to see something even better than Tedious does: streaming insert. Provide an async file stream with mapping, save memory and execute a fast bulk insert.

I'm fascinated about the idea but not sure would it be possible to make working. I'm also extremely busy for the next few months and the next planned thing from me for Tiberius is to make TLS working on macOS. If somebody has time before spring and would like to try something out, I'd be interested to see a pull request!

Mart-Bogdan · 2020-12-15T07:51:53Z

It's definitely possible. I've used it in java JDBC using SQLBulkCopy class.
So it's covered by TDS protocol, but that could be huge amount of work.

pimeys · 2020-12-15T10:21:48Z

I don't think it's a huge amount of work, we have good implementations already in JDBC and in Tedious. It's a good idea to read the TDS manual, but it might be easier to just see what Tedious does and what the JDBC driver does.

I hope the facilities in Tiberius are good enough, and I can provide help in our Discord server if you want to try your luck implementing it. We should discuss a bit how the implementation should look like beforehand.

Actually, I'm quite excited to write this by myself. I'm just quite busy now and I know the next thing I have allocated for Tiberius from work is going to be the macOS TLS fix, so I doubt I can start this that soon...

nickolay · 2021-03-20T01:03:06Z

I was curious about this and it turned out not very hard to get the basics working (for a 1x2 table with hard-coded structure and content for now). See https://github.com/nickolay/tiberius/commits/wip/bulkload

According to https://winprotocoldoc.blob.core.windows.net/productionwindowsarchives/MS-TDS/%5BMS-TDS%5D.pdf the bulk-load exchange can be expressed in terms of the structs already existing in tiberius:

An "INSERT BULK table_name (col1 type1, col2 type2, ...)" SQL batch request (where the columns and types have to match a subset of the target table)
A BulkLoadBCP request, consisting of
- COLMETADATA token (TokenColMetaData), containing a number of column definitions (MetaDataColumn -> BaseMetaDataColumn -> BitFlags<ColumnFlag>, TypeInfo)
- A number of ROW tokens (TokenRow -> ColumnData)
- A DONE token (TokenDone)

So the bulk of the changes needed to make this work is simply adding Encode impls for the existing types. Only ColumnData is special: it already has an Encode implementation, which conflates encoding the column data with encoding its type info. Since the bulk-load stream does not repeat the type info before each value, I've split the methods encoding the data and the type info (encode_type_info_for_rpc).

The new bits are: a new bulk_load_tinyint test calls the new Client::bulk_load method passing in the table definition in a new BulkLoadMetadata struct. Under the hood the new BulkLoadRequest struct encodes the BulkLoadBCP request.

I leave the interesting part of designing the streaming API and the tedious work of implementing and testing the bulk uploading of all the supported types to someone else.

pimeys · 2021-03-25T20:30:42Z

Can we even do a streaming API for bulk loads? It would be nice to read data from e.g. a huge CSV file in batches to memory, then write to the database and load another batch.

Does TDS actually allow this, or do you actually have to write all the data to the token before you can write it to the wire?

I'm not going to have any need for this feature, so I'm probably the wrong person to answer on how people would like to use this...

nickolay · 2021-03-25T20:44:14Z

I’m not sure I understand what you’re saying.. Note that each row is a separate token in TDS, so there shouldn’t be any protocol-level problems, the question is designing how the rust API should look like.

pimeys · 2021-03-25T20:47:37Z

Mm, true. If there's anybody who'd actually need this feature, could you write down to this ticket how you'd like the API to look like? We don't need to go all async in the first try, but at least a good first PR would be nice for commenting...

pimeys · 2021-07-29T15:17:27Z

Could people needing bulk inserts test and see if the WIP API works for you? I also don't have that much time anymore for it, so maybe if you want it to get finished faster, help would be nice for writing tests and fixing the issues...

mcamou · 2021-07-30T18:12:15Z

Hi Julius, thanks for the effort. I'll test it next week.

rogerpengyu · 2022-04-08T20:54:55Z

What's the latest on this topic?

pimeys · 2022-04-08T21:43:12Z

The PR is there. The annoying thing is you have to define the columns before just right or nothing happens. I think the API sucks and we have no need for it, so I get no time to work on subject.

Check the PR and if you need the feature, some advice how the api should look like would be nice.

rogerpengyu · 2022-06-02T07:24:24Z

I will spend some time on this if no one else is looking into it.

rogerpengyu · 2022-06-24T01:11:14Z

could someone take a review at #227

This is the interface

    let mut req = client.bulk_insert("bulk_test1").await?;

    info!("start loading data");
    for i in 0..1000 {
        let int_column = [Some(32), None][i % 2];
        let float_column = [Some(34f32), None][i % 2];
        let string_column = [Some("aaa"), None][i % 2];

        let row = (int_column, float_column, string_column).into_row();

        req.send(row).await?;
    }

pimeys · 2022-08-23T13:34:49Z

This was merged already a while ago.

pimeys added help wanted Extra attention is needed kind/feedback Issue for gathering feedback. kind/feature A request for a new feature. labels Dec 15, 2020

pimeys mentioned this issue Jan 26, 2021

createMany operation prisma/prisma-engines#1550

Merged

pimeys mentioned this issue Jul 27, 2021

WIP: Bulk loads #163

Closed

janpio removed the kind/feedback Issue for gathering feedback. label May 31, 2022

pimeys closed this as completed Aug 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bulk loading #104

Add bulk loading #104

mcamou commented Nov 26, 2020

esheppa commented Nov 29, 2020

Mart-Bogdan commented Dec 14, 2020

pimeys commented Dec 14, 2020

Mart-Bogdan commented Dec 15, 2020

pimeys commented Dec 15, 2020

nickolay commented Mar 20, 2021

pimeys commented Mar 25, 2021

nickolay commented Mar 25, 2021

pimeys commented Mar 25, 2021

pimeys commented Jul 29, 2021

mcamou commented Jul 30, 2021

rogerpengyu commented Apr 8, 2022

pimeys commented Apr 8, 2022

rogerpengyu commented Jun 2, 2022

rogerpengyu commented Jun 24, 2022

pimeys commented Aug 23, 2022

Add bulk loading #104

Add bulk loading #104

Comments

mcamou commented Nov 26, 2020

esheppa commented Nov 29, 2020

Mart-Bogdan commented Dec 14, 2020

pimeys commented Dec 14, 2020

Mart-Bogdan commented Dec 15, 2020

pimeys commented Dec 15, 2020

nickolay commented Mar 20, 2021

pimeys commented Mar 25, 2021

nickolay commented Mar 25, 2021

pimeys commented Mar 25, 2021

pimeys commented Jul 29, 2021

mcamou commented Jul 30, 2021

rogerpengyu commented Apr 8, 2022

pimeys commented Apr 8, 2022

rogerpengyu commented Jun 2, 2022

rogerpengyu commented Jun 24, 2022

pimeys commented Aug 23, 2022