Skip to content

dasl-/mysql-update-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 

Repository files navigation

mysql update benchmark

TLDR

When performing bulk UPDATEs of your MySQL data (e.g. during backfills), use bulk INSERT …​ ON DUPLICATE KEY UPDATE …​ statements. Sort your data in primary key order. This can result in speed gains of 130x over the naive approach of performing one UPDATE statement per row.

Introduction

This repo contains a command line script update_benchmark which allows us to compare the performance of different methods for updating data in MySQL.

MySQL documentation gives several tips about optimizing INSERT performance, including a section specific to InnoDB. To my knowledge, there is no such page of tips for optimizing UPDATE performance in official MySQL documentation. Furthermore, these tips do not give estimates of how much they may improve performance. The benchmarks we run here are intended to shed some light on these shortcomings. Some of the tips for optimizing INSERT performance may be applicable for UPDATE statements.

The update_benchmark script is intended to be run from a client host that will be connecting to a remote MySQL server over the network. The update_benchmark script allows test data to be easily and quickly inserted into MySQL. All of the inserts and updates it performs will be done in a single-threaded process. This process will reuse the same database connection for every query.

Usage

% ./update_benchmark -h
mysql update benchmark

Usage: ./update_benchmark -H HOST -P PORT -u USER -p PASSWORD -T TABLE -t TEST_TYPE [-b BATCH_SIZE] [-n NUM_ROWS] [-l LIMIT] [-s] [-f] [-d]
    -h              Show this message
    -H  HOST        mysql host
    -P  PORT        mysql port
    -u  USER        mysql user
    -p  PASSWORD    mysql password
    -T  table       mysql table to use. This must be a fully qualified table name: `database.table`
    -t  TEST_TYPE   Test type. One of 'individual', 'transaction', 'iodku', or 'insert'
    -s              Sort. Insert / update rows in sorted PK order.
    -d              Disable mysql binary logging
    -b  BATCH_SIZE  Batch size. Applicable for 'transaction', 'iodku', and 'insert' test types. Default: 1000.
    -n  NUM_ROWS    Number of rows to create when test type is 'insert'. Default: 1000000.
    -l  LIMIT       Limit the number of rows to update. Default: unlimited.
    -f              Force. Use to overwrite existing data.

Benchmark equipment and setup

We ran these benchmark tests against Percona Server for MySQL:

% mysql --version
mysql  Ver 14.14 Distrib 5.7.29-32, for Linux (x86_64) using  6.2

The MySQL server machine was hosted in google cloud platform on an n2-highmem-4 instance (4 vCPUs, 32 GB memory) with Intel Cascade Lake CPU platform. MySQL data was stored on a 38 GB SSD persistent disk. Queries were issued to the server from a remote client host. We tested against InnoDB tables. MySQL had a 15 GB buffer pool.

Methods used

MySQL documentation gives several tips about optimizing INSERT performance, some of which may be applicable to optimizing UPDATE performance:

The time required for inserting a row is determined by the following factors, where the numbers indicate approximate proportions:

  • Connecting: (3)

  • Sending query to server: (2)

  • Parsing query: (2)

  • Inserting row: (1 × size of row)

  • Inserting indexes: (1 × number of indexes)

  • Closing: (1)

MySQL documentation also provides several InnoDB specific tips.

The buffer pool is a RAM based cache for InnoDB table data. Usage of the buffer pool can prevent disk reads, which are generally much slower than reading from RAM. InnoDB documentation states writing data in primary key order can give a performance boost in many cases, although the boost may be more apparent in tables that do not fit in the buffer pool. Thus, our performance numbers may change when run on larger tables:

When performing bulk inserts, it is faster to insert rows in PRIMARY KEY order. InnoDB tables use a clustered index, which makes it relatively fast to use data in the order of the PRIMARY KEY. Performing bulk inserts in PRIMARY KEY order is particularly important for tables that do not fit entirely within the buffer pool.

With this information in mind, we compared three different methods of updating data in our benchmark tests below.

Method 1: individual updates

The "individual update" test is the most "naive" method. It will essentially run 1,000,000 of the following UPDATE queries in autocommit mode:

UPDATE dasl.dasl_test SET user_id = :user_id WHERE email = :email

We tested executing the queries in primary key order (sorted by email) and random order.

Method 2: individual updates within a transaction, varying transaction batch size

The "transaction" test improves upon Method 1 by wrapping batches of UPDATE statements in database transactions. Essentially, it will do something like:

START TRANSACTION;
UPDATE dasl.dasl_test SET user_id = :user_id WHERE email = :email
UPDATE dasl.dasl_test SET user_id = :user_id WHERE email = :email
...
UPDATE dasl.dasl_test SET user_id = :user_id WHERE email = :email
COMMIT;

START TRANSACTION;
UPDATE dasl.dasl_test SET user_id = :user_id WHERE email = :email
UPDATE dasl.dasl_test SET user_id = :user_id WHERE email = :email
...
UPDATE dasl.dasl_test SET user_id = :user_id WHERE email = :email
COMMIT;

# repeat until all rows are updated

The batch size governs how many UPDATE statements we run within each transaction. The intuition behind this approach is to perform fewer flushes to disk (docs):

When importing data into InnoDB, turn off autocommit mode, because it performs a log flush to disk for every insert.

We tested executing the queries in primary key order (sorted by email) and random order.

Method 3: INSERT …​ ON DUPLICATE KEY UPDATE …​ (IODKU) updates, varying IODKU batch size

MySQL’s UPDATE statement does not allow us to perform bulk updates where each row gets a unique value for its update. Using an UPDATE statement, we could issue a query like:

UPDATE dasl.dasl_test SET user_id = :user_id
WHERE email IN ('[email protected]', '[email protected]', ..., '[email protected]');

But this would update every row matching the WHERE clause with the same user_id. In our examples, we want to update each row with a different user_id.

MySQL provides a SQL construct INSERT …​ ON DUPLICATE KEY UPDATE …​ (IODKU) that may be used as a bulk update query:

INSERT INTO dasl.dasl_test (email, user_id) VALUES
('[email protected]', :user_id1), ('[email protected]', :user_id2), ..., ('[email protected]', :user_idN)
ON DUPLICATE KEY UPDATE user_id = VALUES(user_id);

MySQL enforces uniqueness on a table’s primary key. Thus, inserting a row with the same primary key as an existing row will generate a duplicate key MySQL error. If we construct the values we are inserting such that they are guaranteed to generate duplicate key errors (i.e. the emails we are attempting to INSERT already exist in the table), then the ON DUPLICATE KEY UPDATE clause of the IODKU query will get executed. In the above example, this will result in each row being updated with a different user_id that was specified in the VALUES clause of the statement.

For example, given our schema:

CREATE TABLE `dasl_test` (
  `email` varchar(50) NOT NULL,
  `user_id` bigint(20) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`email`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8

If we have the following data in the table:

+-------------------+---------+
| email             | user_id |
+-------------------+---------+
| [email protected] |       0 |
| [email protected] |       0 |
| [email protected] |       0 |
+-------------------+---------+

Then if we issue an IODKU statement:

INSERT INTO dasl.dasl_test (email, user_id) VALUES
('[email protected]', 42798), ('[email protected]', 29238), ('[email protected]', 987238)
ON DUPLICATE KEY UPDATE user_id = VALUES(user_id);

We will then have the following data in the table:

+-------------------+---------+
| email             | user_id |
+-------------------+---------+
| [email protected] |   42798 |
| [email protected] |   29238 |
| [email protected] |  987238 |
+-------------------+---------+

The "IODKU" test improves upon Method 2 by issuing fewer SQL statements. Given a batch size N, essentially, it will do something like:

INSERT INTO dasl.dasl_test (email, user_id) VALUES
(:email1, :user_id1), (:email2, :user_id2), ..., (:emailN, :user_idN)
ON DUPLICATE KEY UPDATE user_id = VALUES(user_id);

INSERT INTO dasl.dasl_test (email, user_id) VALUES
(:email1, :user_id1), (:email2, :user_id2), ..., (:emailN, :user_idN)
ON DUPLICATE KEY UPDATE user_id = VALUES(user_id);

# repeat until all rows are updated

The batch size governs how many rows we update with each SQL statement. The intuition behind this approach is to send fewer queries to the server and parse fewer queries on the server. Communication overhead should be reduced between the client and server.

Updating data when table fits in buffer pool: benchmark results

In the below tests, we first inserted 1,000,000 rows into a test table with the following schema:

CREATE TABLE `dasl_test` (
  `email` varchar(50) NOT NULL,
  `user_id` bigint(20) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`email`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8

This table should be small enough that it fits in the buffer pool. Thus, sorting the data in primary key order may be less impactful than if the table were bigger than the buffer pool. We inserted this data using the update_benchmark script:

./update_benchmark -H MYSQL_HOST -P 3306 -u MYSQL_USER -p MYSQL_PASSWORD -T dasl.dasl_test -t insert -b 10000 -f -s -d

The update_benchmark script would generate random emails for each of the 1,000,000 rows it inserted. Each user_id would be set to 0. Next, we would see how long it took to update the user_id column to a unique value for each row using various methods. These benchmark tests were designed to be similar to how one might backfill a new user_id column on a table.

Method 1: individual updates

Updating the data in random vs sorted primary key order did not appear to affect performance much. In both cases, the median benchmark time was 46 minutes. Full results are in this spreadsheet under the heading updating 1,000,000 rows via individual updates.

Method 2: individual updates within a transaction, varying transaction batch size

Once again, we tested executing the queries in primary key order (sorted by email) and random order. Updating the data in sorted primary key order may have helped marginally, but not by much. Increasing the batch size helped up to a certain point. There appears to be diminishing returns to increasing batch size beyond 10,000. When increasing batch size beyond 100,000, performance may be adversely impacted. Thus we recommend a sweet spot of 10,000 for batch size.

Using a batch size of 10,000, the median benchmark time for updating in random order was 3.7 minutes, and the median benchmark time for updating in sorted primary key order was 3.6 minutes. Full results are in this spreadsheet under the heading updating 1,000,000 rows via individual updates in transactions, varying transaction batch size. This is approximately a 12x speedup compared to Method 1.

Method 3: IODKU updates, varying IODKU batch size

Once again, we tested executing the queries in primary key order (sorted by email) and random order. Unlike in Methods 1 and 2, updating the data in sorted primary key order had a clearer performance benefit with this approach. Like with Method 2, increasing the batch size helped up to a certain point. There appears to be diminishing returns to increasing batch size beyond 10,000. When increasing batch size beyond 100,000, performance may be adversely impacted. Thus we again recommend a sweet spot of 10,000 for batch size.

Using a batch size of 10,000, the median benchmark time for updating in random order was 25.1 seconds, and the median benchmark time for updating in sorted primary key order was 21.0 seconds. Full results are in this spreadsheet under the heading updating 1,000,000 rows via iodku, varying iodku batch size. This is approximately a 10x speedup compared to Method 2 and a 130x speedup compared with Method 1.

Updating data when table is too big to fit in buffer pool: benchmark results

In the below tests, we first inserted 500,000,000 rows into a test table with the same schema as before:

CREATE TABLE `dasl_test` (
  `email` varchar(50) NOT NULL,
  `user_id` bigint(20) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`email`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8

This table should be large enough that it does not entirely fit in the buffer pool. Thus, sorting the data in primary key order may be more impactful than if the table were smaller than the buffer pool. We inserted this data using the update_benchmark script:

./update_benchmark -H MYSQL_HOST -P 3306 -u MYSQL_USER -p MYSQL_PASSWORD -T dasl.dasl_test -t insert -b 10000 -f -s -d -n 500000000

The update_benchmark script would generate random emails for each of the 500,000,000 rows it inserted. Each user_id would be set to 0. Next, we would see how long it took to update the user_id column to a unique value for a subset of these rows (the first 125,000,000 rows) using various methods. These benchmark tests were designed to be similar to how one might backfill a new user_id column on a table. We ran a similar suite of tests as we did in the above section Updating data when table fits in buffer pool: benchmark results.

Method 1: individual updates

Updating the data in sorted primary key order was only marginally faster than updating in random order. In the random order case, the median benchmark time was 52.7 hours, whereas in the sorted case, the median benchmark time was 51.9 hours. Full results are in this spreadsheet under the heading updating 125,000,000 rows via individual updates.

Method 2: individual updates within a transaction, varying transaction batch size

Once again, we tested executing the queries in primary key order (sorted by email) and random order. Updating the data in sorted primary key order had a clear performance benefit. Increasing the batch size helped up to a certain point. There appears to be diminishing returns to increasing batch size beyond 10,000. When increasing batch size beyond 10,000, performance may be adversely impacted. Thus we recommend a sweet spot of 10,000 for batch size.

Using a batch size of 10,000, the median benchmark time for updating in random order was 9.5 hours, and the median benchmark time for updating in sorted primary key order was 6.7 hours. Full results are in this spreadsheet under the heading updating 125,000,000 rows via individual updates in transactions, varying transaction batch size. This is approximately an 8x speedup compared to Method 1.

Method 3: IODKU updates, varying IODKU batch size

Once again, we tested executing the queries in primary key order (sorted by email) and random order. As with Method 2, updating the data in sorted primary key order had a clear performance benefit. Also like with Method 2, increasing the batch size helped up to a certain point. Although performance continued to improve in batch sizes larger than 10,000, I’d be wary of potentially locking rows in the tables for extended periods of time when using larger batch sizes, as individual statement execution times may increase (TODO: this could be investigated more thoroughly). And the benefits of batch sizes larger than 10,000 were relatively small.

Using a batch size of 10,000, the median benchmark time for updating in random order was 6.1 hours, and the median benchmark time for updating in sorted primary key order was 50.6 minutes. Full results are in this spreadsheet under the heading updating 125,000,000 rows via iodku, varying iodku batch size. This is approximately an 8x speedup compared to Method 2 and a 60x speedup compared with Method 1.

Future work

  • Test the optimal strategy for parallelizing work. In these benchmark tests, all updates were done in a single thread.

About

benchmarking the fastest way to do updates in mysql

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages