From 872aa77797d0f275fcd0ffa4ced516e3e42294ad Mon Sep 17 00:00:00 2001 From: TieweiFang Date: Fri, 22 Nov 2024 10:16:04 +0800 Subject: [PATCH 1/4] fix 1 --- .../data-operate/export/export-manual.md | 386 ++++++++++-------- .../current/data-operate/export/outfile.md | 233 +++++++---- .../Show-Statements/SHOW-EXPORT.md | 29 ++ 3 files changed, 386 insertions(+), 262 deletions(-) diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/export-manual.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/export-manual.md index 059da958f1b25..3fa284c376e2a 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/export-manual.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/export-manual.md @@ -26,132 +26,78 @@ under the License. 本文档将介绍如何使用`EXPORT`命令导出 Doris 中存储的数据。 -有关`EXPORT`命令的详细介绍,请参考:[EXPORT](../../sql-manual/sql-statements/Data-Manipulation-Statements/Manipulation/EXPORT.md) - -## 概述 - `Export` 是 Doris 提供的一种将数据异步导出的功能。该功能可以将用户指定的表或分区的数据,以指定的文件格式,导出到目标存储系统中,包括对象存储、HDFS 或本地文件系统。 `Export` 是一个异步执行的命令,命令执行成功后,立即返回结果,用户可以通过`Show Export` 命令查看该 Export 任务的详细信息。 -关于如何选择 `SELECT INTO OUTFILE` 和 `EXPORT`,请参阅 [导出综述](../../data-operate/export/export-overview.md)。 - -`EXPORT` 当前支持导出以下类型的表或视图 - -* Doris 内表 -* Doris 逻辑视图 -* Doris Catalog 表 - -`EXPORT` 目前支持以下导出格式 - -* Parquet -* ORC -* csv -* csv\_with\_names -* csv\_with\_names\_and\_types +有关`EXPORT`命令的详细介绍,请参考:[EXPORT](../../sql-manual/sql-statements/Data-Manipulation-Statements/Manipulation/EXPORT.md) -不支持压缩格式的导出。 +关于如何选择 `SELECT INTO OUTFILE` 和 `EXPORT`,请参阅 [导出综述](../../data-operate/export/export-overview.md)。 示例: ```sql mysql> EXPORT TABLE tpch1.lineitem TO "s3://my_bucket/path/to/exp_" - -> PROPERTIES( - -> "format" = "csv", - -> "max_file_size" = "2048MB" - -> ) - -> WITH s3 ( - -> "s3.endpoint" = "${endpoint}", - -> "s3.region" = "${region}", - -> "s3.secret_key"="${sk}", - -> "s3.access_key" = "${ak}" - -> ); + PROPERTIES( + "format" = "csv", + "max_file_size" = "2048MB" + ) + WITH s3 ( + "s3.endpoint" = "${endpoint}", + "s3.region" = "${region}", + "s3.secret_key"="${sk}", + "s3.access_key" = "${ak}" + ); ``` -提交作业后,可以通过 [SHOW EXPORT](../../sql-manual/sql-statements/Show-Statements/SHOW-EXPORT.md) 命令查询导出作业状态,结果举例如下: +--- -```sql -mysql> show export\G -*************************** 1. row *************************** - JobId: 143265 - Label: export_0aa6c944-5a09-4d0b-80e1-cb09ea223f65 - State: FINISHED - Progress: 100% - TaskInfo: {"partitions":[],"parallelism":5,"data_consistency":"partition","format":"csv","broker":"S3","column_separator":"\t","line_delimiter":"\n","max_file_size":"2048MB","delete_existing_files":"","with_bom":"false","db":"tpch1","tbl":"lineitem"} - Path: s3://ftw-datalake-test-1308700295/test_ycs_activeDefense_v10/test_csv/exp_ - CreateTime: 2024-06-11 18:01:18 - StartTime: 2024-06-11 18:01:18 - FinishTime: 2024-06-11 18:01:31 - Timeout: 7200 - ErrorMsg: NULL -OutfileInfo: [ - [ - { - "fileNumber": "1", - "totalRows": "6001215", - "fileSize": "747503989bytes", - "url": "s3://my_bucket/path/to/exp_6555cd33e7447c1-baa9568b5c4eb0ac_*" - } - ] -] -1 row in set (0.00 sec) -``` +## 基本原理 -`show export` 命令返回的结果各个列的含义如下: - -* JobId:作业的唯一 ID -* Label:该导出作业的标签,如果 Export 没有指定,则系统会默认生成一个。 -* State:作业状态: - * PENDING:作业待调度 - * EXPORTING:数据导出中 - * FINISHED:作业成功 - * CANCELLED:作业失败 -* Progress:作业进度。该进度以查询计划为单位。假设一共 10 个线程,当前已完成 3 个,则进度为 30%。 -* TaskInfo:以 Json 格式展示的作业信息: - * db:数据库名 - * tbl:表名 - * partitions:指定导出的分区。`空`列表 表示所有分区。 - * column\_separator:导出文件的列分隔符。 - * line\_delimiter:导出文件的行分隔符。 - * tablet num:涉及的总 Tablet 数量。 - * broker:使用的 broker 的名称。 - * coord num:查询计划的个数。 - * max\_file\_size:一个导出文件的最大大小。 - * delete\_existing\_files:是否删除导出目录下已存在的文件及目录。 - * columns:指定需要导出的列名,空值代表导出所有列。 - * format:导出的文件格式 -* Path:远端存储上的导出路径。 -* CreateTime/StartTime/FinishTime:作业的创建时间、开始调度时间和结束时间。 -* Timeout:作业超时时间。单位是秒。该时间从 CreateTime 开始计算。 -* ErrorMsg:如果作业出现错误,这里会显示错误原因。 -* OutfileInfo:如果作业导出成功,这里会显示具体的`SELECT INTO OUTFILE`结果信息。 +Export 任务的底层是执行`SELECT INTO OUTFILE` SQL 语句。用户发起一个 Export 任务后,Doris 会根据 Export 要导出的表构造出一个或多个 `SELECT INTO OUTFILE` 执行计划,随后将这些`SELECT INTO OUTFILE` 执行计划提交给 Doris 的 Job Schedule 任务调度器,Job Schedule 任务调度器会自动调度这些任务并执行。 -提交 Export 作业后,在 Export 任务成功或失败之前可以通过 [CANCEL EXPORT](../../sql-manual/sql-statements/Data-Manipulation-Statements/Manipulation/CANCEL-EXPORT.md) 命令取消导出作业。取消命令举例如下: +默认情况下,Export 任务是单线程执行的。为了提高导出的效率,Export 命令可以设置一个 `parallelism` 参数来并发导出数据。设置`parallelism` 大于 1 后,Export 任务会使用多个线程并发的去执行 `SELECT INTO OUTFILE` 查询计划。`parallelism`参数实际就是指定执行 EXPORT 作业的线程数量。 -```sql -CANCEL EXPORT FROM tpch1 WHERE LABEL like "%export_%"; -``` +## 使用场景 +`Export` 适用于以下场景: -## 导出文件列类型映射 +- 大数据量的单表导出、仅需简单的过滤条件。 +- 需要异步提交任务的场景。 -`Export`支持导出数据为 Parquet、ORC 文件格式。Parquet、ORC 文件格式拥有自己的数据类型,Doris 的导出功能能够自动将 Doris 的数据类型导出为 Parquet、ORC 文件格式的对应数据类型,具体映射关系请参阅[导出综述](../../data-operate/export/export-overview.md)文档的 "导出文件列类型映射" 部分。 +使用 `Export` 时需要注意以下限制: +1. 当前不支持压缩格式的导出。 -## 示例 +## 快速上手 +### 建表与导入数据 +```sql +CREATE TABLE IF NOT EXISTS tbl ( + `c1` int(11) NULL, + `c2` string NULL, + `c3` bigint NULL +) +DISTRIBUTED BY HASH(c1) BUCKETS 20 +PROPERTIES("replication_num" = "1"); -### 导出到 HDFS -将 db1.tbl1 表的 p1 和 p2 分区中的`col1` 列和`col2` 列数据导出到 HDFS 上,设置导出作业的 label 为 `mylabel`。导出文件格式为 csv(默认格式),列分割符为`,`,导出作业单个文件大小限制为 512MB。 +insert into tbl values + (1, 'doris', 18), + (2, 'nereids', 20), + (3, 'pipelibe', 99999), + (4, 'Apache', 122123455), + (5, null, null); +``` + +### 创建导出作业 + +#### 导出到HDFS +将 tbl 表的所有数据导出到 HDFS 上,设置导出作业的文件格式为 csv(默认格式),并设置列分割符为`,`。 ```sql -EXPORT TABLE db1.tbl1 -PARTITION (p1,p2) +EXPORT TABLE tbl TO "hdfs://host/path/to/export/" PROPERTIES ( - "label" = "mylabel", - "column_separator"=",", - "max_file_size" = "512MB", - "columns" = "col1,col2" + "line_delimiter" = "," ) with HDFS ( "fs.defaultFS"="hdfs://hdfs_host:port", @@ -162,15 +108,11 @@ with HDFS ( 如果 HDFS 开启了高可用,则需要提供 HA 信息,如: ```sql -EXPORT TABLE db1.tbl1 -PARTITION (p1,p2) +EXPORT TABLE tbl TO "hdfs://HDFS8000871/path/to/export/" PROPERTIES ( - "label" = "mylabel", - "column_separator"=",", - "max_file_size" = "512MB", - "columns" = "col1,col2" + "line_delimiter" = "," ) with HDFS ( "fs.defaultFS" = "hdfs://HDFS8000871", @@ -186,15 +128,11 @@ with HDFS ( 如果 Hadoop 集群开启了高可用并且启用了 Kerberos 认证,可以参考如下 SQL 语句: ```sql -EXPORT TABLE db1.tbl1 -PARTITION (p1,p2) +EXPORT TABLE tbl TO "hdfs://HDFS8000871/path/to/export/" PROPERTIES ( - "label" = "mylabel", - "column_separator"=",", - "max_file_size" = "512MB", - "columns" = "col1,col2" + "line_delimiter" = "," ) with HDFS ( "fs.defaultFS"="hdfs://hacluster/", @@ -211,14 +149,14 @@ with HDFS ( ); ``` -### 导出到 S3 +#### 导出到对象存储 -将 s3_test 表中的所有数据导出到 s3 上,导出格式为 csv,以不可见字符 `\x07` 作为行分隔符。 +将 tbl 表中的所有数据导出到对象存储上,设置导出作业的文件格式为 csv(默认格式),并设置列分割符为`,`。 ```sql -EXPORT TABLE s3_test TO "s3://bucket/a/b/c" +EXPORT TABLE tbl TO "s3://bucket/a/b/c" PROPERTIES ( - "line_delimiter" = "\\x07" + "line_delimiter" = "," ) WITH s3 ( "s3.endpoint" = "xxxxx", "s3.region" = "xxxxx", @@ -227,39 +165,19 @@ PROPERTIES ( ) ``` -### 导出到本地文件系统 +#### 导出到本地文件系统 + > > export 数据导出到本地文件系统,需要在 fe.conf 中添加`enable_outfile_to_local=true`并且重启 FE。 -将 test 表中的所有数据导出到本地存储: +将 tbl 表中的所有数据导出到本地文件系统,设置导出作业的文件格式为 csv(默认格式),并设置列分割符为`,`。 ```sql --- parquet 格式 -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "columns" = "k1,k2", - "format" = "parquet" -); - --- orc 格式 -EXPORT TABLE test TO "file:///home/user/tmp/" +-- csv 格式 +EXPORT TABLE tbl TO "file:///home/user/tmp/" PROPERTIES ( - "columns" = "k1,k2", - "format" = "orc" -); - --- csv_with_names 格式,以‘AA’为列分割符,‘zz’为行分割符 -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "format" = "csv_with_names", - "column_separator"="AA", - "line_delimiter" = "zz" -); - --- csv_with_names_and_types 格式 -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "format" = "csv_with_names_and_types" + "format" = "csv", + "line_delimiter" = "," ); ``` @@ -267,7 +185,119 @@ PROPERTIES ( 导出到本地文件系统的功能不适用于公有云用户,仅适用于私有化部署的用户。并且默认用户对集群节点有完全的控制权限。Doris 对于用户填写的导出路径不会做合法性检查。如果 Doris 的进程用户对该路径无写权限,或路径不存在,则会报错。同时处于安全性考虑,如果该路径已存在同名的文件,则也会导出失败。 Doris 不会管理导出到本地的文件,也不会检查磁盘空间等。这些文件需要用户自行管理,如清理等。 -### 指定分区导出 + +### 查看导出作业 +提交作业后,可以通过 [SHOW EXPORT](../../sql-manual/sql-statements/Show-Statements/SHOW-EXPORT.md) 命令查询导出作业状态,结果举例如下: + +```sql +mysql> show export\G +*************************** 1. row *************************** + JobId: 143265 + Label: export_0aa6c944-5a09-4d0b-80e1-cb09ea223f65 + State: FINISHED + Progress: 100% + TaskInfo: {"partitions":[],"parallelism":5,"data_consistency":"partition","format":"csv","broker":"S3","column_separator":"\t","line_delimiter":"\n","max_file_size":"2048MB","delete_existing_files":"","with_bom":"false","db":"tpch1","tbl":"lineitem"} + Path: s3://ftw-datalake-test-1308700295/test_ycs_activeDefense_v10/test_csv/exp_ + CreateTime: 2024-06-11 18:01:18 + StartTime: 2024-06-11 18:01:18 + FinishTime: 2024-06-11 18:01:31 + Timeout: 7200 + ErrorMsg: NULL +OutfileInfo: [ + [ + { + "fileNumber": "1", + "totalRows": "6001215", + "fileSize": "747503989bytes", + "url": "s3://my_bucket/path/to/exp_6555cd33e7447c1-baa9568b5c4eb0ac_*" + } + ] +] +1 row in set (0.00 sec) +``` + +有关 `show export` 命令的详细用法及其返回结果的各个列的含义可以参看 [SHOW EXPORT](../../sql-manual/sql-statements/Show-Statements/SHOW-EXPORT.md): + +### 取消导出作业 + +提交 Export 作业后,在 Export 任务成功或失败之前可以通过 [CANCEL EXPORT](../../sql-manual/sql-statements/Data-Manipulation-Statements/Manipulation/CANCEL-EXPORT.md) 命令取消导出作业。取消命令举例如下: + +```sql +CANCEL EXPORT FROM dbName WHERE LABEL like "%export_%"; +``` + +## 导出说明 + +### 导出数据源 + +`EXPORT` 当前支持导出以下类型的表或视图 + +* Doris 内表 +* Doris 逻辑视图 +* Doris Catalog 表 + +### 导出数据存储位置 + +`Export` 目前支持导出到以下存储位置: + +- 对象存储:Amazon S3、COS、OSS、OBS、Google GCS +- HDFS +- 本地文件系统 + +### 导出文件类型 + +`EXPORT` 目前支持导出为以下文件格式: + +* Parquet +* ORC +* csv +* csv\_with\_names +* csv\_with\_names\_and\_types + +### 导出文件列类型映射 + +`Export` 支持导出为 Parquet、ORC 文件格式。Parquet、ORC 文件格式拥有自己的数据类型,Doris 的导出功能能够自动将 Doris 的数据类型导出为 Parquet、ORC 文件格式的对应数据类型。 + +以下是 Doris 数据类型和 Parquet、ORC 文件格式的数据类型映射关系表: +| Doris Type | Arrow Type | Orc Type | +| ---------- | ---------- | -------- | +| boolean | boolean | boolean | +| tinyint | int8 | tinyint | +| smallint | int16 | smallint | +| int | int32 | int | +| bigint | int64 | bigint | +| largeInt | utf8 | string | +| date | utf8 | string | +| datev2 | Date32Type | string | +| datetime | utf8 | string | +| datetimev2 | TimestampType | timestamp | +| float | float32 | float | +| double | float64 | double | +| char / varchar / string| utf8 | string | +| decimal | decimal128 | decimal | +| struct | struct | struct | +| map | map | map | +| array | list | array | +| json | utf8 | string | +| variant | utf8 | string | +| bitmap | binary | binary | +| quantile_state| binary | binary | +| hll | binary | binary | + +> 注意:Doris 导出到 Parquet 文件格式时,会先将 Doris 内存数据转换为 Arrow 内存数据格式,然后由 Arrow 写出到 Parquet 文件格式。 + +## 导出示例 + +- [指定分区导出](#指定分区导出) +- [导出时过滤数据](#导出时过滤数据) +- [导出外表数据](#导出外表数据) +- [调整导出数据一致性](#调整导出数据一致性) +- [调整导出作业并发度](#调整导出作业并发度) +- [导出前清空导出目录](#导出前清空导出目录) +- [调整导出文件的大小](#调整导出文件的大小) + + +**指定分区导出** 导出作业支持仅导出 Doris 内表的部分分区,如仅导出 test 表的 p1 和 p2 分区 @@ -280,7 +310,8 @@ PROPERTIES ( ); ``` -### 导出时过滤数据 + +**导出时过滤数据** 导出作业支持导出时根据谓词条件过滤数据,仅导出符合条件的数据,如仅导出满足 `k1 < 50` 条件的数据 @@ -294,7 +325,8 @@ PROPERTIES ( ); ``` -### 导出外表数据 + +**导出外表数据** 导出作业支持 Doris Catalog 外表数据: @@ -320,9 +352,8 @@ PROPERTIES( 当前 Export 导出 Catalog 外表数据不支持并发导出,即使指定 parallelism 大于 1,仍然是单线程导出。 ::: -## 最佳实践 - -### 导出一致性 + +**调整导出数据一致性** `Export`导出支持 partition / tablets 两种粒度。`data_consistency`参数用来指定以何种粒度切分希望导出的表,`none` 代表 Tablets 级别,`partition`代表 Partition 级别。 @@ -341,7 +372,8 @@ PROPERTIES ( 关于 Export 底层构造 `SELECT INTO OUTFILE` 的逻辑,可参阅附录部分。 -### 导出作业并发度 + +**调整导出作业并发度** Export 可以设置不同的并发度来并发导出数据。指定并发度为 5: @@ -356,7 +388,8 @@ PROPERTIES ( 关于 Export 并发导出的原理,可参阅附录部分。 -### 导出前清空导出目录 + +**导出前清空导出目录** ```sql EXPORT TABLE test TO "file:///home/user/tmp" @@ -372,7 +405,8 @@ PROPERTIES ( > 注意: 若要使用 delete_existing_files 参数,还需要在 fe.conf 中添加配置`enable_delete_existing_files = true`并重启 fe,此时 delete_existing_files 才会生效。delete_existing_files = true 是一个危险的操作,建议只在测试环境中使用。 -### 设置导出文件的大小 + +**调整导出文件的大小** 导出作业支持设置导出文件的大小,如果单个文件大小超过设定值,则会按照指定大小分成多个文件导出。 @@ -427,48 +461,40 @@ PROPERTIES ( 导出操作完成后,建议验证导出的数据是否完整和正确,以确保数据的质量和完整性。 -## 附录 - -### 并发导出原理 - -Export 任务的底层是执行`SELECT INTO OUTFILE` SQL 语句。用户发起一个 Export 任务后,Doris 会根据 Export 要导出的表构造出一个或多个 `SELECT INTO OUTFILE` 执行计划,随后将这些`SELECT INTO OUTFILE` 执行计划提交给 Doris 的 Job Schedule 任务调度器,Job Schedule 任务调度器会自动调度这些任务并执行。 - -默认情况下,Export 任务是单线程执行的。为了提高导出的效率,Export 命令可以设置一个 `parallelism` 参数来并发导出数据。设置`parallelism` 大于 1 后,Export 任务会使用多个线程并发的去执行 `SELECT INTO OUTFILE` 查询计划。`parallelism`参数实际就是指定执行 EXPORT 作业的线程数量。 - -一个 Export 任务构造一个或多个 `SELECT INTO OUTFILE` 执行计划的具体逻辑是: +* 一个 Export 任务构造一个或多个 `SELECT INTO OUTFILE` 执行计划的具体逻辑是: -1. 选择导出的数据的一致性模型 + 1. 选择导出的数据的一致性模型 - 根据 `data_consistency` 参数来决定导出的一致性,这个只和语义有关,和并发度无关,用户要先根据自己的需求,选择一致性模型。 + 根据 `data_consistency` 参数来决定导出的一致性,这个只和语义有关,和并发度无关,用户要先根据自己的需求,选择一致性模型。 -2. 确定并发度 + 2. 确定并发度 - 根据 `parallelism` 参数确定由多少个线程来运行这些 `SELECT INTO OUTFILE` 执行计划。parallelism 决定了最大可能的线程数。 + 根据 `parallelism` 参数确定由多少个线程来运行这些 `SELECT INTO OUTFILE` 执行计划。parallelism 决定了最大可能的线程数。 - > 注意:即使 Export 命令设置了 `parallelism` 参数,该 Export 任务的实际并发线程数量还与 Job Schedule 有关。Export 任务设置多并发后,每一个并发线程都是 Job Schedule 提供的,所以如果此时 Doris 系统任务较繁忙,Job Schedule 的线程资源较紧张,那么有可能分给 Export 任务的实际线程数量达不到 `parallelism` 个数,影响 Export 的并发导出。此时可以通过减轻系统负载或调整 FE 配置 `async_task_consumer_thread_num` 增加 Job Schedule 的总线程数量来缓解这个问题。 + > 注意:即使 Export 命令设置了 `parallelism` 参数,该 Export 任务的实际并发线程数量还与 Job Schedule 有关。Export 任务设置多并发后,每一个并发线程都是 Job Schedule 提供的,所以如果此时 Doris 系统任务较繁忙,Job Schedule 的线程资源较紧张,那么有可能分给 Export 任务的实际线程数量达不到 `parallelism` 个数,影响 Export 的并发导出。此时可以通过减轻系统负载或调整 FE 配置 `async_task_consumer_thread_num` 增加 Job Schedule 的总线程数量来缓解这个问题。 -3. 确定每一个 outfile 语句的任务量 + 3. 确定每一个 outfile 语句的任务量 - 每一个线程会根据 `maximum_tablets_of_outfile_in_export` 以及数据实际的分区数 / buckets 数来决定要拆分成多少个 outfile。 + 每一个线程会根据 `maximum_tablets_of_outfile_in_export` 以及数据实际的分区数 / buckets 数来决定要拆分成多少个 outfile。 - > `maximum_tablets_of_outfile_in_export` 是 FE 的配置,默认值为 10。该参数用于指定 Export 任务切分出来的单个 OutFile 语句中允许的最大 partitions / buckets 数量。修改该配置需要重启 FE。 + > `maximum_tablets_of_outfile_in_export` 是 FE 的配置,默认值为 10。该参数用于指定 Export 任务切分出来的单个 OutFile 语句中允许的最大 partitions / buckets 数量。修改该配置需要重启 FE。 - 举例:假设一张表共有 20 个 partition,每个 partition 都有 5 个 buckets,那么该表一共有 100 个 buckets。设置`data_consistency = none` 以及 `maximum_tablets_of_outfile_in_export = 10`。 + 举例:假设一张表共有 20 个 partition,每个 partition 都有 5 个 buckets,那么该表一共有 100 个 buckets。设置`data_consistency = none` 以及 `maximum_tablets_of_outfile_in_export = 10`。 - 1. `parallelism = 5` 情况下 + 1. `parallelism = 5` 情况下 - Export 任务将把该表的 100 个 buckets 分成 5 份,每个线程负责 20 个 buckets。每个线程负责的 20 个 buckets 又将以 10 个为单位分成 2 组,每组 buckets 各由一个 outfile 查询计划负责。所以最终该 Export 任务有 5 个线程并发执行,每个线程负责 2 个 outfile 语句,每个线程负责的 outfile 语句串行的被执行。 + Export 任务将把该表的 100 个 buckets 分成 5 份,每个线程负责 20 个 buckets。每个线程负责的 20 个 buckets 又将以 10 个为单位分成 2 组,每组 buckets 各由一个 outfile 查询计划负责。所以最终该 Export 任务有 5 个线程并发执行,每个线程负责 2 个 outfile 语句,每个线程负责的 outfile 语句串行的被执行。 - 2. `parallelism = 3` 情况下 + 2. `parallelism = 3` 情况下 - Export 任务将把该表的 100 个 buckets 分成 3 份,3 个线程分别负责 34、33、33 个 buckets。每个线程负责的 buckets 又将以 10 个为单位分成 4 组(最后一组不足 10 个 buckets),每组 buckets 各由一个 outfile 查询计划负责。所以该 Export 任务最终有 3 个线程并发执行,每个线程负责 4 个 outfile 语句,每个线程负责的 outfile 语句串行的被执行。 + Export 任务将把该表的 100 个 buckets 分成 3 份,3 个线程分别负责 34、33、33 个 buckets。每个线程负责的 buckets 又将以 10 个为单位分成 4 组(最后一组不足 10 个 buckets),每组 buckets 各由一个 outfile 查询计划负责。所以该 Export 任务最终有 3 个线程并发执行,每个线程负责 4 个 outfile 语句,每个线程负责的 outfile 语句串行的被执行。 - 3. `parallelism = 120` 情况下 + 3. `parallelism = 120` 情况下 - 由于该表 buckets 只有 100 个,所以系统会将 `parallelism` 强制设为 100,并以 `parallelism = 100` 去执行。Export 任务将把该表的 100 个 buckets 分成 100 份,每个线程负责 1 个 buckets。每个线程负责的 1 个 buckets 又将以 10 个为单位分成 1 组(该组实际就只有 1 个 buckets),每组 buckets 由一个 outfile 查询计划负责。所以最终该 Export 任务有 100 个线程并发执行,每个线程负责 1 个 outfile 语句,每个 outfile 语句实际只导出 1 个 buckets。 + 由于该表 buckets 只有 100 个,所以系统会将 `parallelism` 强制设为 100,并以 `parallelism = 100` 去执行。Export 任务将把该表的 100 个 buckets 分成 100 份,每个线程负责 1 个 buckets。每个线程负责的 1 个 buckets 又将以 10 个为单位分成 1 组(该组实际就只有 1 个 buckets),每组 buckets 由一个 outfile 查询计划负责。所以最终该 Export 任务有 100 个线程并发执行,每个线程负责 1 个 outfile 语句,每个 outfile 语句实际只导出 1 个 buckets。 -当前版本若希望 Export 有一个较好的性能,建议设置以下参数: +* 当前版本若希望 Export 有一个较好的性能,建议设置以下参数: -1. 打开 session 变量 `enable_parallel_outfile`。 -2. 设置 Export 的 `parallelism` 参数为较大值,使得每一个线程只负责一个 `SELECT INTO OUTFILE` 查询计划。 -3. 设置 FE 配置 `maximum_tablets_of_outfile_in_export` 为较小值,使得每一个 `SELECT INTO OUTFILE` 查询计划导出的数据量较小。 + 1. 打开 session 变量 `enable_parallel_outfile`。 + 2. 设置 Export 的 `parallelism` 参数为较大值,使得每一个线程只负责一个 `SELECT INTO OUTFILE` 查询计划。 + 3. 设置 FE 配置 `maximum_tablets_of_outfile_in_export` 为较小值,使得每一个 `SELECT INTO OUTFILE` 查询计划导出的数据量较小。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/outfile.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/outfile.md index 6312a5591a8d4..b20bf47ab2cc8 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/outfile.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/outfile.md @@ -26,25 +26,12 @@ under the License. 本文档将介绍如何使用 `SELECT INTO OUTFILE` 命令进行查询结果的导出操作。 -有关`SELECT INTO OUTFILE`命令的详细介绍,请参考:[SELECT INTO OUTFILE](../../sql-manual/sql-statements/Data-Manipulation-Statements/OUTFILE.md) - -## 概述 - `SELECT INTO OUTFILE` 命令将 `SELECT` 部分的结果数据,以指定的文件格式导出到目标存储系统中,包括对象存储、HDFS 或本地文件系统。 `SELECT INTO OUTFILE` 是一个同步命令,命令返回即表示导出结束。若导出成功,会返回导出的文件数量、大小、路径等信息。若导出失败,会返回错误信息。 关于如何选择 `SELECT INTO OUTFILE` 和 `EXPORT`,请参阅 [导出综述](./export-overview.md)。 -`SELECT INTO OUTFILE` 目前支持以下导出格式 - -* Parquet -* ORC -* csv -* csv\_with\_names -* csv\_with\_names\_and\_types - -不支持压缩格式的导出。 示例: @@ -64,15 +51,54 @@ mysql> SELECT * FROM tbl1 LIMIT 10 INTO OUTFILE "file:///home/work/path/result_" * FileSize:导出文件总大小。单位字节。 * URL:导出的文件路径的前缀,多个文件会以后缀 `_0`,`_1` 依次编号。 -## 导出文件列类型映射 +有关`SELECT INTO OUTFILE`命令的详细介绍,请参考:[SELECT INTO OUTFILE](../../sql-manual/sql-statements/Data-Manipulation-Statements/OUTFILE.md) + +-------------- + +## 使用场景 + +`SELECT INTO OUTFILE` 适用于以下场景: + +1. 导出数据需要经过复杂计算逻辑的,如过滤、聚合、关联等。 +2. 适合执行同步任务的场景。 -`SELECT INTO OUTFILE` 支持导出为 Parquet、ORC 文件格式。Parquet、ORC 文件格式拥有自己的数据类型,Doris 的导出功能能够自动将 Doris 的数据类型导出为 Parquet、ORC 文件格式的对应数据类型,具体映射关系请参阅[导出综述](../../data-operate/export/export-overview.md)文档的 "导出文件列类型映射" 部分。 +在使用 `SELECT INTO OUTFILE` 时需要注意以下限制: -## 示例 +1. 不支持压缩格式的导出。 +2. 2.1 版本 pipeline 引擎不支持并发导出。 +3. 若希望导出到本地文件系统,需要在 fe.conf 中添加配置 `enable_outfile_to_local=true` 并重启FE。 + + +## 基本原理 + +`SELECT INTO OUTFILE` 功能本质上是执行一个 SQL 查询命令,其原理基本同普通查询的原理一致,。唯一的不同是,普通查询将最后查询的结果集输出到 mysql 客户端,而 `SELECT INTO OUTFILE` 将最后的查询结果集输出到外部存储介质。 + +`SELECT INTO OUTFILE`并发导出的原理是将大规模数据集划分为小块,并在多个节点上并行处理。在可以并发导出的场景下,并行的在多个 BE 节点上导出,每个 BE 处理结果集的一部分。 + +## 快速上手 +### 建表与导入数据 + +```sql +CREATE TABLE IF NOT EXISTS tbl ( + `c1` int(11) NULL, + `c2` string NULL, + `c3` bigint NULL +) +DISTRIBUTED BY HASH(c1) BUCKETS 20 +PROPERTIES("replication_num" = "1"); + + +insert into tbl values + (1, 'doris', 18), + (2, 'nereids', 20), + (3, 'pipelibe', 99999), + (4, 'Apache', 122123455), + (5, null, null); +``` ### 导出到 HDFS -将查询结果导出到文件 `hdfs://path/to/` 目录下,指定导出格式为 PARQUET: +将查询结果导出到文件 `hdfs://path/to/` 目录下,指定导出格式为 Parquet : ```sql SELECT c1, c2, c3 FROM tbl @@ -125,7 +151,7 @@ PROPERTIES ); ``` -### 导出到 S3 +### 导出到对象存储 将查询结果导出到 s3 存储的 `s3://path/to/` 目录下,指定导出格式为 ORC,需要提供`sk` `ak`等信息 @@ -141,14 +167,13 @@ PROPERTIES( ); ``` -### 导出到本地 -> +### 导出到本地文件系统 > 如需导出到本地文件,需在 `fe.conf` 中添加 `enable_outfile_to_local=true`并重启 FE。 将查询结果导出到 BE 的`file:///path/to/` 目录下,指定导出格式为 CSV,指定列分割符为`,`。 ```sql -SELECT k1 FROM tbl1 UNION SELECT k2 FROM tbl1 +SELECT c1, c2 FROM tbl FROM tbl1 INTO OUTFILE "file:///path/to/result_" FORMAT AS CSV PROPERTIES( @@ -160,9 +185,70 @@ PROPERTIES( 导出到本地文件的功能不适用于公有云用户,仅适用于私有化部署的用户。并且默认用户对集群节点有完全的控制权限。Doris 对于用户填写的导出路径不会做合法性检查。如果 Doris 的进程用户对该路径无写权限,或路径不存在,则会报错。同时处于安全性考虑,如果该路径已存在同名的文件,则也会导出失败。 Doris 不会管理导出到本地的文件,也不会检查磁盘空间等。这些文件需要用户自行管理,如清理等。 -## 最佳实践 +### 更多用法 + +有关`SELECT INTO OUTFILE`命令的详细介绍,请参考:[SELECT INTO OUTFILE](../../sql-manual/sql-statements/Data-Manipulation-Statements/OUTFILE.md) + +## 导出说明 +### 导出数据存储位置 +`SELECT INTO OUTFILE` 目前支持导出到以下存储位置: -### 生成导出成功标识文件 +- 对象存储:Amazon S3、COS、OSS、OBS、Google GCS +- HDFS +- 本地文件系统 + +### 导出文件类型 +`SELECT INTO OUTFILE` 目前支持导出以下文件格式 + +* Parquet +* ORC +* csv +* csv\_with\_names +* csv\_with\_names\_and\_types + +### 导出文件列类型映射 + +`SELECT INTO OUTFILE` 支持导出为 Parquet、ORC 文件格式。Parquet、ORC 文件格式拥有自己的数据类型,Doris 的导出功能能够自动将 Doris 的数据类型导出为 Parquet、ORC 文件格式的对应数据类型。 + +以下是 Doris 数据类型和 Parquet、ORC 文件格式的数据类型映射关系表: +| Doris Type | Arrow Type | Orc Type | +| ---------- | ---------- | -------- | +| boolean | boolean | boolean | +| tinyint | int8 | tinyint | +| smallint | int16 | smallint | +| int | int32 | int | +| bigint | int64 | bigint | +| largeInt | utf8 | string | +| date | utf8 | string | +| datev2 | Date32Type | string | +| datetime | utf8 | string | +| datetimev2 | TimestampType | timestamp | +| float | float32 | float | +| double | float64 | double | +| char / varchar / string| utf8 | string | +| decimal | decimal128 | decimal | +| struct | struct | struct | +| map | map | map | +| array | list | array | +| json | utf8 | string | +| variant | utf8 | string | +| bitmap | binary | binary | +| quantile_state| binary | binary | +| hll | binary | binary | + +> 注意:Doris 导出到 Parquet 文件格式时,会先将 Doris 内存数据转换为 Arrow 内存数据格式,然后由 Arrow 写出到 Parquet 文件格式。 + + +## 导出示例 + +- [生成导出成功标识文件示例](#生成导出成功标识文件示例) +- [并发导出示例](#并发导出示例) +- [导出前清空导出目录示例](#导出前清空导出目录示例) +- [设置导出文件的大小示例](#设置导出文件的大小示例) + + + +**生成导出成功标识文件示例** `SELECT INTO OUTFILE`命令是一个同步命令,因此有可能在 SQL 执行过程中任务连接断开了,从而无法获悉导出的数据是否正常结束或是否完整。此时可以使用 `success_file_name` 参数要求导出成功后,在目录下生成一个文件标识。 @@ -188,10 +274,44 @@ PROPERTIES 在导出完成后,会多写出一个文件,该文件的文件名为 `SUCCESS`。 -### 并发导出 + +**并发导出示例** 默认情况下,`SELECT` 部分的查询结果会先汇聚到某一个 BE 节点,由该节点单线程导出数据。然而,在某些情况下,如没有 `ORDER BY` 子句的查询语句,则可以开启并发导出,多个 BE 节点同时导出数据,以提升导出性能。 +然而,并非所有的 SQL 查询语句都可以并发导出。一个查询语句是否可以并发导出可以通过以下步骤来判断: + +* 确定会话变量已开启:`set enable_parallel_outfile = true;` +* 通过 `EXPLAIN` 查看执行计划 + +```sql +mysql> EXPLAIN SELECT ... INTO OUTFILE "s3://xxx" ...; ++-----------------------------------------------------------------------------+ +| Explain String | ++-----------------------------------------------------------------------------+ +| PLAN FRAGMENT 0 | +| OUTPUT EXPRS: | | | | +| PARTITION: UNPARTITIONED | +| | +| RESULT SINK | +| | +| 1:EXCHANGE | +| | +| PLAN FRAGMENT 1 | +| OUTPUT EXPRS:`k1` + `k2` | +| PARTITION: HASH_PARTITIONED: `default_cluster:test`.`multi_tablet`.`k1` | +| | +| RESULT FILE SINK | +| FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951_ | +| STORAGE TYPE: S3 | +| | +| 0:OlapScanNode | +| TABLE: multi_tablet | ++-----------------------------------------------------------------------------+ +``` + +`EXPLAIN` 命令会返回该语句的查询计划。观察该查询计划,如果发现 `RESULT FILE SINK` 出现在 `PLAN FRAGMENT 1` 中,就说明该查询语句可以并发导出。如果 `RESULT FILE SINK` 出现在 `PLAN FRAGMENT 0` 中,则说明当前查询不能进行并发导出。 + 下面我们通过一个示例演示如何正确开启并发导出功能: 1. 打开并发导出会话变量 @@ -237,9 +357,8 @@ mysql> SELECT * FROM demo.tbl ORDER BY id 可以看到,最终结果只有一行,并没有触发并发导出。 -关于更多并发导出的原理说明,可参阅附录部分。 - -### 导出前清空导出目录 + +**导出前清空导出目录示例** ```sql SELECT * FROM tbl1 @@ -259,11 +378,10 @@ PROPERTIES 如果设置了 `"delete_existing_files" = "true"`,导出作业会先将 `s3://my_bucket/export/`目录下所有文件及目录删除,然后导出数据到该目录下。 -> 注意: - -> 若要使用 delete_existing_files 参数,还需要在 fe.conf 中添加配置`enable_delete_existing_files = true`并重启 fe,此时 delete_existing_files 才会生效。delete_existing_files = true 是一个危险的操作,建议只在测试环境中使用。 +> 注意:若要使用 delete_existing_files 参数,还需要在 fe.conf 中添加配置`enable_delete_existing_files = true`并重启 fe,此时 delete_existing_files 才会生效。delete_existing_files = true 是一个危险的操作,建议只在测试环境中使用。 -### 设置导出文件的大小 + +**设置导出文件的大小示例** ```sql SELECT * FROM tbl @@ -284,7 +402,7 @@ PROPERTIES( - 导出数据量和导出效率 - `SELECT INTO OUTFILE`功能本质上是执行一个 SQL 查询命令。如果不开启并发导出,查询结果是由单个 BE 节点,单线程导出的,因此整个导出的耗时包括查询本身的耗时和最终结果集写出的耗时。开启并发导出可以降低导出的时间。 +`SELECT INTO OUTFILE`功能本质上是执行一个 SQL 查询命令。如果不开启并发导出,查询结果是由单个 BE 节点,单线程导出的,因此整个导出的耗时包括查询本身的耗时和最终结果集写出的耗时。开启并发导出可以降低导出的时间。 - 导出超时 @@ -306,53 +424,4 @@ PROPERTIES( - 非可见字符的函数 -   对于部分输出为非可见字符的函数,如 BITMAP、HLL 类型,CSV 输出为 `\N`,Parquet、ORC 输出为 NULL。 - -   目前部分地理信息函数,如 `ST_Point` 的输出类型为 VARCHAR,但实际输出值为经过编码的二进制字符。当前这些函数会输出乱码。对于地理函数,请使用 `ST_AsText` 进行输出。 - -## 附录 - -### 并发导出原理 - -- 原理介绍 - -   Doris 是典型的基于 MPP 架构的高性能、实时的分析型数据库。MPP 架构的一大特征是使用分布式架构,将大规模数据集划分为小块,并在多个节点上并行处理。 - -   `SELECT INTO OUTFILE`的并发导出就是基于上述 MPP 架构的并行处理能力,在可以并发导出的场景下(后面会详细说明哪些场景可以并发导出),并行的在多个 BE 节点上导出,每个 BE 处理结果集的一部分。 - -- 如何判断可以执行并发导出 - - * 确定会话变量已开启:`set enable_parallel_outfile = true;` - * 通过 `EXPLAIN` 查看执行计划 - - ```sql - mysql> EXPLAIN SELECT ... INTO OUTFILE "s3://xxx" ...; - +-----------------------------------------------------------------------------+ - | Explain String | - +-----------------------------------------------------------------------------+ - | PLAN FRAGMENT 0 | - | OUTPUT EXPRS: | | | | - | PARTITION: UNPARTITIONED | - | | - | RESULT SINK | - | | - | 1:EXCHANGE | - | | - | PLAN FRAGMENT 1 | - | OUTPUT EXPRS:`k1` + `k2` | - | PARTITION: HASH_PARTITIONED: `default_cluster:test`.`multi_tablet`.`k1` | - | | - | RESULT FILE SINK | - | FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951_ | - | STORAGE TYPE: S3 | - | | - | 0:OlapScanNode | - | TABLE: multi_tablet | - +-----------------------------------------------------------------------------+ - ``` - - `EXPLAIN` 命令会返回该语句的查询计划。观察该查询计划,如果发现 `RESULT FILE SINK` 出现在 `PLAN FRAGMENT 1` 中,就说明该查询语句可以并发导出。如果 `RESULT FILE SINK` 出现在 `PLAN FRAGMENT 0` 中,则说明当前查询不能进行并发导出。 - -- 导出并发度 - - 当满足并发导出的条件后,导出任务的并发度为:`BE 节点数 * parallel_fragment_exec_instance_num`。 +   对于部分输出为非可见字符的函数,如 BITMAP、HLL 类型,导出到 CSV 文件格式时输出为 `\N`。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/Show-Statements/SHOW-EXPORT.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/Show-Statements/SHOW-EXPORT.md index 3639b4bb052a1..7bb097fedd071 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/Show-Statements/SHOW-EXPORT.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/Show-Statements/SHOW-EXPORT.md @@ -54,6 +54,35 @@ SHOW EXPORT 3. 可以使用 ORDER BY 对任意列组合进行排序 4. 如果指定了 LIMIT,则显示 limit 条匹配记录。否则全部显示 +`show export` 命令返回的结果各个列的含义如下: + +* JobId:作业的唯一 ID +* Label:该导出作业的标签,如果 Export 没有指定,则系统会默认生成一个。 +* State:作业状态: + * PENDING:作业待调度 + * EXPORTING:数据导出中 + * FINISHED:作业成功 + * CANCELLED:作业失败 +* Progress:作业进度。该进度以查询计划为单位。假设一共 10 个线程,当前已完成 3 个,则进度为 30%。 +* TaskInfo:以 Json 格式展示的作业信息: + * db:数据库名 + * tbl:表名 + * partitions:指定导出的分区。`空`列表 表示所有分区。 + * column\_separator:导出文件的列分隔符。 + * line\_delimiter:导出文件的行分隔符。 + * tablet num:涉及的总 Tablet 数量。 + * broker:使用的 broker 的名称。 + * coord num:查询计划的个数。 + * max\_file\_size:一个导出文件的最大大小。 + * delete\_existing\_files:是否删除导出目录下已存在的文件及目录。 + * columns:指定需要导出的列名,空值代表导出所有列。 + * format:导出的文件格式 +* Path:远端存储上的导出路径。 +* CreateTime/StartTime/FinishTime:作业的创建时间、开始调度时间和结束时间。 +* Timeout:作业超时时间。单位是秒。该时间从 CreateTime 开始计算。 +* ErrorMsg:如果作业出现错误,这里会显示错误原因。 +* OutfileInfo:如果作业导出成功,这里会显示具体的`SELECT INTO OUTFILE`结果信息。 + ## 举例 1. 展示默认 db 的所有导出任务 From 53aafbe4f2ef6b563c49e0bc1d7a57eaf576ba40 Mon Sep 17 00:00:00 2001 From: TieweiFang Date: Thu, 28 Nov 2024 17:59:04 +0800 Subject: [PATCH 2/4] fix 2 --- .../data-operate/export/export-manual.md | 118 +++++++++--------- .../current/data-operate/export/outfile.md | 117 ++++++++--------- .../Data-Manipulation-Statements/OUTFILE.md | 8 ++ 3 files changed, 121 insertions(+), 122 deletions(-) diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/export-manual.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/export-manual.md index 3fa284c376e2a..8b8c2db71c6b5 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/export-manual.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/export-manual.md @@ -34,29 +34,13 @@ under the License. 关于如何选择 `SELECT INTO OUTFILE` 和 `EXPORT`,请参阅 [导出综述](../../data-operate/export/export-overview.md)。 -示例: - -```sql -mysql> EXPORT TABLE tpch1.lineitem TO "s3://my_bucket/path/to/exp_" - PROPERTIES( - "format" = "csv", - "max_file_size" = "2048MB" - ) - WITH s3 ( - "s3.endpoint" = "${endpoint}", - "s3.region" = "${region}", - "s3.secret_key"="${sk}", - "s3.access_key" = "${ak}" - ); -``` - --- ## 基本原理 Export 任务的底层是执行`SELECT INTO OUTFILE` SQL 语句。用户发起一个 Export 任务后,Doris 会根据 Export 要导出的表构造出一个或多个 `SELECT INTO OUTFILE` 执行计划,随后将这些`SELECT INTO OUTFILE` 执行计划提交给 Doris 的 Job Schedule 任务调度器,Job Schedule 任务调度器会自动调度这些任务并执行。 -默认情况下,Export 任务是单线程执行的。为了提高导出的效率,Export 命令可以设置一个 `parallelism` 参数来并发导出数据。设置`parallelism` 大于 1 后,Export 任务会使用多个线程并发的去执行 `SELECT INTO OUTFILE` 查询计划。`parallelism`参数实际就是指定执行 EXPORT 作业的线程数量。 +默认情况下,Export 任务是单线程执行的。为了提高导出的效率,Export 命令可以设置 `parallelism` 参数来并发导出数据。设置`parallelism` 大于 1 后,Export 任务会使用多个线程并发的去执行 `SELECT INTO OUTFILE` 查询计划。`parallelism`参数实际就是指定执行 EXPORT 作业的线程数量。 ## 使用场景 `Export` 适用于以下场景: @@ -65,7 +49,9 @@ Export 任务的底层是执行`SELECT INTO OUTFILE` SQL 语句。用户发起 - 需要异步提交任务的场景。 使用 `Export` 时需要注意以下限制: -1. 当前不支持压缩格式的导出。 +- 当前不支持压缩格式的导出。 +- 不支持 Select 结果集导出。若需要导出 Select 结果集,请使用[OUTFILE导出](../../data-operate/export/outfile.md) +- 若希望导出到本地文件系统,需要在 fe.conf 中添加配置 `enable_outfile_to_local=true` 并重启FE。 ## 快速上手 ### 建表与导入数据 @@ -105,49 +91,9 @@ with HDFS ( ); ``` -如果 HDFS 开启了高可用,则需要提供 HA 信息,如: - -```sql -EXPORT TABLE tbl -TO "hdfs://HDFS8000871/path/to/export/" -PROPERTIES -( - "line_delimiter" = "," -) -with HDFS ( - "fs.defaultFS" = "hdfs://HDFS8000871", - "hadoop.username" = "hadoop", - "dfs.nameservices" = "your-nameservices", - "dfs.ha.namenodes.your-nameservices" = "nn1,nn2", - "dfs.namenode.rpc-address.HDFS8000871.nn1" = "ip:port", - "dfs.namenode.rpc-address.HDFS8000871.nn2" = "ip:port", - "dfs.client.failover.proxy.provider.HDFS8000871" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" -); -``` - -如果 Hadoop 集群开启了高可用并且启用了 Kerberos 认证,可以参考如下 SQL 语句: +如果 HDFS 集群开启了高可用,则需要提供 HA 信息,参考案例:[导出到开启了高可用的 HDFS 集群](#高可用HDFS导出) -```sql -EXPORT TABLE tbl -TO "hdfs://HDFS8000871/path/to/export/" -PROPERTIES -( - "line_delimiter" = "," -) -with HDFS ( - "fs.defaultFS"="hdfs://hacluster/", - "hadoop.username" = "hadoop", - "dfs.nameservices"="hacluster", - "dfs.ha.namenodes.hacluster"="n1,n2", - "dfs.namenode.rpc-address.hacluster.n1"="192.168.0.1:8020", - "dfs.namenode.rpc-address.hacluster.n2"="192.168.0.2:8020", - "dfs.client.failover.proxy.provider.hacluster"="org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider", - "dfs.namenode.kerberos.principal"="hadoop/_HOST@REALM.COM" - "hadoop.security.authentication"="kerberos", - "hadoop.kerberos.principal"="doris_test@REALM.COM", - "hadoop.kerberos.keytab"="/path/to/doris_test.keytab" -); -``` +如果 HDFS 集群开启了高可用并且启用了 Kerberos 认证,需要提供 Kerberos 认证信息,参考案例:[导出到开启了高可用及kerberos认证的 HDFS 集群](#高可用及kerberos集群导出) #### 导出到对象存储 @@ -288,6 +234,8 @@ CANCEL EXPORT FROM dbName WHERE LABEL like "%export_%"; ## 导出示例 +- [导出到开启了高可用的 HDFS 集群](#高可用HDFS导出) +- [导出到开启了高可用及kerberos认证的 HDFS 集群](#高可用及kerberos集群导出) - [指定分区导出](#指定分区导出) - [导出时过滤数据](#导出时过滤数据) - [导出外表数据](#导出外表数据) @@ -296,6 +244,56 @@ CANCEL EXPORT FROM dbName WHERE LABEL like "%export_%"; - [导出前清空导出目录](#导出前清空导出目录) - [调整导出文件的大小](#调整导出文件的大小) + +**导出到开启了高可用的 HDFS 集群** + +如果 HDFS 开启了高可用,则需要提供 HA 信息,如: + +```sql +EXPORT TABLE tbl +TO "hdfs://HDFS8000871/path/to/export/" +PROPERTIES +( + "line_delimiter" = "," +) +with HDFS ( + "fs.defaultFS" = "hdfs://HDFS8000871", + "hadoop.username" = "hadoop", + "dfs.nameservices" = "your-nameservices", + "dfs.ha.namenodes.your-nameservices" = "nn1,nn2", + "dfs.namenode.rpc-address.HDFS8000871.nn1" = "ip:port", + "dfs.namenode.rpc-address.HDFS8000871.nn2" = "ip:port", + "dfs.client.failover.proxy.provider.HDFS8000871" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" +); +``` + + +**导出到开启了高可用及kerberos认证的 HDFS 集群** + +如果 Hadoop 集群开启了高可用并且启用了 Kerberos 认证,可以参考如下 SQL 语句: + +```sql +EXPORT TABLE tbl +TO "hdfs://HDFS8000871/path/to/export/" +PROPERTIES +( + "line_delimiter" = "," +) +with HDFS ( + "fs.defaultFS"="hdfs://hacluster/", + "hadoop.username" = "hadoop", + "dfs.nameservices"="hacluster", + "dfs.ha.namenodes.hacluster"="n1,n2", + "dfs.namenode.rpc-address.hacluster.n1"="192.168.0.1:8020", + "dfs.namenode.rpc-address.hacluster.n2"="192.168.0.2:8020", + "dfs.client.failover.proxy.provider.hacluster"="org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider", + "dfs.namenode.kerberos.principal"="hadoop/_HOST@REALM.COM" + "hadoop.security.authentication"="kerberos", + "hadoop.kerberos.principal"="doris_test@REALM.COM", + "hadoop.kerberos.keytab"="/path/to/doris_test.keytab" +); +``` + **指定分区导出** diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/outfile.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/outfile.md index b20bf47ab2cc8..8a88f43f8ec47 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/outfile.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/export/outfile.md @@ -33,24 +33,6 @@ under the License. 关于如何选择 `SELECT INTO OUTFILE` 和 `EXPORT`,请参阅 [导出综述](./export-overview.md)。 -示例: - -```sql -mysql> SELECT * FROM tbl1 LIMIT 10 INTO OUTFILE "file:///home/work/path/result_"; -+------------+-----------+----------+--------------------------------------------------------------------+ -| FileNumber | TotalRows | FileSize | URL | -+------------+-----------+----------+--------------------------------------------------------------------+ -| 1 | 2 | 8 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ | -+------------+-----------+----------+--------------------------------------------------------------------+ -``` - -返回结果说明: - -* FileNumber:最终生成的文件个数。 -* TotalRows:结果集行数。 -* FileSize:导出文件总大小。单位字节。 -* URL:导出的文件路径的前缀,多个文件会以后缀 `_0`,`_1` 依次编号。 - 有关`SELECT INTO OUTFILE`命令的详细介绍,请参考:[SELECT INTO OUTFILE](../../sql-manual/sql-statements/Data-Manipulation-Statements/OUTFILE.md) -------------- @@ -59,19 +41,19 @@ mysql> SELECT * FROM tbl1 LIMIT 10 INTO OUTFILE "file:///home/work/path/result_" `SELECT INTO OUTFILE` 适用于以下场景: -1. 导出数据需要经过复杂计算逻辑的,如过滤、聚合、关联等。 -2. 适合执行同步任务的场景。 +- 导出数据需要经过复杂计算逻辑的,如过滤、聚合、关联等。 +- 适合执行同步任务的场景。 在使用 `SELECT INTO OUTFILE` 时需要注意以下限制: -1. 不支持压缩格式的导出。 -2. 2.1 版本 pipeline 引擎不支持并发导出。 -3. 若希望导出到本地文件系统,需要在 fe.conf 中添加配置 `enable_outfile_to_local=true` 并重启FE。 +- 不支持压缩格式的导出。 +- 2.1 版本 pipeline 引擎不支持并发导出。 +- 若希望导出到本地文件系统,需要在 fe.conf 中添加配置 `enable_outfile_to_local=true` 并重启FE。 ## 基本原理 -`SELECT INTO OUTFILE` 功能本质上是执行一个 SQL 查询命令,其原理基本同普通查询的原理一致,。唯一的不同是,普通查询将最后查询的结果集输出到 mysql 客户端,而 `SELECT INTO OUTFILE` 将最后的查询结果集输出到外部存储介质。 +`SELECT INTO OUTFILE` 功能本质上是执行一个 SQL 查询命令,其原理基本同普通查询的原理一致。唯一的不同是,普通查询将最后查询的结果集输出到 mysql 客户端,而 `SELECT INTO OUTFILE` 将最后的查询结果集输出到外部存储介质。 `SELECT INTO OUTFILE`并发导出的原理是将大规模数据集划分为小块,并在多个节点上并行处理。在可以并发导出的场景下,并行的在多个 BE 节点上导出,每个 BE 处理结果集的一部分。 @@ -111,45 +93,9 @@ PROPERTIES ); ``` -如果 HDFS 开启了高可用,则需要提供 HA 信息,如: - -```sql -SELECT c1, c2, c3 FROM tbl -INTO OUTFILE "hdfs://HDFS8000871/path/to/result_" -FORMAT AS PARQUET -PROPERTIES -( - "fs.defaultFS" = "hdfs://HDFS8000871", - "hadoop.username" = "hadoop", - "dfs.nameservices" = "your-nameservices", - "dfs.ha.namenodes.your-nameservices" = "nn1,nn2", - "dfs.namenode.rpc-address.HDFS8000871.nn1" = "ip:port", - "dfs.namenode.rpc-address.HDFS8000871.nn2" = "ip:port", - "dfs.client.failover.proxy.provider.HDFS8000871" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" -); -``` +如果 HDFS 集群开启了高可用,则需要提供 HA 信息,参考案例:[导出到开启了高可用的 HDFS 集群](#高可用HDFS导出) -如果 Hadoop 集群开启了高可用并且启用了 Kerberos 认证,可以参考如下 SQL 语句: - -```sql -SELECT * FROM tbl -INTO OUTFILE "hdfs://path/to/result_" -FORMAT AS PARQUET -PROPERTIES -( - "fs.defaultFS"="hdfs://hacluster/", - "hadoop.username" = "hadoop", - "dfs.nameservices"="hacluster", - "dfs.ha.namenodes.hacluster"="n1,n2", - "dfs.namenode.rpc-address.hacluster.n1"="192.168.0.1:8020", - "dfs.namenode.rpc-address.hacluster.n2"="192.168.0.2:8020", - "dfs.client.failover.proxy.provider.hacluster"="org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider", - "dfs.namenode.kerberos.principal"="hadoop/_HOST@REALM.COM" - "hadoop.security.authentication"="kerberos", - "hadoop.kerberos.principal"="doris_test@REALM.COM", - "hadoop.kerberos.keytab"="/path/to/doris_test.keytab" -); -``` +如果 HDFS 集群开启了高可用并且启用了 Kerberos 认证,需要提供 Kerberos 认证信息,参考案例:[导出到开启了高可用及kerberos认证的 HDFS 集群](#高可用及kerberos集群导出) ### 导出到对象存储 @@ -241,11 +187,58 @@ PROPERTIES( ## 导出示例 +- [导出到开启了高可用的 HDFS 集群](#高可用HDFS导出) +- [导出到开启了高可用及kerberos认证的 HDFS 集群](#高可用及kerberos集群导出) - [生成导出成功标识文件示例](#生成导出成功标识文件示例) - [并发导出示例](#并发导出示例) - [导出前清空导出目录示例](#导出前清空导出目录示例) - [设置导出文件的大小示例](#设置导出文件的大小示例) + +**导出到开启了高可用的 HDFS 集群** + +如果 HDFS 开启了高可用,则需要提供 HA 信息,如: + +```sql +SELECT c1, c2, c3 FROM tbl +INTO OUTFILE "hdfs://HDFS8000871/path/to/result_" +FORMAT AS PARQUET +PROPERTIES +( + "fs.defaultFS" = "hdfs://HDFS8000871", + "hadoop.username" = "hadoop", + "dfs.nameservices" = "your-nameservices", + "dfs.ha.namenodes.your-nameservices" = "nn1,nn2", + "dfs.namenode.rpc-address.HDFS8000871.nn1" = "ip:port", + "dfs.namenode.rpc-address.HDFS8000871.nn2" = "ip:port", + "dfs.client.failover.proxy.provider.HDFS8000871" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" +); +``` + + +**导出到开启了高可用及kerberos认证的 HDFS 集群** + +如果 Hdfs 集群开启了高可用并且启用了 Kerberos 认证,可以参考如下 SQL 语句: + +```sql +SELECT * FROM tbl +INTO OUTFILE "hdfs://path/to/result_" +FORMAT AS PARQUET +PROPERTIES +( + "fs.defaultFS"="hdfs://hacluster/", + "hadoop.username" = "hadoop", + "dfs.nameservices"="hacluster", + "dfs.ha.namenodes.hacluster"="n1,n2", + "dfs.namenode.rpc-address.hacluster.n1"="192.168.0.1:8020", + "dfs.namenode.rpc-address.hacluster.n2"="192.168.0.2:8020", + "dfs.client.failover.proxy.provider.hacluster"="org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider", + "dfs.namenode.kerberos.principal"="hadoop/_HOST@REALM.COM" + "hadoop.security.authentication"="kerberos", + "hadoop.kerberos.principal"="doris_test@REALM.COM", + "hadoop.kerberos.keytab"="/path/to/doris_test.keytab" +); +``` **生成导出成功标识文件示例** diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/Data-Manipulation-Statements/OUTFILE.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/Data-Manipulation-Statements/OUTFILE.md index 5a0ccf42bc97e..3aef1b0ff4216 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/Data-Manipulation-Statements/OUTFILE.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/Data-Manipulation-Statements/OUTFILE.md @@ -134,6 +134,14 @@ INTO OUTFILE "file_path" INTO OUTFILE "file:///home/work/path/result_"; ``` +#### 返回结果说明 + +Outfile 语句返回的结果,各个列的含义如下: +* FileNumber:最终生成的文件个数。 +* TotalRows:结果集行数。 +* FileSize:导出文件总大小。单位字节。 +* URL:导出的文件路径的前缀,多个文件会以后缀 `_0`,`_1` 依次编号。 + #### 数据类型映射 Parquet、ORC 文件格式拥有自己的数据类型,Doris的导出功能能够自动将 Doris 的数据类型导出到 Parquet/ORC 文件格式的对应数据类型,以下是 Apache Doris 数据类型和 Parquet/ORC 文件格式的数据类型映射关系表: From dd5276d587450210460eb9c36f7195d1216d796b Mon Sep 17 00:00:00 2001 From: TieweiFang Date: Fri, 3 Jan 2025 15:14:43 +0800 Subject: [PATCH 3/4] fix 2.1/3.0 --- .../data-operate/export/export-manual.md | 391 +++++++++--------- .../data-operate/export/outfile.md | 316 ++++++++------ .../load-and-export/OUTFILE.md | 8 + .../load-and-export/SHOW-EXPORT.md | 29 ++ .../data-operate/export/export-manual.md | 388 +++++++++-------- .../data-operate/export/outfile.md | 316 ++++++++------ .../load-and-export/OUTFILE.md | 8 + .../load-and-export/SHOW-EXPORT.md | 29 ++ 8 files changed, 864 insertions(+), 621 deletions(-) diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/export/export-manual.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/export/export-manual.md index c0cd8b8471607..59bdb04b3ecb4 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/export/export-manual.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/export/export-manual.md @@ -26,49 +26,113 @@ under the License. 本文档将介绍如何使用`EXPORT`命令导出 Doris 中存储的数据。 -有关`EXPORT`命令的详细介绍,请参考:[EXPORT](../../sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md) - -## 概述 - `Export` 是 Doris 提供的一种将数据异步导出的功能。该功能可以将用户指定的表或分区的数据,以指定的文件格式,导出到目标存储系统中,包括对象存储、HDFS 或本地文件系统。 `Export` 是一个异步执行的命令,命令执行成功后,立即返回结果,用户可以通过`Show Export` 命令查看该 Export 任务的详细信息。 +有关`EXPORT`命令的详细介绍,请参考:[EXPORT](../../sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md) + 关于如何选择 `SELECT INTO OUTFILE` 和 `EXPORT`,请参阅 [导出综述](../../data-operate/export/export-overview.md)。 -`EXPORT` 当前支持导出以下类型的表或视图 +--- -* Doris 内表 -* Doris 逻辑视图 -* Doris Catalog 表 +## 基本原理 -`EXPORT` 目前支持以下导出格式 +Export 任务的底层是执行`SELECT INTO OUTFILE` SQL 语句。用户发起一个 Export 任务后,Doris 会根据 Export 要导出的表构造出一个或多个 `SELECT INTO OUTFILE` 执行计划,随后将这些`SELECT INTO OUTFILE` 执行计划提交给 Doris 的 Job Schedule 任务调度器,Job Schedule 任务调度器会自动调度这些任务并执行。 -* Parquet -* ORC -* csv -* csv_with_names -* csv_with_names_and_types +默认情况下,Export 任务是单线程执行的。为了提高导出的效率,Export 命令可以设置 `parallelism` 参数来并发导出数据。设置`parallelism` 大于 1 后,Export 任务会使用多个线程并发的去执行 `SELECT INTO OUTFILE` 查询计划。`parallelism`参数实际就是指定执行 EXPORT 作业的线程数量。 + +## 使用场景 +`Export` 适用于以下场景: + +- 大数据量的单表导出、仅需简单的过滤条件。 +- 需要异步提交任务的场景。 + +使用 `Export` 时需要注意以下限制: +- 当前不支持压缩格式的导出。 +- 不支持 Select 结果集导出。若需要导出 Select 结果集,请使用[OUTFILE导出](../../data-operate/export/outfile.md) +- 若希望导出到本地文件系统,需要在 fe.conf 中添加配置 `enable_outfile_to_local=true` 并重启FE。 + +## 快速上手 +### 建表与导入数据 +```sql +CREATE TABLE IF NOT EXISTS tbl ( + `c1` int(11) NULL, + `c2` string NULL, + `c3` bigint NULL +) +DISTRIBUTED BY HASH(c1) BUCKETS 20 +PROPERTIES("replication_num" = "1"); -不支持压缩格式的导出。 +insert into tbl values + (1, 'doris', 18), + (2, 'nereids', 20), + (3, 'pipelibe', 99999), + (4, 'Apache', 122123455), + (5, null, null); +``` -示例: +### 创建导出作业 + +#### 导出到HDFS +将 tbl 表的所有数据导出到 HDFS 上,设置导出作业的文件格式为 csv(默认格式),并设置列分割符为`,`。 ```sql -mysql> EXPORT TABLE tpch1.lineitem TO "s3://my_bucket/path/to/exp_" - -> PROPERTIES( - -> "format" = "csv", - -> "max_file_size" = "2048MB" - -> ) - -> WITH s3 ( - -> "s3.endpoint" = "${endpoint}", - -> "s3.region" = "${region}", - -> "s3.secret_key"="${sk}", - -> "s3.access_key" = "${ak}" - -> ); +EXPORT TABLE tbl +TO "hdfs://host/path/to/export/" +PROPERTIES +( + "line_delimiter" = "," +) +with HDFS ( + "fs.defaultFS"="hdfs://hdfs_host:port", + "hadoop.username" = "hadoop" +); ``` +如果 HDFS 集群开启了高可用,则需要提供 HA 信息,参考案例:[导出到开启了高可用的 HDFS 集群](#高可用HDFS导出) + +如果 HDFS 集群开启了高可用并且启用了 Kerberos 认证,需要提供 Kerberos 认证信息,参考案例:[导出到开启了高可用及kerberos认证的 HDFS 集群](#高可用及kerberos集群导出) + +#### 导出到对象存储 + +将 tbl 表中的所有数据导出到对象存储上,设置导出作业的文件格式为 csv(默认格式),并设置列分割符为`,`。 + +```sql +EXPORT TABLE tbl TO "s3://bucket/a/b/c" +PROPERTIES ( + "line_delimiter" = "," +) WITH s3 ( + "s3.endpoint" = "xxxxx", + "s3.region" = "xxxxx", + "s3.secret_key"="xxxx", + "s3.access_key" = "xxxxx" +) +``` + +#### 导出到本地文件系统 + +> +> export 数据导出到本地文件系统,需要在 fe.conf 中添加`enable_outfile_to_local=true`并且重启 FE。 + +将 tbl 表中的所有数据导出到本地文件系统,设置导出作业的文件格式为 csv(默认格式),并设置列分割符为`,`。 + +```sql +-- csv 格式 +EXPORT TABLE tbl TO "file:///home/user/tmp/" +PROPERTIES ( + "format" = "csv", + "line_delimiter" = "," +); +``` + +> 注意: + 导出到本地文件系统的功能不适用于公有云用户,仅适用于私有化部署的用户。并且默认用户对集群节点有完全的控制权限。Doris 对于用户填写的导出路径不会做合法性检查。如果 Doris 的进程用户对该路径无写权限,或路径不存在,则会报错。同时处于安全性考虑,如果该路径已存在同名的文件,则也会导出失败。 + Doris 不会管理导出到本地的文件,也不会检查磁盘空间等。这些文件需要用户自行管理,如清理等。 + + +### 查看导出作业 提交作业后,可以通过 [SHOW EXPORT](../../sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md) 命令查询导出作业状态,结果举例如下: ```sql @@ -98,80 +162,99 @@ OutfileInfo: [ 1 row in set (0.00 sec) ``` -`show export` 命令返回的结果各个列的含义如下: - -* JobId:作业的唯一 ID -* Label:该导出作业的标签,如果 Export 没有指定,则系统会默认生成一个。 -* State:作业状态: - * PENDING:作业待调度 - * EXPORTING:数据导出中 - * FINISHED:作业成功 - * CANCELLED:作业失败 -* Progress:作业进度。该进度以查询计划为单位。假设一共 10 个线程,当前已完成 3 个,则进度为 30%。 -* TaskInfo:以 JSON 格式展示的作业信息: - * db:数据库名 - * tbl:表名 - * partitions:指定导出的分区。`空`列表 表示所有分区。 - * column_separator:导出文件的列分隔符。 - * line_delimiter:导出文件的行分隔符。 - * tablet num:涉及的总 Tablet 数量。 - * broker:使用的 broker 的名称。 - * coord num:查询计划的个数。 - * max_file_size:一个导出文件的最大大小。 - * delete_existing_files:是否删除导出目录下已存在的文件及目录。 - * columns:指定需要导出的列名,空值代表导出所有列。 - * format:导出的文件格式 -* Path:远端存储上的导出路径。 -* CreateTime/StartTime/FinishTime:作业的创建时间、开始调度时间和结束时间。 -* Timeout:作业超时时间。单位是秒。该时间从 CreateTime 开始计算。 -* ErrorMsg:如果作业出现错误,这里会显示错误原因。 -* OutfileInfo:如果作业导出成功,这里会显示具体的`SELECT INTO OUTFILE`结果信息。 +有关 `show export` 命令的详细用法及其返回结果的各个列的含义可以参看 [SHOW EXPORT](../../sql-manual/sql-statements/Show-Statements/SHOW-EXPORT.md): + +### 取消导出作业 提交 Export 作业后,在 Export 任务成功或失败之前可以通过 [CANCEL EXPORT](../../sql-manual/sql-statements/data-modification/load-and-export/CANCEL-EXPORT.md) 命令取消导出作业。取消命令举例如下: ```sql -CANCEL EXPORT FROM tpch1 WHERE LABEL like "%export_%"; +CANCEL EXPORT FROM dbName WHERE LABEL like "%export_%"; ``` -## 导出文件列类型映射 +## 导出说明 -`Export`支持导出数据为 Parquet、ORC 文件格式。Parquet、ORC 文件格式拥有自己的数据类型,Doris 的导出功能能够自动将 Doris 的数据类型导出为 Parquet、ORC 文件格式的对应数据类型,具体映射关系请参阅[导出综述](../../data-operate/export/export-overview.md)文档的 "导出文件列类型映射" 部分。 +### 导出数据源 -## 示例 +`EXPORT` 当前支持导出以下类型的表或视图 -### 导出到 HDFS +* Doris 内表 +* Doris 逻辑视图 +* Doris Catalog 表 -将 db1.tbl1 表的 p1 和 p2 分区中的`col1` 列和`col2` 列数据导出到 HDFS 上,设置导出作业的 label 为 `mylabel`。导出文件格式为 CSV `,`,导出作业单个文件大小限制为 512MB。 +### 导出数据存储位置 -```sql -EXPORT TABLE db1.tbl1 -PARTITION (p1,p2) -TO "hdfs://host/path/to/export/" -PROPERTIES -( - "label" = "mylabel", - "column_separator"=",", - "max_file_size" = "512MB", - "columns" = "col1,col2" -) -with HDFS ( - "fs.defaultFS"="hdfs://hdfs_host:port", - "hadoop.username" = "hadoop" -); -``` +`Export` 目前支持导出到以下存储位置: + +- 对象存储:Amazon S3、COS、OSS、OBS、Google GCS +- HDFS +- 本地文件系统 + +### 导出文件类型 + +`EXPORT` 目前支持导出为以下文件格式: + +* Parquet +* ORC +* csv +* csv\_with\_names +* csv\_with\_names\_and\_types + +### 导出文件列类型映射 + +`Export` 支持导出为 Parquet、ORC 文件格式。Parquet、ORC 文件格式拥有自己的数据类型,Doris 的导出功能能够自动将 Doris 的数据类型导出为 Parquet、ORC 文件格式的对应数据类型。 + +以下是 Doris 数据类型和 Parquet、ORC 文件格式的数据类型映射关系表: +| Doris Type | Arrow Type | Orc Type | +| ---------- | ---------- | -------- | +| boolean | boolean | boolean | +| tinyint | int8 | tinyint | +| smallint | int16 | smallint | +| int | int32 | int | +| bigint | int64 | bigint | +| largeInt | utf8 | string | +| date | utf8 | string | +| datev2 | Date32Type | string | +| datetime | utf8 | string | +| datetimev2 | TimestampType | timestamp | +| float | float32 | float | +| double | float64 | double | +| char / varchar / string| utf8 | string | +| decimal | decimal128 | decimal | +| struct | struct | struct | +| map | map | map | +| array | list | array | +| json | utf8 | string | +| variant | utf8 | string | +| bitmap | binary | binary | +| quantile_state| binary | binary | +| hll | binary | binary | + +> 注意:Doris 导出到 Parquet 文件格式时,会先将 Doris 内存数据转换为 Arrow 内存数据格式,然后由 Arrow 写出到 Parquet 文件格式。 + +## 导出示例 + +- [导出到开启了高可用的 HDFS 集群](#高可用HDFS导出) +- [导出到开启了高可用及kerberos认证的 HDFS 集群](#高可用及kerberos集群导出) +- [指定分区导出](#指定分区导出) +- [导出时过滤数据](#导出时过滤数据) +- [导出外表数据](#导出外表数据) +- [调整导出数据一致性](#调整导出数据一致性) +- [调整导出作业并发度](#调整导出作业并发度) +- [导出前清空导出目录](#导出前清空导出目录) +- [调整导出文件的大小](#调整导出文件的大小) + + +**导出到开启了高可用的 HDFS 集群** 如果 HDFS 开启了高可用,则需要提供 HA 信息,如: ```sql -EXPORT TABLE db1.tbl1 -PARTITION (p1,p2) +EXPORT TABLE tbl TO "hdfs://HDFS8000871/path/to/export/" PROPERTIES ( - "label" = "mylabel", - "column_separator"=",", - "max_file_size" = "512MB", - "columns" = "col1,col2" + "line_delimiter" = "," ) with HDFS ( "fs.defaultFS" = "hdfs://HDFS8000871", @@ -184,18 +267,17 @@ with HDFS ( ); ``` + +**导出到开启了高可用及kerberos认证的 HDFS 集群** + 如果 Hadoop 集群开启了高可用并且启用了 Kerberos 认证,可以参考如下 SQL 语句: ```sql -EXPORT TABLE db1.tbl1 -PARTITION (p1,p2) +EXPORT TABLE tbl TO "hdfs://HDFS8000871/path/to/export/" PROPERTIES ( - "label" = "mylabel", - "column_separator"=",", - "max_file_size" = "512MB", - "columns" = "col1,col2" + "line_delimiter" = "," ) with HDFS ( "fs.defaultFS"="hdfs://hacluster/", @@ -212,65 +294,8 @@ with HDFS ( ); ``` -### 导出到 S3 - - -将 s3_test 表中的所有数据导出到 s3 上,导出格式为 CSV,以不可见字符 `\x07` 作为行分隔符。 - - -```sql -EXPORT TABLE s3_test TO "s3://bucket/a/b/c" -PROPERTIES ( - "line_delimiter" = "\\x07" -) WITH s3 ( - "s3.endpoint" = "xxxxx", - "s3.region" = "xxxxx", - "s3.secret_key"="xxxx", - "s3.access_key" = "xxxxx" -) -``` - -### 导出到本地文件系统 -> -> export 数据导出到本地文件系统,需要在 fe.conf 中添加`enable_outfile_to_local=true`并且重启 FE。 - -将 test 表中的所有数据导出到本地存储: - -```sql --- parquet 格式 -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "columns" = "k1,k2", - "format" = "parquet" -); - --- orc 格式 -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "columns" = "k1,k2", - "format" = "orc" -); - --- csv_with_names 格式,以‘AA’为列分割符,‘zz’为行分割符 -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "format" = "csv_with_names", - "column_separator"="AA", - "line_delimiter" = "zz" -); - --- csv_with_names_and_types 格式 -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "format" = "csv_with_names_and_types" -); -``` - -> 注意: - 导出到本地文件系统的功能不适用于公有云用户,仅适用于私有化部署的用户。并且默认用户对集群节点有完全的控制权限。Doris 对于用户填写的导出路径不会做合法性检查。如果 Doris 的进程用户对该路径无写权限,或路径不存在,则会报错。同时处于安全性考虑,如果该路径已存在同名的文件,则也会导出失败。 - Doris 不会管理导出到本地的文件,也不会检查磁盘空间等。这些文件需要用户自行管理,如清理等。 - -### 指定分区导出 + +**指定分区导出** 导出作业支持仅导出 Doris 内表的部分分区,如仅导出 test 表的 p1 和 p2 分区 @@ -283,7 +308,8 @@ PROPERTIES ( ); ``` -### 导出时过滤数据 + +**导出时过滤数据** 导出作业支持导出时根据谓词条件过滤数据,仅导出符合条件的数据,如仅导出满足 `k1 < 50` 条件的数据 @@ -297,7 +323,8 @@ PROPERTIES ( ); ``` -### 导出外表数据 + +**导出外表数据** 导出作业支持 Doris Catalog 外表数据: @@ -323,11 +350,10 @@ PROPERTIES( 当前 Export 导出 Catalog 外表数据不支持并发导出,即使指定 parallelism 大于 1,仍然是单线程导出。 ::: -## 最佳实践 + +**调整导出数据一致性** -### 导出一致性 - -`Export`导出支持 `partition / tablets` 两种粒度。`data_consistency`参数用来指定以何种粒度切分希望导出的表,`none` 代表 Tablets 级别,`partition`代表 Partition 级别。 +`Export`导出支持 partition / tablets 两种粒度。`data_consistency`参数用来指定以何种粒度切分希望导出的表,`none` 代表 Tablets 级别,`partition`代表 Partition 级别。 ```sql EXPORT TABLE test TO "file:///home/user/tmp" @@ -344,7 +370,8 @@ PROPERTIES ( 关于 Export 底层构造 `SELECT INTO OUTFILE` 的逻辑,可参阅附录部分。 -### 导出作业并发度 + +**调整导出作业并发度** Export 可以设置不同的并发度来并发导出数据。指定并发度为 5: @@ -359,7 +386,8 @@ PROPERTIES ( 关于 Export 并发导出的原理,可参阅附录部分。 -### 导出前清空导出目录 + +**导出前清空导出目录** ```sql EXPORT TABLE test TO "file:///home/user/tmp" @@ -375,7 +403,8 @@ PROPERTIES ( > 注意: 若要使用 delete_existing_files 参数,还需要在 fe.conf 中添加配置`enable_delete_existing_files = true`并重启 fe,此时 delete_existing_files 才会生效。delete_existing_files = true 是一个危险的操作,建议只在测试环境中使用。 -### 设置导出文件的大小 + +**调整导出文件的大小** 导出作业支持设置导出文件的大小,如果单个文件大小超过设定值,则会按照指定大小分成多个文件导出。 @@ -430,48 +459,40 @@ PROPERTIES ( 导出操作完成后,建议验证导出的数据是否完整和正确,以确保数据的质量和完整性。 -## 附录 - -### 并发导出原理 - -Export 任务的底层是执行`SELECT INTO OUTFILE` SQL 语句。用户发起一个 Export 任务后,Doris 会根据 Export 要导出的表构造出一个或多个 `SELECT INTO OUTFILE` 执行计划,随后将这些`SELECT INTO OUTFILE` 执行计划提交给 Doris 的 Job Schedule 任务调度器,Job Schedule 任务调度器会自动调度这些任务并执行。 - -默认情况下,Export 任务是单线程执行的。为了提高导出的效率,Export 命令可以设置一个 `parallelism` 参数来并发导出数据。设置`parallelism` 大于 1 后,Export 任务会使用多个线程并发的去执行 `SELECT INTO OUTFILE` 查询计划。`parallelism`参数实际就是指定执行 EXPORT 作业的线程数量。 - -一个 Export 任务构造一个或多个 `SELECT INTO OUTFILE` 执行计划的具体逻辑是: +* 一个 Export 任务构造一个或多个 `SELECT INTO OUTFILE` 执行计划的具体逻辑是: -1. 选择导出的数据的一致性模型 + 1. 选择导出的数据的一致性模型 - 根据 `data_consistency` 参数来决定导出的一致性,这个只和语义有关,和并发度无关,用户要先根据自己的需求,选择一致性模型。 + 根据 `data_consistency` 参数来决定导出的一致性,这个只和语义有关,和并发度无关,用户要先根据自己的需求,选择一致性模型。 -2. 确定并发度 + 2. 确定并发度 - 根据 `parallelism` 参数确定由多少个线程来运行这些 `SELECT INTO OUTFILE` 执行计划。parallelism 决定了最大可能的线程数。 + 根据 `parallelism` 参数确定由多少个线程来运行这些 `SELECT INTO OUTFILE` 执行计划。parallelism 决定了最大可能的线程数。 - > 注意:即使 Export 命令设置了 `parallelism` 参数,该 Export 任务的实际并发线程数量还与 Job Schedule 有关。Export 任务设置多并发后,每一个并发线程都是 Job Schedule 提供的,所以如果此时 Doris 系统任务较繁忙,Job Schedule 的线程资源较紧张,那么有可能分给 Export 任务的实际线程数量达不到 `parallelism` 个数,影响 Export 的并发导出。此时可以通过减轻系统负载或调整 FE 配置 `async_task_consumer_thread_num` 增加 Job Schedule 的总线程数量来缓解这个问题。 + > 注意:即使 Export 命令设置了 `parallelism` 参数,该 Export 任务的实际并发线程数量还与 Job Schedule 有关。Export 任务设置多并发后,每一个并发线程都是 Job Schedule 提供的,所以如果此时 Doris 系统任务较繁忙,Job Schedule 的线程资源较紧张,那么有可能分给 Export 任务的实际线程数量达不到 `parallelism` 个数,影响 Export 的并发导出。此时可以通过减轻系统负载或调整 FE 配置 `async_task_consumer_thread_num` 增加 Job Schedule 的总线程数量来缓解这个问题。 -3. 确定每一个 outfile 语句的任务量 + 3. 确定每一个 outfile 语句的任务量 - 每一个线程会根据 `maximum_tablets_of_outfile_in_export` 以及数据实际的分区数 / buckets 数来决定要拆分成多少个 outfile。 + 每一个线程会根据 `maximum_tablets_of_outfile_in_export` 以及数据实际的分区数 / buckets 数来决定要拆分成多少个 outfile。 - > `maximum_tablets_of_outfile_in_export` 是 FE 的配置,默认值为 10。该参数用于指定 Export 任务切分出来的单个 OutFile 语句中允许的最大 partitions / buckets 数量。修改该配置需要重启 FE。 + > `maximum_tablets_of_outfile_in_export` 是 FE 的配置,默认值为 10。该参数用于指定 Export 任务切分出来的单个 OutFile 语句中允许的最大 partitions / buckets 数量。修改该配置需要重启 FE。 - 举例:假设一张表共有 20 个 partition,每个 partition 都有 5 个 buckets,那么该表一共有 100 个 buckets。设置`data_consistency = none` 以及 `maximum_tablets_of_outfile_in_export = 10`。 + 举例:假设一张表共有 20 个 partition,每个 partition 都有 5 个 buckets,那么该表一共有 100 个 buckets。设置`data_consistency = none` 以及 `maximum_tablets_of_outfile_in_export = 10`。 - 1. `parallelism = 5` 情况下 + 1. `parallelism = 5` 情况下 - Export 任务将把该表的 100 个 buckets 分成 5 份,每个线程负责 20 个 buckets。每个线程负责的 20 个 buckets 又将以 10 个为单位分成 2 组,每组 buckets 各由一个 outfile 查询计划负责。所以最终该 Export 任务有 5 个线程并发执行,每个线程负责 2 个 outfile 语句,每个线程负责的 outfile 语句串行的被执行。 + Export 任务将把该表的 100 个 buckets 分成 5 份,每个线程负责 20 个 buckets。每个线程负责的 20 个 buckets 又将以 10 个为单位分成 2 组,每组 buckets 各由一个 outfile 查询计划负责。所以最终该 Export 任务有 5 个线程并发执行,每个线程负责 2 个 outfile 语句,每个线程负责的 outfile 语句串行的被执行。 - 2. `parallelism = 3` 情况下 + 2. `parallelism = 3` 情况下 - Export 任务将把该表的 100 个 buckets 分成 3 份,3 个线程分别负责 34、33、33 个 buckets。每个线程负责的 buckets 又将以 10 个为单位分成 4 组(最后一组不足 10 个 buckets),每组 buckets 各由一个 outfile 查询计划负责。所以该 Export 任务最终有 3 个线程并发执行,每个线程负责 4 个 outfile 语句,每个线程负责的 outfile 语句串行的被执行。 + Export 任务将把该表的 100 个 buckets 分成 3 份,3 个线程分别负责 34、33、33 个 buckets。每个线程负责的 buckets 又将以 10 个为单位分成 4 组(最后一组不足 10 个 buckets),每组 buckets 各由一个 outfile 查询计划负责。所以该 Export 任务最终有 3 个线程并发执行,每个线程负责 4 个 outfile 语句,每个线程负责的 outfile 语句串行的被执行。 - 3. `parallelism = 120` 情况下 + 3. `parallelism = 120` 情况下 - 由于该表 buckets 只有 100 个,所以系统会将 `parallelism` 强制设为 100,并以 `parallelism = 100` 去执行。Export 任务将把该表的 100 个 buckets 分成 100 份,每个线程负责 1 个 buckets。每个线程负责的 1 个 buckets 又将以 10 个为单位分成 1 组(该组实际就只有 1 个 buckets),每组 buckets 由一个 outfile 查询计划负责。所以最终该 Export 任务有 100 个线程并发执行,每个线程负责 1 个 outfile 语句,每个 outfile 语句实际只导出 1 个 buckets。 + 由于该表 buckets 只有 100 个,所以系统会将 `parallelism` 强制设为 100,并以 `parallelism = 100` 去执行。Export 任务将把该表的 100 个 buckets 分成 100 份,每个线程负责 1 个 buckets。每个线程负责的 1 个 buckets 又将以 10 个为单位分成 1 组(该组实际就只有 1 个 buckets),每组 buckets 由一个 outfile 查询计划负责。所以最终该 Export 任务有 100 个线程并发执行,每个线程负责 1 个 outfile 语句,每个 outfile 语句实际只导出 1 个 buckets。 -当前版本若希望 Export 有一个较好的性能,建议设置以下参数: +* 当前版本若希望 Export 有一个较好的性能,建议设置以下参数: -1. 打开 session 变量 `enable_parallel_outfile`。 -2. 设置 Export 的 `parallelism` 参数为较大值,使得每一个线程只负责一个 `SELECT INTO OUTFILE` 查询计划。 -3. 设置 FE 配置 `maximum_tablets_of_outfile_in_export` 为较小值,使得每一个 `SELECT INTO OUTFILE` 查询计划导出的数据量较小。 + 1. 打开 session 变量 `enable_parallel_outfile`。 + 2. 设置 Export 的 `parallelism` 参数为较大值,使得每一个线程只负责一个 `SELECT INTO OUTFILE` 查询计划。 + 3. 设置 FE 配置 `maximum_tablets_of_outfile_in_export` 为较小值,使得每一个 `SELECT INTO OUTFILE` 查询计划导出的数据量较小。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/export/outfile.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/export/outfile.md index 5c9dd560218a8..abe09176fe20c 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/export/outfile.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/export/outfile.md @@ -26,53 +26,61 @@ under the License. 本文档将介绍如何使用 `SELECT INTO OUTFILE` 命令进行查询结果的导出操作。 -有关`SELECT INTO OUTFILE`命令的详细介绍,请参考:[SELECT INTO OUTFILE](../../sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md) - -## 概述 - `SELECT INTO OUTFILE` 命令将 `SELECT` 部分的结果数据,以指定的文件格式导出到目标存储系统中,包括对象存储、HDFS 或本地文件系统。 `SELECT INTO OUTFILE` 是一个同步命令,命令返回即表示导出结束。若导出成功,会返回导出的文件数量、大小、路径等信息。若导出失败,会返回错误信息。 关于如何选择 `SELECT INTO OUTFILE` 和 `EXPORT`,请参阅 [导出综述](./export-overview.md)。 -`SELECT INTO OUTFILE` 目前支持以下导出格式 -* Parquet -* ORC -* csv -* csv\_with\_names -* csv\_with\_names\_and\_types +有关`SELECT INTO OUTFILE`命令的详细介绍,请参考:[SELECT INTO OUTFILE](../../sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md) -不支持压缩格式的导出。 +-------------- -示例: +## 使用场景 + +`SELECT INTO OUTFILE` 适用于以下场景: + +- 导出数据需要经过复杂计算逻辑的,如过滤、聚合、关联等。 +- 适合执行同步任务的场景。 + +在使用 `SELECT INTO OUTFILE` 时需要注意以下限制: + +- 不支持压缩格式的导出。 +- 2.1 版本 pipeline 引擎不支持并发导出。 +- 若希望导出到本地文件系统,需要在 fe.conf 中添加配置 `enable_outfile_to_local=true` 并重启FE。 -```sql -mysql> SELECT * FROM tbl1 LIMIT 10 INTO OUTFILE "file:///home/work/path/result_"; -+------------+-----------+----------+--------------------------------------------------------------------+ -| FileNumber | TotalRows | FileSize | URL | -+------------+-----------+----------+--------------------------------------------------------------------+ -| 1 | 2 | 8 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ | -+------------+-----------+----------+--------------------------------------------------------------------+ -``` -返回结果说明: +## 基本原理 -* FileNumber:最终生成的文件个数。 -* TotalRows:结果集行数。 -* FileSize:导出文件总大小。单位字节。 -* URL:导出的文件路径的前缀,多个文件会以后缀 `_0`,`_1` 依次编号。 +`SELECT INTO OUTFILE` 功能本质上是执行一个 SQL 查询命令,其原理基本同普通查询的原理一致。唯一的不同是,普通查询将最后查询的结果集输出到 mysql 客户端,而 `SELECT INTO OUTFILE` 将最后的查询结果集输出到外部存储介质。 -## 导出文件列类型映射 +`SELECT INTO OUTFILE`并发导出的原理是将大规模数据集划分为小块,并在多个节点上并行处理。在可以并发导出的场景下,并行的在多个 BE 节点上导出,每个 BE 处理结果集的一部分。 -`SELECT INTO OUTFILE` 支持导出为 Parquet、ORC 文件格式。Parquet、ORC 文件格式拥有自己的数据类型,Doris 的导出功能能够自动将 Doris 的数据类型导出为 Parquet、ORC 文件格式的对应数据类型,具体映射关系请参阅[导出综述](../../data-operate/export/export-overview.md)文档的 "导出文件列类型映射" 部分。 +## 快速上手 +### 建表与导入数据 -## 示例 +```sql +CREATE TABLE IF NOT EXISTS tbl ( + `c1` int(11) NULL, + `c2` string NULL, + `c3` bigint NULL +) +DISTRIBUTED BY HASH(c1) BUCKETS 20 +PROPERTIES("replication_num" = "1"); + + +insert into tbl values + (1, 'doris', 18), + (2, 'nereids', 20), + (3, 'pipelibe', 99999), + (4, 'Apache', 122123455), + (5, null, null); +``` ### 导出到 HDFS -将查询结果导出到文件 `hdfs://path/to/` 目录下,指定导出格式为 PARQUET: +将查询结果导出到文件 `hdfs://path/to/` 目录下,指定导出格式为 Parquet : ```sql SELECT c1, c2, c3 FROM tbl @@ -85,6 +93,110 @@ PROPERTIES ); ``` +如果 HDFS 集群开启了高可用,则需要提供 HA 信息,参考案例:[导出到开启了高可用的 HDFS 集群](#高可用HDFS导出) + +如果 HDFS 集群开启了高可用并且启用了 Kerberos 认证,需要提供 Kerberos 认证信息,参考案例:[导出到开启了高可用及kerberos认证的 HDFS 集群](#高可用及kerberos集群导出) + +### 导出到对象存储 + +将查询结果导出到 s3 存储的 `s3://path/to/` 目录下,指定导出格式为 ORC,需要提供`sk` `ak`等信息 + +```sql +SELECT * FROM tbl +INTO OUTFILE "s3://path/to/result_" +FORMAT AS ORC +PROPERTIES( + "s3.endpoint" = "https://xxx", + "s3.region" = "ap-beijing", + "s3.access_key"= "your-ak", + "s3.secret_key" = "your-sk" +); +``` + +### 导出到本地文件系统 +> 如需导出到本地文件,需在 `fe.conf` 中添加 `enable_outfile_to_local=true`并重启 FE。 + +将查询结果导出到 BE 的`file:///path/to/` 目录下,指定导出格式为 CSV,指定列分割符为`,`。 + +```sql +SELECT c1, c2 FROM tbl FROM tbl1 +INTO OUTFILE "file:///path/to/result_" +FORMAT AS CSV +PROPERTIES( + "column_separator" = "," +); +``` + +> 注意: + 导出到本地文件的功能不适用于公有云用户,仅适用于私有化部署的用户。并且默认用户对集群节点有完全的控制权限。Doris 对于用户填写的导出路径不会做合法性检查。如果 Doris 的进程用户对该路径无写权限,或路径不存在,则会报错。同时处于安全性考虑,如果该路径已存在同名的文件,则也会导出失败。 + Doris 不会管理导出到本地的文件,也不会检查磁盘空间等。这些文件需要用户自行管理,如清理等。 + +### 更多用法 + +有关`SELECT INTO OUTFILE`命令的详细介绍,请参考:[SELECT INTO OUTFILE](../../sql-manual/sql-statements/Data-Manipulation-Statements/OUTFILE.md) + +## 导出说明 +### 导出数据存储位置 +`SELECT INTO OUTFILE` 目前支持导出到以下存储位置: + +- 对象存储:Amazon S3、COS、OSS、OBS、Google GCS +- HDFS +- 本地文件系统 + +### 导出文件类型 +`SELECT INTO OUTFILE` 目前支持导出以下文件格式 + +* Parquet +* ORC +* csv +* csv\_with\_names +* csv\_with\_names\_and\_types + +### 导出文件列类型映射 + +`SELECT INTO OUTFILE` 支持导出为 Parquet、ORC 文件格式。Parquet、ORC 文件格式拥有自己的数据类型,Doris 的导出功能能够自动将 Doris 的数据类型导出为 Parquet、ORC 文件格式的对应数据类型。 + +以下是 Doris 数据类型和 Parquet、ORC 文件格式的数据类型映射关系表: +| Doris Type | Arrow Type | Orc Type | +| ---------- | ---------- | -------- | +| boolean | boolean | boolean | +| tinyint | int8 | tinyint | +| smallint | int16 | smallint | +| int | int32 | int | +| bigint | int64 | bigint | +| largeInt | utf8 | string | +| date | utf8 | string | +| datev2 | Date32Type | string | +| datetime | utf8 | string | +| datetimev2 | TimestampType | timestamp | +| float | float32 | float | +| double | float64 | double | +| char / varchar / string| utf8 | string | +| decimal | decimal128 | decimal | +| struct | struct | struct | +| map | map | map | +| array | list | array | +| json | utf8 | string | +| variant | utf8 | string | +| bitmap | binary | binary | +| quantile_state| binary | binary | +| hll | binary | binary | + +> 注意:Doris 导出到 Parquet 文件格式时,会先将 Doris 内存数据转换为 Arrow 内存数据格式,然后由 Arrow 写出到 Parquet 文件格式。 + + +## 导出示例 + +- [导出到开启了高可用的 HDFS 集群](#高可用HDFS导出) +- [导出到开启了高可用及kerberos认证的 HDFS 集群](#高可用及kerberos集群导出) +- [生成导出成功标识文件示例](#生成导出成功标识文件示例) +- [并发导出示例](#并发导出示例) +- [导出前清空导出目录示例](#导出前清空导出目录示例) +- [设置导出文件的大小示例](#设置导出文件的大小示例) + + +**导出到开启了高可用的 HDFS 集群** + 如果 HDFS 开启了高可用,则需要提供 HA 信息,如: ```sql @@ -103,7 +215,10 @@ PROPERTIES ); ``` -如果 Hadoop 集群开启了高可用并且启用了 Kerberos 认证,可以参考如下 SQL 语句: + +**导出到开启了高可用及kerberos认证的 HDFS 集群** + +如果 Hdfs 集群开启了高可用并且启用了 Kerberos 认证,可以参考如下 SQL 语句: ```sql SELECT * FROM tbl @@ -125,44 +240,8 @@ PROPERTIES ); ``` -### 导出到 S3 - -将查询结果导出到 s3 存储的 `s3://path/to/` 目录下,指定导出格式为 ORC,需要提供`sk` `ak`等信息 - -```sql -SELECT * FROM tbl -INTO OUTFILE "s3://path/to/result_" -FORMAT AS ORC -PROPERTIES( - "s3.endpoint" = "https://xxx", - "s3.region" = "ap-beijing", - "s3.access_key"= "your-ak", - "s3.secret_key" = "your-sk" -); -``` - -### 导出到本地 -> -> 如需导出到本地文件,需在 `fe.conf` 中添加 `enable_outfile_to_local=true`并重启 FE。 - -将查询结果导出到 BE 的`file:///path/to/` 目录下,指定导出格式为 CSV,指定列分割符为`,`。 - -```sql -SELECT k1 FROM tbl1 UNION SELECT k2 FROM tbl1 -INTO OUTFILE "file:///path/to/result_" -FORMAT AS CSV -PROPERTIES( - "column_separator" = "," -); -``` - -> 注意: - 导出到本地文件的功能不适用于公有云用户,仅适用于私有化部署的用户。并且默认用户对集群节点有完全的控制权限。Doris 对于用户填写的导出路径不会做合法性检查。如果 Doris 的进程用户对该路径无写权限,或路径不存在,则会报错。同时处于安全性考虑,如果该路径已存在同名的文件,则也会导出失败。 - Doris 不会管理导出到本地的文件,也不会检查磁盘空间等。这些文件需要用户自行管理,如清理等。 - -## 最佳实践 - -### 生成导出成功标识文件 + +**生成导出成功标识文件示例** `SELECT INTO OUTFILE`命令是一个同步命令,因此有可能在 SQL 执行过程中任务连接断开了,从而无法获悉导出的数据是否正常结束或是否完整。此时可以使用 `success_file_name` 参数要求导出成功后,在目录下生成一个文件标识。 @@ -188,10 +267,44 @@ PROPERTIES 在导出完成后,会多写出一个文件,该文件的文件名为 `SUCCESS`。 -### 并发导出 + +**并发导出示例** 默认情况下,`SELECT` 部分的查询结果会先汇聚到某一个 BE 节点,由该节点单线程导出数据。然而,在某些情况下,如没有 `ORDER BY` 子句的查询语句,则可以开启并发导出,多个 BE 节点同时导出数据,以提升导出性能。 +然而,并非所有的 SQL 查询语句都可以并发导出。一个查询语句是否可以并发导出可以通过以下步骤来判断: + +* 确定会话变量已开启:`set enable_parallel_outfile = true;` +* 通过 `EXPLAIN` 查看执行计划 + +```sql +mysql> EXPLAIN SELECT ... INTO OUTFILE "s3://xxx" ...; ++-----------------------------------------------------------------------------+ +| Explain String | ++-----------------------------------------------------------------------------+ +| PLAN FRAGMENT 0 | +| OUTPUT EXPRS: | | | | +| PARTITION: UNPARTITIONED | +| | +| RESULT SINK | +| | +| 1:EXCHANGE | +| | +| PLAN FRAGMENT 1 | +| OUTPUT EXPRS:`k1` + `k2` | +| PARTITION: HASH_PARTITIONED: `default_cluster:test`.`multi_tablet`.`k1` | +| | +| RESULT FILE SINK | +| FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951_ | +| STORAGE TYPE: S3 | +| | +| 0:OlapScanNode | +| TABLE: multi_tablet | ++-----------------------------------------------------------------------------+ +``` + +`EXPLAIN` 命令会返回该语句的查询计划。观察该查询计划,如果发现 `RESULT FILE SINK` 出现在 `PLAN FRAGMENT 1` 中,就说明该查询语句可以并发导出。如果 `RESULT FILE SINK` 出现在 `PLAN FRAGMENT 0` 中,则说明当前查询不能进行并发导出。 + 下面我们通过一个示例演示如何正确开启并发导出功能: 1. 打开并发导出会话变量 @@ -237,9 +350,8 @@ mysql> SELECT * FROM demo.tbl ORDER BY id 可以看到,最终结果只有一行,并没有触发并发导出。 -关于更多并发导出的原理说明,可参阅附录部分。 - -### 导出前清空导出目录 + +**导出前清空导出目录示例** ```sql SELECT * FROM tbl1 @@ -259,11 +371,10 @@ PROPERTIES 如果设置了 `"delete_existing_files" = "true"`,导出作业会先将 `s3://my_bucket/export/`目录下所有文件及目录删除,然后导出数据到该目录下。 -> 注意: - -> 若要使用 delete_existing_files 参数,还需要在 fe.conf 中添加配置`enable_delete_existing_files = true`并重启 fe,此时 delete_existing_files 才会生效。delete_existing_files = true 是一个危险的操作,建议只在测试环境中使用。 +> 注意:若要使用 delete_existing_files 参数,还需要在 fe.conf 中添加配置`enable_delete_existing_files = true`并重启 fe,此时 delete_existing_files 才会生效。delete_existing_files = true 是一个危险的操作,建议只在测试环境中使用。 -### 设置导出文件的大小 + +**设置导出文件的大小示例** ```sql SELECT * FROM tbl @@ -284,7 +395,7 @@ PROPERTIES( - 导出数据量和导出效率 - `SELECT INTO OUTFILE`功能本质上是执行一个 SQL 查询命令。如果不开启并发导出,查询结果是由单个 BE 节点,单线程导出的,因此整个导出的耗时包括查询本身的耗时和最终结果集写出的耗时。开启并发导出可以降低导出的时间。 +`SELECT INTO OUTFILE`功能本质上是执行一个 SQL 查询命令。如果不开启并发导出,查询结果是由单个 BE 节点,单线程导出的,因此整个导出的耗时包括查询本身的耗时和最终结果集写出的耗时。开启并发导出可以降低导出的时间。 - 导出超时 @@ -306,53 +417,4 @@ PROPERTIES( - 非可见字符的函数 -   对于部分输出为非可见字符的函数,如 BITMAP、HLL 类型,CSV 输出为 `\N`,Parquet、ORC 输出为 NULL。 - -   目前部分地理信息函数,如 `ST_Point` 的输出类型为 VARCHAR,但实际输出值为经过编码的二进制字符。当前这些函数会输出乱码。对于地理函数,请使用 `ST_AsText` 进行输出。 - -## 附录 - -### 并发导出原理 - -- 原理介绍 - -   Doris 是典型的基于 MPP 架构的高性能、实时的分析型数据库。MPP 架构的一大特征是使用分布式架构,将大规模数据集划分为小块,并在多个节点上并行处理。 - -   `SELECT INTO OUTFILE`的并发导出就是基于上述 MPP 架构的并行处理能力,在可以并发导出的场景下(后面会详细说明哪些场景可以并发导出),并行的在多个 BE 节点上导出,每个 BE 处理结果集的一部分。 - -- 如何判断可以执行并发导出 - - * 确定会话变量已开启:`set enable_parallel_outfile = true;` - * 通过 `EXPLAIN` 查看执行计划 - - ```sql - mysql> EXPLAIN SELECT ... INTO OUTFILE "s3://xxx" ...; - +-----------------------------------------------------------------------------+ - | Explain String | - +-----------------------------------------------------------------------------+ - | PLAN FRAGMENT 0 | - | OUTPUT EXPRS: | | | | - | PARTITION: UNPARTITIONED | - | | - | RESULT SINK | - | | - | 1:EXCHANGE | - | | - | PLAN FRAGMENT 1 | - | OUTPUT EXPRS:`k1` + `k2` | - | PARTITION: HASH_PARTITIONED: `default_cluster:test`.`multi_tablet`.`k1` | - | | - | RESULT FILE SINK | - | FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951_ | - | STORAGE TYPE: S3 | - | | - | 0:OlapScanNode | - | TABLE: multi_tablet | - +-----------------------------------------------------------------------------+ - ``` - - `EXPLAIN` 命令会返回该语句的查询计划。观察该查询计划,如果发现 `RESULT FILE SINK` 出现在 `PLAN FRAGMENT 1` 中,就说明该查询语句可以并发导出。如果 `RESULT FILE SINK` 出现在 `PLAN FRAGMENT 0` 中,则说明当前查询不能进行并发导出。 - -- 导出并发度 - - 当满足并发导出的条件后,导出任务的并发度为:`BE 节点数 * parallel_fragment_exec_instance_num`。 +   对于部分输出为非可见字符的函数,如 BITMAP、HLL 类型,导出到 CSV 文件格式时输出为 `\N`。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md index e41c329c3e378..1dc2c42d07ab6 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md @@ -128,6 +128,14 @@ INTO OUTFILE "file_path" INTO OUTFILE "file:///home/work/path/result_"; ``` +#### 返回结果说明 + +Outfile 语句返回的结果,各个列的含义如下: +* FileNumber:最终生成的文件个数。 +* TotalRows:结果集行数。 +* FileSize:导出文件总大小。单位字节。 +* URL:导出的文件路径的前缀,多个文件会以后缀 `_0`,`_1` 依次编号。 + #### 数据类型映射 parquet、orc 文件格式拥有自己的数据类型,Doris 的导出功能能够自动将 Doris 的数据类型导出到 parquet/orc 文件格式的对应数据类型,以下是 Doris 数据类型和 parquet/orc 文件格式的数据类型映射关系表: diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md index 8c24cb34a0f4b..6046e6d8c14eb 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md @@ -50,6 +50,35 @@ SHOW EXPORT 3. 可以使用 ORDER BY 对任意列组合进行排序 4. 如果指定了 LIMIT,则显示 limit 条匹配记录。否则全部显示 +`show export` 命令返回的结果各个列的含义如下: + +* JobId:作业的唯一 ID +* Label:该导出作业的标签,如果 Export 没有指定,则系统会默认生成一个。 +* State:作业状态: + * PENDING:作业待调度 + * EXPORTING:数据导出中 + * FINISHED:作业成功 + * CANCELLED:作业失败 +* Progress:作业进度。该进度以查询计划为单位。假设一共 10 个线程,当前已完成 3 个,则进度为 30%。 +* TaskInfo:以 Json 格式展示的作业信息: + * db:数据库名 + * tbl:表名 + * partitions:指定导出的分区。`空`列表 表示所有分区。 + * column\_separator:导出文件的列分隔符。 + * line\_delimiter:导出文件的行分隔符。 + * tablet num:涉及的总 Tablet 数量。 + * broker:使用的 broker 的名称。 + * coord num:查询计划的个数。 + * max\_file\_size:一个导出文件的最大大小。 + * delete\_existing\_files:是否删除导出目录下已存在的文件及目录。 + * columns:指定需要导出的列名,空值代表导出所有列。 + * format:导出的文件格式 +* Path:远端存储上的导出路径。 +* CreateTime/StartTime/FinishTime:作业的创建时间、开始调度时间和结束时间。 +* Timeout:作业超时时间。单位是秒。该时间从 CreateTime 开始计算。 +* ErrorMsg:如果作业出现错误,这里会显示错误原因。 +* OutfileInfo:如果作业导出成功,这里会显示具体的`SELECT INTO OUTFILE`结果信息。 + ## 示例 1. 展示默认 db 的所有导出任务 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/export/export-manual.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/export/export-manual.md index 8f088c371c372..59bdb04b3ecb4 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/export/export-manual.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/export/export-manual.md @@ -26,48 +26,113 @@ under the License. 本文档将介绍如何使用`EXPORT`命令导出 Doris 中存储的数据。 -有关`EXPORT`命令的详细介绍,请参考:[EXPORT](../../sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md) - -## 概述 - `Export` 是 Doris 提供的一种将数据异步导出的功能。该功能可以将用户指定的表或分区的数据,以指定的文件格式,导出到目标存储系统中,包括对象存储、HDFS 或本地文件系统。 `Export` 是一个异步执行的命令,命令执行成功后,立即返回结果,用户可以通过`Show Export` 命令查看该 Export 任务的详细信息。 +有关`EXPORT`命令的详细介绍,请参考:[EXPORT](../../sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md) + 关于如何选择 `SELECT INTO OUTFILE` 和 `EXPORT`,请参阅 [导出综述](../../data-operate/export/export-overview.md)。 -`EXPORT` 当前支持导出以下类型的表或视图 +--- -* Doris 内表 -* Doris 逻辑视图 -* Doris Catalog 表 +## 基本原理 -`EXPORT` 目前支持以下导出格式 +Export 任务的底层是执行`SELECT INTO OUTFILE` SQL 语句。用户发起一个 Export 任务后,Doris 会根据 Export 要导出的表构造出一个或多个 `SELECT INTO OUTFILE` 执行计划,随后将这些`SELECT INTO OUTFILE` 执行计划提交给 Doris 的 Job Schedule 任务调度器,Job Schedule 任务调度器会自动调度这些任务并执行。 -* Parquet -* ORC -* csv -* csv\_with\_names -* csv\_with\_names\_and\_types +默认情况下,Export 任务是单线程执行的。为了提高导出的效率,Export 命令可以设置 `parallelism` 参数来并发导出数据。设置`parallelism` 大于 1 后,Export 任务会使用多个线程并发的去执行 `SELECT INTO OUTFILE` 查询计划。`parallelism`参数实际就是指定执行 EXPORT 作业的线程数量。 + +## 使用场景 +`Export` 适用于以下场景: -不支持压缩格式的导出。 +- 大数据量的单表导出、仅需简单的过滤条件。 +- 需要异步提交任务的场景。 -示例: +使用 `Export` 时需要注意以下限制: +- 当前不支持压缩格式的导出。 +- 不支持 Select 结果集导出。若需要导出 Select 结果集,请使用[OUTFILE导出](../../data-operate/export/outfile.md) +- 若希望导出到本地文件系统,需要在 fe.conf 中添加配置 `enable_outfile_to_local=true` 并重启FE。 +## 快速上手 +### 建表与导入数据 ```sql -mysql> EXPORT TABLE tpch1.lineitem TO "s3://my_bucket/path/to/exp_" - -> PROPERTIES( - -> "format" = "csv", - -> "max_file_size" = "2048MB" - -> ) - -> WITH s3 ( - -> "s3.endpoint" = "${endpoint}", - -> "s3.region" = "${region}", - -> "s3.secret_key"="${sk}", - -> "s3.access_key" = "${ak}" - -> ); +CREATE TABLE IF NOT EXISTS tbl ( + `c1` int(11) NULL, + `c2` string NULL, + `c3` bigint NULL +) +DISTRIBUTED BY HASH(c1) BUCKETS 20 +PROPERTIES("replication_num" = "1"); + + +insert into tbl values + (1, 'doris', 18), + (2, 'nereids', 20), + (3, 'pipelibe', 99999), + (4, 'Apache', 122123455), + (5, null, null); ``` +### 创建导出作业 + +#### 导出到HDFS +将 tbl 表的所有数据导出到 HDFS 上,设置导出作业的文件格式为 csv(默认格式),并设置列分割符为`,`。 + +```sql +EXPORT TABLE tbl +TO "hdfs://host/path/to/export/" +PROPERTIES +( + "line_delimiter" = "," +) +with HDFS ( + "fs.defaultFS"="hdfs://hdfs_host:port", + "hadoop.username" = "hadoop" +); +``` + +如果 HDFS 集群开启了高可用,则需要提供 HA 信息,参考案例:[导出到开启了高可用的 HDFS 集群](#高可用HDFS导出) + +如果 HDFS 集群开启了高可用并且启用了 Kerberos 认证,需要提供 Kerberos 认证信息,参考案例:[导出到开启了高可用及kerberos认证的 HDFS 集群](#高可用及kerberos集群导出) + +#### 导出到对象存储 + +将 tbl 表中的所有数据导出到对象存储上,设置导出作业的文件格式为 csv(默认格式),并设置列分割符为`,`。 + +```sql +EXPORT TABLE tbl TO "s3://bucket/a/b/c" +PROPERTIES ( + "line_delimiter" = "," +) WITH s3 ( + "s3.endpoint" = "xxxxx", + "s3.region" = "xxxxx", + "s3.secret_key"="xxxx", + "s3.access_key" = "xxxxx" +) +``` + +#### 导出到本地文件系统 + +> +> export 数据导出到本地文件系统,需要在 fe.conf 中添加`enable_outfile_to_local=true`并且重启 FE。 + +将 tbl 表中的所有数据导出到本地文件系统,设置导出作业的文件格式为 csv(默认格式),并设置列分割符为`,`。 + +```sql +-- csv 格式 +EXPORT TABLE tbl TO "file:///home/user/tmp/" +PROPERTIES ( + "format" = "csv", + "line_delimiter" = "," +); +``` + +> 注意: + 导出到本地文件系统的功能不适用于公有云用户,仅适用于私有化部署的用户。并且默认用户对集群节点有完全的控制权限。Doris 对于用户填写的导出路径不会做合法性检查。如果 Doris 的进程用户对该路径无写权限,或路径不存在,则会报错。同时处于安全性考虑,如果该路径已存在同名的文件,则也会导出失败。 + Doris 不会管理导出到本地的文件,也不会检查磁盘空间等。这些文件需要用户自行管理,如清理等。 + + +### 查看导出作业 提交作业后,可以通过 [SHOW EXPORT](../../sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md) 命令查询导出作业状态,结果举例如下: ```sql @@ -97,80 +162,99 @@ OutfileInfo: [ 1 row in set (0.00 sec) ``` -`show export` 命令返回的结果各个列的含义如下: - -* JobId:作业的唯一 ID -* Label:该导出作业的标签,如果 Export 没有指定,则系统会默认生成一个。 -* State:作业状态: - * PENDING:作业待调度 - * EXPORTING:数据导出中 - * FINISHED:作业成功 - * CANCELLED:作业失败 -* Progress:作业进度。该进度以查询计划为单位。假设一共 10 个线程,当前已完成 3 个,则进度为 30%。 -* TaskInfo:以 Json 格式展示的作业信息: - * db:数据库名 - * tbl:表名 - * partitions:指定导出的分区。`空`列表 表示所有分区。 - * column\_separator:导出文件的列分隔符。 - * line\_delimiter:导出文件的行分隔符。 - * tablet num:涉及的总 Tablet 数量。 - * broker:使用的 broker 的名称。 - * coord num:查询计划的个数。 - * max\_file\_size:一个导出文件的最大大小。 - * delete\_existing\_files:是否删除导出目录下已存在的文件及目录。 - * columns:指定需要导出的列名,空值代表导出所有列。 - * format:导出的文件格式 -* Path:远端存储上的导出路径。 -* CreateTime/StartTime/FinishTime:作业的创建时间、开始调度时间和结束时间。 -* Timeout:作业超时时间。单位是秒。该时间从 CreateTime 开始计算。 -* ErrorMsg:如果作业出现错误,这里会显示错误原因。 -* OutfileInfo:如果作业导出成功,这里会显示具体的`SELECT INTO OUTFILE`结果信息。 +有关 `show export` 命令的详细用法及其返回结果的各个列的含义可以参看 [SHOW EXPORT](../../sql-manual/sql-statements/Show-Statements/SHOW-EXPORT.md): + +### 取消导出作业 提交 Export 作业后,在 Export 任务成功或失败之前可以通过 [CANCEL EXPORT](../../sql-manual/sql-statements/data-modification/load-and-export/CANCEL-EXPORT.md) 命令取消导出作业。取消命令举例如下: ```sql -CANCEL EXPORT FROM tpch1 WHERE LABEL like "%export_%"; +CANCEL EXPORT FROM dbName WHERE LABEL like "%export_%"; ``` -## 导出文件列类型映射 +## 导出说明 -`Export`支持导出数据为 Parquet、ORC 文件格式。Parquet、ORC 文件格式拥有自己的数据类型,Doris 的导出功能能够自动将 Doris 的数据类型导出为 Parquet、ORC 文件格式的对应数据类型,具体映射关系请参阅[导出综述](../../data-operate/export/export-overview.md)文档的 "导出文件列类型映射" 部分。 +### 导出数据源 -## 示例 +`EXPORT` 当前支持导出以下类型的表或视图 -### 导出到 HDFS +* Doris 内表 +* Doris 逻辑视图 +* Doris Catalog 表 -将 db1.tbl1 表的 p1 和 p2 分区中的`col1` 列和`col2` 列数据导出到 HDFS 上,设置导出作业的 label 为 `mylabel`。导出文件格式为 csv(默认格式),列分割符为`,`,导出作业单个文件大小限制为 512MB。 +### 导出数据存储位置 -```sql -EXPORT TABLE db1.tbl1 -PARTITION (p1,p2) -TO "hdfs://host/path/to/export/" -PROPERTIES -( - "label" = "mylabel", - "column_separator"=",", - "max_file_size" = "512MB", - "columns" = "col1,col2" -) -with HDFS ( - "fs.defaultFS"="hdfs://hdfs_host:port", - "hadoop.username" = "hadoop" -); -``` +`Export` 目前支持导出到以下存储位置: + +- 对象存储:Amazon S3、COS、OSS、OBS、Google GCS +- HDFS +- 本地文件系统 + +### 导出文件类型 + +`EXPORT` 目前支持导出为以下文件格式: + +* Parquet +* ORC +* csv +* csv\_with\_names +* csv\_with\_names\_and\_types + +### 导出文件列类型映射 + +`Export` 支持导出为 Parquet、ORC 文件格式。Parquet、ORC 文件格式拥有自己的数据类型,Doris 的导出功能能够自动将 Doris 的数据类型导出为 Parquet、ORC 文件格式的对应数据类型。 + +以下是 Doris 数据类型和 Parquet、ORC 文件格式的数据类型映射关系表: +| Doris Type | Arrow Type | Orc Type | +| ---------- | ---------- | -------- | +| boolean | boolean | boolean | +| tinyint | int8 | tinyint | +| smallint | int16 | smallint | +| int | int32 | int | +| bigint | int64 | bigint | +| largeInt | utf8 | string | +| date | utf8 | string | +| datev2 | Date32Type | string | +| datetime | utf8 | string | +| datetimev2 | TimestampType | timestamp | +| float | float32 | float | +| double | float64 | double | +| char / varchar / string| utf8 | string | +| decimal | decimal128 | decimal | +| struct | struct | struct | +| map | map | map | +| array | list | array | +| json | utf8 | string | +| variant | utf8 | string | +| bitmap | binary | binary | +| quantile_state| binary | binary | +| hll | binary | binary | + +> 注意:Doris 导出到 Parquet 文件格式时,会先将 Doris 内存数据转换为 Arrow 内存数据格式,然后由 Arrow 写出到 Parquet 文件格式。 + +## 导出示例 + +- [导出到开启了高可用的 HDFS 集群](#高可用HDFS导出) +- [导出到开启了高可用及kerberos认证的 HDFS 集群](#高可用及kerberos集群导出) +- [指定分区导出](#指定分区导出) +- [导出时过滤数据](#导出时过滤数据) +- [导出外表数据](#导出外表数据) +- [调整导出数据一致性](#调整导出数据一致性) +- [调整导出作业并发度](#调整导出作业并发度) +- [导出前清空导出目录](#导出前清空导出目录) +- [调整导出文件的大小](#调整导出文件的大小) + + +**导出到开启了高可用的 HDFS 集群** 如果 HDFS 开启了高可用,则需要提供 HA 信息,如: ```sql -EXPORT TABLE db1.tbl1 -PARTITION (p1,p2) +EXPORT TABLE tbl TO "hdfs://HDFS8000871/path/to/export/" PROPERTIES ( - "label" = "mylabel", - "column_separator"=",", - "max_file_size" = "512MB", - "columns" = "col1,col2" + "line_delimiter" = "," ) with HDFS ( "fs.defaultFS" = "hdfs://HDFS8000871", @@ -183,18 +267,17 @@ with HDFS ( ); ``` + +**导出到开启了高可用及kerberos认证的 HDFS 集群** + 如果 Hadoop 集群开启了高可用并且启用了 Kerberos 认证,可以参考如下 SQL 语句: ```sql -EXPORT TABLE db1.tbl1 -PARTITION (p1,p2) +EXPORT TABLE tbl TO "hdfs://HDFS8000871/path/to/export/" PROPERTIES ( - "label" = "mylabel", - "column_separator"=",", - "max_file_size" = "512MB", - "columns" = "col1,col2" + "line_delimiter" = "," ) with HDFS ( "fs.defaultFS"="hdfs://hacluster/", @@ -211,63 +294,8 @@ with HDFS ( ); ``` -### 导出到 S3 - -将 s3_test 表中的所有数据导出到 s3 上,导出格式为 csv,以不可见字符 `\x07` 作为行分隔符。 - -```sql -EXPORT TABLE s3_test TO "s3://bucket/a/b/c" -PROPERTIES ( - "line_delimiter" = "\\x07" -) WITH s3 ( - "s3.endpoint" = "xxxxx", - "s3.region" = "xxxxx", - "s3.secret_key"="xxxx", - "s3.access_key" = "xxxxx" -) -``` - -### 导出到本地文件系统 -> -> export 数据导出到本地文件系统,需要在 fe.conf 中添加`enable_outfile_to_local=true`并且重启 FE。 - -将 test 表中的所有数据导出到本地存储: - -```sql --- parquet 格式 -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "columns" = "k1,k2", - "format" = "parquet" -); - --- orc 格式 -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "columns" = "k1,k2", - "format" = "orc" -); - --- csv_with_names 格式,以‘AA’为列分割符,‘zz’为行分割符 -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "format" = "csv_with_names", - "column_separator"="AA", - "line_delimiter" = "zz" -); - --- csv_with_names_and_types 格式 -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "format" = "csv_with_names_and_types" -); -``` - -> 注意: - 导出到本地文件系统的功能不适用于公有云用户,仅适用于私有化部署的用户。并且默认用户对集群节点有完全的控制权限。Doris 对于用户填写的导出路径不会做合法性检查。如果 Doris 的进程用户对该路径无写权限,或路径不存在,则会报错。同时处于安全性考虑,如果该路径已存在同名的文件,则也会导出失败。 - Doris 不会管理导出到本地的文件,也不会检查磁盘空间等。这些文件需要用户自行管理,如清理等。 - -### 指定分区导出 + +**指定分区导出** 导出作业支持仅导出 Doris 内表的部分分区,如仅导出 test 表的 p1 和 p2 分区 @@ -280,7 +308,8 @@ PROPERTIES ( ); ``` -### 导出时过滤数据 + +**导出时过滤数据** 导出作业支持导出时根据谓词条件过滤数据,仅导出符合条件的数据,如仅导出满足 `k1 < 50` 条件的数据 @@ -294,7 +323,8 @@ PROPERTIES ( ); ``` -### 导出外表数据 + +**导出外表数据** 导出作业支持 Doris Catalog 外表数据: @@ -320,9 +350,8 @@ PROPERTIES( 当前 Export 导出 Catalog 外表数据不支持并发导出,即使指定 parallelism 大于 1,仍然是单线程导出。 ::: -## 最佳实践 - -### 导出一致性 + +**调整导出数据一致性** `Export`导出支持 partition / tablets 两种粒度。`data_consistency`参数用来指定以何种粒度切分希望导出的表,`none` 代表 Tablets 级别,`partition`代表 Partition 级别。 @@ -341,7 +370,8 @@ PROPERTIES ( 关于 Export 底层构造 `SELECT INTO OUTFILE` 的逻辑,可参阅附录部分。 -### 导出作业并发度 + +**调整导出作业并发度** Export 可以设置不同的并发度来并发导出数据。指定并发度为 5: @@ -356,7 +386,8 @@ PROPERTIES ( 关于 Export 并发导出的原理,可参阅附录部分。 -### 导出前清空导出目录 + +**导出前清空导出目录** ```sql EXPORT TABLE test TO "file:///home/user/tmp" @@ -372,7 +403,8 @@ PROPERTIES ( > 注意: 若要使用 delete_existing_files 参数,还需要在 fe.conf 中添加配置`enable_delete_existing_files = true`并重启 fe,此时 delete_existing_files 才会生效。delete_existing_files = true 是一个危险的操作,建议只在测试环境中使用。 -### 设置导出文件的大小 + +**调整导出文件的大小** 导出作业支持设置导出文件的大小,如果单个文件大小超过设定值,则会按照指定大小分成多个文件导出。 @@ -427,48 +459,40 @@ PROPERTIES ( 导出操作完成后,建议验证导出的数据是否完整和正确,以确保数据的质量和完整性。 -## 附录 - -### 并发导出原理 - -Export 任务的底层是执行`SELECT INTO OUTFILE` SQL 语句。用户发起一个 Export 任务后,Doris 会根据 Export 要导出的表构造出一个或多个 `SELECT INTO OUTFILE` 执行计划,随后将这些`SELECT INTO OUTFILE` 执行计划提交给 Doris 的 Job Schedule 任务调度器,Job Schedule 任务调度器会自动调度这些任务并执行。 - -默认情况下,Export 任务是单线程执行的。为了提高导出的效率,Export 命令可以设置一个 `parallelism` 参数来并发导出数据。设置`parallelism` 大于 1 后,Export 任务会使用多个线程并发的去执行 `SELECT INTO OUTFILE` 查询计划。`parallelism`参数实际就是指定执行 EXPORT 作业的线程数量。 - -一个 Export 任务构造一个或多个 `SELECT INTO OUTFILE` 执行计划的具体逻辑是: +* 一个 Export 任务构造一个或多个 `SELECT INTO OUTFILE` 执行计划的具体逻辑是: -1. 选择导出的数据的一致性模型 + 1. 选择导出的数据的一致性模型 - 根据 `data_consistency` 参数来决定导出的一致性,这个只和语义有关,和并发度无关,用户要先根据自己的需求,选择一致性模型。 + 根据 `data_consistency` 参数来决定导出的一致性,这个只和语义有关,和并发度无关,用户要先根据自己的需求,选择一致性模型。 -2. 确定并发度 + 2. 确定并发度 - 根据 `parallelism` 参数确定由多少个线程来运行这些 `SELECT INTO OUTFILE` 执行计划。parallelism 决定了最大可能的线程数。 + 根据 `parallelism` 参数确定由多少个线程来运行这些 `SELECT INTO OUTFILE` 执行计划。parallelism 决定了最大可能的线程数。 - > 注意:即使 Export 命令设置了 `parallelism` 参数,该 Export 任务的实际并发线程数量还与 Job Schedule 有关。Export 任务设置多并发后,每一个并发线程都是 Job Schedule 提供的,所以如果此时 Doris 系统任务较繁忙,Job Schedule 的线程资源较紧张,那么有可能分给 Export 任务的实际线程数量达不到 `parallelism` 个数,影响 Export 的并发导出。此时可以通过减轻系统负载或调整 FE 配置 `async_task_consumer_thread_num` 增加 Job Schedule 的总线程数量来缓解这个问题。 + > 注意:即使 Export 命令设置了 `parallelism` 参数,该 Export 任务的实际并发线程数量还与 Job Schedule 有关。Export 任务设置多并发后,每一个并发线程都是 Job Schedule 提供的,所以如果此时 Doris 系统任务较繁忙,Job Schedule 的线程资源较紧张,那么有可能分给 Export 任务的实际线程数量达不到 `parallelism` 个数,影响 Export 的并发导出。此时可以通过减轻系统负载或调整 FE 配置 `async_task_consumer_thread_num` 增加 Job Schedule 的总线程数量来缓解这个问题。 -3. 确定每一个 outfile 语句的任务量 + 3. 确定每一个 outfile 语句的任务量 - 每一个线程会根据 `maximum_tablets_of_outfile_in_export` 以及数据实际的分区数 / buckets 数来决定要拆分成多少个 outfile。 + 每一个线程会根据 `maximum_tablets_of_outfile_in_export` 以及数据实际的分区数 / buckets 数来决定要拆分成多少个 outfile。 - > `maximum_tablets_of_outfile_in_export` 是 FE 的配置,默认值为 10。该参数用于指定 Export 任务切分出来的单个 OutFile 语句中允许的最大 partitions / buckets 数量。修改该配置需要重启 FE。 + > `maximum_tablets_of_outfile_in_export` 是 FE 的配置,默认值为 10。该参数用于指定 Export 任务切分出来的单个 OutFile 语句中允许的最大 partitions / buckets 数量。修改该配置需要重启 FE。 - 举例:假设一张表共有 20 个 partition,每个 partition 都有 5 个 buckets,那么该表一共有 100 个 buckets。设置`data_consistency = none` 以及 `maximum_tablets_of_outfile_in_export = 10`。 + 举例:假设一张表共有 20 个 partition,每个 partition 都有 5 个 buckets,那么该表一共有 100 个 buckets。设置`data_consistency = none` 以及 `maximum_tablets_of_outfile_in_export = 10`。 - 1. `parallelism = 5` 情况下 + 1. `parallelism = 5` 情况下 - Export 任务将把该表的 100 个 buckets 分成 5 份,每个线程负责 20 个 buckets。每个线程负责的 20 个 buckets 又将以 10 个为单位分成 2 组,每组 buckets 各由一个 outfile 查询计划负责。所以最终该 Export 任务有 5 个线程并发执行,每个线程负责 2 个 outfile 语句,每个线程负责的 outfile 语句串行的被执行。 + Export 任务将把该表的 100 个 buckets 分成 5 份,每个线程负责 20 个 buckets。每个线程负责的 20 个 buckets 又将以 10 个为单位分成 2 组,每组 buckets 各由一个 outfile 查询计划负责。所以最终该 Export 任务有 5 个线程并发执行,每个线程负责 2 个 outfile 语句,每个线程负责的 outfile 语句串行的被执行。 - 2. `parallelism = 3` 情况下 + 2. `parallelism = 3` 情况下 - Export 任务将把该表的 100 个 buckets 分成 3 份,3 个线程分别负责 34、33、33 个 buckets。每个线程负责的 buckets 又将以 10 个为单位分成 4 组(最后一组不足 10 个 buckets),每组 buckets 各由一个 outfile 查询计划负责。所以该 Export 任务最终有 3 个线程并发执行,每个线程负责 4 个 outfile 语句,每个线程负责的 outfile 语句串行的被执行。 + Export 任务将把该表的 100 个 buckets 分成 3 份,3 个线程分别负责 34、33、33 个 buckets。每个线程负责的 buckets 又将以 10 个为单位分成 4 组(最后一组不足 10 个 buckets),每组 buckets 各由一个 outfile 查询计划负责。所以该 Export 任务最终有 3 个线程并发执行,每个线程负责 4 个 outfile 语句,每个线程负责的 outfile 语句串行的被执行。 - 3. `parallelism = 120` 情况下 + 3. `parallelism = 120` 情况下 - 由于该表 buckets 只有 100 个,所以系统会将 `parallelism` 强制设为 100,并以 `parallelism = 100` 去执行。Export 任务将把该表的 100 个 buckets 分成 100 份,每个线程负责 1 个 buckets。每个线程负责的 1 个 buckets 又将以 10 个为单位分成 1 组(该组实际就只有 1 个 buckets),每组 buckets 由一个 outfile 查询计划负责。所以最终该 Export 任务有 100 个线程并发执行,每个线程负责 1 个 outfile 语句,每个 outfile 语句实际只导出 1 个 buckets。 + 由于该表 buckets 只有 100 个,所以系统会将 `parallelism` 强制设为 100,并以 `parallelism = 100` 去执行。Export 任务将把该表的 100 个 buckets 分成 100 份,每个线程负责 1 个 buckets。每个线程负责的 1 个 buckets 又将以 10 个为单位分成 1 组(该组实际就只有 1 个 buckets),每组 buckets 由一个 outfile 查询计划负责。所以最终该 Export 任务有 100 个线程并发执行,每个线程负责 1 个 outfile 语句,每个 outfile 语句实际只导出 1 个 buckets。 -当前版本若希望 Export 有一个较好的性能,建议设置以下参数: +* 当前版本若希望 Export 有一个较好的性能,建议设置以下参数: -1. 打开 session 变量 `enable_parallel_outfile`。 -2. 设置 Export 的 `parallelism` 参数为较大值,使得每一个线程只负责一个 `SELECT INTO OUTFILE` 查询计划。 -3. 设置 FE 配置 `maximum_tablets_of_outfile_in_export` 为较小值,使得每一个 `SELECT INTO OUTFILE` 查询计划导出的数据量较小。 + 1. 打开 session 变量 `enable_parallel_outfile`。 + 2. 设置 Export 的 `parallelism` 参数为较大值,使得每一个线程只负责一个 `SELECT INTO OUTFILE` 查询计划。 + 3. 设置 FE 配置 `maximum_tablets_of_outfile_in_export` 为较小值,使得每一个 `SELECT INTO OUTFILE` 查询计划导出的数据量较小。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/export/outfile.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/export/outfile.md index 5c9dd560218a8..abe09176fe20c 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/export/outfile.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/export/outfile.md @@ -26,53 +26,61 @@ under the License. 本文档将介绍如何使用 `SELECT INTO OUTFILE` 命令进行查询结果的导出操作。 -有关`SELECT INTO OUTFILE`命令的详细介绍,请参考:[SELECT INTO OUTFILE](../../sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md) - -## 概述 - `SELECT INTO OUTFILE` 命令将 `SELECT` 部分的结果数据,以指定的文件格式导出到目标存储系统中,包括对象存储、HDFS 或本地文件系统。 `SELECT INTO OUTFILE` 是一个同步命令,命令返回即表示导出结束。若导出成功,会返回导出的文件数量、大小、路径等信息。若导出失败,会返回错误信息。 关于如何选择 `SELECT INTO OUTFILE` 和 `EXPORT`,请参阅 [导出综述](./export-overview.md)。 -`SELECT INTO OUTFILE` 目前支持以下导出格式 -* Parquet -* ORC -* csv -* csv\_with\_names -* csv\_with\_names\_and\_types +有关`SELECT INTO OUTFILE`命令的详细介绍,请参考:[SELECT INTO OUTFILE](../../sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md) -不支持压缩格式的导出。 +-------------- -示例: +## 使用场景 + +`SELECT INTO OUTFILE` 适用于以下场景: + +- 导出数据需要经过复杂计算逻辑的,如过滤、聚合、关联等。 +- 适合执行同步任务的场景。 + +在使用 `SELECT INTO OUTFILE` 时需要注意以下限制: + +- 不支持压缩格式的导出。 +- 2.1 版本 pipeline 引擎不支持并发导出。 +- 若希望导出到本地文件系统,需要在 fe.conf 中添加配置 `enable_outfile_to_local=true` 并重启FE。 -```sql -mysql> SELECT * FROM tbl1 LIMIT 10 INTO OUTFILE "file:///home/work/path/result_"; -+------------+-----------+----------+--------------------------------------------------------------------+ -| FileNumber | TotalRows | FileSize | URL | -+------------+-----------+----------+--------------------------------------------------------------------+ -| 1 | 2 | 8 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ | -+------------+-----------+----------+--------------------------------------------------------------------+ -``` -返回结果说明: +## 基本原理 -* FileNumber:最终生成的文件个数。 -* TotalRows:结果集行数。 -* FileSize:导出文件总大小。单位字节。 -* URL:导出的文件路径的前缀,多个文件会以后缀 `_0`,`_1` 依次编号。 +`SELECT INTO OUTFILE` 功能本质上是执行一个 SQL 查询命令,其原理基本同普通查询的原理一致。唯一的不同是,普通查询将最后查询的结果集输出到 mysql 客户端,而 `SELECT INTO OUTFILE` 将最后的查询结果集输出到外部存储介质。 -## 导出文件列类型映射 +`SELECT INTO OUTFILE`并发导出的原理是将大规模数据集划分为小块,并在多个节点上并行处理。在可以并发导出的场景下,并行的在多个 BE 节点上导出,每个 BE 处理结果集的一部分。 -`SELECT INTO OUTFILE` 支持导出为 Parquet、ORC 文件格式。Parquet、ORC 文件格式拥有自己的数据类型,Doris 的导出功能能够自动将 Doris 的数据类型导出为 Parquet、ORC 文件格式的对应数据类型,具体映射关系请参阅[导出综述](../../data-operate/export/export-overview.md)文档的 "导出文件列类型映射" 部分。 +## 快速上手 +### 建表与导入数据 -## 示例 +```sql +CREATE TABLE IF NOT EXISTS tbl ( + `c1` int(11) NULL, + `c2` string NULL, + `c3` bigint NULL +) +DISTRIBUTED BY HASH(c1) BUCKETS 20 +PROPERTIES("replication_num" = "1"); + + +insert into tbl values + (1, 'doris', 18), + (2, 'nereids', 20), + (3, 'pipelibe', 99999), + (4, 'Apache', 122123455), + (5, null, null); +``` ### 导出到 HDFS -将查询结果导出到文件 `hdfs://path/to/` 目录下,指定导出格式为 PARQUET: +将查询结果导出到文件 `hdfs://path/to/` 目录下,指定导出格式为 Parquet : ```sql SELECT c1, c2, c3 FROM tbl @@ -85,6 +93,110 @@ PROPERTIES ); ``` +如果 HDFS 集群开启了高可用,则需要提供 HA 信息,参考案例:[导出到开启了高可用的 HDFS 集群](#高可用HDFS导出) + +如果 HDFS 集群开启了高可用并且启用了 Kerberos 认证,需要提供 Kerberos 认证信息,参考案例:[导出到开启了高可用及kerberos认证的 HDFS 集群](#高可用及kerberos集群导出) + +### 导出到对象存储 + +将查询结果导出到 s3 存储的 `s3://path/to/` 目录下,指定导出格式为 ORC,需要提供`sk` `ak`等信息 + +```sql +SELECT * FROM tbl +INTO OUTFILE "s3://path/to/result_" +FORMAT AS ORC +PROPERTIES( + "s3.endpoint" = "https://xxx", + "s3.region" = "ap-beijing", + "s3.access_key"= "your-ak", + "s3.secret_key" = "your-sk" +); +``` + +### 导出到本地文件系统 +> 如需导出到本地文件,需在 `fe.conf` 中添加 `enable_outfile_to_local=true`并重启 FE。 + +将查询结果导出到 BE 的`file:///path/to/` 目录下,指定导出格式为 CSV,指定列分割符为`,`。 + +```sql +SELECT c1, c2 FROM tbl FROM tbl1 +INTO OUTFILE "file:///path/to/result_" +FORMAT AS CSV +PROPERTIES( + "column_separator" = "," +); +``` + +> 注意: + 导出到本地文件的功能不适用于公有云用户,仅适用于私有化部署的用户。并且默认用户对集群节点有完全的控制权限。Doris 对于用户填写的导出路径不会做合法性检查。如果 Doris 的进程用户对该路径无写权限,或路径不存在,则会报错。同时处于安全性考虑,如果该路径已存在同名的文件,则也会导出失败。 + Doris 不会管理导出到本地的文件,也不会检查磁盘空间等。这些文件需要用户自行管理,如清理等。 + +### 更多用法 + +有关`SELECT INTO OUTFILE`命令的详细介绍,请参考:[SELECT INTO OUTFILE](../../sql-manual/sql-statements/Data-Manipulation-Statements/OUTFILE.md) + +## 导出说明 +### 导出数据存储位置 +`SELECT INTO OUTFILE` 目前支持导出到以下存储位置: + +- 对象存储:Amazon S3、COS、OSS、OBS、Google GCS +- HDFS +- 本地文件系统 + +### 导出文件类型 +`SELECT INTO OUTFILE` 目前支持导出以下文件格式 + +* Parquet +* ORC +* csv +* csv\_with\_names +* csv\_with\_names\_and\_types + +### 导出文件列类型映射 + +`SELECT INTO OUTFILE` 支持导出为 Parquet、ORC 文件格式。Parquet、ORC 文件格式拥有自己的数据类型,Doris 的导出功能能够自动将 Doris 的数据类型导出为 Parquet、ORC 文件格式的对应数据类型。 + +以下是 Doris 数据类型和 Parquet、ORC 文件格式的数据类型映射关系表: +| Doris Type | Arrow Type | Orc Type | +| ---------- | ---------- | -------- | +| boolean | boolean | boolean | +| tinyint | int8 | tinyint | +| smallint | int16 | smallint | +| int | int32 | int | +| bigint | int64 | bigint | +| largeInt | utf8 | string | +| date | utf8 | string | +| datev2 | Date32Type | string | +| datetime | utf8 | string | +| datetimev2 | TimestampType | timestamp | +| float | float32 | float | +| double | float64 | double | +| char / varchar / string| utf8 | string | +| decimal | decimal128 | decimal | +| struct | struct | struct | +| map | map | map | +| array | list | array | +| json | utf8 | string | +| variant | utf8 | string | +| bitmap | binary | binary | +| quantile_state| binary | binary | +| hll | binary | binary | + +> 注意:Doris 导出到 Parquet 文件格式时,会先将 Doris 内存数据转换为 Arrow 内存数据格式,然后由 Arrow 写出到 Parquet 文件格式。 + + +## 导出示例 + +- [导出到开启了高可用的 HDFS 集群](#高可用HDFS导出) +- [导出到开启了高可用及kerberos认证的 HDFS 集群](#高可用及kerberos集群导出) +- [生成导出成功标识文件示例](#生成导出成功标识文件示例) +- [并发导出示例](#并发导出示例) +- [导出前清空导出目录示例](#导出前清空导出目录示例) +- [设置导出文件的大小示例](#设置导出文件的大小示例) + + +**导出到开启了高可用的 HDFS 集群** + 如果 HDFS 开启了高可用,则需要提供 HA 信息,如: ```sql @@ -103,7 +215,10 @@ PROPERTIES ); ``` -如果 Hadoop 集群开启了高可用并且启用了 Kerberos 认证,可以参考如下 SQL 语句: + +**导出到开启了高可用及kerberos认证的 HDFS 集群** + +如果 Hdfs 集群开启了高可用并且启用了 Kerberos 认证,可以参考如下 SQL 语句: ```sql SELECT * FROM tbl @@ -125,44 +240,8 @@ PROPERTIES ); ``` -### 导出到 S3 - -将查询结果导出到 s3 存储的 `s3://path/to/` 目录下,指定导出格式为 ORC,需要提供`sk` `ak`等信息 - -```sql -SELECT * FROM tbl -INTO OUTFILE "s3://path/to/result_" -FORMAT AS ORC -PROPERTIES( - "s3.endpoint" = "https://xxx", - "s3.region" = "ap-beijing", - "s3.access_key"= "your-ak", - "s3.secret_key" = "your-sk" -); -``` - -### 导出到本地 -> -> 如需导出到本地文件,需在 `fe.conf` 中添加 `enable_outfile_to_local=true`并重启 FE。 - -将查询结果导出到 BE 的`file:///path/to/` 目录下,指定导出格式为 CSV,指定列分割符为`,`。 - -```sql -SELECT k1 FROM tbl1 UNION SELECT k2 FROM tbl1 -INTO OUTFILE "file:///path/to/result_" -FORMAT AS CSV -PROPERTIES( - "column_separator" = "," -); -``` - -> 注意: - 导出到本地文件的功能不适用于公有云用户,仅适用于私有化部署的用户。并且默认用户对集群节点有完全的控制权限。Doris 对于用户填写的导出路径不会做合法性检查。如果 Doris 的进程用户对该路径无写权限,或路径不存在,则会报错。同时处于安全性考虑,如果该路径已存在同名的文件,则也会导出失败。 - Doris 不会管理导出到本地的文件,也不会检查磁盘空间等。这些文件需要用户自行管理,如清理等。 - -## 最佳实践 - -### 生成导出成功标识文件 + +**生成导出成功标识文件示例** `SELECT INTO OUTFILE`命令是一个同步命令,因此有可能在 SQL 执行过程中任务连接断开了,从而无法获悉导出的数据是否正常结束或是否完整。此时可以使用 `success_file_name` 参数要求导出成功后,在目录下生成一个文件标识。 @@ -188,10 +267,44 @@ PROPERTIES 在导出完成后,会多写出一个文件,该文件的文件名为 `SUCCESS`。 -### 并发导出 + +**并发导出示例** 默认情况下,`SELECT` 部分的查询结果会先汇聚到某一个 BE 节点,由该节点单线程导出数据。然而,在某些情况下,如没有 `ORDER BY` 子句的查询语句,则可以开启并发导出,多个 BE 节点同时导出数据,以提升导出性能。 +然而,并非所有的 SQL 查询语句都可以并发导出。一个查询语句是否可以并发导出可以通过以下步骤来判断: + +* 确定会话变量已开启:`set enable_parallel_outfile = true;` +* 通过 `EXPLAIN` 查看执行计划 + +```sql +mysql> EXPLAIN SELECT ... INTO OUTFILE "s3://xxx" ...; ++-----------------------------------------------------------------------------+ +| Explain String | ++-----------------------------------------------------------------------------+ +| PLAN FRAGMENT 0 | +| OUTPUT EXPRS: | | | | +| PARTITION: UNPARTITIONED | +| | +| RESULT SINK | +| | +| 1:EXCHANGE | +| | +| PLAN FRAGMENT 1 | +| OUTPUT EXPRS:`k1` + `k2` | +| PARTITION: HASH_PARTITIONED: `default_cluster:test`.`multi_tablet`.`k1` | +| | +| RESULT FILE SINK | +| FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951_ | +| STORAGE TYPE: S3 | +| | +| 0:OlapScanNode | +| TABLE: multi_tablet | ++-----------------------------------------------------------------------------+ +``` + +`EXPLAIN` 命令会返回该语句的查询计划。观察该查询计划,如果发现 `RESULT FILE SINK` 出现在 `PLAN FRAGMENT 1` 中,就说明该查询语句可以并发导出。如果 `RESULT FILE SINK` 出现在 `PLAN FRAGMENT 0` 中,则说明当前查询不能进行并发导出。 + 下面我们通过一个示例演示如何正确开启并发导出功能: 1. 打开并发导出会话变量 @@ -237,9 +350,8 @@ mysql> SELECT * FROM demo.tbl ORDER BY id 可以看到,最终结果只有一行,并没有触发并发导出。 -关于更多并发导出的原理说明,可参阅附录部分。 - -### 导出前清空导出目录 + +**导出前清空导出目录示例** ```sql SELECT * FROM tbl1 @@ -259,11 +371,10 @@ PROPERTIES 如果设置了 `"delete_existing_files" = "true"`,导出作业会先将 `s3://my_bucket/export/`目录下所有文件及目录删除,然后导出数据到该目录下。 -> 注意: - -> 若要使用 delete_existing_files 参数,还需要在 fe.conf 中添加配置`enable_delete_existing_files = true`并重启 fe,此时 delete_existing_files 才会生效。delete_existing_files = true 是一个危险的操作,建议只在测试环境中使用。 +> 注意:若要使用 delete_existing_files 参数,还需要在 fe.conf 中添加配置`enable_delete_existing_files = true`并重启 fe,此时 delete_existing_files 才会生效。delete_existing_files = true 是一个危险的操作,建议只在测试环境中使用。 -### 设置导出文件的大小 + +**设置导出文件的大小示例** ```sql SELECT * FROM tbl @@ -284,7 +395,7 @@ PROPERTIES( - 导出数据量和导出效率 - `SELECT INTO OUTFILE`功能本质上是执行一个 SQL 查询命令。如果不开启并发导出,查询结果是由单个 BE 节点,单线程导出的,因此整个导出的耗时包括查询本身的耗时和最终结果集写出的耗时。开启并发导出可以降低导出的时间。 +`SELECT INTO OUTFILE`功能本质上是执行一个 SQL 查询命令。如果不开启并发导出,查询结果是由单个 BE 节点,单线程导出的,因此整个导出的耗时包括查询本身的耗时和最终结果集写出的耗时。开启并发导出可以降低导出的时间。 - 导出超时 @@ -306,53 +417,4 @@ PROPERTIES( - 非可见字符的函数 -   对于部分输出为非可见字符的函数,如 BITMAP、HLL 类型,CSV 输出为 `\N`,Parquet、ORC 输出为 NULL。 - -   目前部分地理信息函数,如 `ST_Point` 的输出类型为 VARCHAR,但实际输出值为经过编码的二进制字符。当前这些函数会输出乱码。对于地理函数,请使用 `ST_AsText` 进行输出。 - -## 附录 - -### 并发导出原理 - -- 原理介绍 - -   Doris 是典型的基于 MPP 架构的高性能、实时的分析型数据库。MPP 架构的一大特征是使用分布式架构,将大规模数据集划分为小块,并在多个节点上并行处理。 - -   `SELECT INTO OUTFILE`的并发导出就是基于上述 MPP 架构的并行处理能力,在可以并发导出的场景下(后面会详细说明哪些场景可以并发导出),并行的在多个 BE 节点上导出,每个 BE 处理结果集的一部分。 - -- 如何判断可以执行并发导出 - - * 确定会话变量已开启:`set enable_parallel_outfile = true;` - * 通过 `EXPLAIN` 查看执行计划 - - ```sql - mysql> EXPLAIN SELECT ... INTO OUTFILE "s3://xxx" ...; - +-----------------------------------------------------------------------------+ - | Explain String | - +-----------------------------------------------------------------------------+ - | PLAN FRAGMENT 0 | - | OUTPUT EXPRS: | | | | - | PARTITION: UNPARTITIONED | - | | - | RESULT SINK | - | | - | 1:EXCHANGE | - | | - | PLAN FRAGMENT 1 | - | OUTPUT EXPRS:`k1` + `k2` | - | PARTITION: HASH_PARTITIONED: `default_cluster:test`.`multi_tablet`.`k1` | - | | - | RESULT FILE SINK | - | FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951_ | - | STORAGE TYPE: S3 | - | | - | 0:OlapScanNode | - | TABLE: multi_tablet | - +-----------------------------------------------------------------------------+ - ``` - - `EXPLAIN` 命令会返回该语句的查询计划。观察该查询计划,如果发现 `RESULT FILE SINK` 出现在 `PLAN FRAGMENT 1` 中,就说明该查询语句可以并发导出。如果 `RESULT FILE SINK` 出现在 `PLAN FRAGMENT 0` 中,则说明当前查询不能进行并发导出。 - -- 导出并发度 - - 当满足并发导出的条件后,导出任务的并发度为:`BE 节点数 * parallel_fragment_exec_instance_num`。 +   对于部分输出为非可见字符的函数,如 BITMAP、HLL 类型,导出到 CSV 文件格式时输出为 `\N`。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md index 2901fd446e21e..5204794f4c546 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md @@ -130,6 +130,14 @@ INTO OUTFILE "file_path" INTO OUTFILE "file:///home/work/path/result_"; ``` +#### 返回结果说明 + +Outfile 语句返回的结果,各个列的含义如下: +* FileNumber:最终生成的文件个数。 +* TotalRows:结果集行数。 +* FileSize:导出文件总大小。单位字节。 +* URL:导出的文件路径的前缀,多个文件会以后缀 `_0`,`_1` 依次编号。 + #### 数据类型映射 Parquet、ORC 文件格式拥有自己的数据类型,Doris 的导出功能能够自动将 Doris 的数据类型导出到 Parquet/ORC 文件格式的对应数据类型,以下是 Apache Doris 数据类型和 Parquet/ORC 文件格式的数据类型映射关系表: diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md index 49422fc6b39fc..a74e3c0bed78e 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md @@ -51,6 +51,35 @@ SHOW EXPORT 3. 可以使用 ORDER BY 对任意列组合进行排序 4. 如果指定了 LIMIT,则显示 limit 条匹配记录。否则全部显示 +`show export` 命令返回的结果各个列的含义如下: + +* JobId:作业的唯一 ID +* Label:该导出作业的标签,如果 Export 没有指定,则系统会默认生成一个。 +* State:作业状态: + * PENDING:作业待调度 + * EXPORTING:数据导出中 + * FINISHED:作业成功 + * CANCELLED:作业失败 +* Progress:作业进度。该进度以查询计划为单位。假设一共 10 个线程,当前已完成 3 个,则进度为 30%。 +* TaskInfo:以 Json 格式展示的作业信息: + * db:数据库名 + * tbl:表名 + * partitions:指定导出的分区。`空`列表 表示所有分区。 + * column\_separator:导出文件的列分隔符。 + * line\_delimiter:导出文件的行分隔符。 + * tablet num:涉及的总 Tablet 数量。 + * broker:使用的 broker 的名称。 + * coord num:查询计划的个数。 + * max\_file\_size:一个导出文件的最大大小。 + * delete\_existing\_files:是否删除导出目录下已存在的文件及目录。 + * columns:指定需要导出的列名,空值代表导出所有列。 + * format:导出的文件格式 +* Path:远端存储上的导出路径。 +* CreateTime/StartTime/FinishTime:作业的创建时间、开始调度时间和结束时间。 +* Timeout:作业超时时间。单位是秒。该时间从 CreateTime 开始计算。 +* ErrorMsg:如果作业出现错误,这里会显示错误原因。 +* OutfileInfo:如果作业导出成功,这里会显示具体的`SELECT INTO OUTFILE`结果信息。 + ## 示例 1. 展示默认 db 的所有导出任务 From e97c94a634544d1e75736bf52a4c3f5c3c7db250 Mon Sep 17 00:00:00 2001 From: TieweiFang Date: Fri, 3 Jan 2025 17:45:49 +0800 Subject: [PATCH 4/4] add english docs --- docs/data-operate/export/export-manual.md | 464 +++++++++--------- docs/data-operate/export/outfile.md | 368 ++++++++------ .../Data-Manipulation-Statements/OUTFILE.md | 7 + .../Show-Statements/SHOW-EXPORT.md | 29 ++ .../data-operate/export/export-manual.md | 464 +++++++++--------- .../data-operate/export/outfile.md | 368 ++++++++------ .../load-and-export/OUTFILE.md | 7 + .../load-and-export/SHOW-EXPORT.md | 31 +- .../data-operate/export/export-manual.md | 464 +++++++++--------- .../data-operate/export/outfile.md | 368 ++++++++------ .../load-and-export/OUTFILE.md | 7 + .../load-and-export/SHOW-EXPORT.md | 29 ++ 12 files changed, 1465 insertions(+), 1141 deletions(-) diff --git a/docs/data-operate/export/export-manual.md b/docs/data-operate/export/export-manual.md index c88d5119700d4..1f6bbcf56c37f 100644 --- a/docs/data-operate/export/export-manual.md +++ b/docs/data-operate/export/export-manual.md @@ -24,51 +24,113 @@ specific language governing permissions and limitations under the License. --> -This document will introduce how to use the `EXPORT` command to export data stored in Doris. +This document will introduce how to use the `EXPORT` command to export the data stored in Doris. -For a detailed description of the `EXPORT` command, please refer to: [EXPORT](../../sql-manual/sql-statements/Data-Manipulation-Statements/Manipulation/EXPORT.md) +`Export` is a function provided by Doris for asynchronous data export. This function can export the data of tables or partitions specified by the user in a specified file format to the target storage system, including object storage, HDFS, or the local file system. -## Overview +`Export` is an asynchronously executed command. After the command is executed successfully, it immediately returns a result, and the user can view the detailed information of the Export task through the `Show Export` command. -`Export` is a feature provided by Doris for asynchronously exporting data. This feature allows users to export data from specified tables or partitions in a specified file format to a target storage system, including object storage, HDFS, or the local file system. +For a detailed introduction of the `EXPORT` command, please refer to: [EXPORT](../../sql-manual/sql-statements/Data-Manipulation-Statements/Manipulation/EXPORT.md) -`Export` is an asynchronous command. After the command is successfully executed, it immediately returns the result. Users can use the `Show Export` command to view detailed information about the export task. +Regarding how to choose between `SELECT INTO OUTFILE` and `EXPORT`, please refer to [Export Overview](../../data-operate/export/export-overview.md). -For guidance on choosing between `SELECT INTO OUTFILE` and `EXPORT`, please see [Export Overview](../../data-operate/export/export-overview). +--- -`EXPORT` currently supports exporting the following types of tables or views: +## Basic Principles -- Doris internal tables -- Doris logical views -- Doris Catalog tables +The underlying layer of the Export task is to execute the `SELECT INTO OUTFILE` SQL statement. After a user initiates an Export task, Doris will construct one or more `SELECT INTO OUTFILE` execution plans based on the table to be exported by Export, and then submit these `SELECT INTO OUTFILE` execution plans to Doris's Job Schedule task scheduler. The Job Schedule task scheduler will automatically schedule and execute these tasks. -`EXPORT` currently supports the following export formats: +By default, the Export task is executed in a single thread. To improve the export efficiency, the Export command can set the `parallelism` parameter to concurrently export data. After setting `parallelism` to be greater than 1, the Export task will use multiple threads to concurrently execute the `SELECT INTO OUTFILE` query plans. The `parallelism` parameter actually specifies the number of threads that execute the EXPORT operation. -- Parquet -- ORC -- csv -- csv\_with\_names -- csv\_with\_names\_and\_types +## Usage Scenarios +`Export` is suitable for the following scenarios: +- Exporting a single table with a large amount of data and only requiring simple filtering conditions. +- Scenarios where tasks need to be submitted asynchronously. + +The following limitations should be noted when using `Export`: +- Currently, the export of compressed formats is not supported. +- Exporting the Select result set is not supported. If you need to export the Select result set, please use [OUTFILE Export](../../data-operate/export/outfile.md). +- If you want to export to the local file system, you need to add the configuration `enable_outfile_to_local = true` in `fe.conf` and restart the FE. + +## Quick Start +### Table Creation and Data Import + +```sql +CREATE TABLE IF NOT EXISTS tbl ( + `c1` int(11) NULL, + `c2` string NULL, + `c3` bigint NULL +) +DISTRIBUTED BY HASH(c1) BUCKETS 20 +PROPERTIES("replication_num" = "1"); -Exporting in compressed formats is not supported. -Example: +insert into tbl values + (1, 'doris', 18), + (2, 'nereids', 20), + (3, 'pipelibe', 99999), + (4, 'Apache', 122123455), + (5, null, null); +``` + +### Create an Export Job + +#### Export to HDFS +Export all data from the `tbl` table to HDFS. Set the file format of the export job to csv (the default format) and set the column delimiter to `,`. ```sql -mysql> EXPORT TABLE tpch1.lineitem TO "s3://my_bucket/path/to/exp_" - -> PROPERTIES( - -> "format" = "csv", - -> "max_file_size" = "2048MB" - -> ) - -> WITH s3 ( - -> "s3.endpoint" = "${endpoint}", - -> "s3.region" = "${region}", - -> "s3.secret_key"="${sk}", - -> "s3.access_key" = "${ak}" - -> ); +EXPORT TABLE tbl +TO "hdfs://host/path/to/export/" +PROPERTIES +( + "line_delimiter" = "," +) +with HDFS ( + "fs.defaultFS"="hdfs://hdfs_host:port", + "hadoop.username" = "hadoop" +); ``` -After submitting a job, you can query the export job status using the [SHOW EXPORT](../../sql-manual/sql-statements/Show-Statements/SHOW-EXPORT.md) command. An example result is as follows: +If the HDFS cluster has high availability enabled, HA information needs to be provided. Refer to the example: [Export to an HDFS Cluster with High Availability Enabled](#high-availability-hdfs-export). + +If the HDFS cluster has both high availability enabled and Kerberos authentication enabled, Kerberos authentication information needs to be provided. Refer to the example: [Export to an HDFS Cluster with High Availability and Kerberos Authentication Enabled](#high-availability-and-kerberos-cluster-export). + +#### Export to Object Storage + +Export the query results to the directory `s3://path/to/` in the S3 storage, specify the export format as ORC, and information such as `sk` (secret key) and `ak` (access key) needs to be provided. + +```sql +EXPORT TABLE tbl TO "s3://bucket/a/b/c" +PROPERTIES ( + "line_delimiter" = "," +) WITH s3 ( + "s3.endpoint" = "xxxxx", + "s3.region" = "xxxxx", + "s3.secret_key"="xxxx", + "s3.access_key" = "xxxxx" +) +``` + +#### Export to the Local File System +> If you need to export to a local file, you must add `enable_outfile_to_local = true` to `fe.conf` and restart the FE. + +Export the query results to the directory `file:///path/to/` on the BE, specify the export format as CSV, and specify the column separator as `,`. + +```sql +-- csv format +EXPORT TABLE tbl TO "file:///home/user/tmp/" +PROPERTIES ( + "format" = "csv", + "line_delimiter" = "," +); +``` + +> Note: +The function of exporting to local files is not applicable to public cloud users, but only to users with private deployments. And it is assumed by default that the user has full control rights over the cluster nodes. Doris does not perform legality checks on the export paths filled in by the user. If the process user of Doris does not have write permissions for the path, or the path does not exist, an error will be reported. Also, for security reasons, if there is a file with the same name already existing in the path, the export will fail. +Doris does not manage the files exported to the local system, nor does it check the disk space, etc. These files need to be managed by the user, such as cleaning them up. + +### View Export Jobs +After submitting a job, you can query the status of the export job via the [SHOW EXPORT](../../sql-manual/sql-statements/Show-Statements/SHOW-EXPORT.md) command. An example of the result is as follows: ```sql mysql> show export\G @@ -97,80 +159,93 @@ OutfileInfo: [ 1 row in set (0.00 sec) ``` -The meaning of each column in the result returned by the `show export` command is as follows: - -- JobId: The unique ID of the job -- Label: The label of the export job. If not specified in the export, the system will generate one by default. -- State: Job status: - - PENDING: Job pending scheduling - - EXPORTING: Data export in progress - - FINISHED: Job successful - - CANCELLED: Job failed -- Progress: Job progress. This progress is based on query plans. For example, if there are a total of 10 threads and 3 have been completed, the progress is 30%. -- TaskInfo: Job information displayed in JSON format: - - db: Database name - - tbl: Table name - - partitions: Specified partitions for export. An empty list indicates all partitions. - - column\_separator: Column separator for the export file. - - line\_delimiter: Line delimiter for the export file. - - tablet num: Total number of tablets involved. - - broker: Name of the broker used. - - coord num: Number of query plans. - - max\_file\_size: Maximum size of an export file. - - delete\_existing\_files: Whether to delete existing files and directories in the export directory. - - columns: Specified column names to export, empty value represents exporting all columns. - - format: File format for export -- Path: Export path on the remote storage. -- CreateTime/StartTime/FinishTime: Job creation time, scheduling start time, and end time. -- Timeout: Job timeout time in seconds. This time is calculated from CreateTime. -- ErrorMsg: If there is an error in the job, the error reason will be displayed here. -- OutfileInfo: If the job is successfully exported, specific `SELECT INTO OUTFILE` result information will be displayed here. - -After submitting the Export job, you can cancel the export job using the [CANCEL EXPORT](../../sql-manual/sql-statements/Data-Manipulation-Statements/Manipulation/CANCEL-EXPORT.md) command before the export task succeeds or fails. An example of the cancel command is as follows: - -```sql -CANCEL EXPORT FROM tpch1 WHERE LABEL like "%export_%"; -``` - -## Export File Column Type Mapping - -`Export` supports exporting data in Parquet and ORC file formats. Parquet and ORC file formats have their own data types. Doris's export function can automatically export Doris's data types to the corresponding data types of Parquet and ORC file formats. For specific mapping relationships, please refer to the "Export File Column Type Mapping" section of the [Export Overview](../../data-operate/export/export-overview.md) document. - -## Examples +For the detailed usage of the `show export` command and the meaning of each column in the returned results, please refer to [SHOW EXPORT](../../sql-manual/sql-statements/Show-Statements/SHOW-EXPORT.md). -### Export to HDFS - -Export data from the `col1` and `col2` columns in the `p1` and `p2` partitions of the db1.tbl1 table to HDFS, setting the label of the export job to `mylabel`. The export file format is csv (default format), the column delimiter is `,`, and the maximum size limit for a single export file is 512MB. +### Cancel Export Jobs +After submitting an Export job, the export job can be cancelled via the [CANCEL EXPORT](../../sql-manual/sql-statements/Data-Manipulation-Statements/Manipulation/CANCEL-EXPORT.md) command before the Export task succeeds or fails. An example of the cancellation command is as follows: ```sql -EXPORT TABLE db1.tbl1 -PARTITION (p1,p2) -TO "hdfs://host/path/to/export/" -PROPERTIES -( - "label" = "mylabel", - "column_separator"=",", - "max_file_size" = "512MB", - "columns" = "col1,col2" -) -with HDFS ( - "fs.defaultFS"="hdfs://hdfs_host:port", - "hadoop.username" = "hadoop" -); +CANCEL EXPORT FROM dbName WHERE LABEL like "%export_%"; ``` -If HDFS is configured for high availability, HA information needs to be provided, as shown below: +## Export Instructions + +### Export Data Sources +`EXPORT` currently supports exporting the following types of tables or views: +- Internal tables in Doris +- Logical views in Doris +- Tables in Doris Catalog + +### Export Data Storage Locations +`Export` currently supports exporting to the following storage locations: +- Object storage: Amazon S3, COS, OSS, OBS, Google GCS +- HDFS +- Local file system + +### Export File Types +`EXPORT` currently supports exporting to the following file formats: +- Parquet +- ORC +- csv +- csv_with_names +- csv_with_names_and_types + +### Column Type Mapping for Exported Files +`Export` supports exporting to Parquet and ORC file formats. Parquet and ORC file formats have their own data types, and the export function of Doris can automatically convert the data types of Doris to the corresponding data types of Parquet and ORC file formats. + +The following is a mapping table of Doris data types to the data types of Parquet and ORC file formats: +| Doris Type | Arrow Type | Orc Type | +| ---------- | ---------- | -------- | +| boolean | boolean | boolean | +| tinyint | int8 | tinyint | +| smallint | int16 | smallint | +| int | int32 | int | +| bigint | int64 | bigint | +| largeInt | utf8 | string | +| date | utf8 | string | +| datev2 | Date32Type | string | +| datetime | utf8 | string | +| datetimev2 | TimestampType | timestamp | +| float | float32 | float | +| double | float64 | double | +| char / varchar / string| utf8 | string | +| decimal | decimal128 | decimal | +| struct | struct | struct | +| map | map | map | +| array | list | array | +| json | utf8 | string | +| variant | utf8 | string | +| bitmap | binary | binary | +| quantile_state| binary | binary | +| hll | binary | binary | + +> Note: When Doris exports data to the Parquet file format, it first converts the in-memory data of Doris into the Arrow in-memory data format, and then writes it out to the Parquet file format via Arrow. + +## Export Examples + +- [Export to an HDFS Cluster with High Availability Enabled](#high-availability-hdfs-export) +- [Export to an HDFS Cluster with High Availability and Kerberos Authentication Enabled](#high-availability-and-kerberos-cluster-export) +- [Specify Partition for Export](#specify-partition-for-export) +- [Filter Data During Export](#filter-data-during-export) +- [Export External Table Data](#export-external-table-data) +- [Adjust Export Data Consistency](#adjust-export-data-consistency) +- [Adjust Concurrency of Export Jobs](#adjust-concurrency-of-export-jobs) +- [Example of Clearing the Export Directory Before Exporting](#example-of-clearing-the-export-directory-before-exporting) +- [Example of Setting the Size of Exported Files](#example-of-setting-the-size-of-exported-files) + + + + +**Export to an HDFS Cluster with High Availability Enabled** + +If the HDFS has high availability enabled, HA information needs to be provided. For example: ```sql -EXPORT TABLE db1.tbl1 -PARTITION (p1,p2) +EXPORT TABLE tbl TO "hdfs://HDFS8000871/path/to/export/" PROPERTIES ( - "label" = "mylabel", - "column_separator"=",", - "max_file_size" = "512MB", - "columns" = "col1,col2" + "line_delimiter" = "," ) with HDFS ( "fs.defaultFS" = "hdfs://HDFS8000871", @@ -183,18 +258,17 @@ with HDFS ( ); ``` -If the Hadoop cluster is configured for high availability and Kerberos authentication is enabled, you can refer to the following SQL statement: + +**Export to an HDFS Cluster with High Availability and Kerberos Authentication Enabled** + +If the HDFS cluster has both high availability enabled and Kerberos authentication enabled, you can refer to the following SQL statements: ```sql -EXPORT TABLE db1.tbl1 -PARTITION (p1,p2) +EXPORT TABLE tbl TO "hdfs://HDFS8000871/path/to/export/" PROPERTIES ( - "label" = "mylabel", - "column_separator"=",", - "max_file_size" = "512MB", - "columns" = "col1,col2" + "line_delimiter" = "," ) with HDFS ( "fs.defaultFS"="hdfs://hacluster/", @@ -211,66 +285,10 @@ with HDFS ( ); ``` -### Export to S3 - -Export all data from the s3_test table to S3, with the export format as csv and using the invisible character `\x07` as the row delimiter. - -```sql -EXPORT TABLE s3_test TO "s3://bucket/a/b/c" -PROPERTIES ( - "line_delimiter" = "\\x07" -) WITH s3 ( - "s3.endpoint" = "xxxxx", - "s3.region" = "xxxxx", - "s3.secret_key"="xxxx", - "s3.access_key" = "xxxxx" -) -``` - -### Export to Local File System - -> -> To export data to the local file system, you need to add `enable_outfile_to_local=true` in fe.conf and restart FE. + +**Specify Partition for Export** -Export all data from the test table to local storage: - -```sql --- parquet format -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "columns" = "k1,k2", - "format" = "parquet" -); - --- orc format -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "columns" = "k1,k2", - "format" = "orc" -); - --- csv_with_names format, using 'AA' as the column separator and 'zz' as the line delimiter -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "format" = "csv_with_names", - "column_separator"="AA", - "line_delimiter" = "zz" -); - --- csv_with_names_and_types format -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "format" = "csv_with_names_and_types" -); -``` - -> Note: - The functionality of exporting to the local file system is not applicable to public cloud users, only for users of private deployments. Additionally, by default, users have full control permissions over the cluster nodes. Doris does not perform validity checks on the export path provided by the user. If the Doris process user does not have write permission to the path or the path does not exist, an error will occur. For security reasons, if a file with the same name already exists in the path, the export will also fail. - Doris does not manage the files exported to the local file system or check disk space, etc. Users need to manage these files themselves, including cleaning them up. - -### Export Specific Partitions - -Export jobs support exporting only specific partitions of Doris internal tables, such as exporting only the p1 and p2 partitions of the test table. +The export job supports exporting only some partitions of the internal tables in Doris. For example, only export partitions p1 and p2 of the `test` table. ```sql EXPORT TABLE test @@ -281,9 +299,10 @@ PROPERTIES ( ); ``` -### Filtering Data during Export + +**Filter Data During Export** -Export jobs support filtering data based on predicate conditions during export, exporting only data that meets certain conditions, such as exporting data that satisfies the condition `k1 < 50`. +The export job supports filtering data according to predicate conditions during the export process, exporting only the data that meets the conditions. For example, only export the data that satisfies the condition `k1 < 50`. ```sql EXPORT TABLE test @@ -295,12 +314,13 @@ PROPERTIES ( ); ``` -### Export External Table Data + +**Export External Table Data** -Export jobs support Doris Catalog external table data: +The export job supports the data of external tables in Doris Catalog. ```sql --- Create a catalog +-- create a catalog CREATE CATALOG `tpch` PROPERTIES ( "type" = "trino-connector", "trino.connector.name" = "tpch", @@ -308,7 +328,7 @@ CREATE CATALOG `tpch` PROPERTIES ( "trino.tpch.splits-per-node" = "32" ); --- Export data from the Catalog external table +-- export Catalog data EXPORT TABLE tpch.sf1.lineitem TO "file:///path/to/exp_" PROPERTIES( "parallelism" = "5", @@ -318,14 +338,13 @@ PROPERTIES( ``` :::tip -Exporting Catalog external table data does not support concurrent exports. Even if a parallelism greater than 1 is specified, it will still be a single-threaded export. +Currently, when exporting data from external tables in the Catalog using Export, concurrent exports are not supported. Even if the parallelism is specified to be greater than 1, the export will still be performed in a single thread. ::: -## Best Practices - -### Export Consistency + +**Adjust Export Data Consistency** -`Export` supports two granularities for export: partition / tablets. The `data_consistency` parameter is used to specify the granularity at which the table to be exported is split. `none` represents Tablets level, and `partition` represents Partition level. +`Export` supports two granularities: partition and tablets. The `data_consistency` parameter is used to specify the granularity at which the table to be exported is sliced. `none` represents the Tablets level, and `partition` represents the Partition level. ```sql EXPORT TABLE test TO "file:///home/user/tmp" @@ -336,15 +355,16 @@ PROPERTIES ( ); ``` -If `"data_consistency" = "partition"` is set, the underlying Export task constructs multiple `SELECT INTO OUTFILE` statements to export different partitions. +If `"data_consistency" = "partition"` is set, multiple `SELECT INTO OUTFILE` statements constructed at the underlying layer of the Export task will export different partitions. -If `"data_consistency" = "none"` is set, the underlying Export task constructs multiple `SELECT INTO OUTFILE` statements to export different tablets. However, these different tablets may belong to the same partition. +If `"data_consistency" = "none"` is set, multiple `SELECT INTO OUTFILE` statements constructed at the underlying layer of the Export task will export different tablets. However, these different tablets may belong to the same partition. -For the logic behind Export's underlying construction of `SELECT INTO OUTFILE` statements, refer to the appendix. +For the logic of constructing `SELECT INTO OUTFILE` at the underlying layer of Export, please refer to the appendix section. -### Export Job Concurrency + +**Adjust Concurrency of Export Jobs** -Export allows setting different concurrency levels to export data concurrently. Specify a concurrency level of 5: +Export can set different degrees of concurrency to export data concurrently. Specify the concurrency degree as 5: ```sql EXPORT TABLE test TO "file:///home/user/tmp/" @@ -355,9 +375,10 @@ PROPERTIES ( ); ``` -For more information on the principles of concurrent export in Export, refer to the appendix section. +For the principle of concurrent export of Export, please refer to the appendix section. -### Clear Export Directory Before Exporting + +**Example of Clearing the Export Directory Before Exporting** ```sql EXPORT TABLE test TO "file:///home/user/tmp" @@ -368,14 +389,15 @@ PROPERTIES ( ); ``` -If `"delete_existing_files" = "true"` is set, the export job will first delete all files and directories under `/home/user/`, and then export data to that directory. +If `"delete_existing_files" = "true"` is set, the export job will first delete all files and directories under the `/home/user/` directory, and then export data to that directory. > Note: -To use the `delete_existing_files` parameter, you also need to add the configuration `enable_delete_existing_files = true` in fe.conf and restart the FE. Only then will `delete_existing_files` take effect. `delete_existing_files = true` is a risky operation and is recommended to be used only in a testing environment. +If you want to use the delete_existing_files parameter, you also need to add the configuration `enable_delete_existing_files = true` in fe.conf and restart fe, then delete_existing_files will take effect. delete_existing_files = true is a dangerous operation and it is recommended to use it only in the test environment. -### Set Export File Size + +**Example of Setting the Size of Exported Files** -Export jobs support setting the size of export files. If a single file exceeds the set value, it will be split into multiple files for export. +The export job supports setting the size of the export file. If the size of a single file exceeds the set value, it will be divided into multiple files for export according to the specified size. ```sql EXPORT TABLE test TO "file:///home/user/tmp/" @@ -385,91 +407,83 @@ PROPERTIES ( ); ``` -By setting `"max_file_size" = "512MB"`, the maximum size of a single export file will be 512MB. +By setting `"max_file_size" = "512MB"`, the maximum size of a single exported file is 512MB. -## Notes -* Memory Limit +## Precautions +* Memory Limitation - Typically, an Export job's query plan consists of only `scan-export` two parts, without involving complex calculation logic that requires a lot of memory. Therefore, the default memory limit of 2GB usually meets the requirements. + Usually, the query plan of an Export job only consists of two parts: scanning and exporting, and does not involve computational logic that requires too much memory. Therefore, the default memory limit of 2GB usually meets the requirements. - However, in some scenarios, such as when a query plan needs to scan too many tablets on the same BE, or when there are too many data versions of tablets, it may lead to insufficient memory. You can adjust the session variable `exec_mem_limit` to increase the memory usage limit. + However, in some scenarios, for example, when a query plan needs to scan too many Tablets on the same BE or there are too many data versions of Tablets, it may lead to insufficient memory. You can adjust the session variable `exec_mem_limit` to increase the memory usage limit. * Export Data Volume - It is not recommended to export a large amount of data at once. It is suggested that the maximum export data volume for an Export job should be within tens of gigabytes. Exporting excessively large data can result in more garbage files and higher retry costs. If the table data volume is too large, it is recommended to export by partition. + It is not recommended to export a large amount of data at one time. The recommended maximum export data volume for an Export job is several tens of gigabytes. Excessive exports will lead to more junk files and higher retry costs. If the table data volume is too large, it is recommended to export by partitions. - Additionally, Export jobs scan data, consuming IO resources, which may impact the system's query latency. + In addition, the Export job will scan data and occupy IO resources, which may affect the query latency of the system. * Export File Management - If an Export job fails during execution, the generated files will not be deleted automatically and will need to be manually deleted by the user. + If the Export job fails, the files that have already been generated will not be deleted, and users need to delete them manually. * Data Consistency - Currently, during export, only a simple check is performed on tablet versions for consistency. It is recommended not to import data into the table during the export process. + Currently, only a simple check is performed on whether the tablet versions are consistent during export. It is recommended not to perform data import operations on the table during the export process. * Export Timeout - If the exported data volume is very large and exceeds the export timeout, the Export task will fail. In such cases, you can specify the `timeout` parameter in the Export command to increase the timeout and retry the Export command. + If the amount of exported data is very large and exceeds the export timeout period, the Export task will fail. At this time, you can specify the `timeout` parameter in the Export command to increase the timeout period and retry the Export command. * Export Failure - If the FE restarts or switches masters during the execution of an Export job, the Export job will fail, and the user will need to resubmit it. You can check the status of Export tasks using the `show export` command. + During the operation of the Export job, if the FE restarts or switches the master, the Export job will fail, and the user needs to resubmit it. You can check the status of the Export task through the `show export` command. -* Number of Export Partitions +* Number of Exported Partitions - An Export Job allows a maximum of 2000 partitions to be exported. You can modify this setting by adding the parameter `maximum_number_of_export_partitions` in fe.conf and restarting the FE. + The maximum number of partitions allowed to be exported by an Export Job is 2000. You can add the parameter `maximum_number_of_export_partitions` in fe.conf and restart the FE to modify this setting. * Concurrent Export - When exporting concurrently, it is important to configure the thread count and parallelism properly to fully utilize system resources and avoid performance bottlenecks. During the export process, monitor progress and performance metrics in real-time to promptly identify issues and optimize adjustments. + During concurrent export, please configure the number of threads and parallelism reasonably to make full use of system resources and avoid performance bottlenecks. During the export process, you can monitor the progress and performance indicators in real time to discover problems in time and make optimization adjustments. * Data Integrity - After the export operation is completed, it is recommended to verify the exported data for completeness and correctness to ensure data quality and integrity. - -## Appendix - -### Principles of Concurrent Export - -The underlying operation of an Export task is to execute the `SELECT INTO OUTFILE` SQL statement. When a user initiates an Export task, Doris constructs one or more `SELECT INTO OUTFILE` execution plans based on the table to be exported, and then submits these `SELECT INTO OUTFILE` execution plans to Doris's Job Schedule task scheduler, which automatically schedules and executes these tasks. - -By default, Export tasks are executed single-threaded. To improve export efficiency, the Export command can set a `parallelism` parameter to export data concurrently. When `parallelism` is set to a value greater than 1, the Export task will use multiple threads to execute the `SELECT INTO OUTFILE` query plans concurrently. The `parallelism` parameter essentially specifies the number of threads to execute the EXPORT job. + After the export operation is completed, it is recommended to verify whether the exported data is complete and correct to ensure the quality and integrity of the data. -The specific logic of constructing one or more `SELECT INTO OUTFILE` execution plans for an Export task is as follows: +* The specific logic for an Export task to construct one or more `SELECT INTO OUTFILE` execution plans is as follows: -1. Select the consistency model for exporting data + 1. Select the Consistency Model of Exported Data - Based on the `data_consistency` parameter to determine the consistency of the export, which is only related to semantics and not concurrency. Users should first choose a consistency model based on their own requirements. + The consistency of export is determined according to the `data_consistency` parameter. This is only related to semantics and has nothing to do with the degree of concurrency. Users should first select a consistency model according to their own requirements. -2. Determine the Degree of Parallelism + 2. Determine the Degree of Concurrency - Determine how many threads will run the `SELECT INTO OUTFILE` execution plan based on the `parallelism` parameter. The `parallelism` parameter determines the maximum number of threads possible. + Determine the number of threads to run these `SELECT INTO OUTFILE` execution plans according to the `parallelism` parameter. The `parallelism` parameter determines the maximum possible number of threads. - > Note: Even if the Export command sets the `parallelism` parameter, the actual number of concurrent threads for the Export task depends on the Job Schedule. When an Export task sets a higher concurrency, each concurrent thread is provided by the Job Schedule. Therefore, if the Doris system tasks are busy and the Job Schedule's thread resources are tight, the actual number of threads assigned to the Export task may not reach the specified `parallelism` number, affecting the concurrent export of the Export task. To mitigate this issue, you can reduce system load or adjust the FE configuration `async_task_consumer_thread_num` to increase the total thread count of the Job Schedule. + > Note: Even if the Export command sets the `parallelism` parameter, the actual number of concurrent threads of the Export task is also related to Job Schedule. After setting multiple concurrency for the Export task, each concurrent thread is provided by Job Schedule. Therefore, if the Doris system tasks are busy at this time and the thread resources of Job Schedule are tight, the actual number of threads allocated to the Export task may not reach the `parallelism` number, which will affect the concurrent export of Export. At this time, you can alleviate this problem by reducing the system load or adjusting the FE configuration `async_task_consumer_thread_num` to increase the total number of threads of Job Schedule. -3. Determine the Workload of Each `outfile` Statement + 3. Determine the Task Amount of Each Outfile Statement - Each thread will determine how many `outfile` statements to split based on `maximum_tablets_of_outfile_in_export` and the actual number of partitions / buckets in the data. + Each thread will decide how many outfiles to split into according to `maximum_tablets_of_outfile_in_export` and the actual number of partitions/buckets of the data. - > `maximum_tablets_of_outfile_in_export` is a configuration in the FE with a default value of 10. This parameter specifies the maximum number of partitions / buckets allowed in a single OutFile statement generated by an Export task. Modifying this configuration requires restarting the FE. + > `maximum_tablets_of_outfile_in_export` is an FE configuration with a default value of 10. This parameter is used to specify the maximum number of partitions/buckets allowed in a single OutFile statement split by the Export task. You need to restart the FE to modify this configuration. - Example: Suppose a table has a total of 20 partitions, each partition has 5 buckets, resulting in a total of 100 buckets. Set `data_consistency = none` and `maximum_tablets_of_outfile_in_export = 10`. + Example: Suppose a table has a total of 20 partitions, and each partition has 5 buckets, then the table has a total of 100 buckets. Set `data_consistency = none` and `maximum_tablets_of_outfile_in_export = 10`. - 1. Scenario with `parallelism = 5` + 1. In the case of `parallelism = 5` - The Export task will divide the 100 buckets of the table into 5 parts, with each thread responsible for 20 buckets. Each thread's 20 buckets will be further divided into 2 groups of 10 buckets each, with each group handled by an outfile query plan. Therefore, the Export task will have 5 threads executing concurrently, with each thread handling 2 outfile statements that are executed serially. + The Export task will divide the 100 buckets of the table into 5 parts, and each thread is responsible for 20 buckets. The 20 buckets responsible by each thread will be divided into 2 groups of 10 buckets each, and each group of buckets is responsible by one outfile query plan. So, finally, the Export task has 5 threads executing concurrently, each thread is responsible for 2 outfile statements, and the outfile statements responsible by each thread are executed serially. - 2. Scenario with `parallelism = 3` + 2. In the case of `parallelism = 3` - The Export task will divide the 100 buckets of the table into 3 parts, with 3 threads responsible for 34, 33, and 33 buckets respectively. Each thread's buckets will be further divided into 4 groups of 10 buckets each (the last group may have fewer than 10 buckets), with each group handled by an outfile query plan. Therefore, the Export task will have 3 threads executing concurrently, with each thread handling 4 outfile statements that are executed serially. + The Export task will divide the 100 buckets of the table into 3 parts, and the 3 threads are responsible for 34, 33, and 33 buckets respectively. The buckets responsible by each thread will be divided into 4 groups of 10 buckets each (the last group has less than 10 buckets), and each group of buckets is responsible by one outfile query plan. So, finally, the Export task has 3 threads executing concurrently, each thread is responsible for 4 outfile statements, and the outfile statements responsible by each thread are executed serially. - 3. Scenario with `parallelism = 120` + 3. In the case of `parallelism = 120` - Since the table has only 100 buckets, the system will force `parallelism` to be set to 100 and execute with `parallelism = 100`. The Export task will divide the 100 buckets of the table into 100 parts, with each thread responsible for 1 bucket. Each thread's 1 bucket will be further divided into 1 group of 1 bucket, with each group handled by an outfile query plan. Therefore, the Export task will have 100 threads executing concurrently, with each thread handling 1 outfile statement, where each outfile statement actually exports only 1 bucket. + Since there are only 100 buckets in the table, the system will force `parallelism` to be set to 100 and execute with `parallelism = 100`. The Export task will divide the 100 buckets of the table into 100 parts, and each thread is responsible for 1 bucket. The 1 bucket responsible by each thread will be divided into 1 group of 10 buckets (this group actually has only 1 bucket), and each group of buckets is responsible by one outfile query plan. So, finally, the Export task has 100 threads executing concurrently, each thread is responsible for 1 outfile statement, and each outfile statement actually exports only 1 bucket. -For optimal performance in the current version of Export, it is recommended to set the following parameters: +* For a better performance of Export in the current version, it is recommended to set the following parameters: -1. Enable the session variable `enable_parallel_outfile`. -2. Set the `parallelism` parameter of Export to a large value so that each thread is responsible for only one `SELECT INTO OUTFILE` query plan. -3. Set the FE configuration `maximum_tablets_of_outfile_in_export` to a small value to export a smaller amount of data for each `SELECT INTO OUTFILE` query plan. + 1. Open the session variable `enable_parallel_outfile`. + 2. Set the `parallelism` parameter of Export to a larger value so that each thread is only responsible for one `SELECT INTO OUTFILE` query plan. + 3. Set the FE configuration `maximum_tablets_of_outfile_in_export` to a smaller value so that the amount of data exported by each `SELECT INTO OUTFILE` query plan is smaller. diff --git a/docs/data-operate/export/outfile.md b/docs/data-operate/export/outfile.md index 1ec2ce0946bec..1e5f32c5d8882 100644 --- a/docs/data-operate/export/outfile.md +++ b/docs/data-operate/export/outfile.md @@ -24,57 +24,59 @@ specific language governing permissions and limitations under the License. --> -This document introduces how to use the `SELECT INTO OUTFILE` command to export query results. +This document will introduce how to use the `SELECT INTO OUTFILE` command to export query results. -For a detailed introduction to the `SELECT INTO OUTFILE` command, refer to: [SELECT INTO OUTFILE](../../sql-manual/sql-statements/Data-Manipulation-Statements/OUTFILE.md). +The `SELECT INTO OUTFILE` command exports the result data of the `SELECT` part to the target storage system in the specified file format, including object storage, HDFS, or the local file system. -## Overview +The `SELECT INTO OUTFILE` is a synchronous command. When the command returns, it means that the export is completed. If the export is successful, information such as the number, size, and path of the exported files will be returned. If the export fails, an error message will be returned. -The `SELECT INTO OUTFILE` command exports the result data of the `SELECT` statement to a target storage system, such as object storage, HDFS, or the local file system, in a specified file format. +For information on how to choose between `SELECT INTO OUTFILE` and `EXPORT`, please refer to [Export Overview](./export-overview.md). -`SELECT INTO OUTFILE` is a synchronous command, meaning it completes when the command returns. If successful, it returns information about the number, size, and paths of the exported files. If it fails, it returns error information. +For a detailed introduction to the `SELECT INTO OUTFILE` command, please refer to: [SELECT INTO OUTFILE](../../sql-manual/sql-statements/Data-Manipulation-Statements/OUTFILE.md) -For guidance on choosing between `SELECT INTO OUTFILE` and `EXPORT`, see the [Export Overview](./export-overview.md). +-------------- -### Supported Export Formats +## Usage Scenarios -`SELECT INTO OUTFILE` currently supports the following export formats: +The `SELECT INTO OUTFILE` is applicable to the following scenarios: +- When the exported data needs to go through complex calculation logics, such as filtering, aggregation, and joining. +- For scenarios suitable for executing synchronous tasks. -- Parquet -- ORC -- CSV -- CSV with column names (`csv_with_names`) -- CSV with column names and types (`csv_with_names_and_types`) - -Compressed formats are not supported. +When using the `SELECT INTO OUTFILE`, the following limitations should be noted: +- It does not support exporting data in compressed formats. +- The pipeline engine in version 2.1 does not support concurrent exports. +- If you want to export data to the local file system, you need to add the configuration `enable_outfile_to_local = true` in the `fe.conf` file and then restart the FE. -### Example -```sql -mysql> SELECT * FROM tbl1 LIMIT 10 INTO OUTFILE "file:///home/work/path/result_"; -+------------+-----------+----------+--------------------------------------------------------------------+ -| FileNumber | TotalRows | FileSize | URL | -+------------+-----------+----------+--------------------------------------------------------------------+ -| 1 | 2 | 8 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ | -+------------+-----------+----------+--------------------------------------------------------------------+ -``` +## Basic Principles +The `SELECT INTO OUTFILE` function essentially executes an SQL query command, and its principle is basically the same as that of an ordinary query. The only difference is that an ordinary query outputs the final query result set to the MySQL client, while the `SELECT INTO OUTFILE` outputs the final query result set to an external storage medium. -Explanation of the returned results: +The principle of concurrent export for `SELECT INTO OUTFILE` is to divide large-scale data sets into small pieces and process them in parallel on multiple nodes. In scenarios where concurrent export is possible, exports are carried out in parallel on multiple BE nodes, with each BE handling a part of the result set. -- **FileNumber**: The number of generated files. -- **TotalRows**: The number of rows in the result set. -- **FileSize**: The total size of the exported files in bytes. -- **URL**: The prefix of the exported file paths. Multiple files will be numbered sequentially with suffixes `_0`, `_1`, etc. +## Quick Start +### Create Tables and Import Data -## Export File Column Type Mapping +```sql +CREATE TABLE IF NOT EXISTS tbl ( + `c1` int(11) NULL, + `c2` string NULL, + `c3` bigint NULL +) +DISTRIBUTED BY HASH(c1) BUCKETS 20 +PROPERTIES("replication_num" = "1"); -`SELECT INTO OUTFILE` supports exporting to Parquet and ORC file formats. Parquet and ORC have their own data types, and Doris can automatically map its data types to corresponding Parquet and ORC data types. Refer to the "Export File Column Type Mapping" section in the [Export Overview](./export-overview.md) document for the specific mapping relationships. -## Examples +insert into tbl values + (1, 'doris', 18), + (2, 'nereids', 20), + (3, 'pipelibe', 99999), + (4, 'Apache', 122123455), + (5, null, null); +``` ### Export to HDFS -Export query results to the `hdfs://path/to/` directory, specifying the export format as PARQUET: +Export the query results to the directory `hdfs://path/to/` and specify the export format as Parquet: ```sql SELECT c1, c2, c3 FROM tbl @@ -87,7 +89,106 @@ PROPERTIES ); ``` -If HDFS is configured for high availability, provide HA information, such as: +If the HDFS cluster has high availability enabled, HA information needs to be provided. Refer to the example: [Export to an HDFS Cluster with High Availability Enabled](#high-availability-hdfs-export). + +If the HDFS cluster has both high availability enabled and Kerberos authentication enabled, Kerberos authentication information needs to be provided. Refer to the example: [Export to an HDFS Cluster with High Availability and Kerberos Authentication Enabled](#high-availability-and-kerberos-cluster-export). + +### Export to Object Storage + +Export the query results to the directory `s3://path/to/` in the S3 storage, specify the export format as ORC, and information such as `sk` (secret key) and `ak` (access key) needs to be provided. + +```sql +SELECT * FROM tbl +INTO OUTFILE "s3://path/to/result_" +FORMAT AS ORC +PROPERTIES( + "s3.endpoint" = "https://xxx", + "s3.region" = "ap-beijing", + "s3.access_key"= "your-ak", + "s3.secret_key" = "your-sk" +); +``` + +### Export to the Local File System +> If you need to export to a local file, you must add `enable_outfile_to_local = true` to `fe.conf` and restart the FE. + +Export the query results to the directory `file:///path/to/` on the BE, specify the export format as CSV, and specify the column separator as `,`. + +```sql +SELECT c1, c2 FROM tbl FROM tbl1 +INTO OUTFILE "file:///path/to/result_" +FORMAT AS CSV +PROPERTIES( + "column_separator" = "," +); +``` + +> Note: +The function of exporting to local files is not applicable to public cloud users, but only to users with private deployments. And it is assumed by default that the user has full control rights over the cluster nodes. Doris does not perform legality checks on the export paths filled in by the user. If the process user of Doris does not have write permissions for the path, or the path does not exist, an error will be reported. Also, for security reasons, if there is a file with the same name already existing in the path, the export will fail. +Doris does not manage the files exported to the local system, nor does it check the disk space, etc. These files need to be managed by the user, such as cleaning them up. + +### More Usage +For a detailed introduction to the `SELECT INTO OUTFILE` command, please refer to: [SELECT INTO OUTFILE](../../sql-manual/sql-statements/Data-Manipulation-Statements/OUTFILE.md) + +## Export Instructions +### Storage Locations for Exported Data +The `SELECT INTO OUTFILE` currently supports exporting data to the following storage locations: +- Object storage: Amazon S3, COS, OSS, OBS, Google GCS +- HDFS +- Local file system + +### Export File Types +The `SELECT INTO OUTFILE` currently supports exporting the following file formats: +- Parquet +- ORC +- csv +- csv_with_names +- csv_with_names_and_types + +### Column Type Mapping for Exported Files +The `SELECT INTO OUTFILE` supports exporting data in Parquet and ORC file formats. Parquet and ORC file formats have their own data types. The export function of Doris can automatically convert the data types in Doris to the corresponding data types in Parquet and ORC file formats. + +The following is a mapping table of Doris data types and data types in Parquet and ORC file formats: +| Doris Type | Arrow Type | Orc Type | +| ---------- | ---------- | -------- | +| boolean | boolean | boolean | +| tinyint | int8 | tinyint | +| smallint | int16 | smallint | +| int | int32 | int | +| bigint | int64 | bigint | +| largeInt | utf8 | string | +| date | utf8 | string | +| datev2 | Date32Type | string | +| datetime | utf8 | string | +| datetimev2 | TimestampType | timestamp | +| float | float32 | float | +| double | float64 | double | +| char / varchar / string| utf8 | string | +| decimal | decimal128 | decimal | +| struct | struct | struct | +| map | map | map | +| array | list | array | +| json | utf8 | string | +| variant | utf8 | string | +| bitmap | binary | binary | +| quantile_state| binary | binary | +| hll | binary | binary | + +> Note: When Doris exports data to the Parquet file format, it first converts the in-memory data of Doris into the Arrow in-memory data format, and then writes it out to the Parquet file format via Arrow. + +## Export Examples +- [Export to an HDFS Cluster with High Availability Enabled](#high-availability-hdfs-export) +- [Export to an HDFS Cluster with High Availability and Kerberos Authentication Enabled](#high-availability-and-kerberos-cluster-export) +- [Example of Generating a File to Mark a Successful Export](#example-of-generating-a-file-to-mark-a-successful-export) +- [Example of Concurrent Export](#example-of-concurrent-export) +- [Example of Clearing the Export Directory Before Exporting](#example-of-clearing-the-export-directory-before-exporting) +- [Example of Setting the Size of Exported Files](#example-of-setting-the-size-of-exported-files) + + + +**Export to an HDFS Cluster with High Availability Enabled** + +If the HDFS has high availability enabled, HA information needs to be provided. For example: ```sql SELECT c1, c2, c3 FROM tbl @@ -105,7 +206,10 @@ PROPERTIES ); ``` -If the Hadoop cluster is configured for high availability and Kerberos authentication is enabled, you can refer to the following SQL statement: + +**Export to an HDFS Cluster with High Availability and Kerberos Authentication Enabled** + +If the HDFS cluster has both high availability enabled and Kerberos authentication enabled, you can refer to the following SQL statements: ```sql SELECT * FROM tbl @@ -120,59 +224,24 @@ PROPERTIES "dfs.namenode.rpc-address.hacluster.n1"="192.168.0.1:8020", "dfs.namenode.rpc-address.hacluster.n2"="192.168.0.2:8020", "dfs.client.failover.proxy.provider.hacluster"="org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider", - "dfs.namenode.kerberos.principal"="hadoop/_HOST@REALM.COM", + "dfs.namenode.kerberos.principal"="hadoop/_HOST@REALM.COM" "hadoop.security.authentication"="kerberos", "hadoop.kerberos.principal"="doris_test@REALM.COM", "hadoop.kerberos.keytab"="/path/to/doris_test.keytab" ); ``` -### Export to S3 + +**Example of Generating a File to Mark a Successful Export** -Export query results to the S3 storage at `s3://path/to/` directory, specifying the export format as ORC. Provide `sk`, `ak`, and other necessary information: +The `SELECT INTO OUTFILE` command is a synchronous command. Therefore, it is possible that the task connection is disconnected during the execution of the SQL, making it impossible to know whether the exported data has ended normally or is complete. At this time, you can use the `success_file_name` parameter to require that a file marker be generated in the directory after a successful export. -```sql -SELECT * FROM tbl -INTO OUTFILE "s3://path/to/result_" -FORMAT AS ORC -PROPERTIES( - "s3.endpoint" = "https://xxx", - "s3.region" = "ap-beijing", - "s3.access_key"= "your-ak", - "s3.secret_key" = "your-sk" -); -``` - -### Export to Local File System -> -> To export to the local file system, add `enable_outfile_to_local=true` in `fe.conf` and restart FE. - -Export query results to the BE's `file:///path/to/` directory, specifying the export format as CSV, with a comma as the column separator: - -```sql -SELECT k1 FROM tbl1 UNION SELECT k2 FROM tbl1 -INTO OUTFILE "file:///path/to/result_" -FORMAT AS CSV -PROPERTIES( - "column_separator" = "," -); -``` - -> Note: -Exporting to local files is not suitable for public cloud users and is intended for private deployment users only. By default, users have full control over cluster nodes. Doris does not check the validity of the export path provided by the user. If the Doris process user does not have write permissions for the path, or the path does not exist, an error will be reported. Additionally, for security reasons, if a file with the same name already exists at the path, the export will fail. Doris does not manage exported local files or check disk space. Users need to manage these files themselves, including cleanup and other tasks. - -## Best Practices - -### Generate Export Success Indicator File +Similar to Hive, users can determine whether the export has ended normally and whether the files in the export directory are complete by checking whether there is a file specified by the `success_file_name` parameter in the export directory. -The `SELECT INTO OUTFILE` command is synchronous, meaning that the task connection could be interrupted during SQL execution, leaving uncertainty about whether the export completed successfully or whether the data is complete. You can use the `success_file_name` parameter to generate an indicator file upon successful export. - -Similar to Hive, users can determine whether the export completed successfully and whether the files in the export directory are complete by checking for the presence of the file specified by the `success_file_name` parameter. - -For example, exporting the results of a `SELECT` statement to Tencent Cloud COS `s3://${bucket_name}/path/my_file_`, specifying the export format as CSV, and setting the success indicator file name to `SUCCESS`: +For example: Export the query results of the `select` statement to Tencent Cloud COS: `s3://${bucket_name}/path/my_file_`. Specify the export format as `csv`. Specify the name of the file marking a successful export as `SUCCESS`. After the export is completed, a marker file will be generated. ```sql -SELECT k1, k2, v1 FROM tbl1 LIMIT 100000 +SELECT k1,k2,v1 FROM tbl1 LIMIT 100000 INTO OUTFILE "s3://my_bucket/path/my_file_" FORMAT AS CSV PROPERTIES @@ -187,21 +256,55 @@ PROPERTIES ) ``` -Upon completion, an additional file named `SUCCESS` will be generated. +After the export is completed, an additional file named `SUCCESS` will be written. + + +**Example of Concurrent Export** + +By default, the query results of the `SELECT` part will first be aggregated to a certain BE node, and this node will export the data in a single thread. However, in some cases, such as for query statements without an `ORDER BY` clause, concurrent exports can be enabled, allowing multiple BE nodes to export data simultaneously to improve export performance. + +However, not all SQL query statements can be exported concurrently. Whether a query statement can be exported concurrently can be determined through the following steps: -### Concurrent Export +* Make sure that the session variable is enabled: `set enable_parallel_outfile = true;` +* Check the execution plan via `EXPLAIN` -By default, the query results in the `SELECT` section are aggregated to a single BE node, which exports data single-threadedly. However, in some cases (e.g., queries without an `ORDER BY` clause), concurrent export can be enabled to have multiple BE nodes export data simultaneously, improving export performance. +```sql +mysql> EXPLAIN SELECT ... INTO OUTFILE "s3://xxx" ...; ++-----------------------------------------------------------------------------+ +| Explain String | ++-----------------------------------------------------------------------------+ +| PLAN FRAGMENT 0 | +| OUTPUT EXPRS: | | | | +| PARTITION: UNPARTITIONED | +| | +| RESULT SINK | +| | +| 1:EXCHANGE | +| | +| PLAN FRAGMENT 1 | +| OUTPUT EXPRS:`k1` + `k2` | +| PARTITION: HASH_PARTITIONED: `default_cluster:test`.`multi_tablet`.`k1` | +| | +| RESULT FILE SINK | +| FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951_ | +| STORAGE TYPE: S3 | +| | +| 0:OlapScanNode | +| TABLE: multi_tablet | ++-----------------------------------------------------------------------------+ +``` + +The `EXPLAIN` command will return the query plan of the statement. By observing the query plan, if `RESULT FILE SINK` appears in `PLAN FRAGMENT 1`, it indicates that the query statement can be exported concurrently. If `RESULT FILE SINK` appears in `PLAN FRAGMENT 0`, it means that the current query cannot be exported concurrently. -Here’s an example demonstrating how to enable concurrent export: +Next, we will demonstrate how to correctly enable the concurrent export function through an example: -1. Enable the concurrent export session variable: +1. Open the concurrent export session variable ```sql mysql> SET enable_parallel_outfile = true; ``` -2. Execute the export command: +2. Execute the export command ```sql mysql> SELECT * FROM demo.tbl @@ -221,9 +324,9 @@ mysql> SELECT * FROM demo.tbl +------------+-----------+----------+-------------------------------------------------------------------------------+ ``` -With concurrent export successfully enabled, the result may consist of multiple rows, indicating that multiple threads exported data concurrently. +It can be seen that after enabling and successfully triggering the concurrent export function, the returned result may consist of multiple lines, indicating that there are multiple threads exporting concurrently. -Adding an `ORDER BY` clause to the query prevents concurrent export, as the top-level sorting node necessitates single-threaded export: +If we modify the above statement, that is, add an `ORDER BY` clause to the query statement. Since the query statement has a top-level sorting node, even if the concurrent export function is enabled, this query cannot be exported concurrently: ```sql mysql> SELECT * FROM demo.tbl ORDER BY id @@ -236,11 +339,10 @@ mysql> SELECT * FROM demo.tbl ORDER BY id +------------+-----------+----------+-------------------------------------------------------------------------------+ ``` -Here, the result is a single row, indicating no concurrent export was triggered. +It can be seen that there is only one final result line, and concurrent export has not been triggered. -Refer to the appendix for more details on concurrent export principles. - -### Clear Export Directory Before Exporting + +**Example of Clearing the Export Directory Before Exporting** ```sql SELECT * FROM tbl1 @@ -258,12 +360,12 @@ PROPERTIES ) ``` -If `"delete_existing_files" = "true"` is set, the export job will first delete all files and directories under `s3://my_bucket/export/`, then export data to that directory. +If `"delete_existing_files" = "true"` is set, the export job will first delete all files and directories under the `s3://my_bucket/export/` directory, and then export data to this directory. -> Note: -To use the `delete_existing_files` parameter, add `enable_delete_existing_files = true` to `fe.conf` and restart FE. This parameter is potentially dangerous and should only be used in a testing environment. +> Note: To use the `delete_existing_files` parameter, you also need to add the configuration `enable_delete_existing_files = true` in `fe.conf` and restart the `fe`, then `delete_existing_files` will take effect. `delete_existing_files = true` is a dangerous operation and it is recommended to use it only in a test environment. -### Set Export File Size + +**Example of Setting the Size of Exported Files** ```sql SELECT * FROM tbl @@ -278,69 +380,25 @@ PROPERTIES( ); ``` -Specifying `"max_file_size" = "2048MB"` ensures that the final file size does not exceed 2GB. If the total size exceeds 2GB, multiple files will be generated. +Since `"max_file_size" = "2048MB"` is specified, if the final generated file is not larger than 2GB, there will be only one file. If it is larger than 2GB, there will be multiple files. -## Considerations +## Precautions + +- Export Data Volume and Export Efficiency +The `SELECT INTO OUTFILE` function essentially executes an SQL query command. If concurrent export is not enabled, the query result is exported by a single BE node in a single thread. Therefore, the total export time includes the time consumed by the query itself and the time required to write out the final result set. Enabling concurrent export can reduce the export time. -- Export Data Volume and Efficiency - The `SELECT INTO OUTFILE` function executes a SQL query. Without concurrent export, a single BE node and thread export the query results. The total export time includes both the query execution time and the result set write-out time. Enabling concurrent export can reduce the export time. - Export Timeout - The export command shares the same timeout as the query. If the data volume is large and causes the export to timeout, you can extend the query timeout by setting the session variable `query_timeout`. -- Export File Management - Doris does not manage exported files, whether successfully exported or remaining from failed exports. Users must handle these files themselves. - Additionally, `SELECT INTO OUTFILE` does not check for the existence of files or file paths. Whether `SELECT INTO OUTFILE` automatically creates paths or overwrites existing files depends entirely on the semantics of the remote storage system. -- Empty Result Sets - Exporting an empty result set still generates an empty file. +The timeout time of the export command is the same as that of the query. If the data volume is large and causes the export data to time out, you can set the session variable `query_timeout` to appropriately extend the query timeout. + +- Management of Exported Files +Doris does not manage the exported files. Whether they are successfully exported or residual files after a failed export, users need to handle them by themselves. +In addition, the `SELECT INTO OUTFILE` command does not check whether the file and file path exist. Whether `SELECT INTO OUTFILE` will automatically create a path or overwrite an existing file is completely determined by the semantics of the remote storage system. + +- If the Query Result Set Is Empty +For an export with an empty result set, an empty file will still be generated. + - File Splitting - File splitting ensures that a single row of data is stored completely in one file. Thus, the file size may not exactly equal `max_file_size`. -- Non-visible Character Functions - For functions outputting non-visible characters (e.g., BITMAP, HLL types), CSV output is `\N`, and Parquet/ORC output is NULL. - Currently, some geographic functions like `ST_Point` output VARCHAR but with encoded binary characters, causing garbled output. Use `ST_AsText` for geographic functions. - -## Appendix - -### Concurrent Export Principles - -- Principle Overview - - Doris is a high-performance, real-time analytical database based on the MPP (Massively Parallel Processing) architecture. MPP divides large datasets into small chunks and processes them in parallel across multiple nodes. - Concurrent export in `SELECT INTO OUTFILE` leverages this parallel processing capability, allowing multiple BE nodes to export parts of the result set simultaneously. - -- How to Determine Concurrent Export Eligibility - - - Ensure Session Variable is Enabled: `set enable_parallel_outfile = true;` - - Check Execution Plan with `EXPLAIN`: - - ```sql - mysql> EXPLAIN SELECT ... INTO OUTFILE "s3://xxx" ...; - +-----------------------------------------------------------------------------+ - | Explain String | - +-----------------------------------------------------------------------------+ - | PLAN FRAGMENT 0 | - | OUTPUT EXPRS: | | | | - | PARTITION: UNPARTITIONED | - | | - | RESULT SINK | - | | - | 1:EXCHANGE | - | | - | PLAN FRAGMENT 1 | - | OUTPUT EXPRS:`k1` - - + `k2` | - | PARTITION: HASH_PARTITIONED: `default_cluster:test`.`multi_tablet`.`k1` | - | | - | RESULT FILE SINK | - | FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951_ | - | STORAGE TYPE: S3 | - | | - | 0:OlapScanNode | - | TABLE: multi_tablet | - +-----------------------------------------------------------------------------+ - ``` - - The `EXPLAIN` command returns the query plan. If `RESULT FILE SINK` appears in `PLAN FRAGMENT 1`, the query can be exported concurrently. If it appears in `PLAN FRAGMENT 0`, concurrent export is not possible. - -- Export Concurrency - - When concurrent export conditions are met, the export task's concurrency is determined by: `BE nodes * parallel_fragment_exec_instance_num`. +File splitting ensures that a row of data is stored completely in a single file. Therefore, the file size is not strictly equal to `max_file_size`. + +- Functions for Non-Visible Characters +For some functions that output non-visible characters, such as BITMAP and HLL types, when exporting to the CSV file format, the output is `\N`. diff --git a/docs/sql-manual/sql-statements/Data-Manipulation-Statements/OUTFILE.md b/docs/sql-manual/sql-statements/Data-Manipulation-Statements/OUTFILE.md index cdee96f15cb6b..9d3ba368db606 100644 --- a/docs/sql-manual/sql-statements/Data-Manipulation-Statements/OUTFILE.md +++ b/docs/sql-manual/sql-statements/Data-Manipulation-Statements/OUTFILE.md @@ -130,6 +130,13 @@ INTO OUTFILE "file_path" INTO OUTFILE "file:///home/work/path/result_"; ``` +#### Explanation of the returned results: + +- **FileNumber**: The number of generated files. +- **TotalRows**: The number of rows in the result set. +- **FileSize**: The total size of the exported files in bytes. +- **URL**: The prefix of the exported file paths. Multiple files will be numbered sequentially with suffixes `_0`, `_1`, etc. + #### DataType Mapping Parquet and ORC file formats have their own data types. The export function of Doris can automatically export the Doris data types to the corresponding data types of the Parquet/ORC file format. The following are the data type mapping relationship of the Doris data types and the Parquet/ORC file format data types: diff --git a/docs/sql-manual/sql-statements/Show-Statements/SHOW-EXPORT.md b/docs/sql-manual/sql-statements/Show-Statements/SHOW-EXPORT.md index 3c35cecda2e65..ce7c917ca8d11 100644 --- a/docs/sql-manual/sql-statements/Show-Statements/SHOW-EXPORT.md +++ b/docs/sql-manual/sql-statements/Show-Statements/SHOW-EXPORT.md @@ -55,6 +55,35 @@ illustrate: 3. You can use ORDER BY to sort any combination of columns 4. If LIMIT is specified, limit matching records are displayed. Otherwise show all +The meaning of each column in the result returned by the `show export` command is as follows: + +- JobId: The unique ID of the job +- Label: The label of the export job. If not specified in the export, the system will generate one by default. +- State: Job status: + - PENDING: Job pending scheduling + - EXPORTING: Data export in progress + - FINISHED: Job successful + - CANCELLED: Job failed +- Progress: Job progress. This progress is based on query plans. For example, if there are a total of 10 threads and 3 have been completed, the progress is 30%. +- TaskInfo: Job information displayed in JSON format: + - db: Database name + - tbl: Table name + - partitions: Specified partitions for export. An empty list indicates all partitions. + - column\_separator: Column separator for the export file. + - line\_delimiter: Line delimiter for the export file. + - tablet num: Total number of tablets involved. + - broker: Name of the broker used. + - coord num: Number of query plans. + - max\_file\_size: Maximum size of an export file. + - delete\_existing\_files: Whether to delete existing files and directories in the export directory. + - columns: Specified column names to export, empty value represents exporting all columns. + - format: File format for export +- Path: Export path on the remote storage. +- `CreateTime/StartTime/FinishTime`: Job creation time, scheduling start time, and end time. +- Timeout: Job timeout time in seconds. This time is calculated from CreateTime. +- ErrorMsg: If there is an error in the job, the error reason will be displayed here. +- OutfileInfo: If the job is successfully exported, specific `SELECT INTO OUTFILE` result information will be displayed here. + ### Example 1. Show all export tasks of default db diff --git a/versioned_docs/version-2.1/data-operate/export/export-manual.md b/versioned_docs/version-2.1/data-operate/export/export-manual.md index 6533bf292bf11..8a3e0769cdc3f 100644 --- a/versioned_docs/version-2.1/data-operate/export/export-manual.md +++ b/versioned_docs/version-2.1/data-operate/export/export-manual.md @@ -24,51 +24,113 @@ specific language governing permissions and limitations under the License. --> -This document will introduce how to use the `EXPORT` command to export data stored in Doris. +This document will introduce how to use the `EXPORT` command to export the data stored in Doris. -For a detailed description of the `EXPORT` command, please refer to: [EXPORT](../../sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md) +`Export` is a function provided by Doris for asynchronous data export. This function can export the data of tables or partitions specified by the user in a specified file format to the target storage system, including object storage, HDFS, or the local file system. -## Overview +`Export` is an asynchronously executed command. After the command is executed successfully, it immediately returns a result, and the user can view the detailed information of the Export task through the `Show Export` command. -`Export` is a feature provided by Doris for asynchronously exporting data. This feature allows users to export data from specified tables or partitions in a specified file format to a target storage system, including object storage, HDFS, or the local file system. +For a detailed introduction of the `EXPORT` command, please refer to: [EXPORT](../../sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md) -`Export` is an asynchronous command. After the command is successfully executed, it immediately returns the result. Users can use the `Show Export` command to view detailed information about the export task. +Regarding how to choose between `SELECT INTO OUTFILE` and `EXPORT`, please refer to [Export Overview](../../data-operate/export/export-overview.md). -For guidance on choosing between `SELECT INTO OUTFILE` and `EXPORT`, please see [Export Overview](../../data-operate/export/export-overview.md). +--- -`EXPORT` currently supports exporting the following types of tables or views: +## Basic Principles -- Doris internal tables -- Doris logical views -- Doris Catalog tables +The underlying layer of the Export task is to execute the `SELECT INTO OUTFILE` SQL statement. After a user initiates an Export task, Doris will construct one or more `SELECT INTO OUTFILE` execution plans based on the table to be exported by Export, and then submit these `SELECT INTO OUTFILE` execution plans to Doris's Job Schedule task scheduler. The Job Schedule task scheduler will automatically schedule and execute these tasks. -`EXPORT` currently supports the following export formats: +By default, the Export task is executed in a single thread. To improve the export efficiency, the Export command can set the `parallelism` parameter to concurrently export data. After setting `parallelism` to be greater than 1, the Export task will use multiple threads to concurrently execute the `SELECT INTO OUTFILE` query plans. The `parallelism` parameter actually specifies the number of threads that execute the EXPORT operation. -- Parquet -- ORC -- csv -- csv\_with\_names -- csv\_with\_names\_and\_types +## Usage Scenarios +`Export` is suitable for the following scenarios: +- Exporting a single table with a large amount of data and only requiring simple filtering conditions. +- Scenarios where tasks need to be submitted asynchronously. + +The following limitations should be noted when using `Export`: +- Currently, the export of compressed formats is not supported. +- Exporting the Select result set is not supported. If you need to export the Select result set, please use [OUTFILE Export](../../data-operate/export/outfile.md). +- If you want to export to the local file system, you need to add the configuration `enable_outfile_to_local = true` in `fe.conf` and restart the FE. + +## Quick Start +### Table Creation and Data Import + +```sql +CREATE TABLE IF NOT EXISTS tbl ( + `c1` int(11) NULL, + `c2` string NULL, + `c3` bigint NULL +) +DISTRIBUTED BY HASH(c1) BUCKETS 20 +PROPERTIES("replication_num" = "1"); -Exporting in compressed formats is not supported. -Example: +insert into tbl values + (1, 'doris', 18), + (2, 'nereids', 20), + (3, 'pipelibe', 99999), + (4, 'Apache', 122123455), + (5, null, null); +``` + +### Create an Export Job + +#### Export to HDFS +Export all data from the `tbl` table to HDFS. Set the file format of the export job to csv (the default format) and set the column delimiter to `,`. ```sql -mysql> EXPORT TABLE tpch1.lineitem TO "s3://my_bucket/path/to/exp_" - -> PROPERTIES( - -> "format" = "csv", - -> "max_file_size" = "2048MB" - -> ) - -> WITH s3 ( - -> "s3.endpoint" = "${endpoint}", - -> "s3.region" = "${region}", - -> "s3.secret_key"="${sk}", - -> "s3.access_key" = "${ak}" - -> ); +EXPORT TABLE tbl +TO "hdfs://host/path/to/export/" +PROPERTIES +( + "line_delimiter" = "," +) +with HDFS ( + "fs.defaultFS"="hdfs://hdfs_host:port", + "hadoop.username" = "hadoop" +); ``` -After submitting a job, you can query the export job status using the [SHOW EXPORT](../../sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md) command. An example result is as follows: +If the HDFS cluster has high availability enabled, HA information needs to be provided. Refer to the example: [Export to an HDFS Cluster with High Availability Enabled](#high-availability-hdfs-export). + +If the HDFS cluster has both high availability enabled and Kerberos authentication enabled, Kerberos authentication information needs to be provided. Refer to the example: [Export to an HDFS Cluster with High Availability and Kerberos Authentication Enabled](#high-availability-and-kerberos-cluster-export). + +#### Export to Object Storage + +Export the query results to the directory `s3://path/to/` in the S3 storage, specify the export format as ORC, and information such as `sk` (secret key) and `ak` (access key) needs to be provided. + +```sql +EXPORT TABLE tbl TO "s3://bucket/a/b/c" +PROPERTIES ( + "line_delimiter" = "," +) WITH s3 ( + "s3.endpoint" = "xxxxx", + "s3.region" = "xxxxx", + "s3.secret_key"="xxxx", + "s3.access_key" = "xxxxx" +) +``` + +#### Export to the Local File System +> If you need to export to a local file, you must add `enable_outfile_to_local = true` to `fe.conf` and restart the FE. + +Export the query results to the directory `file:///path/to/` on the BE, specify the export format as CSV, and specify the column separator as `,`. + +```sql +-- csv format +EXPORT TABLE tbl TO "file:///home/user/tmp/" +PROPERTIES ( + "format" = "csv", + "line_delimiter" = "," +); +``` + +> Note: +The function of exporting to local files is not applicable to public cloud users, but only to users with private deployments. And it is assumed by default that the user has full control rights over the cluster nodes. Doris does not perform legality checks on the export paths filled in by the user. If the process user of Doris does not have write permissions for the path, or the path does not exist, an error will be reported. Also, for security reasons, if there is a file with the same name already existing in the path, the export will fail. +Doris does not manage the files exported to the local system, nor does it check the disk space, etc. These files need to be managed by the user, such as cleaning them up. + +### View Export Jobs +After submitting a job, you can query the status of the export job via the [SHOW EXPORT](../../sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md) command. An example of the result is as follows: ```sql mysql> show export\G @@ -97,80 +159,93 @@ OutfileInfo: [ 1 row in set (0.00 sec) ``` -The meaning of each column in the result returned by the `show export` command is as follows: - -- JobId: The unique ID of the job -- Label: The label of the export job. If not specified in the export, the system will generate one by default. -- State: Job status: - - PENDING: Job pending scheduling - - EXPORTING: Data export in progress - - FINISHED: Job successful - - CANCELLED: Job failed -- Progress: Job progress. This progress is based on query plans. For example, if there are a total of 10 threads and 3 have been completed, the progress is 30%. -- TaskInfo: Job information displayed in JSON format: - - db: Database name - - tbl: Table name - - partitions: Specified partitions for export. An empty list indicates all partitions. - - column\_separator: Column separator for the export file. - - line\_delimiter: Line delimiter for the export file. - - tablet num: Total number of tablets involved. - - broker: Name of the broker used. - - coord num: Number of query plans. - - max\_file\_size: Maximum size of an export file. - - delete\_existing\_files: Whether to delete existing files and directories in the export directory. - - columns: Specified column names to export, empty value represents exporting all columns. - - format: File format for export -- Path: Export path on the remote storage. -- `CreateTime/StartTime/FinishTime`: Job creation time, scheduling start time, and end time. -- Timeout: Job timeout time in seconds. This time is calculated from CreateTime. -- ErrorMsg: If there is an error in the job, the error reason will be displayed here. -- OutfileInfo: If the job is successfully exported, specific `SELECT INTO OUTFILE` result information will be displayed here. - -After submitting the Export job, you can cancel the export job using the [CANCEL EXPORT](../../sql-manual/sql-statements/data-modification/load-and-export/CANCEL-EXPORT.md) command before the export task succeeds or fails. An example of the cancel command is as follows: - -```sql -CANCEL EXPORT FROM tpch1 WHERE LABEL like "%export_%"; -``` - -## Export File Column Type Mapping - -`Export` supports exporting data in Parquet and ORC file formats. Parquet and ORC file formats have their own data types. Doris's export function can automatically export Doris's data types to the corresponding data types of Parquet and ORC file formats. For specific mapping relationships, please refer to the "Export File Column Type Mapping" section of the [Export Overview](../../data-operate/export/export-overview.md) document. - -## Examples +For the detailed usage of the `show export` command and the meaning of each column in the returned results, please refer to [SHOW EXPORT](../../sql-manual/sql-statements/Show-Statements/SHOW-EXPORT.md). -### Export to HDFS - -Export data from the `col1` and `col2` columns in the `p1` and `p2` partitions of the db1.tbl1 table to HDFS, setting the label of the export job to `mylabel`. The export file format is csv (default format), the column delimiter is `,`, and the maximum size limit for a single export file is 512MB. +### Cancel Export Jobs +After submitting an Export job, the export job can be cancelled via the [CANCEL EXPORT](../../sql-manual/sql-statements/data-modification/load-and-export/CANCEL-EXPORT.md) command before the Export task succeeds or fails. An example of the cancellation command is as follows: ```sql -EXPORT TABLE db1.tbl1 -PARTITION (p1,p2) -TO "hdfs://host/path/to/export/" -PROPERTIES -( - "label" = "mylabel", - "column_separator"=",", - "max_file_size" = "512MB", - "columns" = "col1,col2" -) -with HDFS ( - "fs.defaultFS"="hdfs://hdfs_host:port", - "hadoop.username" = "hadoop" -); +CANCEL EXPORT FROM dbName WHERE LABEL like "%export_%"; ``` -If HDFS is configured for high availability, HA information needs to be provided, as shown below: +## Export Instructions + +### Export Data Sources +`EXPORT` currently supports exporting the following types of tables or views: +- Internal tables in Doris +- Logical views in Doris +- Tables in Doris Catalog + +### Export Data Storage Locations +`Export` currently supports exporting to the following storage locations: +- Object storage: Amazon S3, COS, OSS, OBS, Google GCS +- HDFS +- Local file system + +### Export File Types +`EXPORT` currently supports exporting to the following file formats: +- Parquet +- ORC +- csv +- csv_with_names +- csv_with_names_and_types + +### Column Type Mapping for Exported Files +`Export` supports exporting to Parquet and ORC file formats. Parquet and ORC file formats have their own data types, and the export function of Doris can automatically convert the data types of Doris to the corresponding data types of Parquet and ORC file formats. + +The following is a mapping table of Doris data types to the data types of Parquet and ORC file formats: +| Doris Type | Arrow Type | Orc Type | +| ---------- | ---------- | -------- | +| boolean | boolean | boolean | +| tinyint | int8 | tinyint | +| smallint | int16 | smallint | +| int | int32 | int | +| bigint | int64 | bigint | +| largeInt | utf8 | string | +| date | utf8 | string | +| datev2 | Date32Type | string | +| datetime | utf8 | string | +| datetimev2 | TimestampType | timestamp | +| float | float32 | float | +| double | float64 | double | +| char / varchar / string| utf8 | string | +| decimal | decimal128 | decimal | +| struct | struct | struct | +| map | map | map | +| array | list | array | +| json | utf8 | string | +| variant | utf8 | string | +| bitmap | binary | binary | +| quantile_state| binary | binary | +| hll | binary | binary | + +> Note: When Doris exports data to the Parquet file format, it first converts the in-memory data of Doris into the Arrow in-memory data format, and then writes it out to the Parquet file format via Arrow. + +## Export Examples + +- [Export to an HDFS Cluster with High Availability Enabled](#high-availability-hdfs-export) +- [Export to an HDFS Cluster with High Availability and Kerberos Authentication Enabled](#high-availability-and-kerberos-cluster-export) +- [Specify Partition for Export](#specify-partition-for-export) +- [Filter Data During Export](#filter-data-during-export) +- [Export External Table Data](#export-external-table-data) +- [Adjust Export Data Consistency](#adjust-export-data-consistency) +- [Adjust Concurrency of Export Jobs](#adjust-concurrency-of-export-jobs) +- [Example of Clearing the Export Directory Before Exporting](#example-of-clearing-the-export-directory-before-exporting) +- [Example of Setting the Size of Exported Files](#example-of-setting-the-size-of-exported-files) + + + + +**Export to an HDFS Cluster with High Availability Enabled** + +If the HDFS has high availability enabled, HA information needs to be provided. For example: ```sql -EXPORT TABLE db1.tbl1 -PARTITION (p1,p2) +EXPORT TABLE tbl TO "hdfs://HDFS8000871/path/to/export/" PROPERTIES ( - "label" = "mylabel", - "column_separator"=",", - "max_file_size" = "512MB", - "columns" = "col1,col2" + "line_delimiter" = "," ) with HDFS ( "fs.defaultFS" = "hdfs://HDFS8000871", @@ -183,18 +258,17 @@ with HDFS ( ); ``` -If the Hadoop cluster is configured for high availability and Kerberos authentication is enabled, you can refer to the following SQL statement: + +**Export to an HDFS Cluster with High Availability and Kerberos Authentication Enabled** + +If the HDFS cluster has both high availability enabled and Kerberos authentication enabled, you can refer to the following SQL statements: ```sql -EXPORT TABLE db1.tbl1 -PARTITION (p1,p2) +EXPORT TABLE tbl TO "hdfs://HDFS8000871/path/to/export/" PROPERTIES ( - "label" = "mylabel", - "column_separator"=",", - "max_file_size" = "512MB", - "columns" = "col1,col2" + "line_delimiter" = "," ) with HDFS ( "fs.defaultFS"="hdfs://hacluster/", @@ -211,66 +285,10 @@ with HDFS ( ); ``` -### Export to S3 - -Export all data from the s3_test table to S3, with the export format as csv and using the invisible character `\x07` as the row delimiter. - -```sql -EXPORT TABLE s3_test TO "s3://bucket/a/b/c" -PROPERTIES ( - "line_delimiter" = "\\x07" -) WITH s3 ( - "s3.endpoint" = "xxxxx", - "s3.region" = "xxxxx", - "s3.secret_key"="xxxx", - "s3.access_key" = "xxxxx" -) -``` - -### Export to Local File System - -> -> To export data to the local file system, you need to add `enable_outfile_to_local=true` in fe.conf and restart FE. + +**Specify Partition for Export** -Export all data from the test table to local storage: - -```sql --- parquet format -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "columns" = "k1,k2", - "format" = "parquet" -); - --- orc format -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "columns" = "k1,k2", - "format" = "orc" -); - --- csv_with_names format, using 'AA' as the column separator and 'zz' as the line delimiter -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "format" = "csv_with_names", - "column_separator"="AA", - "line_delimiter" = "zz" -); - --- csv_with_names_and_types format -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "format" = "csv_with_names_and_types" -); -``` - -> Note: - The functionality of exporting to the local file system is not applicable to public cloud users, only for users of private deployments. Additionally, by default, users have full control permissions over the cluster nodes. Doris does not perform validity checks on the export path provided by the user. If the Doris process user does not have write permission to the path or the path does not exist, an error will occur. For security reasons, if a file with the same name already exists in the path, the export will also fail. - Doris does not manage the files exported to the local file system or check disk space, etc. Users need to manage these files themselves, including cleaning them up. - -### Export Specific Partitions - -Export jobs support exporting only specific partitions of Doris internal tables, such as exporting only the p1 and p2 partitions of the test table. +The export job supports exporting only some partitions of the internal tables in Doris. For example, only export partitions p1 and p2 of the `test` table. ```sql EXPORT TABLE test @@ -281,9 +299,10 @@ PROPERTIES ( ); ``` -### Filtering Data during Export + +**Filter Data During Export** -Export jobs support filtering data based on predicate conditions during export, exporting only data that meets certain conditions, such as exporting data that satisfies the condition `k1 < 50`. +The export job supports filtering data according to predicate conditions during the export process, exporting only the data that meets the conditions. For example, only export the data that satisfies the condition `k1 < 50`. ```sql EXPORT TABLE test @@ -295,12 +314,13 @@ PROPERTIES ( ); ``` -### Export External Table Data + +**Export External Table Data** -Export jobs support Doris Catalog external table data: +The export job supports the data of external tables in Doris Catalog. ```sql --- Create a catalog +-- create a catalog CREATE CATALOG `tpch` PROPERTIES ( "type" = "trino-connector", "trino.connector.name" = "tpch", @@ -308,7 +328,7 @@ CREATE CATALOG `tpch` PROPERTIES ( "trino.tpch.splits-per-node" = "32" ); --- Export data from the Catalog external table +-- export Catalog data EXPORT TABLE tpch.sf1.lineitem TO "file:///path/to/exp_" PROPERTIES( "parallelism" = "5", @@ -318,14 +338,13 @@ PROPERTIES( ``` :::tip -Exporting Catalog external table data does not support concurrent exports. Even if a parallelism greater than 1 is specified, it will still be a single-threaded export. +Currently, when exporting data from external tables in the Catalog using Export, concurrent exports are not supported. Even if the parallelism is specified to be greater than 1, the export will still be performed in a single thread. ::: -## Best Practices - -### Export Consistency + +**Adjust Export Data Consistency** -`Export` supports two granularities for export: `partition / tablets`. The `data_consistency` parameter is used to specify the granularity at which the table to be exported is split. `none` represents Tablets level, and `partition` represents Partition level. +`Export` supports two granularities: partition and tablets. The `data_consistency` parameter is used to specify the granularity at which the table to be exported is sliced. `none` represents the Tablets level, and `partition` represents the Partition level. ```sql EXPORT TABLE test TO "file:///home/user/tmp" @@ -336,15 +355,16 @@ PROPERTIES ( ); ``` -If `"data_consistency" = "partition"` is set, the underlying Export task constructs multiple `SELECT INTO OUTFILE` statements to export different partitions. +If `"data_consistency" = "partition"` is set, multiple `SELECT INTO OUTFILE` statements constructed at the underlying layer of the Export task will export different partitions. -If `"data_consistency" = "none"` is set, the underlying Export task constructs multiple `SELECT INTO OUTFILE` statements to export different tablets. However, these different tablets may belong to the same partition. +If `"data_consistency" = "none"` is set, multiple `SELECT INTO OUTFILE` statements constructed at the underlying layer of the Export task will export different tablets. However, these different tablets may belong to the same partition. -For the logic behind Export's underlying construction of `SELECT INTO OUTFILE` statements, refer to the appendix. +For the logic of constructing `SELECT INTO OUTFILE` at the underlying layer of Export, please refer to the appendix section. -### Export Job Concurrency + +**Adjust Concurrency of Export Jobs** -Export allows setting different concurrency levels to export data concurrently. Specify a concurrency level of 5: +Export can set different degrees of concurrency to export data concurrently. Specify the concurrency degree as 5: ```sql EXPORT TABLE test TO "file:///home/user/tmp/" @@ -355,9 +375,10 @@ PROPERTIES ( ); ``` -For more information on the principles of concurrent export in Export, refer to the appendix section. +For the principle of concurrent export of Export, please refer to the appendix section. -### Clear Export Directory Before Exporting + +**Example of Clearing the Export Directory Before Exporting** ```sql EXPORT TABLE test TO "file:///home/user/tmp" @@ -368,14 +389,15 @@ PROPERTIES ( ); ``` -If `"delete_existing_files" = "true"` is set, the export job will first delete all files and directories under `/home/user/`, and then export data to that directory. +If `"delete_existing_files" = "true"` is set, the export job will first delete all files and directories under the `/home/user/` directory, and then export data to that directory. > Note: -To use the `delete_existing_files` parameter, you also need to add the configuration `enable_delete_existing_files = true` in fe.conf and restart the FE. Only then will `delete_existing_files` take effect. `delete_existing_files = true` is a risky operation and is recommended to be used only in a testing environment. +If you want to use the delete_existing_files parameter, you also need to add the configuration `enable_delete_existing_files = true` in fe.conf and restart fe, then delete_existing_files will take effect. delete_existing_files = true is a dangerous operation and it is recommended to use it only in the test environment. -### Set Export File Size + +**Example of Setting the Size of Exported Files** -Export jobs support setting the size of export files. If a single file exceeds the set value, it will be split into multiple files for export. +The export job supports setting the size of the export file. If the size of a single file exceeds the set value, it will be divided into multiple files for export according to the specified size. ```sql EXPORT TABLE test TO "file:///home/user/tmp/" @@ -385,91 +407,83 @@ PROPERTIES ( ); ``` -By setting `"max_file_size" = "512MB"`, the maximum size of a single export file will be 512MB. +By setting `"max_file_size" = "512MB"`, the maximum size of a single exported file is 512MB. -## Notes -* Memory Limit +## Precautions +* Memory Limitation - Typically, an Export job's query plan consists of only `scan-export` two parts, without involving complex calculation logic that requires a lot of memory. Therefore, the default memory limit of 2GB usually meets the requirements. + Usually, the query plan of an Export job only consists of two parts: scanning and exporting, and does not involve computational logic that requires too much memory. Therefore, the default memory limit of 2GB usually meets the requirements. - However, in some scenarios, such as when a query plan needs to scan too many tablets on the same BE, or when there are too many data versions of tablets, it may lead to insufficient memory. You can adjust the session variable `exec_mem_limit` to increase the memory usage limit. + However, in some scenarios, for example, when a query plan needs to scan too many Tablets on the same BE or there are too many data versions of Tablets, it may lead to insufficient memory. You can adjust the session variable `exec_mem_limit` to increase the memory usage limit. * Export Data Volume - It is not recommended to export a large amount of data at once. It is suggested that the maximum export data volume for an Export job should be within tens of gigabytes. Exporting excessively large data can result in more garbage files and higher retry costs. If the table data volume is too large, it is recommended to export by partition. + It is not recommended to export a large amount of data at one time. The recommended maximum export data volume for an Export job is several tens of gigabytes. Excessive exports will lead to more junk files and higher retry costs. If the table data volume is too large, it is recommended to export by partitions. - Additionally, Export jobs scan data, consuming IO resources, which may impact the system's query latency. + In addition, the Export job will scan data and occupy IO resources, which may affect the query latency of the system. * Export File Management - If an Export job fails during execution, the generated files will not be deleted automatically and will need to be manually deleted by the user. + If the Export job fails, the files that have already been generated will not be deleted, and users need to delete them manually. * Data Consistency - Currently, during export, only a simple check is performed on tablet versions for consistency. It is recommended not to import data into the table during the export process. + Currently, only a simple check is performed on whether the tablet versions are consistent during export. It is recommended not to perform data import operations on the table during the export process. * Export Timeout - If the exported data volume is very large and exceeds the export timeout, the Export task will fail. In such cases, you can specify the `timeout` parameter in the Export command to increase the timeout and retry the Export command. + If the amount of exported data is very large and exceeds the export timeout period, the Export task will fail. At this time, you can specify the `timeout` parameter in the Export command to increase the timeout period and retry the Export command. * Export Failure - If the FE restarts or switches masters during the execution of an Export job, the Export job will fail, and the user will need to resubmit it. You can check the status of Export tasks using the `show export` command. + During the operation of the Export job, if the FE restarts or switches the master, the Export job will fail, and the user needs to resubmit it. You can check the status of the Export task through the `show export` command. -* Number of Export Partitions +* Number of Exported Partitions - An Export Job allows a maximum of 2000 partitions to be exported. You can modify this setting by adding the parameter `maximum_number_of_export_partitions` in fe.conf and restarting the FE. + The maximum number of partitions allowed to be exported by an Export Job is 2000. You can add the parameter `maximum_number_of_export_partitions` in fe.conf and restart the FE to modify this setting. * Concurrent Export - When exporting concurrently, it is important to configure the thread count and parallelism properly to fully utilize system resources and avoid performance bottlenecks. During the export process, monitor progress and performance metrics in real-time to promptly identify issues and optimize adjustments. + During concurrent export, please configure the number of threads and parallelism reasonably to make full use of system resources and avoid performance bottlenecks. During the export process, you can monitor the progress and performance indicators in real time to discover problems in time and make optimization adjustments. * Data Integrity - After the export operation is completed, it is recommended to verify the exported data for completeness and correctness to ensure data quality and integrity. - -## Appendix - -### Principles of Concurrent Export - -The underlying operation of an Export task is to execute the `SELECT INTO OUTFILE` SQL statement. When a user initiates an Export task, Doris constructs one or more `SELECT INTO OUTFILE` execution plans based on the table to be exported, and then submits these `SELECT INTO OUTFILE` execution plans to Doris's Job Schedule task scheduler, which automatically schedules and executes these tasks. - -By default, Export tasks are executed single-threaded. To improve export efficiency, the Export command can set a `parallelism` parameter to export data concurrently. When `parallelism` is set to a value greater than 1, the Export task will use multiple threads to execute the `SELECT INTO OUTFILE` query plans concurrently. The `parallelism` parameter essentially specifies the number of threads to execute the EXPORT job. + After the export operation is completed, it is recommended to verify whether the exported data is complete and correct to ensure the quality and integrity of the data. -The specific logic of constructing one or more `SELECT INTO OUTFILE` execution plans for an Export task is as follows: +* The specific logic for an Export task to construct one or more `SELECT INTO OUTFILE` execution plans is as follows: -1. Select the consistency model for exporting data + 1. Select the Consistency Model of Exported Data - Based on the `data_consistency` parameter to determine the consistency of the export, which is only related to semantics and not concurrency. Users should first choose a consistency model based on their own requirements. + The consistency of export is determined according to the `data_consistency` parameter. This is only related to semantics and has nothing to do with the degree of concurrency. Users should first select a consistency model according to their own requirements. -2. Determine the Degree of Parallelism + 2. Determine the Degree of Concurrency - Determine how many threads will run the `SELECT INTO OUTFILE` execution plan based on the `parallelism` parameter. The `parallelism` parameter determines the maximum number of threads possible. + Determine the number of threads to run these `SELECT INTO OUTFILE` execution plans according to the `parallelism` parameter. The `parallelism` parameter determines the maximum possible number of threads. - > Note: Even if the Export command sets the `parallelism` parameter, the actual number of concurrent threads for the Export task depends on the Job Schedule. When an Export task sets a higher concurrency, each concurrent thread is provided by the Job Schedule. Therefore, if the Doris system tasks are busy and the Job Schedule's thread resources are tight, the actual number of threads assigned to the Export task may not reach the specified `parallelism` number, affecting the concurrent export of the Export task. To mitigate this issue, you can reduce system load or adjust the FE configuration `async_task_consumer_thread_num` to increase the total thread count of the Job Schedule. + > Note: Even if the Export command sets the `parallelism` parameter, the actual number of concurrent threads of the Export task is also related to Job Schedule. After setting multiple concurrency for the Export task, each concurrent thread is provided by Job Schedule. Therefore, if the Doris system tasks are busy at this time and the thread resources of Job Schedule are tight, the actual number of threads allocated to the Export task may not reach the `parallelism` number, which will affect the concurrent export of Export. At this time, you can alleviate this problem by reducing the system load or adjusting the FE configuration `async_task_consumer_thread_num` to increase the total number of threads of Job Schedule. -3. Determine the Workload of Each `outfile` Statement + 3. Determine the Task Amount of Each Outfile Statement - Each thread will determine how many `outfile` statements to split based on `maximum_tablets_of_outfile_in_export` and the actual number of partitions / buckets in the data. + Each thread will decide how many outfiles to split into according to `maximum_tablets_of_outfile_in_export` and the actual number of partitions/buckets of the data. - > `maximum_tablets_of_outfile_in_export` is a configuration in the FE with a default value of 10. This parameter specifies the maximum number of partitions / buckets allowed in a single OutFile statement generated by an Export task. Modifying this configuration requires restarting the FE. + > `maximum_tablets_of_outfile_in_export` is an FE configuration with a default value of 10. This parameter is used to specify the maximum number of partitions/buckets allowed in a single OutFile statement split by the Export task. You need to restart the FE to modify this configuration. - Example: Suppose a table has a total of 20 partitions, each partition has 5 buckets, resulting in a total of 100 buckets. Set `data_consistency = none` and `maximum_tablets_of_outfile_in_export = 10`. + Example: Suppose a table has a total of 20 partitions, and each partition has 5 buckets, then the table has a total of 100 buckets. Set `data_consistency = none` and `maximum_tablets_of_outfile_in_export = 10`. - 1. Scenario with `parallelism = 5` + 1. In the case of `parallelism = 5` - The Export task will divide the 100 buckets of the table into 5 parts, with each thread responsible for 20 buckets. Each thread's 20 buckets will be further divided into 2 groups of 10 buckets each, with each group handled by an outfile query plan. Therefore, the Export task will have 5 threads executing concurrently, with each thread handling 2 outfile statements that are executed serially. + The Export task will divide the 100 buckets of the table into 5 parts, and each thread is responsible for 20 buckets. The 20 buckets responsible by each thread will be divided into 2 groups of 10 buckets each, and each group of buckets is responsible by one outfile query plan. So, finally, the Export task has 5 threads executing concurrently, each thread is responsible for 2 outfile statements, and the outfile statements responsible by each thread are executed serially. - 2. Scenario with `parallelism = 3` + 2. In the case of `parallelism = 3` - The Export task will divide the 100 buckets of the table into 3 parts, with 3 threads responsible for 34, 33, and 33 buckets respectively. Each thread's buckets will be further divided into 4 groups of 10 buckets each (the last group may have fewer than 10 buckets), with each group handled by an outfile query plan. Therefore, the Export task will have 3 threads executing concurrently, with each thread handling 4 outfile statements that are executed serially. + The Export task will divide the 100 buckets of the table into 3 parts, and the 3 threads are responsible for 34, 33, and 33 buckets respectively. The buckets responsible by each thread will be divided into 4 groups of 10 buckets each (the last group has less than 10 buckets), and each group of buckets is responsible by one outfile query plan. So, finally, the Export task has 3 threads executing concurrently, each thread is responsible for 4 outfile statements, and the outfile statements responsible by each thread are executed serially. - 3. Scenario with `parallelism = 120` + 3. In the case of `parallelism = 120` - Since the table has only 100 buckets, the system will force `parallelism` to be set to 100 and execute with `parallelism = 100`. The Export task will divide the 100 buckets of the table into 100 parts, with each thread responsible for 1 bucket. Each thread's 1 bucket will be further divided into 1 group of 1 bucket, with each group handled by an outfile query plan. Therefore, the Export task will have 100 threads executing concurrently, with each thread handling 1 outfile statement, where each outfile statement actually exports only 1 bucket. + Since there are only 100 buckets in the table, the system will force `parallelism` to be set to 100 and execute with `parallelism = 100`. The Export task will divide the 100 buckets of the table into 100 parts, and each thread is responsible for 1 bucket. The 1 bucket responsible by each thread will be divided into 1 group of 10 buckets (this group actually has only 1 bucket), and each group of buckets is responsible by one outfile query plan. So, finally, the Export task has 100 threads executing concurrently, each thread is responsible for 1 outfile statement, and each outfile statement actually exports only 1 bucket. -For optimal performance in the current version of Export, it is recommended to set the following parameters: +* For a better performance of Export in the current version, it is recommended to set the following parameters: -1. Enable the session variable `enable_parallel_outfile`. -2. Set the `parallelism` parameter of Export to a large value so that each thread is responsible for only one `SELECT INTO OUTFILE` query plan. -3. Set the FE configuration `maximum_tablets_of_outfile_in_export` to a small value to export a smaller amount of data for each `SELECT INTO OUTFILE` query plan. + 1. Open the session variable `enable_parallel_outfile`. + 2. Set the `parallelism` parameter of Export to a larger value so that each thread is only responsible for one `SELECT INTO OUTFILE` query plan. + 3. Set the FE configuration `maximum_tablets_of_outfile_in_export` to a smaller value so that the amount of data exported by each `SELECT INTO OUTFILE` query plan is smaller. diff --git a/versioned_docs/version-2.1/data-operate/export/outfile.md b/versioned_docs/version-2.1/data-operate/export/outfile.md index cb0eaa14beb2d..170769639e5cc 100644 --- a/versioned_docs/version-2.1/data-operate/export/outfile.md +++ b/versioned_docs/version-2.1/data-operate/export/outfile.md @@ -24,57 +24,59 @@ specific language governing permissions and limitations under the License. --> -This document introduces how to use the `SELECT INTO OUTFILE` command to export query results. +This document will introduce how to use the `SELECT INTO OUTFILE` command to export query results. -For a detailed introduction to the `SELECT INTO OUTFILE` command, refer to: [SELECT INTO OUTFILE](../../sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md). +The `SELECT INTO OUTFILE` command exports the result data of the `SELECT` part to the target storage system in the specified file format, including object storage, HDFS, or the local file system. -## Overview +The `SELECT INTO OUTFILE` is a synchronous command. When the command returns, it means that the export is completed. If the export is successful, information such as the number, size, and path of the exported files will be returned. If the export fails, an error message will be returned. -The `SELECT INTO OUTFILE` command exports the result data of the `SELECT` statement to a target storage system, such as object storage, HDFS, or the local file system, in a specified file format. +For information on how to choose between `SELECT INTO OUTFILE` and `EXPORT`, please refer to [Export Overview](./export-overview.md). -`SELECT INTO OUTFILE` is a synchronous command, meaning it completes when the command returns. If successful, it returns information about the number, size, and paths of the exported files. If it fails, it returns error information. +For a detailed introduction to the `SELECT INTO OUTFILE` command, please refer to: [SELECT INTO OUTFILE](../../sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md) -For guidance on choosing between `SELECT INTO OUTFILE` and `EXPORT`, see the [Export Overview](./export-overview.md). +-------------- -### Supported Export Formats +## Usage Scenarios -`SELECT INTO OUTFILE` currently supports the following export formats: +The `SELECT INTO OUTFILE` is applicable to the following scenarios: +- When the exported data needs to go through complex calculation logics, such as filtering, aggregation, and joining. +- For scenarios suitable for executing synchronous tasks. -- Parquet -- ORC -- CSV -- CSV with column names (`csv_with_names`) -- CSV with column names and types (`csv_with_names_and_types`) - -Compressed formats are not supported. +When using the `SELECT INTO OUTFILE`, the following limitations should be noted: +- It does not support exporting data in compressed formats. +- The pipeline engine in version 2.1 does not support concurrent exports. +- If you want to export data to the local file system, you need to add the configuration `enable_outfile_to_local = true` in the `fe.conf` file and then restart the FE. -### Example -```sql -mysql> SELECT * FROM tbl1 LIMIT 10 INTO OUTFILE "file:///home/work/path/result_"; -+------------+-----------+----------+--------------------------------------------------------------------+ -| FileNumber | TotalRows | FileSize | URL | -+------------+-----------+----------+--------------------------------------------------------------------+ -| 1 | 2 | 8 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ | -+------------+-----------+----------+--------------------------------------------------------------------+ -``` +## Basic Principles +The `SELECT INTO OUTFILE` function essentially executes an SQL query command, and its principle is basically the same as that of an ordinary query. The only difference is that an ordinary query outputs the final query result set to the MySQL client, while the `SELECT INTO OUTFILE` outputs the final query result set to an external storage medium. -Explanation of the returned results: +The principle of concurrent export for `SELECT INTO OUTFILE` is to divide large-scale data sets into small pieces and process them in parallel on multiple nodes. In scenarios where concurrent export is possible, exports are carried out in parallel on multiple BE nodes, with each BE handling a part of the result set. -- **FileNumber**: The number of generated files. -- **TotalRows**: The number of rows in the result set. -- **FileSize**: The total size of the exported files in bytes. -- **URL**: The prefix of the exported file paths. Multiple files will be numbered sequentially with suffixes `_0`, `_1`, etc. +## Quick Start +### Create Tables and Import Data -## Export File Column Type Mapping +```sql +CREATE TABLE IF NOT EXISTS tbl ( + `c1` int(11) NULL, + `c2` string NULL, + `c3` bigint NULL +) +DISTRIBUTED BY HASH(c1) BUCKETS 20 +PROPERTIES("replication_num" = "1"); -`SELECT INTO OUTFILE` supports exporting to Parquet and ORC file formats. Parquet and ORC have their own data types, and Doris can automatically map its data types to corresponding Parquet and ORC data types. Refer to the "Export File Column Type Mapping" section in the [Export Overview](./export-overview.md) document for the specific mapping relationships. -## Examples +insert into tbl values + (1, 'doris', 18), + (2, 'nereids', 20), + (3, 'pipelibe', 99999), + (4, 'Apache', 122123455), + (5, null, null); +``` ### Export to HDFS -Export query results to the `hdfs://path/to/` directory, specifying the export format as PARQUET: +Export the query results to the directory `hdfs://path/to/` and specify the export format as Parquet: ```sql SELECT c1, c2, c3 FROM tbl @@ -87,7 +89,106 @@ PROPERTIES ); ``` -If HDFS is configured for high availability, provide HA information, such as: +If the HDFS cluster has high availability enabled, HA information needs to be provided. Refer to the example: [Export to an HDFS Cluster with High Availability Enabled](#high-availability-hdfs-export). + +If the HDFS cluster has both high availability enabled and Kerberos authentication enabled, Kerberos authentication information needs to be provided. Refer to the example: [Export to an HDFS Cluster with High Availability and Kerberos Authentication Enabled](#high-availability-and-kerberos-cluster-export). + +### Export to Object Storage + +Export the query results to the directory `s3://path/to/` in the S3 storage, specify the export format as ORC, and information such as `sk` (secret key) and `ak` (access key) needs to be provided. + +```sql +SELECT * FROM tbl +INTO OUTFILE "s3://path/to/result_" +FORMAT AS ORC +PROPERTIES( + "s3.endpoint" = "https://xxx", + "s3.region" = "ap-beijing", + "s3.access_key"= "your-ak", + "s3.secret_key" = "your-sk" +); +``` + +### Export to the Local File System +> If you need to export to a local file, you must add `enable_outfile_to_local = true` to `fe.conf` and restart the FE. + +Export the query results to the directory `file:///path/to/` on the BE, specify the export format as CSV, and specify the column separator as `,`. + +```sql +SELECT c1, c2 FROM tbl FROM tbl1 +INTO OUTFILE "file:///path/to/result_" +FORMAT AS CSV +PROPERTIES( + "column_separator" = "," +); +``` + +> Note: +The function of exporting to local files is not applicable to public cloud users, but only to users with private deployments. And it is assumed by default that the user has full control rights over the cluster nodes. Doris does not perform legality checks on the export paths filled in by the user. If the process user of Doris does not have write permissions for the path, or the path does not exist, an error will be reported. Also, for security reasons, if there is a file with the same name already existing in the path, the export will fail. +Doris does not manage the files exported to the local system, nor does it check the disk space, etc. These files need to be managed by the user, such as cleaning them up. + +### More Usage +For a detailed introduction to the `SELECT INTO OUTFILE` command, please refer to: [SELECT INTO OUTFILE](../../sql-manual/sql-statements/Data-Manipulation-Statements/OUTFILE.md) + +## Export Instructions +### Storage Locations for Exported Data +The `SELECT INTO OUTFILE` currently supports exporting data to the following storage locations: +- Object storage: Amazon S3, COS, OSS, OBS, Google GCS +- HDFS +- Local file system + +### Export File Types +The `SELECT INTO OUTFILE` currently supports exporting the following file formats: +- Parquet +- ORC +- csv +- csv_with_names +- csv_with_names_and_types + +### Column Type Mapping for Exported Files +The `SELECT INTO OUTFILE` supports exporting data in Parquet and ORC file formats. Parquet and ORC file formats have their own data types. The export function of Doris can automatically convert the data types in Doris to the corresponding data types in Parquet and ORC file formats. + +The following is a mapping table of Doris data types and data types in Parquet and ORC file formats: +| Doris Type | Arrow Type | Orc Type | +| ---------- | ---------- | -------- | +| boolean | boolean | boolean | +| tinyint | int8 | tinyint | +| smallint | int16 | smallint | +| int | int32 | int | +| bigint | int64 | bigint | +| largeInt | utf8 | string | +| date | utf8 | string | +| datev2 | Date32Type | string | +| datetime | utf8 | string | +| datetimev2 | TimestampType | timestamp | +| float | float32 | float | +| double | float64 | double | +| char / varchar / string| utf8 | string | +| decimal | decimal128 | decimal | +| struct | struct | struct | +| map | map | map | +| array | list | array | +| json | utf8 | string | +| variant | utf8 | string | +| bitmap | binary | binary | +| quantile_state| binary | binary | +| hll | binary | binary | + +> Note: When Doris exports data to the Parquet file format, it first converts the in-memory data of Doris into the Arrow in-memory data format, and then writes it out to the Parquet file format via Arrow. + +## Export Examples +- [Export to an HDFS Cluster with High Availability Enabled](#high-availability-hdfs-export) +- [Export to an HDFS Cluster with High Availability and Kerberos Authentication Enabled](#high-availability-and-kerberos-cluster-export) +- [Example of Generating a File to Mark a Successful Export](#example-of-generating-a-file-to-mark-a-successful-export) +- [Example of Concurrent Export](#example-of-concurrent-export) +- [Example of Clearing the Export Directory Before Exporting](#example-of-clearing-the-export-directory-before-exporting) +- [Example of Setting the Size of Exported Files](#example-of-setting-the-size-of-exported-files) + + + +**Export to an HDFS Cluster with High Availability Enabled** + +If the HDFS has high availability enabled, HA information needs to be provided. For example: ```sql SELECT c1, c2, c3 FROM tbl @@ -105,7 +206,10 @@ PROPERTIES ); ``` -If the Hadoop cluster is configured for high availability and Kerberos authentication is enabled, you can refer to the following SQL statement: + +**Export to an HDFS Cluster with High Availability and Kerberos Authentication Enabled** + +If the HDFS cluster has both high availability enabled and Kerberos authentication enabled, you can refer to the following SQL statements: ```sql SELECT * FROM tbl @@ -120,59 +224,24 @@ PROPERTIES "dfs.namenode.rpc-address.hacluster.n1"="192.168.0.1:8020", "dfs.namenode.rpc-address.hacluster.n2"="192.168.0.2:8020", "dfs.client.failover.proxy.provider.hacluster"="org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider", - "dfs.namenode.kerberos.principal"="hadoop/_HOST@REALM.COM", + "dfs.namenode.kerberos.principal"="hadoop/_HOST@REALM.COM" "hadoop.security.authentication"="kerberos", "hadoop.kerberos.principal"="doris_test@REALM.COM", "hadoop.kerberos.keytab"="/path/to/doris_test.keytab" ); ``` -### Export to S3 + +**Example of Generating a File to Mark a Successful Export** -Export query results to the S3 storage at `s3://path/to/` directory, specifying the export format as ORC. Provide `sk`, `ak`, and other necessary information: +The `SELECT INTO OUTFILE` command is a synchronous command. Therefore, it is possible that the task connection is disconnected during the execution of the SQL, making it impossible to know whether the exported data has ended normally or is complete. At this time, you can use the `success_file_name` parameter to require that a file marker be generated in the directory after a successful export. -```sql -SELECT * FROM tbl -INTO OUTFILE "s3://path/to/result_" -FORMAT AS ORC -PROPERTIES( - "s3.endpoint" = "https://xxx", - "s3.region" = "ap-beijing", - "s3.access_key"= "your-ak", - "s3.secret_key" = "your-sk" -); -``` - -### Export to Local File System -> -> To export to the local file system, add `enable_outfile_to_local=true` in `fe.conf` and restart FE. - -Export query results to the BE's `file:///path/to/` directory, specifying the export format as CSV, with a comma as the column separator: - -```sql -SELECT k1 FROM tbl1 UNION SELECT k2 FROM tbl1 -INTO OUTFILE "file:///path/to/result_" -FORMAT AS CSV -PROPERTIES( - "column_separator" = "," -); -``` - -> Note: -Exporting to local files is not suitable for public cloud users and is intended for private deployment users only. By default, users have full control over cluster nodes. Doris does not check the validity of the export path provided by the user. If the Doris process user does not have write permissions for the path, or the path does not exist, an error will be reported. Additionally, for security reasons, if a file with the same name already exists at the path, the export will fail. Doris does not manage exported local files or check disk space. Users need to manage these files themselves, including cleanup and other tasks. - -## Best Practices - -### Generate Export Success Indicator File +Similar to Hive, users can determine whether the export has ended normally and whether the files in the export directory are complete by checking whether there is a file specified by the `success_file_name` parameter in the export directory. -The `SELECT INTO OUTFILE` command is synchronous, meaning that the task connection could be interrupted during SQL execution, leaving uncertainty about whether the export completed successfully or whether the data is complete. You can use the `success_file_name` parameter to generate an indicator file upon successful export. - -Similar to Hive, users can determine whether the export completed successfully and whether the files in the export directory are complete by checking for the presence of the file specified by the `success_file_name` parameter. - -For example, exporting the results of a `SELECT` statement to Tencent Cloud COS `s3://${bucket_name}/path/my_file_`, specifying the export format as CSV, and setting the success indicator file name to `SUCCESS`: +For example: Export the query results of the `select` statement to Tencent Cloud COS: `s3://${bucket_name}/path/my_file_`. Specify the export format as `csv`. Specify the name of the file marking a successful export as `SUCCESS`. After the export is completed, a marker file will be generated. ```sql -SELECT k1, k2, v1 FROM tbl1 LIMIT 100000 +SELECT k1,k2,v1 FROM tbl1 LIMIT 100000 INTO OUTFILE "s3://my_bucket/path/my_file_" FORMAT AS CSV PROPERTIES @@ -187,21 +256,55 @@ PROPERTIES ) ``` -Upon completion, an additional file named `SUCCESS` will be generated. +After the export is completed, an additional file named `SUCCESS` will be written. + + +**Example of Concurrent Export** + +By default, the query results of the `SELECT` part will first be aggregated to a certain BE node, and this node will export the data in a single thread. However, in some cases, such as for query statements without an `ORDER BY` clause, concurrent exports can be enabled, allowing multiple BE nodes to export data simultaneously to improve export performance. + +However, not all SQL query statements can be exported concurrently. Whether a query statement can be exported concurrently can be determined through the following steps: -### Concurrent Export +* Make sure that the session variable is enabled: `set enable_parallel_outfile = true;` +* Check the execution plan via `EXPLAIN` -By default, the query results in the `SELECT` section are aggregated to a single BE node, which exports data single-threadedly. However, in some cases (e.g., queries without an `ORDER BY` clause), concurrent export can be enabled to have multiple BE nodes export data simultaneously, improving export performance. +```sql +mysql> EXPLAIN SELECT ... INTO OUTFILE "s3://xxx" ...; ++-----------------------------------------------------------------------------+ +| Explain String | ++-----------------------------------------------------------------------------+ +| PLAN FRAGMENT 0 | +| OUTPUT EXPRS: | | | | +| PARTITION: UNPARTITIONED | +| | +| RESULT SINK | +| | +| 1:EXCHANGE | +| | +| PLAN FRAGMENT 1 | +| OUTPUT EXPRS:`k1` + `k2` | +| PARTITION: HASH_PARTITIONED: `default_cluster:test`.`multi_tablet`.`k1` | +| | +| RESULT FILE SINK | +| FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951_ | +| STORAGE TYPE: S3 | +| | +| 0:OlapScanNode | +| TABLE: multi_tablet | ++-----------------------------------------------------------------------------+ +``` + +The `EXPLAIN` command will return the query plan of the statement. By observing the query plan, if `RESULT FILE SINK` appears in `PLAN FRAGMENT 1`, it indicates that the query statement can be exported concurrently. If `RESULT FILE SINK` appears in `PLAN FRAGMENT 0`, it means that the current query cannot be exported concurrently. -Here’s an example demonstrating how to enable concurrent export: +Next, we will demonstrate how to correctly enable the concurrent export function through an example: -1. Enable the concurrent export session variable: +1. Open the concurrent export session variable ```sql mysql> SET enable_parallel_outfile = true; ``` -2. Execute the export command: +2. Execute the export command ```sql mysql> SELECT * FROM demo.tbl @@ -221,9 +324,9 @@ mysql> SELECT * FROM demo.tbl +------------+-----------+----------+-------------------------------------------------------------------------------+ ``` -With concurrent export successfully enabled, the result may consist of multiple rows, indicating that multiple threads exported data concurrently. +It can be seen that after enabling and successfully triggering the concurrent export function, the returned result may consist of multiple lines, indicating that there are multiple threads exporting concurrently. -Adding an `ORDER BY` clause to the query prevents concurrent export, as the top-level sorting node necessitates single-threaded export: +If we modify the above statement, that is, add an `ORDER BY` clause to the query statement. Since the query statement has a top-level sorting node, even if the concurrent export function is enabled, this query cannot be exported concurrently: ```sql mysql> SELECT * FROM demo.tbl ORDER BY id @@ -236,11 +339,10 @@ mysql> SELECT * FROM demo.tbl ORDER BY id +------------+-----------+----------+-------------------------------------------------------------------------------+ ``` -Here, the result is a single row, indicating no concurrent export was triggered. +It can be seen that there is only one final result line, and concurrent export has not been triggered. -Refer to the appendix for more details on concurrent export principles. - -### Clear Export Directory Before Exporting + +**Example of Clearing the Export Directory Before Exporting** ```sql SELECT * FROM tbl1 @@ -258,12 +360,12 @@ PROPERTIES ) ``` -If `"delete_existing_files" = "true"` is set, the export job will first delete all files and directories under `s3://my_bucket/export/`, then export data to that directory. +If `"delete_existing_files" = "true"` is set, the export job will first delete all files and directories under the `s3://my_bucket/export/` directory, and then export data to this directory. -> Note: -To use the `delete_existing_files` parameter, add `enable_delete_existing_files = true` to `fe.conf` and restart FE. This parameter is potentially dangerous and should only be used in a testing environment. +> Note: To use the `delete_existing_files` parameter, you also need to add the configuration `enable_delete_existing_files = true` in `fe.conf` and restart the `fe`, then `delete_existing_files` will take effect. `delete_existing_files = true` is a dangerous operation and it is recommended to use it only in a test environment. -### Set Export File Size + +**Example of Setting the Size of Exported Files** ```sql SELECT * FROM tbl @@ -278,69 +380,25 @@ PROPERTIES( ); ``` -Specifying `"max_file_size" = "2048MB"` ensures that the final file size does not exceed 2GB. If the total size exceeds 2GB, multiple files will be generated. +Since `"max_file_size" = "2048MB"` is specified, if the final generated file is not larger than 2GB, there will be only one file. If it is larger than 2GB, there will be multiple files. -## Considerations +## Precautions + +- Export Data Volume and Export Efficiency +The `SELECT INTO OUTFILE` function essentially executes an SQL query command. If concurrent export is not enabled, the query result is exported by a single BE node in a single thread. Therefore, the total export time includes the time consumed by the query itself and the time required to write out the final result set. Enabling concurrent export can reduce the export time. -- Export Data Volume and Efficiency - The `SELECT INTO OUTFILE` function executes a SQL query. Without concurrent export, a single BE node and thread export the query results. The total export time includes both the query execution time and the result set write-out time. Enabling concurrent export can reduce the export time. - Export Timeout - The export command shares the same timeout as the query. If the data volume is large and causes the export to timeout, you can extend the query timeout by setting the session variable `query_timeout`. -- Export File Management - Doris does not manage exported files, whether successfully exported or remaining from failed exports. Users must handle these files themselves. - Additionally, `SELECT INTO OUTFILE` does not check for the existence of files or file paths. Whether `SELECT INTO OUTFILE` automatically creates paths or overwrites existing files depends entirely on the semantics of the remote storage system. -- Empty Result Sets - Exporting an empty result set still generates an empty file. +The timeout time of the export command is the same as that of the query. If the data volume is large and causes the export data to time out, you can set the session variable `query_timeout` to appropriately extend the query timeout. + +- Management of Exported Files +Doris does not manage the exported files. Whether they are successfully exported or residual files after a failed export, users need to handle them by themselves. +In addition, the `SELECT INTO OUTFILE` command does not check whether the file and file path exist. Whether `SELECT INTO OUTFILE` will automatically create a path or overwrite an existing file is completely determined by the semantics of the remote storage system. + +- If the Query Result Set Is Empty +For an export with an empty result set, an empty file will still be generated. + - File Splitting - File splitting ensures that a single row of data is stored completely in one file. Thus, the file size may not exactly equal `max_file_size`. -- Non-visible Character Functions - For functions outputting non-visible characters (e.g., BITMAP, HLL types), CSV output is `\N`, and Parquet/ORC output is NULL. - Currently, some geographic functions like `ST_Point` output VARCHAR but with encoded binary characters, causing garbled output. Use `ST_AsText` for geographic functions. - -## Appendix - -### Concurrent Export Principles - -- Principle Overview - - Doris is a high-performance, real-time analytical database based on the MPP (Massively Parallel Processing) architecture. MPP divides large datasets into small chunks and processes them in parallel across multiple nodes. - Concurrent export in `SELECT INTO OUTFILE` leverages this parallel processing capability, allowing multiple BE nodes to export parts of the result set simultaneously. - -- How to Determine Concurrent Export Eligibility - - - Ensure Session Variable is Enabled: `set enable_parallel_outfile = true;` - - Check Execution Plan with `EXPLAIN`: - - ```sql - mysql> EXPLAIN SELECT ... INTO OUTFILE "s3://xxx" ...; - +-----------------------------------------------------------------------------+ - | Explain String | - +-----------------------------------------------------------------------------+ - | PLAN FRAGMENT 0 | - | OUTPUT EXPRS: | | | | - | PARTITION: UNPARTITIONED | - | | - | RESULT SINK | - | | - | 1:EXCHANGE | - | | - | PLAN FRAGMENT 1 | - | OUTPUT EXPRS:`k1` - - + `k2` | - | PARTITION: HASH_PARTITIONED: `default_cluster:test`.`multi_tablet`.`k1` | - | | - | RESULT FILE SINK | - | FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951_ | - | STORAGE TYPE: S3 | - | | - | 0:OlapScanNode | - | TABLE: multi_tablet | - +-----------------------------------------------------------------------------+ - ``` - - The `EXPLAIN` command returns the query plan. If `RESULT FILE SINK` appears in `PLAN FRAGMENT 1`, the query can be exported concurrently. If it appears in `PLAN FRAGMENT 0`, concurrent export is not possible. - -- Export Concurrency - - When concurrent export conditions are met, the export task's concurrency is determined by: `BE nodes * parallel_fragment_exec_instance_num`. +File splitting ensures that a row of data is stored completely in a single file. Therefore, the file size is not strictly equal to `max_file_size`. + +- Functions for Non-Visible Characters +For some functions that output non-visible characters, such as BITMAP and HLL types, when exporting to the CSV file format, the output is `\N`. diff --git a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md index b1a49ad48a036..bc818932ce676 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md +++ b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md @@ -125,6 +125,13 @@ INTO OUTFILE "file_path" INTO OUTFILE "file:///home/work/path/result_"; ``` +#### Explanation of the returned results: + +- **FileNumber**: The number of generated files. +- **TotalRows**: The number of rows in the result set. +- **FileSize**: The total size of the exported files in bytes. +- **URL**: The prefix of the exported file paths. Multiple files will be numbered sequentially with suffixes `_0`, `_1`, etc. + #### DataType Mapping Parquet and ORC file formats have their own data types. The export function of Doris can automatically export the Doris data types to the corresponding data types of the Parquet/ORC file format. The following are the data type mapping relationship of the Doris data types and the Parquet/ORC file format data types: diff --git a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md index c05de32fa8f38..fb264e966f75b 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md +++ b/versioned_docs/version-2.1/sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md @@ -51,7 +51,36 @@ illustrate: 3. You can use ORDER BY to sort any combination of columns 4. If LIMIT is specified, limit matching records are displayed. Otherwise show all -## Examples +The meaning of each column in the result returned by the `show export` command is as follows: + +- JobId: The unique ID of the job +- Label: The label of the export job. If not specified in the export, the system will generate one by default. +- State: Job status: + - PENDING: Job pending scheduling + - EXPORTING: Data export in progress + - FINISHED: Job successful + - CANCELLED: Job failed +- Progress: Job progress. This progress is based on query plans. For example, if there are a total of 10 threads and 3 have been completed, the progress is 30%. +- TaskInfo: Job information displayed in JSON format: + - db: Database name + - tbl: Table name + - partitions: Specified partitions for export. An empty list indicates all partitions. + - column\_separator: Column separator for the export file. + - line\_delimiter: Line delimiter for the export file. + - tablet num: Total number of tablets involved. + - broker: Name of the broker used. + - coord num: Number of query plans. + - max\_file\_size: Maximum size of an export file. + - delete\_existing\_files: Whether to delete existing files and directories in the export directory. + - columns: Specified column names to export, empty value represents exporting all columns. + - format: File format for export +- Path: Export path on the remote storage. +- `CreateTime/StartTime/FinishTime`: Job creation time, scheduling start time, and end time. +- Timeout: Job timeout time in seconds. This time is calculated from CreateTime. +- ErrorMsg: If there is an error in the job, the error reason will be displayed here. +- OutfileInfo: If the job is successfully exported, specific `SELECT INTO OUTFILE` result information will be displayed here. + +## Example 1. Show all export tasks of default db diff --git a/versioned_docs/version-3.0/data-operate/export/export-manual.md b/versioned_docs/version-3.0/data-operate/export/export-manual.md index c039c522d28d3..8a3e0769cdc3f 100644 --- a/versioned_docs/version-3.0/data-operate/export/export-manual.md +++ b/versioned_docs/version-3.0/data-operate/export/export-manual.md @@ -24,51 +24,113 @@ specific language governing permissions and limitations under the License. --> -This document will introduce how to use the `EXPORT` command to export data stored in Doris. +This document will introduce how to use the `EXPORT` command to export the data stored in Doris. -For a detailed description of the `EXPORT` command, please refer to: [EXPORT](../../sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md) +`Export` is a function provided by Doris for asynchronous data export. This function can export the data of tables or partitions specified by the user in a specified file format to the target storage system, including object storage, HDFS, or the local file system. -## Overview +`Export` is an asynchronously executed command. After the command is executed successfully, it immediately returns a result, and the user can view the detailed information of the Export task through the `Show Export` command. -`Export` is a feature provided by Doris for asynchronously exporting data. This feature allows users to export data from specified tables or partitions in a specified file format to a target storage system, including object storage, HDFS, or the local file system. +For a detailed introduction of the `EXPORT` command, please refer to: [EXPORT](../../sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md) -`Export` is an asynchronous command. After the command is successfully executed, it immediately returns the result. Users can use the `Show Export` command to view detailed information about the export task. +Regarding how to choose between `SELECT INTO OUTFILE` and `EXPORT`, please refer to [Export Overview](../../data-operate/export/export-overview.md). -For guidance on choosing between `SELECT INTO OUTFILE` and `EXPORT`, please see [Export Overview](../../data-operate/export/export-overview.md). +--- -`EXPORT` currently supports exporting the following types of tables or views: +## Basic Principles -- Doris internal tables -- Doris logical views -- Doris Catalog tables +The underlying layer of the Export task is to execute the `SELECT INTO OUTFILE` SQL statement. After a user initiates an Export task, Doris will construct one or more `SELECT INTO OUTFILE` execution plans based on the table to be exported by Export, and then submit these `SELECT INTO OUTFILE` execution plans to Doris's Job Schedule task scheduler. The Job Schedule task scheduler will automatically schedule and execute these tasks. -`EXPORT` currently supports the following export formats: +By default, the Export task is executed in a single thread. To improve the export efficiency, the Export command can set the `parallelism` parameter to concurrently export data. After setting `parallelism` to be greater than 1, the Export task will use multiple threads to concurrently execute the `SELECT INTO OUTFILE` query plans. The `parallelism` parameter actually specifies the number of threads that execute the EXPORT operation. -- Parquet -- ORC -- csv -- csv\_with\_names -- csv\_with\_names\_and\_types +## Usage Scenarios +`Export` is suitable for the following scenarios: +- Exporting a single table with a large amount of data and only requiring simple filtering conditions. +- Scenarios where tasks need to be submitted asynchronously. + +The following limitations should be noted when using `Export`: +- Currently, the export of compressed formats is not supported. +- Exporting the Select result set is not supported. If you need to export the Select result set, please use [OUTFILE Export](../../data-operate/export/outfile.md). +- If you want to export to the local file system, you need to add the configuration `enable_outfile_to_local = true` in `fe.conf` and restart the FE. + +## Quick Start +### Table Creation and Data Import + +```sql +CREATE TABLE IF NOT EXISTS tbl ( + `c1` int(11) NULL, + `c2` string NULL, + `c3` bigint NULL +) +DISTRIBUTED BY HASH(c1) BUCKETS 20 +PROPERTIES("replication_num" = "1"); -Exporting in compressed formats is not supported. -Example: +insert into tbl values + (1, 'doris', 18), + (2, 'nereids', 20), + (3, 'pipelibe', 99999), + (4, 'Apache', 122123455), + (5, null, null); +``` + +### Create an Export Job + +#### Export to HDFS +Export all data from the `tbl` table to HDFS. Set the file format of the export job to csv (the default format) and set the column delimiter to `,`. ```sql -mysql> EXPORT TABLE tpch1.lineitem TO "s3://my_bucket/path/to/exp_" - -> PROPERTIES( - -> "format" = "csv", - -> "max_file_size" = "2048MB" - -> ) - -> WITH s3 ( - -> "s3.endpoint" = "${endpoint}", - -> "s3.region" = "${region}", - -> "s3.secret_key"="${sk}", - -> "s3.access_key" = "${ak}" - -> ); +EXPORT TABLE tbl +TO "hdfs://host/path/to/export/" +PROPERTIES +( + "line_delimiter" = "," +) +with HDFS ( + "fs.defaultFS"="hdfs://hdfs_host:port", + "hadoop.username" = "hadoop" +); ``` -After submitting a job, you can query the export job status using the [SHOW EXPORT](../../sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md) command. An example result is as follows: +If the HDFS cluster has high availability enabled, HA information needs to be provided. Refer to the example: [Export to an HDFS Cluster with High Availability Enabled](#high-availability-hdfs-export). + +If the HDFS cluster has both high availability enabled and Kerberos authentication enabled, Kerberos authentication information needs to be provided. Refer to the example: [Export to an HDFS Cluster with High Availability and Kerberos Authentication Enabled](#high-availability-and-kerberos-cluster-export). + +#### Export to Object Storage + +Export the query results to the directory `s3://path/to/` in the S3 storage, specify the export format as ORC, and information such as `sk` (secret key) and `ak` (access key) needs to be provided. + +```sql +EXPORT TABLE tbl TO "s3://bucket/a/b/c" +PROPERTIES ( + "line_delimiter" = "," +) WITH s3 ( + "s3.endpoint" = "xxxxx", + "s3.region" = "xxxxx", + "s3.secret_key"="xxxx", + "s3.access_key" = "xxxxx" +) +``` + +#### Export to the Local File System +> If you need to export to a local file, you must add `enable_outfile_to_local = true` to `fe.conf` and restart the FE. + +Export the query results to the directory `file:///path/to/` on the BE, specify the export format as CSV, and specify the column separator as `,`. + +```sql +-- csv format +EXPORT TABLE tbl TO "file:///home/user/tmp/" +PROPERTIES ( + "format" = "csv", + "line_delimiter" = "," +); +``` + +> Note: +The function of exporting to local files is not applicable to public cloud users, but only to users with private deployments. And it is assumed by default that the user has full control rights over the cluster nodes. Doris does not perform legality checks on the export paths filled in by the user. If the process user of Doris does not have write permissions for the path, or the path does not exist, an error will be reported. Also, for security reasons, if there is a file with the same name already existing in the path, the export will fail. +Doris does not manage the files exported to the local system, nor does it check the disk space, etc. These files need to be managed by the user, such as cleaning them up. + +### View Export Jobs +After submitting a job, you can query the status of the export job via the [SHOW EXPORT](../../sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md) command. An example of the result is as follows: ```sql mysql> show export\G @@ -97,80 +159,93 @@ OutfileInfo: [ 1 row in set (0.00 sec) ``` -The meaning of each column in the result returned by the `show export` command is as follows: - -- JobId: The unique ID of the job -- Label: The label of the export job. If not specified in the export, the system will generate one by default. -- State: Job status: - - PENDING: Job pending scheduling - - EXPORTING: Data export in progress - - FINISHED: Job successful - - CANCELLED: Job failed -- Progress: Job progress. This progress is based on query plans. For example, if there are a total of 10 threads and 3 have been completed, the progress is 30%. -- TaskInfo: Job information displayed in JSON format: - - db: Database name - - tbl: Table name - - partitions: Specified partitions for export. An empty list indicates all partitions. - - column\_separator: Column separator for the export file. - - line\_delimiter: Line delimiter for the export file. - - tablet num: Total number of tablets involved. - - broker: Name of the broker used. - - coord num: Number of query plans. - - max\_file\_size: Maximum size of an export file. - - delete\_existing\_files: Whether to delete existing files and directories in the export directory. - - columns: Specified column names to export, empty value represents exporting all columns. - - format: File format for export -- Path: Export path on the remote storage. -- CreateTime/StartTime/FinishTime: Job creation time, scheduling start time, and end time. -- Timeout: Job timeout time in seconds. This time is calculated from CreateTime. -- ErrorMsg: If there is an error in the job, the error reason will be displayed here. -- OutfileInfo: If the job is successfully exported, specific `SELECT INTO OUTFILE` result information will be displayed here. - -After submitting the Export job, you can cancel the export job using the [CANCEL EXPORT](../../sql-manual/sql-statements/data-modification/load-and-export/CANCEL-EXPORT.md) command before the export task succeeds or fails. An example of the cancel command is as follows: - -```sql -CANCEL EXPORT FROM tpch1 WHERE LABEL like "%export_%"; -``` - -## Export File Column Type Mapping - -`Export` supports exporting data in Parquet and ORC file formats. Parquet and ORC file formats have their own data types. Doris's export function can automatically export Doris's data types to the corresponding data types of Parquet and ORC file formats. For specific mapping relationships, please refer to the "Export File Column Type Mapping" section of the [Export Overview](../../data-operate/export/export-overview.md) document. - -## Examples +For the detailed usage of the `show export` command and the meaning of each column in the returned results, please refer to [SHOW EXPORT](../../sql-manual/sql-statements/Show-Statements/SHOW-EXPORT.md). -### Export to HDFS - -Export data from the `col1` and `col2` columns in the `p1` and `p2` partitions of the db1.tbl1 table to HDFS, setting the label of the export job to `mylabel`. The export file format is csv (default format), the column delimiter is `,`, and the maximum size limit for a single export file is 512MB. +### Cancel Export Jobs +After submitting an Export job, the export job can be cancelled via the [CANCEL EXPORT](../../sql-manual/sql-statements/data-modification/load-and-export/CANCEL-EXPORT.md) command before the Export task succeeds or fails. An example of the cancellation command is as follows: ```sql -EXPORT TABLE db1.tbl1 -PARTITION (p1,p2) -TO "hdfs://host/path/to/export/" -PROPERTIES -( - "label" = "mylabel", - "column_separator"=",", - "max_file_size" = "512MB", - "columns" = "col1,col2" -) -with HDFS ( - "fs.defaultFS"="hdfs://hdfs_host:port", - "hadoop.username" = "hadoop" -); +CANCEL EXPORT FROM dbName WHERE LABEL like "%export_%"; ``` -If HDFS is configured for high availability, HA information needs to be provided, as shown below: +## Export Instructions + +### Export Data Sources +`EXPORT` currently supports exporting the following types of tables or views: +- Internal tables in Doris +- Logical views in Doris +- Tables in Doris Catalog + +### Export Data Storage Locations +`Export` currently supports exporting to the following storage locations: +- Object storage: Amazon S3, COS, OSS, OBS, Google GCS +- HDFS +- Local file system + +### Export File Types +`EXPORT` currently supports exporting to the following file formats: +- Parquet +- ORC +- csv +- csv_with_names +- csv_with_names_and_types + +### Column Type Mapping for Exported Files +`Export` supports exporting to Parquet and ORC file formats. Parquet and ORC file formats have their own data types, and the export function of Doris can automatically convert the data types of Doris to the corresponding data types of Parquet and ORC file formats. + +The following is a mapping table of Doris data types to the data types of Parquet and ORC file formats: +| Doris Type | Arrow Type | Orc Type | +| ---------- | ---------- | -------- | +| boolean | boolean | boolean | +| tinyint | int8 | tinyint | +| smallint | int16 | smallint | +| int | int32 | int | +| bigint | int64 | bigint | +| largeInt | utf8 | string | +| date | utf8 | string | +| datev2 | Date32Type | string | +| datetime | utf8 | string | +| datetimev2 | TimestampType | timestamp | +| float | float32 | float | +| double | float64 | double | +| char / varchar / string| utf8 | string | +| decimal | decimal128 | decimal | +| struct | struct | struct | +| map | map | map | +| array | list | array | +| json | utf8 | string | +| variant | utf8 | string | +| bitmap | binary | binary | +| quantile_state| binary | binary | +| hll | binary | binary | + +> Note: When Doris exports data to the Parquet file format, it first converts the in-memory data of Doris into the Arrow in-memory data format, and then writes it out to the Parquet file format via Arrow. + +## Export Examples + +- [Export to an HDFS Cluster with High Availability Enabled](#high-availability-hdfs-export) +- [Export to an HDFS Cluster with High Availability and Kerberos Authentication Enabled](#high-availability-and-kerberos-cluster-export) +- [Specify Partition for Export](#specify-partition-for-export) +- [Filter Data During Export](#filter-data-during-export) +- [Export External Table Data](#export-external-table-data) +- [Adjust Export Data Consistency](#adjust-export-data-consistency) +- [Adjust Concurrency of Export Jobs](#adjust-concurrency-of-export-jobs) +- [Example of Clearing the Export Directory Before Exporting](#example-of-clearing-the-export-directory-before-exporting) +- [Example of Setting the Size of Exported Files](#example-of-setting-the-size-of-exported-files) + + + + +**Export to an HDFS Cluster with High Availability Enabled** + +If the HDFS has high availability enabled, HA information needs to be provided. For example: ```sql -EXPORT TABLE db1.tbl1 -PARTITION (p1,p2) +EXPORT TABLE tbl TO "hdfs://HDFS8000871/path/to/export/" PROPERTIES ( - "label" = "mylabel", - "column_separator"=",", - "max_file_size" = "512MB", - "columns" = "col1,col2" + "line_delimiter" = "," ) with HDFS ( "fs.defaultFS" = "hdfs://HDFS8000871", @@ -183,18 +258,17 @@ with HDFS ( ); ``` -If the Hadoop cluster is configured for high availability and Kerberos authentication is enabled, you can refer to the following SQL statement: + +**Export to an HDFS Cluster with High Availability and Kerberos Authentication Enabled** + +If the HDFS cluster has both high availability enabled and Kerberos authentication enabled, you can refer to the following SQL statements: ```sql -EXPORT TABLE db1.tbl1 -PARTITION (p1,p2) +EXPORT TABLE tbl TO "hdfs://HDFS8000871/path/to/export/" PROPERTIES ( - "label" = "mylabel", - "column_separator"=",", - "max_file_size" = "512MB", - "columns" = "col1,col2" + "line_delimiter" = "," ) with HDFS ( "fs.defaultFS"="hdfs://hacluster/", @@ -211,66 +285,10 @@ with HDFS ( ); ``` -### Export to S3 - -Export all data from the s3_test table to S3, with the export format as csv and using the invisible character `\x07` as the row delimiter. - -```sql -EXPORT TABLE s3_test TO "s3://bucket/a/b/c" -PROPERTIES ( - "line_delimiter" = "\\x07" -) WITH s3 ( - "s3.endpoint" = "xxxxx", - "s3.region" = "xxxxx", - "s3.secret_key"="xxxx", - "s3.access_key" = "xxxxx" -) -``` - -### Export to Local File System - -> -> To export data to the local file system, you need to add `enable_outfile_to_local=true` in fe.conf and restart FE. + +**Specify Partition for Export** -Export all data from the test table to local storage: - -```sql --- parquet format -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "columns" = "k1,k2", - "format" = "parquet" -); - --- orc format -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "columns" = "k1,k2", - "format" = "orc" -); - --- csv_with_names format, using 'AA' as the column separator and 'zz' as the line delimiter -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "format" = "csv_with_names", - "column_separator"="AA", - "line_delimiter" = "zz" -); - --- csv_with_names_and_types format -EXPORT TABLE test TO "file:///home/user/tmp/" -PROPERTIES ( - "format" = "csv_with_names_and_types" -); -``` - -> Note: - The functionality of exporting to the local file system is not applicable to public cloud users, only for users of private deployments. Additionally, by default, users have full control permissions over the cluster nodes. Doris does not perform validity checks on the export path provided by the user. If the Doris process user does not have write permission to the path or the path does not exist, an error will occur. For security reasons, if a file with the same name already exists in the path, the export will also fail. - Doris does not manage the files exported to the local file system or check disk space, etc. Users need to manage these files themselves, including cleaning them up. - -### Export Specific Partitions - -Export jobs support exporting only specific partitions of Doris internal tables, such as exporting only the p1 and p2 partitions of the test table. +The export job supports exporting only some partitions of the internal tables in Doris. For example, only export partitions p1 and p2 of the `test` table. ```sql EXPORT TABLE test @@ -281,9 +299,10 @@ PROPERTIES ( ); ``` -### Filtering Data during Export + +**Filter Data During Export** -Export jobs support filtering data based on predicate conditions during export, exporting only data that meets certain conditions, such as exporting data that satisfies the condition `k1 < 50`. +The export job supports filtering data according to predicate conditions during the export process, exporting only the data that meets the conditions. For example, only export the data that satisfies the condition `k1 < 50`. ```sql EXPORT TABLE test @@ -295,12 +314,13 @@ PROPERTIES ( ); ``` -### Export External Table Data + +**Export External Table Data** -Export jobs support Doris Catalog external table data: +The export job supports the data of external tables in Doris Catalog. ```sql --- Create a catalog +-- create a catalog CREATE CATALOG `tpch` PROPERTIES ( "type" = "trino-connector", "trino.connector.name" = "tpch", @@ -308,7 +328,7 @@ CREATE CATALOG `tpch` PROPERTIES ( "trino.tpch.splits-per-node" = "32" ); --- Export data from the Catalog external table +-- export Catalog data EXPORT TABLE tpch.sf1.lineitem TO "file:///path/to/exp_" PROPERTIES( "parallelism" = "5", @@ -318,14 +338,13 @@ PROPERTIES( ``` :::tip -Exporting Catalog external table data does not support concurrent exports. Even if a parallelism greater than 1 is specified, it will still be a single-threaded export. +Currently, when exporting data from external tables in the Catalog using Export, concurrent exports are not supported. Even if the parallelism is specified to be greater than 1, the export will still be performed in a single thread. ::: -## Best Practices - -### Export Consistency + +**Adjust Export Data Consistency** -`Export` supports two granularities for export: partition / tablets. The `data_consistency` parameter is used to specify the granularity at which the table to be exported is split. `none` represents Tablets level, and `partition` represents Partition level. +`Export` supports two granularities: partition and tablets. The `data_consistency` parameter is used to specify the granularity at which the table to be exported is sliced. `none` represents the Tablets level, and `partition` represents the Partition level. ```sql EXPORT TABLE test TO "file:///home/user/tmp" @@ -336,15 +355,16 @@ PROPERTIES ( ); ``` -If `"data_consistency" = "partition"` is set, the underlying Export task constructs multiple `SELECT INTO OUTFILE` statements to export different partitions. +If `"data_consistency" = "partition"` is set, multiple `SELECT INTO OUTFILE` statements constructed at the underlying layer of the Export task will export different partitions. -If `"data_consistency" = "none"` is set, the underlying Export task constructs multiple `SELECT INTO OUTFILE` statements to export different tablets. However, these different tablets may belong to the same partition. +If `"data_consistency" = "none"` is set, multiple `SELECT INTO OUTFILE` statements constructed at the underlying layer of the Export task will export different tablets. However, these different tablets may belong to the same partition. -For the logic behind Export's underlying construction of `SELECT INTO OUTFILE` statements, refer to the appendix. +For the logic of constructing `SELECT INTO OUTFILE` at the underlying layer of Export, please refer to the appendix section. -### Export Job Concurrency + +**Adjust Concurrency of Export Jobs** -Export allows setting different concurrency levels to export data concurrently. Specify a concurrency level of 5: +Export can set different degrees of concurrency to export data concurrently. Specify the concurrency degree as 5: ```sql EXPORT TABLE test TO "file:///home/user/tmp/" @@ -355,9 +375,10 @@ PROPERTIES ( ); ``` -For more information on the principles of concurrent export in Export, refer to the appendix section. +For the principle of concurrent export of Export, please refer to the appendix section. -### Clear Export Directory Before Exporting + +**Example of Clearing the Export Directory Before Exporting** ```sql EXPORT TABLE test TO "file:///home/user/tmp" @@ -368,14 +389,15 @@ PROPERTIES ( ); ``` -If `"delete_existing_files" = "true"` is set, the export job will first delete all files and directories under `/home/user/`, and then export data to that directory. +If `"delete_existing_files" = "true"` is set, the export job will first delete all files and directories under the `/home/user/` directory, and then export data to that directory. > Note: -To use the `delete_existing_files` parameter, you also need to add the configuration `enable_delete_existing_files = true` in fe.conf and restart the FE. Only then will `delete_existing_files` take effect. `delete_existing_files = true` is a risky operation and is recommended to be used only in a testing environment. +If you want to use the delete_existing_files parameter, you also need to add the configuration `enable_delete_existing_files = true` in fe.conf and restart fe, then delete_existing_files will take effect. delete_existing_files = true is a dangerous operation and it is recommended to use it only in the test environment. -### Set Export File Size + +**Example of Setting the Size of Exported Files** -Export jobs support setting the size of export files. If a single file exceeds the set value, it will be split into multiple files for export. +The export job supports setting the size of the export file. If the size of a single file exceeds the set value, it will be divided into multiple files for export according to the specified size. ```sql EXPORT TABLE test TO "file:///home/user/tmp/" @@ -385,91 +407,83 @@ PROPERTIES ( ); ``` -By setting `"max_file_size" = "512MB"`, the maximum size of a single export file will be 512MB. +By setting `"max_file_size" = "512MB"`, the maximum size of a single exported file is 512MB. -## Notes -* Memory Limit +## Precautions +* Memory Limitation - Typically, an Export job's query plan consists of only `scan-export` two parts, without involving complex calculation logic that requires a lot of memory. Therefore, the default memory limit of 2GB usually meets the requirements. + Usually, the query plan of an Export job only consists of two parts: scanning and exporting, and does not involve computational logic that requires too much memory. Therefore, the default memory limit of 2GB usually meets the requirements. - However, in some scenarios, such as when a query plan needs to scan too many tablets on the same BE, or when there are too many data versions of tablets, it may lead to insufficient memory. You can adjust the session variable `exec_mem_limit` to increase the memory usage limit. + However, in some scenarios, for example, when a query plan needs to scan too many Tablets on the same BE or there are too many data versions of Tablets, it may lead to insufficient memory. You can adjust the session variable `exec_mem_limit` to increase the memory usage limit. * Export Data Volume - It is not recommended to export a large amount of data at once. It is suggested that the maximum export data volume for an Export job should be within tens of gigabytes. Exporting excessively large data can result in more garbage files and higher retry costs. If the table data volume is too large, it is recommended to export by partition. + It is not recommended to export a large amount of data at one time. The recommended maximum export data volume for an Export job is several tens of gigabytes. Excessive exports will lead to more junk files and higher retry costs. If the table data volume is too large, it is recommended to export by partitions. - Additionally, Export jobs scan data, consuming IO resources, which may impact the system's query latency. + In addition, the Export job will scan data and occupy IO resources, which may affect the query latency of the system. * Export File Management - If an Export job fails during execution, the generated files will not be deleted automatically and will need to be manually deleted by the user. + If the Export job fails, the files that have already been generated will not be deleted, and users need to delete them manually. * Data Consistency - Currently, during export, only a simple check is performed on tablet versions for consistency. It is recommended not to import data into the table during the export process. + Currently, only a simple check is performed on whether the tablet versions are consistent during export. It is recommended not to perform data import operations on the table during the export process. * Export Timeout - If the exported data volume is very large and exceeds the export timeout, the Export task will fail. In such cases, you can specify the `timeout` parameter in the Export command to increase the timeout and retry the Export command. + If the amount of exported data is very large and exceeds the export timeout period, the Export task will fail. At this time, you can specify the `timeout` parameter in the Export command to increase the timeout period and retry the Export command. * Export Failure - If the FE restarts or switches masters during the execution of an Export job, the Export job will fail, and the user will need to resubmit it. You can check the status of Export tasks using the `show export` command. + During the operation of the Export job, if the FE restarts or switches the master, the Export job will fail, and the user needs to resubmit it. You can check the status of the Export task through the `show export` command. -* Number of Export Partitions +* Number of Exported Partitions - An Export Job allows a maximum of 2000 partitions to be exported. You can modify this setting by adding the parameter `maximum_number_of_export_partitions` in fe.conf and restarting the FE. + The maximum number of partitions allowed to be exported by an Export Job is 2000. You can add the parameter `maximum_number_of_export_partitions` in fe.conf and restart the FE to modify this setting. * Concurrent Export - When exporting concurrently, it is important to configure the thread count and parallelism properly to fully utilize system resources and avoid performance bottlenecks. During the export process, monitor progress and performance metrics in real-time to promptly identify issues and optimize adjustments. + During concurrent export, please configure the number of threads and parallelism reasonably to make full use of system resources and avoid performance bottlenecks. During the export process, you can monitor the progress and performance indicators in real time to discover problems in time and make optimization adjustments. * Data Integrity - After the export operation is completed, it is recommended to verify the exported data for completeness and correctness to ensure data quality and integrity. - -## Appendix - -### Principles of Concurrent Export - -The underlying operation of an Export task is to execute the `SELECT INTO OUTFILE` SQL statement. When a user initiates an Export task, Doris constructs one or more `SELECT INTO OUTFILE` execution plans based on the table to be exported, and then submits these `SELECT INTO OUTFILE` execution plans to Doris's Job Schedule task scheduler, which automatically schedules and executes these tasks. - -By default, Export tasks are executed single-threaded. To improve export efficiency, the Export command can set a `parallelism` parameter to export data concurrently. When `parallelism` is set to a value greater than 1, the Export task will use multiple threads to execute the `SELECT INTO OUTFILE` query plans concurrently. The `parallelism` parameter essentially specifies the number of threads to execute the EXPORT job. + After the export operation is completed, it is recommended to verify whether the exported data is complete and correct to ensure the quality and integrity of the data. -The specific logic of constructing one or more `SELECT INTO OUTFILE` execution plans for an Export task is as follows: +* The specific logic for an Export task to construct one or more `SELECT INTO OUTFILE` execution plans is as follows: -1. Select the consistency model for exporting data + 1. Select the Consistency Model of Exported Data - Based on the `data_consistency` parameter to determine the consistency of the export, which is only related to semantics and not concurrency. Users should first choose a consistency model based on their own requirements. + The consistency of export is determined according to the `data_consistency` parameter. This is only related to semantics and has nothing to do with the degree of concurrency. Users should first select a consistency model according to their own requirements. -2. Determine the Degree of Parallelism + 2. Determine the Degree of Concurrency - Determine how many threads will run the `SELECT INTO OUTFILE` execution plan based on the `parallelism` parameter. The `parallelism` parameter determines the maximum number of threads possible. + Determine the number of threads to run these `SELECT INTO OUTFILE` execution plans according to the `parallelism` parameter. The `parallelism` parameter determines the maximum possible number of threads. - > Note: Even if the Export command sets the `parallelism` parameter, the actual number of concurrent threads for the Export task depends on the Job Schedule. When an Export task sets a higher concurrency, each concurrent thread is provided by the Job Schedule. Therefore, if the Doris system tasks are busy and the Job Schedule's thread resources are tight, the actual number of threads assigned to the Export task may not reach the specified `parallelism` number, affecting the concurrent export of the Export task. To mitigate this issue, you can reduce system load or adjust the FE configuration `async_task_consumer_thread_num` to increase the total thread count of the Job Schedule. + > Note: Even if the Export command sets the `parallelism` parameter, the actual number of concurrent threads of the Export task is also related to Job Schedule. After setting multiple concurrency for the Export task, each concurrent thread is provided by Job Schedule. Therefore, if the Doris system tasks are busy at this time and the thread resources of Job Schedule are tight, the actual number of threads allocated to the Export task may not reach the `parallelism` number, which will affect the concurrent export of Export. At this time, you can alleviate this problem by reducing the system load or adjusting the FE configuration `async_task_consumer_thread_num` to increase the total number of threads of Job Schedule. -3. Determine the Workload of Each `outfile` Statement + 3. Determine the Task Amount of Each Outfile Statement - Each thread will determine how many `outfile` statements to split based on `maximum_tablets_of_outfile_in_export` and the actual number of partitions / buckets in the data. + Each thread will decide how many outfiles to split into according to `maximum_tablets_of_outfile_in_export` and the actual number of partitions/buckets of the data. - > `maximum_tablets_of_outfile_in_export` is a configuration in the FE with a default value of 10. This parameter specifies the maximum number of partitions / buckets allowed in a single OutFile statement generated by an Export task. Modifying this configuration requires restarting the FE. + > `maximum_tablets_of_outfile_in_export` is an FE configuration with a default value of 10. This parameter is used to specify the maximum number of partitions/buckets allowed in a single OutFile statement split by the Export task. You need to restart the FE to modify this configuration. - Example: Suppose a table has a total of 20 partitions, each partition has 5 buckets, resulting in a total of 100 buckets. Set `data_consistency = none` and `maximum_tablets_of_outfile_in_export = 10`. + Example: Suppose a table has a total of 20 partitions, and each partition has 5 buckets, then the table has a total of 100 buckets. Set `data_consistency = none` and `maximum_tablets_of_outfile_in_export = 10`. - 1. Scenario with `parallelism = 5` + 1. In the case of `parallelism = 5` - The Export task will divide the 100 buckets of the table into 5 parts, with each thread responsible for 20 buckets. Each thread's 20 buckets will be further divided into 2 groups of 10 buckets each, with each group handled by an outfile query plan. Therefore, the Export task will have 5 threads executing concurrently, with each thread handling 2 outfile statements that are executed serially. + The Export task will divide the 100 buckets of the table into 5 parts, and each thread is responsible for 20 buckets. The 20 buckets responsible by each thread will be divided into 2 groups of 10 buckets each, and each group of buckets is responsible by one outfile query plan. So, finally, the Export task has 5 threads executing concurrently, each thread is responsible for 2 outfile statements, and the outfile statements responsible by each thread are executed serially. - 2. Scenario with `parallelism = 3` + 2. In the case of `parallelism = 3` - The Export task will divide the 100 buckets of the table into 3 parts, with 3 threads responsible for 34, 33, and 33 buckets respectively. Each thread's buckets will be further divided into 4 groups of 10 buckets each (the last group may have fewer than 10 buckets), with each group handled by an outfile query plan. Therefore, the Export task will have 3 threads executing concurrently, with each thread handling 4 outfile statements that are executed serially. + The Export task will divide the 100 buckets of the table into 3 parts, and the 3 threads are responsible for 34, 33, and 33 buckets respectively. The buckets responsible by each thread will be divided into 4 groups of 10 buckets each (the last group has less than 10 buckets), and each group of buckets is responsible by one outfile query plan. So, finally, the Export task has 3 threads executing concurrently, each thread is responsible for 4 outfile statements, and the outfile statements responsible by each thread are executed serially. - 3. Scenario with `parallelism = 120` + 3. In the case of `parallelism = 120` - Since the table has only 100 buckets, the system will force `parallelism` to be set to 100 and execute with `parallelism = 100`. The Export task will divide the 100 buckets of the table into 100 parts, with each thread responsible for 1 bucket. Each thread's 1 bucket will be further divided into 1 group of 1 bucket, with each group handled by an outfile query plan. Therefore, the Export task will have 100 threads executing concurrently, with each thread handling 1 outfile statement, where each outfile statement actually exports only 1 bucket. + Since there are only 100 buckets in the table, the system will force `parallelism` to be set to 100 and execute with `parallelism = 100`. The Export task will divide the 100 buckets of the table into 100 parts, and each thread is responsible for 1 bucket. The 1 bucket responsible by each thread will be divided into 1 group of 10 buckets (this group actually has only 1 bucket), and each group of buckets is responsible by one outfile query plan. So, finally, the Export task has 100 threads executing concurrently, each thread is responsible for 1 outfile statement, and each outfile statement actually exports only 1 bucket. -For optimal performance in the current version of Export, it is recommended to set the following parameters: +* For a better performance of Export in the current version, it is recommended to set the following parameters: -1. Enable the session variable `enable_parallel_outfile`. -2. Set the `parallelism` parameter of Export to a large value so that each thread is responsible for only one `SELECT INTO OUTFILE` query plan. -3. Set the FE configuration `maximum_tablets_of_outfile_in_export` to a small value to export a smaller amount of data for each `SELECT INTO OUTFILE` query plan. + 1. Open the session variable `enable_parallel_outfile`. + 2. Set the `parallelism` parameter of Export to a larger value so that each thread is only responsible for one `SELECT INTO OUTFILE` query plan. + 3. Set the FE configuration `maximum_tablets_of_outfile_in_export` to a smaller value so that the amount of data exported by each `SELECT INTO OUTFILE` query plan is smaller. diff --git a/versioned_docs/version-3.0/data-operate/export/outfile.md b/versioned_docs/version-3.0/data-operate/export/outfile.md index cb0eaa14beb2d..170769639e5cc 100644 --- a/versioned_docs/version-3.0/data-operate/export/outfile.md +++ b/versioned_docs/version-3.0/data-operate/export/outfile.md @@ -24,57 +24,59 @@ specific language governing permissions and limitations under the License. --> -This document introduces how to use the `SELECT INTO OUTFILE` command to export query results. +This document will introduce how to use the `SELECT INTO OUTFILE` command to export query results. -For a detailed introduction to the `SELECT INTO OUTFILE` command, refer to: [SELECT INTO OUTFILE](../../sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md). +The `SELECT INTO OUTFILE` command exports the result data of the `SELECT` part to the target storage system in the specified file format, including object storage, HDFS, or the local file system. -## Overview +The `SELECT INTO OUTFILE` is a synchronous command. When the command returns, it means that the export is completed. If the export is successful, information such as the number, size, and path of the exported files will be returned. If the export fails, an error message will be returned. -The `SELECT INTO OUTFILE` command exports the result data of the `SELECT` statement to a target storage system, such as object storage, HDFS, or the local file system, in a specified file format. +For information on how to choose between `SELECT INTO OUTFILE` and `EXPORT`, please refer to [Export Overview](./export-overview.md). -`SELECT INTO OUTFILE` is a synchronous command, meaning it completes when the command returns. If successful, it returns information about the number, size, and paths of the exported files. If it fails, it returns error information. +For a detailed introduction to the `SELECT INTO OUTFILE` command, please refer to: [SELECT INTO OUTFILE](../../sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md) -For guidance on choosing between `SELECT INTO OUTFILE` and `EXPORT`, see the [Export Overview](./export-overview.md). +-------------- -### Supported Export Formats +## Usage Scenarios -`SELECT INTO OUTFILE` currently supports the following export formats: +The `SELECT INTO OUTFILE` is applicable to the following scenarios: +- When the exported data needs to go through complex calculation logics, such as filtering, aggregation, and joining. +- For scenarios suitable for executing synchronous tasks. -- Parquet -- ORC -- CSV -- CSV with column names (`csv_with_names`) -- CSV with column names and types (`csv_with_names_and_types`) - -Compressed formats are not supported. +When using the `SELECT INTO OUTFILE`, the following limitations should be noted: +- It does not support exporting data in compressed formats. +- The pipeline engine in version 2.1 does not support concurrent exports. +- If you want to export data to the local file system, you need to add the configuration `enable_outfile_to_local = true` in the `fe.conf` file and then restart the FE. -### Example -```sql -mysql> SELECT * FROM tbl1 LIMIT 10 INTO OUTFILE "file:///home/work/path/result_"; -+------------+-----------+----------+--------------------------------------------------------------------+ -| FileNumber | TotalRows | FileSize | URL | -+------------+-----------+----------+--------------------------------------------------------------------+ -| 1 | 2 | 8 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ | -+------------+-----------+----------+--------------------------------------------------------------------+ -``` +## Basic Principles +The `SELECT INTO OUTFILE` function essentially executes an SQL query command, and its principle is basically the same as that of an ordinary query. The only difference is that an ordinary query outputs the final query result set to the MySQL client, while the `SELECT INTO OUTFILE` outputs the final query result set to an external storage medium. -Explanation of the returned results: +The principle of concurrent export for `SELECT INTO OUTFILE` is to divide large-scale data sets into small pieces and process them in parallel on multiple nodes. In scenarios where concurrent export is possible, exports are carried out in parallel on multiple BE nodes, with each BE handling a part of the result set. -- **FileNumber**: The number of generated files. -- **TotalRows**: The number of rows in the result set. -- **FileSize**: The total size of the exported files in bytes. -- **URL**: The prefix of the exported file paths. Multiple files will be numbered sequentially with suffixes `_0`, `_1`, etc. +## Quick Start +### Create Tables and Import Data -## Export File Column Type Mapping +```sql +CREATE TABLE IF NOT EXISTS tbl ( + `c1` int(11) NULL, + `c2` string NULL, + `c3` bigint NULL +) +DISTRIBUTED BY HASH(c1) BUCKETS 20 +PROPERTIES("replication_num" = "1"); -`SELECT INTO OUTFILE` supports exporting to Parquet and ORC file formats. Parquet and ORC have their own data types, and Doris can automatically map its data types to corresponding Parquet and ORC data types. Refer to the "Export File Column Type Mapping" section in the [Export Overview](./export-overview.md) document for the specific mapping relationships. -## Examples +insert into tbl values + (1, 'doris', 18), + (2, 'nereids', 20), + (3, 'pipelibe', 99999), + (4, 'Apache', 122123455), + (5, null, null); +``` ### Export to HDFS -Export query results to the `hdfs://path/to/` directory, specifying the export format as PARQUET: +Export the query results to the directory `hdfs://path/to/` and specify the export format as Parquet: ```sql SELECT c1, c2, c3 FROM tbl @@ -87,7 +89,106 @@ PROPERTIES ); ``` -If HDFS is configured for high availability, provide HA information, such as: +If the HDFS cluster has high availability enabled, HA information needs to be provided. Refer to the example: [Export to an HDFS Cluster with High Availability Enabled](#high-availability-hdfs-export). + +If the HDFS cluster has both high availability enabled and Kerberos authentication enabled, Kerberos authentication information needs to be provided. Refer to the example: [Export to an HDFS Cluster with High Availability and Kerberos Authentication Enabled](#high-availability-and-kerberos-cluster-export). + +### Export to Object Storage + +Export the query results to the directory `s3://path/to/` in the S3 storage, specify the export format as ORC, and information such as `sk` (secret key) and `ak` (access key) needs to be provided. + +```sql +SELECT * FROM tbl +INTO OUTFILE "s3://path/to/result_" +FORMAT AS ORC +PROPERTIES( + "s3.endpoint" = "https://xxx", + "s3.region" = "ap-beijing", + "s3.access_key"= "your-ak", + "s3.secret_key" = "your-sk" +); +``` + +### Export to the Local File System +> If you need to export to a local file, you must add `enable_outfile_to_local = true` to `fe.conf` and restart the FE. + +Export the query results to the directory `file:///path/to/` on the BE, specify the export format as CSV, and specify the column separator as `,`. + +```sql +SELECT c1, c2 FROM tbl FROM tbl1 +INTO OUTFILE "file:///path/to/result_" +FORMAT AS CSV +PROPERTIES( + "column_separator" = "," +); +``` + +> Note: +The function of exporting to local files is not applicable to public cloud users, but only to users with private deployments. And it is assumed by default that the user has full control rights over the cluster nodes. Doris does not perform legality checks on the export paths filled in by the user. If the process user of Doris does not have write permissions for the path, or the path does not exist, an error will be reported. Also, for security reasons, if there is a file with the same name already existing in the path, the export will fail. +Doris does not manage the files exported to the local system, nor does it check the disk space, etc. These files need to be managed by the user, such as cleaning them up. + +### More Usage +For a detailed introduction to the `SELECT INTO OUTFILE` command, please refer to: [SELECT INTO OUTFILE](../../sql-manual/sql-statements/Data-Manipulation-Statements/OUTFILE.md) + +## Export Instructions +### Storage Locations for Exported Data +The `SELECT INTO OUTFILE` currently supports exporting data to the following storage locations: +- Object storage: Amazon S3, COS, OSS, OBS, Google GCS +- HDFS +- Local file system + +### Export File Types +The `SELECT INTO OUTFILE` currently supports exporting the following file formats: +- Parquet +- ORC +- csv +- csv_with_names +- csv_with_names_and_types + +### Column Type Mapping for Exported Files +The `SELECT INTO OUTFILE` supports exporting data in Parquet and ORC file formats. Parquet and ORC file formats have their own data types. The export function of Doris can automatically convert the data types in Doris to the corresponding data types in Parquet and ORC file formats. + +The following is a mapping table of Doris data types and data types in Parquet and ORC file formats: +| Doris Type | Arrow Type | Orc Type | +| ---------- | ---------- | -------- | +| boolean | boolean | boolean | +| tinyint | int8 | tinyint | +| smallint | int16 | smallint | +| int | int32 | int | +| bigint | int64 | bigint | +| largeInt | utf8 | string | +| date | utf8 | string | +| datev2 | Date32Type | string | +| datetime | utf8 | string | +| datetimev2 | TimestampType | timestamp | +| float | float32 | float | +| double | float64 | double | +| char / varchar / string| utf8 | string | +| decimal | decimal128 | decimal | +| struct | struct | struct | +| map | map | map | +| array | list | array | +| json | utf8 | string | +| variant | utf8 | string | +| bitmap | binary | binary | +| quantile_state| binary | binary | +| hll | binary | binary | + +> Note: When Doris exports data to the Parquet file format, it first converts the in-memory data of Doris into the Arrow in-memory data format, and then writes it out to the Parquet file format via Arrow. + +## Export Examples +- [Export to an HDFS Cluster with High Availability Enabled](#high-availability-hdfs-export) +- [Export to an HDFS Cluster with High Availability and Kerberos Authentication Enabled](#high-availability-and-kerberos-cluster-export) +- [Example of Generating a File to Mark a Successful Export](#example-of-generating-a-file-to-mark-a-successful-export) +- [Example of Concurrent Export](#example-of-concurrent-export) +- [Example of Clearing the Export Directory Before Exporting](#example-of-clearing-the-export-directory-before-exporting) +- [Example of Setting the Size of Exported Files](#example-of-setting-the-size-of-exported-files) + + + +**Export to an HDFS Cluster with High Availability Enabled** + +If the HDFS has high availability enabled, HA information needs to be provided. For example: ```sql SELECT c1, c2, c3 FROM tbl @@ -105,7 +206,10 @@ PROPERTIES ); ``` -If the Hadoop cluster is configured for high availability and Kerberos authentication is enabled, you can refer to the following SQL statement: + +**Export to an HDFS Cluster with High Availability and Kerberos Authentication Enabled** + +If the HDFS cluster has both high availability enabled and Kerberos authentication enabled, you can refer to the following SQL statements: ```sql SELECT * FROM tbl @@ -120,59 +224,24 @@ PROPERTIES "dfs.namenode.rpc-address.hacluster.n1"="192.168.0.1:8020", "dfs.namenode.rpc-address.hacluster.n2"="192.168.0.2:8020", "dfs.client.failover.proxy.provider.hacluster"="org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider", - "dfs.namenode.kerberos.principal"="hadoop/_HOST@REALM.COM", + "dfs.namenode.kerberos.principal"="hadoop/_HOST@REALM.COM" "hadoop.security.authentication"="kerberos", "hadoop.kerberos.principal"="doris_test@REALM.COM", "hadoop.kerberos.keytab"="/path/to/doris_test.keytab" ); ``` -### Export to S3 + +**Example of Generating a File to Mark a Successful Export** -Export query results to the S3 storage at `s3://path/to/` directory, specifying the export format as ORC. Provide `sk`, `ak`, and other necessary information: +The `SELECT INTO OUTFILE` command is a synchronous command. Therefore, it is possible that the task connection is disconnected during the execution of the SQL, making it impossible to know whether the exported data has ended normally or is complete. At this time, you can use the `success_file_name` parameter to require that a file marker be generated in the directory after a successful export. -```sql -SELECT * FROM tbl -INTO OUTFILE "s3://path/to/result_" -FORMAT AS ORC -PROPERTIES( - "s3.endpoint" = "https://xxx", - "s3.region" = "ap-beijing", - "s3.access_key"= "your-ak", - "s3.secret_key" = "your-sk" -); -``` - -### Export to Local File System -> -> To export to the local file system, add `enable_outfile_to_local=true` in `fe.conf` and restart FE. - -Export query results to the BE's `file:///path/to/` directory, specifying the export format as CSV, with a comma as the column separator: - -```sql -SELECT k1 FROM tbl1 UNION SELECT k2 FROM tbl1 -INTO OUTFILE "file:///path/to/result_" -FORMAT AS CSV -PROPERTIES( - "column_separator" = "," -); -``` - -> Note: -Exporting to local files is not suitable for public cloud users and is intended for private deployment users only. By default, users have full control over cluster nodes. Doris does not check the validity of the export path provided by the user. If the Doris process user does not have write permissions for the path, or the path does not exist, an error will be reported. Additionally, for security reasons, if a file with the same name already exists at the path, the export will fail. Doris does not manage exported local files or check disk space. Users need to manage these files themselves, including cleanup and other tasks. - -## Best Practices - -### Generate Export Success Indicator File +Similar to Hive, users can determine whether the export has ended normally and whether the files in the export directory are complete by checking whether there is a file specified by the `success_file_name` parameter in the export directory. -The `SELECT INTO OUTFILE` command is synchronous, meaning that the task connection could be interrupted during SQL execution, leaving uncertainty about whether the export completed successfully or whether the data is complete. You can use the `success_file_name` parameter to generate an indicator file upon successful export. - -Similar to Hive, users can determine whether the export completed successfully and whether the files in the export directory are complete by checking for the presence of the file specified by the `success_file_name` parameter. - -For example, exporting the results of a `SELECT` statement to Tencent Cloud COS `s3://${bucket_name}/path/my_file_`, specifying the export format as CSV, and setting the success indicator file name to `SUCCESS`: +For example: Export the query results of the `select` statement to Tencent Cloud COS: `s3://${bucket_name}/path/my_file_`. Specify the export format as `csv`. Specify the name of the file marking a successful export as `SUCCESS`. After the export is completed, a marker file will be generated. ```sql -SELECT k1, k2, v1 FROM tbl1 LIMIT 100000 +SELECT k1,k2,v1 FROM tbl1 LIMIT 100000 INTO OUTFILE "s3://my_bucket/path/my_file_" FORMAT AS CSV PROPERTIES @@ -187,21 +256,55 @@ PROPERTIES ) ``` -Upon completion, an additional file named `SUCCESS` will be generated. +After the export is completed, an additional file named `SUCCESS` will be written. + + +**Example of Concurrent Export** + +By default, the query results of the `SELECT` part will first be aggregated to a certain BE node, and this node will export the data in a single thread. However, in some cases, such as for query statements without an `ORDER BY` clause, concurrent exports can be enabled, allowing multiple BE nodes to export data simultaneously to improve export performance. + +However, not all SQL query statements can be exported concurrently. Whether a query statement can be exported concurrently can be determined through the following steps: -### Concurrent Export +* Make sure that the session variable is enabled: `set enable_parallel_outfile = true;` +* Check the execution plan via `EXPLAIN` -By default, the query results in the `SELECT` section are aggregated to a single BE node, which exports data single-threadedly. However, in some cases (e.g., queries without an `ORDER BY` clause), concurrent export can be enabled to have multiple BE nodes export data simultaneously, improving export performance. +```sql +mysql> EXPLAIN SELECT ... INTO OUTFILE "s3://xxx" ...; ++-----------------------------------------------------------------------------+ +| Explain String | ++-----------------------------------------------------------------------------+ +| PLAN FRAGMENT 0 | +| OUTPUT EXPRS: | | | | +| PARTITION: UNPARTITIONED | +| | +| RESULT SINK | +| | +| 1:EXCHANGE | +| | +| PLAN FRAGMENT 1 | +| OUTPUT EXPRS:`k1` + `k2` | +| PARTITION: HASH_PARTITIONED: `default_cluster:test`.`multi_tablet`.`k1` | +| | +| RESULT FILE SINK | +| FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951_ | +| STORAGE TYPE: S3 | +| | +| 0:OlapScanNode | +| TABLE: multi_tablet | ++-----------------------------------------------------------------------------+ +``` + +The `EXPLAIN` command will return the query plan of the statement. By observing the query plan, if `RESULT FILE SINK` appears in `PLAN FRAGMENT 1`, it indicates that the query statement can be exported concurrently. If `RESULT FILE SINK` appears in `PLAN FRAGMENT 0`, it means that the current query cannot be exported concurrently. -Here’s an example demonstrating how to enable concurrent export: +Next, we will demonstrate how to correctly enable the concurrent export function through an example: -1. Enable the concurrent export session variable: +1. Open the concurrent export session variable ```sql mysql> SET enable_parallel_outfile = true; ``` -2. Execute the export command: +2. Execute the export command ```sql mysql> SELECT * FROM demo.tbl @@ -221,9 +324,9 @@ mysql> SELECT * FROM demo.tbl +------------+-----------+----------+-------------------------------------------------------------------------------+ ``` -With concurrent export successfully enabled, the result may consist of multiple rows, indicating that multiple threads exported data concurrently. +It can be seen that after enabling and successfully triggering the concurrent export function, the returned result may consist of multiple lines, indicating that there are multiple threads exporting concurrently. -Adding an `ORDER BY` clause to the query prevents concurrent export, as the top-level sorting node necessitates single-threaded export: +If we modify the above statement, that is, add an `ORDER BY` clause to the query statement. Since the query statement has a top-level sorting node, even if the concurrent export function is enabled, this query cannot be exported concurrently: ```sql mysql> SELECT * FROM demo.tbl ORDER BY id @@ -236,11 +339,10 @@ mysql> SELECT * FROM demo.tbl ORDER BY id +------------+-----------+----------+-------------------------------------------------------------------------------+ ``` -Here, the result is a single row, indicating no concurrent export was triggered. +It can be seen that there is only one final result line, and concurrent export has not been triggered. -Refer to the appendix for more details on concurrent export principles. - -### Clear Export Directory Before Exporting + +**Example of Clearing the Export Directory Before Exporting** ```sql SELECT * FROM tbl1 @@ -258,12 +360,12 @@ PROPERTIES ) ``` -If `"delete_existing_files" = "true"` is set, the export job will first delete all files and directories under `s3://my_bucket/export/`, then export data to that directory. +If `"delete_existing_files" = "true"` is set, the export job will first delete all files and directories under the `s3://my_bucket/export/` directory, and then export data to this directory. -> Note: -To use the `delete_existing_files` parameter, add `enable_delete_existing_files = true` to `fe.conf` and restart FE. This parameter is potentially dangerous and should only be used in a testing environment. +> Note: To use the `delete_existing_files` parameter, you also need to add the configuration `enable_delete_existing_files = true` in `fe.conf` and restart the `fe`, then `delete_existing_files` will take effect. `delete_existing_files = true` is a dangerous operation and it is recommended to use it only in a test environment. -### Set Export File Size + +**Example of Setting the Size of Exported Files** ```sql SELECT * FROM tbl @@ -278,69 +380,25 @@ PROPERTIES( ); ``` -Specifying `"max_file_size" = "2048MB"` ensures that the final file size does not exceed 2GB. If the total size exceeds 2GB, multiple files will be generated. +Since `"max_file_size" = "2048MB"` is specified, if the final generated file is not larger than 2GB, there will be only one file. If it is larger than 2GB, there will be multiple files. -## Considerations +## Precautions + +- Export Data Volume and Export Efficiency +The `SELECT INTO OUTFILE` function essentially executes an SQL query command. If concurrent export is not enabled, the query result is exported by a single BE node in a single thread. Therefore, the total export time includes the time consumed by the query itself and the time required to write out the final result set. Enabling concurrent export can reduce the export time. -- Export Data Volume and Efficiency - The `SELECT INTO OUTFILE` function executes a SQL query. Without concurrent export, a single BE node and thread export the query results. The total export time includes both the query execution time and the result set write-out time. Enabling concurrent export can reduce the export time. - Export Timeout - The export command shares the same timeout as the query. If the data volume is large and causes the export to timeout, you can extend the query timeout by setting the session variable `query_timeout`. -- Export File Management - Doris does not manage exported files, whether successfully exported or remaining from failed exports. Users must handle these files themselves. - Additionally, `SELECT INTO OUTFILE` does not check for the existence of files or file paths. Whether `SELECT INTO OUTFILE` automatically creates paths or overwrites existing files depends entirely on the semantics of the remote storage system. -- Empty Result Sets - Exporting an empty result set still generates an empty file. +The timeout time of the export command is the same as that of the query. If the data volume is large and causes the export data to time out, you can set the session variable `query_timeout` to appropriately extend the query timeout. + +- Management of Exported Files +Doris does not manage the exported files. Whether they are successfully exported or residual files after a failed export, users need to handle them by themselves. +In addition, the `SELECT INTO OUTFILE` command does not check whether the file and file path exist. Whether `SELECT INTO OUTFILE` will automatically create a path or overwrite an existing file is completely determined by the semantics of the remote storage system. + +- If the Query Result Set Is Empty +For an export with an empty result set, an empty file will still be generated. + - File Splitting - File splitting ensures that a single row of data is stored completely in one file. Thus, the file size may not exactly equal `max_file_size`. -- Non-visible Character Functions - For functions outputting non-visible characters (e.g., BITMAP, HLL types), CSV output is `\N`, and Parquet/ORC output is NULL. - Currently, some geographic functions like `ST_Point` output VARCHAR but with encoded binary characters, causing garbled output. Use `ST_AsText` for geographic functions. - -## Appendix - -### Concurrent Export Principles - -- Principle Overview - - Doris is a high-performance, real-time analytical database based on the MPP (Massively Parallel Processing) architecture. MPP divides large datasets into small chunks and processes them in parallel across multiple nodes. - Concurrent export in `SELECT INTO OUTFILE` leverages this parallel processing capability, allowing multiple BE nodes to export parts of the result set simultaneously. - -- How to Determine Concurrent Export Eligibility - - - Ensure Session Variable is Enabled: `set enable_parallel_outfile = true;` - - Check Execution Plan with `EXPLAIN`: - - ```sql - mysql> EXPLAIN SELECT ... INTO OUTFILE "s3://xxx" ...; - +-----------------------------------------------------------------------------+ - | Explain String | - +-----------------------------------------------------------------------------+ - | PLAN FRAGMENT 0 | - | OUTPUT EXPRS: | | | | - | PARTITION: UNPARTITIONED | - | | - | RESULT SINK | - | | - | 1:EXCHANGE | - | | - | PLAN FRAGMENT 1 | - | OUTPUT EXPRS:`k1` - - + `k2` | - | PARTITION: HASH_PARTITIONED: `default_cluster:test`.`multi_tablet`.`k1` | - | | - | RESULT FILE SINK | - | FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951_ | - | STORAGE TYPE: S3 | - | | - | 0:OlapScanNode | - | TABLE: multi_tablet | - +-----------------------------------------------------------------------------+ - ``` - - The `EXPLAIN` command returns the query plan. If `RESULT FILE SINK` appears in `PLAN FRAGMENT 1`, the query can be exported concurrently. If it appears in `PLAN FRAGMENT 0`, concurrent export is not possible. - -- Export Concurrency - - When concurrent export conditions are met, the export task's concurrency is determined by: `BE nodes * parallel_fragment_exec_instance_num`. +File splitting ensures that a row of data is stored completely in a single file. Therefore, the file size is not strictly equal to `max_file_size`. + +- Functions for Non-Visible Characters +For some functions that output non-visible characters, such as BITMAP and HLL types, when exporting to the CSV file format, the output is `\N`. diff --git a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md index 6ec6f3402a348..c0476b43a8cfd 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md +++ b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md @@ -129,6 +129,13 @@ INTO OUTFILE "file_path" INTO OUTFILE "file:///home/work/path/result_"; ``` +#### Explanation of the returned results: + +- **FileNumber**: The number of generated files. +- **TotalRows**: The number of rows in the result set. +- **FileSize**: The total size of the exported files in bytes. +- **URL**: The prefix of the exported file paths. Multiple files will be numbered sequentially with suffixes `_0`, `_1`, etc. + #### DataType Mapping Parquet and ORC file formats have their own data types. The export function of Doris can automatically export the Doris data types to the corresponding data types of the Parquet/ORC file format. The following are the data type mapping relationship of the Doris data types and the Parquet/ORC file format data types: diff --git a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md index 48a80c369ecdf..fb264e966f75b 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md +++ b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/SHOW-EXPORT.md @@ -51,6 +51,35 @@ illustrate: 3. You can use ORDER BY to sort any combination of columns 4. If LIMIT is specified, limit matching records are displayed. Otherwise show all +The meaning of each column in the result returned by the `show export` command is as follows: + +- JobId: The unique ID of the job +- Label: The label of the export job. If not specified in the export, the system will generate one by default. +- State: Job status: + - PENDING: Job pending scheduling + - EXPORTING: Data export in progress + - FINISHED: Job successful + - CANCELLED: Job failed +- Progress: Job progress. This progress is based on query plans. For example, if there are a total of 10 threads and 3 have been completed, the progress is 30%. +- TaskInfo: Job information displayed in JSON format: + - db: Database name + - tbl: Table name + - partitions: Specified partitions for export. An empty list indicates all partitions. + - column\_separator: Column separator for the export file. + - line\_delimiter: Line delimiter for the export file. + - tablet num: Total number of tablets involved. + - broker: Name of the broker used. + - coord num: Number of query plans. + - max\_file\_size: Maximum size of an export file. + - delete\_existing\_files: Whether to delete existing files and directories in the export directory. + - columns: Specified column names to export, empty value represents exporting all columns. + - format: File format for export +- Path: Export path on the remote storage. +- `CreateTime/StartTime/FinishTime`: Job creation time, scheduling start time, and end time. +- Timeout: Job timeout time in seconds. This time is calculated from CreateTime. +- ErrorMsg: If there is an error in the job, the error reason will be displayed here. +- OutfileInfo: If the job is successfully exported, specific `SELECT INTO OUTFILE` result information will be displayed here. + ## Example 1. Show all export tasks of default db