diff --git a/230. Cloud/Azure/11. Data Tools/Azure Data Factory/README.md b/230. Cloud/Azure/11. Data Tools/Azure Data Factory/README.md index 9257302e..023653de 100644 --- a/230. Cloud/Azure/11. Data Tools/Azure Data Factory/README.md +++ b/230. Cloud/Azure/11. Data Tools/Azure Data Factory/README.md @@ -1,29 +1,133 @@ -## 介绍 Introduction +## 1. 介绍 Introduction -Azure data factory (ADF) 是 Azure 提供的可横向扩张的 (scale out) 无服务的 (serverless) 的数据相关的一项服务。[^1] +Azure data factory (ADF) 是 可横向扩张的 (scale out) 无服务的 (serverless) 的数据整合和迁移的服务。[["]](https://docs.microsoft.com/en-us/azure/data-factory/) 主要包含以下三个方面: -- 数据集成 (data integration) :与不同数据源结合的能力。[^3] -- 数据转换 (data transformation) :数据从一种格式转换成另一种格式的能力。[^4] -- SSIS (SQL Server Integration Services) : 复制或下载文件,加载数据仓库,清除和挖掘数据以及管理 SQL Server 对象和数据。[^2] +- 数据集成 (data integration) :与不同数据源结合的能力。[["]](https://en.wikipedia.org/wiki/Data_integration) +- 数据转换 (data transformation) :数据从一种格式转换成另一种格式的能力。[["]](https://en.wikipedia.org/wiki/Data_transformation) +- SSIS (SQL Server Integration Services) : 复制或下载文件,加载数据仓库,清除和挖掘数据以及管理 SQL Server 对象和数据。 [["]](https://docs.microsoft.com/zh-cn/sql/integration-services/sql-server-integration-services?view=sql-server-ver15) -ADF 有版本区分,因此在 StackOverflow 上搜索,需要注意看标签是否带 v2。ADF 对标 AWS 的是 [AWS Data Pipeline](https://aws.amazon.com/cn/datapipeline)。 +## 2. 功能介绍 +一般: +- **Execute Pipeline**: 执行管道。通过 monitor 可以看到 pipeline 的输入参数、重新执行 pipeline。在定义 pipeline 时,需要注意这点。 +- 数组(上限 100,000) [["]](https://learn.microsoft.com/en-us/azure/data-factory/control-flow-for-each-activity) -## 延伸阅读 See also + - **[ForEach](https://learn.microsoft.com/en-us/azure/data-factory/control-flow-for-each-activity)**: 循环。并行数上限 50,ForEach 不能内嵌,但可通过内嵌 pipeline 来 workaround。[["]](https://learn.microsoft.com/en-us/fabric/data-factory/data-factory-limitations#data-pipeline-resource-limits) + - ... +- Get Metadata: 获得文件的元数据。 + +- Lookup: 通过 dataset 获得数据。输出最大支持 4 MB、5000 行 + + - 突破方式: 如果数据源有 index 的话,可以通过循环或者 util 的形式实现。(💡[官方的 workarounds](https://docs.microsoft.com/en-us/azure/data-factory/control-flow-lookup-activity#limitations-and-workarounds) 太模糊,无法参考使用) + +- Web: http 操作 +- webhook + +数据操作: + +- Copy Acitivity:数据复制。 +- Data Flow:数据复制和操作。比 Copy Activity 复杂。 + + + + + +### 2.1. Copy Acitivity + +CosmosDB: + +- 建议使用 DB=>Storage=>DB 进行数据迁移(DB=>DB 时常会报错)。 +- batch size=1,`Request Size = Single Document Size * Write Batch Size` [["]](https://learn.microsoft.com/en-us/answers/questions/69129/copy-from-cosmosdb-to-cosmosdb-error-34-request-si),batch size 设置过高,可能会 CosmosDB request 2M 上限错误。 +- CosmosDB 单条数据大小上限为 2M,Copy Acitivity 的上限为 1.7M 左右。(不知道原因) +- Data Flow 可以插入 2M 的数据,但会报奇怪的错误。 +- 并发量设置越低,使用的吞吐量会越低。[["]](https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance-features#parallel-copy) +- DIU:计算力[["]](https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance-features#data-integration-units) +- 性能调优:[Performance tuning steps](https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance#performance-tuning-steps) +- [数据映射](https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping) 包含 flatten transformation 等操作。 + - flatten 可以将一条数据内部 data List,扁平成多条数据。 +- `validateDataConsistency` 启动后会校验一致性。[["]](https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-data-consistency) + + + + + +### 2.2. Data flow + +Data flow 用于数据转换。 + +1. Data flow 一般用于对数据库、大文件进行转换,HTTP协议 一般会限制每分钟访问的速率。 +2. Data flow 不是用于备份数据,从 Data flow 中导入后,数据可能会有损失(Boolean=>String,integer=>String) + +[官网](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-transformation-overview)提供了以下工具进行数据转换。工具以下概念相关 + +- stream +- MS SQL + +| Name | Category | Description | +| :----------------------------------------------------------- | :---------------------- | :----------------------------------------------------------- | +| [Aggregate](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-aggregate) | Schema modifier | Define different types of aggregations such as SUM, MIN, MAX, and COUNT grouped by existing or computed columns. | +| [Alter row](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-alter-row) | Row modifier | Set insert, delete, update, and upsert policies on rows. | +| [Conditional split](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-conditional-split) | Multiple inputs/outputs | Route rows of data to different streams based on matching conditions. | +| [Derived column](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-derived-column) | Schema modifier | generate new columns or modify existing fields using the data flow expression language. | +| [Exists](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-exists) | Multiple inputs/outputs | Check whether your data exists in another source or stream. | +| [Filter](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-filter) | Row modifier | Filter a row based upon a condition. | +| [Flatten](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-flatten) | Schema modifier | Take array values inside hierarchical structures such as JSON and unroll them into individual rows. | +| [Join](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-join) | Multiple inputs/outputs | Combine data from two sources or streams. | +| [Lookup](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-lookup) | Multiple inputs/outputs | Reference data from another source. | +| [New branch](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-new-branch) | Multiple inputs/outputs | Apply multiple sets of operations and transformations against the same data stream. | +| [Parse](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-new-branch) | Formatter | Parse text columns in your data stream that are strings of JSON, delimited text, or XML formatted text. | +| [Pivot](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-pivot) | Schema modifier | An aggregation where one or more grouping columns has its distinct row values transformed into individual columns. | +| [Rank](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-rank) | Schema modifier | Generate an ordered ranking based upon sort conditions | +| [Select](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-select) | Schema modifier | Alias columns and stream names, and drop or reorder columns | +| [Sink](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-sink) | - | A final destination for your data | +| [Sort](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-sort) | Row modifier | Sort incoming rows on the current data stream | +| [Source](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-source) | - | A data source for the data flow | +| [Surrogate key](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-surrogate-key) | Schema modifier | Add an incrementing non-business arbitrary key value | +| [Union](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-union) | Multiple inputs/outputs | Combine multiple data streams vertically | +| [Unpivot](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-unpivot) | Schema modifier | Pivot columns into row values | +| [Window](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-window) | Schema modifier | Define window-based aggregations of columns in your data streams. | +| [Parse](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-parse) | Schema modifier | Parse column data to Json or delimited text | + + + + + + + +## 3. 监控 + +方式包含: + +- alert 通知 +- 查看配置所配置日志路径 +- 查看 dashboard + + + + + + + + + + + +## 豆知識 + +- ADF 有版本区分,因此在 StackOverflow 上搜索,需要注意看标签是否带 v2。ADF 对标 AWS 的是 [AWS Data Pipeline](https://aws.amazon.com/cn/datapipeline)。 - [ADF 反馈网站](https://feedback.azure.com/d365community/forum/1219ec2d-6c26-ec11-b6e6-000d3a4f032c#) +- [ADF 模版文件](https://learn.microsoft.com/en-us/azure/data-factory/solution-templates-introduction) + + -[^1]: [Azure Data Factory documentation](https://docs.microsoft.com/en-us/azure/data-factory/) -[^2]: [SQL Server Integration Services](https://docs.microsoft.com/zh-cn/sql/integration-services/sql-server-integration-services?view=sql-server-ver15) -[^3]: [Data integration - Wikipedia](https://en.wikipedia.org/wiki/Data_integration) -[^4]:[Data transformation - Wikipedia](https://en.wikipedia.org/wiki/Data_transformation) diff --git "a/230. Cloud/Azure/11. Data Tools/Azure Data Factory/\345\212\237\350\203\275\344\270\200\346\240\217.md" "b/230. Cloud/Azure/11. Data Tools/Azure Data Factory/\345\212\237\350\203\275\344\270\200\346\240\217.md" deleted file mode 100644 index 8b4d3e0d..00000000 --- "a/230. Cloud/Azure/11. Data Tools/Azure Data Factory/\345\212\237\350\203\275\344\270\200\346\240\217.md" +++ /dev/null @@ -1,242 +0,0 @@ - - -## 数据工厂限制🚫 - -数据工厂是多租户服务 (multitenant service)[^9] ,因此具有上限。具体参考[官网](https://docs.microsoft.com/en-US/azure/azure-resource-manager/management/azure-subscription-service-limits#data-factory-limits)。下面举一些例子 - -- ForEach 并行数 ≤ 50 - -- linked service ≤ 3000 - - 当超过上限时,将会抛出以下类似的错误异常 - - >There are substantial concurrent copy activity executions which is causing failures due to throttling under subscription xxxx, region jpe and limitation 3000. Please reduce the concurrent executions. For limits, refer - - 经过实验,可以同时启动 1500 个 Copy Activity,*也许*是因为每一个 Copy Activity 有 2 个 Linked Service。 - -## Copy Activity - -### 概念 - -source: 数据源 - -[sink](https://en.wikipedia.org/wiki/Sink_(computing)): 接收器 (原意: 水槽,洗碗槽) - -Hierarchical 分层:JSON、XML、NoSQL - -tabular : 表格(excel、关系数据库) - -### 性能 - -概念📙 - -DIU (Data Integration Unit) [^1]这是 Azure云 特有的概念,介绍的文档比较少且模糊不清,笔者认为应解释为 "单位时间内,CPU、内存、网络资源分配等消耗的时间" - -策略♞ - -- [For Each ](https://docs.microsoft.com/en-us/azure/data-factory/control-flow-for-each-activity) 拆分需要拷贝的数据,并行执行。 -- Copy Activity 的性能 - ![监视复制活动运行详细信息](/assets/blog_res/azure/monitor-copy-activity-run-details.png) - 1. Azure 提供了[性能优化 (performance tuning) 提示](https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance-troubleshooting)功能 - - [并行数的调优](https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance-features#parallel-copy) - - [颗粒大小的调优](https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance-features#data-integration-units) - 2. **Duration** 的内容常为优化的对象。[^3] - 3. [暂存 (staging) 功能](https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance-features#staged-copy) (Specify whether to copy data via an interim staging store. Enable staging only for the beneficial scenarios, e.g. load data into Azure Synapse Analytics via PolyBase, load data to/from Snowflake, load data from Amazon Redshift via UNLOAD or from HDFS via DistCp, etc.[Learn more](https://go.microsoft.com/fwlink/?linkid=2159335)) - -测试步骤🧪[^7] - -1. 选择大数据 -2. 输入 [Data Integration Units (DIU)](https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance#data-integration-units) 和 [parallel copy](https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance#parallel-copy),并不断调试,最终获取最优数值 -3. 拆分需要拷贝的数据,并聚合结果。以下是官方的模板: - - [Copy files from multiple containers](https://docs.microsoft.com/en-us/azure/data-factory/solution-template-copy-files-multiple-containers) - - [Migrate data from Amazon S3 to ADLS Gen2](https://docs.microsoft.com/en-us/azure/data-factory/solution-template-migration-s3-azure) - - [Bulk copy with a control table](https://docs.microsoft.com/en-us/azure/data-factory/solution-template-bulk-copy-with-control-table) - - - -### Schema映射 Schema Mapping - -Copy Activity 有一系列默认的映射策略。而配置显式映射 (Explicit mapping) 时,需加注意,不同的 source-sink 组合配置的方式是不同。[^2] - -![从表格映射到表格](/assets/blog_res/azure/map-tabular-to-tabular.png) - -Mapping 支持 *Fatten* 操作,可以讲一个 array 扁平化。这方便 JSON 转换成 table - -![使用 UI 从分层映射到表格](/assets/blog_res/azure/map-hierarchical-to-tabular-ui.png) - - - - - - - -### 数据一致性验证 Data consistency verification - -Copy Activity 提供了数据
一致性验证
。通过 `validateDataConsistency` 启动该校验。[^5] - -校验的*对象*以及*策略*♘ - -- 二进制对象:file size, lastModifiedDate, MD5 checksum -- 表格数据(tabular data):` 读取的行数 = 复制的行数 + 跳过的行数` - -*什么时候发生?*📅[^4] - -- 主键重复 -- 作为 source 的二进制文件不能访问、被删除 - -当数据发生 *不一致性*⚠️时,可以通过 `dataInconsistency` 设置行为 - -- 中止 -- 跳跃 - -在设定 `logSettings` 和 `path` 可以记录 *不一致* 时候的日志。 - -### 监控·容错·测试 Monitor·Fault tolerance·Test - -💿数据不一致 - -当 *不允许数据不一致* 那么 Copy Activity 将重试或者中止。中止时,pipeline 将以失败的形式返回,此时可以 - -1. 发送邮件通知 -2. 定期查看 监控 (monitor) 情况 - -当 *允许数据不一致* 时,可以监控以下数据,并根据所得数据进行下一步策略下一步策略。[^4] - -- activity结果 (`@activity('Copy data').output`) [^6] -- 日志文件 - -📏测试 - -可通过来回复制进行数据校验进行实现,示例如下: - -1. 备份 数据库-1 至 Azure Blob Storage -2. Azure Blob Storage 将备份数据恢复至 数据库-2 -3. 数据库-1 和 数据库-2 的数据进行一一比较。 - -目的: 数据在传输中是否有不可预料损失和变形。 - -📝特殊需求 - -监控 Copy Activity 的运行时长,当时长过长时,发送监控信息至运维人员。[^6] - -### 其他 - -- 压缩功能 - - - -## Data flow - -Data flow 用于数据转换。 - -1. Data flow 一般用于对数据库、大文件进行转换,HTTP协议 一般会限制每分钟访问的速率。 -2. Data flow 不是用于备份数据,从 Data flow 中导入后,数据可能会有损失(Boolean=>String,integer=>String) - -[官网](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-transformation-overview)提供了以下工具进行数据转换。工具以下概念相关 - -- stream -- MS SQL - -| Name | Category | Description | -| :----------------------------------------------------------- | :---------------------- | :----------------------------------------------------------- | -| [Aggregate](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-aggregate) | Schema modifier | Define different types of aggregations such as SUM, MIN, MAX, and COUNT grouped by existing or computed columns. | -| [Alter row](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-alter-row) | Row modifier | Set insert, delete, update, and upsert policies on rows. | -| [Conditional split](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-conditional-split) | Multiple inputs/outputs | Route rows of data to different streams based on matching conditions. | -| [Derived column](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-derived-column) | Schema modifier | generate new columns or modify existing fields using the data flow expression language. | -| [Exists](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-exists) | Multiple inputs/outputs | Check whether your data exists in another source or stream. | -| [Filter](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-filter) | Row modifier | Filter a row based upon a condition. | -| [Flatten](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-flatten) | Schema modifier | Take array values inside hierarchical structures such as JSON and unroll them into individual rows. | -| [Join](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-join) | Multiple inputs/outputs | Combine data from two sources or streams. | -| [Lookup](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-lookup) | Multiple inputs/outputs | Reference data from another source. | -| [New branch](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-new-branch) | Multiple inputs/outputs | Apply multiple sets of operations and transformations against the same data stream. | -| [Parse](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-new-branch) | Formatter | Parse text columns in your data stream that are strings of JSON, delimited text, or XML formatted text. | -| [Pivot](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-pivot) | Schema modifier | An aggregation where one or more grouping columns has its distinct row values transformed into individual columns. | -| [Rank](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-rank) | Schema modifier | Generate an ordered ranking based upon sort conditions | -| [Select](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-select) | Schema modifier | Alias columns and stream names, and drop or reorder columns | -| [Sink](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-sink) | - | A final destination for your data | -| [Sort](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-sort) | Row modifier | Sort incoming rows on the current data stream | -| [Source](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-source) | - | A data source for the data flow | -| [Surrogate key](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-surrogate-key) | Schema modifier | Add an incrementing non-business arbitrary key value | -| [Union](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-union) | Multiple inputs/outputs | Combine multiple data streams vertically | -| [Unpivot](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-unpivot) | Schema modifier | Pivot columns into row values | -| [Window](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-window) | Schema modifier | Define window-based aggregations of columns in your data streams. | -| [Parse](https://docs.microsoft.com/en-us/azure/data-factory/data-flow-parse) | Schema modifier | Parse column data to Json or delimited text | - - - - - -## 控制流 Control Flow - -- **Execute Pipeline**: 执行管道。通过 monitor 可以看到 pipeline 的输入参数、重新执行 pipeline。在定义 pipeline 时,需要注意这点。 - -- 数组(上限 100,000) [^8] - - - **Append Variable**: 追加变量到数组里。 - - **Filter**: 过滤数组 - - **ForEach**: 循环数组。 - - 最大并行为 50,默认为 20,如需扩展则要多重 ForEach (Execute Pipeline + ForEach 的方式)。 - - 测试结果显示,设置最大并行数设置过高时,是按照最低数来执行。(💡那为何不全自动化呢?) - - ForEach 的限制很多。 - - **Until** - -- 输入 - - - **Get Metadata**: 获得文件的元数据。元数据不得超过 4 MB - - - **Lookup**: 通过 dataset 获得数据。 - - - 输出最大支持 4 MB,如果大小超过此限制,活动将失败。 - - - 最多可以返回 5000 行;如果结果集包含的记录超过此范围,将返回前 5000 行。 - - - 突破方式: 如果数据源有 index 的话,可以通过循环或者 util 的形式实现。 - - (💡[官方的 workarounds](https://docs.microsoft.com/en-us/azure/data-factory/control-flow-lookup-activity#limitations-and-workarounds) 太模糊,无法参考使用) - - - **Web**: - -- 输出 - - - **Web**: 可以发送各种数据。另外还可以将 datasets 和 linkedServices 发送出去。 - - **webhook** - -- 条件语句 - - - **If Condition**: if 语句 - - **Switch** - - **Validation**: 等待文件。当文件或文件夹存在时,才能继续下一步。 - - **wait**: 等待一段时间后再执行下一步。 - -- **Set Variable**: 设置变量 - -## Delete Activity - -Delete Activity 仅仅用于删除文件。如需定时删除文件,则要与 schedule trigger 一起使用。 - - - -## 外部服务 - -### Databricks - -Azure Databricks 基于 Apache Spark 的快速、简单、协作分析平台 - -### Azure Data Explorer - -数据分析 - - - - - -[^1]: [Data Integration Units](https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance-features#data-integration-units) -[^2]: [Schema and data type mapping in copy activity - Microsoft Docs](https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping) - -[^3]:[Troubleshoot copy activity on Azure IR](https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance-troubleshooting#troubleshoot-copy-activity-on-azure-ir) -[^4]: [Fault tolerance](https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-fault-tolerance) -[^5]: [Data consistency verification in copy activity - Azure](https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-data-consistency) -[^6]: [Monitor copy activity](https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-monitoring) -[^7]: [Performance tuning steps](https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance#performance-tuning-steps) -[^8]: [ForEach Activity](https://docs.microsoft.com/en-us/azure/data-factory/control-flow-for-each-activity) -[^9]: [Data Factory limits](https://docs.microsoft.com/en-US/azure/azure-resource-manager/management/azure-subscription-service-limits#data-factory-limits) \ No newline at end of file diff --git a/230. Cloud/Azure/11. Data Tools/README.md "b/230. Cloud/Azure/11. Data Tools/\346\225\260\346\215\256\345\244\207\344\273\275.md" similarity index 73% rename from 230. Cloud/Azure/11. Data Tools/README.md rename to "230. Cloud/Azure/11. Data Tools/\346\225\260\346\215\256\345\244\207\344\273\275.md" index 5f0bb6b4..cbf818f8 100644 --- a/230. Cloud/Azure/11. Data Tools/README.md +++ "b/230. Cloud/Azure/11. Data Tools/\346\225\260\346\215\256\345\244\207\344\273\275.md" @@ -8,19 +8,19 @@ 数据工具分为以下: -1. 托管备份工具。如 CosmosDB 可选持续备份或者定期备份,但该工具只能在该服务里使用。 +1. **服务自带的备份**。如 CosmosDB 可选持续备份或者定期备份,但该工具只能在该服务里使用。 这种托管工具的持续备份通常十分成熟,可以恢复到某一个具体的时刻(point-to-restore),我们只需要在上面点击一下备份保存时间即可。 -2. 备份平台,如:**Azure Recovery Services Vault(RSV)** 和 **Azure Backup Vault(BV)** +2. **备份平台提供的备份**。如:**Azure Recovery Services Vault(RSV)** 和 **Azure Backup Vault(BV)** 是(1)的补足。在这里,备份文件是托管的,可统一设置备份策略。但只能定期备份。 -3. 传统备份 +3. **传统备份** 如我们使用 virtualbox 的时,可以手动制作 snapshot 作为备份。Azure snapshot 也是类似的存在。 -4. 综合性数据工作: 如 Azure Data Factory +4. **其他工具**: 如 Azure Data Factory。参考:[数据复制](./数据复制.md) 用于补足上述场景无法实现的操作。例,CosmosDB 无法复制数据,有时候我们需要复制数据做测试,这时候就需要外部工具(data factory)了。 @@ -51,7 +51,7 @@ Recovery Services vault 是一个存储的仓库。而实际做备份和恢复 -## 3. 虚拟机的备份策略 +## 3. 虚拟机备份 虚拟机的备份与恢复有若干种策略: @@ -100,45 +100,13 @@ Recovery Services vault 是一个存储的仓库。而实际做备份和恢复 -## 4. 数据工作 -![image-20240424132626001](https://raw.githubusercontent.com/caliburn1994/caliburn1994.github.io/dev/images/20240424132628.png) -数据同步有几种方式 -- [Data factory](Azure%20Data%20Factory): 综合性比较好,各种迁移工具都有。可重复执行。 -- AzCopy: 本地命令工具。**适用:复制 blob 和 file。** -- Azure Import/Export: 数据装进硬盘里,发送到 Azure 的数据中心或者发过来发送到客户手中。**适用:复制 blob 和 file。** - -### 4.1. Azure Import/Export -Azure Import/Export: -- Import: 数据装进硬盘里,发送到 Azure 的数据中心。(支持服务: Azure Blob storage、Azure Files) [[”]](https://learn.microsoft.com/en-us/azure/import-export/storage-import-export-service) - 所需配置文件: - - - a dataset CSV file: 文件信息 - - a driveset CSV file: 驱动信息 -- Export: 从数据中心将数据取出来,存到硬盘发送到客户手上。 - - - -### 4.2. Azcopy - -**AzCopy** 是命令行工具,可从数据源下载到本地,或者从本地上传。 [[”]](https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10#download-azcopy) - -- Azure Blob Storage、Azure File -- Azure Table 只支持旧版本 ll **[AzCopy version 7.3](https://aka.ms/downloadazcopynet)** ,新本不支持 - -| 命令行 | 说明 | -| ----------- | ---------------------------------- | -| azcopy make | Creates a container or file share. | - -**QA-1: AzCopy 连接 Azure Blob Storage、Azure File 通过什么方式验证?** - -A: Microsoft Entra ID 、a Shared Access Signature (SAS) token diff --git "a/230. Cloud/Azure/11. Data Tools/\346\225\260\346\215\256\345\244\215\345\210\266.md" "b/230. Cloud/Azure/11. Data Tools/\346\225\260\346\215\256\345\244\215\345\210\266.md" new file mode 100644 index 00000000..f4a413ef --- /dev/null +++ "b/230. Cloud/Azure/11. Data Tools/\346\225\260\346\215\256\345\244\215\345\210\266.md" @@ -0,0 +1,71 @@ +## 介绍 + +![image-20240703114404592](https://raw.githubusercontent.com/caliburn1994/caliburn1994.github.io/dev/images/20240703114408.png) + +数据复制工具有: + +- [Data factory](Azure%20Data%20Factory): 综合性数据复制、聚合工具。运行在云上。 +- [Azure Cosmos DB Desktop Data Migration Tool](https://github.com/AzureCosmosDB/data-migration-desktop-tool):综合性数据复制工具。运行在本地机器上。 +- AzCopy: 本地命令工具。**适用:复制 blob 和 file。** +- Azure Import/Export: 数据装进硬盘里,发送到 Azure 的数据中心或者发过来发送到客户手中。**适用:复制 blob 和 file。** + + + +## 2. 工具 + +### 2.1. Azure Import/Export + +Azure Import/Export: + +- Import: 数据装进硬盘里,发送到 Azure 的数据中心。(支持服务: Azure Blob storage、Azure Files) [["]](https://learn.microsoft.com/en-us/azure/import-export/storage-import-export-service) + + 所需配置文件: + + - a dataset CSV file: 文件信息 + - a driveset CSV file: 驱动信息 + +- Export: 从数据中心将数据取出来,存到硬盘发送到客户手上。 + + + +### 2.2. Azcopy + +**AzCopy** 是命令行工具,可从数据源下载到本地,或者从本地上传。 [["]](https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10#download-azcopy) + +- Azure Blob Storage、Azure File +- Azure Table 只支持旧版本 ll **[AzCopy version 7.3](https://aka.ms/downloadazcopynet)** ,新本不支持 + +| 命令行 | 说明 | +| ----------- | ---------------------------------- | +| azcopy make | Creates a container or file share. | + +**QA-1: AzCopy 连接 Azure Blob Storage、Azure File 通过什么方式验证?** + +A: Microsoft Entra ID 、a Shared Access Signature (SAS) token + + + + + +## 3. 校验 + +以 CosmosDB 为例,原数据应该包含以下内容: + +- 接近各种极限的数据,如:2M 大小的数据、包含很深层次的数据 +- 日期、小数点 + +校验手段 + +- 抽样校验 +- 数据源和数据目标的数据一一对应。 + + + + + + + + + + + diff --git "a/230. Cloud/Azure/11. Data Tools/\350\277\201\347\247\273\344\270\216\345\244\207\344\273\275.xmind" "b/230. Cloud/Azure/11. Data Tools/\350\277\201\347\247\273\344\270\216\345\244\207\344\273\275.xmind" index 3348c50b..4efb1cc1 100644 Binary files "a/230. Cloud/Azure/11. Data Tools/\350\277\201\347\247\273\344\270\216\345\244\207\344\273\275.xmind" and "b/230. Cloud/Azure/11. Data Tools/\350\277\201\347\247\273\344\270\216\345\244\207\344\273\275.xmind" differ diff --git a/images/20240424132628.png b/images/20240424132628.png deleted file mode 100644 index 7f1627c4..00000000 Binary files a/images/20240424132628.png and /dev/null differ