Skip to content

Latest commit

 

History

History
135 lines (70 loc) · 12.2 KB

File metadata and controls

135 lines (70 loc) · 12.2 KB

Workshop: SQL Server 2022 Workshop

A Microsoft workshop from the SQL Server team

SQL Server 2022 Data Virtualization and Object Storage

SQL Server 2022 provides data virtualization capabilities to access data stored in formats such as parquet and delta tables over sources such as Azure Blob Storage, Azure Data Lake Storage, and S3 compatible object storage providers. In addition, SQL Server 2022 enhances remote backup and restore capabilities by supporting S3 compatible object storage providers.

Read through the entire module to learn about all the new data virtualization capabilities or go directly to the following exercises in this module to see new features in action:

5.0 Using OPENROWSET with parquet
5.1 Using parquet files with S3 compatible object storage
5.2 Using delta table with S3 compatible object storage
5.3 Backup and Restore with S3 compatible object storage
5.4 Using S3 with minio for data virtualization and object storage

____________________________________________________________________________________________________________________________________________________

Data Virtualization in SQL Server

SQL Server has provided since SQL Server 2016 the concept of Polybase, which is accessing data outside of SQL Server using T-SQL statements like OPENROWSET and CREATE EXTERNAL TABLE. This concept is known as data virtualization since users can access data outside of SQL Server without moving the data into SQL Server. SQL Server 2019 added support for data virtualization using ODBC drivers. This data access uses software called Polybase Services. SQL Server 2022 adds support within the database engine using REST API interfaces to access data in Azure Blob Storage, Azure Data Lake Storage, and S3 compatible object storage providers.

The following figure shows the full extent of data virtualization support in SQL Server 2022.

SQL Server 2022 Data Virtualization

Data virtualization with REST API connectors

SQL Server 2022 data virtualization now includes access to new data source connectors using REST API interfaces built into the SQL Server engine. These new data source connectors are often used to support a large set of files to build a data lake. SQL Server can now be used as a data hub to query files in a data lake without moving the data.

REST API connectors

The following new connectors are supported using REST API:

  • Azure Blob Storage

Using the data source connector abs you can access files stored in Azure Blob Storage containers. Azure Blob Storage is scalable, highly-available, redundant, cost effective with tiered storage, secured, and optimized for data lakes. You can learn more about data virtualization with Azure Blob Storage at https://learn.microsoft.com/sql/relational-databases/polybase/virtualize-csv.

  • Azure Data Lake Storage

Using the data source connector adls you can access file stored in Azure Data Lake Storage (ADLS). ADLS is built on top of Azure Blob Storage and provides file system semantics, file-level security, and scale. All these capabilities are built on top of Azure Blob Storage providing low cost, tiered access, high security, high availability, and durability. You can learn more about data virtualization with Azure Data Lake Storage at https://learn.microsoft.com/sql/relational-databases/polybase/virtualize-delta.

  • S3 compatible object storage providers

Using the data source connector s3 you can access file stored in S3 compatible object storage. S3 has become a popular object storage protocol and is supported by many providers both in clouds and on-premises. S3 object storage offers scale, lower cost, geo-distribution, and a reliable data transport. You can read more about data virtualization with S3 compatible object storage providers at https://learn.microsoft.com/sql/relational-databases/polybase/polybase-configure-s3-compatible.

Using T-SQL for data virtualization

T-SQL provides the following statements to support data virtualization with REST API connectors:

  • OPENROWSET

The T-SQL statement OPENROWSET allows you to access any REST API data connector "ad-hoc" where you supply information about how to connect to the specific data source and file or folder information.

In this exercise you will learn how to use the T-SQL function OPENROWSET to access parquet files in a public dataset in Azure Blob Storage.

Follow the instructions in the readme.md file in the sql2022workshop\05_DataVirt\openrowset folder.

  • CREATE EXTERNAL TABLE

You can use the T-SQL statement CREATE EXTERNAL TABLE to create a "virtual" table that is a mapping to files in any REST API data connector. You can then access the external table just like a SQL Server table except the data is stored in the remote data source. The metadata of the external table is stored in SQL Server system tables.

  • CREATE EXTERNAL TABLE AS SELECT

A new capability in SQL Server 2022 is to create an external table based on a query from an existing SQL Server table (or another external table) also known as CREATE EXTERNAL TABLE AS SELECT (CETAS). The result of the query will be a new external table and files will be created to the remote data source based on the results of the SELECT statement. You can use this method to export SQL Server data to files in a remote data source.

New native file format support

SQL Server allows you to create a file format to define the structure of a file in a remote data source. You would use file formats to define the structure of a JSON file since JSON is not a native supported file format. SQL Server 2022 now recognizes native file formats parquet and delta tables.

  • Parquet

Parquet is an open-source binary file format with the schema built into the file. Parquet files use a columnar storage format making them efficient for compression and retrieval. You can see an example of how to use parquet files with an S3 compatible object storage provider at https://learn.microsoft.com/sql/relational-databases/polybase/polybase-virtualize-parquet-file.

In this exercise you will see how parquet files can be used with external tables for a S3 compatible object storage provider by reviewing the results of T-SQL notebook.

Follow the instructions in the readme.md file in the sql2022workshop\05_DataVirt\parquet folder.

  • Delta tables

Delta lake is an open-source storage architecture using delta tables. Delta tables are files based on parquet but also offer basic transaction and log support. You can see an example of how to use Delta tables with Azure Data Lake Storage at https://learn.microsoft.com/sql/relational-databases/polybase/virtualize-delta

In this exercise you will see how delta tables can be used with external tables for a S3 compatible object storage provider by reviewing the results of T-SQL notebook.

Follow the instructions in the readme.md file in the sql2022workshop\05_DataVirt\delta folder.

Backup and Restore with S3 compatible object storage providers

SQL Server 2022 extends the BACKUP TO URL and RESTORE FROM URL T-SQL statements to support S3 compatible object storage providers. Most backup and restore options are available using this new capability. You can learn more about how to use S3 compatible object storage providers for SQL Server backups at https://learn.microsoft.com/sql/relational-databases/backup-restore/sql-server-backup-to-url-s3-compatible-object-storage.

You can see a list of common S3 compatible object storage providers at https://learn.microsoft.com/sql/relational-databases/backup-restore/sql-server-backup-and-restore-with-s3-compatible-object-storage#providers-of-s3-compatible-object-storage.

In this exercise you will see how delta tables Can be used with external tables for a S3 compatible object storage provider by reviewing the results of T-SQL notebook.

Follow the instructions in the readme.md file in the sql2022workshop\05_DataVirt\s3objectstorage folder.

Using a S3 compatible object provider for data virtualization and object storage

Put your knowledge to test by setting up a S3 compatible object provider and using new data virtualization and object storage capabilities.

In this advanced exercise you will setup S3 object storage using a free version of minio server and walk through using parquet files, delta tables, and backup/restore.

Follow the instructions in the readme.md file in the sql2022workshop\05_DataVirt\minio folder.

Next Steps

Next, Continue to T-SQL enhancements in SQL Server 2022.