-
Notifications
You must be signed in to change notification settings - Fork 186
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
prepare dataset release & docs updates (#2126)
* remove standalone dataset from exports * make pipeline dataset factory public * rework transformation section * fix some linting errors * add row counts feature for readabledataset * add dataset access example to getting started scripts * add notes about row_counts special query to datasets docs * fix internal docusaurus links * Update docs/website/docs/intro.md * Update docs/website/docs/tutorial/load-data-from-an-api.md * Update docs/website/docs/tutorial/load-data-from-an-api.md * Update docs/website/docs/tutorial/load-data-from-an-api.md * Update docs/website/docs/general-usage/dataset-access/dataset.md * Update docs/website/docs/general-usage/dataset-access/dataset.md * Update docs/website/docs/dlt-ecosystem/transformations/index.md * Update docs/website/docs/dlt-ecosystem/transformations/index.md * Update docs/website/docs/dlt-ecosystem/transformations/index.md * Update docs/website/docs/dlt-ecosystem/transformations/index.md * Update docs/website/docs/dlt-ecosystem/destinations/duckdb.md * Update docs/website/docs/dlt-ecosystem/transformations/index.md * Update docs/website/docs/dlt-ecosystem/transformations/index.md * Update docs/website/docs/dlt-ecosystem/transformations/python.md * Update docs/website/docs/dlt-ecosystem/transformations/python.md * Update docs/website/docs/dlt-ecosystem/transformations/python.md * Update docs/website/docs/dlt-ecosystem/transformations/python.md * Update docs/website/docs/dlt-ecosystem/transformations/python.md * Update docs/website/docs/dlt-ecosystem/transformations/python.md * Update docs/website/docs/dlt-ecosystem/transformations/python.md * Update docs/website/docs/dlt-ecosystem/transformations/python.md * Update docs/website/docs/dlt-ecosystem/transformations/sql.md * Update docs/website/docs/dlt-ecosystem/transformations/sql.md * Update docs/website/docs/dlt-ecosystem/transformations/sql.md * Update docs/website/docs/dlt-ecosystem/transformations/sql.md * Update docs/website/docs/dlt-ecosystem/transformations/sql.md * Update docs/website/docs/general-usage/dataset-access/dataset.md --------- Co-authored-by: Alena Astrakhantseva <[email protected]>
- Loading branch information
1 parent
95d6063
commit b8bac75
Showing
29 changed files
with
432 additions
and
154 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -262,20 +262,30 @@ In this example, the first pipeline loads the data using `pipedrive_source()`. T | |
|
||
#### [Using the `dlt` SQL client](dlt-ecosystem/transformations/sql.md) | ||
|
||
Another option is to leverage the `dlt` SQL client to query the loaded data and perform transformations using SQL statements. You can execute SQL statements that change the database schema or manipulate data within tables. Here's an example of inserting a row into the `customers` table using the `dlt` SQL client: | ||
Another option is to leverage the `dlt` SQL client to query the loaded data and perform transformations using SQL statements. You can execute SQL statements that change the database schema or manipulate data within tables. Here's an example of creating a new table with aggregated sales data in duckdb: | ||
|
||
```py | ||
pipeline = dlt.pipeline(destination="bigquery", dataset_name="crm") | ||
pipeline = dlt.pipeline(destination="duckdb", dataset_name="crm") | ||
|
||
with pipeline.sql_client() as client: | ||
client.execute_sql( | ||
"INSERT INTO customers VALUES (%s, %s, %s)", 10, "Fred", "[email protected]" | ||
) | ||
""" CREATE TABLE aggregated_sales AS | ||
SELECT | ||
category, | ||
region, | ||
SUM(amount) AS total_sales, | ||
AVG(amount) AS average_sales | ||
FROM | ||
sales | ||
GROUP BY | ||
category, | ||
region; | ||
""") | ||
``` | ||
|
||
In this example, the `execute_sql` method of the SQL client allows you to execute SQL statements. The statement inserts a row with values into the `customers` table. | ||
|
||
#### [Using Pandas](dlt-ecosystem/transformations/pandas.md) | ||
#### [Using Pandas](dlt-ecosystem/transformations/python.md) | ||
|
||
You can fetch query results as Pandas data frames and perform transformations using Pandas functionalities. Here's an example of reading data from the `issues` table in DuckDB and counting reaction types using Pandas: | ||
|
||
|
@@ -287,11 +297,8 @@ pipeline = dlt.pipeline( | |
dev_mode=True | ||
) | ||
|
||
with pipeline.sql_client() as client: | ||
with client.execute_query( | ||
'SELECT "reactions__+1", "reactions__-1", reactions__laugh, reactions__hooray, reactions__rocket FROM issues' | ||
) as cursor: | ||
reactions = cursor.df() | ||
# get a dataframe of all reactions from the dataset | ||
reactions = pipeline.dataset().issues.select("reactions__+1", "reactions__-1", "reactions__laugh", "reactions__hooray", "reactions__rocket").df() | ||
|
||
counts = reactions.sum(0).sort_values(0, ascending=False) | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
--- | ||
title: Transforming your data | ||
description: How to transform your data | ||
keywords: [datasets, data, access, transformations] | ||
--- | ||
import DocCardList from '@theme/DocCardList'; | ||
|
||
# Transforming data | ||
|
||
If you'd like to transform your data after a pipeline load, you have 3 options available to you: | ||
|
||
* [Using dbt](./dbt/dbt.md) - dlt provides a convenient dbt wrapper to make integration easier. | ||
* [Using the `dlt` SQL client](./sql.md) - dlt exposes an SQL client to transform data on your destination directly using SQL. | ||
* [Using Python with DataFrames or Arrow tables](./python.md) - you can also transform your data using Arrow tables and DataFrames in Python. | ||
|
||
If you need to preprocess some of your data before it is loaded, you can learn about strategies to: | ||
|
||
* [Rename columns.](../../general-usage/customising-pipelines/renaming_columns) | ||
* [Pseudonymize columns.](../../general-usage/customising-pipelines/pseudonymizing_columns) | ||
* [Remove columns.](../../general-usage/customising-pipelines/removing_columns) | ||
|
||
This is particularly useful if you are trying to remove data related to PII or other sensitive data, you want to remove columns that are not needed for your use case or you are using a destination that does not support certain data types in your source data. | ||
|
||
|
||
# Learn more | ||
<DocCardList /> | ||
|
This file was deleted.
Oops, something went wrong.
109 changes: 109 additions & 0 deletions
109
docs/website/docs/dlt-ecosystem/transformations/python.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
--- | ||
title: Transforming data in Python with Arrow tables or DataFrames | ||
description: Transforming data loaded by a dlt pipeline with pandas dataframes or arrow tables | ||
keywords: [transform, pandas] | ||
--- | ||
|
||
# Transforming data in Python with Arrow tables or DataFrames | ||
|
||
You can transform your data in Python using Pandas DataFrames or Arrow tables. To get started, please read the [dataset docs](../../general-usage/dataset-access/dataset). | ||
|
||
|
||
## Interactively transforming your data in Python | ||
|
||
Using the methods explained in the [dataset docs](../../general-usage/dataset-access/dataset), you can fetch data from your destination into a DataFrame or Arrow table in your local Python process and work with it interactively. This even works for filesystem destinations: | ||
|
||
|
||
The example below reads GitHub reactions data from the `issues` table and | ||
counts the reaction types. | ||
|
||
```py | ||
pipeline = dlt.pipeline( | ||
pipeline_name="github_pipeline", | ||
destination="duckdb", | ||
dataset_name="github_reactions", | ||
dev_mode=True | ||
) | ||
|
||
# get a dataframe of all reactions from the dataset | ||
reactions = pipeline.dataset().issues.select("reactions__+1", "reactions__-1", "reactions__laugh", "reactions__hooray", "reactions__rocket").df() | ||
|
||
# calculate and print out the sum of all reactions | ||
counts = reactions.sum(0).sort_values(0, ascending=False) | ||
print(counts) | ||
|
||
# alternatively, you can fetch the data as an arrow table | ||
reactions = pipeline.dataset().issues.select("reactions__+1", "reactions__-1", "reactions__laugh", "reactions__hooray", "reactions__rocket").arrow() | ||
# ... do transformations on the arrow table | ||
``` | ||
|
||
## Persisting your transformed data | ||
|
||
Since dlt supports DataFrames and Arrow tables from resources directly, you can use the same pipeline to load the transformed data back into the destination. | ||
|
||
|
||
### A simple example | ||
|
||
A simple example that creates a new table from an existing user table but only with columns that do not contain private information. Note that we use the `iter_arrow()` method on the relation to iterate over the arrow table instead of fetching it all at once. | ||
|
||
```py | ||
pipeline = dlt.pipeline( | ||
pipeline_name="users_pipeline", | ||
destination="duckdb", | ||
dataset_name="users_raw", | ||
dev_mode=True | ||
) | ||
|
||
# get user relation with only a few columns selected, but omitting email and name | ||
users = pipeline.dataset().users.select("age", "amount_spent", "country") | ||
|
||
# load the data into a new table called users_clean in the same dataset | ||
pipeline.run(users.iter_arrow(chunk_size=1000), table_name="users_clean") | ||
``` | ||
|
||
### A more complex example | ||
|
||
The example above could easily be done in SQL. Let's assume you'd like to actually do in Python some Arrow transformations. For this will create a resources from which we can yield the modified Arrow tables. The same is possibly with DataFrames. | ||
|
||
```py | ||
import pyarrow.compute as pc | ||
|
||
pipeline = dlt.pipeline( | ||
pipeline_name="users_pipeline", | ||
destination="duckdb", | ||
dataset_name="users_raw", | ||
dev_mode=True | ||
) | ||
|
||
# NOTE: this resource will work like a regular resource and support write_disposition, primary_key, etc. | ||
# NOTE: For selecting only users above 18, we could also use the filter method on the relation with ibis expressions | ||
@dlt.resource(table_name="users_clean") | ||
def users_clean(): | ||
users = pipeline.dataset().users | ||
for arrow_table in users.iter_arrow(chunk_size=1000): | ||
|
||
# we want to filter out users under 18 | ||
age_filter = pc.greater_equal(arrow_table["age"], 18) | ||
arrow_table = arrow_table.filter(age_filter) | ||
|
||
# we want to hash the email column | ||
arrow_table = arrow_table.append_column("email_hash", pc.sha256(arrow_table["email"])) | ||
|
||
# we want to remove the email column and name column | ||
arrow_table = arrow_table.drop(["email", "name"]) | ||
|
||
# yield the transformed arrow table | ||
yield arrow_table | ||
|
||
|
||
pipeline.run(users_clean()) | ||
``` | ||
|
||
## Other transforming tools | ||
|
||
If you want to transform your data before loading, you can use Python. If you want to transform the | ||
data after loading, you can use Pandas or one of the following: | ||
|
||
1. [dbt.](dbt/dbt.md) (recommended) | ||
2. [`dlt` SQL client.](sql.md) | ||
|
Oops, something went wrong.