Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 15 additions & 4 deletions src/content/docs/r2/data-catalog/deleting-data.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,17 @@ import { FileTree } from "~/components"
import { Tabs, TabItem } from "~/components"
import { InlineBadge } from "~/components";

## Deleting data in R2 Data Catalog

Deleting data from R2 Data Catalog or any Apache Iceberg catalog requires that operations are done in a transaction through the catalog itself. Manually deleting metadata or data files directly can lead to data catalog corruption.

## Automatic table maintenance
R2 Data Catalog can automatically manage table maintenance operations such as snapshot expiration and compaction. These continuous operations help keep latency and storage costs down.
- **Snapshot expiration**: Automatically removes old snapshots. This reduces metadata overhead. Data files are not removed until orphan file removal is run.
- **Compaction**: Merges small data files into larger ones. This optimizes read performance and reduces the number of files read during queries.

Without enabling automatic maintenance, you need to manually handle these operations.

Learn more in the [table maintenance](/r2/data-catalog/table-maintenance/) documentation.

## Examples of enabling automatic table maintenance in R2 Data Catalog
```bash
# Enable automatic snapshot expiration for entire catalog
Expand All @@ -25,9 +32,13 @@ npx wrangler r2 bucket catalog snapshot-expiration enable my-bucket \
npx wrangler r2 bucket catalog compaction enable my-bucket \
--target-size 256
```
More information can be found in the [table maintenance](/r2/data-catalog/table-maintenance/) and [manage catalogs](/r2/data-catalog/manage-catalogs/) documentation.
Refer to additional examples in the [manage catalogs](/r2/data-catalog/manage-catalogs/) documentation.

## Examples of deleting data from R2 Data Catalog using PySpark
## Manually deleting and removing data
You need to manually delete data for:
- Complying with data retention policies such as GDPR or CCPA.
- Selective based deletes using conditional logic.
- Removing stale or unreferenced files that R2 Data Catalog does not manage.

The following are basic examples using PySpark but similar operations can be performed using other Iceberg-compatible engines. To configure PySpark, refer to our [example](/r2/data-catalog/config-examples/spark-python/) or the official [PySpark documentation](https://spark.apache.org/docs/latest/api/python/getting_started/index.html).

Expand Down
20 changes: 8 additions & 12 deletions src/content/docs/r2/data-catalog/manage-catalogs.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,11 @@ npx wrangler r2 bucket catalog disable <BUCKET_NAME>
## Enable compaction

Compaction improves query performance by combining the many small files created during data ingestion into fewer, larger files according to the set `target file size`. For more information about compaction and why it's valuable, refer to [About compaction](/r2/data-catalog/table-maintenance/).
:::note[API token permission requirements]
Table maintenance operations such as compaction and snapshot expiration requires a Cloudflare API token with both R2 storage and R2 Data Catalog read/write permissions to act as a service credential.

Refer to [Authenticate your Iceberg engine](#authenticate-your-iceberg-engine) for details on creating a token with the required permissions.
:::
<Tabs syncKey='CLIvDash'>
<TabItem label='Dashboard'>

Expand Down Expand Up @@ -120,12 +124,6 @@ npx wrangler r2 bucket catalog compaction enable <BUCKET_NAME> <NAMESPACE> <TABL
</TabItem>
</Tabs>

:::note[API token permission requirements]
Compaction requires a Cloudflare API token with both R2 storage and R2 Data Catalog read/write permissions to act as a service credential. The compaction process uses this token to read files, combine them, and update table metadata.

Refer to [Authenticate your Iceberg engine](#authenticate-your-iceberg-engine) for details on creating a token with the required permissions.
:::

Once enabled, compaction applies retroactively to all existing tables (for catalog-level compaction) or the specified table (for table-level compaction). During open beta, we currently compact up to 2 GB worth of files once per hour for each table.

## Disable compaction
Expand Down Expand Up @@ -165,6 +163,10 @@ npx wrangler r2 bucket catalog compaction disable <BUCKET_NAME> <NAMESPACE> <TAB

Snapshot expiration automatically removes old table snapshots to reduce metadata bloat and storage costs. For more information about snapshot expiration and why it is valuable, refer to [Table maintenance](/r2/data-catalog/table-maintenance/).

:::note
Snapshot expiration commands are available as of Wrangler version 4.56.0.
:::

To enable snapshot expiration on your catalog, run the [`r2 bucket catalog snapshot-expiration enable` command](/workers/wrangler/commands/#r2-bucket-catalog-snapshot-expiration-enable):

```bash
Expand All @@ -180,12 +182,6 @@ npx wrangler r2 bucket catalog snapshot-expiration enable <BUCKET_NAME> <NAMESPA
--retain-last 5
```

:::note[API token permission requirements]
Catalog-level snapshot expiration requires a Cloudflare API token with both R2 storage and R2 Data Catalog read/write permissions to act as a service credential. The snapshot expiration process uses this token to update table metadata and remove old snapshots.

Refer to [Authenticate your Iceberg engine](#authenticate-your-iceberg-engine) for details on creating a token with the required permissions.
:::

## Disable snapshot expiration

Disabling snapshot expiration prevents the process from running for all tables (catalog level) or a specific table (table level). You can re-enable snapshot expiration at any time.
Expand Down
Loading
Loading