Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
title: "Building a RAG-based AI Recommender (2/2)"
author: Shiyan Xu
category: blog
subCategory: indexing
image: /assets/images/blog/2025-08-29-building-a-rag-based-ai-recommender-2.jpg
tags:
- blog
Expand Down
17 changes: 9 additions & 8 deletions website/blog/2025-09-17-hudi-auto-gen-keys.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ title: "Automatic Record Key Generation in Apache Hudi"
excerpt: ""
author: Shiyan Xu
category: blog
subCategory: security
image: /assets/images/blog/2025-09-17-hudi-auto-gen-keys/2025-09-17-hudi-auto-gen-keys.fig2.jpg
tags:
- hudi
Expand All @@ -19,8 +20,8 @@ By using a primary key that is stable across record movement, a system can effic

Apache Hudi was the first lakehouse storage project to introduce the notion of record keys. For mutable workloads, this addressed a significant architectural challenge. In a typical data lake table, updating records usually required rewriting entire partitions—a process that is slow and expensive. By supporting the record key as the stable identifier for every record, Hudi offered unique and advanced capabilities among lakehouse frameworks:

* Hudi supports [record-level indexing](https://hudi.apache.org/blog/2023/11/01/record-level-index/) for directly locating records in [file groups](https://hudi.apache.org/docs/storage_layouts) for highly efficient upserts and queries, and [secondary indexes](https://hudi.apache.org/blog/2025/04/02/secondary-index/) that enable performant lookups for predicates on non-record key fields.
* Hudi implements [merge modes](https://hudi.apache.org/blog/2025/03/03/record-mergers-in-hudi/), standardizing record-merging semantics to handle requirements such as unordered events, duplicate records, and custom merge logic.
* Hudi supports [record-level indexing](https://hudi.apache.org/blog/2023/11/01/record-level-index/) for directly locating records in [file groups](https://hudi.apache.org/docs/storage_layouts) for highly efficient upserts and queries, and [secondary indexes](https://hudi.apache.org/blog/2025/04/02/secondary-index/) that enable performant lookups for predicates on non-record key fields.
* Hudi implements [merge modes](https://hudi.apache.org/blog/2025/03/03/record-mergers-in-hudi/), standardizing record-merging semantics to handle requirements such as unordered events, duplicate records, and custom merge logic.
* By materializing record keys along with other [record-level meta-fields](https://www.onehouse.ai/blog/hudi-metafields-demystified), Hudi unlocks features such as efficient [change data capture (CDC)](https://hudi.apache.org/blog/2024/07/30/data-lake-cdc/) that serves record-level change streams, near-infinite history for time-travel queries, and the [clustering table service](https://hudi.apache.org/docs/clustering) that can significantly optimize file sizes.

<figure>
Expand Down Expand Up @@ -61,10 +62,10 @@ In this example, you’re creating a Copy-on-Write table partitioned by `city`.

Designing a key generation mechanism that operates efficiently at petabyte scale requires careful thought. We established five core requirements for the auto-generated keys:

1. **Global Uniqueness:** Keys must be unique across the entire table to maintain the integrity of a primary key.
2. **Low Storage Footprint:** The keys should be highly compressible to add minimal storage overhead.
3. **Computational Efficiency:** The encoding and decoding process must be lightweight so as not to slow down the write process.
4. **Idempotency:** The generation process must be resilient to task retries, producing the same key for the same record every time.
1. **Global Uniqueness:** Keys must be unique across the entire table to maintain the integrity of a primary key.
2. **Low Storage Footprint:** The keys should be highly compressible to add minimal storage overhead.
3. **Computational Efficiency:** The encoding and decoding process must be lightweight so as not to slow down the write process.
4. **Idempotency:** The generation process must be resilient to task retries, producing the same key for the same record every time.
5. **Engine Agnostic:** The logic must be reusable and implemented consistently across different execution engines like Spark and Flink.

These principles guided the technical design. To align with primary key semantics, global uniqueness was non-negotiable. To minimize storage footprint, the generated keys needed to be compact and highly compressible, especially for tables with billions of records. The computational cost was also critical; any expensive operation would be amplified by the number of records, creating a significant performance overhead. Furthermore, in distributed systems where task failures and retries are common, the key generation process had to be idempotent—ensuring the same input record always produces the exact same key. Finally, the solution needed to be engine-agnostic to provide consistent behavior, whether data is written via Spark, Flink, or another supported engine.
Expand All @@ -79,8 +80,8 @@ Based on the requirements mentioned previously, we eliminated several common ID

Each component serves a specific purpose:

* **Write Action Start Time:** The timestamp from the Hudi timeline that marks the beginning of a write transaction.
* **Workload Partition ID:** An internal identifier that execution engines use to track the specific data split being processed by a given distributed write task.
* **Write Action Start Time:** The timestamp from the Hudi timeline that marks the beginning of a write transaction.
* **Workload Partition ID:** An internal identifier that execution engines use to track the specific data split being processed by a given distributed write task.
* **Record Sequence ID:** A counter that uniquely identifies each record within that data split.

Together, these three components—all readily accessible during the write process—form a record identifier that satisfies the requirements of global uniqueness, idempotency, and being engine-agnostic.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ title: "Real-Time Cloud Security Graphs with Apache Hudi and PuppyGraph"
excerpt: "Hudi tables support fast upserts and incremental processing. PuppyGraph queries relationships in place using openCypher or Gremlin. In this blog, we explore how to get started with real-time security graph analytics at scale using the data already stored in your Hudi lakehouse tables."
author: Jaz Samantha Ku, in collaboration with Shiyan Xu
category: blog
subCategory: use case
image: /assets/images/blog/2025-10-02-Real-Time-Cloud-Security-Graphs-Hudi+PuppyGraph/fig-4-Sample-Architecture-of-PuppyGraph-Hudi.png
tags:
- Apache Hudi
Expand All @@ -16,8 +17,8 @@ Security tools such as SIEM, CSPM, and cloud workload protection need relationsh

To keep up, the data pipeline must support:

* Continuous upserts with low lag so detections run on the latest state
* Incremental consumption so analytics read only “what changed since T”
* Continuous upserts with low lag so detections run on the latest state
* Incremental consumption so analytics read only “what changed since T”
* A rewindable timeline so responders can review state during investigations

With Apache Hudi and PuppyGraph, this becomes straightforward. Hudi tables support fast upserts and incremental processing. PuppyGraph queries relationships in place using openCypher or Gremlin. In this blog, we explore how to get started with real-time security graph analytics at scale using the data already stored in your Hudi lakehouse tables.
Expand Down Expand Up @@ -81,19 +82,19 @@ Getting started is straightforward. You will deploy the stack, load security dat

The components of this demo project include:

* Storage: MinIO/S3 – Object store for Hudi data
* Data Lakehouse: Apache Hudi – Brings database functionality to your data lakes
* Catalog: Hive Metastore – Backed by Postgres
* Compute engines:
* Spark – Initial table writes
* Storage: MinIO/S3 – Object store for Hudi data
* Data Lakehouse: Apache Hudi – Brings database functionality to your data lakes
* Catalog: Hive Metastore – Backed by Postgres
* Compute engines:
* Spark – Initial table writes
* PuppyGraph – Graph query engine for complex, multi-hop graph queries

### Prerequisites

This tutorial assumes that you have the following:

1. **Docker** and **Docker** **Compose** (for setting up the Docker container)
2. **Python 3** (for managing dependencies)
1. **Docker** and **Docker** **Compose** (for setting up the Docker container)
2. **Python 3** (for managing dependencies)
3. [PuppyGraph-Hudi Demo Repository](https://github.com/puppygraph/puppygraph-getting-started/tree/main/integration-demos/hudi-demo)

#### Data Preparation
Expand Down Expand Up @@ -137,7 +138,7 @@ docker compose exec spark /opt/spark/bin/spark-sql -f /init.sql

#### Modeling the Graph

Now that our data is loaded in, we can log into the PuppyGraph Web UI at [http://localhost:8081](http://localhost:8081) with the default credentials (username: puppygraph, password: puppygraph123)
Now that our data is loaded in, we can log into the PuppyGraph Web UI at [http://localhost:8081](http://localhost:8081) with the default credentials (username: puppygraph, password: puppygraph123)

<figure>
![](/assets/images/blog/2025-10-02-Real-Time-Cloud-Security-Graphs-Hudi+PuppyGraph/fig-5-PuppyGraph-Login-Page.png)
Expand All @@ -162,9 +163,9 @@ Once you see your graph schema loaded in, you’re ready to start querying your

By modeling the network infrastructure as a graph, users can identify potential security risks, such as:

* Public IP addresses exposed to the internet
* Network interfaces not protected by any security group
* Roles granted excessive access permissions
* Public IP addresses exposed to the internet
* Network interfaces not protected by any security group
* Roles granted excessive access permissions
* Security groups with overly permissive ingress rules

Listed below are some sample queries you can try running to explore the data:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ title: "Modernizing Upstox's Data Platform with Apache Hudi, dbt, and EMR Server
excerpt: ""
author: The Hudi Community
category: blog
subCategory: data lake
image: /assets/images/blog/2025-10-16-Modernizing-Upstox-Data-Platform-with-Apache-Hudi-DBT-and-EMR-Serverless/fig1.png
tags:
- hudi
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ title: "Partition Stats: Enhancing Column Stats in Hudi 1.0"
excerpt: ""
author: Aditya Goenka and Shiyan Xu
category: blog
subCategory: lakehouse
image: /assets/images/blog/2025-10-22-Partition_Stats_Enhancing_Column_Stats_in_Hudi_1.0/fig1.jpg
tags:
- hudi
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ title: "Deep Dive Into Hudi’s Indexing Subsystem (Part 1 of 2)"
excerpt: ""
author: Shiyan Xu
category: blog
subCategory: upserts
image: /assets/images/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2/fig1.png
tags:
- hudi
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ title: "How FreeWheel Uses Apache Hudi to Power Its Data Lakehouse"
excerpt: "How FreeWheel unified batch and streaming with an Apache Hudi–powered lakehouse to improve freshness, simplify operations, and scale analytics."
author: The Hudi Community
category: blog
subCategory: security
image: /assets/images/blog/2025-11-07-how-freewheel-uses-apache-hudi-to-power-its-data-lakehouse/image1.png
tags:
- hudi
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ title: Deep Dive Into Hudi's Indexing Subsystem (Part 2 of 2)
excerpt: 'Explore advanced indexing in Apache Hudi: record and secondary indexes for fast point lookups, expression indexes for transformed predicates, and async indexing for building indexes without blocking writes.'
author: Shiyan Xu
category: blog
subCategory: data lake
image: /assets/images/blog/2025-11-12-deep-dive-into-hudis-indexing-subsystem-part-2-of-2/fig1.png
tags:
- hudi
Expand Down
5 changes: 1 addition & 4 deletions website/learn/blog.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,8 @@
---
title: "Blogs"
hide_title: true
---

import BlogList from '@site/src/components/BlogList';

# Blogs

Welcome to Apache Hudi blogs! Here you'll find the latest articles, tutorials, and updates from the Hudi community.

<BlogList />
3 changes: 3 additions & 0 deletions website/src/components/BlogList/Icon/search.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading