Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release Airflow 3.0 #39593

Open
1 task
kaxil opened this issue May 13, 2024 · 5 comments
Open
1 task

Release Airflow 3.0 #39593

kaxil opened this issue May 13, 2024 · 5 comments
Labels
kind:meta High-level information important to the community
Milestone

Comments

@kaxil
Copy link
Member

kaxil commented May 13, 2024

Hello all,

Creating a meta-issue to track all the projects related to Airflow 3 and pointers on how contributors can help in this effort.

The Home Page for Airflow 3 discussions is: https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+3.0

Workstreams: https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+3+Workstreams

How to participate & help?

Check this doc and find items without an owner; this is the workstream that needs someone in the community to lead. Comment & tag me if you are interested in any of the workstreams.

Example:

  • Airflow Standalone Improvements

@kaxil kaxil added the kind:meta High-level information important to the community label May 13, 2024
@kaxil kaxil added this to the Airflow 3.0.0 milestone May 13, 2024
@kaxil kaxil changed the title Release Airflow 3 [DRAFT] Release Airflow 3 May 13, 2024
@kaxil kaxil changed the title [DRAFT] Release Airflow 3 [WIP] Release Airflow 3 May 13, 2024
@jscheffl
Copy link
Contributor

Why not adding a "Project" --> https://github.com/apache/airflow/projects ?

@JossWhittle
Copy link

Is there any planned follow up to AIP-48 to expose a Dataset api to custom providers and give a mechanism for polling for Dataset changes using deferrable triggerers?

Since AIP-48's responsibilities were shrunk so it could be merged there has not been any visible discussion about a follow up AIP or any progress towards the remainder of it's goals on the 3.x roadmap.

@kaxil
Copy link
Member Author

kaxil commented Jun 18, 2024

Why not adding a "Project" --> https://github.com/apache/airflow/projects ?

Because we will have multiple "Projects"

@kaxil
Copy link
Member Author

kaxil commented Jun 18, 2024

Is there any planned follow up to AIP-48 to expose a Dataset api to custom providers and give a mechanism for polling for Dataset changes using deferrable triggerers?

Since AIP-48's responsibilities were shrunk so it could be merged there has not been any visible discussion about a follow up AIP or any progress towards the remainder of it's goals on the 3.x roadmap.

Not yet, but Airflow 2.9 included the support for Dataset event updates which could act as a proxy for a "push-based" mechanism until we have a poll-based mechanism

@JossWhittle
Copy link

JossWhittle commented Jun 19, 2024

@kaxil I found the internal api call to create a DatasetEvent the other day but I'm much of a muchness over whether I want to abuse it to solve my problem.

def register_dataset_change(

"/api/v1/datasets/events", json=event_payload, environ_overrides={"REMOTE_USER": "test"}

I want to be able to have a custom Dataset class listening to a message queue. Currently this is achieved using a separate continuously scheduled DAG with a deferable operator that consumes messages and triggers a DAG run of the actual processing DAG.

This could be changed to create DatasetEvent embedding the message(s) from the queue into the extra field, and have the processing DAG schedule on that Dataset. Would this be inherently dangerous to do?

Using an external DAG for polling in either case at least gets all the fault tolerance and deferability of a DAG, and means status and history is shown in the UI.

A downside I am seeing though is that when my DAGs finish and I write to a message queue, this is entirely decoupled from being able to declare that outgoing queue as a Dataset outlet.

In fact, outlets can't really be used at all here because we want to trigger on the message being pulled from the queue by another polling DAG, not by the current DAG simply succeeding which won't have passed the message(s) into the extra field.

This means the graph of inter-dag dependencies is always broken up which is unfortunate.

Perhaps in the meantime Dataset could get a constructor argument to prevent triggering a DatasetEvent when used as an outlet. This would allow outlets to be used to mark up inter-dag dependencies.


I think in a world where there is a polling mechanism, Dataset outlets on succeeding tasks should only hint to Airflow that poll-ers should poll, but shouldn't create a DatasetEvent directly. Poll-ers should be deferable and fault tolerant, so an outlet Dataset firing really just means waking the poller immediately if it is deferred. Otherwise it will pick it up on it's own.

@kaxil kaxil changed the title [WIP] Release Airflow 3 Release Airflow 3.0 Jun 24, 2024
@kaxil kaxil pinned this issue Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:meta High-level information important to the community
Projects
None yet
Development

No branches or pull requests

3 participants