[rush] granular build event data #4741

aramissennyeydd · 2024-05-28T20:26:03Z

Summary

This is follow up from this Zulip thread.

Problem statement: Debugging Rush in remote environments (like CI Docker images) is difficult.

I propose a new realtime system for event generation and collection with Rush. This will be used in complement with telemetry and logs to provide a complete view of Rush lifecycle data and events. This new system will allow core libraries and plugins alike to publish events to an event bus, the same event bus would be used by both Heft and Rush libraries. There will be a set of plugins on top of the event bus that can read and publish events to a secondary location.

I'd also like this design to serve as a foundation for a realtime build tracking web UI. Also, while I won't touch on it in this design, a realtime event based system could provide the ability for web-hook like functionality where events could trigger workflows, like starting an Argo workflow, sending a message or updating a status check on Github.

Example events:

`operation.status.changed`

interface OperationStatusChangeEvent extends IEvent {
  kind: 'operation.status.changed'
  spec: {
    from: OperationStatus:
    to: OperationStatus;
    timeElapsed: number;
  }
}

`data.executionGraph`

interface ExecutionGraphEvent extends IEvent {
  kind: 'data.executionGraph'
  spec: {
    operations: Operation[];
    executionGraph: Tree<Operation>;
  }
}

`data.buildCache`

interface ExecutionGraphEvent extends IEvent {
  kind: 'data.executionGraph'
  spec: {
    operationToBuildCacheIdMap: Map<string, string>;
    buildCachePartsByProjectMap: Map<string, Map<string, string>>;
  }
}

Existing solutions

Telemetry

Telemetry should continue to be the place to put key information about Rush developers' everyday activities. That information can be fed into a data pipeline or reporting system and used to track pain points, long running jobs, frequent failures, etc. My view of telemetry is that its strength is in its aggregated format, timings are aggregated to important spans like operations, success and failure is reported at that same level. Overloading this format to allow for additional event types to be passed, with specific data besides machine and rush context metadata, is detracting the from the value proposition of Rush telemetry.

Logging

Logs should continue to be the place to output important debug information and helpful logs. When debugging, having a full view of the process, its preconditions, inputs and outputs is important. However, the current logging experience is always off or always on. Either you get no logs, 50% of the logs with --verbose, another 25% with --debug and 100% with both --verbose and --debug. Looking into a specific precondition like build cache id construction would require sifting through the other logs for cobuilds, heft task execution, etc. While those logs are important, when I want to dive into a specific task (especially on a remote machine) they're difficult to parse through. Adding more information to them would make them event harder to parse through.

Scope

In Scope

Extensible, not limited to just the schema defined in the upstream solution. You should be able to define your own events.
Shared solution for Rush and Heft. You should be able to use this in both OSS rush and in your local Rush and Heft plugins.
Usable by plugins outside of the OSS rush repository. You shouldn't need special knowledge of the Rush repository to add events that you can consume.
Definition of events vs telemetry, investigate whether telemetry can be rewritten as an event consumer.

Out of Scope

Event publishing restrictions. We won't talk about how to only publish these events when a debug flag is enabled.
Event transformations. You should handle event transformations in your own custom plugins or change the shape after collecting the event server side.
Event collection store. We won't talk about how to store events after they've been consumed.
Cloud provider integrations. We won't talk about integrating this solution with Kafka, EventBridge, etc.

Proposal

This proposal has 4 parts,

The API definition. What will all of the parts' interactions look like.
The usage of the new API.
Optimizations for using the new API.
Plugins that should ship with this new API.

API

Events

Based off of k8s/Backstage, this model is intended to be extensible and be able to handle multiple schema adjustments by different kinds and spec.type. Adopters can define their own events + schemas and handle them separately.

interface IEvent {
  apiVersion: 'rushstack.io/v1' | 'rushstack.io/v2';
  kind: string;
  spec: {
    type: string;

    // additional data to send to the server, schema determined by kind and spec.type.
    [key: string]: any;
  }
  metadata: {
    // Additional data, schema determined by kind and spec.type.
    [key: string]: any;
    timestamp: string;
    project?: { packageName: string; }
    phase?: { name: string; }
    cobuild?: {
      runnerId: string;
      contextId: string;
    }
    operation?: {
      name: string;
      shard?: number;
    }
  }
}

`event.kind`

For naming, I propose a loose standard of using dot separated parts to denote namespaces. For example, build.event.status.changed, build is the umbrella topic that this event sits in, event is a bit more specific and the final 2 parts denote field changes. status is something we care about and changed lets us know that this event should have a from state and a to state.

I propose a special data.* namespace that will hold all events that are used to share state/data. This can be useful for uploading a build plan or cluster map. It's not quite an event in the traditional sense.

Event Bus

In order to publish events, we'll use a simple event bus design. It should ideally be agnostic to event schema/type and just pass events to its subscribers. There should be one EventBus for Rush and one for Heft. This simplifies the model a little and makes the consumer model easier as well. I see the primary use case being forwarding these events to either a reporting system or a web UI, which can decide which events it cares about. For local consumers that want to be more judicious, each consumer of the event bus can listen to only the specific events that they care about, using the filters provided for kind and type.

interface IEventBus {
    name: string;
    // Publish an event, internally, we will route to the correct subscribers based on kind.
    publish(event: IEvent);

    // Subscribe to a given kind. If you don't pass in a kind, you will subscribe to all events.
    subscribe(kindFilter?: string, typeFilter?: string, consumer: IEventConsumer): void;
    unsubscribe(kindFilter?: string, typeFilter?: string, consumer: IEventConsumer): void;
}

interface IEventBusSubscriber {
   // On each event, the event bus will call this from its `subscribe` method.
  onEvent(event: IEvent): Promise<void>;
}

interface IEventConsumer extends IEventBusSubscriber {
  onSubscribe(bus: IEventBus): void;
}

Usage

type IBaseEventConsumerJson = {
  filter?: {
    kind: string;
    type?: string;
  };
  name: string;
}

type ILocalEventConsumerJson = IBaseEventConsumerJson & {
  path: string;
  fileNameFormat: string;
}

type IRemoteEventConsumerJson = IBaseEventConsumerJson & {
  endpoint: string;
  batched?: boolean;
  batchSize?: number;
}

type IEventConsumerJson = ILocalEventConsumerJson | IRemoteEventConsumerJson;

type EventConsumerFactory = (eventConsumerJson: IEventConsumerJson) => EventConsumer | Promise<IEventConsumer>;

interface TaskSession {
    registerEventConsumerFactory(cacheProviderName: string, factory: EventConsumerFactory): void;
    getEventConsumerFactory(cacheProviderName: string): EventConsumerFactory | undefined;
}

Similar to build cache extensions, we would also provide an interface to use this event bus.

Rush Plugins

export class CustomEventConsumer implements IEventConsumer {
  constructor(private config: IEventConsumerJson){
  }
  onSubscribe(bus: IEventBus){
    // handle initialization and whatnot.
  }

  async onEvent(event: IEvent){
    // custom upload event using gRPC or graphql. Or broadcast event to other agents.
  }
}

export class RushEventConsumerCustomPlugin implements IRushPlugin {
  public apply(rushSession: RushSession, rushConfig: RushConfiguration): void {
    rushSession.hooks.initialize.tap(this.pluginName, () => {
      rushSession.registerEventConsumerFactory('custom', async (eventConsumerConfig) => {
        return new CustomEventConsumer(eventConsumerConfig);
      })
   })
  }
}

Heft Plugins

export declare interface IHeftTaskSession {
  // Extended with a new property for sending events.
  eventBus?: IEventBus;
}

And then in the plugins,

public apply(
    taskSession: IHeftTaskSession,
    heftConfiguration: HeftConfiguration,
    pluginOptions?: IJestPluginOptions
  ): void {
  taskSession.eventBus?.publish({kind: 'MyPluginEvent'});
}

Propagating Context

In the above Heft example, adding fields like cobuild runner ID and machine information is manual and may be prone to issues. To get around this, I propose a new IEventFactory that prepopulates these fields on events and a new EventContext class that propagates context values through the application.

interface IEventFactory {
  // Using the partial event passed in, create an IEvent using a combination of partialEvent and EventContext.
  // data like timestamp will be generated inside of this method.
  create(partialEvent: Partial<IEvent>): IEvent;
}

class EventContext {
  // Using a dot separate path, set the context at that path to that value.
  // Could also use JSON path here.
  set(path: string, value: any);

  get(path: string): any;
  getAll(): object;
}

These would also be passed to Heft plugins through HeftTaskSession.

Default Setup

Out of the box, I propose a single plugin, the rush-http-event-consumer-plugin that will publish events to a given endpoint in config.

export class RushEventConsumerHttpProvider implements IEventConsumer {
  constructor(private config: IEventConsumerJson){
  }

  onSubscribe(bus: IEventBus){}

  async onEvent(event: IEvent){
    // forward the event to the given endpoint.
    await fetch(config.endpoint, {
      body: JSON.stringify(event),
      type: 'POST'
    });
}

export class RushEventConsumerHttpPlugin implements IRushPlugin {
  public apply(rushSession: RushSession, rushConfig: RushConfiguration): void {
    rushSession.hooks.initialize.tap(this.pluginName, () => {
      rushSession.registerEventConsumerFactory('http', async (eventConsumerConfig) => {
        return new RushEventConsumerHttpProvider(eventConsumerConfig);
      })
   })
  }
}

Standard questions

Please answer these questions to help us investigate your issue more quickly:

Question	Answer
`@microsoft/rush` globally installed version?
`rushVersion` from rush.json?
`useWorkspaces` from rush.json?
Operating system?
Would you consider contributing a PR?
Node.js version (`node -v`)?

The text was updated successfully, but these errors were encountered:

iclanton · 2024-05-29T18:25:02Z

@dmichon-msft mentioned an API that you may find useful.

dmichon-msft · 2024-05-29T21:39:10Z

The rush-serve-plugin flattens relevant events into WebSocket messages for live data:
https://github.com/microsoft/rushstack/blob/44b7912020ce97831b68e2f34affd64903ec9989/rush-plugins/rush-serve-plugin/src/api.types.ts#L70C1-L127C69
https://github.com/microsoft/rushstack/blob/main/rush-plugins/rush-serve-plugin/README.md#live-build-status-via-web-socket

Note that the @rushstack/operation-graph package was designed to be the converged backend for execution between both Rush and Heft, though cobuilds threw a bit of a wrench into that since they add a lot of complexity to the scheduler.

aramissennyeydd · 2024-05-30T15:06:31Z

@dmichon-msft Interesting, I (clearly) wasn't aware of the runPhasedCommand.for(...).tapPromise model to tie into the execution hooks for phased commands. This definitely seems like something I could run with internally, and wouldn't need any special adjustments, thanks! I'll update this thread on how far I get.

To your comment about @rushstack/operation-graph, is that going to make this effort more difficult/fragile to changes in the future? Would converging onto @rushstack/operation-graph be a useful prerequisite here?

dmichon-msft · 2024-05-31T00:15:46Z

The convergence would allow us to do most work related to build graph in one place instead of two; the main motivation of creating @rushstack/operation-graph was to allow all the work that went into optimizing for Heft to also be reused in Rush, and to have a common model for IPC communication, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rush] granular build event data #4741

[rush] granular build event data #4741

aramissennyeydd commented May 28, 2024 •

edited

Loading

iclanton commented May 29, 2024

dmichon-msft commented May 29, 2024

aramissennyeydd commented May 30, 2024

dmichon-msft commented May 31, 2024

[rush] granular build event data #4741

[rush] granular build event data #4741

Comments

aramissennyeydd commented May 28, 2024 • edited Loading

Summary

operation.status.changed

data.executionGraph

data.buildCache

Existing solutions

Telemetry

Logging

Scope

In Scope

Out of Scope

Proposal

API

Events

event.kind

Event Bus

Usage

Rush Plugins

Heft Plugins

Propagating Context

Default Setup

Standard questions

iclanton commented May 29, 2024

dmichon-msft commented May 29, 2024

aramissennyeydd commented May 30, 2024

dmichon-msft commented May 31, 2024

aramissennyeydd commented May 28, 2024 •

edited

Loading

`operation.status.changed`

`data.executionGraph`

`data.buildCache`

`event.kind`