The Distributed Event Factory (DEF) is a simulation tool for Distributed Process Mining. It is designed for simulating real world processes utilizing Markov chains. It produces events consisting of classical process mining measures caseId, activity and timestamp on assembled with location data as the node and a group. In contrast to traditional process mining, multiple independent event streams are generated instead of a single large one.
DEF mainly targets researchers who like to test and evaluate distributed data mining algorithms. It is suited for generating data at large scale. So, it can be leveraged for benchmarking and load testing.
The Distributed Event Factory was designed for testing and evaluating Distributed Process Mining Algorithms. A use case for the tool is the simulation of a smart factory with multiple production facilities. Each Sensor writes its own event logs. On this data process mining should be applied. We use such a factory as a running example for configuring the DEF.
Three core concepts underlie DEF: DataSource, Simulation and Sink. Those entities are configured via yaml-files.
DEF is based on a Markov chain. The vertices of the Markov chain represent the distributed data sources.
The edges represent transitions between the data sources. They contain a probability for transitioning, the activity
performed and a function modelling the process duration.
A minimal running example yaml is presented here:
kind: datasource
name: "MaterialPreparation"
spec:
name: "MaterialPreparation"
group: "factory"
selection: genericProbability
distribution: [ 0.8, 0.2 ]
eventData:
- activity: "Waiting for Material"
transition: "MaterialPreparation"
duration: 2
- activity: "MaterialPreparation - Finished"
duration: 1
transition: "AssemblyLineSetup"
Every Resource has to specify a kind in our case the kind datasource
.
The name (here: MaterialPreparation
), is used for later referencing this datasource in the code.
The specification spec
defines the properties of the datasource. The name
refers to the node name which
is later referred in the node
field of the generated event. This is also the case for the field group
.
Moreover, The selection
of the next emitted event can be specified. In this case its genericProbability
.
It selects the events from the event data by the defined distribution
. The event data refers to the edges in the Markov
chain including duration
, transition
and activity
.
To view and use the data which is generated by the datasources (a) sink(s) must be specified. The minimal sink-yaml.
kind: sink
name: console-sink
spec:
type: console
id: "Sensor"
dataSourceRefs:
- "MaterialPreparation"
The type
of the sink is here defined as console-sink
which prints the events to the console. The id
is a prefix to
identify the sink in the console (especially useful if multiple console sinks are used). The dataSourceRefs
defines
from which datasource the events are emitted. A datasource can be also linked to multiple sinks.
A simulation
runs the process through our datasources and emits their events via the sinks. The type
stream
creates
a datastream. The caseId
defines how the case-ids are generated.
In the incresing
mode the caseId is an incrementally increased number. The load defines at which speed the simulation is run.
In this case constantly 10 events per second are emitted.
kind: simulation
name: streamSimulation
spec:
type: stream
caseId: increasing
load:
loadBehavior: load
load: 10
All currently shown configurations are suited for generating mock events streams.
As additional feature, DEF can also be used for real load testing of process mining applications.
However, therefore specific configurations need to be specified. The type of simulation has to be laod-test
and the sinks have to be loadtest
sinks. The load-test sinks depend on a load-test backend which can precisely emit
the specified load event at large scale.
The loadtest simulation has 2 new properties to specify: timeframe
and a load-backend url
. A timeframe is a batch of events which is submitted together to the load-backend. The number specifies the duration
of this timeframe in milliseconds.
kind: sink
name: http-sink
spec:
type: http
id: "Sensor"
timeframe: 1000
url: "http://localhost:8080/loadgenerator"
dataSourceRefs:
- "MaterialPreparation"
The simulation specifies a new property genTimeframesTilStart
first timeframes can be buffered at the
load-backend before the actual event emission is started.
kind: simulation
name: assembly-line-simulation
spec:
type: loadTest
caseId: increasing
genTimeframesTilStart: 10
load:
loadBehavior: constant
load: 1000
To use the load-backend docker needs to be installed on your machine. Currently, the load backend only supports writing data to a kafka queue. The load-backend is started via the script
./run-load-backend <bootstrapServerUrl> <kafkaTopic>
There are two options install DEF, via PyPI or building it yourself.
Requirements:
- Python >= 3.10
Install dependencies:
pip install -r requirements.txt
Define the Distribted Event Factory
and specify a config file like mentioned above.
from distributed_event_factory.event_factory import EventFactory
event_factory = EventFactory()
(event_factory
.add_directory("config/datasource/assemblyline")
.add_file("config/simulation/stream.yaml")
.add_file("config/sink/sink.yaml")
).run()