Process Discovery

Process Discovery algorithms aim to identify a suitable process model that represents the sequence of events or activities executed during a process. Below is an overview visualizing the advantages and disadvantages of various mining algorithms.

AlphaAlpha+HeuristicInductive
Cannot handle loops of length one or twoCan handle loops of length one and twoTakes frequency into accountCan handle invisible tasks
Invisible and duplicated tasks cannot be discoveredInvisible and duplicated tasks cannot be discoveredDetects short loopsModel is sound
Discovered model may not be soundDiscovered model may not be soundDoes not guarantee a sound modelMost widely used process mining algorithm
Weak against noiseWeak against noise

Alpha Miner

Alpha Miner is one of the most well-known Process Discovery algorithms. It generates:

  • A Petri net model where all transitions are visible, unique, and correspond to classified events (such as activities).
  • An initial marking that defines the Petri net’s state when execution begins.
  • A final marking that defines the Petri net’s state when execution ends.

Here's an example where a log is read, the Alpha algorithm is applied, and a Petri net along with the initial and final markings are discovered. The input log is running-example.xes.

First, import the log:

                
            

Next, apply the Alpha Miner algorithm:

                
            
Visualization of the object referenced in the example.

Inductive Miner

PM4Py provides implementations of the inductive miner (IM), inductive miner infrequent (IMf), and inductive miner directly-follows (IMd) algorithms. The respective papers are:

  • Inductive Miner: Discovering block-structured process models from event logs—a constructive approach (link)
  • Inductive Miner infrequent: Discovering block-structured process models from event logs containing infrequent behavior (link)
  • Inductive Miner directly-follows: Scalable process discovery with guarantees (link)

The core idea of Inductive Miner is to detect a 'cut' in the log (e.g., sequential cut, parallel cut, concurrent cut, and loop cut) and then apply recursion to sublogs until a base case is reached. The directly-follows variant avoids recursion and uses the Directly Follows graph instead.

Inductive Miner models often utilize hidden transitions, particularly to skip or loop through portions of the model. Additionally, each visible transition has a unique label (no duplicate labels for transitions).

Two types of process models can be derived: Petri Net and Process Tree.

To mine a Petri Net, follow this example. We read the log, apply the inductive miner, and obtain the Petri net along with the initial and final markings. The input log is running-example.xes.

First, read the log and then apply the inductive miner:

                
            
Visualization of the object referenced in the example.

To generate a process tree, use this code snippet. The last two lines visualize the process tree:

                
            
Visualization of the object referenced in the example.

It is also possible to convert a process tree into a Petri net:

                
            
Visualization of the object referenced in the example.

Heuristic Miner

Heuristic Miner works on the directly-follows graph, allowing it to handle noise and detect common constructs (such as dependencies and AND relationships between activities). The output is a Heuristic Net, which can then be converted into a Petri net. The paper can be found here: this link.

To apply Heuristic Miner and discover a Heuristic Net, first import the log, then generate the Heuristic Net. A variety of parameters can be explored by clicking the following button:

Inspect parameters

                
            
Visualization of the object referenced in the example.
Parameter NameMeaning
dependency_thresholdThreshold for activity dependency (default: 0.5)
and_thresholdThreshold for AND relationships (default: 0.65)
loop_two_thresholdThreshold for loops of length 2 (default: 0.5)

To visualize the Heuristic Net, use the code on the right:

                
            
Visualization of the object referenced in the example.

To convert the Heuristic Net into a Petri net and visualize it, use the code on the right:

                
            
Visualization of the object referenced in the example.

Directly Follows Graph

Process models using Petri nets have well-defined semantics: a process starts at the initial marking and ends at the final marking. Directly-Follows Graphs are another type of process model where nodes represent events/activities, and directed edges indicate that one event/activity is followed by another. You can easily add metrics like frequency or performance to these edges.

First, import the log. Then, extract and visualize the Directly-Follows Graph, decorated with activity frequencies.

                
            
Visualization of the object referenced in the example.

To decorate the graph with performance metrics, replace two parameters in the code:

                
            
Visualization of the object referenced in the example.

To save the DFG, for example in SVG format, use the following code:

                
            

Adding Information about Frequency and Performance

Similar to Directly-Follows Graphs, Petri nets can also be decorated with frequency or performance information. This is achieved by replaying the model and assigning frequency or performance to the paths. The variant parameter in the visualizer specifies the type of annotation to apply. The available options for the variant parameter are:

  • pn_visualizer.Variants.WO_DECORATION: Default, no decoration.
  • pn_visualizer.Variants.FREQUENCY: Decorates the model with frequency information from the replay.
  • pn_visualizer.Variants.PERFORMANCE: Decorates the model with performance information (mean time) from the replay.

To visualize the Petri net decorated with frequency, use the following code:

                
            

Correlation Miner

In Process Mining, event logs typically contain at least the following:

  • A case identifier,
  • An activity,
  • A timestamp.

The case identifier links an event occurring in a system to a specific execution of a process. This association enables the use of algorithms for process discovery, conformance checking, and other process mining tasks. However, in some systems—such as those collecting data from IoT systems—it may be difficult to assign a case identifier. In these cases, performing traditional process mining is not feasible.

Correlation mining addresses this challenge by enabling the extraction of process models from event logs that lack case identifiers. These logs typically contain only:

  • An activity column,
  • A timestamp column.

For this explanation, we assume that a total order exists for the events (i.e., no two events share the same timestamp). Situations without a total order are more complex.

The Correlation Miner is a technique introduced by:

Pourmirza, Shaya, Remco Dijkman, and Paul Grefen. "Correlation Miner: Mining business process models and event correlations without case identifiers." International Journal of Cooperative Information Systems 26.02 (2017): 1742002.

The approach resolves the problem by solving an (integer) linear problem based on two key matrices:

  • The P/S matrix: This matrix captures the order relationships between activities as recorded in the log.
  • The Duration matrix: This matrix represents the duration between two activities, obtained through an optimization process.

The solution to this problem provides a set of activity pairs that are, according to the approach, in a directly-follows relationship, along with the strength of these relationships. This is referred to as the “frequency” Directly-Follows Graph (DFG).

A “performance” DFG can be derived from the Duration matrix by retaining only the entries that appear in the solution (i.e., the pairs of activities that are part of the frequency DFG).

This can then be visualized using tools such as the PM4Py DFG visualization.

For a more realistic example, we can take an existing log, remove the case ID column, and attempt to reconstruct the DFG without the case ID.

Let’s walk through an example. First, we load a CSV file into a Pandas dataframe, retaining only the concept:name and time:timestamp columns:

    

Next, we apply the Correlation Miner approach:

    

To better visualize the DFG, we can calculate the frequency of activities:

    

Finally, we can visualize the DFG:

    

Upon visualizing the DFG, we can confirm that the Correlation Miner successfully identified a clear main path in the process.

The Correlation Miner offers several variants, each with its own characteristics:

VariantDescription
Variants.CLASSICCalculates the P/S matrix and the Duration matrix using the entire event list in a traditional manner.
Variants.TRACE_BASEDCalculates the P/S matrix and Duration matrix on a trace-by-trace basis using a classic event log and merges the results. This variant produces a more understandable model compared to the classic DFG.
Variants.CLASSIC_SPLITCalculates the P/S matrix and Duration matrix on the entire event list but splits it into chunks to speed up the computation. While this results in a less accurate model than the CLASSIC variant, the computation is faster. The default chunk size is 100,000 events.

Temporal Profile

PM4Py includes an implementation of the temporal profile model, which is described in:

Stertz, Florian, Jürgen Mangler, and Stefanie Rinderle-Ma. "Temporal Conformance Checking at Runtime Based on Time-infused Process Models." arXiv preprint arXiv:2008.07262 (2020).

The temporal profile measures, for each pair of activities in the log, the average time and standard deviation between events containing those activities. The time is measured between the completion of the first event and the start of the second event. This model assumes an interval log, where events have two timestamps.

The output of the temporal profile discovery is a dictionary where each activity pair (represented as a tuple) is associated with two values: the average time and the standard deviation of the time interval between the events.

An example of temporal profile discovery is provided below. First, we load an event log and apply the discovery algorithm.

    

Several parameters can be used to customize the execution of the temporal profile discovery:

Parameter KeyTypeDefaultDescription
Parameters.ACTIVITY_KEYstringconcept:nameThe attribute to use as the activity.
Parameters.START_TIMESTAMP_KEYstringstart_timestampThe attribute to use as the start timestamp.
Parameters.TIMESTAMP_KEYstringtime:timestampThe attribute to use as the timestamp.