In PM4Py, you can calculate various statistics on classic event logs and dataframes.
Given an event log, you can retrieve the list of all case durations (expressed in seconds).
The only parameter needed is the timestamp. The code snippet on the right demonstrates this.
You can retrieve the case arrival ratio from an event log. This ratio represents the average time between the arrival of two consecutive cases in the log.
Additionally, you can calculate the case dispersion ratio, which represents the average time between the completion of two consecutive cases in the log.
The performance spectrum is a novel visualization of process performance, based on the time elapsed between different activities in the process executions. It was initially described in:
Denisov, Vadim, et al. "The Performance Spectrum Miner: Visual Analytics for Fine-Grained Performance Analysis of Processes." BPM (Dissertation/Demos/Industry). 2018.
The performance spectrum works with an event log and a list of activities considered for building the spectrum. In the following example, the performance spectrum is built from the receipt event log, which includes the "Confirmation of receipt", "T04 Determine confirmation of receipt", and "T10 Determine necessity to stop indication" activities. The event log is loaded, and the performance spectrum (containing timestamps for the different activities during process execution) is computed and visualized:
In the example, three horizontal lines represent the activities included in the spectrum, and several oblique lines represent the elapsed times between two activities. The more oblique lines are highlighted in different colors, making it easier to identify timestamps where the execution was more bottlenecked and potential patterns (e.g., FIFO, LIFO).
Two important KPIs for process executions are:
For interval event logs (those that have both start and end timestamps), the lead time and cycle time can be calculated incrementally for each event. The lead time and cycle time reported on the last event of a case correspond to the entire process execution. This helps to identify which activities caused bottlenecks, for example, when the lead time increases more significantly than the cycle time.
The PM4Py algorithm begins by sorting each case by the start timestamp (ensuring activities that started earlier are reported first in the log), and it can calculate the lead and cycle times for all situations, including complex ones. The following picture shows an example of this process:
The following attributes are added to events in the log:
Metric | Meaning |
---|---|
@@approx_bh_partial_cycle_time | Incremental cycle time associated with the event (the cycle time of the last event is the cycle time of the instance) |
@@approx_bh_partial_lead_time | Incremental lead time associated with the event |
@@approx_bh_overall_wasted_time | Difference between partial lead time and partial cycle time values |
@@approx_bh_this_wasted_time | Wasted time for the activity defined by the ‘interval’ event |
@@approx_bh_ratio_cycle_lead_time | Measures the incremental Flow Rate (between 0 and 1) |
The method for calculating lead and cycle times can be applied with the following code:
This results in an enriched log, where each event contains the corresponding attributes for lead and cycle time.
This statistic works only with interval event logs, i.e., logs where each event has both a start timestamp and a completion timestamp.
The average sojourn time statistic calculates the time spent on each activity by averaging the time between the start and completion timestamps for the activity's events. Here’s an example of how to use it. First, we import an interval event log.
Then, we calculate the statistic, providing the attribute that represents the start timestamp and the attribute that represents the completion timestamp.
The same statistic can be applied seamlessly to Pandas dataframes, using the following alternative class:
pm4py.statistics.sojourn_time.pandas.get
This statistic works only with interval event logs, i.e., logs where each event has both a start timestamp and a completion timestamp.
In interval event logs, the definition of event order is more flexible. A pair of events in a case can intersect in the following ways:
The second case represents event-based concurrency, where several events are actively executed at the same time.
You may want to retrieve the set of activities for which such concurrent execution occurs, along with the frequency of these occurrences. PM4Py can perform this calculation. Here’s an example. First, we import an interval event log.
Then, we calculate the statistic, providing the attribute that represents the start timestamp and the attribute that represents the completion timestamp.
This statistic can also be applied seamlessly to Pandas dataframes using the following alternative class:
pm4py.statistics.concurrent_activities.pandas.get
We provide an approach for calculating the Eventually-Follows Graph (EFG).
The EFG represents the partial order of events within the process executions of the log.
Our implementation works for both lifecycle logs (logs where each event has only one timestamp) and interval logs (logs where each event has both a start and a completion timestamp). In the case of interval logs, the start timestamp is used to define the EFG/partial order.
Specifically, the method assumes lifecycle logs when no start timestamp is passed in the parameters, and it assumes interval logs when a start timestamp is provided.
Here's an example. First, we import an interval event log.
Then, we calculate the statistic, providing the attribute representing the completion timestamp and optionally the attribute representing the start timestamp.
Graphs help visualize various aspects of the current log, such as the distribution of numeric attributes, case duration, or events over time.
The following example demonstrates the distribution of case duration using two different graphs: a simple plot and a semi-logarithmic plot (on the X-axis). The semi-logarithmic plot is less sensitive to potential outliers.
First, the receipt log is loaded, and then the distribution related to case duration is obtained. Both a simple plot and a semi-logarithmic plot (on the X-axis) can be generated.
The following example illustrates a graph representing the distribution of events over time. This is especially important as it shows the time intervals when the greatest number of events is recorded.
The distribution of events over time is calculated, and the graph is plotted accordingly.
The following example demonstrates two graphs related to the distribution of a numeric attribute: a standard plot and a semi-logarithmic plot (on the X-axis), which is less sensitive to outliers.
First, a filtered version of the Road Traffic log is loaded, followed by the distribution of the numeric attribute 'amount'. Both the standard plot and the semi-logarithmic plot can be generated.
The dotted chart is a classic visualization technique for displaying events within an event log across different dimensions. Each event in the log corresponds to a point on the chart. The dimensions are projected onto the graph as follows:
The dimensions can include string, numeric, or date values, which are appropriately managed by the dotted chart.
A common use case for the dotted chart is to visualize the distribution of cases and events over time using the following dimensions:
This setup enables the identification of patterns such as:
Below are examples of how to create and visualize the dotted chart based on both default and custom attribute selections.
To create the default dotted chart for the receipt event log, use the following code:
To create a custom dotted chart for the receipt event log using "concept:name" (activity), "org:resource" (organizational resource), and "org:group" (organizational group) as dimensions, use the following code:
Observing the distribution of events over time can provide valuable insights into work shifts, busy periods, and patterns throughout the year.
The distribution of events over time can be visualized by loading an event log and calculating the distribution across various time periods, such as hours of the day, days of the week, days of the month, months, or years.
The parameter distr_type
can take the following values:
An activity is considered to be executed in batches by a given resource when the resource performs the same activity multiple times within a short period.
Identifying such activities can help uncover process areas that may benefit from automation, as repetitive activities may indicate inefficiencies.
Below is an example calculation using an event log to detect batches.
The results can be displayed as follows:
Our method can detect different types of batches, including:
The rework statistic identifies activities that have been repeated during the same process execution, revealing inefficiencies in the process.
In our implementation, rework is calculated based on an event log or Pandas dataframe, and it returns a dictionary that associates each activity with the number of cases that contain rework for that activity.
Below is an example calculation on an event log.
Rework at the case level refers to the number of events in a case that repeat an activity already performed earlier in the case.
For example, if a case contains the activities A, B, A, B, C, D, the rework count is 2 since the events in positions 3 and 4 correspond to activities that have already occurred.
The rework statistic helps identify cases where many activities are repeated, indicating potential inefficiencies.
Below is an example calculation on an event log. After the computation, the dictio
will contain entries for the six cases in the example log:
{ "1": { "number_activities": 5, "rework": 0 }, "2": { "number_activities": 5, "rework": 0 }, "3": { "number_activities": 9, "rework": 2 }, "4": { "number_activities": 5, "rework": 0 }, "5": { "number_activities": 13, "rework": 7 }, "6": { "number_activities": 5, "rework": 0 } }
We provide a feature that enables querying information about the paths in the event log at a specific point in time or within a time interval using an interval tree data structure.
This is useful for quickly computing the workload of resources over a given time interval or for measuring the number of open cases at any given time.
To transform an event log into an interval tree, use the following code:
The following example computes the workload (number of events) for each resource within the specified interval.
The following example computes the number of open cases for each directly-follows path in the log.