Statistics

In PM4Py, you can calculate various statistics on classic event logs and dataframes.

Throughput Time

Given an event log, you can retrieve the list of all case durations (expressed in seconds).

The only parameter needed is the timestamp. The code snippet on the right demonstrates this.

                
            
Visualization of the object referenced in the example.

Case Arrival/Dispersion Ratio

You can retrieve the case arrival ratio from an event log. This ratio represents the average time between the arrival of two consecutive cases in the log.

                
            

Additionally, you can calculate the case dispersion ratio, which represents the average time between the completion of two consecutive cases in the log.

                
            

Performance Spectrum

The performance spectrum is a novel visualization of process performance, based on the time elapsed between different activities in the process executions. It was initially described in:

Denisov, Vadim, et al. "The Performance Spectrum Miner: Visual Analytics for Fine-Grained Performance Analysis of Processes." BPM (Dissertation/Demos/Industry). 2018.

The performance spectrum works with an event log and a list of activities considered for building the spectrum. In the following example, the performance spectrum is built from the receipt event log, which includes the "Confirmation of receipt", "T04 Determine confirmation of receipt", and "T10 Determine necessity to stop indication" activities. The event log is loaded, and the performance spectrum (containing timestamps for the different activities during process execution) is computed and visualized:

                
            
Visualization of the object referenced in the example.

In the example, three horizontal lines represent the activities included in the spectrum, and several oblique lines represent the elapsed times between two activities. The more oblique lines are highlighted in different colors, making it easier to identify timestamps where the execution was more bottlenecked and potential patterns (e.g., FIFO, LIFO).

Cycle Time and Waiting Time

Two important KPIs for process executions are:

  • The Lead Time: the overall time in which the instance was worked on, from start to finish, regardless of whether it was actively worked on.
  • The Cycle Time: the total time in which the instance was actively worked on, from start to finish.

For interval event logs (those that have both start and end timestamps), the lead time and cycle time can be calculated incrementally for each event. The lead time and cycle time reported on the last event of a case correspond to the entire process execution. This helps to identify which activities caused bottlenecks, for example, when the lead time increases more significantly than the cycle time.

The PM4Py algorithm begins by sorting each case by the start timestamp (ensuring activities that started earlier are reported first in the log), and it can calculate the lead and cycle times for all situations, including complex ones. The following picture shows an example of this process:

The following attributes are added to events in the log:

Attributes

MetricMeaning
@@approx_bh_partial_cycle_timeIncremental cycle time associated with the event (the cycle time of the last event is the cycle time of the instance)
@@approx_bh_partial_lead_timeIncremental lead time associated with the event
@@approx_bh_overall_wasted_timeDifference between partial lead time and partial cycle time values
@@approx_bh_this_wasted_timeWasted time for the activity defined by the ‘interval’ event
@@approx_bh_ratio_cycle_lead_timeMeasures the incremental Flow Rate (between 0 and 1)

The method for calculating lead and cycle times can be applied with the following code:

                
            

This results in an enriched log, where each event contains the corresponding attributes for lead and cycle time.

Sojourn Time

This statistic works only with interval event logs, i.e., logs where each event has both a start timestamp and a completion timestamp.

The average sojourn time statistic calculates the time spent on each activity by averaging the time between the start and completion timestamps for the activity's events. Here’s an example of how to use it. First, we import an interval event log.

                
            

Then, we calculate the statistic, providing the attribute that represents the start timestamp and the attribute that represents the completion timestamp.

                
            

The same statistic can be applied seamlessly to Pandas dataframes, using the following alternative class:

pm4py.statistics.sojourn_time.pandas.get

Concurrent Activities

This statistic works only with interval event logs, i.e., logs where each event has both a start timestamp and a completion timestamp.

In interval event logs, the definition of event order is more flexible. A pair of events in a case can intersect in the following ways:

  • An event where the start timestamp is greater than or equal to the completion timestamp of the other event.
  • An event where the start timestamp is greater than or equal to the start timestamp of the other event but is less than the completion timestamp of the other event.

The second case represents event-based concurrency, where several events are actively executed at the same time.

You may want to retrieve the set of activities for which such concurrent execution occurs, along with the frequency of these occurrences. PM4Py can perform this calculation. Here’s an example. First, we import an interval event log.

                
            

Then, we calculate the statistic, providing the attribute that represents the start timestamp and the attribute that represents the completion timestamp.

                
            

This statistic can also be applied seamlessly to Pandas dataframes using the following alternative class:

pm4py.statistics.concurrent_activities.pandas.get

Eventually Follows Graph

We provide an approach for calculating the Eventually-Follows Graph (EFG).

The EFG represents the partial order of events within the process executions of the log.

Our implementation works for both lifecycle logs (logs where each event has only one timestamp) and interval logs (logs where each event has both a start and a completion timestamp). In the case of interval logs, the start timestamp is used to define the EFG/partial order.

Specifically, the method assumes lifecycle logs when no start timestamp is passed in the parameters, and it assumes interval logs when a start timestamp is provided.

Here's an example. First, we import an interval event log.

                
            

Then, we calculate the statistic, providing the attribute representing the completion timestamp and optionally the attribute representing the start timestamp.

                
            

Displaying Graphs

Graphs help visualize various aspects of the current log, such as the distribution of numeric attributes, case duration, or events over time.

Distribution of Case Duration

The following example demonstrates the distribution of case duration using two different graphs: a simple plot and a semi-logarithmic plot (on the X-axis). The semi-logarithmic plot is less sensitive to potential outliers.

First, the receipt log is loaded, and then the distribution related to case duration is obtained. Both a simple plot and a semi-logarithmic plot (on the X-axis) can be generated.

                
            

Distribution of Events Over Time

The following example illustrates a graph representing the distribution of events over time. This is especially important as it shows the time intervals when the greatest number of events is recorded.

The distribution of events over time is calculated, and the graph is plotted accordingly.

                
            

Distribution of a Numeric Attribute

The following example demonstrates two graphs related to the distribution of a numeric attribute: a standard plot and a semi-logarithmic plot (on the X-axis), which is less sensitive to outliers.

First, a filtered version of the Road Traffic log is loaded, followed by the distribution of the numeric attribute 'amount'. Both the standard plot and the semi-logarithmic plot can be generated.

                
            

Dotted Chart

The dotted chart is a classic visualization technique for displaying events within an event log across different dimensions. Each event in the log corresponds to a point on the chart. The dimensions are projected onto the graph as follows:

  • X-axis: represents the values of the first dimension.
  • Y-axis: represents the values of the second dimension.
  • Color: represents the values of the third dimension, with points displayed in different colors.

The dimensions can include string, numeric, or date values, which are appropriately managed by the dotted chart.

A common use case for the dotted chart is to visualize the distribution of cases and events over time using the following dimensions:

  • X-axis: the timestamp of the event.
  • Y-axis: the index of the case in the event log.
  • Color: the activity of the event.

This setup enables the identification of patterns such as:

  • Batches of events.
  • Changes in the case arrival rate.
  • Changes in the case finishing rate.

Below are examples of how to create and visualize the dotted chart based on both default and custom attribute selections.

To create the default dotted chart for the receipt event log, use the following code:

                
            

To create a custom dotted chart for the receipt event log using "concept:name" (activity), "org:resource" (organizational resource), and "org:group" (organizational group) as dimensions, use the following code:

                
            

Event Distribution

Observing the distribution of events over time can provide valuable insights into work shifts, busy periods, and patterns throughout the year.

The distribution of events over time can be visualized by loading an event log and calculating the distribution across various time periods, such as hours of the day, days of the week, days of the month, months, or years.

                
            

The parameter distr_type can take the following values:

  • hours: plots the distribution over hours of the day.
  • days_week: plots the distribution over days of the week.
  • days_of_month: plots the distribution over days of the month.
  • months: plots the distribution over months of the year.
  • years: plots the distribution over different years in the log.

Detection of Batches

An activity is considered to be executed in batches by a given resource when the resource performs the same activity multiple times within a short period.

Identifying such activities can help uncover process areas that may benefit from automation, as repetitive activities may indicate inefficiencies.

Below is an example calculation using an event log to detect batches.

                
            

The results can be displayed as follows:

                
            

Our method can detect different types of batches, including:

  • Simultaneous: all events in the batch have identical start and end timestamps.
  • Batching at start: all events in the batch have identical start timestamps.
  • Batching at end: all events in the batch have identical end timestamps.
  • Sequential batching: for all consecutive events, the end of the first event is equal to the start of the second.
  • Concurrent batching: consecutive events that are not sequentially matched.

Rework Activities

The rework statistic identifies activities that have been repeated during the same process execution, revealing inefficiencies in the process.

In our implementation, rework is calculated based on an event log or Pandas dataframe, and it returns a dictionary that associates each activity with the number of cases that contain rework for that activity.

Below is an example calculation on an event log.

                
            

Rework Cases

Rework at the case level refers to the number of events in a case that repeat an activity already performed earlier in the case.

For example, if a case contains the activities A, B, A, B, C, D, the rework count is 2 since the events in positions 3 and 4 correspond to activities that have already occurred.

The rework statistic helps identify cases where many activities are repeated, indicating potential inefficiencies.

Below is an example calculation on an event log. After the computation, the dictio will contain entries for the six cases in the example log:

{ "1": { "number_activities": 5, "rework": 0 }, "2": { "number_activities": 5, "rework": 0 }, "3": { "number_activities": 9, "rework": 2 }, "4": { "number_activities": 5, "rework": 0 }, "5": { "number_activities": 13, "rework": 7 }, "6": { "number_activities": 5, "rework": 0 } }

                
            

Query Structure Paths Over Time

We provide a feature that enables querying information about the paths in the event log at a specific point in time or within a time interval using an interval tree data structure.

This is useful for quickly computing the workload of resources over a given time interval or for measuring the number of open cases at any given time.

To transform an event log into an interval tree, use the following code:

                
            

The following example computes the workload (number of events) for each resource within the specified interval.

                
            

The following example computes the number of open cases for each directly-follows path in the log.