PM4Py offers a variety of specific methods to filter event logs.
The following section presents various methods for filtering event logs based on time frames. For each method, both the log and Pandas DataFrame implementations are provided. You might be interested in retaining only the traces that fall within a specific time interval, such as from March 9, 2011, to January 18, 2012.
Additionally, it is possible to keep the traces that intersect with a time interval.
So far, only trace-based filtering techniques have been discussed. However, there is also a method to keep the events that fall within a specific timeframe.
This filter allows you to keep only traces with durations that fall within a specified interval. In the examples, traces between 1 and 10 days are kept. Note that the time parameters are given in seconds.
In general, PM4Py can filter a log or a DataFrame based on start activities. First, you may need to identify the starting activities, for which code snippets are provided. Afterward, an example of filtering is shown. The first snippet works with a log object, while the second works with a DataFrame.
`log_start` is a dictionary where the key is the activity and the value is the number of occurrences.
PM4Py also allows filtering by end activities. This filter keeps only traces that end with a specified set of activities. First, you may need to identify the end activities, for which a code snippet is provided.
A variant refers to a set of cases that share the same control-flow perspective, meaning a set of cases that follow the same sequence of activities in the same order. In this section, we will first focus on log objects for all methods, and then we will cover DataFrame-based methods. To retrieve the variants from the log, you can use the following code snippet:
To filter by a specific collection of variants, use the following code snippet:
Other variant-based filters are available. For example, filters on the top-k variants retain only the cases following one of the k most frequent variants:
The variant coverage filter retains only those traces that follow the top variants in the log, under the condition that each variant covers a specified percentage of cases. For instance, if `min_coverage_percentage=0.4` and we have a log with 1000 cases, where 500 are variant 1, 400 are variant 2, and 100 are variant 3, the filter will keep only traces for variants 1 and 2.
Filtering by attribute values allows you to:
Examples of attributes include the resource (typically contained in the `org:resource` attribute) and the activity (usually found in the `concept:name` attribute). The first method can be applied to log objects, while the second applies to DataFrame objects. To get the list of resources and activities contained in the log, you can use the following code.
To filter traces containing or not containing a given list of resources, use the following code:
You can also keep only the events performed by a given list of resources, trimming the cases accordingly. Use the following code for this:
Filtering by numeric attribute values offers options similar to filtering by string attributes. First, we import the log, then filter to keep only events satisfying a numeric range between 34 and 36. An additional filter can be applied to retain only cases with at least one event within the specified range. If you're interested in cases with a specific activity, such as "Add penalty," having an amount between 34 and 500, the following code snippet can be used:
The between filter transforms the event log by identifying subcases that span from a source activity to a target activity. This is useful for analyzing behavior between two activities, such as throughput time, activity inclusion, or conformance levels. The filter between two activities is applied as follows:
The case size filter retains only the cases in the log that have a number of events within a user-specified range. This filter can be used to eliminate cases that are too short (possibly incomplete or outliers) or too long (indicating excessive rework). The case size filter can be applied as follows:
The rework filter identifies cases where a given activity has been repeated. For instance, it can be used to search for cases that contain at least two occurrences of the activity "reinitiate request."
The path performance filter identifies cases in which the duration of a given path between two activities falls within a specified range. This is useful for identifying cases where a significant amount of time has passed between two activities. The filter is applied as follows: