Feature Selection

Feature selection operations allow representing the event log in a tabular format. This is crucial for tasks such as prediction and anomaly detection.

Automatic Feature Selection

In PM4Py, we offer methods to perform automatic feature selection. As an example, let's import the receipt log and apply automatic feature selection. First, we import the receipt log:

Then, let's perform automatic feature selection:

Printing the value of feature_names, we observe that the following attributes were selected:

The channel attribute at the trace level (with values: Desk, Intern, Internet, Post, e-mail).
The department attribute at the trace level (with values: Customer contact, Experts, General).
The group attribute at the event level (with values: EMPTY, Group 1, Group 12, Group 13, Group 14, Group 15, Group 2, Group 3, Group 4, Group 7).

No numeric attribute is selected. The printed feature_names are represented as:

[ trace:channel@Desk, trace:channel@Intern, trace:channel@Internet, trace:channel@Post, trace:channel@e-mail, trace:department@Customer contact, trace:department@Experts, trace:department@General, event:org:group@EMPTY, event:org:group@Group 1, event:org:group@Group 12, event:org:group@Group 13, event:org:group@Group 14, event:org:group@Group 15, event:org:group@Group 2, event:org:group@Group 3, event:org:group@Group 4, event:org:group@Group 7 ].

As shown, different features correspond to different attribute values. This technique is called one-hot encoding: a case is assigned a value of 0 if it does not contain an event with the given attribute value, and 1 if it contains at least one such event.

Representing the features as a dataframe:

We can observe the features assigned to each individual case.

Manual Feature Selection

Manual feature selection allows users to specify which attributes should be included. These may include, for example:

Activities performed during process execution (usually stored in the event attribute concept:name).
Resources performing the process execution (usually stored in the event attribute org:resource).
Selected numeric attributes, at the user's discretion.

To perform manual feature selection, we use the method log_to_features.apply. The following types of features can be considered:

Parameter	Description
`str_ev_attr`	String attributes at the event level, one-hot encoded to assume values of 0 or 1.
`str_tr_attr`	String attributes at the trace level, one-hot encoded to assume values of 0 or 1.
`num_ev_attr`	Numeric attributes at the event level, encoded by taking the last value observed among the trace's events.
`num_tr_attr`	Numeric attributes at the trace level, encoded by including their numeric value.
`str_evsucc_attr`	Successions of string attribute values at the event level: for instance, given a trace [A, B, C], features will include not only A, B, and C individually, but also directly-follows pairs (A, B) and (B, C).

For example, consider a feature selection where we are interested in:

Whether a process execution contains a specific activity.
Whether a process execution involves a specific resource.
Whether a process execution contains a specific directly-follows path between activities.
Whether a process execution contains a specific directly-follows path between resources.

In this case, the number of features becomes significantly larger.

Calculating Useful Features

Other important features include the cycle time and the lead time associated with a case. In this context, we may assume one of the following:

A log with lifecycles, where each event is instantaneous,
Or an interval log, where events are associated with two timestamps (start and end).

Lead and cycle times can be calculated directly from interval logs. If we have a lifecycle log, we first need to convert it using:

After conversion, features such as lead and cycle times can be added using the following instructions:

Once the start timestamp attribute (e.g., start_timestamp) and the timestamp attribute (e.g., time:timestamp) are provided, the following features are returned:

@@approx_bh_partial_cycle_time: Incremental cycle time associated with the event (the final event's cycle time is the instance's total cycle time).
@@approx_bh_partial_lead_time: Incremental lead time associated with the event.
@@approx_bh_overall_wasted_time: Difference between the partial lead time and the partial cycle time.
@@approx_bh_this_wasted_time: Wasted time specifically related to the activity described by the 'interval' event.
@@approx_bh_ratio_cycle_lead_time: Measures the incremental flow rate (ranging from 0 to 1).

Since these are all numerical attributes, we can further refine the feature extraction by applying:

Additionally, we offer the calculation of further intra- and inter-case features, which can be enabled by setting boolean parameters in the log_to_features.apply method, including:

ENABLE_CASE_DURATION: Adds case duration as an additional feature.
ENABLE_TIMES_FROM_FIRST_OCCURRENCE: Adds times measured from the first occurrence of an activity within the case.
ENABLE_TIMES_FROM_LAST_OCCURRENCE: Adds times measured from the last occurrence of an activity within the case.
ENABLE_DIRECT_PATHS_TIMES_LAST_OCC: Adds the duration of the last occurrence of a directed (i, i+1) path as a feature.
ENABLE_INDIRECT_PATHS_TIMES_LAST_OCC: Adds the duration of the last occurrence of an indirect (i, j) path as a feature.
ENABLE_WORK_IN_PROGRESS: Adds the number of concurrent cases as a feature (work in progress).
ENABLE_RESOURCE_WORKLOAD: Adds the workload of resources as a feature.

PCA - Reducing the Number of Features

Techniques such as clustering, prediction, and anomaly detection can suffer when the dataset has too many features. Therefore, dimensionality reduction techniques (like PCA) help manage the complexity of the data. Starting from a Pandas dataframe generated from the extracted features:

It is possible to reduce the number of features using PCA. For example, we can create a PCA model with 5 components and apply it to the dataframe:

In this way, more than 400 columns are reduced to 5 principal components that capture most of the data variance.

Anomaly Detection

In this section, we focus on calculating an anomaly score for each case. This score is based on the extracted features and works best when combined with dimensionality reduction (such as PCA). We can apply a method called IsolationForest to the dataframe, which adds a column of scores: cases with a score ≤ 0 are considered anomalous, while those with a score > 0 are not.

To identify the most anomalous cases, we can sort the dataframe after inserting an index. The resulting output highlights the most anomalous cases:

Evolution of the Features

We might be interested in observing how features evolve over time to detect positions in the event log that show behavior different from the mainstream. PM4Py provides a method to graph feature evolution over time. Here is an example:

Event-based Feature Extraction

Some machine learning methods (e.g., LSTM-based deep learning) require features at the event level, instead of aggregating features at the case level. In these methods, each event is represented as a numerical row containing features related to that event. We can perform a default event-based feature extraction as follows:

Alternatively, it is possible to manually specify the features to be extracted. The parameters str_ev_attr and num_ev_attr correspond to those described in previous sections:

Decision Tree About the Ending Activity of a Process

Decision trees are tools that help understand the conditions leading to a particular outcome. In this section, several examples related to the construction of decision trees are provided. The ideas behind building decision trees are discussed in the scientific paper: de Leoni, Massimiliano, Wil MP van der Aalst, and Marcus Dees. "A General Process Mining Framework for Correlating, Predicting, and Clustering Dynamic Behavior Based on Event Logs."

The general procedure is as follows:

Obtain a representation of the log based on a given set of features (e.g., using one-hot encoding for string attributes and preserving numeric attributes as they are).
Construct a representation of the target classes.
Build the decision tree.
Visualize the decision tree.

A process instance may potentially finish with different activities, signaling different outcomes. A decision tree can help understand the reasons behind each outcome. First, a log is loaded, and then a feature-based representation of the log is created.

Alternatively, an automatic feature representation (automatic attribute selection) can be obtained:

(Optional) The extracted features can be represented as a Pandas DataFrame:

(Optional) The DataFrame can then be exported as a CSV file:

Next, the target classes are defined: each endpoint activity of the process instance is assigned to a different class.

The decision tree is then built and visualized:

Decision Tree About the Duration of a Case

A decision tree regarding the duration of a case helps understand the factors behind a high case duration (i.e., durations above a given threshold). First, a log is loaded, and a feature-based representation is created.

Alternatively, an automatic feature representation can be generated:

Then, the target classes are formed:

Traces below the specified threshold (e.g., 200 days — time measured in seconds).
Traces above the specified threshold.

The decision tree is then built and visualized:

Decision Mining

Decision Mining enables the following, given:

An event log,
A process model (an accepting Petri net),
A decision point,

It retrieves the features of the cases that take different paths. This allows, for example, building a decision tree to explain the choices made.

First, import a XES log:

Next, calculate a model using the Inductive Miner:

To visualize the model:

For this example, we select decision point p_10, where a choice is made between the activities examine casually and examine thoroughly. Once we have a log, a model, and a decision point, the decision mining algorithm can be executed:

The outputs of the apply method are:

X: A Pandas DataFrame containing the features associated with each case leading to a decision.
y: A Pandas Series containing the class (output) of each decision (e.g., 0 or 1).
class_names: The names of the possible decision outcomes (e.g., examine casually and examine thoroughly).

These outputs can be used with any classification or comparison technique. In particular, decision trees are a useful choice. We provide a function to automatically discover decision trees from decision mining results:

To visualize the resulting decision tree:

Feature Extraction on DataFrames

While the feature extraction described above is generic, it might not be optimal (performance-wise) when working directly with Pandas DataFrames. We also offer the option to extract a feature table by providing:

The DataFrame,
A set of columns to use as features.

The output is another DataFrame containing:

The case identifier.
For each string attribute: a one-hot encoding counting the number of occurrences for each possible value.
For each numeric attribute: the last value observed within each case.

Here is an example that keeps concept:name (activity) and amount (cost) as features:

The resulting feature table will contain columns such as:

['case:concept:name', 'concept:name_CreateFine', 'concept:name_SendFine', 'concept:name_InsertFineNotification', 'concept:name_Addpenalty', 'concept:name_SendforCreditCollection', 'concept:name_Payment', 'concept:name_InsertDateAppealtoPrefecture', 'concept:name_SendAppealtoPrefecture', 'concept:name_ReceiveResultAppealfromPrefecture', 'concept:name_NotifyResultAppealtoOffender', 'amount']

Discovery of a Data Petri Net

Given a Petri net discovered by a classical process mining algorithm (e.g., Alpha Miner or Inductive Miner), we can enhance it into a Data Petri Net by applying decision mining at every decision point, and transforming the resulting decision trees into guards (boolean conditions).

An example:

The guards discovered for each transition can be printed. They are expressed as boolean conditions and interpreted by the execution engine:

Temporal Feature Extraction

The PM4Py library provides a method to extract temporal features from an event log, event stream, or Pandas DataFrame, as described in the paper by Pourbafrani et al. (2020). This method groups events by a specified time granularity (e.g., weekly) and computes aggregated metrics to represent process behavior over time.

To apply temporal feature extraction, you can use the apply function from the provided code. Here's an example:

The resulting features_df is a Pandas DataFrame containing temporal features grouped by the specified frequency (e.g., weekly).

Parameters

The algorithm accepts several parameters to customize the feature extraction process. These are defined in the Parameters enum and include:

Parameter	Description
`GROUPER_FREQ`	Time interval for grouping events (e.g., "W" for weekly, "D" for daily). Default: "W".
`ARRIVAL_RATE`	Column name for the arrival rate of cases. Default: "arrival_rate".
`FINISH_RATE`	Column name for the completion rate of cases. Default: "finish_rate".
`SERVICE_TIME`	Column name for the service time (time spent on activities). Default: "service_time".
`WAITING_TIME`	Column name for the waiting time (time spent idle). Default: "waiting_time".
`SOJOURN_TIME`	Column name for the sojourn time (total time from start to end). Default: "sojourn_time".
`CASE_ID_COLUMN`	Column name for case IDs. Default: "case:concept:name".
`ACTIVITY_COLUMN`	Column name for activities. Default: "concept:name".
`TIMESTAMP_COLUMN`	Column name for event timestamps. Default: "time:timestamp".
`START_TIMESTAMP_COLUMN`	Column name for start timestamps (if available). Defaults to the timestamp column.
`RESOURCE_COLUMN`	Column name for resources. Default: "org:resource".

These parameters allow users to specify the granularity and naming conventions for the extracted features, tailoring the output to their specific needs.

Process Overview

The temporal feature extraction process involves the following steps:

Log Conversion: The input log (EventLog, EventStream, or DataFrame) is converted to a Pandas DataFrame for processing.
Arrival and Finish Rates: The algorithm calculates the arrival rate (how often new cases start) and finish rate (how often cases complete) for each case, adding these as columns to the DataFrame.
Service, Waiting, and Sojourn Times: For each case, the algorithm computes:
- Service Time: Time spent actively processing activities.
- Waiting Time: Time spent idle between activities.
- Sojourn Time: Total time from the start to the end of a case.
Grouping by Time: Events are grouped by the specified time frequency (e.g., weekly) based on the start timestamp.
Feature Aggregation: For each time group, the algorithm computes:
- Number of unique resources, cases, and activities.
- Total number of events.
- Average arrival and finish rates.
- Average service, waiting, and sojourn times.
Output: A DataFrame is returned with columns for the timestamp of each group and the computed features. Missing values are filled with 0.

The resulting DataFrame provides a tabular representation of temporal process characteristics, suitable for further analysis or visualization.

Example Output

The output DataFrame might look like this:

This DataFrame shows temporal features for two weekly groups, including counts of unique elements and averages of time-based metrics.

Use Cases

Temporal feature extraction is particularly useful for:

Process Simulation: Generating system dynamics models for simulation, as described in the referenced paper.
Performance Analysis: Identifying bottlenecks by analyzing waiting and service times.
Trend Detection: Observing how process metrics evolve over time to detect anomalies or shifts in behavior.
Predictive Modeling: Using temporal features as inputs for machine learning models to predict process outcomes.

Advanced Usage

To customize the feature extraction further, you can modify the parameters. For example, to group by days and use custom column names:

This allows the algorithm to adapt to different log formats and analysis requirements.

References

The approach is based on the following paper:

Pourbafrani, Mahsa, Sebastiaan J. van Zelst, and Wil MP van der Aalst. "Supporting automatic system dynamics model generation for simulation in the context of process mining." International Conference on Business Information Systems. Springer, Cham, 2020.