Supported/Described Version(s): pm4py 2.7.11.9
This documentation assumes that the reader has a basic understanding of process
mining
and python concepts.
Handling Event Data
Importing IEEE XES files
IEEE XES is a standard format describing how event logs are stored.
For more information about the format, please study the
IEEE XES Website
.
A simple synthetic event log (
running-example.xes
) can be downloaded from
here
.
Note that several real event logs have been made available, over the past few
years.
You can find them
here
.
The example code on the right shows how to import an event log, stored in the IEEE
XES format, given a file path to the log file.
The code fragment uses the standard importer (iterparse, described in a later
paragraph).
Note that IEEE XES Event Logs are imported into a Pandas dataframe object.
import pm4py
if __name__ == "__main__":
log = pm4py.read_xes('tests/input_data/running-example.xes')
Apart from the IEEE XES standard, a lot of event logs are actually stored in a CSV
file.
In general, there is two ways to deal with CSV files in pm4py:
Import the CSV into a pandasDataFrame;
In general, most existing algorithms in pm4py are coded to be flexible in terms
of their
input, i.e., if a certain event log object is provided that is not in the right
form, we
translate it to the appropriate form for you.
Hence, after importing a dataframe, most algorithms are directly able to work
with the
data frame.
Convert the CSV into an event log object (similar to the result of the IEEE XES
importer
presented in the previous section);
In this case, the first step is to import the CSV file using pandas (similar to
previous bullet) and subsequently converting it to the event log object.
In the remainder of this section, we briefly highlight how to convert a pandas
DataFrame
to an event log.
Note that most algorithms use the same type of conversion, in case a given
event data
object is not of the right type.
The example code on the right shows how to convert a CSV file into the pm4py
internal event data object types.
By default, the converter converts the dataframe to an Event Log object (i.e., not
an Event Stream).
if __name__ == "__main__":
dataframe = pd.read_csv('tests/input_data/running-example.csv', sep=',')
dataframe = pm4py.format_dataframe(dataframe, case_id='case:concept:name', activity_key='concept:name', timestamp_key='time:timestamp')
event_log = pm4py.convert_to_event_log(dataframe)
Note that the example code above does not directly work in a lot of cases. Let us consider a very simple example event log, and, assume it is stored
as a csv-file:
In this small example table, we observe four columns, i.e., CaseID,
Activity,
Timestamp and clientID.
Clearly, when importing the data and converting it to an Event Log object, we aim to
combine all rows (events) with the same value for the CaseID column
together.
Another interesting phenomenon in the example data is the fourth column, i.e.,
clientID.
In fact, the client ID is an attribute that will not change over the course of
execution
a process instance, i.e., it is a case-level attribute.
pm4py allows us to specify that a column actually describes a case-level attribute
(under the assumption that the attribute does not change during the execution of a
process).
The example code on the right shows how to convert the previously examplified csv
data file.
After loading the csv file of the example table, we rename the clientID
column to case:clientID (this is a specific operation provided by
pandas!).
import pandas as pd
import pm4py
if __name__ == "__main__":
dataframe = pd.read_csv('tests/input_data/running-example-transformed.csv', sep=',')
dataframe = dataframe.rename(columns={'clientID': 'case:clientID'})
dataframe = pm4py.format_dataframe(dataframe, case_id='CaseID', activity_key='Activity', timestamp_key='Timestamp')
event_log = pm4py.convert_to_event_log(dataframe)
In this section, we describe how to convert event log objects from one object type
to another object type.
There are three objects, which we are able to 'switch' between, i.e., Event Log,
Event Stream and Data Frame objects.
Please refer to the previous code snippet for an example of applying log conversion
(applied when importing a CSV object).
Finally, note that most algorithms internally use the converters, in order to be
able to handle an input event data object of any form.
In such a case, the default parameters are used.
Exporting an Event Log object to an IEEE Xes file is fairly straightforward in pm4py.
Consider the example code fragment on the right, which depicts this
functionality.
import pm4py
if __name__ == "__main__":
pm4py.write_xes(log, 'exported.xes')
In the example, the log object is assumed to be an Event Log object.
The exporter also accepts an Event Stream or DataFrame object as an input.
However, the exporter will first convert the given input object into an Event Log.
Hence, in this case, standard parameters for the conversion are used.
Thus, if the user wants more control, it is advisable to apply the conversion to
Event Log, prior to exporting.
To export an event log to a csv-file, pm4py uses Pandas.
Hence, an event log is first converted to a Pandas Data Frame, after which it is
written to disk.
import pandas as pd
import pm4py
if __name__ == "__main__":
dataframe = pm4py.convert_to_dataframe(log)
dataframe.to_csv('exported.csv')
In case an event log object is provided that is not a dataframe, i.e., an Event Log
or Event Stream, the conversion is applied, using the default parameter values,
i.e., as presented in the Converting
Event Data section.
Note that exporting event data to as csv file has no parameters.
In case more control over the conversion is needed, please apply a conversion to
dataframe first, prior to exporting to csv.
At this moment, I/O of any format supported by Pandas (dataframes) is implicitly
supported.
As long as data can be loaded into a Pandas dataframe, pm4py is reasonably able to work
with such files.
Filtering on timeframe
In the following paragraph, various methods regarding filtering with time
frames are present. For each of the methods, the log and Pandas
Dataframe methods are revealed.
One might be interested in only keeping the traces that are contained in
a specific interval, e.g. 09 March 2011 and 18 January 2012.
This filter permits to keep only traces with duration that is inside a specified
interval. In the examples, traces between 1 and 10 days are kept.
Note that the time parameters are given in seconds.
First of all, it might be necessary to know the starting activities. Therefore, code
snippets are provided. Subsequently, an example of filtering is provided. The first
snippet is working with log object, the second one is working on a dataframe.
log_start is a dictionary that contains as key the activity and as
value the number of occurrence.
import pm4py
if __name__ == "__main__":
log_start = pm4py.get_start_activities(log)
filtered_log = pm4py.filter_start_activities(log, ["S1"]) #suppose "S1" is the start activity you want to filter on
Filter on end activities
In general, pm4py offers the possibility to filter a log or a dataframe on end activities.
This filter permits to keep only traces with an end activity among a set of specified
activities. First of all, it might be necessary to know the end activities.
Therefore, a code snippet is provided.
A variant is a set of cases that share the same control-flow perspective, so a set of cases
that share the same classified events (activities) in the same order. In this section, we
will focus for all methods first on log objects, then we will continue with the
dataframe.
To retrieve the variants from the log, the code snippet can be used:
import pm4py
if __name__ == "__main__":
variants = pm4py.get_variants(log)
import pm4py
if __name__ == "__main__":
log = pm4py.read_xes("tests/input_data/receipt.xes")
k = 2
filtered_log = pm4py.filter_variants_top_k(log, k)
The filters on variants coverage keeps the cases following the top variants of the log, following the
conditions that each variant covers the specified percentage of cases in the log.
If min_coverage_percentage=0.4, and we have a log with 1000 cases,
of which 500 of the variant 1, 400 of the variant 2, and 100 of the variant 3,
the filter keeps only the traces of variant 1 and variant 2
import pm4py
if __name__ == "__main__":
log = pm4py.read_xes("tests/input_data/receipt.xes")
perc = 0.1
filtered_log = pm4py.filter_variants_by_coverage_percentage(log, perc)
Filtering on attributes values permits alternatively to:
Keep cases that contains at least an event with one of the given attribute values
Remove cases that contains an event with one of the the given attribute values
Keep events (trimming traces) that have one of the given attribute values
Remove events (trimming traces) that have one of the given attribute values
Example of attributes are the resource (generally contained in org:resource attribute) and
the activity (generally contained in concept:name attribute). As noted before, the first
method can be applied on log objects, the second on dataframe objects.
To get the list of resources and activities contained in the log, the following code
could be used.
Filtering on numeric attribute values provide options that are similar to filtering on string
attribute values (that we already considered).
First, we import, the log. Subsequently, we want to keep only the events satisfying
an amount comprised between 34 and 36. An additional filter aims to to keep only
cases with at least an event satisfying the specified amount. The filter on cases
provide the option to specify up to two attributes that are checked on the events
that shall satisfy the numeric range. For example, if we are interested in cases
having an event with activity Add penalty that has an amount between 34 and 500, a
code snippet is also provided.
The between filter transforms the event log by identifying, in the current set of cases,
all the subcases going from a source activity to a target activity.
This is useful to analyse in detail the behavior in the log between such couple of activities
(e.g., the throughput time, which activities are included, the level of conformance).
The between filter between two activities is applied as follows.
The paths performance filter identifies the cases in which
a given path between two activities takes a duration that is included
in a range that is specified by the user.
This can be useful to identify the cases in which a large amount
of time is passed between two activities.
The paths filter is applied as follows. In this case,
we are looking for cases containing at least one occurrence
of the path between decide and pay compensation
having a duration included between 2 days and 10 days (where each day
has a duration of 86400 seconds).
Traditional event logs, used by mainstream process mining techniques, require
the events to be related to a case. A case is a set of events for a particular
purpose. A case notion is a criteria to assign a case to the events.
However, in real processes this leads to two problems:
If we consider the Order-to-Cash process, an order could be related to many different deliveries.
If we consider the delivery as case notion, the same event of Create Order needs to be
replicated in different cases (all the deliveries involving the order). This is called the
convergence problem.
If we consider the Order-to-Cash process, an order could contain different order items,
each one with a different lifecycle. If we consider the order as case notion, several instances
of the activities for the single items may be contained in the case, and this make the
frequency/performance annotation of the process problematic. This is called the divergence
problem.
Object-centric event logs relax the assumption that an event is related to exactly
one case. Indeed, an event can be related to different objects of different object types.
Essentially, we can describe the different components of an object-centric event log as:
Events, having an identifier, an activity, a timestamp, a list of related objects and a
dictionary of other attributes.
Objects, having an identifier, a type and a dictionary of other attributes.
Attribute names, e.g., the possible keys for the attributes of the event/object attribute map.
Object types, e.g., the possible types for the objects.
Supported Formats
Several historical formats (OpenSLEX, XOC) have been proposed for the storage of object-centric
event logs. In particular, the OCEL standard proposes
lean and intercompatible formats for the storage of object-centric event logs. These include:
XML-OCEL: a storage format based on XML for object-centric event logs.
An example of XML-OCEL event log is reported here.
JSON-OCEL: a storage format based on JSON for object-centric event logs.
An example of JSON-OCEL event log is reported here.
Among the commonalities of these formats, the event/object identifier is ocel:id,
the activity identifier is ocel:activity, the timestamp of the event is ocel:timestamp,
the type of the object is ocel:type.
Moreover, the list of related objects for the events is identified by ocel:omap,
the attribute map for the events is identified by ocel:vmap, the attribute map for the
objects is identified by ocel:ovmap.
Ignoring the attributes at the object level, we can also represent the object-centric event log
in a CSV format (an example is reported here). There, a row represent an event, where the event identifier is ocel:eid,
and the related objects for a given type OTYPE are reported as a list under the voice ocel:type:OTYPE.
Importing/Export OCELs
For all the supported formats, an OCEL event log can be read by doing:
import pm4py
if __name__ == "__main__":
path = "tests/input_data/ocel/example_log.jsonocel"
ocel = pm4py.read_ocel(path)
Object-Centric Event Log (number of events: 23, number of objects: 15, number of activities: 15, number of object types: 3, events-objects relationships: 39)
Activities occurrences: {'Create Order': 3, 'Create Delivery': 3, 'Delivery Successful': 3, 'Invoice Sent': 2, 'Payment Reminder': 2, 'Confirm Order': 1, 'Item out of Stock': 1, 'Item back in Stock': 1, 'Delivery Failed': 1, 'Retry Delivery': 1, 'Pay Order': 1, 'Remove Item': 1, 'Cancel Order': 1, 'Add Item to Order': 1, 'Send for Credit Collection': 1}
Object types occurrences: {'element': 9, 'order': 3, 'delivery': 3}
Please use ocel.get_extended_table() to get a dataframe representation of the events related to the objects.
The retrieval of the names of the attributes in the log can be obtained
doing:
if __name__ == "__main__":
attribute_names = pm4py.ocel_get_attribute_names(ocel)
The retrieval of a dictionary containing the set of activities for each object type
can be obtained using the command on the right. In this case, the key
of the dictionary will be the object type, and the value the set of activities
which appears for the object type.
if __name__ == "__main__":
object_type_activities = pm4py.ocel_object_type_activities(ocel)
It is possible to obtain for each event identifier and object type the number of related
objects to the event. The output will be a dictionary where the first key will be
the event identifier, the second key will be the object type and the value will
be the number of related objects per type.
if __name__ == "__main__":
ocel_objects_ot_count = pm4py.ocel_objects_ot_count(ocel)
It is possible to calculate the so-called temporal summary of the object-centric event log.
The temporal summary is a table (dataframe) in which the different timestamps occurring in the log are reported
along with the set of activities happening in a given point of time and the objects involved in such
It is possible to calculate the so-called objects summary of the object-centric event log.
The objects summary is a table (dataframe) in which the different objects occurring in the log are reported
along with the list of activities of the events related to the object, the start/end timestamps
of the lifecycle, the duration of the lifecycle and the other objects related to the given object
in the interaction graph.
Internal Data Structure
In this section, we describe the data structure used in pm4py to store object-centric event logs.
We have in total three Pandas dataframes:
The events dataframe: this stores a row for each event. Each row contains
the event identifier (ocel:eid), the activity (ocel:activity),
the timestamp (ocel:timestamp), and the values for the other event attributes (one per column).
The objects dataframe: this stores a row for each object. Each row contains
the object identifier (ocel:oid), the type (ocel:type),
and the values for the object attributes (one per column).
The relations dataframe: this stores a row for every relation event->object.
Each row contains the event identifier (ocel:eid), the object identifier
(ocel:oid), the type of the related object (ocel:type).
These dataframes can be accessed as properties of the OCEL object (e.g.,
ocel.events, ocel.objects, ocel.relations), and be obviously used
for any purposes (filtering, discovery).
Filtering Object-Centric Event Logs
In this section, we describe some filtering operations available in pm4py and specific for
object-centric event logs. There are filters at three levels:
Filters at the event level (operating first at the ocel.events structure and then propagating
the result to the other parts of the object-centric log).
Filters at the object level (operating first at the ocel.objects structure and then propagating
the result to the other parts of the object-centric log).
Filters at the relations level (operating first at the ocel.relations structure and then propagating
the result to the other parts of the object-centric log).
Filter on Event Attributes
We can keep the events with a given attribute falling inside the specified list
of values by using pm4py.filter_ocel_event_attribute.
An example, filtering on the ocel:activity (the activity) attribute
is reported on the right. The positive boolean tells if to filter the events
with an activity falling in the list or to filter the events NOT falling in the
specified list (if positive is False).
We can keep the objects with a given attribute falling inside the specified list
of values by using pm4py.filter_ocel_object_attribute.
An example, filtering on the ocel:type (the object type) attribute
is reported on the right. The positive boolean tells if to filter the objects
with a type falling in the list or to filter the objects NOT falling in the
specified list (if positive is False).
Sometimes, object-centric event logs include more relations between events
and objects than legit. This could lead back to the divergence problem.
We introduce a filter on the allowed activities per object type.
This helps in keeping for each activity only the meaningful object types, excluding the others.
An example application of the filter is reported on the right. In this case, we keep
for the order object type only the Create Order activity,
and for the item object type only the Create Order and
Create Delivery activities.
With this filter, we want to search for some patterns in the log (for example, the events related
to at least 1 order and 2 items). This helps in identifying exceptional patterns
(e.g., an exceptional number of related objects per event). An example is reported on the right.
In some contexts, we may want to identify the events in which an object of a given
type starts/completes his lifecycle. This may pinpoint some uncompleteness
in the recordings. Examples are reported on the right.
An useful filter, to restrict the behavior of the object-centric event log
to a specific time interval, is the timestamp filter (analogous to its
traditional counterpart). An example is reported on the right.
In this filter, we want to keep a limited set of object types of the log
by manually specifying the object types to retain. Only the events related
to at least one object of a provided object type are kept.
if __name__ == "__main__":
filtered_ocel = pm4py.filter_ocel_object_types(ocel, ['order', 'element'])
In this filter, we want to keep the events related to the connected component
of a provided object in the objects interaction graph. So a subset of events of the original log,
loosely interconnected, are kept in the filtered log
if __name__ == "__main__":
filtered_ocel = pm4py.filter_ocel_cc_object(ocel, 'o1')
In this filter, we want to keep a subset of the objects (identifiers) of the original
object-centric event log. Therefore, only the events related to at least one of these objects
are kept in the object-centric event log.
if __name__ == "__main__":
filtered_ocel = pm4py.filter_ocel_objects(ocel, ['o1', 'i1'])
It's also possible to iteratively expand the set of objects of the filter to the objects
that are interconnected to the given objects in the objects interaction graph.
This is done with the parameter level. An example is provided where the expansion
of the set of objects to the 'nearest' ones is done:
if __name__ == "__main__":
filtered_ocel = pm4py.filter_ocel_objects(ocel, ['o1'], level=2)
Flattening to a Traditional Log
Flattening permits to convert an object-centric event log to a traditional
event log with the specification of an object type. This allows for the application
of traditional process mining techniques to the flattened log.
An example in which an event log is imported, and a flattening operation
is applied on the order object type, is the following:
The situation in which an object-centric event log is produced directly at the extraction
phase from the information systems is uncommon. Extractors for this settings are quite uncommon
nowadays.
More frequent is the situation where some event logs can be extracted from the system
and then their cases are related. So we can use the classical extractors to extract the
event logs, and additionally extract only the relationships between the cases.
This information can be used to mine the relationships between events. In particular,
the method of timestamp-based interleavings can be used. These consider the temporal
flow between the different processes, based on the provided case relations: you can go from the
left-process to the right-process, and from the right-process to the left-process.
In the following, we will assume the cases to be Pandas dataframes (with the classical
pm4py naming convention, e.g. case:concept:name, concept:name and time:timestamp)
and a case relations dataframe is defined between them (with the related cases being expressed
respectively as case:concept:name_LEFT and case:concept:name_RIGHT.
In this example, we load two event logs, and a dataframe containing the relationships
between them. Then, we apply the timestamp-based interleaved miner.
import pandas as pd
import pm4py
if __name__ == "__main__":
dataframe1 = pd.read_csv("tests/input_data/interleavings/receipt_even.csv")
dataframe1 = pm4py.format_dataframe(dataframe1)
dataframe2 = pd.read_csv("tests/input_data/interleavings/receipt_odd.csv")
dataframe2 = pm4py.format_dataframe(dataframe2)
case_relations = pd.read_csv("tests/input_data/interleavings/case_relations.csv")
from pm4py.algo.discovery.ocel.interleavings import algorithm as interleavings_discovery
interleavings = interleavings_discovery.apply(dataframe1, dataframe2, case_relations)
The resulting interleavings dataframe will contain several columns, including for each row (that is a couple of related events, the first belonging to the first dataframe, the second belonging to the second dataframe):
All the columns of the event (of the interleaving) of the first dataframe (with prefix LEFT).
All the columns of the event (of the interleaving) of the second dataframe (with prefix RIGHT).
The column @@direction indicating the direction of the interleaving (with LR we go left-to-right so
from the first dataframe to the second dataframe;
with RL we go right-to-left, so from the second dataframe to the first dataframe).
The columns @@source_activity and @@target_activity contain respectively the source and target activity of the interleaving.
The columns @@source_timestamp and @@target_timestamp contain respectively the source and target timestamp of the interleaving.
The column @@left_index contains the index of the event of the first of the two dataframes.
The column @@right_index contains the index of the event of the second of the two dataframes.
The column @@timestamp_diff contains the difference between the two timestamps (can be useful to aggregate on the time).
We provide a visualization of the interleavings between the two logs. The visualization considers
the DFG of the two logs and shows the interleavings between them (decorated by the frequency/performance
of the relationship).
An example of frequency-based interleavings visualization is reported on the right.
import pandas as pd
import pm4py
if __name__ == "__main__":
dataframe1 = pd.read_csv("tests/input_data/interleavings/receipt_even.csv")
dataframe1 = pm4py.format_dataframe(dataframe1)
dataframe2 = pd.read_csv("tests/input_data/interleavings/receipt_odd.csv")
dataframe2 = pm4py.format_dataframe(dataframe2)
case_relations = pd.read_csv("tests/input_data/interleavings/case_relations.csv")
from pm4py.algo.discovery.ocel.interleavings import algorithm as interleavings_discovery
interleavings = interleavings_discovery.apply(dataframe1, dataframe2, case_relations)
from pm4py.visualization.ocel.interleavings import visualizer as interleavings_visualizer
# visualizes the frequency of the interleavings
gviz_freq = interleavings_visualizer.apply(dataframe1, dataframe2, interleavings, parameters={"annotation": "frequency", "format": "svg"})
interleavings_visualizer.view(gviz_freq)
if __name__ == "__main__":
dataframe1 = pd.read_csv("tests/input_data/interleavings/receipt_even.csv")
dataframe1 = pm4py.format_dataframe(dataframe1)
dataframe2 = pd.read_csv("tests/input_data/interleavings/receipt_odd.csv")
dataframe2 = pm4py.format_dataframe(dataframe2)
case_relations = pd.read_csv("tests/input_data/interleavings/case_relations.csv")
from pm4py.algo.discovery.ocel.interleavings import algorithm as interleavings_discovery
interleavings = interleavings_discovery.apply(dataframe1, dataframe2, case_relations)
from pm4py.visualization.ocel.interleavings import visualizer as interleavings_visualizer
# visualizes the performance of the interleavings
gviz_perf = interleavings_visualizer.apply(dataframe1, dataframe2, interleavings, parameters={"annotation": "performance", "aggregation_measure": "median", "format": "svg"})
interleavings_visualizer.view(gviz_perf)
The parameters offered by the visualization of the interleavings follows:
Parameters.FORMAT: the format of the visualization (svg, png).
Parameters.BGCOLOR: background color of the visualization (default: transparent).
Parameters.RANKDIR: the direction of visualization of the diagram (LR, TB).
Parameters.ANNOTATION: the annotation to be used (frequency, performance).
Parameters.AGGREGATION_MEASURE: the aggregation to be used (mean, median, min, max).
Parameters.ACTIVITY_PERCENTAGE: the percentage of activities that shall be included in the two DFGs and the interleavings visualization.
Parameters.PATHS_PERCENTAG: the percentage of paths that shall be included in the two DFGs and the interleavings visualization.
Parameters.DEPENDENCY_THRESHOLD: the dependency threshold that shall be used to filter the edges of the DFG.
Parameters.MIN_FACT_EDGES_INTERLEAVINGS: parameter that regulates the fraction of interleavings that is shown in the diagram.
Creating an OCEL out of the Interleavings
Given two logs having related cases, we saw how to calculate the interleavings between the logs.
In this section, we want to exploit the information contained in the two logs and in their
interleavings to create an object-centric event log (OCEL). This will contain the events of the
two event logs and the connections between them. The OCEL can be used with any object-centric
process mining technique.
An example is reported on the right.
import pandas as pd
import pm4py
if __name__ == "__main__":
dataframe1 = pd.read_csv("tests/input_data/interleavings/receipt_even.csv")
dataframe1 = pm4py.format_dataframe(dataframe1)
dataframe2 = pd.read_csv("tests/input_data/interleavings/receipt_odd.csv")
dataframe2 = pm4py.format_dataframe(dataframe2)
case_relations = pd.read_csv("tests/input_data/interleavings/case_relations.csv")
from pm4py.algo.discovery.ocel.interleavings import algorithm as interleavings_discovery
interleavings = interleavings_discovery.apply(dataframe1, dataframe2, case_relations)
from pm4py.objects.ocel.util import log_ocel
ocel = log_ocel.from_interleavings(dataframe1, dataframe2, interleavings)
Merging Related Logs (Case Relations)
If two event logs of two inter-related process are considered, it may make sense for some
analysis to merge them. The resulting log will contain cases which contain events of the first
and the second event log.
This happens when popular enterprise processes such as the P2P and the O2C are considered.
If a sales order is placed which require a material that is not available, a purchase order
can be operated to a supplier in order to get the material and fulfill the sales order.
For the merge operation, we will need to consider:
A reference event log (whose cases will be enriched by the events of the other event log.
An event log to be merged (its events end up in the cases of the reference event log).
A set of case relationships between them.
An example is reported on the right. The result is a traditional event log.
import pandas as pd
import pm4py
from pm4py.algo.merging.case_relations import algorithm as case_relations_merging
import os
if __name__ == "__main__":
dataframe1 = pd.read_csv(os.path.join("tests", "input_data", "interleavings", "receipt_even.csv"))
dataframe1 = pm4py.format_dataframe(dataframe1)
dataframe2 = pd.read_csv(os.path.join("tests", "input_data", "interleavings", "receipt_odd.csv"))
dataframe2 = pm4py.format_dataframe(dataframe2)
case_relations = pd.read_csv(os.path.join("tests", "input_data", "interleavings", "case_relations.csv"))
merged = case_relations_merging.apply(dataframe1, dataframe2, case_relations)
Network Analysis
The classical social network analysis methods (such as the ones described in this page at the later sections)
are based on the order of the events inside a case. For example, the Handover of Work metric considers
the directly-follows relationships between resources during the work of a case. An edge is added between
the two resources if such relationships occurs.
Real-life scenarios may be more complicated. At first, is difficult to collect events inside the same
case without having convergence/divergence issues (see first section of the OCEL part). At second,
the type of relationship may also be important. Consider for example the relationship between two resources:
this may be more efficient if the activity that is executed is liked by the resources, rather than
disgusted.
The network analysis that we introduce in this section generalizes some existing social network analysis
metrics, becoming independent from the choice of a case notion and permitting to build a multi-graph
instead of a simple graph.
With this, we assume events to be linked by signals. An event emits a signal (that is contained as one
attribute of the event) that is assumed to be received by other events (also, this is an attribute of these events)
that follow the first event in the log. So, we assume there is an OUT attribute (of the event) that is identical to the IN attribute (of the other events).
When we collect this information, we can build the network analysis graph:
The source node of the relation is given by an aggregation over a node_column_source attribute.
The target node of the relation is given by an aggregation over a node_column_target attribute.
The type of edge is given by an aggregation over an edge_column attribute.
The network analysis graph can either be annotated with frequency or performance information.
In the previous example, we have loaded one traditional event log (the receipt.xes
event log), and performed the network analysis with the follows choice of parameters:
The OUT-column is set to case:concept:name and the IN-column is set also to
case:concept:name (that means, succeeding events of the same case are connected).
The node_column_source and node_column_target attribute are set to org:group (we want to see the network
of relations between different organizational groups.
The edge_column attribute is set to concept:name (we want to see the frequency/performance
of edges between groups, depending on the activity, so we can evaluate advantageous exchanges).
Note that in the previous case, we resorted to use the case identifier as OUT/IN column,
but that's just a specific example (the OUT and IN columns can be different, and differ from the
case identifier).
In the right, an example of network analysis, producing a multigraph annotated
with performance information, and performing a visualization of the same, is reported.
The visualization supports the following parameters:
format: the format of the visualization (default: png).
bgcolor: the background color of the produced picture.
activity_threshold: the minimum number of occurrences for an activity to be included (default: 1).
edge_threshold: the minimum number of occurrences for an edge to be included (default: 1).
Link Analysis
While the goal of the network analysis is to provide an aggregated visualization of the links between
different events, the goal of link analysis is just the discovery of the links between the events,
to be able to reason about them.
In the examples that follow, we are going to consider the document flow table VBFA of SAP.
This table contains some properties and the connections between sales orders documents (e.g. the order document
itself, the delivery documents, the invoice documents). Reasoning on the properties of the links could help
to understand anomalous situations (e.g. the currency/price is changed during the order's lifecycle).
A link analysis starts from the production of a link analysis dataframe.
This contains the linked events according to the provided specification of the attributes.
First, we load a CSV containing the information from a VBFA table extracted
from an educational instance of SAP. Then, we do some pre-processing to ensure
the consistency of the data contained in the dataframe.
Then, we discover the link analysis dataframe.
At this point, several analysis could be performed.
For example, findings the interconnected documents for which
the currency differs between the two documents can be done as follows.
The parameters of the link analysis algorithm are:
Parameters.OUT_COLUMN: the column of the dataframe that is used to link the source events to the target events.
Parameters.IN_COLUMN: the column of the dataframe that is used to link the target events to the source events.
Parameters.SORTING_COLUMN: the attribute which is used preliminarly to sort the dataframe.
Parameters.INDEX_COLUMN: the name of the column of the dataframe that should be used to store the incremental event index.
Parameters.LOOK_FORWARD: merge an event e1 with an event e2 (e1.OUT = e2.IN) only if the index in the dataframe
of e1 is lower than the index of the dataframe of e2.
Parameters.KEEP_FIRST_OCCURRENCE if several events e21, e22 are such that e1.OUT = e21.IN = e22.IN,
keep only the relationship between e1 and e21.
Parameters.PROPAGATE: propagate the discovered relationships. If e1, e2, e3 are such that e1.OUT = e2.IN
and e2.OUT = e3.IN, then consider e1 to be in relationship also with e3.
OC-DFG discovery
Object-centric directly-follows multigraphs are a composition of directly-follows
graphs for the single object type, which can be annotated with different metrics considering
the entities of an object-centric event log (i.e., events, unique objects, total objects).
We provide both the discovery of the OC-DFG (which provides a generic objects allowing for
many different choices of the metrics), and the visualization of the same.
An example, in which an object-centric event log is loaded,
an object-centric directly-follows multigraph is discovered,
and visualized with frequency annotation on the screen, is provided on the right.
import pm4py
import os
if __name__ == "__main__":
ocel = pm4py.read_ocel(os.path.join("tests", "input_data", "ocel", "example_log.jsonocel"))
ocdfg = pm4py.discover_ocdfg(ocel)
# views the model with the frequency annotation
pm4py.view_ocdfg(ocdfg, format="svg")
An example, in which an object-centric event log is loaded,
an object-centric directly-follows multigraph is discovered,
and visualized with performance annotation on the screen, is provided on the right.
import pm4py
import os
if __name__ == "__main__":
ocel = pm4py.read_ocel(os.path.join("tests", "input_data", "ocel", "example_log.jsonocel"))
ocdfg = pm4py.discover_ocdfg(ocel)
# views the model with the performance annotation
pm4py.view_ocdfg(ocdfg, format="svg", annotation="performance", performance_aggregation="median")
The visualization supports the following parameters:
annotation: The annotation to use for the visualization. Values: frequency (the frequency annotation), performance (the performance annotation).
act_metric: The metric to use for the activities. Available values: events (number of events), unique_objects (number of unique objects), total_objects (number of total objects).
edge_metric: The metric to use for the edges. Available values: event_couples (number of event couples), unique_objects (number of unique objects), total_objects (number of total objects).
act_threshold: The threshold to apply on the activities frequency (default: 0). Only activities having a frequency >= than this are kept in the graph.
edge_threshold: The threshold to apply on the edges frequency (default 0). Only edges having a frequency >= than this are kept in the graph.
performance_aggregation: The aggregation measure to use for the performance: mean, median, min, max, sum
format: The format of the output visualization (default: png)
OC-PN discovery
Object-centric Petri Nets (OC-PN) are formal models, discovered on top of the object-centric event logs,
using an underlying process discovery algorithm (such as the Inductive Miner). They have been described in the scientific
paper:
van der Aalst, Wil MP, and Alessandro Berti. "Discovering object-centric Petri nets." Fundamenta informaticae 175.1-4 (2020): 1-40.
In pm4py, we offer a basic implementation of object-centric Petri nets (without any additional decoration).
An example, in which an object-centric event log is loaded, the discovery algorithm is applied,
and the OC-PN is visualized, is reported on the right.
import pm4py
import os
if __name__ == "__main__":
ocel = pm4py.read_ocel(os.path.join("tests", "input_data", "ocel", "example_log.jsonocel"))
model = pm4py.discover_oc_petri_net(ocel)
pm4py.view_ocpn(model, format="svg")
Object Graphs on OCEL
It is possible to catch the interaction between the different objects of an OCEL
in different ways. In pm4py, we offer support for the computation of some object-based graphs:
The objects interaction graph connects two objects if they are related in some
event of the log.
The objects descendants graph connects an object, which is related to an event
but does not start its lifecycle with the given event, to all the objects that start their
lifecycle with the given event.
The objects inheritance graph connects an object, which terminates its
lifecycle with the given event, to all the objects that start their lifecycle with the
given event.
The objects cobirth graph connects objects which start their lifecycle within
the same event.
The objects codeath graph connects objects which complete their lifecycle
within the same event.
if __name__ == "__main__":
ocel = pm4py.read_ocel("tests/input_data/ocel/example_log.jsonocel")
from pm4py.algo.transformation.ocel.graphs import object_interaction_graph
graph = object_interaction_graph.apply(ocel)
if __name__ == "__main__":
ocel = pm4py.read_ocel("tests/input_data/ocel/example_log.jsonocel")
from pm4py.algo.transformation.ocel.graphs import object_descendants_graph
graph = object_descendants_graph.apply(ocel)
if __name__ == "__main__":
ocel = pm4py.read_ocel("tests/input_data/ocel/example_log.jsonocel")
from pm4py.algo.transformation.ocel.graphs import object_inheritance_graph
graph = object_inheritance_graph.apply(ocel)
if __name__ == "__main__":
ocel = pm4py.read_ocel("tests/input_data/ocel/example_log.jsonocel")
from pm4py.algo.transformation.ocel.graphs import object_cobirth_graph
graph = object_cobirth_graph.apply(ocel)
if __name__ == "__main__":
ocel = pm4py.read_ocel("tests/input_data/ocel/example_log.jsonocel")
from pm4py.algo.transformation.ocel.graphs import object_codeath_graph
graph = object_codeath_graph.apply(ocel)
Feature Extraction on OCEL - Object-Based
For machine learning purposes, we might want to create a feature matrix, which
contains a row for every object of the object-centric event log.
The dimensions which can be considered for the computation of features are different:
The lifecycle of an object (sequence of events in the log which are related
to an object). From this dimension, several features, including the length of the lifecycle,
the duration of the lifecycle, can be computed. Moreover, the sequence of the activities
inside the lifecycle can be computed. For example, the one-hot encoding of the
activities can be considered (every activity is associated to a different column,
and the number of events of the lifecycle having the given activity is reported).
Features extracted from the graphs computed on the OCEL (objects interaction graph,
objects descendants graph, objects inheritance graph, objects cobirth/codeath graph).
For every one of these, the number of objects connected to a given object are considered
as feature.
The number of objects having a lifecycle intersecting (on the time dimension)
with the current object.
The one-hot-encoding of a specified collection of string attributes.
The encoding of the values of a specified collection of numeric attributes.
To compute the object-based features, the following command can be used
(we would like to consider oattr1 as the only string attribute to one-hot-encode,
and oattr2 as the only numeric attribute to encode). If no string/numeric attributes
should be included, the parameters can be omitted.
For machine learning purposes, we might want to create a feature matrix, which
contains a row for every event of the object-centric event log.
The dimensions which can be considered for the computation of features are different:
The timestamp of the event. This can be encoded in different way (absolute timestamp,
hour of the day, day of the week, month).
The activity of the event. An one-hot encoding of the activity values can be performed.
The related objects to the event. Features such as the total number of related objects,
the number of related objects per type, the number of objects which start their lifecycle
with the current event, the number of objects which complete their lifecycle with the
current event) can be considered.
The one-hot-encoding of a specified collection of string attributes.
The encoding of the values of a specified collection of numeric attributes.
To compute the event-based features, the following command can be used
(we would like to consider prova as the only string attribute to one-hot-encode,
and prova2 as the only numeric attribute to encode). If no string/numeric attributes
should be included, the parameters can be omitted.
The validation process permits to recognise valid JSON-OCEL/XML-OCEL files before
starting the parsing. This is done against a schema which contains the basic structure
that should be followed by JSON-OCEL and XML-OCEL files.
The validation of a JSON-OCEL file is done as follows:
from pm4py.objects.ocel.validation import jsonocel
if __name__ == "__main__":
validation_result = jsonocel.apply("tests/input_data/ocel/example_log.jsonocel", "tests/input_data/ocel/schema.json")
print(validation_result)
if __name__ == "__main__":
validation_result = xmlocel.apply("tests/input_data/ocel/example_log.xmlocel", "tests/input_data/ocel/schema.xml")
print(validation_result)
Process Discovery algorithms want to find a suitable process model that describes the
order of events/activities that are executed during a process execution.
In the following, we made up an overview to visualize the advantages and disadvantages of
mining algorithms.
Alpha
Alpha+
Heuristic
Inductive
Cannot handle loops of length one and length two
Can handle loops of length one and length two
Takes frequency into account
Can handle invisible tasks
Invisible and duplicated tasks cannot be discovered
Invisible and duplicated tasks cannot be discovered
Detects short loops
Model is sound
Discovered model might not be sound
Discovered model might not be sound
Does not guarantee a sound model
Most used process mining algorithm
Weak against noise
Weak against noise
Alpha Miner
The alpha miner is one of the most known Process Discovery algorithm and is able to find:
A Petri net model where all the transitions are visible and unique and correspond to
classified events (for example, to activities).
An initial marking that describes the status of the Petri net model when a execution
starts.
A final marking that describes the status of the Petri net model when a execution
ends.
We provide an example where a log is read, the Alpha algorithm is applied and the Petri net
along with the initial and the final marking are found. The log we take as input is the
running-example.xes.
First, the log has to be imported.
import os
import pm4py
if __name__ == "__main__":
log = pm4py.read_xes(os.path.join("tests","input_data","running-example.xes"))
Inductive Miner
In pm4py, we offer an implementation of the inductive miner (IM), of the inductive miner
infrequent (IMf),
and of the inductive miner directly-follows (IMd) algorithm. The papers describing the
approaches are
the following:
The basic idea of
Inductive Miner is about detecting a 'cut' in the log (e.g. sequential cut, parallel cut,
concurrent cut and loop cut) and then recur on sublogs, which were found applying the cut,
until a base case is found. The Directly-Follows variant avoids the recursion on the sublogs
but uses the Directly Follows graph.
Inductive miner models usually make extensive use of hidden transitions, especially for
skipping/looping on a portion on the model. Furthermore, each visible transition has a
unique label (there are no transitions in the model that share the same label).
Two process models can be derived: Petri Net and Process Tree.
To mine a Petri Net, we provide an example. A log is read, the inductive miner is applied
and the
Petri net along with the initial and the final marking are found. The log we take as
input is the running-example.xes.
First, the log is read, then the inductive miner algorithm is applied.
import os
import pm4py
if __name__ == "__main__":
log = pm4py.read_xes(os.path.join("tests","input_data","running-example.xes"))
net, initial_marking, final_marking = pm4py.discover_petri_net_inductive(log)
Heuristic Miner
Heuristics Miner is an algorithm that acts on the Directly-Follows Graph, providing way to
handle with noise and to find common constructs (dependency between two activities, AND).
The output of the Heuristics Miner is an Heuristics Net, so an object that contains the
activities and the relationships between them. The Heuristics Net can be then converted into
a Petri net. The paper can be visited by clicking on the upcoming link: this
link).
It is possible to obtain a Heuristic Net and a Petri Net.
To apply the Heuristics Miner to discover an Heuristics Net, it is necessary to
import a log. Then, a Heuristic Net can be found. There are also numerous
possible parameters that can be inspected by clicking on the following button.
Since decomposed models are expected to have less concurrency, the components are aligned using
a Dijkstra approach. In the case of border disagreements, this can degrade the performance of
the algorithm.
It should be noted that this is not an approximation technique;
according to the authors, it should provide the same fitness
as the original alignments.
Since the alignment is recomposed, we can use the fitness evaluator to evaluate
the fitness (that is not related to the computation of fitness described in the paper).
from pm4py.algo.evaluation.replay_fitness import algorithm as rp_fitness_evaluator
if __name__ == "__main__":
fitness = rp_fitness_evaluator.evaluate(conf, variant=rp_fitness_evaluator.Variants.ALIGNMENT_BASED)
Footprints
Footprints are a very basic (but scalable) conformance checking technique to compare entities
(such that event logs, DFGs, Petri nets, process trees, any other kind of model).
Essentially, a relationship between any couple of activities of the log/model is inferred. This
can include:
Directly-Follows Relationships: in the log/model, it is possible that the activity A is
directly followed by B.
Directly-Before Relationships: in the log/model, it is possible that the activity B is
directly preceded by A.
Parallel behavior: it is possible that A is followed by B and B is followed by A
A footprints matrix can be calculated, that describes for each couple of activities the
footprint relationship.
It is possible to calculate that for different types of models and for the entire event log,
but also trace-by-trace (if the local behavior is important).
Let’s assume that the running-example.xes event log is loaded:
import pm4py
import os
if __name__ == "__main__":
log = pm4py.read_xes(os.path.join("tests", "input_data", "running-example.xes"))
from pm4py.algo.discovery.footprints import algorithm as footprints_discovery
if __name__ == "__main__":
fp_log = footprints_discovery.apply(log, variant=footprints_discovery.Variants.ENTIRE_EVENT_LOG)
{‘sequence’: {(‘examine casually’, ‘decide’), (‘decide’, ‘pay compensation’), (‘register
request’, ‘examine thoroughly’), (‘reinitiate request’, ‘examine casually’), (‘check
ticket’, ‘decide’), (‘register request’, ‘examine casually’), (‘reinitiate request’,
‘examine thoroughly’), (‘decide’, ‘reject request’), (‘examine thoroughly’, ‘decide’),
(‘reinitiate request’, ‘check ticket’), (‘register request’, ‘check ticket’), (‘decide’,
‘reinitiate request’)}, ‘parallel’: {(‘examine casually’, ‘check ticket’), (‘check ticket’,
‘examine casually’), (‘check ticket’, ‘examine thoroughly’), (‘examine thoroughly’, ‘check
ticket’)}, ‘start_activities’: {‘register request’}, ‘end_activities’: {‘pay compensation’,
‘reject request’}, ‘activities’: {‘reject request’, ‘register request’, ‘check ticket’,
‘decide’, ‘pay compensation’, ‘examine thoroughly’, ‘examine casually’, ‘reinitiate
request’}}
The data structure is a dictionary with, as keys, sequence (expressing directly-follows
relationships) and parallel (expressing the parallel behavior that can happen in either way).
The footprints of the log, trace-by-trace, can be calculated as follows, and are a list of
footprints for each trace:
from pm4py.algo.discovery.footprints import algorithm as footprints_discovery
if __name__ == "__main__":
fp_trace_by_trace = footprints_discovery.apply(log, variant=footprints_discovery.Variants.TRACE_BY_TRACE)
{‘sequence’: {(‘check ticket’, ‘decide’), (‘reinitiate request’, ‘examine casually’),
(‘register request’, ‘examine thoroughly’), (‘decide’, ‘reject request’), (‘register
request’, ‘check ticket’), (‘register request’, ‘examine casually’), (‘decide’, ‘reinitiate
request’), (‘reinitiate request’, ‘examine thoroughly’), (‘decide’, ‘pay compensation’),
(‘reinitiate request’, ‘check ticket’), (‘examine casually’, ‘decide’), (‘examine
thoroughly’, ‘decide’)}, ‘parallel’: {(‘check ticket’, ‘examine thoroughly’), (‘examine
thoroughly’, ‘check ticket’), (‘check ticket’, ‘examine casually’), (‘examine casually’,
‘check ticket’)}, ‘activities’: {‘decide’, ‘examine casually’, ‘reinitiate request’, ‘check
ticket’, ‘examine thoroughly’, ‘register request’, ‘reject request’, ‘pay compensation’},
‘start_activities’: {‘register request’}}
The data structure is a dictionary with, as keys, sequence (expressing directly-follows
relationships) and parallel (expressing the parallel behavior that can happen in either way).
It is possible to visualize a comparison between the footprints of the (entire) log and the
footprints of the (entire) model.
First of all, let’s see how to visualize a single footprints table, for example the one of
the model. The following code can be used:
from pm4py.visualization.footprints import visualizer as fp_visualizer
if __name__ == "__main__":
gviz = fp_visualizer.apply(fp_net, parameters={fp_visualizer.Variants.SINGLE.value.Parameters.FORMAT: "svg"})
fp_visualizer.view(gviz)
To compare the two footprints tables, the following code can be used. Please note that the
visualization will look the same, if no deviations are discovered. If deviations are found
they are colored by red.
from pm4py.visualization.footprints import visualizer as fp_visualizer
if __name__ == "__main__":
gviz = fp_visualizer.apply(fp_log, fp_net, parameters={fp_visualizer.Variants.COMPARISON.value.Parameters.FORMAT: "svg"})
fp_visualizer.view(gviz)
if __name__ == "__main__":
log = pm4py.read_xes(os.path.join("tests", "input_data", "receipt.xes"))
filtered_log = pm4py.filter_variants_top_k(log, 3)
net, im, fm = pm4py.discover_petri_net_inductive(filtered_log)
With a conformance checking operation, we want instead to compare the behavior of the traces
of the log against the footprints of the model.
This can be done using the following code:
if __name__ == "__main__":
conf_fp = pm4py.conformance_diagnostics_footprints(fp_trace_by_trace, fp_net)
And will contain, for each trace of the log, a set with the deviations. Extract of the list for
some traces:
{(‘T06 Determine necessity of stop advice’, ‘T04 Determine confirmation of receipt’), (‘T02
Check confirmation of receipt’, ‘T06 Determine necessity of stop advice’)}
set()
{(‘T19 Determine report Y to stop indication’, ‘T20 Print report Y to stop indication’),
(‘T10 Determine necessity to stop indication’, ‘T16 Report reasons to hold request’), (‘T16
Report reasons to hold request’, ‘T17 Check report Y to stop indication’), (‘T17 Check
report Y to stop indication’, ‘T19 Determine report Y to stop indication’)}
set()
set()
{(‘T02 Check confirmation of receipt’, ‘T06 Determine necessity of stop advice’), (‘T10
Determine necessity to stop indication’, ‘T04 Determine confirmation of receipt’), (‘T04
Determine confirmation of receipt’, ‘T03 Adjust confirmation of receipt’), (‘T03 Adjust
confirmation of receipt’, ‘T02 Check confirmation of receipt’)}
set()
We can see that for the first trace that contains deviations, there are two deviations, the
first related to T06 Determine necessity of stop advice being executed before T04 Determine
confirmation of receipt; the second related to T02 Check confirmation of receipt being followed
by T06 Determine necessity of stop advice.
The traces for which the conformance returns nothing are fit (at least according to the
footprints).
Footprints conformance checking is a way to identify obvious deviations, behavior of the log
that is not allowed by the model.
On the log side, their scalability is wonderful! The calculation of footprints for a Petri net
model may be instead more expensive.
If we change the underlying model, from Petri nets to process tree, it is possible to exploit
its bottomup structure in order to calculate the footprints almost instantaneously.
Let’s open a log, calculate a process tree and then apply the discovery of the footprints.
We open the running-example log:
import pm4py
if __name__ == "__main__":
log = pm4py.read_xes("tests/input_data/running-example.xes")
Then, the footprints can be discovered. We discover the footprints on the entire log, we
discover the footprints trace-by-trace in the log, and then we discover the footprints on
the process tree:
from pm4py.algo.discovery.footprints import algorithm as fp_discovery
if __name__ == "__main__":
fp_log = fp_discovery.apply(log, variant=fp_discovery.Variants.ENTIRE_EVENT_LOG)
fp_trace_trace = fp_discovery.apply(log, variant=fp_discovery.Variants.TRACE_BY_TRACE)
fp_tree = fp_discovery.apply(tree, variant=fp_discovery.Variants.PROCESS_TREE)
Each one of these contains:
A list of sequential footprints contained in the log/allowed by the model
A list of parallel footprints contained in the log/allowed by the model
A list of activities contained in the log/allowed by the model
A list of start activities contained in the log/allowed by the model
A list of end activities contained in the log/allowed by the model
It is possible to execute an enhanced conformance checking between the footprints of the
(entire) log, and the footprints of the model, by doing:
from pm4py.algo.conformance.footprints import algorithm as fp_conformance
if __name__ == "__main__":
conf_result = fp_conformance.apply(fp_log, fp_tree, variant=fp_conformance.Variants.LOG_EXTENSIVE)
The result contains, for each item of the previous list, the violations.
Given the result of conformance checking, it is possible to calculate the footprints-based
fitness and precision of the process model, by doing:
from pm4py.algo.conformance.footprints.util import evaluation
if __name__ == "__main__":
fitness = evaluation.fp_fitness(fp_log, fp_tree, conf_result)
precision = evaluation.fp_precision(fp_log, fp_tree)
These values are both included in the interval [0,1]
Log Skeleton
The concept of log skeleton has been described in the contribution
Verbeek, H. M. W., and R. Medeiros de Carvalho. “Log skeletons: A classification approach to
process discovery.” arXiv preprint arXiv:1806.08247 (2018).
And is claimingly the most accurate classification approach to decide whether a trace belongs to
(the language) of a log or not.
For a log, an object containing a list of relations is calculated.