摘要:过滤EVENT DATA
随笔
trace:路径,表示图上的路径,从头走到尾算一次trace
Variant:变体,表示同一类traces,同一种路径为一个Variant
case:方案/情况,表示事件日志里的一次走法的记录(对应路径trace)
activity:动作/活动,表示过程中的一个动作(动作名称)。
event:事件,一个动作的记录,包括activity动作名称、发生时间、发生地点等信息的记录。
(不理解)按时间范围过滤(Filtering on timeframe)
(不确定)如果只对某段时间范围内的traces感兴趣,即时间包含(contain)在开始与结束时间内。例如:2011-03-09到2012-01-18这段时间内。第一段代码用于log对象,第二段代码用于dataframe对象(后面的代码示例都是如此)。
from pm4py.algo.filtering.log.timestamp import timestamp_filter
filtered_log = timestamp_filter.filter_traces_contained(log, "2011-03-09 00:00:00", "2012-01-18 23:59:59")
from pm4py.algo.filtering.pandas.timestamp import timestamp_filter
df_timest_intersecting = timestamp_filter.filter_traces_contained(dataframe, "2011-03-09 00:00:00", "2012-01-18 23:59:59",parameters={timestamp_filter.Parameters.CASE_ID_KEY: "case:concept:name",timestamp_filter.Parameters.TIMESTAMP_KEY: "time:timestamp"})
(不确定)intersecting(相交),不知道如何理解,猜测可能是等于这两个时间?但如果是这样应该不止两个时间参数,应该给个列表参数;或者可能是。对应如下代码:
from pm4py.algo.filtering.log.timestamp import timestamp_filter
filtered_log = timestamp_filter.filter_traces_intersecting(log, "2011-03-09 00:00:00", "2012-01-18 23:59:59")
from pm4py.algo.filtering.pandas.timestamp import timestamp_filter
df_timest_intersecting = timestamp_filter.filter_traces_intersecting(dataframe, "2011-03-09 00:00:00", "2012-01-18 23:59:59",parameters={timestamp_filter.Parameters.CASE_ID_KEY: "case:concept:name",timestamp_filter.Parameters.TIMESTAMP_KEY: "time:timestamp"})
按开始动作筛选(Filter on start activities)
首先需要知道开始动作是哪个,再进行筛选。
log_start是key为动作名称,value为出现次数的字典。
from pm4py.algo.filtering.log.start_activities import start_activities_filter
log_start = start_activities_filter.get_start_activities(log)
filtered_log = start_activities_filter.apply(log, ["S1"]) #suppose "S1" is the start activity you want to filter on
from pm4py.algo.filtering.pandas.start_activities import start_activities_filter
log_start = start_activities_filter.get_start_activities(dataframe)
df_start_activities = start_activities_filter.apply(dataframe, ["S1"],parameters={start_activities_filter.Parameters.CASE_ID_KEY: "case:concept:name",start_activities_filter.Parameters.ACTIVITY_KEY: "concept:name"}) #suppose "S1" is the start activity you want to filter on
还有一个方法是根据开始动作出现频率筛选。DECREASING_FACTOR 默认为0.6。
from pm4py.algo.filtering.log.start_activities import start_activities_filter
log_af_sa = start_activities_filter.apply_auto_filter(log, parameters={start_activities_filter.Parameters.DECREASING_FACTOR: 0.6})
from pm4py.algo.filtering.pandas.start_activities import start_activities_filter
df_auto_sa = start_activities_filter.apply_auto_filter(dataframe, parameters={start_activities_filter.Parameters.DECREASING_FACTOR: 0.6})
按结束动作筛选(Filter on end activities)
首先也要知道结束动作名称。
from pm4py.algo.filtering.log.end_activities import end_activities_filter
end_activities = end_activities_filter.get_end_activities(log)
filtered_log = end_activities_filter.apply(log, ["pay compensation"])
from pm4py.algo.filtering.pandas.end_activities import end_activities_filter
end_activities = end_activities_filter.get_end_activities(df)
filtered_df = end_activities_filter.apply(df, ["pay compensation"],parameters={end_activities_filter.Parameters.CASE_ID_KEY: "case:concept:name",end_activities_filter.Parameters.ACTIVITY_KEY: "concept:name"})
根据变体筛选(Filter on variants)
为了得到所给日志(log)里包含的变体列表。返回结果是个字典,key为变体,value为共享该变体的case列表。
from pm4py.algo.filtering.log.variants import variants_filter
variants = variants_filter.get_variants(log)
from pm4py.statistics.traces.generic.pandas import case_statistics
variants = case_statistics.get_variants_df(df,parameters={case_statistics.Parameters.CASE_ID_KEY: "case:concept:name",case_statistics.Parameters.ACTIVITY_KEY: "concept:name"})
如果想获取变体出现次数,以下代码返回一个变体列表及其计数(所以,一个字典key为变体,value为出现次数)
from pm4py.statistics.traces.generic.log import case_statistics
variants_count = case_statistics.get_variant_statistics(log)
variants_count = sorted(variants_count, key=lambda x: x['count'], reverse=True)
from pm4py.statistics.traces.generic.pandas import case_statistics
variants_count = case_statistics.get_variant_statistics(df,parameters={case_statistics.Parameters.CASE_ID_KEY: "case:concept:name",case_statistics.Parameters.ACTIVITY_KEY: "concept:name",case_statistics.Parameters.TIMESTAMP_KEY: "time:timestamp"})
variants_count = sorted(variants_count, key=lambda x: x['case:concept:name'], reverse=True)
为了基于变体筛选,假设variants是个列表,每个元素是个variant。
from pm4py.algo.filtering.log.variants import variants_filter
filtered_log1 = variants_filter.apply(log, variants)
from pm4py.algo.filtering.pandas.variants import variants_filterfiltered_df1 = variants_filter.apply(df, variants,parameters={variants_filter.Parameters.CASE_ID_KEY: "case:concept:name",variants_filter.Parameters.ACTIVITY_KEY: "concept:name"})
与上一个示例相反,如果想将给的变体过滤掉。假设variants依然是个列表,每个元素还是是一个variant。
filtered_log2 = variants_filter.apply(log, variants, parameters={variants_filter.Parameters.POSITIVE: False})
filtered_df2 = variants_filter.apply(df, variants,parameters={variants_filter.Parameters.POSITIVE: False, variants_filter.Parameters.CASE_ID_KEY: "case:concept:name",variants_filter.Parameters.ACTIVITY_KEY: "concept:name"})
一个过滤器要自动保留最普遍的variants可以用apply_auto_filter方法。这个方法接收一个参数parameter叫DECREASING_FACTOR,与start activities filter的一样,默认0.6。
auto_filtered_log = variants_filter.apply_auto_filter(log)
auto_filtered_df = variants_filter.apply_auto_filter(df,parameters={variants_filter.Parameters.POSITIVE: False, variants_filter.Parameters.CASE_ID_KEY: "case:concept:name",variants_filter.Parameters.ACTIVITY_KEY: "concept:name"})
对于event log对象,可以用变体百分比过滤器。要保留的变体必须指定一个百分比参数,参数值从0-1,0表示只保存最频繁的变体,1表示保留所有变体。
from pm4py.algo.filtering.log.variants import variants_filterfiltered_log = variants_filter.filter_log_variants_percentage(log, percentage=0.5)
为了更明确,见如下解释。
其他基于变体的过滤器:
top-k过滤器,只保留k个最常见的变体
import pm4py
log = pm4py.read_xes("tests/input_data/receipt.xes")
k = 2
filtered_log = pm4py.filter_variants_top_k(log, k)
变体覆盖率过滤器,根据指定条件的百分比保留。
假如min_coverage_percentage=0.4,我们有个log有1000个cases,500个variant1,400个variant2,100个variant3,过滤器只会保留variant1和variant2。
import pm4py
log = pm4py.read_xes("tests/input_data/receipt.xes")
perc = 0.1
filtered_log = pm4py.filter_variants_by_coverage_percentage(log, perc)
按属性值筛选(Filter on attributes values)
不理解
(Filter on numeric attribute values)
(Between Filter)
用于识别当前cases中,从源动作到目标动作的所有子案例,转换成事件日志event log
import pm4pylog = pm4py.read_xes("tests/input_data/running-example.xes")filtered_log = pm4py.filter_between(log, "check ticket", "decide")
(Case Size Filter)
保留指定范围数量的事件
import pm4pylog = pm4py.read_xes("tests/input_data/running-example.xes")filtered_log = pm4py.filter_case_size(log, 5, 10)
(Rework Filter)
用于识别有重复动作的case。
下面示例,我们查找reinitiate request动作至少出现两次的所有case
import pm4pylog = pm4py.read_xes("tests/input_data/running-example.xes")filtered_log = pm4py.filter_activities_rework(log, "reinitiate request", 2)
(Paths Performance Filter)
用于识别两个指定动作经过指定时间范围内的case
下面示例,我们要找decide 和 pay compensation在两天到10天间至少出现一次的case
import pm4pylog = pm4py.read_xes("tests/input_data/running-example.xes")filtered_log = pm4py.filter_paths_performance(log, ("decide", "pay compensation"), 2*86400, 10*86400)
(Generic Filtering on Event Log)
如果以上过滤器都无法满足需求,我们提供一个基于通用boolean表达式的方法
下面示例,我们保留在事件日志中事件数量大于6的case
import pm4pylog = pm4py.read_xes("tests/input_data/running-example.xes")filtered_log = pm4py.filter_log(lambda x: len(x) > 6, log)
*若对本文有疑问(例如:笔记中知识点或表达有误),欢迎指出,共同学习进步。