PM4PY - Filtering Event Data

news/2025/2/19 17:01:46/

摘要:过滤EVENT DATA

随笔

trace:路径,表示图上的路径,从头走到尾算一次trace
Variant:变体,表示同一类traces,同一种路径为一个Variant
case:方案/情况,表示事件日志里的一次走法的记录(对应路径trace)
activity:动作/活动,表示过程中的一个动作(动作名称)。
event:事件,一个动作的记录,包括activity动作名称、发生时间、发生地点等信息的记录。

(不理解)按时间范围过滤(Filtering on timeframe)

不确定)如果只对某段时间范围内的traces感兴趣,即时间包含(contain)在开始与结束时间内。例如:2011-03-09到2012-01-18这段时间内。第一段代码用于log对象,第二段代码用于dataframe对象(后面的代码示例都是如此)。

from pm4py.algo.filtering.log.timestamp import timestamp_filter
filtered_log = timestamp_filter.filter_traces_contained(log, "2011-03-09 00:00:00", "2012-01-18 23:59:59")
from pm4py.algo.filtering.pandas.timestamp import timestamp_filter
df_timest_intersecting = timestamp_filter.filter_traces_contained(dataframe, "2011-03-09 00:00:00", "2012-01-18 23:59:59",parameters={timestamp_filter.Parameters.CASE_ID_KEY: "case:concept:name",timestamp_filter.Parameters.TIMESTAMP_KEY: "time:timestamp"})

不确定)intersecting(相交),不知道如何理解,猜测可能是等于这两个时间?但如果是这样应该不止两个时间参数,应该给个列表参数;或者可能是。对应如下代码:

from pm4py.algo.filtering.log.timestamp import timestamp_filter
filtered_log = timestamp_filter.filter_traces_intersecting(log, "2011-03-09 00:00:00", "2012-01-18 23:59:59")
from pm4py.algo.filtering.pandas.timestamp import timestamp_filter
df_timest_intersecting = timestamp_filter.filter_traces_intersecting(dataframe, "2011-03-09 00:00:00", "2012-01-18 23:59:59",parameters={timestamp_filter.Parameters.CASE_ID_KEY: "case:concept:name",timestamp_filter.Parameters.TIMESTAMP_KEY: "time:timestamp"})

按开始动作筛选(Filter on start activities)

首先需要知道开始动作是哪个,再进行筛选。
log_start是key为动作名称,value为出现次数的字典。

from pm4py.algo.filtering.log.start_activities import start_activities_filter
log_start = start_activities_filter.get_start_activities(log)
filtered_log = start_activities_filter.apply(log, ["S1"]) #suppose "S1" is the start activity you want to filter on
from pm4py.algo.filtering.pandas.start_activities import start_activities_filter
log_start = start_activities_filter.get_start_activities(dataframe)
df_start_activities = start_activities_filter.apply(dataframe, ["S1"],parameters={start_activities_filter.Parameters.CASE_ID_KEY: "case:concept:name",start_activities_filter.Parameters.ACTIVITY_KEY: "concept:name"}) #suppose "S1" is the start activity you want to filter on

还有一个方法是根据开始动作出现频率筛选。DECREASING_FACTOR 默认为0.6。

from pm4py.algo.filtering.log.start_activities import start_activities_filter
log_af_sa = start_activities_filter.apply_auto_filter(log, parameters={start_activities_filter.Parameters.DECREASING_FACTOR: 0.6})
from pm4py.algo.filtering.pandas.start_activities import start_activities_filter
df_auto_sa = start_activities_filter.apply_auto_filter(dataframe, parameters={start_activities_filter.Parameters.DECREASING_FACTOR: 0.6})

按结束动作筛选(Filter on end activities)

首先也要知道结束动作名称。

from pm4py.algo.filtering.log.end_activities import end_activities_filter
end_activities = end_activities_filter.get_end_activities(log)
filtered_log = end_activities_filter.apply(log, ["pay compensation"])
from pm4py.algo.filtering.pandas.end_activities import end_activities_filter
end_activities = end_activities_filter.get_end_activities(df)
filtered_df = end_activities_filter.apply(df, ["pay compensation"],parameters={end_activities_filter.Parameters.CASE_ID_KEY: "case:concept:name",end_activities_filter.Parameters.ACTIVITY_KEY: "concept:name"})

根据变体筛选(Filter on variants)

为了得到所给日志(log)里包含的变体列表。返回结果是个字典,key为变体,value为共享该变体的case列表。

from pm4py.algo.filtering.log.variants import variants_filter
variants = variants_filter.get_variants(log)
from pm4py.statistics.traces.generic.pandas import case_statistics
variants = case_statistics.get_variants_df(df,parameters={case_statistics.Parameters.CASE_ID_KEY: "case:concept:name",case_statistics.Parameters.ACTIVITY_KEY: "concept:name"})

如果想获取变体出现次数,以下代码返回一个变体列表及其计数(所以,一个字典key为变体,value为出现次数)

from pm4py.statistics.traces.generic.log import case_statistics
variants_count = case_statistics.get_variant_statistics(log)
variants_count = sorted(variants_count, key=lambda x: x['count'], reverse=True)
from pm4py.statistics.traces.generic.pandas import case_statistics
variants_count = case_statistics.get_variant_statistics(df,parameters={case_statistics.Parameters.CASE_ID_KEY: "case:concept:name",case_statistics.Parameters.ACTIVITY_KEY: "concept:name",case_statistics.Parameters.TIMESTAMP_KEY: "time:timestamp"})
variants_count = sorted(variants_count, key=lambda x: x['case:concept:name'], reverse=True)

为了基于变体筛选,假设variants是个列表,每个元素是个variant。

from pm4py.algo.filtering.log.variants import variants_filter
filtered_log1 = variants_filter.apply(log, variants)
from pm4py.algo.filtering.pandas.variants import variants_filterfiltered_df1 = variants_filter.apply(df, variants,parameters={variants_filter.Parameters.CASE_ID_KEY: "case:concept:name",variants_filter.Parameters.ACTIVITY_KEY: "concept:name"})

与上一个示例相反,如果想将给的变体过滤掉。假设variants依然是个列表,每个元素还是是一个variant。

filtered_log2 = variants_filter.apply(log, variants, parameters={variants_filter.Parameters.POSITIVE: False})
filtered_df2 = variants_filter.apply(df, variants,parameters={variants_filter.Parameters.POSITIVE: False, variants_filter.Parameters.CASE_ID_KEY: "case:concept:name",variants_filter.Parameters.ACTIVITY_KEY: "concept:name"})

一个过滤器要自动保留最普遍的variants可以用apply_auto_filter方法。这个方法接收一个参数parameter叫DECREASING_FACTOR,与start activities filter的一样,默认0.6。

auto_filtered_log = variants_filter.apply_auto_filter(log)
auto_filtered_df = variants_filter.apply_auto_filter(df,parameters={variants_filter.Parameters.POSITIVE: False, variants_filter.Parameters.CASE_ID_KEY: "case:concept:name",variants_filter.Parameters.ACTIVITY_KEY: "concept:name"})

对于event log对象,可以用变体百分比过滤器。要保留的变体必须指定一个百分比参数,参数值从0-1,0表示只保存最频繁的变体,1表示保留所有变体。

from pm4py.algo.filtering.log.variants import variants_filterfiltered_log = variants_filter.filter_log_variants_percentage(log, percentage=0.5)

为了更明确,见如下解释。
在这里插入图片描述
其他基于变体的过滤器:
top-k过滤器,只保留k个最常见的变体

import pm4py
log = pm4py.read_xes("tests/input_data/receipt.xes")
k = 2
filtered_log = pm4py.filter_variants_top_k(log, k)

变体覆盖率过滤器,根据指定条件的百分比保留。
假如min_coverage_percentage=0.4,我们有个log有1000个cases,500个variant1,400个variant2,100个variant3,过滤器只会保留variant1和variant2。

import pm4py
log = pm4py.read_xes("tests/input_data/receipt.xes")
perc = 0.1
filtered_log = pm4py.filter_variants_by_coverage_percentage(log, perc)

按属性值筛选(Filter on attributes values)

不理解

(Filter on numeric attribute values)

(Between Filter)

用于识别当前cases中,从源动作到目标动作的所有子案例,转换成事件日志event log

import pm4pylog = pm4py.read_xes("tests/input_data/running-example.xes")filtered_log = pm4py.filter_between(log, "check ticket", "decide")

(Case Size Filter)

保留指定范围数量的事件

import pm4pylog = pm4py.read_xes("tests/input_data/running-example.xes")filtered_log = pm4py.filter_case_size(log, 5, 10)

(Rework Filter)

用于识别有重复动作的case。
下面示例,我们查找reinitiate request动作至少出现两次的所有case

import pm4pylog = pm4py.read_xes("tests/input_data/running-example.xes")filtered_log = pm4py.filter_activities_rework(log, "reinitiate request", 2)

(Paths Performance Filter)

用于识别两个指定动作经过指定时间范围内的case
下面示例,我们要找decide 和 pay compensation在两天到10天间至少出现一次的case

import pm4pylog = pm4py.read_xes("tests/input_data/running-example.xes")filtered_log = pm4py.filter_paths_performance(log, ("decide", "pay compensation"), 2*86400, 10*86400)

(Generic Filtering on Event Log)

如果以上过滤器都无法满足需求,我们提供一个基于通用boolean表达式的方法
下面示例,我们保留在事件日志中事件数量大于6的case

import pm4pylog = pm4py.read_xes("tests/input_data/running-example.xes")filtered_log = pm4py.filter_log(lambda x: len(x) > 6, log)

*若对本文有疑问(例如:笔记中知识点或表达有误),欢迎指出,共同学习进步。


http://www.ppmy.cn/news/514417.html

相关文章

基于FPGA的快速傅里叶变换加速(三)

基于FPGA的快速傅里叶变换加速(三) 硬件加速介绍及部分verilog代码实现 1. 硬件加速1.1 FPGA1.1.1 FPGA介绍概念:基本结构:工作原理: 1.1.2 开发板开发板类型开发板介绍 1.2 加速理念1.2.1 硬件加速1.2.2 FFT中的加速&…

Android studio如何存放pm4和txt等其他文件

右击app——>new ——> folder——> assets Folder 确定 成功

PM4PY - Handling Event Data

摘要:导入导出CSV文件、事件日志数据类型转换 随笔 trace(轨迹):从头部到尾部走一次路径就算一次trace Variants(变体):不同的路径为一个变体 process execution(流程执行&#xf…

网安笔记--整合

文章目录 1 intro威胁因素 2 加密密码体制分类安全性 3 DESDES工作模式多重DES和安全性加密策略 4 RSA PKCS保密系统体制分类单向函数用处 RSAElgamal 5 SHAHASHSHA-1SHA-2 6 数字签名基本概念签名体系MAC 消息认证码 7 密码协议协议距离协议分类密钥建立协议认证建立协议认证的…

Pytorch数据类型Tensor张量操作(操作比较全)

文章目录 Pytorch数据类型Tensor张量操作一.创建张量的方式1.创建无初始化张量2.创建随机张量3.创建初值为指定数值的张量4.从数据创建张量5.生成等差数列张量 二.改变张量形状三.索引四.维度变换1.维度增加unsqueeze2.维度扩展expand3.维度减少squeeze4.维度扩展repeat 五.维度…

推广一下小黑论坛

感觉我的用的t400不错,就推广一下这个论坛: http://www.xiaoheiclub.com/?fromuid20075

极家汇家居生活馆来讲解热胶和冷胶墙布哪个好?

热胶和冷胶墙布哪个好?极家汇家居生活馆来说说。对于热胶和冷胶墙布,网络上一直存在争议,目前暂时还没有定论。下面,极家汇家居生活馆就来与大家分享一下热胶和冷胶墙布的相关知识 先来看看热胶和冷胶分别是什么墙布。热胶是一种用…

小黑老师python_小黑课堂小黑老师 - 主页

${content} 你输入的邮件地址曾经通过${type}激活了本站帐号,请使用${type}帐号直接登录。 课程习题 : 提示 请选择一个答案 提交 查看正确答案 下一题 ${option}: ${content} {if multiple} {else} {/if} {if defined("xlist")&&!!xl…