Python批量正则匹配(多文本×多规则) 2023-08-27 笔记,技巧,实验 暂无评论 728 次阅读 ## 什么是多规则正则匹配 多规则正则匹配是一种基于人工规则的语义提取技术,实现简单,在特定场景中足够好用。 下面是多规则正则匹配的效果: > This paper proposes Computation Offloading using Reinforcement Learning (CORL) scheme to minimize latency and energy consumption. > - 句式:本文贡献(This paper) - 场景:任务卸载(Computation Offloading) - 方法:强化学习(Reinforcement Learning) - 优化目标:时延(latency)、能耗(energy) 因为一种含义有不同的表达方法,所以每个语义都要单独列一条正则表达式。例如: - 本文贡献:`this (paper|article|work|study|manuscript)` - 任务卸载:`(task|comput[a-z]*( resource[a-z]*)?|application|job|service)[\- ]offloading|offload([a-z]* ){,4}(task|job)` - 强化学习:`reinforcement learning|deep Q[\– ]network|actor[\– ]critic` - 时延:`delay|completion time|latency` - 能耗:`energy|power consumption` ## 多规则匹配的实现方式 既要匹配关键词,又要知道是哪条规则匹配到的,有两种方法:(1)写一个循环,每条规则逐一匹配;(2)用正则表达式自带的“命名组合”`(?P…)`,将所有规则合并成一条,一次性匹配。 ```python # 方法1:逐条规则 import re text = "This paper proposes Computation Offloading using Reinforcement Learning (CORL) scheme to minimize latency and energy consumption." rules=[ ['本文贡献','this (paper|article|work|study|manuscript)'], ['任务卸载','(task|comput[a-z]*( resource[a-z]*)?|application|job|service)[\- ]offloading|offload([a-z]* ){,4}(task|job)'], ['强化学习','reinforcement learning|deep Q[\– ]network|actor[\– ]critic'], ['时延','delay|completion time|latency'], ['能耗','energy|power consumption'] ] for sub_key,pattern in rules: for ret in re.finditer(pattern,text,re.IGNORECASE): sub_val=text[ret.start():ret.end()] print(f"{sub_key}: {sub_val}") ``` ```python # 方法2:合并规则 import re text = "This paper proposes Computation Offloading using Reinforcement Learning (CORL) scheme to minimize latency and energy consumption." pattern = "(?P<本文贡献>this (paper|article|work|study|manuscript))|(?P<任务卸载>(task|comput[a-z]*( resource[a-z]*)?|application|job|service)[\- ]offloading|offload([a-z]* ){,4}(task|job))|(?P<强化学习>reinforcement learning|deep Q[\– ]network|actor[\– ]critic)|(?P<时延>delay|completion time|latency)|(?P<能耗>energy|power consumption)" for ret in re.finditer(pattern,text,re.IGNORECASE): sub_key=ret.lastgroup sub_val=text[ret.start():ret.end()] print(f"{sub_key}: {sub_val}") ``` ## 多文本多规则正则匹配 以上是1个句子、n个规则的正则匹配。 如果句子也有n个,又能分两种情况:(1)写一个循环,每个句子逐条匹配;(2)将所有句子合并成一个长篇,一次性匹配,再根据匹配结果的位置判断是第几句。 以下是个简单的实验,测一下n个句子、n个规则时,逐条匹配与合并匹配哪个快。 **准备数据:** ``` data={ "txts": [ "A Cost-Driven Fuzzy Scheduling Strategy for Intelligent Workflow ...", "Scheduling in edge-cloud environments can address ...", ... ], "rules": [ "本文|this (paper|article|work|study|manuscript)|we ", "基于|based|driven", ... ] } ``` (每条规则第一个“|”之前是规则名。) **测试程序:** ```python ''' @File :test.py @Description :正则表达式测试 @Date :2023/08/22 17:21:18 @Author :pro1515151515 @Version :1.0 ''' import os import json from itertools import accumulate import bisect # python自带的二分搜索 import re import time def 合并文本_合并规则(txts,rules): t0=time.time() output=set() pattern=re.compile('|'.join([f"(?P<{l.split('|')[0].replace(' ','_')}>{l})" for l in rules])) text='\n'.join(txts) pins = list(accumulate([len(txt)+1 for txt in txts])) for ret in pattern.finditer(text,re.IGNORECASE): tid=bisect.bisect_left(pins, ret.start()) sub_key=ret.lastgroup sub_val=text[ret.start():ret.end()] sub_string=txts[tid] output.add(f"{sub_key}|{sub_val}|{sub_string}") t1=time.time() print(f"[合并文本_合并规则] 匹配数:{len(output)},用时:{t1-t0}秒") return output def 逐条文本_合并规则(txts,rules): t0=time.time() output=set() pattern=re.compile('|'.join([f"(?P<{l.split('|')[0].replace(' ','_')}>{l})" for l in rules])) for text in txts: for ret in pattern.finditer(text,re.IGNORECASE): sub_key=ret.lastgroup sub_val=text[ret.start():ret.end()] sub_string=text output.add(f"{sub_key}|{sub_val}|{sub_string}") t1=time.time() print(f"[逐条文本_合并规则] 匹配数:{len(output)},用时:{t1-t0}秒") return output def 合并文本_逐条规则(txts,rules): t0=time.time() output=set() patterns={} for rule in rules: patterns[rule.split('|')[0].replace(' ','_')]=re.compile(f"({rule})") text='\n'.join(txts) pins = list(accumulate([len(txt)+1 for txt in txts])) for sub_key,pattern in patterns.items(): for ret in pattern.finditer(text,re.IGNORECASE): tid=bisect.bisect_left(pins, ret.start()) sub_val=text[ret.start():ret.end()] sub_string=txts[tid] output.add(f"{sub_key}|{sub_val}|{sub_string}") t1=time.time() print(f"[合并文本_逐条规则] 匹配数:{len(output)},用时:{t1-t0}秒") return output def 逐条文本_逐条规则(txts,rules): t0=time.time() hit=0 output=set() patterns={} for rule in rules: patterns[rule.split('|')[0].replace(' ','_')]=re.compile(f"({rule})") for text in txts: for sub_key,pattern in patterns.items(): for ret in pattern.finditer(text,re.IGNORECASE): sub_val=text[ret.start():ret.end()] sub_string=text output.add(f"{sub_key}|{sub_val}|{sub_string}") t1=time.time() print(f"[逐条文本_逐条规则] 匹配数:{len(output)},用时:{t1-t0}秒") return output def main(): with open('data.json','r',encoding='utf-8') as f: data=json.loads(f.read()) txts = data['txts'] rules = data['rules'] print(f"{len(txts)}行文本,{len(rules)}条匹配规则") ans1=合并文本_合并规则(txts,rules) ans2=逐条文本_合并规则(txts,rules) ans3=合并文本_逐条规则(txts,rules) ans4=逐条文本_逐条规则(txts,rules) print("结束") if __name__=='__main__': main() ``` **测试结果:** ``` 5299行文本,99条匹配规则 [合并文本_合并规则] 匹配数:4175,用时:6.736292600631714秒 [逐条文本_合并规则] 匹配数:4163,用时:6.20894980430603秒 [合并文本_逐条规则] 匹配数:4182,用时:0.6981680393218994秒 [逐条文本_逐条规则] 匹配数:4170,用时:0.9920296669006348秒 结束 ``` **结论:** n个句子、n条规则的多对多正则匹配,建议是将文本合并起来匹配、每条规则逐条匹配;另外,匹配结果可能存在重复项,需要去重。 **附件:** [Python批量正则匹配(测试数据+代码).zip](https://www.proup.club/usr/uploads/2023/08/1940423003.zip) 标签: none 本作品采用 知识共享署名-相同方式共享 4.0 国际许可协议 进行许可。
评论已关闭