编程 Python

浅析Python实现DFA算法

Posted in Python onJune 26, 2021

一、概述

计算机操作系统中的进程状态与切换可以作为 DFA 算法的一种近似理解。如下图所示，其中椭圆表示状态，状态之间的连线表示事件，进程的状态以及事件都是可确定的，且都可以穷举。

浅析Python实现DFA算法

DFA 算法具有多种应用，在此先介绍在匹配关键词领域的应用。

二、匹配关键词

我们可以将每个文本片段作为状态，例如“匹配关键词”可拆分为“匹”、“匹配”、“匹配关”、“匹配关键”和“匹配关键词”五个文本片段。

浅析Python实现DFA算法

【过程】：

初始状态为空，当触发事件“匹”时转换到状态“匹”；
触发事件“配”，转换到状态“匹配”；
依次类推，直到转换为最后一个状态“匹配关键词”。

再让我们考虑多个关键词的情况，例如“匹配算法”、“匹配关键词”以及“信息抽取”。

浅析Python实现DFA算法

可以看到上图的状态图类似树形结构，也正是因为这个结构，使得 DFA 算法在关键词匹配方面要快于关键词迭代方法（for 循环）。经常刷 LeetCode 的读者应该清楚树形结构的时间复杂度要小于 for 循环的时间复杂度。

for 循环：

keyword_list = []

for keyword in ["匹配算法", "匹配关键词", "信息抽取"]:
    if keyword in "DFA 算法匹配关键词":
        keyword_list.append(keyword)

for 循环需要遍历一遍关键词表，随着关键词表的扩充，所需的时间也会越来越长。

DFA 算法：找到“匹”时，只会按照事件走向特定的序列，例如“匹配关键词”，而不会走向“匹配算法”，因此遍历的次数要小于 for 循环。具体的实现放在下文中。

【问】：那么如何构建状态图所示的结构呢？

【答】：在 Python 中我们可以使用 dict 数据结构。

state_event_dict = {
    "匹": {
        "配": {
            "算": {
                "法": {
                    "is_end": True
                },
                "is_end": False
            },
            "关": {
                "键": {
                    "词": {
                        "is_end": True
                    },
                    "is_end": False
                },
                "is_end": False
            },
            "is_end": False
        },
        "is_end": False
    },
    "信": {
        "息": {
            "抽": {
                "取": {
                    "is_end": True
                },
                "is_end": False
            },
            "is_end": False
        },
        "is_end": False
    }
}

用嵌套字典来作为树形结构，key 作为事件，通过 is_end 字段来判断状态是否为最后一个状态，如果是最后一个状态，则停止状态转换，获取匹配的关键词。

【问】：如果关键词存在包含关系，例如“匹配关键词”和“匹配”，那么该如何处理呢？

【答】：我们仍然可以用 is_end 字段来表示关键词的结尾，同时添加一个新的字段，例如 is_continue 来表明仍可继续进行匹配。除此之外，也可以通过寻找除 is_end 字段外是否还有其他的字段来判断是否继续进行匹配。例如下面代码中的“配”，除了 is_end 字段外还有“关”，因此还需要继续进行匹配。

state_event_dict = {
    "匹": {
        "配": {
            "关": {
                "键": {
                    "词": {
                        "is_end": True
                    },
                    "is_end": False
                },
                "is_end": False
            },
            "is_end": True
        },
        "is_end": False
    }
}

接下来，我们来实现这个算法。

三、算法实现

使用 Python 3.6 版本实现，当然 Python 3.X 都能运行。

3.1、构建存储结构

def _generate_state_event_dict(keyword_list: list) -> dict:
    state_event_dict = {}

    # 遍历每一个关键词
    for keyword in keyword_list:
        current_dict = state_event_dict
        length = len(keyword)

        for index, char in enumerate(keyword):
            if char not in current_dict:
                next_dict = {"is_end": False}
                current_dict[char] = next_dict
                current_dict = next_dict
            else:
                next_dict = current_dict[char]
                current_dict = next_dict

            if index == length - 1:
                current_dict["is_end"] = True

    return state_event_dict

关于上述代码仍然有不少可迭代优化的地方，例如先对关键词列表按照字典序进行排序，这样可以让具有相同前缀的关键词集中在一块，从而在构建存储结构时能够减少遍历的次数。

3.2、匹配关键词

def match(state_event_dict: dict, content: str):
    match_list = []
    state_list = []
    temp_match_list = []

    for char_pos, char in enumerate(content):
        # 首先找到匹配项的起点
        if char in state_event_dict:
            state_list.append(state_event_dict)
            temp_match_list.append({
                "start": char_pos,
                "match": ""
            })

        # 可能会同时满足多个匹配项，因此遍历这些匹配项
        for index, state in enumerate(state_list):
            if char in state:
                state_list[index] = state[char]
                temp_match_list[index]["match"] += char

                # 如果抵达匹配项的结尾，表明匹配关键词完成
                if state[char]["is_end"]:
                    match_list.append(copy.deepcopy(temp_match_list[index]))

                    # 如果还能继续，则继续进行匹配
                    if len(state[char].keys()) == 1:
                        state_list.pop(index)
                        temp_match_list.pop(index)
            # 如果不满足匹配项的要求，则将其移除
            else:
                state_list.pop(index)
                temp_match_list.pop(index)

    return match_list

3.3、完整代码

import re
import copy


class DFA:

    def __init__(self, keyword_list: list):
        self.state_event_dict = self._generate_state_event_dict(keyword_list)

    def match(self, content: str):
        match_list = []
        state_list = []
        temp_match_list = []

        for char_pos, char in enumerate(content):
            if char in self.state_event_dict:
                state_list.append(self.state_event_dict)
                temp_match_list.append({
                    "start": char_pos,
                    "match": ""
                })

            for index, state in enumerate(state_list):
                if char in state:
                    state_list[index] = state[char]
                    temp_match_list[index]["match"] += char

                    if state[char]["is_end"]:
                        match_list.append(copy.deepcopy(temp_match_list[index]))

                        if len(state[char].keys()) == 1:
                            state_list.pop(index)
                            temp_match_list.pop(index)
                else:
                    state_list.pop(index)
                    temp_match_list.pop(index)

        return match_list

    @staticmethod
    def _generate_state_event_dict(keyword_list: list) -> dict:
        state_event_dict = {}

        for keyword in keyword_list:
            current_dict = state_event_dict
            length = len(keyword)

            for index, char in enumerate(keyword):
                if char not in current_dict:
                    next_dict = {"is_end": False}
                    current_dict[char] = next_dict
                    current_dict = next_dict
                else:
                    next_dict = current_dict[char]
                    current_dict = next_dict

                if index == length - 1:
                    current_dict["is_end"] = True

        return state_event_dict


if __name__ == "__main__":
    dfa = DFA(["匹配关键词", "匹配算法", "信息抽取", "匹配"])
    print(dfa.match("信息抽取之 DFA 算法匹配关键词，匹配算法"))

输出：

[

{

'start': 0,

'match': '信息抽取'

}, {

'start': 12,

'match': '匹配'

}, {

'start': 12,

'match': '匹配关键词'

}, {

'start': 18,

'match': '匹配'

}, {

'start': 18,

'match': '匹配算法'

}

]

四、其他用法

4.1、添加通配符

在敏感词识别时往往会遇到同一种类型的句式，例如“你这个傻X”，其中 X 可以有很多，难道我们需要一个个添加到关键词表中吗？最好能够通过类似正则表达式的方法去进行识别。一个简单的做法就是“*”，匹配任何内容。

添加通配符只需要对匹配关键词过程进行修改：

def match(self, content: str):
    match_list = []
    state_list = []
    temp_match_list = []

    for char_pos, char in enumerate(content):
        if char in self.state_event_dict:
            state_list.append(self.state_event_dict)
            temp_match_list.append({
                "start": char_pos,
                "match": ""
            })

        for index, state in enumerate(state_list):
            is_find = False
            state_char = None

            # 如果是 * 则匹配所有内容
            if "*" in state:
                state_list[index] = state["*"]
                state_char = state["*"]
                is_find = True

            if char in state:
                state_list[index] = state[char]
                state_char = state[char]
                is_find = True

            if is_find:
                temp_match_list[index]["match"] += char

                if state_char["is_end"]:
                    match_list.append(copy.deepcopy(temp_match_list[index]))

                    if len(state_char.keys()) == 1:
                        state_list.pop(index)
                        temp_match_list.pop(index)
            else:
                state_list.pop(index)
                temp_match_list.pop(index)

    return match_list

main() 函数。

if __name__ == "__main__":
    dfa = DFA(["匹配关键词", "匹配算法", "信息*取", "匹配"])
    print(dfa.match("信息抽取之 DFA 算法匹配关键词，匹配算法，信息抓取"))

输出：

[

{

'start': 0,

'match': '信息抽取'

}, {

'start': 12,

'match': '匹配'

}, {

'start': 12,

'match': '匹配关键词'

}, {

'start': 18,

'match': '匹配'

}, {

'start': 18,

'match': '匹配算法'

}, {

'start': 23,

'match': '信息抓取'

}

]

以上就是浅析Python实现DFA算法的详细内容，更多关于Python DFA算法的资料请关注三水点靠木其它相关文章！

浅析Python实现DFA算法

- Author -

clvsit

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python局域网ip扫描示例分享

Apr 03 Python

Python中的赋值、浅拷贝、深拷贝介绍

Mar 09 Python

以Python的Pyspider为例剖析搜索引擎的网络爬虫实现方法

Mar 30 Python

在GitHub Pages上使用Pelican搭建博客的教程

Apr 25 Python

简述Python中的进程、线程、协程

Mar 18 Python

Python随机数random模块使用指南

Sep 09 Python

django将数组传递给前台模板的方法

Aug 06 Python

python每天定时运行某程序代码

Aug 16 Python

Python3.7 pyodbc完美配置访问access数据库

Oct 03 Python

Tensorflow分批量读取数据教程

Feb 07 Python

Python环境配置实现pip加速过程解析

Nov 27 Python

pyqt5蒙版遮罩mask,setmask的使用

Jun 11 Python

解析目标检测之IoU

pycharm代码删除恢复的方法

Python max函数中key的用法及原理解析

Python访问Redis的详细操作

一文搞懂python异常处理、模块与包

Python实战之OpenCV实现猫脸检测

Python爬虫基础之简单说一下scrapy的框架结构