编程 Python

python实现过滤敏感词

Posted in Python onMay 08, 2021

简述：

关于敏感词过滤可以看成是一种文本反垃圾算法，例如
题目：敏感词文本文件 filtered_words.txt，当用户输入敏感词语，则用星号 * 替换，例如当用户输入「北京是个好城市」，则变成「**是个好城市」
代码：

#coding=utf-8
def filterwords(x):
    with open(x,'r') as f:
        text=f.read()
    print text.split('\n')
    userinput=raw_input('myinput:')
    for i in text.split('\n'):
        if i in userinput:
            replace_str='*'*len(i.decode('utf-8'))
            word=userinput.replace(i,replace_str)
            return word

print filterwords('filtered_words.txt')

再例如反黄系列：

开发敏感词语过滤程序，提示用户输入评论内容，如果用户输入的内容中包含特殊的字符：
敏感词列表 li = ["苍老师","东京热",”武藤兰”,”波多野结衣”]
则将用户输入的内容中的敏感词汇替换成***，并添加到一个列表中；如果用户输入的内容没有敏感词汇，则直接添加到上述的列表中。
content = input('请输入你的内容：')
li = ["苍老师","东京热","武藤兰","波多野结衣"]
i = 0
while i < 4:
    for li[i] in content:
        li1 = content.replace('苍老师','***')
        li2 = li1.replace('东京热','***')
        li3 = li2.replace('武藤兰','***')
        li4 = li3.replace('波多野结衣','***')
    else:
        pass
    i += 1

python实现过滤敏感词

实战案例：

一道bat面试题：快速替换10亿条标题中的5万个敏感词，有哪些解决思路？
有十亿个标题，存在一个文件中，一行一个标题。有5万个敏感词，存在另一个文件。写一个程序过滤掉所有标题中的所有敏感词，保存到另一个文件中。

1、DFA过滤敏感词算法

在实现文字过滤的算法中，DFA是比较好的实现算法。DFA即Deterministic Finite Automaton，也就是确定有穷自动机。
算法核心是建立了以敏感词为基础的许多敏感词树。
python 实现DFA算法：

# -*- coding:utf-8 -*-

import time
time1=time.time()

# DFA算法
class DFAFilter():
    def __init__(self):
        self.keyword_chains = {}
        self.delimit = '\x00'

    def add(self, keyword):
        keyword = keyword.lower()
        chars = keyword.strip()
        if not chars:
            return
        level = self.keyword_chains
        for i in range(len(chars)):
            if chars[i] in level:
                level = level[chars[i]]
            else:
                if not isinstance(level, dict):
                    break
                for j in range(i, len(chars)):
                    level[chars[j]] = {}
                    last_level, last_char = level, chars[j]
                    level = level[chars[j]]
                last_level[last_char] = {self.delimit: 0}
                break
        if i == len(chars) - 1:
            level[self.delimit] = 0

    def parse(self, path):
        with open(path,encoding='utf-8') as f:
            for keyword in f:
                self.add(str(keyword).strip())

    def filter(self, message, repl="*"):
        message = message.lower()
        ret = []
        start = 0
        while start < len(message):
            level = self.keyword_chains
            step_ins = 0
            for char in message[start:]:
                if char in level:
                    step_ins += 1
                    if self.delimit not in level[char]:
                        level = level[char]
                    else:
                        ret.append(repl * step_ins)
                        start += step_ins - 1
                        break
                else:
                    ret.append(message[start])
                    break
            else:
                ret.append(message[start])
            start += 1

        return ''.join(ret)


if __name__ == "__main__":
    gfw = DFAFilter()
    path="F:/文本反垃圾算法/sensitive_words.txt"
    gfw.parse(path)
    text="新疆骚乱苹果新品发布会?八"
    result = gfw.filter(text)

    print(text)
    print(result)
    time2 = time.time()
    print('总共耗时：' + str(time2 - time1) + 's')

运行效果：

新疆骚乱苹果新品发布会?八
****苹果新品发布会**
总共耗时：0.0010344982147216797s

2、AC自动机过滤敏感词算法

AC自动机：一个常见的例子就是给出n个单词，再给出一段包含m个字符的文章，让你找出有多少个单词在文章里出现过。
简单地讲，AC自动机就是字典树+kmp算法+失配指针

# -*- coding:utf-8 -*-

import time
time1=time.time()

# AC自动机算法
class node(object):
    def __init__(self):
        self.next = {}
        self.fail = None
        self.isWord = False
        self.word = ""

class ac_automation(object):

    def __init__(self):
        self.root = node()

    # 添加敏感词函数
    def addword(self, word):
        temp_root = self.root
        for char in word:
            if char not in temp_root.next:
                temp_root.next[char] = node()
            temp_root = temp_root.next[char]
        temp_root.isWord = True
        temp_root.word = word

    # 失败指针函数
    def make_fail(self):
        temp_que = []
        temp_que.append(self.root)
        while len(temp_que) != 0:
            temp = temp_que.pop(0)
            p = None
            for key,value in temp.next.item():
                if temp == self.root:
                    temp.next[key].fail = self.root
                else:
                    p = temp.fail
                    while p is not None:
                        if key in p.next:
                            temp.next[key].fail = p.fail
                            break
                        p = p.fail
                    if p is None:
                        temp.next[key].fail = self.root
                temp_que.append(temp.next[key])

    # 查找敏感词函数
    def search(self, content):
        p = self.root
        result = []
        currentposition = 0

        while currentposition < len(content):
            word = content[currentposition]
            while word in p.next == False and p != self.root:
                p = p.fail

            if word in p.next:
                p = p.next[word]
            else:
                p = self.root

            if p.isWord:
                result.append(p.word)
                p = self.root
            currentposition += 1
        return result

    # 加载敏感词库函数
    def parse(self, path):
        with open(path,encoding='utf-8') as f:
            for keyword in f:
                self.addword(str(keyword).strip())

    # 敏感词替换函数
    def words_replace(self, text):
        """
        :param ah: AC自动机
        :param text: 文本
        :return: 过滤敏感词之后的文本
        """
        result = list(set(self.search(text)))
        for x in result:
            m = text.replace(x, '*' * len(x))
            text = m
        return text





if __name__ == '__main__':

    ah = ac_automation()
    path='F:/文本反垃圾算法/sensitive_words.txt'
    ah.parse(path)
    text1="新疆骚乱苹果新品发布会?八"
    text2=ah.words_replace(text1)

    print(text1)
    print(text2)

    time2 = time.time()
    print('总共耗时：' + str(time2 - time1) + 's')

运行结果：

新疆骚乱苹果新品发布会?八
****苹果新品发布会**
总共耗时：0.0010304450988769531s

以上就是python实现过滤敏感词的详细内容，更多关于python 过滤敏感词的资料请关注三水点靠木其它相关文章！

python实现过滤敏感词

- Author -

学到老

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

Python实现压缩和解压缩ZIP文件的方法分析

Sep 28 Python

python实现k-means聚类算法

Feb 23 Python

django中静态文件配置static的方法

May 20 Python

Python装饰器简单用法实例小结

Dec 03 Python

python3.4爬虫demo

Jan 22 Python

Django中间件基础用法详解

Jul 18 Python

Tensorflow实现酸奶销量预测分析

Jul 19 Python

python之PyQt按钮右键菜单功能的实现代码

Aug 17 Python

django写用户登录判定并跳转制定页面的实例

Aug 21 Python

使用OpenCV-python3实现滑动条更新图像的Canny边缘检测功能

Dec 12 Python

10行Python代码实现Web自动化管控的示例代码

Aug 14 Python

Python+OpenCV图像处理——打印图片属性、设置存储路径、调用摄像头

Oct 22 Python

Django中的JWT身份验证的实现

May 07 #Python

python开发实时可视化仪表盘的示例

Python使用scapy模块发包收包

如何用 Python 子进程关闭 Excel 自动化中的弹窗

PyTorch的Debug指南

May 07 #Python

基于Python的EasyGUI学习实践

Python列表删除重复元素与图像相似度判断及删除实例代码

You might like

php session 预定义数组

2009/03/16 PHP

php生成扇形比例图实例

2013/11/06 PHP

jquery BS,dialog控件自适应大小

2009/07/06 Javascript

javascript 验证日期的函数

2010/03/18 Javascript

自己动手制作jquery插件之自动添加删除行的实现

2011/10/13 Javascript

基于jquery实现的一个选择中国大学的弹框 (数据、步骤、代码)

2012/07/26 Javascript

利用JS实现浏览器的title闪烁

2013/07/08 Javascript

页面按钮禁用与解除禁用的方法

2014/02/19 Javascript

JavaScript插件化开发教程（三）

2015/01/27 Javascript

浅谈关于.vue文件中style的scoped属性

2017/08/19 Javascript

详解vue移动端日期选择组件

2018/02/22 Javascript

vue基于element的区间选择组件

2018/09/07 Javascript

vue实现百度下拉列表交互操作示例

2019/03/12 Javascript

layui table去掉右侧滑动条的实现方法

2019/09/05 Javascript

vue+element树组件实现树懒加载的过程详解

2019/10/21 Javascript

vue通过v-html指令渲染的富文本无法修改样式的解决方案

2020/05/20 Javascript

[10:28]2018DOTA2国际邀请赛寻真——VGJ.S寻梦之路

2018/08/15 DOTA

Python常用的日期时间处理方法示例

2015/02/08 Python

详解python里使用正则表达式的分组命名方式

2017/10/24 Python

Python多图片合并PDF的方法

2019/01/03 Python

Python3.4学习笔记之常用操作符,条件分支和循环用法示例

2019/03/01 Python

Python 实现取多维数组第n维的前几位

2019/11/26 Python

python的sys.path模块路径添加方式

2020/03/09 Python

香港唯港荟酒店预订：Hotel ICON

2018/03/27 全球购物

JSF的标签库有哪些

2012/04/27 面试题

出国导师推荐信

2014/01/16 职场文书

《最佳路径》教学反思

2014/04/13 职场文书

感恩之星事迹材料

2014/05/03 职场文书

合作协议书模板

2014/10/10 职场文书

原告代理词范文

2015/05/25 职场文书

关于五一放假的通知

2015/08/18 职场文书

Python关于OS文件目录处理的实例分享

2021/05/23 Python

详解CSS伪元素的妙用单标签之美

2021/05/25 HTML / CSS

口袋妖怪冰系十大最强精灵，几何雪花排第七，第六类似北极熊

2022/03/18 日漫

Javascript webpack动态import

2022/04/19 Javascript

kubernetes集群搭建Zabbix监控平台的详细过程

2022/07/07 Servers