编程 Python

python 全文检索引擎详解

Posted in Python onApril 25, 2017

python 全文检索引擎详解

最近一直在探索着如何用Python实现像百度那样的关键词检索功能。说起关键词检索，我们会不由自主地联想到正则表达式。正则表达式是所有检索的基础，python中有个re类，是专门用于正则匹配。然而，光光是正则表达式是不能很好实现检索功能的。

python有一个whoosh包，是专门用于全文搜索引擎。

whoosh在国内使用的比较少，而它的性能还没有sphinx/coreseek成熟，不过不同于前者，这是一个纯python库，对python的爱好者更为方便使用。具体的代码如下

安装

输入命令行 pip install whoosh

需要导入的包有:

fromwhoosh.index import create_in

fromwhoosh.fields import *

fromwhoosh.analysis import RegexAnalyzer

fromwhoosh.analysis import Tokenizer,Token

中文分词解析器

class ChineseTokenizer(Tokenizer):
  """
  中文分词解析器
  """
  def __call__(self, value, positions=False, chars=False,
         keeporiginal=True, removestops=True, start_pos=0, start_char=0,
         mode='', **kwargs):
    assert isinstance(value, text_type), "%r is not unicode "% value
    t = Token(positions, chars, removestops=removestops, mode=mode, **kwargs)
    list_seg = jieba.cut_for_search(value)
    for w in list_seg:
      t.original = t.text = w
      t.boost = 0.5
      if positions:
        t.pos = start_pos + value.find(w)
      if chars:
        t.startchar = start_char + value.find(w)
        t.endchar = start_char + value.find(w) + len(w)
      yield t


def chinese_analyzer():
  return ChineseTokenizer()

构建索引的函数

@staticmethod
  def create_index(document_dir):
    analyzer = chinese_analyzer()
    schema = Schema(titel=TEXT(stored=True, analyzer=analyzer), path=ID(stored=True),
            content=TEXT(stored=True, analyzer=analyzer))
    ix = create_in("./", schema)
    writer = ix.writer()
    for parents, dirnames, filenames in os.walk(document_dir):
      for filename in filenames:
        title = filename.replace(".txt", "").decode('utf8')
        print title
        content = open(document_dir + '/' + filename, 'r').read().decode('utf-8')
        path = u"/b"
        writer.add_document(titel=title, path=path, content=content)
    writer.commit()

检索函数

@staticmethod
  def search(search_str):
    title_list = []
    print 'here'
    ix = open_dir("./")
    searcher = ix.searcher()
    print search_str,type(search_str)
    results = searcher.find("content", search_str)
    for hit in results:
      print hit['titel']
      print hit.score
      print hit.highlights("content", top=10)
      title_list.append(hit['titel'])
    print 'tt',title_list
    return title_list

感谢阅读，希望能帮助到大家，谢谢大家对本站的支持！

python 全文检索引擎详解

- Author -

lqh

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python使用rsa加密算法模块模拟新浪微博登录

Jan 22 Python

Python中标准库OS的常用方法总结大全

Jul 19 Python

对pandas的dataframe绘图并保存的实现方法

Aug 05 Python

python实现壁纸批量下载代码实例

Jan 25 Python

Python3中正则模块re.compile、re.match及re.search函数用法详解

Jun 11 Python

Python unittest 简单实现参数化的方法

Nov 30 Python

对python打乱数据集中X,y标签对的方法详解

Dec 14 Python

python生成随机红包的实例写法

Sep 02 Python

python栈的基本定义与使用方法示例【初始化、赋值、入栈、出栈等】

Oct 24 Python

Python3使用xlrd、xlwt处理Excel方法数据

Feb 28 Python

使用Python三角函数公式计算三角形的夹角案例

Apr 15 Python

python中如何写类

Jun 29 Python

window下eclipse安装python插件教程

Apr 24 #Python

Python处理PDF及生成多层PDF实例代码

Apr 24 #Python

python爬虫框架scrapy实战之爬取京东商城进阶篇

Apr 24 #Python

python爬虫实战之爬取京东商城实例教程

Apr 24 #Python

python中urllib.unquote乱码的原因与解决方法

Apr 24 #Python

Python面向对象特殊成员

Apr 24 #Python

Python解惑之整数比较详解

Apr 24 #Python

You might like

ThinkPHP分页实例

2014/10/15 PHP

php连接odbc数据源并保存与查询数据的方法

2014/12/24 PHP

表单项的name命名为submit、reset引起的问题

2007/12/22 Javascript

JavaScript中的Window窗口对象

2008/01/16 Javascript

jquery动态加载图片数据练习代码

2011/08/04 Javascript

jquery load事件(callback/data)使用方法及注意事项

2013/02/06 Javascript

jquery制作LED 时钟特效

2015/02/01 Javascript

在JavaScript的正则表达式中使用exec()方法

2015/06/16 Javascript

详解js私有作用域中创建特权方法

2016/01/25 Javascript

javascript自定义滚动条实现代码

2020/04/20 Javascript

js表单处理中单选、多选、选择框值的获取及表单的序列化

2016/03/08 Javascript

JavaScript利用HTML DOM进行文档操作的方法

2016/03/28 Javascript

jQuery 全选全部选反选实现代码

2016/08/17 Javascript

微信小程序进行微信支付的步骤昂述

2016/12/01 Javascript

jQuery插件echarts实现的单折线图效果示例【附demo源码下载】

2017/03/04 Javascript

Vue.2.0.5过渡效果使用技巧

2017/03/16 Javascript

ubuntu编译nodejs所需的软件并安装

2017/09/12 NodeJs

VSCode 配置React Native开发环境的方法

2017/12/27 Javascript

vue实现微信分享朋友圈,发送朋友的示例讲解

2018/02/10 Javascript

浅谈React之状态(State)

2018/09/19 Javascript

关于layui toolbar和template的结合使用方法

2019/09/19 Javascript

webpack 处理CSS资源的实现

2019/09/27 Javascript

vue中@change兼容问题详解

2019/10/25 Javascript

小程序实现上传视频功能

2020/08/18 Javascript

[29:23]2014 DOTA2国际邀请赛中国区预选赛 LGD-GAMING VS CIS 第一场1

2014/05/23 DOTA

python操作ssh实现服务器日志下载的方法

2015/06/03 Python

Python制作简易注册登录系统

2016/12/15 Python

Python3匿名函数lambda介绍与使用示例

2019/05/18 Python

python:目标检测模型预测准确度计算方式(基于IoU)

2020/01/18 Python

运行tensorflow python程序,限制对GPU和CPU的占用操作

2020/02/06 Python

详解django使用include无法跳转的解决方法

2020/03/19 Python

阿提哈德航空官方网站：Etihad Airways

2017/01/06 全球购物

学雷锋献爱心倡议书

2015/04/27 职场文书

MySQL时间盲注的五种延时方法实现

2021/05/18 MySQL

使用JS实现简易计算器

2021/06/14 Javascript

MySQL8.0升级的踩坑历险记

2021/11/01 MySQL