编程 Python

Python插入Elasticsearch操作方法解析

Posted in Python onJanuary 19, 2020

这篇文章主要介绍了Python插入Elasticsearch操作方法解析,文中通过示例代码介绍的非常详细，对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下

在用scrapy做爬虫的时候，需要将数据存入的es中。网上找了两种方法，照葫芦画瓢也能出来，暂记下来：

首先安装了es，版本是5.6.1的较早版本

用pip安装与es版本相对的es相关包

pip install elasticsearch-dsl==5.1.0

方法一：

以下是pipelines.py模块的完整代码

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import chardet

class SinafinancespiderPipeline(object):
  def process_item(self, item, spider):
    return item


# 写入到es中,需要在settings中启用这个类 ExchangeratespiderESPipeline
# 需要安装pip install elasticsearch-dsl==5.1.0 注意与es版本需要对应
from elasticsearch_dsl import Date,Nested,Boolean,analyzer,Completion,Keyword,Text,Integer,DocType
from elasticsearch_dsl.connections import connections
connections.create_connection(hosts=['192.168.52.138'])
from elasticsearch import Elasticsearch
es = Elasticsearch()

class AticleType(DocType):
  page_from = Keyword()
  # domain报错
  domain=Keyword()
  cra_url=Keyword()
  spider = Keyword()
  cra_time = Keyword()
  page_release_time = Keyword()
  page_title = Text(analyzer="ik_max_word")
  page_content = Text(analyzer="ik_max_word")
class Meta:
    index = "scrapy"
    doc_type = "sinafinance"
    # 以下settings和mappings都没起作用，暂且记下
    settings = {
      "number_of_shards": 3,
    }
    mappings = {
      '_id':{'path':'cra_url'}
    }


class ExchangeratespiderESPipeline(DocType):
  from elasticsearch5 import Elasticsearch
  ES = ['192.168.52.138:9200']
  es = Elasticsearch(ES,sniff_on_start=True)

  def process_item(self, item, spider):

    spider.logger.info("-----enter into insert ES")
    article = AticleType()

    article.page_from=item['page_from']
    article.domain=item['domain']
    article.cra_url =item['cra_url']
    article.spider =item['spider']
    article.cra_time =item['cra_time']
    article.page_release_time =item['page_release_time']
    article.page_title =item['page_title']
    article.page_content =item['page_content']

    article.save()
    return item

以上方法能将数据写入es，但是如果重复爬取的话，会重复插入数据，因为主键 ”_id” 是ES自己产生的，找不到自定义_id的入口。于是放弃。

方法二：实现自定义主键写入，覆盖插入

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from elasticsearch5 import Elasticsearch

class SinafinancespiderPipeline(object):
  def process_item(self, item, spider):
    return item


# 写入到es中,需要在settings中启用这个类 ExchangeratespiderESPipeline
# 需要安装pip install elasticsearch-dsl==5.1.0 注意与es版本需要对应
class SinafinancespiderESPipeline():
  def __init__(self):
    self.ES = ['192.168.52.138:9200']
    # 创建es客户端
    self.es = Elasticsearch(
      self.ES,
      # 启动前嗅探es集群服务器
      sniff_on_start=True,
      # es集群服务器结点连接异常时是否刷新es结点信息
      sniff_on_connection_fail=True,
      # 每60秒刷新节点信息
      sniffer_timeout=60
    )

  def process_item(self, item, spider):
    spider.logger.info("-----enter into insert ES")
    doc = {
      'page_from': item['page_from'],
      'domain': item['domain'],
      'spider': item['spider'],
      'page_release_time': item['page_release_time'],
      'page_title': item['page_title'],
      'page_content': item['page_content'],
      'cra_url': item['cra_url'],
      'cra_time': item['cra_time']
    }
    self.es.index(index='scrapy', doc_type='sinafinance', body=doc, id=item['cra_url'])

    return item

搜索数据的方法：

# 字典形式设置body
query = {
 'query': {
  'bool': {
   'must': [
    {'match': {'_all': 'python web'}}
   ],
   'filter': [
    {'term': {'status': 2}}
   ]
  }
 }
}
ret = es.search(index='articles', doc_type='article', body=query)

# 查询数据
data = es.search(index='articles', doc_type='article', body=body)
print(data)
# 增加
es.index(...)
# 修改
es.update(...)
# 删除
es.delete()

完成后

在settings.py模块中注册自定义的类

ITEM_PIPELINES = {
  # 'sinafinancespider.pipelines.SinafinancespiderPipeline': 300,
  'sinafinancespider.pipelines.SinafinancespiderESPipeline': 300,
}

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持三水点靠木。

Python插入Elasticsearch操作方法解析

- Author -

cknds

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python正则匹配抓取豆瓣电影链接和评论代码分享

Dec 27 Python

编写Python脚本使得web页面上的代码高亮显示

Apr 24 Python

Python读写配置文件的方法

Jun 03 Python

Python正则抓取新闻标题和链接的方法示例

Apr 24 Python

Python获取当前页面内所有链接的四种方法对比分析

Aug 19 Python

Python实现文件信息进行合并实例代码

Jan 17 Python

Python常见字典内建函数用法示例

May 14 Python

Tensorflow实现卷积神经网络的详细代码

May 24 Python

详解Django的model查询操作与查询性能优化

Oct 16 Python

用Python调用win命令行提高工作效率的实例

Aug 14 Python

python 匿名函数与三元运算学习笔记

Oct 23 Python

python抢购软件/插件/脚本附完整源码

Mar 04 Python

Docker部署Python爬虫项目的方法步骤

Jan 19 #Python

Python Selenium参数配置方法解析

Jan 19 #Python

浅谈tensorflow中张量的提取值和赋值

Jan 19 #Python

python通过安装itchat包实现微信自动回复收到的春节祝福

Jan 19 #Python

使用 Python 处理3万多条数据只要几秒钟

Jan 19 #Python

Python openpyxl模块原理及用法解析

Jan 19 #Python

Python imutils 填充图片周边为黑色的实现

Jan 19 #Python

You might like

php模拟js函数unescape的函数代码

2012/10/20 PHP

PHP实现微信JS-SDK接口选择相册及拍照并上传的方法

2016/12/05 PHP

php获取微信共享收货地址的方法

2017/12/21 PHP

PHP实现负载均衡的加权轮询方法分析

2018/08/22 PHP

JavaScript数据类型学习笔记

2016/01/25 Javascript

jQuery 获取select选中值及清除选中状态

2016/12/13 Javascript

vue.js实现请求数据的方法示例

2017/02/07 Javascript

JS实现线性表的顺序表示方法示例【经典数据结构】

2017/04/11 Javascript

windows下vue-cli导入bootstrap样式

2017/04/25 Javascript

js实现控制文件拖拽并获取拖拽内容功能

2018/02/17 Javascript

Hexo已经看腻了，来手把手教你使用VuePress搭建个人博客

2018/04/26 Javascript

jQuery实现输入框的放大和缩小功能示例

2018/07/21 jQuery

发布订阅模式在vue中的实际运用实例详解

2019/06/09 Javascript

如何给element添加一个抽屉组件的方法步骤

2019/07/14 Javascript

vue v-for 使用问题整理小结

2019/08/04 Javascript

Webpack5正式发布,有哪些新特性

2020/10/12 Javascript

jQuery实现推拉门效果

2020/10/19 jQuery

[03:12]完美世界DOTA2联赛PWL DAY9集锦

2020/11/10 DOTA

python字符串替换示例

2014/04/24 Python

python简单的函数定义和用法实例

2015/05/07 Python

Python编程中的文件读写及相关的文件对象方法讲解

2016/01/19 Python

Python的Django框架中forms表单类的使用方法详解

2016/06/21 Python

Python中eval带来的潜在风险代码分析

2017/12/11 Python

深入了解Python中pop和remove的使用方法

2018/01/09 Python

tensorflow构建BP神经网络的方法

2018/03/12 Python

更换Django默认的模板引擎为jinja2的实现方法

2018/05/28 Python

python3基于TCP实现CS架构文件传输

2018/07/28 Python

详解Python中的type和object

2018/08/15 Python

Python imutils 填充图片周边为黑色的实现

2020/01/19 Python

在Python中用GDAL实现矢量对栅格的切割实例

2020/03/11 Python

python logging模块的使用

2020/09/07 Python

大型营销活动计划书

2014/04/28 职场文书

世界红十字日活动总结

2015/02/10 职场文书

银行中层干部培训心得体会

2016/01/11 职场文书

字典算法实现及操作 --python（实用）

2021/03/31 Python

Python机器学习三大件之一numpy

2021/05/10 Python