在centos7中分布式部署pyspider


Posted in Python onMay 03, 2017

1.搭建环境:

系统版本:Linux centos-linux.shared 3.10.0-123.el7.x86_64 #1 SMP Mon Jun 30 12:09:22 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

python版本:Python 3.5.1

1.1.搭建python3环境:

本人在尝试过后选择集成环境Anaconda

1.1.1.编译

# 下载依赖
yum install -y ncurses-devel openssl openssl-devel zlib-devel gcc make glibc-devel libffi-devel glibc-static glibc-utils sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-deve
# 下载python版本
wget https://www.python.org/ftp/python/3.5.1/Python-3.5.1.tgz
# 或者使用国内源
wget http://mirrors.sohu.com/python/3.5.1/Python-3.5.1.tgz
mv Python-3.5.1.tgz /usr/local/src;cd /usr/local/src
# 解压
tar -zxf Python-3.5.1.tgz;cd Python-3.5.1
# 编译安装
./configure --prefix=/usr/local/python3.5 --enable-shared
make && make install
# 建立软链接
ln -s /usr/local/python3.5/bin/python3 /usr/bin/python3
echo "/usr/local/python3.5/lib" > /etc/ld.so.conf.d/python3.5.conf
ldconfig
# 验证python3
python3
# Python 3.5.1 (default, Oct 9 2016, 11:44:24)
# [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux
# Type "help", "copyright", "credits" or "license" for more information.
# >>>
# pip
/usr/local/python3.5/bin/pip3 install --upgrade pip
ln -s /usr/local/python3.5/bin/pip /usr/bin/pip
# 本人在安装时出现问题 将pip重装
wget https://bootstrap.pypa.io/get-pip.py --no-check-certificate
python get-pip.py

1.1.2.集成环境anaconda

# 集成环境anaconda(推荐)
wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
# 直接安装即可
./Anaconda3-4.2.0-Linux-x86_64.sh
# 若出错,可能是解压失败
yum install bzip2

1.2.安装mariaDB

# 安装
yum -y install mariadb mariadb-server
# 启动
systemctl start mariadb
# 设置为开机启动
systemctl enable mariadb
# 配置密码 默认为空
mysql_secure_installation
# 登录
mysql -u root -p
# 创建一个用户 自己设定账户密码
CREATE USER 'user_name'@'localhost' IDENTIFIED BY 'user_pass';
GRANT ALL PRIVILEGES ON *.* TO 'user_name'@'localhost' WITH GRANT OPTION;
CREATE USER 'user_name'@'%' IDENTIFIED BY 'user_pass';
GRANT ALL PRIVILEGES ON *.* TO 'user_name'@'%' WITH GRANT OPTION;

1.3.安装pyspider

本人使用Anaconda

# 搭建虚拟环境sbird python版本3.*
conda create -n sbird python=3*
# 进入环境
source activate sbird
# 安装pyspider
pip install pyspider
# 报错 
# it does not exist. The exported locale is "en_US.UTF-8" but it is not supported
# 执行 可写入.bashrc
export LC_ALL=en_US.utf-8
export LANG=en_US.utf-8
#ImportError: pycurl: libcurl link-time version (7.29.0) is older than compile-time version (7.49.0)
conda install pycurl
# 退出
source deactivate sbird
# 若在虚拟机内 出现无法访问localhost:5000 可关闭防火墙
systemctl stop firewalld.service
#########直接运行源码==============
mkdir git;cd git
# 下载
git clone https://github.com/binux/pyspider.git
# 安装
/root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py

其他方法

# 搭建虚拟环境
pip install virtualenv
mkdir python;cd python
# 创建虚拟环境pyenv3
virtualenv -p /usr/bin/python3 pyenv3
# 进入虚拟环境 激活环境
cd pyenv3/
source ./bin/activate
pip install pyspider
# 若pycurl报错 
yum install libcurl-devel
# 继续
pip install pyspider
# 关闭
deactivate

本人推荐用anaconda方式安装

若pyspider运行过程中出现错误,参考anaconda安装部分,至此,访问localhost:5000可看到页面。

1.4.安装Supervisor

# 安装
yum install supervisor -y
# 若无法检索 则添加阿里的epel源
vim /etc/yum.repos.d/epel.repo
# 添加以下内容
[epel]
name=Extra Packages for Enterprise Linux 7 - $basearch
baseurl=http://mirrors.aliyun.com/epel/7/$basearch
http://mirrors.aliyuncs.com/epel/7/$basearch
failovermethod=priority
enabled=1
gpgcheck=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7

[epel-debuginfo]
name=Extra Packages for Enterprise Linux 7 - $basearch - Debug
baseurl=http://mirrors.aliyun.com/epel/7/$basearch/debug
http://mirrors.aliyuncs.com/epel/7/$basearch/debug
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=0

[epel-source]
name=Extra Packages for Enterprise Linux 7 - $basearch - Source
baseurl=http://mirrors.aliyun.com/epel/7/SRPMS
http://mirrors.aliyuncs.com/epel/7/SRPMS
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=0
# 安装
yum install supervisor -y
# 测试是否安装成功
echo_supervisord_conf

1.4.1.Supervisor用法

supervisord   #supervisor的服务器端部分 启动
supervisorctl  #启动supervisor的命令行窗口
# 假设创建进程pyspider01
vim /etc/supervisord.d/pyspider01.ini
# 写入以下内容
[program:pyspider01]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/pyspider01.log
# 重载
supervisorctl reload
# 启动
supervisorctl start pyspider01
# 也可这样启动
supervisord -c /etc/supervisord.conf
# 查看状态
supervisorctl status
# output 
pyspider01            RUNNING  pid 4026, uptime 0:02:40
# 关闭
supervisorctl shutdown

1.5.安装redis

# 消息队列采用redis
mkdir download;cd download
wget http://download.redis.io/releases/redis-3.2.4.tar.gz
tar xzf redis-3.2.4.tar.gz
cd redis-3.2.4
make
# 或者直接yum安装
yum -y install redis
# 启动
systemctl start redis.service
# 重启
systemctl restart redis.service
# 停止
systemctl stop redis.service
# 查看状态
systemctl status redis.service
# 更改文件/etc/redis.conf
vim /etc/redis.conf
# 更改内容
daemonize no 改为 daemonize yes
bind 127.0.0.1 改为 bind 10.211.55.22(当前服务器ip)
# 重启redis
systemctl restart redis.service

1.6.关于自启动

# Supervisor添加到自启动服务
systemctl enable supervisord.service
# redis添加到自启动服务
systemctl enable redis.service
# 关闭防火墙自启动
systemctl disable firewalld.service

至此,pyspider单个服务器运行环境搭建且部署完毕,启动localhost:5000进入web界面。

也可编写脚本运行,在/pyspider/supervisor/pyspider01.log查看运行状态。

2.分布式部署

刚才配置的服务器,将其命名为centos01,按照这样的配置,再分别部署两台centos02、centos03。

如下:

服务器名称 ip 说明

centos01 10.211.55.22 redis,mariaDB, scheduler
centos02 10.211.55.23 fetcher, processor, result_worker,phantomjs
centos03 10.211.55.24 fetcher, processor,,result_worker,webui

2.1.centos01

进入服务器centos01,经过第一步,基本环境已经搭好,首先编辑配置文件/pyspider/config.json

{
 "taskdb": "mysql+taskdb://user_name:user_pass@10.211.55.22:3306/taskdb",
 "projectdb": "mysql+projectdb://user_name:user_pass@10.211.55.22:3306/projectdb",
 "resultdb": "mysql+resultdb://user_name:user_pass@10.211.55.22:3306/resultdb",
 "message_queue": "redis://10.211.55.22:6379/db",
 "logging-config": "/pyspider/logging.conf",
 "phantomjs-proxy":"10.211.55.23:25555",
 "webui": {
  "username": "",
  "password": "",
  "need-auth": false,
  "host":"10.211.55.24",
  "port":"5000",
  "scheduler-rpc":"http:// 10.211.55.22:5002",
  "fetcher-rpc":"http://10.211.55.23:5001"
 },
 "fetcher": {
  "xmlrpc":true,
  "xmlrpc-host": "0.0.0.0",
  "xmlrpc-port": "5001"
 },
 "scheduler": {
  "xmlrpc":true,
  "xmlrpc-host": "0.0.0.0",
  "xmlrpc-port": "5002"
 }
}

尝试运行下:

/root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler
# 报错
ImportError: No module named 'mysql'
# 下载 mysql-connector-python
cd ~/git/
git clone https://github.com/mysql/mysql-connector-python.git
# 安装
source activate sbird
cd mysql-connector-python
python setup.py install
# 安装redis
pip install redis
source deactivate
# 运行
/root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler
# 输出 ok
[I 161010 15:57:25 scheduler:644] scheduler starting...
[I 161010 15:57:25 scheduler:779] scheduler.xmlrpc listening on 0.0.0.0:5002
[I 161010 15:57:25 scheduler:583] in 5m: new:0,success:0,retry:0,failed:0

运行成功后,可直接更改/etc/supervisord.d/pyspider01.ini如下:

[program:pyspider01]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/pyspider01.log
# 重载
supervisorctl reload
# 查看状态
supervisorctl status

centos01部署完毕。

2.2.centos02

在centos02中,需要运行result_worker、processor、phantomjs、fetcher

分别建立文件:

/etc/supervisord.d/result_worker.ini

[program:result_worker]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json result_worker
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/result_worker.log
/etc/supervisord.d/processor.ini

[program:processor]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json processor
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/processor.log
/etc/supervisord.d/phantomjs.ini

[program:phantomjs]

command   = /pyspider/phantomjs --config=/pyspider/pjsconfig.json /pyspider/phantomjs_fetcher.js 25555
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/phantomjs.log
/etc/supervisord.d/fetcher.ini

[program:fetcher]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json fetcher
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/fetcher.log

在pyspider目录中建立pjsconfig.json

{
 /*--ignore-ssl-errors=true */
 "ignoreSslErrors": true,

 /*--ssl-protocol=true */
 "sslprotocol": "any",

 /* Same as: --output-encoding=utf8 */
 "outputEncoding": "utf8",

 /* persistent Cookies. */
 /*cookiesfile="e:/phontjscookies.txt",*/
 cookiesfile="pyspider/phontjscookies.txt",

 /* load image */
 autoLoadImages = false
}

下载phantomjs至/pyspider/文件夹,将git/pyspider/pyspider/fetcher/phantomjs_fetcher.js复制到phantomjs_fetcher.js

# 重载
supervisorctl reload
# 查看状态
supervisorctl status
# output
fetcher             RUNNING  pid 3446, uptime 0:00:07
phantomjs            RUNNING  pid 3448, uptime 0:00:07
processor            RUNNING  pid 3447, uptime 0:00:07
result_worker          RUNNING  pid 3445, uptime 0:00:07

centos02部署完毕。

2.3.centos03

部署这三个进程fetcher, processor, result_worker和centos02 一样,本服务器主要是在前面的基础上加上webui

建立文件:

/etc/supervisord.d/webui.ini

[program:webui]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json webui
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/webui.log
# 重载
supervisorctl reload
# 查看状态
supervisorctl status
# output
fetcher             RUNNING  pid 2724, uptime 0:00:07
processor            RUNNING  pid 2725, uptime 0:00:07
result_worker          RUNNING  pid 2723, uptime 0:00:07
webui              RUNNING  pid 2726, uptime 0:00:07

3.总结

访问 http://10.211.55.24:5000 即可,尽情爬取吧。

Python 相关文章推荐
Python设计模式编程中解释器模式的简单程序示例分享
Mar 02 Python
深入了解Python数据类型之列表
Jun 24 Python
Python3如何解决字符编码问题详解
Apr 23 Python
浅谈Python中的私有变量
Feb 28 Python
python爬取哈尔滨天气信息
Jul 14 Python
对python多线程与global变量详解
Nov 09 Python
python opencv 图像拼接的实现方法
Jun 27 Python
Python整数与Numpy数据溢出问题解决
Sep 11 Python
python pandas dataframe 去重函数的具体使用
Jul 20 Python
Python extract及contains方法代码实例
Sep 11 Python
Django 用户认证Auth组件的使用
Nov 30 Python
pip install命令安装扩展库整理
Mar 02 Python
python3读取MySQL-Front的MYSQL密码
May 03 #Python
Python判断变量是否为Json格式的字符串示例
May 03 #Python
Windows和Linux下Python输出彩色文字的方法教程
May 02 #Python
python中字符串类型json操作的注意事项
May 02 #Python
python实现逻辑回归的方法示例
May 02 #Python
pycharm中连接mysql数据库的步骤详解
May 02 #Python
Python多线程实现同步的四种方式
May 02 #Python
You might like
精致的人儿就要挑杯子喝咖啡
2021/03/03 冲泡冲煮
PHP实现AES256加密算法实例
2014/09/22 PHP
php中base64_decode与base64_encode加密解密函数实例
2014/11/24 PHP
php获取百度收录、百度热词及百度快照的方法
2015/04/02 PHP
php简单统计字符串单词数量的方法
2015/06/19 PHP
Zend Framework教程之Zend_Db_Table表关联实例详解
2016/03/23 PHP
laravel接管Dingo-api和默认的错误处理方式
2019/10/25 PHP
JS location几个方法小姐
2008/07/09 Javascript
js实现带关闭按钮始终显示在网页最底部工具条的方法
2015/03/02 Javascript
javascript框架设计之浏览器的嗅探和特征侦测
2015/06/23 Javascript
JS+CSS实现简易实用的滑动门菜单效果
2015/09/18 Javascript
对angularJs中2种自定义服务的实例讲解
2018/09/30 Javascript
vsCode安装使用教程和插件安装方法
2020/08/24 Javascript
element中的$confirm的使用
2020/04/26 Javascript
JavaScript如何实现监听键盘输入和鼠标监点击
2020/07/20 Javascript
[04:22]DOTA2上海特级锦标赛主赛事第四日TOP10
2016/03/06 DOTA
[01:05:32]DOTA2上海特级锦标赛主赛事日 - 3 败者组第三轮#1COL VS Alliance第一局
2016/03/04 DOTA
Python标准库之循环器(itertools)介绍
2014/11/25 Python
自己使用总结Python程序代码片段
2015/06/02 Python
Python中的Numpy矩阵操作
2018/08/12 Python
PyQt5图形界面播放音乐的实例
2019/06/17 Python
python 实现生成均匀分布的点
2019/12/05 Python
sklearn-SVC实现与类参数详解
2019/12/10 Python
Pycharm 2020最新永久激活码(附最新激活码和插件)
2020/09/17 Python
通俗易懂了解Python装饰器原理
2020/09/17 Python
python+playwright微软自动化工具的使用
2021/02/02 Python
Footshop罗马尼亚:最好的运动鞋选择
2019/09/10 全球购物
美国名牌手表折扣网站:Jomashop
2020/05/22 全球购物
实习教师自我鉴定
2013/09/27 职场文书
外贸学院会计专业应届生求职信
2013/11/14 职场文书
工程项目经理任命书
2014/06/05 职场文书
学习张丽丽心得体会
2014/09/03 职场文书
酒店温馨提示语
2015/07/14 职场文书
logback如何自定义日志存储
2021/08/30 Java/Android
MyBatis 动态SQL全面详解
2021/10/05 MySQL
Python实战实现爬取天气数据并完成可视化分析详解
2022/06/16 Python