编程 Python

基于python爬虫数据处理(详解)

Posted in Python onJune 10, 2017

一、首先理解下面几个函数

设置变量 length()函数 char_length() replace() 函数 max() 函数

1.1、设置变量 set @变量名=值

set @address='中国-山东省-聊城市-莘县';
select @address

1.2 、length()函数 char_length()函数区别

select length('a')
,char_length('a')
,length('中')
,char_length('中')

1.3、 replace() 函数和length()函数组合

set @address='中国-山东省-聊城市-莘县';
select @address
,replace(@address,'-','') as address_1
,length(@address) as len_add1
,length(replace(@address,'-','')) as len_add2
,length(@address)-length(replace(@address,'-','')) as _count

etl清洗字段时候有明显分割符的如何确定新的数据表增加几个分割出的字段

计算出com_industry中最多有几个 - 符以便确定增加几个字段最大值+1 为可以拆分成的字段数此表为3 因此可以拆分出4个行业字段也就是4个行业等级

select max(length(com_industry)-length(replace(com_industry,'-',''))) as _max_count
from etl1_socom_data

1.4、设置变量 substring_index()字符串截取函数用法

set @address='中国-山东省-聊城市-莘县';
select 
substring_index(@address,'-',1) as china,
substring_index(substring_index(@address,'-',2),'-',-1) as province,
substring_index(substring_index(@address,'-',3),'-',-1) as city,
substring_index(@address,'-',-1) as district

1.5、条件判断函数 case when

case when then when then else 值 end as 字段名

select case when 89>101 then '大于' else '小于' end as betl1_socom_data

二、kettle转换etl1清洗

首先建表步骤在视频里

字段索引没有提索引算法建议用BTREE算法增强查询效率

2.1.kettle文件名：trans_etl1_socom_data

2.2.包括控件：表输入>>>表输出

2.3.数据流方向：s_socom_data>>>>etl1_socom_data

基于python爬虫数据处理(详解)

kettle转换1截图

2.4、表输入2.4、SQL脚本初步清洗com_district和com_industry字段

select a.*,
case when com_district like '%业' or com_district like '%织' or com_district like '%育' then null else com_district end as com_district1
,case when com_district like '%业' or com_district like '%织' or com_district like '%育' then concat(com_district,'-',com_industry) else com_industry end as com_industry_total
,replace(com_addr,'地 址：','') as com_addr1
,replace(com_phone,'电 话：','') as com_phone1
,replace(com_fax,'传 真：','') as com_fax1
,replace(com_mobile,'手机：','') as com_mobile1
,replace(com_url,'网址：','') as com_url1
,replace(com_email,'邮箱：','') as com_email1
,replace(com_contactor,'联系人：','') as com_contactor1
,replace(com_emploies_nums,'公司人数：','') as com_emploies_nums1
,replace(com_reg_capital,'注册资金：万','') as com_reg_capital1
,replace(com_type,'经济类型：','') as com_type1
,replace(com_product,'公司产品：','') as com_product1
,replace(com_desc,'公司简介：','') as com_desc1
from s_socom_data as a

2.5、表输出

基于python爬虫数据处理(详解)

表输出设置注意事项

注意事项：

① 涉及爬虫增量操作不要勾选裁剪表选项

②数据连接问题选择表输出中表所在的数据库

③字段映射问题确保数据流中的字段和物理表的字段数量一致对应一致

三、kettle转换etl2清洗

首先建表增加了4个字段演示步骤在视频里

字段索引没有提索引算法建议用BTREE算法增强查询效率

主要针对etl1 生成的新的com_industry进行字段拆分清洗

3.1.kettle文件名：trans_etl2_socom_data

3.2.包括控件：表输入>>>表输出

3.3.数据流方向：etl1_socom_data>>>>etl2_socom_data

注意事项：

① 涉及爬虫增量操作不要勾选裁剪表选项

②数据连接问题选择表输出中表所在的数据库

③字段映射问题确保数据流中的字段和物理表的字段数量一致对应一致

基于python爬虫数据处理(详解)

kettle转换2截图

3.4、SQL脚本对com_industry进行拆分完成所有字段清洗注册资金字段时间关系没有进行细致拆解调整代码即可

select a.*,
case 
#行业为''的值 置为空
when length(com_industry)=0 then null
#其他的取第一个-分隔符之前
else substring_index(com_industry,'-',1) end as com_industry1,
case 
when length(com_industry)-length(replace(com_industry,'-',''))=0 then null
#'交通运输、仓储和邮政业-' 这种值 行业2 也置为null
when length(com_industry)-length(replace(com_industry,'-',''))=1 and length(substring_index(com_industry,'-',-1))=0 then null
when length(com_industry)-length(replace(com_industry,'-',''))=1 then substring_index(com_industry,'-',-1)
else substring_index(substring_index(com_industry,'-',2),'-',-1)
end as com_industry2,
case 
when length(com_industry)-length(replace(com_industry,'-',''))<=1 then null
when length(com_industry)-length(replace(com_industry,'-',''))=2 then substring_index(com_industry,'-',-1)
else substring_index(substring_index(com_industry,'-',3),'-',-1)
end as com_industry3,
case 
when length(com_industry)-length(replace(com_industry,'-',''))<=2 then null
else substring_index(com_industry,'-',-1)
end as com_industry4
from etl1_socom_data as a

四、清洗效果质量检查

4.1爬虫数据源数据和网站数据是否相符

如果本身工作是爬虫和数据处理在一起处理，抓取的时候其实已经判断，此步骤可以省略，如果对接上游爬虫同事，这一步首先判断，不然清洗也是无用功，一般都要求爬虫同事存储请求的url便于后面数据处理查看数据质量

4.2计算爬虫数据源和各etl清洗数据表数据量

注：SQL脚本中没有经过聚合过滤 3个表数据量应相等

4.2.1、sql查询下面表我是在同一数据库中如果不在同一数据库 from 后面应加上表所在的数据库名称

不推荐数据量大的时候使用

select count(1) from s_socom_data
union all
select count(1) from etl1_socom_data
union all
select count(1) from etl2_socom_data

4.2.2 根据 kettle转换执行完毕以后表输出总量对比

基于python爬虫数据处理(详解)

kettle表输出总数据量

4.3查看etl清洗质量

确保前两个步骤已经无误，数据处理负责的etl清洗工作自查开始针对数据源清洗的字段写脚本检查 socom网站主要是对地区和行业进行了清洗对其他字段做了替换多余字段处理，因此采取脚本检查，

找到page_url和网站数据进行核查

where里面这样写便于查看某个字段的清洗情况

select * 
from etl2_socom_data 
where com_district is null and length(com_industry)-length(replace(com_industry,'-',''))=3

http://www.socom.cn/company/7320798.html此页面数据和etl2_socom_data表最终清洗数据对比

基于python爬虫数据处理(详解)

网站页面数据

基于python爬虫数据处理(详解)

etl2_socom_data表数据

清洗工作完成。

以上这篇基于python爬虫数据处理(详解)就是小编分享给大家的全部内容了，希望能给大家一个参考，也希望大家多多支持三水点靠木。

基于python爬虫数据处理(详解)

- Author -

jingxian

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python中的内置函数getattr()介绍及示例

Jul 20 Python

python打开文件并获取文件相关属性的方法

Apr 23 Python

python获得linux下所有挂载点（mount points）的方法

Apr 29 Python

python仿evething的文件搜索器实例代码

May 13 Python

Python中字符串List按照长度排序

Jul 01 Python

python 绘制拟合曲线并加指定点标识的实现

Jul 10 Python

Flask框架实现的前端RSA加密与后端Python解密功能详解

Aug 13 Python

python数据类型之间怎么转换技巧分享

Aug 20 Python

Python 音频生成器的实现示例

Dec 24 Python

在python中使用nohup命令说明

Apr 16 Python

浅谈keras中的batch_dot,dot方法和TensorFlow的matmul

Jun 18 Python

pycharm激活码2020最新分享适用pycharm2020最新版亲测可用

Nov 22 Python

python实现稀疏矩阵示例代码

Jun 09 #Python

Python实现数据库并行读取和写入实例

Jun 09 #Python

详解python之多进程和进程池(Processing库)

Jun 09 #Python

Python使用django搭建web开发环境

Jun 09 #Python

Python实现删除文件中含“指定内容”的行示例

Jun 09 #Python

Python实现两个list对应元素相减操作示例

Jun 09 #Python

Python实现向服务器请求压缩数据及解压缩数据的方法示例

Jun 09 #Python

You might like

PHP 如何向 MySQL 发送数据

2006/10/09 PHP

POSIX 风格和兼容 Perl 风格两种正则表达式主要函数的类比(preg_match, preg_replace, ereg, ereg_replace)

2010/10/12 PHP

php中删除数组的第一个元素和最后一个元素的函数

2015/03/07 PHP

PHP静态延迟绑定和普通静态效率的对比

2017/10/20 PHP

php实现的生成排列算法示例

2019/07/25 PHP

弹出层之1：JQuery.Boxy (一) 使用介绍

2011/10/06 Javascript

解决Jquery load()加载GB2312页面时出现乱码的两种方案

2013/09/10 Javascript

zTree插件之多选下拉菜单实例代码

2013/11/06 Javascript

javascript中Object使用详解

2015/01/26 Javascript

Bootstrap CSS布局之代码

2016/12/17 Javascript

jQuery上传多张图片带进度条样式（DEMO）

2017/03/02 Javascript

Webpack 服务器端代码打包的示例代码

2017/09/19 Javascript

vue项目中jsonp跨域获取qq音乐首页推荐问题

2018/05/30 Javascript

使用 Vue-TCB 快速在 Vue 应用中接入云开发的方法

2020/02/10 Javascript

Webpack的Loader和Plugin的区别

2020/11/09 Javascript

[48:21]林俊杰圣堂刺客超神杀戮秀

2014/10/29 DOTA

Python 3中的yield from语法详解

2017/01/18 Python

TensorFlow实现AutoEncoder自编码器

2018/03/09 Python

Django 实现图片上传和显示过程详解

2019/07/18 Python

利用python Selenium实现自动登陆京东签到领金币功能

2019/10/31 Python

将tensorflow模型打包成PB文件及PB文件读取方式

2020/01/23 Python

PurCotton全棉时代官网：100%天然棉花生产的生活护理用品

2016/11/18 全球购物

英国床垫和床架购物网站：Bedman

2019/11/04 全球购物

您的时尚，您的生活方式：DTLR Villa

2019/12/25 全球购物

会议接待欢迎词

2014/01/12 职场文书

毕业生如何写自荐信

2014/03/26 职场文书

三八活动策划方案

2014/08/17 职场文书

湘江北去观后感

2015/06/15 职场文书

法定代表人身份证明书

2015/06/18 职场文书

2015年小学总务工作总结

2015/07/21 职场文书

2016圣诞节贺卡寄语

2015/12/07 职场文书

远程教育培训心得体会

2016/01/09 职场文书

2016年政治理论学习心得体会

2016/01/25 职场文书

如何用JS实现简单的数据监听

2021/05/06 Javascript

深入详解JS函数的柯里化

2021/06/09 Javascript

Shell中的单中括号和双中括号的用法详解

2022/12/24 Servers