Python基于pandas实现json格式转换成dataframe的方法


Posted in Python onJune 22, 2018

本文实例讲述了Python基于pandas实现json格式转换成dataframe的方法。分享给大家供大家参考,具体如下:

# -*- coding:utf-8 -*-
#!python3
import re
import json
from bs4 import BeautifulSoup
import pandas as pd
import requests
import os
from pandas.io.json import json_normalize
class image_structs():
  def __init__(self):
    self.picture_url = {
      "image_id": '',
      "picture_url": ''
    }
class data_structs():
  def __init__(self):
    # columns=['title', 'item_url', 'id','picture_url','std_desc','description','information','fitment'])
    self.info={
      "title":'',
      "item_url":'',
      "id":0,
      "picture_url":[],
      "std_desc":'',
      "description":'',
      "information":'',
      "fitment":''
    }
# "https://waldoch.com/store/catalogsearch/result/index/?cat=0&limit=200&p=1&q=nerf+bar"
# https://waldoch.com/store/new-oem-ford-f-150-f150-5-running-boards-nerf-bar-crew-cab-2015-w-brackets-fl34-16451-ge5fm6.html
def get_item_list(outfile):
  result = []
  for i in range(6):
    print(i)
    i = str(i+1)
    url = "https://waldoch.com/store/catalogsearch/result/index/?cat=0&limit=200&p="+i+"&q=nerf+bar"
    web = requests.get(url)
    soup = BeautifulSoup(web.text,"html.parser")
    alink = soup.find_all("a",class_="product-image")
    for a in alink:
      title = a["title"]
      item_url = a["href"]
      result.append([title,item_url])
  df = pd.DataFrame(result,columns=["title","item_url"])
  df = df.drop_duplicates()
  df["id"] =df.index
  df.to_excel(outfile,index=False)
def get_item_info(file,outfile):
  DEFAULT_FALSE = ""
  df = pd.read_excel(file)
  for i in df.index:
    id = df.loc[i,"id"]
    if os.path.exists(str(int(id))+".xlsx"):
      continue
    item_url = df.loc[i,"item_url"]
    url = item_url
    web = requests.get(url)
    soup = BeautifulSoup(web.text, "html.parser")
    # 图片
    imglink = soup.find_all("img", class_=re.compile("^gallery-image"))
    data = data_structs()
    data.info["title"] = df.loc[i,"title"]
    data.info["id"] = id
    data.info["item_url"] = item_url
    for a in imglink:
      image = image_structs()
      image.picture_url["image_id"] = a["id"]
      image.picture_url["picture_url"]=a["src"]
      print(image.picture_url)
      data.info["picture_url"].append(image.picture_url)
    print(data.info)
    # std_desc
    std_desc = soup.find("div", itemprop="description")
    try:
      strings_desc = []
      for ii in std_desc.stripped_strings:
        strings_desc.append(ii)
      strings_desc = "\n".join(strings_desc)
    except:
      strings_desc=DEFAULT_FALSE
    # description
    try:
      desc = soup.find('h2', text="Description")
      desc = desc.find_next()
    except:
      desc=DEFAULT_FALSE
    description=desc
    # information
    try:
      information = soup.find("h2", text='Information')
      desc = information
      desc = desc.find_next()
    except:
      desc=DEFAULT_FALSE
    information = desc
    # fitment
    try:
      fitment = soup.find('h2', text='Fitment')
      desc = fitment
      desc = desc.find_next()
    except:
      desc=DEFAULT_FALSE
    fitment=desc
    data.info["std_desc"] = strings_desc
    data.info["description"] = str(description)
    data.info["information"] = str(information)
    data.info["fitment"] = str(fitment)
    print(data.info.keys())
    singledf = json_normalize(data.info,"picture_url",['title', 'item_url', 'id', 'std_desc', 'description', 'information', 'fitment'])
    singledf.to_excel("test.xlsx",index=False)
    exit()
    # print(df.ix[i])
  df.to_excel(outfile,index=False)
# get_item_list("item_urls.xlsx")
get_item_info("item_urls.xlsx","item_urls_info.xlsx")

这里涉及到的几个Python模块都可以使用pip install命令进行安装,如:

pip install BeautifulSoup4
pip install xlrd
pip install openpyxl
Python 相关文章推荐
对于Python中RawString的理解介绍
Jul 07 Python
Python实现识别手写数字大纲
Jan 29 Python
Python装饰器原理与用法分析
Apr 30 Python
python实现批量图片格式转换
Jun 16 Python
Django实现分页功能
Jul 02 Python
Python3.7中安装openCV库的方法
Jul 11 Python
python 文本单词提取和词频统计的实例
Dec 22 Python
Python从单元素字典中获取key和value的实例
Dec 31 Python
Python数据类型之Number数字操作实例详解
May 08 Python
使用python进行广告点击率的预测的实现
Jul 04 Python
python pyecharts 实现一个文件绘制多张图
May 13 Python
解决Tensorflow2.0 tf.keras.Model.load_weights() 报错处理问题
Jun 12 Python
深入浅析Python的类
Jun 22 #Python
基于python绘制科赫雪花
Jun 22 #Python
python3读取csv和xlsx文件的实例
Jun 22 #Python
django admin 后台实现三级联动的示例代码
Jun 22 #Python
python使用turtle库与random库绘制雪花
Jun 22 #Python
Python3导入CSV文件的实例(跟Python2有些许的不同)
Jun 22 #Python
Django Admin实现三级联动的示例代码(省市区)
Jun 22 #Python
You might like
kohana框架上传文件验证规则写法示例
2014/07/14 PHP
PHP简单检测网址是否能够正常打开的方法
2016/09/04 PHP
laravel框架之数据库查出来的对象实现转化为数组
2019/10/23 PHP
laravel框架模型中非静态方法也能静态调用的原理分析
2019/11/23 PHP
javascript 数组学习资料收集
2010/04/11 Javascript
下拉菜单点击实现连接跳转功能的js代码
2013/05/19 Javascript
jquery实现弹出层遮罩效果的简单实例
2014/03/03 Javascript
验证码在IE中不刷新而谷歌等浏览器正常的解决方案
2014/03/18 Javascript
jquery中animate的stop()方法作用实例分析
2015/01/30 Javascript
javascript字符串循环匹配实例分析
2015/07/17 Javascript
jQuery左侧大图右侧小图焦点图幻灯切换代码分享
2015/08/19 Javascript
JS实现方向键切换输入框焦点的方法
2015/08/19 Javascript
跟我学习javascript的最新标准ES6
2015/11/20 Javascript
node.js缺少mysql模块运行报错的解决方法
2016/11/13 Javascript
JS双击变input框批量修改内容
2016/12/12 Javascript
详解如何较好的使用js
2016/12/16 Javascript
nodejs开发微信小程序实现密码加密
2017/07/11 NodeJs
vue axios用法教程详解
2017/07/23 Javascript
JS实现的透明度渐变动画效果示例
2018/04/28 Javascript
JavaScript 高性能数组去重的方法
2018/09/20 Javascript
小程序云开发之用户注册登录
2019/05/18 Javascript
原生JavaScript实现日历功能代码实例(无引用Jq)
2019/09/23 Javascript
vue的路由映射问题及解决方案
2019/10/14 Javascript
JS面向对象编程——ES6 中class的继承用法详解
2020/03/03 Javascript
vue实现简单瀑布流布局
2020/05/28 Javascript
使用PYTHON接收多播数据的代码
2012/03/01 Python
Python找出最小的K个数实例代码
2018/01/04 Python
python3利用tcp实现文件夹远程传输
2018/07/28 Python
Python3.5实现的罗马数字转换成整数功能示例
2019/02/25 Python
python用Configobj模块读取配置文件
2020/09/26 Python
Python日志打印里logging.getLogger源码分析详解
2021/01/17 Python
澳大利亚领先的女帽及配饰公司:Morgan&Taylor
2019/12/01 全球购物
财务人员担保书
2014/05/13 职场文书
幼儿教师继续教育培训心得体会
2016/01/19 职场文书
初中生物教学反思
2016/02/20 职场文书
python游戏开发之pygame实现接球小游戏
2022/04/22 Python