Python基于pandas实现json格式转换成dataframe的方法


Posted in Python onJune 22, 2018

本文实例讲述了Python基于pandas实现json格式转换成dataframe的方法。分享给大家供大家参考,具体如下:

# -*- coding:utf-8 -*-
#!python3
import re
import json
from bs4 import BeautifulSoup
import pandas as pd
import requests
import os
from pandas.io.json import json_normalize
class image_structs():
  def __init__(self):
    self.picture_url = {
      "image_id": '',
      "picture_url": ''
    }
class data_structs():
  def __init__(self):
    # columns=['title', 'item_url', 'id','picture_url','std_desc','description','information','fitment'])
    self.info={
      "title":'',
      "item_url":'',
      "id":0,
      "picture_url":[],
      "std_desc":'',
      "description":'',
      "information":'',
      "fitment":''
    }
# "https://waldoch.com/store/catalogsearch/result/index/?cat=0&limit=200&p=1&q=nerf+bar"
# https://waldoch.com/store/new-oem-ford-f-150-f150-5-running-boards-nerf-bar-crew-cab-2015-w-brackets-fl34-16451-ge5fm6.html
def get_item_list(outfile):
  result = []
  for i in range(6):
    print(i)
    i = str(i+1)
    url = "https://waldoch.com/store/catalogsearch/result/index/?cat=0&limit=200&p="+i+"&q=nerf+bar"
    web = requests.get(url)
    soup = BeautifulSoup(web.text,"html.parser")
    alink = soup.find_all("a",class_="product-image")
    for a in alink:
      title = a["title"]
      item_url = a["href"]
      result.append([title,item_url])
  df = pd.DataFrame(result,columns=["title","item_url"])
  df = df.drop_duplicates()
  df["id"] =df.index
  df.to_excel(outfile,index=False)
def get_item_info(file,outfile):
  DEFAULT_FALSE = ""
  df = pd.read_excel(file)
  for i in df.index:
    id = df.loc[i,"id"]
    if os.path.exists(str(int(id))+".xlsx"):
      continue
    item_url = df.loc[i,"item_url"]
    url = item_url
    web = requests.get(url)
    soup = BeautifulSoup(web.text, "html.parser")
    # 图片
    imglink = soup.find_all("img", class_=re.compile("^gallery-image"))
    data = data_structs()
    data.info["title"] = df.loc[i,"title"]
    data.info["id"] = id
    data.info["item_url"] = item_url
    for a in imglink:
      image = image_structs()
      image.picture_url["image_id"] = a["id"]
      image.picture_url["picture_url"]=a["src"]
      print(image.picture_url)
      data.info["picture_url"].append(image.picture_url)
    print(data.info)
    # std_desc
    std_desc = soup.find("div", itemprop="description")
    try:
      strings_desc = []
      for ii in std_desc.stripped_strings:
        strings_desc.append(ii)
      strings_desc = "\n".join(strings_desc)
    except:
      strings_desc=DEFAULT_FALSE
    # description
    try:
      desc = soup.find('h2', text="Description")
      desc = desc.find_next()
    except:
      desc=DEFAULT_FALSE
    description=desc
    # information
    try:
      information = soup.find("h2", text='Information')
      desc = information
      desc = desc.find_next()
    except:
      desc=DEFAULT_FALSE
    information = desc
    # fitment
    try:
      fitment = soup.find('h2', text='Fitment')
      desc = fitment
      desc = desc.find_next()
    except:
      desc=DEFAULT_FALSE
    fitment=desc
    data.info["std_desc"] = strings_desc
    data.info["description"] = str(description)
    data.info["information"] = str(information)
    data.info["fitment"] = str(fitment)
    print(data.info.keys())
    singledf = json_normalize(data.info,"picture_url",['title', 'item_url', 'id', 'std_desc', 'description', 'information', 'fitment'])
    singledf.to_excel("test.xlsx",index=False)
    exit()
    # print(df.ix[i])
  df.to_excel(outfile,index=False)
# get_item_list("item_urls.xlsx")
get_item_info("item_urls.xlsx","item_urls_info.xlsx")

这里涉及到的几个Python模块都可以使用pip install命令进行安装,如:

pip install BeautifulSoup4
pip install xlrd
pip install openpyxl
Python 相关文章推荐
Python中的FTP通信模块ftplib的用法整理
Jul 08 Python
python+selenium+autoit实现文件上传功能
Aug 23 Python
tensorflow 1.0用CNN进行图像分类
Apr 15 Python
python3+PyQt5实现自定义分数滑块部件
Apr 24 Python
Python中文件的读取和写入操作
Apr 27 Python
mac下pycharm设置python版本的图文教程
Jun 13 Python
django框架ModelForm组件用法详解
Dec 11 Python
python GUI库图形界面开发之PyQt5不规则窗口实现与显示GIF动画的详细方法与实例
Mar 09 Python
pytorch实现查看当前学习率
Jun 24 Python
PyTorch实现重写/改写Dataset并载入Dataloader
Jul 14 Python
Pytorch中Softmax与LogSigmoid的对比分析
Jun 05 Python
Python使用Beautiful Soup(BS4)库解析HTML和XML
Jun 05 Python
深入浅析Python的类
Jun 22 #Python
基于python绘制科赫雪花
Jun 22 #Python
python3读取csv和xlsx文件的实例
Jun 22 #Python
django admin 后台实现三级联动的示例代码
Jun 22 #Python
python使用turtle库与random库绘制雪花
Jun 22 #Python
Python3导入CSV文件的实例(跟Python2有些许的不同)
Jun 22 #Python
Django Admin实现三级联动的示例代码(省市区)
Jun 22 #Python
You might like
基于pear auth实现登录验证
2010/02/26 PHP
php Rename 更改文件、文件夹名称
2011/05/24 PHP
php以post形式发送xml的方法
2014/11/04 PHP
php简单统计在线人数的方法
2016/05/10 PHP
PHP查询附近的人及其距离的实现方法
2016/05/11 PHP
PHP版微信小店接口开发实例
2016/11/12 PHP
JavaScript 对象、函数和继承
2009/07/07 Javascript
JavaScript OOP面向对象介绍
2010/12/02 Javascript
在javascript中执行任意html代码的方法示例解读
2013/12/25 Javascript
引用其它js时如何同时处理多个window.onload事件
2014/09/02 Javascript
jQuery实现多级下拉菜单jDropMenu的方法
2015/08/28 Javascript
jQuery鼠标事件汇总
2015/08/30 Javascript
基于javascript实现右下角浮动广告效果
2016/01/08 Javascript
AngularJS 获取ng-repeat动态生成的ng-model值实例详解
2016/11/29 Javascript
jQuery插件FusionWidgets实现的AngularGauge图效果示例【附demo源码】
2017/03/23 jQuery
AngularJS使用ui-route实现多层嵌套路由的示例
2018/01/10 Javascript
深入理解Vue nextTick 机制
2018/04/28 Javascript
Vue监听事件实现计数点击依次增加的方法
2018/09/26 Javascript
JS实现深度优先搜索求解两点间最短路径
2019/01/17 Javascript
ES6如何用一句代码实现函数的柯里化
2020/01/18 Javascript
Vue中this.$nextTick的作用及用法
2020/02/04 Javascript
node.js中对Event Loop事件循环的理解与应用实例分析
2020/02/14 Javascript
解决vue中el-tab-pane切换的问题
2020/07/19 Javascript
Vue中component标签解决项目组件化操作
2020/09/04 Javascript
在webstorm中配置less的方法详解
2020/09/25 Javascript
python将图片文件转换成base64编码的方法
2015/03/14 Python
Python对列表中的各项进行关联详解
2017/08/15 Python
pandas 按照特定顺序输出的实现代码
2018/07/10 Python
pandas DataFrame行或列的删除方法的实现示例
2019/08/02 Python
python树的同构学习笔记
2019/09/14 Python
CSS3中的clip-path使用攻略
2015/08/03 HTML / CSS
关于box-sizing的全面理解
2016/07/28 HTML / CSS
小学生安全保证书
2014/02/01 职场文书
疾病捐款倡议书
2014/05/13 职场文书
服务员岗位职责
2015/02/03 职场文书
Python中使用tkFileDialog实现文件选择、保存和路径选择
2022/05/20 Python