编程 Python

Python基于pandas实现json格式转换成dataframe的方法

Posted in Python onJune 22, 2018

本文实例讲述了Python基于pandas实现json格式转换成dataframe的方法。分享给大家供大家参考，具体如下：

# -*- coding:utf-8 -*-
#!python3
import re
import json
from bs4 import BeautifulSoup
import pandas as pd
import requests
import os
from pandas.io.json import json_normalize
class image_structs():
  def __init__(self):
    self.picture_url = {
      "image_id": '',
      "picture_url": ''
    }
class data_structs():
  def __init__(self):
    # columns=['title', 'item_url', 'id','picture_url','std_desc','description','information','fitment'])
    self.info={
      "title":'',
      "item_url":'',
      "id":0,
      "picture_url":[],
      "std_desc":'',
      "description":'',
      "information":'',
      "fitment":''
    }
# "https://waldoch.com/store/catalogsearch/result/index/?cat=0&limit=200&p=1&q=nerf+bar"
# https://waldoch.com/store/new-oem-ford-f-150-f150-5-running-boards-nerf-bar-crew-cab-2015-w-brackets-fl34-16451-ge5fm6.html
def get_item_list(outfile):
  result = []
  for i in range(6):
    print(i)
    i = str(i+1)
    url = "https://waldoch.com/store/catalogsearch/result/index/?cat=0&limit=200&p="+i+"&q=nerf+bar"
    web = requests.get(url)
    soup = BeautifulSoup(web.text,"html.parser")
    alink = soup.find_all("a",class_="product-image")
    for a in alink:
      title = a["title"]
      item_url = a["href"]
      result.append([title,item_url])
  df = pd.DataFrame(result,columns=["title","item_url"])
  df = df.drop_duplicates()
  df["id"] =df.index
  df.to_excel(outfile,index=False)
def get_item_info(file,outfile):
  DEFAULT_FALSE = ""
  df = pd.read_excel(file)
  for i in df.index:
    id = df.loc[i,"id"]
    if os.path.exists(str(int(id))+".xlsx"):
      continue
    item_url = df.loc[i,"item_url"]
    url = item_url
    web = requests.get(url)
    soup = BeautifulSoup(web.text, "html.parser")
    # 图片
    imglink = soup.find_all("img", class_=re.compile("^gallery-image"))
    data = data_structs()
    data.info["title"] = df.loc[i,"title"]
    data.info["id"] = id
    data.info["item_url"] = item_url
    for a in imglink:
      image = image_structs()
      image.picture_url["image_id"] = a["id"]
      image.picture_url["picture_url"]=a["src"]
      print(image.picture_url)
      data.info["picture_url"].append(image.picture_url)
    print(data.info)
    # std_desc
    std_desc = soup.find("div", itemprop="description")
    try:
      strings_desc = []
      for ii in std_desc.stripped_strings:
        strings_desc.append(ii)
      strings_desc = "\n".join(strings_desc)
    except:
      strings_desc=DEFAULT_FALSE
    # description
    try:
      desc = soup.find('h2', text="Description")
      desc = desc.find_next()
    except:
      desc=DEFAULT_FALSE
    description=desc
    # information
    try:
      information = soup.find("h2", text='Information')
      desc = information
      desc = desc.find_next()
    except:
      desc=DEFAULT_FALSE
    information = desc
    # fitment
    try:
      fitment = soup.find('h2', text='Fitment')
      desc = fitment
      desc = desc.find_next()
    except:
      desc=DEFAULT_FALSE
    fitment=desc
    data.info["std_desc"] = strings_desc
    data.info["description"] = str(description)
    data.info["information"] = str(information)
    data.info["fitment"] = str(fitment)
    print(data.info.keys())
    singledf = json_normalize(data.info,"picture_url",['title', 'item_url', 'id', 'std_desc', 'description', 'information', 'fitment'])
    singledf.to_excel("test.xlsx",index=False)
    exit()
    # print(df.ix[i])
  df.to_excel(outfile,index=False)
# get_item_list("item_urls.xlsx")
get_item_info("item_urls.xlsx","item_urls_info.xlsx")

这里涉及到的几个Python模块都可以使用pip install命令进行安装，如：

pip install BeautifulSoup4

pip install xlrd

pip install openpyxl

Python基于pandas实现json格式转换成dataframe的方法

- Author -

zn505119020

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

Python中使用语句导入模块或包的机制研究

Mar 30 Python

Python线程详解

Jun 24 Python

Python基础语法(Python基础知识点)

Feb 28 Python

python实现按任意键继续执行程序

Dec 30 Python

快速了解Python开发中的cookie及简单代码示例

Jan 17 Python

python编辑用户登入界面的实现代码

Jul 16 Python

python微信撤回监测代码

Apr 29 Python

PyTorch中Tensor的维度变换实现

Aug 18 Python

简单易懂Pytorch实战实例VGG深度网络

Aug 27 Python

Python序列化pickle模块使用详解

Mar 05 Python

如何实现在jupyter notebook中播放视频(不停地展示图片)

Apr 23 Python

Python爬虫如何破解JS加密的Cookie

Nov 19 Python

深入浅析Python的类

Jun 22 #Python

基于python绘制科赫雪花

Jun 22 #Python

python3读取csv和xlsx文件的实例

Jun 22 #Python

django admin 后台实现三级联动的示例代码

Jun 22 #Python

python使用turtle库与random库绘制雪花

Jun 22 #Python

Python3导入CSV文件的实例(跟Python2有些许的不同)

Jun 22 #Python

Django Admin实现三级联动的示例代码(省市区)

Jun 22 #Python

You might like

php下保存远程图片到本地的办法

2010/08/08 PHP

比较strtr, str_replace和preg_replace三个函数的效率

2013/06/26 PHP

php+xml编程之SimpleXML的应用实例

2015/01/24 PHP

PHP使用array_fill定义多维数组的方法

2015/03/18 PHP

PHP实现二叉树的深度优先与广度优先遍历方法

2015/09/28 PHP

PHP+Ajax异步带进度条上传文件实例

2016/11/01 PHP

CodeIgniter框架数据库基本操作示例

2018/05/24 PHP

PHP中abstract(抽象)、final(最终)和static(静态)原理与用法详解

2020/06/05 PHP

js之WEB开发调试利器:Firebug 下载

2007/01/13 Javascript

js 提交和设置表单的值

2008/12/19 Javascript

js实现百度联盟中一款不错的图片切换效果完整实例

2015/03/04 Javascript

JavaScript判断一个字符串是否包含指定子字符串的方法

2015/03/18 Javascript

动态加载jQuery的两种方法实例分析

2015/07/17 Javascript

jQuery左侧大图右侧小图焦点图幻灯切换代码分享

2015/08/19 Javascript

Avalon中文长字符截取、关键字符隐藏、自定义过滤器

2016/05/18 Javascript

JavaScript程序中实现继承特性的方式总结

2016/06/24 Javascript

Javascript实现代码折叠功能

2016/08/25 Javascript

JS抛物线动画实例制作

2018/02/24 Javascript

Taro集成Redux快速上手的方法示例

2018/06/21 Javascript

仿vue-cli搭建属于自己的脚手架的方法步骤

2019/04/17 Javascript

JavaScript获取时区实现过程解析

2020/09/24 Javascript

Python提示[Errno 32]Broken pipe导致线程crash错误解决方法

2014/11/19 Python

Python基础教程之利用期物处理并发

2018/03/29 Python

python+opencv实现阈值分割

2018/12/26 Python

django将数组传递给前台模板的方法

2019/08/06 Python

使用OpenCV实现仿射变换—平移功能

2019/08/29 Python

Pyecharts绘制全球流向图的示例代码

2020/01/08 Python

tensorflow 实现自定义梯度反向传播代码

2020/02/10 Python

Python3 xml.etree.ElementTree支持的XPath语法详解

2020/03/06 Python

python模块如何查看

2020/06/16 Python

网络维护管理员的自我评价分享

2013/11/11 职场文书

后进生转化工作制度

2014/01/17 职场文书

cf收人广告词大全

2014/03/14 职场文书

数据结构课程设计心得体会

2016/01/15 职场文书

一文搞懂python异常处理、模块与包

2021/06/26 Python

Windows server 2012 NTP时间同步的实现

2022/06/25 Servers