Python基于pandas实现json格式转换成dataframe的方法


Posted in Python onJune 22, 2018

本文实例讲述了Python基于pandas实现json格式转换成dataframe的方法。分享给大家供大家参考,具体如下:

# -*- coding:utf-8 -*-
#!python3
import re
import json
from bs4 import BeautifulSoup
import pandas as pd
import requests
import os
from pandas.io.json import json_normalize
class image_structs():
  def __init__(self):
    self.picture_url = {
      "image_id": '',
      "picture_url": ''
    }
class data_structs():
  def __init__(self):
    # columns=['title', 'item_url', 'id','picture_url','std_desc','description','information','fitment'])
    self.info={
      "title":'',
      "item_url":'',
      "id":0,
      "picture_url":[],
      "std_desc":'',
      "description":'',
      "information":'',
      "fitment":''
    }
# "https://waldoch.com/store/catalogsearch/result/index/?cat=0&limit=200&p=1&q=nerf+bar"
# https://waldoch.com/store/new-oem-ford-f-150-f150-5-running-boards-nerf-bar-crew-cab-2015-w-brackets-fl34-16451-ge5fm6.html
def get_item_list(outfile):
  result = []
  for i in range(6):
    print(i)
    i = str(i+1)
    url = "https://waldoch.com/store/catalogsearch/result/index/?cat=0&limit=200&p="+i+"&q=nerf+bar"
    web = requests.get(url)
    soup = BeautifulSoup(web.text,"html.parser")
    alink = soup.find_all("a",class_="product-image")
    for a in alink:
      title = a["title"]
      item_url = a["href"]
      result.append([title,item_url])
  df = pd.DataFrame(result,columns=["title","item_url"])
  df = df.drop_duplicates()
  df["id"] =df.index
  df.to_excel(outfile,index=False)
def get_item_info(file,outfile):
  DEFAULT_FALSE = ""
  df = pd.read_excel(file)
  for i in df.index:
    id = df.loc[i,"id"]
    if os.path.exists(str(int(id))+".xlsx"):
      continue
    item_url = df.loc[i,"item_url"]
    url = item_url
    web = requests.get(url)
    soup = BeautifulSoup(web.text, "html.parser")
    # 图片
    imglink = soup.find_all("img", class_=re.compile("^gallery-image"))
    data = data_structs()
    data.info["title"] = df.loc[i,"title"]
    data.info["id"] = id
    data.info["item_url"] = item_url
    for a in imglink:
      image = image_structs()
      image.picture_url["image_id"] = a["id"]
      image.picture_url["picture_url"]=a["src"]
      print(image.picture_url)
      data.info["picture_url"].append(image.picture_url)
    print(data.info)
    # std_desc
    std_desc = soup.find("div", itemprop="description")
    try:
      strings_desc = []
      for ii in std_desc.stripped_strings:
        strings_desc.append(ii)
      strings_desc = "\n".join(strings_desc)
    except:
      strings_desc=DEFAULT_FALSE
    # description
    try:
      desc = soup.find('h2', text="Description")
      desc = desc.find_next()
    except:
      desc=DEFAULT_FALSE
    description=desc
    # information
    try:
      information = soup.find("h2", text='Information')
      desc = information
      desc = desc.find_next()
    except:
      desc=DEFAULT_FALSE
    information = desc
    # fitment
    try:
      fitment = soup.find('h2', text='Fitment')
      desc = fitment
      desc = desc.find_next()
    except:
      desc=DEFAULT_FALSE
    fitment=desc
    data.info["std_desc"] = strings_desc
    data.info["description"] = str(description)
    data.info["information"] = str(information)
    data.info["fitment"] = str(fitment)
    print(data.info.keys())
    singledf = json_normalize(data.info,"picture_url",['title', 'item_url', 'id', 'std_desc', 'description', 'information', 'fitment'])
    singledf.to_excel("test.xlsx",index=False)
    exit()
    # print(df.ix[i])
  df.to_excel(outfile,index=False)
# get_item_list("item_urls.xlsx")
get_item_info("item_urls.xlsx","item_urls_info.xlsx")

这里涉及到的几个Python模块都可以使用pip install命令进行安装,如:

pip install BeautifulSoup4
pip install xlrd
pip install openpyxl
Python 相关文章推荐
python发布模块的步骤分享
Feb 21 Python
python将图片文件转换成base64编码的方法
Mar 14 Python
Python实现TCP/IP协议下的端口转发及重定向示例
Jun 14 Python
Python列表推导式、字典推导式与集合推导式用法实例分析
Feb 07 Python
谈谈Python中的while循环语句
Mar 10 Python
浅析pandas 数据结构中的DataFrame
Oct 12 Python
关于Python3 lambda函数的深入浅出
Nov 27 Python
Python实现CNN的多通道输入实例
Jan 17 Python
python pandas移动窗口函数rolling的用法
Feb 29 Python
python 通过邮件控制实现远程控制电脑操作
Mar 16 Python
python装饰器代码深入讲解
Mar 01 Python
用Python爬虫破解滑动验证码的案例解析
May 06 Python
深入浅析Python的类
Jun 22 #Python
基于python绘制科赫雪花
Jun 22 #Python
python3读取csv和xlsx文件的实例
Jun 22 #Python
django admin 后台实现三级联动的示例代码
Jun 22 #Python
python使用turtle库与random库绘制雪花
Jun 22 #Python
Python3导入CSV文件的实例(跟Python2有些许的不同)
Jun 22 #Python
Django Admin实现三级联动的示例代码(省市区)
Jun 22 #Python
You might like
多重?l件?合查?(一)
2006/10/09 PHP
php原生导出excel文件的两种方法(推荐)
2016/11/19 PHP
PHP实现广度优先搜索算法(BFS,Broad First Search)详解
2017/09/16 PHP
用js计算页面执行时间的函数
2006/12/07 Javascript
可以将word转成html的js代码
2010/04/11 Javascript
JavaScript跨域方法汇总
2014/10/16 Javascript
浅谈jQuery中height与width
2015/07/06 Javascript
Jquery插件仿百度搜索关键字自动匹配功能
2016/05/11 Javascript
jQuery实现的多张图无缝滚动效果【测试可用】
2016/09/12 Javascript
Angular中的interceptors拦截器
2017/06/25 Javascript
原生JS+HTML5实现跟随鼠标一起流动的粒子动画效果
2018/05/03 Javascript
Angular 实现输入框中显示文章标签的实例代码
2018/11/07 Javascript
如何在基于vue-cli的项目自定义打包环境
2018/11/10 Javascript
javascript实现拼图游戏
2021/01/29 Javascript
DJANGO-ALLAUTH社交用户系统的安装配置
2014/11/18 Python
Python多线程爬虫简单示例
2016/03/04 Python
Python中内建函数的简单用法说明
2016/05/05 Python
Python实现Windows和Linux之间互相传输文件(文件夹)的方法
2017/05/08 Python
详解python多线程、锁、event事件机制的简单使用
2018/04/27 Python
深入理解Python异常处理的哲学
2019/02/01 Python
Django使用redis缓存服务器的实现代码示例
2019/04/28 Python
python或C++读取指定文件夹下的所有图片
2019/08/31 Python
浅析PEP570新语法: 只接受位置参数
2019/10/15 Python
tensorflow 大于某个值为1,小于为0的实例
2020/06/30 Python
Python3爬虫发送请求的知识点实例
2020/07/30 Python
CSS3动画之利用requestAnimationFrame触发重新播放功能
2019/09/11 HTML / CSS
大学生标准自荐书
2014/06/15 职场文书
物业保安岗位职责
2014/07/02 职场文书
公安机关纪律作风整顿个人剖析材料材料
2014/10/10 职场文书
考试作弊检讨书
2014/10/21 职场文书
个人党性分析总结
2015/03/05 职场文书
恋恋笔记本观后感
2015/06/16 职场文书
MySQL优化之如何写出高质量sql语句
2021/05/17 MySQL
Python还能这么玩之用Python做个小游戏的外挂
2021/06/04 Python
Python实现视频中添加音频工具详解
2021/12/06 Python
了解MySQL查询语句执行过程(5大组件)
2022/08/14 MySQL