编程 Python

Python根据URL地址下载文件并保存至对应目录的实现

Posted in Python onNovember 15, 2020

引言

在编程中经常会遇到图片等数据集将图片等数据以URL形式存储在txt文档中，为便于后续的分析，需要将其下载下来，并按照文件夹分类存储。本文以Github中Alexander Kim提供的图片分类数据集为例，下载其提供的图片样本并分类保存

Python 3.6.5，Anaconda， VSCode

1. 下载数据集文件

建立项目文件夹，下载上述Github项目中的raw_data文件夹，并保存至项目目录中。

Python根据URL地址下载文件并保存至对应目录的实现

2. 获取样本文件位置

编写get_doc_path.py，根据根目录位置，获取目录及其子目录所有数据集文件

import os


def get_file(root_path, all_files={}):
  '''
  递归函数，遍历该文档目录和子目录下的所有文件，获取其path
  '''
  files = os.listdir(root_path)
  for file in files:
    if not os.path.isdir(root_path + '/' + file):  # not a dir
      all_files[file] = root_path + '/' + file
    else: # is a dir
      get_file((root_path+'/'+file), all_files)
  return all_files


if __name__ == '__main__':
  path = './raw_data'
  print(get_file(path))

3. 下载文件

3.1 读取url列表并

for filename, path in paths.items():
    print('reading file: {}'.format(filename))
    with open(path, 'r') as f:
      lines = f.readlines()
      url_list = []
      for line in lines:
        url_list.append(line.strip('\n'))
      print(url_list)

3.2 创建文件夹

foldername = "./picture_get_by_url/pic_download/{}".format(filename.split('.')[0])
if not os.path.exists(folder_path):
    print("Selected folder not exist, try to create it.")
    os.makedirs(folder_path)

3.3 下载图片

def get_pic_by_url(folder_path, lists):
  if not os.path.exists(folder_path):
    print("Selected folder not exist, try to create it.")
    os.makedirs(folder_path)
  for url in lists:
    print("Try downloading file: {}".format(url))
    filename = url.split('/')[-1]
    filepath = folder_path + '/' + filename
    if os.path.exists(filepath):
      print("File have already exist. skip")
    else:
      try:
        urllib.request.urlretrieve(url, filename=filepath)
      except Exception as e:
        print("Error occurred when downloading file, error message:")
        print(e)

4. 完整源码

4.1 get_doc_path.py

import os


def get_file(root_path, all_files={}):
  '''
  递归函数，遍历该文档目录和子目录下的所有文件，获取其path
  '''
  files = os.listdir(root_path)
  for file in files:
    if not os.path.isdir(root_path + '/' + file):  # not a dir
      all_files[file] = root_path + '/' + file
    else: # is a dir
      get_file((root_path+'/'+file), all_files)
  return all_files


if __name__ == '__main__':
  path = './raw_data'
  print(get_file(path))

4.2 get_pic.py

import get_doc_path
import os
import urllib.request


def get_pic_by_url(folder_path, lists):
  if not os.path.exists(folder_path):
    print("Selected folder not exist, try to create it.")
    os.makedirs(folder_path)
  for url in lists:
    print("Try downloading file: {}".format(url))
    filename = url.split('/')[-1]
    filepath = folder_path + '/' + filename
    if os.path.exists(filepath):
      print("File have already exist. skip")
    else:
      try:
        urllib.request.urlretrieve(url, filename=filepath)
      except Exception as e:
        print("Error occurred when downloading file, error message:")
        print(e)


if __name__ == "__main__":
  root_path = './picture_get_by_url/raw_data'
  paths = get_doc_path.get_file(root_path)
  print(paths)
  for filename, path in paths.items():
    print('reading file: {}'.format(filename))
    with open(path, 'r') as f:
      lines = f.readlines()
      url_list = []
      for line in lines:
        url_list.append(line.strip('\n'))
      foldername = "./picture_get_by_url/pic_download/{}".format(filename.split('.')[0])
      get_pic_by_url(foldername, url_list)

4.3 运行结果

执行get_pic.py
当程序意外停止或再次执行时，程序会自动跳过文件夹中已下载的文件，继续下载未下载的内容

{‘urls_drawings.txt': ‘./picture_get_by_url/raw_data/drawings/urls_drawings.txt', ‘urls_hentai.txt': ‘./picture_get_by_url/raw_data/hentai/urls_hentai.txt', ‘urls_neutral.txt': ‘./picture_get_by_url/raw_data/neutral/urls_neutral.txt', ‘urls_porn.txt': ‘./picture_get_by_url/raw_data/porn/urls_porn.txt', ‘urls_sexy.txt': ‘./picture_get_by_url/raw_data/sexy/urls_sexy.txt'}
reading file: urls_drawings.txt
Try downloading file: http://41.media.tumblr.com/xxxxxx.jpg
Try downloading file: http://41.media.tumblr.com/xxxxxx.jpg
Try downloading file: http://ak1.polyvoreimg.com/cgi/img-thing/size/l/tid/xxxxxx.jpg
Error occurred when downloading file, error message:
HTTP Error 502: No data received from server or forwarder
Try downloading file: http://akicocotte.weblike.jp/gaugau/xxxxxx.jpg
Try downloading file: http://animewriter.files.wordpress.com/2009/01/nagisa-xxxxxx-xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg
Try downloading file: http://cdn.awwni.me/xxxxxx.jpg

后注：由于样本数据集内容的问题，上述地址以xxxxx代替具体地址，案例项目也已经失效，但是方法仍然可以借鉴

20.9.23更新：数据集地址：https://github.com/ZQ-Qi/nsfw_data_scrapper，单纯为了学习和实践本文代码的可以下载该数据集进行尝试

到此这篇关于Python根据URL地址下载文件并保存至对应目录的实现的文章就介绍到这了,更多相关Python URL下载文件内容请搜索三水点靠木以前的文章或继续浏览下面的相关文章希望大家以后多多支持三水点靠木！

Python根据URL地址下载文件并保存至对应目录的实现

- Author -

妈哒好气哦

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python实现同时给多个变量赋值的方法

Apr 30 Python

PyQt5每天必学之布局管理

Apr 19 Python

python3.6的venv模块使用详解

Aug 01 Python

python 美化输出信息的实例

Oct 15 Python

用Python逐行分析文件方法

Jan 28 Python

windows安装TensorFlow和Keras遇到的问题及其解决方法

Jul 10 Python

python 解决cv2绘制中文乱码问题

Dec 23 Python

windows下的pycharm安装及其设置中文菜单

Apr 23 Python

使用PyWeChatSpy自动回复微信拍一拍功能的实现代码

Jul 02 Python

Python实现一个优先级队列的方法

Jul 31 Python

python绕过图片滑动验证码实现爬取PTA所有题目功能附源码

Jan 06 Python

Python实现生活常识解答机器人

Jun 28 Python

python re的findall和finditer的区别详解

Nov 15 #Python

Python获取android设备cpu和内存占用情况

Nov 15 #Python

Python __slots__的使用方法

Nov 15 #Python

Python descriptor(描述符)的实现

Nov 15 #Python

基于OpenCV的网络实时视频流传输的实现

Nov 15 #Python

彻底解决Python包下载慢问题

Nov 15 #Python

Python eval函数原理及用法解析

Nov 14 #Python

You might like

高亮度显示ｐｈｐ源代码

2006/10/09 PHP

一贴学会PHP 新手入门教程

2009/08/03 PHP

php array_push()数组函数:将一个或多个单元压入数组的末尾（入栈）

2011/07/12 PHP

php 错误处理经验分享

2011/10/11 PHP

php 使用array函数实现分页

2015/02/13 PHP

php递归遍历多维数组的方法

2015/04/18 PHP

php简单读取.vcf格式文件的方法示例

2017/09/02 PHP

图片自动缩小的js代码,用以防止图片撑破页面

2007/03/12 Javascript

jQuery的3种请求方式$.post,$.get,$.getJSON

2014/03/28 Javascript

js全选按钮的实现方法

2015/11/17 Javascript

jquery ajax局部加载方法详解(实现代码)

2016/05/12 Javascript

详解Vue.js——60分钟组件快速入门（上篇）

2016/12/05 Javascript

基于jQuery Easyui实现登陆框界面

2017/07/10 jQuery

如何在vue里添加好看的lottie动画

2018/08/02 Javascript

基于jQuery ztree实现表格风格的树状结构

2018/08/31 jQuery

微信小程序视图容器（swiper）组件创建轮播图

2020/06/19 Javascript

jQuery内容选择器与表单选择器实例分析

2019/06/28 jQuery

在 Vue 中编写 SVG 图标组件的方法

2020/02/24 Javascript

vue 路由meta 设置导航隐藏与显示功能的示例代码

2020/09/04 Javascript

[45:59]完美世界DOTA2联赛PWL S2 FTD vs GXR 第二场 11.22

2020/11/24 DOTA

python实现异步回调机制代码分享

2014/01/10 Python

Python计时相关操作详解【time,datetime】

2017/05/26 Python

python编程培训 python培训靠谱吗

2018/01/17 Python

python 中的list和array的不同之处及转换问题

2018/03/13 Python

pytorch模型预测结果与ndarray互转方式

2020/01/15 Python

pandas数据分组groupby()和统计函数agg()的使用

2021/03/04 Python

Html5基于canvas实现电子签名并生成PDF文档

2020/12/07 HTML / CSS

澳大利亚领先的优质葡萄酒拍卖会：Langton’s Fine Wines

2019/03/24 全球购物

中专生毕业自我鉴定

2013/11/01 职场文书

清洁工表扬信

2014/01/08 职场文书

学校节能减排方案

2014/06/13 职场文书

教育实践活动对照检查材料

2014/09/23 职场文书

2014统计局民主生活会对照检查材料思想汇报

2014/10/02 职场文书

《确定位置》教学反思

2016/02/18 职场文书

MySQL系列之开篇 MySQL关系型数据库基础概念

2021/07/02 MySQL

深入讲解数据库中Decimal类型的使用以及实现方法

2022/02/15 MySQL