批量下载网站图片的Python实用小工具

本文适合于熟悉Python编程且对互联网高清图片饶有兴趣的筒鞋。读完本文后，将学会如何使用Python库批量并发地抓取网页和下载图片资源。只要懂得如何安装Python库以及运行Python程序，就能使用本文给出的程序批量下载指定图片啦！

　在网上冲浪的时候，总有些“小浪花”令人喜悦。没错，小浪花就是美图啦。边浏览边下载，自然是不错的；不过，好花不常开，好景不常在，想要便捷地保存下来，一个个地另存为还是很麻烦的。能不能批量下载呢？

目标

太平洋摄影网，一个不错的摄影网站。如果你喜欢自然风光的话，不妨在上面好好饱览一顿吧。饱览一会，或许你还想打包带走呢。这并不是难事，让我们顺藤摸瓜地来尝试一番吧（懒得截图，自己打开网站观赏吧）。

首先，我们打开网址 http://dp.pconline.com.cn/list/all_t145.html ；那么，马上有N多美妙的缩略图呈现在你面前；

任意点击其中一个链接，就到了一个系列的第一张图片的页面： http://dp.pconline.com.cn/photo/3687487.html，再点击下可以到第二张图片的页面： http://dp.pconline.com.cn/photo/3687487_2.html ；图片下方点击“查看原图”，会跳转到 http://dp.pconline.com.cn/public/photo/source_photo.jsp?id=19706865&photoId=3687487 这个页面，呈现出一张美美的高清图。右键另存为，就可以保存到本地。

也许你的心已经开始痒痒啦：要是一个命令行，就能把美图尽收怀中，岂不美哉！

思路

该如何下手呢？要想用程序自动化解决问题，就得找到其中规律！规律，YES ！

只要你做过 web 开发，一定知道，在浏览器的控制台，会有页面的 html ，而 html 里会包含图片，或者是包含图片的另一个 HTML。对于上面的情况而言， http://dp.pconline.com.cn/list/all_t145.html 是一个大主题系列的入口页面，比如自然是 t145，建筑是 t292，记作 EntryHtml ；这个入口页面包含很多链接指向子的HTML，这些子 HTML 是这个大主题下的不同个性风格的摄影师拍摄的不同系列的美图，记作 SerialHtml ; 而这些 SerialHtml 又会包含一个子系列每一张图片的首 HTML，记作 picHtml ，这个 picHtml 包含一个“查看原图”链接，指向图片高清地址的链接 http://dp.pconline.com.cn/public/photo/source_photo.jsp?id=19706865&photoId=3687487 ，记作 picOriginLink ；最后，在 picOriginLink 里找到 img 元素，即高清图片的真真实地址 picOrigin。 (⊙v⊙)嗯，貌似有点绕晕了，我们来总结一下：

EntryHtml （主题入口页面） -> SerialHtml （子系列入口页面） -> picHtml （子系列图片浏览页面） -> picOriginLink （高清图片页面） -> picOrigin （高清图片的真实地址）

现在，我们要弄清楚这五级是怎么关联的。

经过查看 HTML 元素，可知：

(1) SerialHtml 元素是 EntryHtml 页面里的 class="picLink" 的 a 元素；

(2) picHtml 元素是 SerialHtml 的加序号的结果，比如 SerialHtml 是 http://dp.pconline.com.cn/photo/3687487.html，总共有 8 张，那么 picHtml = http://dp.pconline.com.cn/photo/3687487_[1-8].html ，注意到 http://dp.pconline.com.cn/photo/3687487.html 与 http://dp.pconline.com.cn/photo/3687487_1.html 是等效的，这会给编程带来方便。

(3) “查看原图” 是指向高清图片地址的页面 xxx.jsp 的链接：它是 picHtml 页面里的 class="aView aViewHD" 的 a 元素；

(4) 最后，从 xxx.jsp 元素中找出 src 为图片后缀的 img 元素即可。

那么，我们的总体思路就是：

STEP1：抓取 EntryHtml 的网页内容 entryContent ;

STEP2：解析 entryContent ，找到class="picLink" 的 a 元素列表 SerialHtmlList ；

STEP3：对于SerialHtmlList 的每一个网页 SerialHtml_i：

(1) 抓取其第一张图片的网页内容，解析出其图片总数 total ；

(2) 根据图片总数 total 并生成 total 个图片链接 picHtmlList ；

a. 对于 picHtmlList 的每一个网页，找到 class="aView aViewHD" 的 a 元素 hdLink ；

b. 抓取 hdLink 对应的网页内容，找到img元素获得最终的图片真实地址 picOrigin ；

c. 下载 picOrigin 。

注意到，一个主题系列有多页，比如首页是 EntryHtml ：http://dp.pconline.com.cn/list/all_t145.html ，第二页是 http://dp.pconline.com.cn/list/all_t145_p2.html ；首页等效于 http://dp.pconline.com.cn/list/all_t145_p1.html 这会给编程带来方便。要下载一个主题下多页的系列图片，只要在最外层再加一层循环。这就是串行版本的实现流程。

串行版本

思路

主要库的选用：

(1) requests : 抓取网页内容；

(2) BeautifulSoup: 遍历HTML文档树，获取所需要的节点元素；

(3) multiprocessing.dummy : Python 的多进程并发库，这个是以多进程API的形式实现多线程的功能。

一点技巧：

(1) 使用装饰器来统一捕获程序中的异常，并打印错误信息方便排查；

(2) 细粒度地拆分逻辑，更易于复用、扩展和优化；

(3) 使用异步函数改善性能，使用 map 函数简洁表达；

运行环境 Python2.7 , 使用 easy_install 或 pip 安装 requests , BeautifulSoup 这两个三方库。

实现

 #!/usr/bin/python

 #_*_encoding:utf-8_*_

 import os

 import re

 import sys

 import requests

 from bs4 import BeautifulSoup

 saveDir = os.environ['HOME'] + '/joy/pic/pconline/nature'

 def createDir(dirName):

     if not os.path.exists(dirName):

         os.makedirs(dirName)

 def catchExc(func):

     def _deco(*args, **kwargs):

         try:

             return func(*args, **kwargs)

         except Exception as e:

             print "error catch exception for %s (%s, %s)." % (func.__name__, str(*args), str(**kwargs))

             print e

             return None

     return _deco

 @catchExc

 def getSoup(url):

     '''

        get the html content of url and transform into soup object

            in order to parse what i want later

     '''

     result = requests.get(url)

     status = result.status_code

     if status != 200:

         return None

     resp = result.text

     soup = BeautifulSoup(resp, "lxml")

     return soup

 @catchExc

 def parseTotal(href):

     '''

       total number of pics is obtained from a data request , not static html.

     '''

     photoId = href.rsplit('/',1)[1].split('.')[0]

     url = "http://dp.pconline.com.cn/public/photo/include/2016/pic_photo/intf/loadPicAmount.jsp?photoId=%s" % photoId

     soup = getSoup("http://dp.pconline.com.cn/public/photo/include/2016/pic_photo/intf/loadPicAmount.jsp?photoId=%s" % photoId)

     totalNode = soup.find('p')

     total = int(totalNode.text)

     return total

 @catchExc

 def buildSubUrl(href, ind):

     '''

     if href is http://dp.pconline.com.cn/photo/3687736.html, total is 10

     then suburl is

         http://dp.pconline.com.cn/photo/3687736_[1-10].html

     which contain the origin href of picture

     '''

     return href.rsplit('.', 1)[0] + "_" + str(ind) + '.html'

 @catchExc

 def download(piclink):

     '''

        download pic from pic href such as

             http://img.pconline.com.cn/images/upload/upc/tx/photoblog/1610/21/c9/28691979_1477032141707.jpg

     '''

     picsrc = piclink.attrs['src']

     picname = picsrc.rsplit('/',1)[1]

     saveFile = saveDir + '/' + picname

     picr = requests.get(piclink.attrs['src'], stream=True)

     with open(saveFile, 'wb') as f:

         for chunk in picr.iter_content(chunk_size=1024):

             if chunk:

                 f.write(chunk)

                 f.flush()

     f.close()

 @catchExc

 def downloadForASerial(serialHref):

     '''

        download a serial of pics

     '''

     href = serialHref

     subsoup = getSoup(href)

     total = parseTotal(href)

     print 'href: %s *** total: %s' % (href, total)

     for ind in range(1, total+1):

         suburl = buildSubUrl(href, ind)

         print "suburl: ", suburl

         subsoup = getSoup(suburl)

         hdlink = subsoup.find('a', class_='aView aViewHD')

         picurl = hdlink.attrs['ourl']

         picsoup = getSoup(picurl)

         piclink = picsoup.find('img', src=re.compile(".jpg"))

         download(piclink)

 @catchExc

 def downloadAllForAPage(entryurl):

     '''

        download serial pics in a page

     '''

     soup = getSoup(entryurl)

     if soup is None:

         return

     #print soup.prettify()

     picLinks = soup.find_all('a', class_='picLink')

     if len(picLinks) == 0:

         return

     hrefs = map(lambda link: link.attrs['href'], picLinks)

     print 'serials in a page: ', len(hrefs)

     for serialHref in hrefs:

         downloadForASerial(serialHref)

 def downloadEntryUrl(serial_num, index):

     entryUrl = 'http://dp.pconline.com.cn/list/all_t%d_p%d.html' % (serial_num, index)

     print "entryUrl: ", entryUrl

     downloadAllForAPage(entryUrl)

     return 0

 def downloadAll(serial_num):

     start = 1

     end = 2

     return [downloadEntryUrl(serial_num, index) for index in range(start, end+1)]

 serial_num = 145

 if __name__ == '__main__':

     createDir(saveDir)

     downloadAll(serial_num)

并发版本

思路

很显然，串行版本会比较慢，CPU 长时间等待网络连接和操作。要提高性能，通常是采用如下措施：

(1) 将任务分组，可以在需要的时候改造成任务并行的计算，也可以在机器性能不佳的情况下控制并发量，保持稳定运行；

(2) 使用多线程将 io 密集型操作隔离开，避免CPU等待；

(3) 单个循环操作改为批量操作，更好地利用并发；

(4) 使用多进程进行 CPU 密集型操作或任务分配，更充分利用多核的力量。

实现

目录结构：

pystudy

    common

        common.py

        net.py

        multitasks.py

    tools

        dwloadpics_multi.py

common.py

 import os

 def createDir(dirName):

     if not os.path.exists(dirName):

         os.makedirs(dirName)

 def catchExc(func):

     def _deco(*args, **kwargs):

         try:

             return func(*args, **kwargs)

         except Exception as e:

             print "error catch exception for %s (%s, %s): %s" % (func.__name__, str(*args), str(**kwargs), e)

             return None

     return _deco

net.py

 import requests

 from bs4 import BeautifulSoup

 from common import catchExc

 import time

 delayForHttpReq = 0.5 # 500ms

 @catchExc

 def getSoup(url):

     '''

        get the html content of url and transform into soup object

            in order to parse what i want later

     '''

     time.sleep(delayForHttpReq)

     result = requests.get(url)

     status = result.status_code

     # print 'url: %s , status: %s' % (url, status)

     if status != 200:

         return None

     resp = result.text

     soup = BeautifulSoup(resp, "lxml")

     return soup

 @catchExc

 def batchGetSoups(pool, urls):

     '''

        get the html content of url and transform into soup object

            in order to parse what i want later

     '''

     urlnum = len(urls)

     if urlnum == 0:

         return []

     return pool.map(getSoup, urls)

 @catchExc

 def download(piclink, saveDir):

     '''

        download pic from pic href such as

             http://img.pconline.com.cn/images/upload/upc/tx/photoblog/1610/21/c9/28691979_1477032141707.jpg

     '''

     picsrc = piclink.attrs['src']

     picname = picsrc.rsplit('/',1)[1]

     saveFile = saveDir + '/' + picname

     picr = requests.get(piclink.attrs['src'], stream=True)

     with open(saveFile, 'wb') as f:

         for chunk in picr.iter_content(chunk_size=1024):

             if chunk:

                 f.write(chunk)

                 f.flush()

     f.close()

 @catchExc

 def downloadForSinleParam(paramTuple):

     download(paramTuple[0], paramTuple[1])

multitasks.py

 from multiprocessing import (cpu_count, Pool)

 from multiprocessing.dummy import Pool as ThreadPool

 ncpus = cpu_count()

 def divideNParts(total, N):

     '''

        divide [0, total) into N parts:

         return [(0, total/N), (total/N, 2M/N), ((N-1)*total/N, total)]

     '''

     each = total / N

     parts = []

     for index in range(N):

         begin = index*each

         if index == N-1:

             end = total

         else:

             end = begin + each

         parts.append((begin, end))

     return parts

dwloadpics_multi.py

 #_*_encoding:utf-8_*_

 #!/usr/bin/python

 import os

 import re

 import sys

 from common import createDir, catchExc

 from net import getSoup, batchGetSoups, download, downloadForSinleParam

 from multitasks import *

 saveDir = os.environ['HOME'] + '/joy/pic/pconline'

 dwpicPool = ThreadPool(5)

 getUrlPool = ThreadPool(2)

 @catchExc

 def parseTotal(href):

     '''

       total number of pics is obtained from a data request , not static html.

     '''

     photoId = href.rsplit('/',1)[1].split('.')[0]

     url = "http://dp.pconline.com.cn/public/photo/include/2016/pic_photo/intf/loadPicAmount.jsp?photoId=%s" % photoId

     soup = getSoup("http://dp.pconline.com.cn/public/photo/include/2016/pic_photo/intf/loadPicAmount.jsp?photoId=%s" % photoId)

     totalNode = soup.find('p')

     total = int(totalNode.text)

     return total

 @catchExc

 def buildSubUrl(href, ind):

     '''

     if href is http://dp.pconline.com.cn/photo/3687736.html, total is 10

     then suburl is

         http://dp.pconline.com.cn/photo/3687736_[1-10].html

     which contain the origin href of picture

     '''

     return href.rsplit('.', 1)[0] + "_" + str(ind) + '.html'

 def getOriginPicLink(subsoup):

     hdlink = subsoup.find('a', class_='aView aViewHD')

     return hdlink.attrs['ourl']

 def findPicLink(picsoup):

     return picsoup.find('img', src=re.compile(".jpg"))

 def downloadForASerial(serialHref):

     '''

        download a serial of pics

     '''

     href = serialHref

     total = getUrlPool.map(parseTotal, [href])[0]

     print 'href: %s *** total: %s' % (href, total)

     suburls = [buildSubUrl(href, ind) for ind in range(1, total+1)]

     subsoups = batchGetSoups(getUrlPool, suburls)

     picUrls = map(getOriginPicLink, subsoups)

     picSoups = batchGetSoups(getUrlPool,picUrls)

     piclinks = map(findPicLink, picSoups)

     downloadParams = map(lambda picLink: (picLink, saveDir), piclinks)

     dwpicPool.map_async(downloadForSinleParam, downloadParams)

 def downloadAllForAPage(entryurl):

     '''

        download serial pics in a page

     '''

     print 'entryurl: ', entryurl

     soups = batchGetSoups(getUrlPool,[entryurl])

     if len(soups) == 0:

         return

     soup = soups[0]

     #print soup.prettify()

     picLinks = soup.find_all('a', class_='picLink')

     if len(picLinks) == 0:

         return

     hrefs = map(lambda link: link.attrs['href'], picLinks)

     map(downloadForASerial, hrefs)

 def downloadAll(serial_num, start, end, taskPool=None):

     entryUrl = 'http://dp.pconline.com.cn/list/all_t%d_p%d.html'

     entryUrls = [ (entryUrl % (serial_num, ind)) for ind in range(start, end+1)]

     execDownloadTask(entryUrls, taskPool)

 def execDownloadTask(entryUrls, taskPool=None):

     if taskPool:

         print 'using pool to download ...'

         taskPool.map(downloadAllForAPage, entryUrls)

     else:

         map(downloadAllForAPage, entryUrls)

 if __name__ == '__main__':

     createDir(saveDir)

     taskPool = Pool(processes=ncpus)

     serial_num = 145

     total = 4

     nparts = divideNParts(total, 2)

     for part in nparts:

         start = part[0]+1

         end = part[1]

         downloadAll(serial_num, start, end, taskPool=None)

     taskPool.close()

     taskPool.join()

知识点

装饰器

catchExc 函数实现了一个简易的异常捕获器，捕获程序中遇到的异常并打印详细信息便于排查。 _deco(*args, **kwargs) 是具有通用签名的 python 函数，装饰器返回的是函数引用，而不是具体的值。

动态数据抓取

比如 http://dp.pconline.com.cn/photo/4846936.html 这个子系列页面下的所有图片数，是根据动态JS加载的（在Chrome通过抓取工具可以得到）。因此，需要构造相应的请求去相应数据，而不是直接解析静态页面。不过这使得工具依赖于具体网站的请求，显然是不灵活的。

 function loadPicAmount(){

         var photoId=4846936;

         var url="/public/photo/include/2016/pic_photo/intf/loadPicAmount.jsp?pho

 toId="+photoId;

         $.get(url,function(data){

                 var picAmount=data;

                 $("#picAmount").append(picAmount);

         });

     }

Soup使用

soup确实是利用jQuery语法获取网页元素的利器啊！也说明，借用已经有的惯用法来开拓新的领域，更容易为用户所接受。

(1) 获取id元素： find(id="")

(2) 获取class元素：hdlink = subsoup.find('a', class_='aView aViewHD')

(3) 获取html标签元素：picsoup.find('img', src=re.compile(".jpg")) ; totalNode = soup.find('p')

(4) 获取所有元素： soup.find_all('a', class_='picLink')

(5) 获取指定元素的文本： totalNode.text

(6) 获取指定元素的属性： hdlink.attrs['ourl']

批量处理

在并发批量版本中，大量使用了 map(func, list) , lambda 表达式及列表推导，使得批量处理的含义更加简洁清晰；

此外，这些 map 都可以在适当的时候替换成并发的处理。

模块化

注意到并发版本拆分成了多个python文件，将通用的函数分离出来进行归类，便于后续可以复用。

这里需要设置PYTHONPATH搜索路径，将自己的公共文件放到这个路径下：

export PYTHONPATH=$PYTHONPATH:~/Workspace/python/pystudy/pystudy/common

遇到的问题

多线程问题

遇到的一个问题是，发现获取图片总数以及网页数据时不稳定，有时能获取有时不能获取，经过打印 http 请求后，发现开始正常，接下来会间隔性地批量出现 503 服务不可用。估计是服务器做了保护措施。为了能够稳定地获取到网站数据，降低了请求频率，发送请求前延迟 500ms 。见 net.py getSoup 方法的 time.sleep(0.5) 。毕竟咱们不是为了恶意攻击服务器，只是希望能够自动化便利地获取网站图片。

进程map调用问题

 from multiprocessing import Pool

 taskPool = Pool(2)

 def g(x):

     return x+1

 def h():

     return taskPool.map(g, [1,2,3,4])

 if __name__ == '__main__':

     print h()

     taskPool.close()

     taskPool.join()

报如下错误：

AttributeError: 'module' object has no attribute 'g'

解决方案是：必须将 taskPool 的定义挪到 if __name__ == '__main__': 包含的作用域内。

 if __name__ == '__main__':

     taskPool = Pool(2)

     print h()

     taskPool.close()

     taskPool.join()

原因见 https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers （16.6.1.5. Using a pool of workers）。

Functionality within this package requires that the __main__ module be importable by the children.

emm... 其实没读懂是什么意思。

https://stackoverflow.com/questions/20222534/python-multiprocessing-on-windows-if-name-main 这里也有参考。大意是说，不能在模块导入时去创建进程。

PS：在网上找了N久，最后发现在一段自己不经意忽略的地方找到。说明要多读官方文档，少走捷径答案。

未完待续

在 http://www.cnblogs.com/lovesqcc/p/8830526.html 一文中，我们实现了批量下载图片的工具的一个更加通用的版本。

本文原创，转载请注明出处，谢谢！ 🙂

批量下载网站图片的Python实用小工具

目标

思路

串行版本

思路

实现

并发版本

思路

实现

知识点

装饰器

动态数据抓取

Soup使用

批量处理

模块化

遇到的问题

多线程问题

进程map调用问题

未完待续

批量下载网站图片的Python实用小工具的相关教程结束。

相关推荐

【教程】AWD中如何通过Python批量快速管理服务器？

数学建模 Excel的批量写入与批量导出

[Android Pro] Android 4.1 使用 Accessibility实现免Root自动批量安装功能

SSRF——weblogic vulhub 漏洞复现及攻击内网redis（一）（附批量检测脚本）

用 Python 批量下载百度图片

Python爬虫实战：批量下载网站图片

批量下载网站图片的Python实用小工具（下）

【Python爬虫】批量爬取网页的图片&制作数据集