Python爬虫之豆瓣-新书速递-图书解析

1- 问题描述

　　抓取豆瓣“新书速递”^[1]页面下图书信息（包括书名，作者，简介，url），将结果重定向到txt文本文件下。

2- 思路分析^[2]

　　Step1 读取HTML

　　Step2 Xpath遍历元素和属性

3- 使用工具

　　Python，lxml模块，requests模块

4- 程序实现

 # -*- coding: utf-8 -*-

 from lxml import html

 import requests

 page = requests.get('http://book.douban.com/latest?icn=index-latestbook-all')

 tree = html.fromstring(page.text)

 # 若保存了html文件，可使用下面方法

 # page = open('/home/freyr/codeHouse/python/512.htm', 'r').read()

 # tree = html.fromstring(page)

 #提取图书信息

 bookname = tree.xpath('//div[@class="detail-frame"]/h2/text()')    # 书名

 author = tree.xpath('//div[@class="detail-frame"]/p[@class="color-gray"]/text()')    # 作者

 info = tree.xpath('//div[@class="detail-frame"]/p[2]/text()')    # 简介

 url = tree.xpath('//ul[@class="cover-col-4 clearfix"]/li/a[@href]')    # URL

 booknames = map(lambda x:x.strip(), bookname)

 authors = map(lambda x:x.strip(), author)

 infos = map(lambda x:x.strip(), info)

 urls = map(lambda p: p.values()[0], url)

 with open('/home/freyr/codeHouse/python/dbBook.txt','w+') as f:

     for book, author, info, url in zip(booknames, authors, infos, urls):

         f.write('%s\n\n%s\n\n%s' % (book.encode('utf-8'), author.encode('utf-8'), info.encode('utf-8')))

         f.write('\n\n%s\n' % url )

         f.write('\n\n-----------------------------------------\n\n\n')

PS: 　　1.还没有真正入手学习网页爬虫，先简单记录下。

　　　　2.程序涉及编码问题^[3]

[1] 豆瓣-新书速递

[2] lxml and Requests

[3] lxml 中文乱码

Python爬虫之豆瓣-新书速递-图书解析的相关教程结束。

《Python爬虫之豆瓣-新书速递-图书解析.doc》

下载本文的Word格式文档，以方便收藏与打印。

Python爬虫之豆瓣-新书速递-图书解析

Python爬虫之豆瓣-新书速递-图书解析的相关教程结束。

相关推荐

Python网络爬虫实战案例之：7000本电子书下载（2）

Python爬取豆瓣视频信息代码实例

python爬虫爬取笔趣网小说网站过程图解

scrapy爬虫如何爬取javascript内容

python使用selenium实现爬虫知乎

爬虫之header

【爬虫+数据清洗+可视化】用Python分析“淄博烧烤“的评论数据

python爬虫防止IP被封的一些措施(转)