Python 爬虫 bs4 数据解析基本使用

Python 爬虫 bs4 基本使用

- 1. bs4 基本语法
- - 1.1 获取 html 页面
  - 1.2 获取标签
  - 1.3 获取标签中的内容
  - 1.4 获取标签中的属性
- 2. 实例

免责声明：自本文章发布起, 本文章仅供参考，不得转载，不得复制等操作。浏览本文章的当事人如涉及到任何违反国家法律法规造成的一切后果由浏览本文章的当事人自行承担与本文章博客主无关。以及由于浏览本文章的当事人转载，复制等操作涉及到任何违反国家法律法规引起的纠纷和造成的一切后果由浏览本文章的当事人自行承担与本文章博客主无关。

import requests
from bs4 import BeautifulSoup

1. bs4 基本语法

1.1 获取 html 页面

获取本地 html 页面

# 读取文件
fp = open("./data/base/taobao.html", "r", encoding="UTF-8")
# 数据加载到该对象中 (本地的 html 文件)
html = BeautifulSoup(fp, "lxml")
print(html)

读取网站获取 html 页面

# 爬取页面
response_text = requests.get(url="https://s.taobao.com/").text
# 数据加载到该对象中 (网络的 html 文件)
html = BeautifulSoup(response_text, "lxml")
print(html)

1.2 获取标签

soup.<tagName>
默认第一个, 没有该标签返回 None

print(html.a)
print(html.img)
print(html.input)

soup.find(<tagName>)
等同于 soup.<tagName> 默认第一个

print(html.find("a"))
print(html.find("img"))
print(html.find("input"))

soup.find(<tagName>, <tagName.class>)
标签属性定位, 包含 <tagName.class> 便可以搜索出来

print(html.find("div", class_="site-nav"))

soup.find_all(<tagName>)
所有标签, 返回值为 List

print(html.find_all("a"))
print(html.find_all("input"))

soup.select(<select>)
运用类选择器查找标签, 所有标签, 返回值为 List

print(html.select(".bang"))
print(html.select("#J_SearchForm .search-button"))
print(html.select(".copyright"))

1.3 获取标签中的内容

text/get_text(): 获取所有的内容
string 获取直系的内容

print(html.find("div", class_="search-button").text)
print(html.find("div", class_="search-button").string)
print(html.find("div", class_="search-button").get_text())

1.4 获取标签中的属性

[<attribute>]

print(html.a["href"])
print(html.find("div")["class"])
print(html.find_all("div")[5]["class"])

2. 实例

三国演义文章内容

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import requests
from bs4 import BeautifulSoup


if __name__ == '__main__':

    # url, UA, 参数
    url = "https://www.shicimingju.com/book/sanguoyanyi.html"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0"
    }
    # 爬取页面
    html = requests.get(url=url, headers=headers, timeout=5).text
    # 数据加载到该对象中 (网络的 html 文件)
    soup = BeautifulSoup(html, "lxml")
    # 得到想要的标签 (含有章节的)
    content = soup.select("div.book-mulu > ul > li > a")

    # 文件
    fp = open("./data/sgyy/sgyy.txt", "w", encoding="utf-8")
    fp.write("章节\t链接\t内容\n")
    for c in content:
        # 爬取章节详细叙述的内容
        href_text = requests.get(url="https://www.shicimingju.com" + c["href"], headers=headers, timeout=5).text
        # 添加章节详细叙述的内容
        href_soup = BeautifulSoup(href_text, "lxml")
        href_text = href_soup.find("div", class_="chapter_content").text
        # 添加章节的名称, 链接, 内容.
        fp.write(f'{c.text}\t{"https://www.shicimingju.com" + c["href"]}\t{href_text}\n')
        print(c.text + " 添加完成")
    fp.close()