忧郁的loli od 链接爬取

说明
思路
以下是代码实现
拓展思路
注

可能是忧郁的loli太小众化了，在网上找相关的爬虫，没有什么搜索结果。GitHub上找到一个使用selenium爬取的，但由于此网站过小，服务器速度很慢，外加selenium本身也会降低浏览的速度，爬取很慢，我曾尝试让selenium避开图片加载，但速度依旧感人，于是决定自己写一个。

本人萌新一枚（也是第一次写博客），如发现代码里有很愚蠢的地方，请大佬们指出，谢谢。

说明

以下代码是在获取国际链接的页面上，无法通过network选项找到获取下载链接的链接的前提下写的。通过发现获取下载链接的链接是由 “https://od.hhgal.com/ + 游戏名 + 下载内容”组合而成，其中下载内容为“游戏名+.rar”或“游戏名+part%d.rar”组成。

4月19号更新：不必手动更改cookies了，现可以自动获取

4月24日：新发现：od链接不像百度网盘链接，他是会自动更新的，因此此文章的代码无实际意义，仅供参考

思路

1.通过xpath定位元素，爬取主页上的各个游戏页面链接和游戏名。
2.进入各个游戏页面爬取文件说明，得到压缩文件的part数量，为合成链接做准备
3.向获取下载链接的链接发送请求，这个请求会重定向到下载链接，只要查看之前发出的重定向 location信息就可以得到下载链接了。

以下是代码实现

import requests
import urllib.parse
from lxml import etree

def get_cookies():
    url1 = 'https://www.hhgal.com/'
    url2 = 'https://www.hhgal.com/?security_verify_data=313638302c31303530'
    headers1 = {"Host": "www.hhgal.com",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
    "Accept-Encoding": "gzip, deflate",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1"
        }
    headers2 = {"Host": "www.hhgal.com",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Referer": "https://www.hhgal.com/",
    "Cookie": "security_session_verify={}; srcurl=68747470733a2f2f7777772e686867616c2e636f6d2f",
    "Upgrade-Insecure-Requests": "1"
        }
    response = requests.get(url1, headers = headers1)
    security_session_verify = response.headers['Set-Cookie'].split(';')[0][24:]

    headers2['Cookie'] = headers2['Cookie'].format(security_session_verify)
    response2 = requests.get(url2, headers = headers2)
    security_session_mid_verify = response2.headers['Set-Cookie'].split(';')[0][28:]
    return [security_session_verify, security_session_mid_verify]
def judge_part(game_url,headers2):
    game_page = requests.get(game_url, headers = headers2)
    html = etree.HTML(game_page.content.decode())
    files = html.xpath('//div[@class = "alert alert-info"]/span')[0]
    files_text = ''.join(files.itertext())  
    if 'part' in files_text:
            part = files_text.split('MD5')[-2]
            part = int(part.split('part')[-1])
            return part
    else:
        return 0

def get_download_url(title, part, headers3):
    url2 = 'https://od.hhgal.com/' + urllib.parse.quote(title+'/'+title)
    headers3["Referer"] = 'https://od.hhgal.com/'+urllib.parse.quote(title)
    success = 1
    with open('record.txt', 'a', encoding = 'utf-8') as f:
        if part == 0:
            each_url = url2 + urllib.parse.quote('.rar')
            f.write(title + '\n')
            try:
                res = requests.get(each_url, headers = headers3)
                location = res.history[0].headers['location']
                f.write(title + '.rar:' + location + '\n\n')
            except:
                f.write(title + '.rar:' + 'GET_FAILED\n\n')
                success = 0

        else:
            f.write(title + '\n')
            for i in range(1, part + 1):
                each_url = url2 + urllib.parse.quote('.part%d.rar'%i)
                try:
                    res = requests.get(each_url, headers = headers3)
                    location = res.history[0].headers['location']
                    f.write(title + 'part%d.rar:'%i + location + '\n')
                except:
                    f.write(title + 'part%d.rar:'%i + 'GET_FAILED\n\n')
                    success = 0
            f.write('\n')
        return success


url = 'https://www.hhgal.com/page/{}/'
verify = get_cookies()
#进入主目录
headers1 = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Cookie": "simplefavorites=%5B%7B%22site_id%22%3A1%2C%22posts%22%3A%5B28759%5D%2C%22groups%22%3A%5B%7B%22group_id%22%3A1%2C%22site_id%22%3A1%2C%22group_name%22%3A%22Default+List%22%2C%22posts%22%3A%5B28759%5D%7D%5D%7D%5D; security_session_verify={}; security_session_mid_verify={}; wpfront-notification-bar-landingpage=1; wordpress_test_cookie=WP+Cookie+check; PHPSESSID=7jrmhukq2dp2mgqjqde4q2o0hi".format(verify[0], verify[1]),
"Host": "www.hhgal.com",
"Referer": "https://www.hhgal.com/",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36"
    }
#进入游戏查看界面
headers2 = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Cookie": "simplefavorites=%5B%7B%22site_id%22%3A1%2C%22posts%22%3A%5B28759%5D%2C%22groups%22%3A%5B%7B%22group_id%22%3A1%2C%22site_id%22%3A1%2C%22group_name%22%3A%22Default+List%22%2C%22posts%22%3A%5B28759%5D%7D%5D%7D%5D; security_session_verify={}; security_session_mid_verify={}; wpfront-notification-bar-landingpage=1; wordpress_test_cookie=WP+Cookie+check; PHPSESSID=5oi82kfhndapa93l1e72m9pat8".format(verify[0], verify[1]),
"Host": "www.hhgal.com",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36"
           }
#用于请求链接
headers3 = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Host": "od.hhgal.com",
"Referer": "",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "same-site",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36"
    }

open('record.txt', 'w')
page = 1
while page <= 90:
    print('Go to page %d'%page)
    #经测试，只有第一页不符合换页规则
    if page == 1:
        response = requests.get('https://www.hhgal.com/', headers = headers1)
    else:
        response = requests.get(url.format(page), headers = headers1)
    text = response.content.decode()
    tree = etree.HTML(text)
    #获取游戏名
    titles = tree.xpath('//div[@class = "article well clearfix mybody3"]//h1/a/span[@class="animated_h1"]')
    titles = [i.text for i in titles]
    #获取游戏页面链接
    game_urls = tree.xpath('//div[@class = "article well clearfix mybody3"]//h1/a')
    game_urls = [i.attrib['href'] for i in game_urls]
    print('主页获取完毕')
    for title, game_url in zip(titles, game_urls):
        if title == '详细更新日志':
            continue
        part = judge_part(game_url, headers2)
        print('part获取完毕')
        if get_download_url(title, part, headers3):
            print(title,':SUCCEED')
        else:
            print(title, ':FAILED')
    page += 1

拓展思路

多线程爬取
断点重连，可继续下载
资源导入数据库
对小网站的爬取速度进行优化
爬取高速链接（高速链接里有个我没能找到规律的4位数字，若能找到规律则最好，若不能，可以进行穷举法尝试，但必须在4优化过的前提下进行，因为那个重定向链接速度真的慢）

注

由于重定向链接比较慢，爬虫启动后请耐心等待
获取失败的几种情况
1.少部分获取链接的网址命名规则不合
2.某些文件命名为part07而不是part7
3.文件命名规则不合
4.某些游戏并没有od链接

忧郁的loli od链接爬取

忧郁的loli od 链接爬取

说明

思路

以下是代码实现

拓展思路

注

相关推荐

VScode链接服务器并配置公钥-SSH Keys

excel 将图片的链接URL 显示为图片转

【java】Java相关学习参考链接（持续更新）

用PHPstudy nginx 配置tp6 隐藏访问链接中的index.php

Appium无线链接多台安卓设备方法

矩池云上安装ikatago及远程链接教程

矩池云上安装ikatago及远程链接教程

python selenium自动化点击页面链接测试

忧郁的loli od链接爬取

忧郁的loli od链接爬取

说明

思路

以下是代码实现

拓展思路

注

相关推荐

VScode链接服务器并配置公钥-SSH Keys

excel 将图片的链接URL 显示为图片 转

【java】Java相关学习参考链接（持续更新）

用PHPstudy nginx 配置tp6 隐藏访问链接中的index.php

Appium无线链接多台安卓设备方法

矩池云上安装ikatago及远程链接教程

矩池云上安装ikatago及远程链接教程

python selenium自动化点击页面链接测试

忧郁的loli od 链接爬取

excel 将图片的链接URL 显示为图片转