用python做youtube自动化下载器 代码

2022-12-07,,,,

目录
项目地址
思路
流程
1. post
i. 先把post中的headers格式化
ii.然后把参数也格式化
iii. 最后再执行requests库的post请求
iv. 封装成一个函数
2. 调用解密函数
i. 分析
ii. 先取出js部分
iii. 取第一个解密函数作为我们用的解密函数
iv. 用execjs执行
1. this也就是window变量不存在
2. alert不存在
v. 整合代码
3. 分析解密结果
i. 取关键json
ii. 格式化json
iii. 取下载地址
3. 全部代码

根据 savefrom条例

本实例及教程只用于学习交流用,权利归savefrom.net所有

最后代码+注释大概100行左右,具体代码以github代码为主(可以会在上面修复bug),本文只做具体讲解

项目地址

github仓库

思路

pythonyoutube自动化下载器 思路

流程

1. post

根据思路里的第一步,我们首先需要用post方式取到加密后的js字段,笔者使用了requests第三方库来执行,关于爬虫可以参考我之前的文章

i. 先把post中的headers格式化

# set the headers or the website will not return information
# the cookies in here you may need to change
headers = {
"cache-Control": "no-cache",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
"*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
"content-type": "application/x-www-form-urlencoded",
"cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
"clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
"helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
"_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
"PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
"PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
"origin": "https://en.savefrom.net",
"pragma": "no-cache",
"referer": "https://en.savefrom.net/1-youtube-video-downloader-4/",
"sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
"sec-ch-ua-mobile": "?0",
"sec-fetch-dest": "iframe",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/87.0.4280.88 Safari/537.36"}

其中cookie部分可能要改,然后最好以你们浏览器上的为主,具体每个参数的含义不是本文范围,可以自行去搜索引擎搜

ii.然后把参数也格式化

# set the parameter, we can get from chrome
kv = {"sf_url": url,
"sf_submit": "",
"new": "1",
"lang": "en",
"app": "",
"country": "cn",
"os": "Windows",
"browser": "Chrome"}

其中sf_url字段是我们要下载的youtube视频的url,其他参数都不变

iii. 最后再执行requests库的post请求

# do the POST request
r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers,
data=kv)
r.raise_for_status()

注意是data=kv

iv. 封装成一个函数

import requests

def gethtml(url):
# set the headers or the website will not return information
# the cookies in here you may need to change
headers = {
"cache-Control": "no-cache",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
"*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
"content-type": "application/x-www-form-urlencoded",
"cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
"clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
"helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
"_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
"PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
"PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
"origin": "https://en.savefrom.net",
"pragma": "no-cache",
"referer": "https://en.savefrom.net/1-youtube-video-downloader-4/",
"sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
"sec-ch-ua-mobile": "?0",
"sec-fetch-dest": "iframe",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/87.0.4280.88 Safari/537.36"}
# set the parameter, we can get from chrome
kv = {"sf_url": url,
"sf_submit": "",
"new": "1",
"lang": "en",
"app": "",
"country": "cn",
"os": "Windows",
"browser": "Chrome"}
# do the POST request
r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers,
data=kv)
r.raise_for_status()
# get the result
return r.text

2. 调用解密函数

i. 分析

这其中的难点在于在python里执行javascript代码,而晚上的解决方法有PyV8等,本文选用execjs。在思路部分我们可以发现js部分的最后几行是解密函数,所以我们只需要在execjs中先执行一遍全部,然后再单独执行解密函数就好了

ii. 先取出js部分

# target(youtube address) url
url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
# get the target text
reo = gethtml(url)
# Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]

这里其实可以用正则,不过由于笔者正则表达式还不太熟练就直接用split

iii. 取第一个解密函数作为我们用的解密函数

当你多取几次不同视频的结果,你就会发现每次的解密函数都不一样,不过位置都是还是在固定行数

# split each line(help us find the decrypt function in last few line)
reA = reo.split("\n")
# get the depcrypt function
name = reA[len(reA) - 3].split(";")[0] + ";"

所以name就是我们的解密函数了(变量名没取太好hhh)

iv. 用execjs执行

# use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
ct = execjs.compile(reo)
# do the decryption
text = ct.eval(name.split("=")[1].replace(";", ""))

其中只取=后面的和去掉分号是指指执行这个函数而不用赋值,当先执行赋值+解密然后取值也不是不可以

但是我们可以发现马上就报错了(要是有这么简单就好了)

1. this也就是window变量不存在

如果没记错是报错this或者$b,笔者尝试把全部this去掉或者把全部框在一个class里面(这样子this就变成那个class了)不过都没有成功,然后发现在npm下有个jsdom可以在execjs里模拟window变量(其实应该有更好方法的),所以我们需要下载npm和里面的jsdom,然后改写以上代码

    addition = """
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
window = dom.window;
document = window.document;
XMLHttpRequest = window.XMLHttpRequest;
"""
# use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
ct = execjs.compile(addition + reo, cwd=r'C:\Users\xxx\AppData\Roaming\npm\node_modules')

其中

cwd字段是npm root -g的结果,也就是npm的modules路径
addition是用来模拟window

但是我们又可以发现下一个错误

2. alert不存在

这个错误是因为在execjs下执行alert函数是没有意义的,因为我们没有浏览器让他弹窗,且原本alert函数的定义是来源window而我们自定义了window,所以我们要在代码前重写覆盖alert函数(相当于定义一个alert)

# override the alert function, because in the code there has one place using
# and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")

v. 整合代码

# target(youtube address) url
url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
# get the target text
reo = gethtml(url)
# Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]
# override the alert function, because in the code there has one place using
# and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")
# split each line(help us find the decrypt function in last few line)
reA = reo.split("\n")
# get the depcrypt function
name = reA[len(reA) - 3].split(";")[0] + ";"
# add jsdom into the execjs because the code will use(maybe there is a solution without jsdom, but i have no idea)
addition = """
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
window = dom.window;
document = window.document;
XMLHttpRequest = window.XMLHttpRequest;
"""
# use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
ct = execjs.compile(addition + reo, cwd=r'C:\Users\19308\AppData\Roaming\npm\node_modules')
# do the decryption
text = ct.eval(name.split("=")[1].replace(";", ""))

3. 分析解密结果

i. 取关键json

运行完上面的部分,解密结果就存在text里了,而我们在思路中可以发现,真正对我们重要的就是存在window.parent.sf.videoResult.show()里的json,所以用正则表达式取这一部分的json

# get the result in json
result = re.search('show\((.*?)\);;', text, re.I | re.M).group(0).replace("show(", "").replace(");;", "")

ii. 格式化json

python可以格式化json的库有很多,这里笔者用了json库(记得import)

# use `json` to load json
j = json.loads(result)

iii. 取下载地址

接下来就到了最后一步,根据思路里和json格式化工具我们可以发现j["url"][num]["url"]就是下载链接,而num是我们要的视频格式(不同分辨率和类型)

# the selection of video(in this case, num=1 mean the video is
# - 360p known from j["url"][num]["quality"]
# - MP4 known from j["url"][num]["type"]
# - audio known from j["url"][num]["audio"]
num = 1
downurl = j["url"][num]["url"]
# do some download
# thanks :)
# - EOF -

3. 全部代码

# -*- coding: utf-8 -*-
# @Time: 2021/1/10
# @Author: Eritque arcus
# @File: Youtube.py
# @License: MIT
# @Environment:
# - windows 10
# - python 3.6.2
# @Dependence:
# - jsdom in npm(windows also can use)
# - requests, execjs, re, json in python
import requests
import execjs
import re
import json def gethtml(url):
# set the headers or the website will not return information
# the cookies in here you may need to change
headers = {
"cache-Control": "no-cache",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
"*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
"content-type": "application/x-www-form-urlencoded",
"cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
"clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
"helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
"_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
"PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
"PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
"origin": "https://en.savefrom.net",
"pragma": "no-cache",
"referer": "https://en.savefrom.net/1-youtube-video-downloader-4/",
"sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
"sec-ch-ua-mobile": "?0",
"sec-fetch-dest": "iframe",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/87.0.4280.88 Safari/537.36"}
# set the parameter, we can get from chrome
kv = {"sf_url": url,
"sf_submit": "",
"new": "1",
"lang": "en",
"app": "",
"country": "cn",
"os": "Windows",
"browser": "Chrome"}
# do the POST request
r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers,
data=kv)
r.raise_for_status()
# get the result
return r.text if __name__ == '__main__':
# target(youtube address) url
url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
# get the target text
reo = gethtml(url)
# Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]
# override the alert function, because in the code there has one place using
# and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")
# split each line(help us find the decrypt function in last few line)
reA = reo.split("\n")
# get the depcrypt function
name = reA[len(reA) - 3].split(";")[0] + ";"
# add jsdom into the execjs because the code will use(maybe there is a solution without jsdom, but i have no idea)
addition = """
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
window = dom.window;
document = window.document;
XMLHttpRequest = window.XMLHttpRequest;
"""
# use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
ct = execjs.compile(addition + reo, cwd=r'C:\Users\19308\AppData\Roaming\npm\node_modules')
# do the decryption
text = ct.eval(name.split("=")[1].replace(";", ""))
# get the result in json
result = re.search('show\((.*?)\);;', text, re.I | re.M).group(0).replace("show(", "").replace(");;", "")
# use `json` to load json
j = json.loads(result)
# the selection of video(in this case, num=1 mean the video is
# - 360p known from j["url"][num]["quality"]
# - MP4 known from j["url"][num]["type"]
# - audio known from j["url"][num]["audio"]
num = 1
downurl = j["url"][num]["url"]
# do some download
# thanks :)
# - EOF -

总计102行
开发环境

# @Environment:
# - windows 10
# - python 3.6.2

依赖

# @Dependence:
# - jsdom in npm(windows also can use)
# - requests, execjs, re, json in python

-end-

For 爬虫

版权声明:本文为博主原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。

本文作者: https://www.cnblogs.com/Eritque-arcus/ 或https://blog.csdn.net/qq_40832960

用python做youtube自动化下载器 代码的相关教程结束。

《用python做youtube自动化下载器 代码.doc》

下载本文的Word格式文档,以方便收藏与打印。