爬虫 Python将网页内容保存为PDF(url转pdf) 譬如下载某个专栏下的全部文章

ops/2024/9/20 3:54:01/ 标签: 爬虫, python, pdf

我看到一个不错的教程，想下载教程下全部文章到本地，有时间看看，但是问了作者没有电子档，就想办法了。
PS: 我一天天的到底在干嘛！唉…

需求: 爬取一个网页里全部文章且存为pdf

参考链接:
【已解决】Python将网页内容保存为PDF （url转pdf）
Python正则表达式详解（超详细，看完必会！）
爬虫：Python下载html保存成pdf——以下载知乎下某个专栏下所有文章为例
python 爬虫，用正则表达式提取页面里所有的http链接

环境
windows 10
vscode
conda，python 3.8
步骤
1）先安装pdfkit：

python">conda install pdfkit
# 或者
pip install pdfkit

2）然后还要安装：wkhtmltopdf
去官网：https://wkhtmltopdf.org/
下载exe，安装到windows上就行。

但是呢，还要配置一下环境变量，把wkhtmltopdf安装目录下的bin文件夹的绝对路径配置到环境变量中。然后打开cmd，输入：echo %PATH%，让环境变量立马生效。

如果这时候开了vscode，还需要重启一下vscode（这应该是让环境变量在vscode的terminal/环境下生效）（我是关了重开，貌似也可以reload window）

3）在vscode下写如下代码：

python">import pdfkit
import os, sys
cur_file_dir = os.path.abspath(__file__).rsplit("\\", 1)[0]
# 你自己填入url
url = "https://xxx"
output_path = os.path.join(cur_file_dir, 'csdn.pdf')
pdfkit.from_url(url, output_path)

运行，然后就可以顺利打印啦！

亲测，单个网页这么转成pdf特别方便，和谷歌浏览器打印再另存为pdf的结果一致。好用的不行。

怎么将一个网页下面全部的文章下载下来存pdf格式呢？以C++自学精简实践目录(必读) 为例。

python"># -*- coding: utf-8 -*-            
import requests
import re
import os
import json
import pdfkit
from collections import dequeHEADERS={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36',# 这里还可以以字典形式给出别的请求头属性
}
# 如果配置了环境变量无法立即生效（大概是需要重启），可以通过这一行语句添加环境变量
os.environ["PATH"] += os.pathsep + r'D:\wkhtmltox\bin'def getUrls(zhuanlan):''':param zhuanlan: such as https://zhuanlan.zhihu.com/reinforcementlearning   传入的是最后这个reinforcementlearning:return: 返回专栏下所有文章的url'''urls = []# p_titles = []offset = 0while True:url = 'https://zhuanlan.zhihu.com/api/columns/{}/articles?include=data&limit=100&offset={}'.format(zhuanlan, offset)html_string = requests.get(url,headers=HEADERS).textcontent = json.loads(html_string)   # 获取的content可以加载为json格式urls.extend([item['url'] for item in content['data']])  # 就可以用json的方式索引到所有的url# p_titles.extend([item['title'] for item in content['data']])  # 获取标题if len(content['data']) < 100:  # 如果是最后一页breakelse:offset += 100return urlsdef getUrls2(zhuanlan):''':param zhuanlan: such as https://zhuanlan.zhihu.com/reinforcementlearning   传入的是最后这个reinforcementlearning:return: 返回专栏下所有文章的url'''urlindex = 'https://zhuanlan.zhihu.com/{}'.format(zhuanlan)print('urlindex:', urlindex)resindex = requests.get(urlindex, headers=HEADERS)# print('resindex.text:', resindex.text)matchac = re.search(r'"articlesCount":(\d+),', resindex.text)   # 通过正则表达式获取文章总数articlesCount = int(matchac.group(1))upper = articlesCount//100+1  # 下面设置了每页显示100条，这里求总页数urls = []for i in range(upper):urlpage = 'https://zhuanlan.zhihu.com/api/columns/{}/articles?include=data&limit={}&offset={}'.format(zhuanlan, 100, 100*i)# limit最大是100respage = requests.get(urlpage, headers=HEADERS)respage.encoding = 'unicode_escape'matchurl = re.findall(r'"title":\s"[^"]+?",\s"url":\s"([^"]+?)",', respage.text)    # 通过正则匹配urlif len(matchurl) !=0:urls += matchurlelse:html_string = requests.get(urlpage, headers=HEADERS).textcontent = json.loads(html_string)  # 获取的content可以加载为json格式urls.extend([item['url'] for item in content['data']])  # 就可以用json的方式索引到所有的urlreturn urlsdef get_html(url):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36',}res = requests.get(url, headers=headers)return res.textdef extract_all_urls(html):pattren = re.compile(r'https://zhuanlan.zhihu.com/p/\d+')# pattren = re.compile(r'https://www.lz13.cn/[^\s]+.html')url_lst = pattren.findall(html)return url_lstdef get_urls_from_url(url):html = get_html(url)url_lst = extract_all_urls(html)return url_lstdef get_all_urls(web_site):url_access_set = set()  # 已经访问过的urlqueue_url_set = set()url_lst = get_urls_from_url(web_site)url_access_set.add(web_site)queue = deque()for url in url_lst:queue.append(url)queue_url_set.add(url)# while len(queue) != 0:#     print(len(queue))#     url = queue.popleft()#     if url in url_access_set:#         continue##     url_access_set.add(url)#     url_lst = get_urls_from_url(url)# for url in url_lst:#     if url not in queue_url_set:#         queue.append(url)#         queue_url_set.add(url)return list(queue_url_set)def saveArticlesPdf(urls, target_path):os.makedirs(target_path, exist_ok=True)for i, url in enumerate(urls):print('[ {} / {} ] processing'.format(str(i+1).zfill(3), len(urls)))content = requests.get(url, headers=HEADERS).text# print('content:', content)try:title = re.search(r'<h1\sclass="Post-Title">(.+)</h1>', content).group(1)except Exception as e:print('error content:', content)content = content.replace('<noscript>', '')     # 解决无法下载图片问题，其中图片路径为相对路径content = content.replace('</noscript>', '')rstr = r"[\/\\\:\*\?\"\<\>\|]"  # '/ \ : * ? " < > |'title = re.sub(rstr, " ", title)title = re.sub('', ' ', title)print('title:', title)try:# 方式一，直接调用wkhtmltopdf的命令# os.system('wkhtmltopdf {} {}'.format(content, target_path+'/{}.pdf'.format(title)))# 方式二，调用pdfkit包的方式pdfkit.from_string(content, target_path+'/{}.pdf'.format(title))except ValueError as e:print(title, e)if __name__ == '__main__':zhuanlan = 'reinforcementlearning'# urls = getUrls(zhuanlan)# urls = getUrls2(zhuanlan)urls = get_all_urls('https://zhuanlan.zhihu.com/p/657345052')saveArticlesPdf(urls, r'E:\save\{}'.format(zhuanlan))