如何运用python爬虫获取大型资讯类网站文章，并同时导出pdf或word格式文本？

这里，我们以比较知名的商业新知网站https://www.shangyexinzhi.com/为例进行代码编写，下面进行代码应用思路。

第一部分，分析网站结构

首先，我们来分析，要使用Python技术分析一个网站的结构，通常可以通过以下步骤实现：

获取网站的HTML内容：使用requests库来获取网站的HTML源代码。
解析HTML内容：使用BeautifulSoup库来解析HTML，提取网站的结构信息，如导航栏、链接、标题等。
分析网站结构：通过提取的HTML元素，分析网站的布局和结构。

以下是一个示例代码，展示如何使用Python分析商业新知网站的结构：

Python代码示例

Python复制

python">import requests
from bs4 import BeautifulSoup# 目标网站URL
url = "https://www.shangyexinzhi.com/"# 发送HTTP请求获取HTML内容
response = requests.get(url)
response.encoding = "utf-8"  # 确保编码正确# 检查请求是否成功
if response.status_code == 200:html_content = response.text
else:print("Failed to retrieve the webpage")exit()# 使用BeautifulSoup解析HTML
soup = BeautifulSoup(html_content, "html.parser")# 提取网站标题
title = soup.find("title").text
print(f"Website Title: {title}")# 提取导航栏链接
nav_links = soup.find_all("a", class_="nav-link")  # 假设导航栏链接有特定的class
print("\nNavigation Links:")
for link in nav_links:print(f"{link.text.strip()} -> {link.get('href')}")# 提取所有一级标题（H1）
h1_tags = soup.find_all("h1")
print("\nH1 Tags:")
for h1 in h1_tags:print(h1.text.strip())# 提取所有二级标题（H2）
h2_tags = soup.find_all("h2")
print("\nH2 Tags:")
for h2 in h2_tags:print(h2.text.strip())# 提取所有链接
all_links = soup.find_all("a")
print("\nAll Links:")
for link in all_links:href = link.get("href")text = link.text.strip()if href and text:print(f"{text} -> {href}")# 提取网站底部信息
footer = soup.find("footer")  # 假设网站底部有footer标签
if footer:print("\nFooter Content:")print(footer.text.strip())

分析结果

运行上述代码后，你可以得到以下信息：

网站标题：提取<title>标签的内容。
导航栏链接：提取导航栏中的所有链接及其文本。
一级标题（H1）和二级标题（H2）：提取页面中所有<h1>和<h2>标签的内容。
所有链接：提取页面中所有<a>标签的href属性和文本。
底部信息：提取<footer>标签的内容。

注意事项

动态内容：如果网站内容是通过JavaScript动态加载的，仅使用requests和BeautifulSoup可能无法获取完整内容。在这种情况下，可以使用Selenium来模拟浏览器行为。
网站结构变化：网站的HTML结构可能会随时更新，因此代码可能需要根据实际情况进行调整。
遵守robots.txt：在爬取网站内容时，请确保遵守网站的robots.txt文件规则，避免违反网站的使用条款。

第二部分，爬取网站文章

要爬取商业新知网站的文章，可以按照以下步骤进行操作。这里结合了最新的搜索结果信息，确保方法的时效性和合规性。

1. 遵守Robots协议

在开始爬取之前，必须检查目标网站的robots.txt文件，以确定哪些页面是可以被爬取的。访问以下链接查看协议：

https://www.shangyexinzhi.com/robots.txt

确保你的爬虫不会访问被禁止的页面。

2. 分析网站结构

打开商业新知网站，使用浏览器的开发者工具（F12）检查文章页面的HTML结构。例如：

文章列表可能包含在某个特定的<div>或<ul>标签中。
每篇文章的标题和链接可能包含在<a>标签中。
文章内容可能包含在<article>或<div>标签中。

3. 使用Python和BeautifulSoup爬取文章

以下是一个基于Python和BeautifulSoup的简单爬虫示例，用于爬取文章标题和链接：

Python复制

python">import requests
from bs4 import BeautifulSoup# 目标网站URL
base_url = "https://www.shangyexinzhi.com"# 发送HTTP请求获取首页内容
response = requests.get(base_url)
response.encoding = "utf-8"# 检查请求是否成功
if response.status_code == 200:html_content = response.textsoup = BeautifulSoup(html_content, "html.parser")# 查找文章列表（根据实际HTML结构调整选择器）articles = soup.find_all("a", class_="article-link")  # 示例选择器，需根据实际调整# 提取文章标题和链接for article in articles:title = article.get_text(strip=True)link = article.get("href")full_link = base_url + link if link.startswith("/") else linkprint(f"标题: {title}\n链接: {full_link}\n")
else:print("Failed to retrieve the webpage")

4. 处理动态内容（可选）

如果文章内容是通过JavaScript动态加载的，可以使用Selenium来模拟浏览器行为。以下是一个Selenium示例：

Python复制

python">from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup# 设置Selenium WebDriver
service = Service(executable_path='/path/to/chromedriver')  # 替换为你的chromedriver路径
driver = webdriver.Chrome(service=service)try:# 访问目标网页driver.get(base_url)driver.implicitly_wait(10)  # 等待10秒，确保页面加载完成# 获取页面源代码html = driver.page_sourcesoup = BeautifulSoup(html, "html.parser")# 提取文章信息（根据实际HTML结构调整选择器）articles = soup.find_all("a", class_="article-link")  # 示例选择器，需根据实际调整for article in articles:title = article.get_text(strip=True)link = article.get("href")full_link = base_url + link if link.startswith("/") else linkprint(f"标题: {title}\n链接: {full_link}\n")
finally:driver.quit()

5. 保存文章内容

根据提取到的链接，进一步请求每篇文章的详细内容，并保存到本地或数据库。

注意事项

遵守Robots协议：确保爬取的页面未被robots.txt禁止。
合理设置访问间隔：避免对服务器造成过大负担。
动态内容处理：如果页面内容是动态加载的，优先考虑Selenium。
尊重网站所有者意愿：如果网站明确禁止爬取，应停止相关操作。

第三部分，怎么爬取该网站的所有文章内容并导出word或pdf格式文本

要爬取商业新知网站的所有文章内容并导出为Word或PDF格式，可以按照以下步骤实现：

步骤 1：爬取文章内容

分析网站结构：使用浏览器开发者工具（F12）查看文章页面的HTML结构，确定文章标题、内容等信息的标签。
编写爬虫代码：使用Python的requests和BeautifulSoup库爬取文章内容。如果页面内容是动态加载的，可以使用Selenium。

示例代码：

Python复制

python">import requests
from bs4 import BeautifulSoup
import os# 创建保存文件的目录
if not os.path.exists("articles"):os.makedirs("articles")# 爬取文章列表
def get_article_list(url):response = requests.get(url)soup = BeautifulSoup(response.content, "html.parser")articles = soup.find_all("a", class_="article-link")  # 根据实际HTML结构调整return [(a.text.strip(), a["href"]) for a in articles]# 爬取单篇文章内容
def get_article_content(url):response = requests.get(url)soup = BeautifulSoup(response.content, "html.parser")title = soup.find("h1").text.strip()  # 文章标题content = soup.find("div", class_="article-content").text.strip()  # 文章内容return title, content# 主函数
def main():base_url = "https://www.shangyexinzhi.com/"articles = get_article_list(base_url)for title, link in articles:full_url = base_url + linkarticle_title, article_content = get_article_content(full_url)save_article(article_title, article_content)# 保存文章为文本文件
def save_article(title, content):filename = f"articles/{title}.txt"with open(filename, "w", encoding="utf-8") as f:f.write(content)print(f"Saved: {filename}")if __name__ == "__main__":main()

步骤 2：将文章内容导出为Word或PDF

导出为Word：使用python-docx库将文章内容保存为Word文档。
导出为PDF：使用pdfkit库将HTML内容转换为PDF。

示例代码（导出为Word）：

Python复制

python">from docx import Documentdef save_as_word(title, content):doc = Document()doc.add_heading(title, level=1)doc.add_paragraph(content)filename = f"articles/{title}.docx"doc.save(filename)print(f"Saved as Word: {filename}")

示例代码（导出为PDF）：

Python复制

python">import pdfkitdef save_as_pdf(title, content):html_content = f"<h1>{title}</h1><p>{content}</p>"filename = f"articles/{title}.pdf"pdfkit.from_string(html_content, filename)print(f"Saved as PDF: {filename}")

步骤 3：整合代码

将上述代码整合到主函数中，爬取文章并导出为Word或PDF格式。

Python复制

python">def main():base_url = "https://www.shangyexhi.com/"articles = get_article_list(base_url)for title, link in articles:full_url = base_url + linkarticle_title, article_content = get_article_content(full_url)save_as_word(article_title, article_content)  # 保存为Wordsave_as_pdf(article_title, article_content)   # 保存为PDF

注意事项

遵守Robots协议：确保爬取的页面未被robots.txt禁止。
动态内容处理：如果文章内容是动态加载的，优先使用Selenium。
文件名处理：确保文件名中不包含特殊字符，避免保存失败。

通过上述步骤，你可以爬取商业新知网站的所有文章内容，并将其导出为Word或PDF格式。如果遇到问题，可以参考相关的技术博客。

python_322">第四部分，运行python爬虫抓取文章的全部代码

这段代码不能直接运行，因为它是基于假设的HTML结构编写的，而实际网站的HTML结构可能与代码中的假设不同。此外，代码中还缺少一些必要的配置和依赖项。以下是需要调整和补充的地方，以确保代码能够正常运行：

1. 调整HTML选择器

在代码中，我使用了假设的HTML选择器（如class_="article-link"和class_="article-content"）。你需要根据实际网站的HTML结构进行调整。以下是调整方法：

检查HTML结构

打开商业新知网站。
使用浏览器的开发者工具（F12）检查文章列表和文章内容的实际HTML结构。
找到文章标题、链接和内容所在的HTML标签和类名。

示例调整

假设文章列表的HTML结构如下：

HTML复制

<div class="article-list"><a href="/article/123" class="article-title">文章标题</a>...
</div>

文章内容的HTML结构如下：

HTML复制

<article><h1>文章标题</h1><div class="content">文章内容</div>
</article>

你需要将代码中的选择器调整为：

Python复制

python">articles = soup.find_all("a", class_="article-title")  # 文章列表
title = soup.find("h1").text.strip()  # 文章标题
content = soup.find("div", class_="content").text.strip()  # 文章内容

2. 安装必要的Python库

代码中使用了多个Python库（如requests、BeautifulSoup、pdfkit、python-docx等）。你需要确保这些库已经安装。可以通过以下命令安装：

bash复制

pip install requests beautifulsoup4 pdfkit python-docx

3. 配置pdfkit

pdfkit需要一个后端工具（如wkhtmltopdf）来将HTML转换为PDF。你需要先安装wkhtmltopdf：

Windows: 下载并安装 wkhtmltopdf。

macOS/Linux: 使用包管理器安装：

bash复制

brew install wkhtmltopdf  # macOS
sudo apt-get install wkhtmltopdf  # Ubuntu

安装完成后，确保wkhtmltopdf的路径已添加到系统环境变量中。

4. 处理文件名中的特殊字符

文件名中可能包含特殊字符（如/、\、:等），这会导致保存文件时出错。需要对文件名进行处理：

Python复制

python">import redef clean_filename(filename):return re.sub(r'[\\/*?:"<>|]', "", filename)

在保存文件时，使用clean_filename函数处理文件名：

Python复制

python">filename = f"articles/{clean_filename(title)}.docx"

5. 完整代码示例

以下是调整后的完整代码示例：

Python复制

python">import requests
from bs4 import BeautifulSoup
import os
import re
from docx import Document
import pdfkit# 创建保存文件的目录
if not os.path.exists("articles"):os.makedirs("articles")# 清理文件名中的特殊字符
def clean_filename(filename):return re.sub(r'[\\/*?:"<>|]', "", filename)# 爬取文章列表
def get_article_list(url):response = requests.get(url)soup = BeautifulSoup(response.content, "html.parser")articles = soup.find_all("a", class_="article-title")  # 根据实际HTML结构调整return [(a.text.strip(), a["href"]) for a in articles]# 爬取单篇文章内容
def get_article_content(url):response = requests.get(url)soup = BeautifulSoup(response.content, "html.parser")title = soup.find("h1").text.strip()  # 文章标题content = soup.find("div", class_="content").text.strip()  # 文章内容return title, content# 保存文章为Word文档
def save_as_word(title, content):doc = Document()doc.add_heading(title, level=1)doc.add_paragraph(content)filename = f"articles/{clean_filename(title)}.docx"doc.save(filename)print(f"Saved as Word: {filename}")# 保存文章为PDF文档
def save_as_pdf(title, content):html_content = f"<h1>{title}</h1><p>{content}</p>"filename = f"articles/{clean_filename(title)}.pdf"pdfkit.from_string(html_content, filename)print(f"Saved as PDF: {filename}")# 主函数
def main():base_url = "https://www.shangyexinzhi.com/"articles = get_article_list(base_url)for title, link in articles:full_url = base_url + linkarticle_title, article_content = get_article_content(full_url)save_as_word(article_title, article_content)  # 保存为Wordsave_as_pdf(article_title, article_content)   # 保存为PDFif __name__ == "__main__":main()

6. 运行代码

在运行代码之前，请确保：

已安装所有必要的Python库。
已安装wkhtmltopdf并配置好路径。
已根据实际网站的HTML结构调整选择器。

运行代码后，爬取的文章内容将被保存为Word和PDF格式，存储在articles文件夹中。如果在运行过程中遇到问题，可以根据错误信息进一步调整代码或检查网站的HTML结构变化。