【Paper Tips】随记4-快速获取网页规范化数据

写paper时随心记录一些对自己有用的skills与tips。

文章目录

一、待解决问题
- 1.1 问题描述
- 1.2 解决方法
二、方法详述
- 2.1 必要说明
- - （1）网络爬虫的合规性
- 2.2 应用步骤
- - 2.2.1 下载对应网页html文件。（非必须）
  - 2.2.2 获取条目链接
  - 2.2.3 提取相关内容
- 2.3 实现效果
三、疑问
四、总结

一、待解决问题

1.1 问题描述

需要统计与整理对应网站的搜索结果数据，数据量不大，但条目众多。

1.2 解决方法

初次使用网络爬虫的方法：
（1）下载对应网页html文件。（非必须）
（2）获取条目链接。
（3）提取相关内容。

二、方法详述

2.1 必要说明

（1）网络爬虫的合规性

需在允许范围内进行爬虫：

遵守法律法规：网络爬虫技术必须遵守《网络安全法》、《个人信息保护法》、《数据安全法》等法律法规。确保爬取的数据不涉及侵犯他人隐私、商业机密或违反其他法律规定。
尊重网站规则：遵守目标网站的Robots协议，不爬取协议禁止的内容。许多网站会在其根域名下的robots.txt文件中声明允许或禁止爬虫访问的页面范围。
合理使用数据：爬取的数据应在法律允许的范围内使用，不得用于非法用途，如商业间谍活动、恶意竞争等。

禁止私自爬取敏感数据，以下行为需谨慎：

爬取非公开数据：通过解密方式抓取非公开数据，如某公司内部服务器数据、某电商网站的加密接口数据等，是违法的。
侵犯个人隐私：爬取涉及个人隐私的数据，如姓名、身份证件号码、通信通讯联系方式、住址等，并用于非法途径是违法的。
对网站造成干扰或破坏：如果爬虫影响网站正常运营，或者对被爬取网站造成破坏，如导致服务器宕机等，是违法的。
未经授权的商业使用：未经网站所有者授权，将爬取的数据用于商业目的，可能构成侵权。
违反网站使用条款：许多网站在其服务条款中明确禁止未经授权的爬取行为，违反这些条款可能导致法律责任。

2.2 应用步骤

2.2.1 下载对应网页html文件。（非必须）

⚠️写在前面：如果你搜索结果的网址能够直接访问，则不需要这一步，直接在代码之中修改为直接访问网页的html就好了。

此次演示以“统计仿真相关ISO标准编号、名称、摘要”为例，首先登录对应搜索网站，并输入关键词simulation得到搜索条目结果。
在这里插入图片描述
按下F12打开开发者工具，点击元素，看到html代码，右键任一代码，点击编辑元素，将该页所有代码复制到本地txt。

由于所有的搜索条目为170条，并且有9个分页，需将9个分页的html代码全部保存到单独的txt文件中。

在这里插入图片描述

2.2.2 获取条目链接

打开任一分页的html代码，我们注意到条目结果的链接都在指定位置，如：

<a href="/standard/61645.html" title="Numerical welding simulation — Execution and documentation">ISO/TS 18166:2016</a>

这里的/standard/61645.html就是我们需要的条目链接。

✅这里有一个小技巧，按下F12，打开网站开发者工具，点击在页面中选择一个元素进行检查就可以直接定位到该元素的html代码，这个有助于爬取数据。

在这里插入图片描述
有了这个关键信息后，我们就可以直接写代码，这个代码的具体功能表述如下：

① 遍历文件夹下所有的txt文件
② 以正则化表达式的方式匹配对应元素
③ 将元素中条目链接提取出来。
④ 全部保存到一个txt文件中。

python">import os
import re# 定义文件夹路径和输出文件路径
folder_path = 'iso_simulation'  # 存放txt文件的文件夹
output_file = 'iso_links_'+folder_path+'.txt'  # 保存提取链接的文件# 创建一个正则表达式模式，用于匹配链接
pattern = re.compile(r'<a href="(/standard/\d+.html)" title="')
url_prefix = 'https://www.iso.org'# 打开输出文件进行写入
with open(output_file, 'w', encoding='utf-8') as outfile:# 遍历文件夹中的每个txt文件for filename in os.listdir(folder_path):if filename.endswith('.txt'):file_path = os.path.join(folder_path, filename)# 打开每个txt文件进行读取with open(file_path, 'r', encoding='utf-8') as infile:content = infile.read()# 使用正则表达式查找所有匹配的链接matches = pattern.findall(content)# 遍历匹配到的链接并写入输出文件for match in matches:# 构造完整的URL并写入文件outfile.write(f'{url_prefix}{match}\n')print(f"提取完成，结果保存在 {output_file}")

2.2.3 提取相关内容

有了上述所有条目链接后，先查看单个条目结果中我们需要提取的html元素是什么，还是通过2.2.2节的方法，定位到编号、名称、摘要的html元素为：

<span class="d-block mb-3 ">ISO/TS 18166:2016</span>
<span class="lead d-block mb-3">Numerical welding simulation — Execution and documentation</span>
<div itemprop="description"><p></p><p>ISO/TS 18166:2016 provides a workflow for the execution, validation, verification and documentation of a numerical welding simulation within the field of computational welding mechanics (CWM). As such, it primarily addresses thermal and mechanical finite element analysis (FEA) of the fusion welding (see ISO/TR 25901:2007, 2.165) of metal parts and fabrications.</p>
<p>CWM is a broad and growing area of engineering analysis.</p>
<p>ISO/TS 18166:2016 covers the following aspects and results of CWM, excluding simulation of the process itself:</p>
<p>-      heat flow during the analysis of one or more passes;</p>
<p>-      thermal expansion as a result of the heat flow;</p>
<p>-      thermal stresses;</p>
<p>-      development of inelastic strains;</p>
<p>-      effect of temperature on material properties;</p>
<p>-      predictions of residual stress distributions;</p>
<p>-      predictions of welding distortion.</p>
<p>ISO/TS 18166:2016 refers to the following physical effects, but these are not covered in depth:</p>
<p>-      physics of the heat source (e.g. laser or welding arc);</p>
<p>-      physics of the melt pool (and key hole for power beam welds);</p>
<p>-      creation and retention of non-equilibrium solid phases;</p>
<p>-      solution and precipitation of second phase particles;</p>
<p>-      effect of microstructure on material properties.</p>
<p>The guidance given by this Technical Specification has not been prepared for use in a specific industry. CWM can be beneficial in design and assessment of a wide range of components. It is anticipated that it will enable industrial bodies or companies to define required levels of CWM for specific applications.</p>
<p>This Technical Specification is independent of the software and implementation, and therefore is not restricted to FEA, or to any particular industry.</p>
<p>It provides a consistent framework for-primary aspects of the commonly adopted methods and goals of CWM (including validation and verification to allow an objective judgment of simulation results).</p>
<p>Through presentation and description of the minimal required aspects of a complete numerical welding simulation, an introduction to computational welding mechanics (CWM) is also provided. (Examples are provided to illustrate the application of this Technical Specification, which can further aid those interested in developing CWM competency).</p>
<p>Clause 4 of this Technical Specification provides more detailed information relating to the generally valid simulation structure and to the corresponding application. Clause 5 refers to corresponding parts of this Technical Specification in which the structure for the respective application cases is put in concrete terms and examples are given. Annex A presents a documentation template to promote the consistency of the reported simulation results.</p><p></p></div>

定位到对应html元素后，转换为对应正则表达式以匹配，具体代码功能如所述：

① 读取txt文件获取所有条目链接
② 以正则表达式匹配对应元素
③ 提取对应内容至word中保存

python">import requests
from bs4 import BeautifulSoup
from docx import Documentdef extract_iso_info_from_web(urls, doc_file):# 创建一个Word文档doc = Document()num=1;for url in urls:# 发送HTTP请求获取网页内容print("开始读取第"+str(num)+"篇...")response = requests.get(url)if response.status_code != 200:print(f"Failed to retrieve the webpage. Status code: {response.status_code}")continuehtml_content = response.text# 使用BeautifulSoup解析HTMLsoup = BeautifulSoup(html_content, 'html.parser')# 提取ISO文件编号和名称title_tag = soup.find('title')if title_tag:title_content = title_tag.get_text()# 提取ISO文件编号和名称iso_info = title_content.split(' - ')iso_number = iso_info[0].strip()iso_name = ' - '.join(iso_info[1:]).strip()else:print("未找到ISO文件编号和名称")iso_number = ""iso_name = ""# 查找包含Abstract内容的divabstract_div = soup.find('div', {'itemprop': 'description'})# 提取Abstract内容if abstract_div:# 清理内容中的空行和多余空格abstract_content = abstract_div.get_text(separator='\n', strip=True)abstract_content = '\n'.join(line for line in abstract_content.split('\n') if line.strip())else:print("未找到Abstract内容")abstract_content = ""# 添加ISO文件编号和名称doc.add_heading(iso_number, level=1)doc.add_heading('ISO文件名称', level=2)doc.add_paragraph(iso_name)# 添加Abstract内容doc.add_heading('Abstract', level=2)doc.add_paragraph(abstract_content)print("第"+str(num)+"篇读取完成！编号为："+iso_number)num=num+1# 保存Word文档doc.save(doc_file)print(f"所有ISO文件编号、名称和Abstract内容已成功保存到 {doc_file}")link_file='iso_links_iso_simulation.txt'# 找到最后一个'_'的位置
last_underscore_index = link_file.rfind('_')# 找到'.txt'的位置
txt_index = link_file.find('.txt')# 提取.txt前一个单词，直到'_'结束
if last_underscore_index != -1 and txt_index != -1 and last_underscore_index < txt_index:keyword = link_file[last_underscore_index + 1:txt_index]
else:print("文件名格式不符合要求")with open(link_file, 'r') as file:lines = file.readlines()# 统计行数
total_lines = len(lines)doc_file = 'iso_keyword_'+keyword+'_num_'+str(total_lines)+'.docx'# 读取txt文件中的链接
with open('iso_links_iso_simulation.txt', 'r') as f:urls = [line.strip() for line in f.readlines()]# 调用函数
extract_iso_info_from_web(urls, doc_file)