爬虫笔记0

问题梳理:

<dl>：Definition List（定义列表）
<dt>：Definition Term（一般放标题）
<dd>：Definition Description（定义列表项，数据所在）
<ul>：Unordered List（无序列表）
<li>：List Item（列表项，数据所在）

*爬虫流程

发送请求（requests）（看url，header，*参数）
获取响应
解析数据（re、xpath、css选择器），parsel集成三种方式
1. re：re模块
2. xpath：lxml模块
3. css选择器：bs4模块
保存数据

* headers怎么找

抓包看请求该接口的参数，对照填写

需要传递的

User-Agent：模拟浏览器身份标识
Host：域名
Referer：访问来源
Cookie：验证身份（user= 100）,保持登录状态

*翻页的用什么函数

抓包找接口，看数据来自哪个接口，模拟浏览器发送相同的请求

*后台返回的数据格式

1、后台传递样式 + 数据

css选择器

<!doctype html>
<html>
<head>
<meta charset="UTF-8">
<title>段落文本</title>
</head><body><h2 align="center">小说标题:鸿蒙主宰</h2><p align="left">混沌初开，乾坤始奠，武道起始。诸雄争霸！强者如云！世人只知源始一族，为天下共尊，却不知其为鹰犬。后世秦羽，封印绕身，一朝化龙.......</p>
</body>
</html>

![外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传]https://img-home.csdnimg.cn/images/20230724024159.png?origin_url=images%2Fimage-20240527205556956.png&pos_id=img-8WsJ6HCJ-1719992708245))

2、后台只传递数据，样式由前端编写，统一组装

后端传递的

{"title":"鸿蒙主宰","content":"混沌初开，乾坤始奠，武道起始。诸雄争霸！强者如云！世人只知源始一族，为天下共尊，却不知其为鹰犬。后世秦羽，封印绕身，一朝化龙......."
}

前端编写的

<!doctype html>
<html>
<head>
<meta charset="UTF-8">
<title>段落文本</title>
</head><body><h2 align="center">小说标题:{{title}}</h2><p align="left">{{content}}</p>
</body>
</html>

如何下载后合并成一个TXT

import os
import redef merge_txt_files(directory, output_file):# 获取指定目录中的所有文件files = os.listdir(directory)# 使用正则表达式匹配文件名格式 1-xxx.txt, 2-xxx.txt, ...pattern = re.compile(r'^(\d+)-.+\.txt$')# 创建一个列表来存储匹配的文件和它们的序号matched_files = []for file in files:match = pattern.match(file)if match:# 获取文件的序号seq_num = int(match.group(1))matched_files.append((seq_num, file))# 按序号排序matched_files.sort()# 打开输出文件with open(output_file, 'w', encoding='utf-8') as outfile:for seq_num, filename in matched_files:file_path = os.path.join(directory, filename)with open(file_path, 'r', encoding='utf-8') as infile:content = infile.read()outfile.write(content)outfile.write('\n')  # 每个文件内容后添加换行符# 示例用法
# directory_path = 'path/to/your/directory'
directory_path = 'test'
# output_file_path = 'path/to/your/output_file.txt'
output_file_path = 'out.txt'
merge_txt_files(directory_path, output_file_path)
print('执行完成')

selector.css函数怎么使用？ && Parsel怎么使用？

略

https://spa3.scrape.center/

iles(directory_path, output_file_path)
print(‘执行完成’)

## selector.css函数怎么使用？ && Parsel怎么使用？略https://spa3.scrape.center/

爬虫笔记0

问题梳理:

*爬虫流程

* headers怎么找

*翻页的用什么函数

*后台返回的数据格式

如何下载后合并成一个TXT

selector.css函数怎么使用？ && Parsel怎么使用？

相关文章

Kafka如何防止消息丢失

CTF常用sql注入（三）无列名注入

VSCode使用Makefile管理工程

unix高级编程系列之文件I/O

【HICE】dns正向解析

MinIO:开源对象存储解决方案的领先者

firewalld 高级配置

【优化论】约束优化算法