requests模块
python中原生的一款基于网络请求的模块,作用是模拟浏览器发送请求
指定url-发送请求-获取响应数据-持久化存储
pro1:爬取搜狗首页的页面数据
basic crowler
python">import requests
if __name__ == '__main__':url='https://www.sogou.com'res=requests.get(url)page_data=res.textwith open('pro01.html','w',encoding='utf-8') as file:file.write(page_data)print('结束')
UA伪装
python">import requests#UA检测User_Agent--请求载体的身份标识
#门户网站的服务器会检测对应请求的载体身份标识,如果检测到请求的载体身份标识为某一款浏览器,说明该请求是一个正常的请求
#如果请求的载体身份标识,不基于某一款浏览器, 则表示该请求为不正常的请求,则服务器端就会拒绝该次请求
#UA伪装:让爬虫对应的请求载体身份标识伪装成某一款浏览器
if __name__ == '__main__':url='https://www.sogou.com/web'#处理url携带的参数:封装到字典中#UA伪装headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}kw=input('enter a word')param={'query':kw}#对指定的url发起的请求是携带参数的,并且请求过程中处理了参数resp=requests.get(url=url,params=param,headers=headers)page_text=resp.textfile_name=kw+'.html'with open(file_name,'w',encoding='utf-8') as file:file.write(page_text)print(file_name,'save')
post请求及存储JSON格式响应数据
python">import requests
import json
if __name__ == '__main__':post_url='https://fanyi.baidu.com/sug'#post请求参数处理kw = input('enter a word')data={'kw':kw}#UA伪装headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36'}resp=requests.post(url=post_url,data=data,headers=headers)#JSON方法返回的一个obj(前提是响应数据是JSON类型的)dic_obj=resp.json()# print(dic_obj)fp=open('pro03.json','w',encoding='utf-8')json.dump(dic_obj,fp=fp,ensure_ascii=False)
数据解析
解析分类:正则, bs4, xpath
解析的监察部文本内容都会在标签之间或者标签对应的属性中进行存储
- 进行指定标签的定位
- 标签或者标签对应的属性中存储的数据值进行提取解析
python">#获取图片
import requestsif __name__ == '__main__':url='https://img3.doubanio.com/view/photo/m_ratio_poster/public/p2917594343.jpg'#content返回二进制形式的图片数据img_resp=requests.get(url).contentwith open('pro01.jpg','wb') as file:file.write(img_resp)
python">#bs4进行数据解析
#实例化一个BS对象,并将页面源码数据加载到该对象中
#通过调用BS对象中相关的属性或者方法及逆行标签定位和数据提取
#对象的实例化:#将本地的html文件加载到对象中#将互联网的html文件加载到对象中
----------------------------------------------
from bs4 import BeautifulSoup
if __name__ == '__main__':fp = open('test.html','r',encoding='utf-8')soup=BeautifulSoup(fp,'lxml')print(soup.find(class_='highlight'))
----------------------------------------------
import requests
from bs4 import BeautifulSoup
#coding:utf-8
if __name__ == '__main__':url='https://www.shicimingju.com/book/sanguoyanyi.html'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36'}page_text=requests.get(url=url,headers=headers)page_text.encoding='utf-8'#在首页解析标题与详情页的urlsoup=BeautifulSoup(page_text.text,'lxml')# print(soup)lst=soup.select('.tabli')# print(lst[0].text)fp=open('bs402.txt','w',encoding='utf-8')for i in lst:title=i.textdetail_url='https://www.shicimingju.com'+i.get('href')fp.write(title+':'+detail_url+'\n')
python">#xpath解析:最常用且最搞笑的一种解析方式
#原理:#实例化一个etree对象,且需要将被解析的页面源码数据加载到该对象中#调用etree对象中的xpath方法结合xpath表达式实现标签的定位和内容的捕获
#实例化对象:#将本地的html文档中的源码加载到etree对象中#将互联网上获取的源码对象数据加载到该对象中
-----------------------------------------------------------------------
import requests
from lxml import etree
if __name__ == '__main__':url='https://cn.58.com/ershoufang'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36'}page_text=requests.get(url=url,headers=headers).texttree=etree.HTML(page_text)titles = tree.xpath('/html/body/div[1]/div/div/section/section[3]/section[1]/section[2]/div/a/div[2]/div[1]/div[1]/h3/@title')# 保存数据with open('xpath02.txt', 'w', encoding='utf-8') as fp:for title in titles:fp.write(title + '\n')