会员购项目面试题解析:高效数据抓取与异常处理

devtools/2024/12/23 1:14:13/

会员购项目

亮点

  • 日志记录信息
  • 协程异步抓取数据,大大提高抓取速度
  • 捕获异常,并添加重试机制

源码

python">import logging
import timeimport requests
import asyncio
import aiohttp
from aiohttp import ContentTypeError
import csv# 配置日志
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s : %(message)s')# 解析数据
def parse_data(data):if data:for meeting in data:project_id = meeting['project_id']project_name = meeting['project_name']start_time = meeting['start_time']venue_name = meeting['venue_name']price_low = meeting['price_low'] / 100price_high = meeting['price_high'] / 100yield {'project_id': project_id,'project_name': project_name,'start_time': start_time,'venue_name': venue_name,'price_low': price_low,'price_high': price_high}# 保存至csv文件中
def save_file(city_info, city_id):if city_info:with open(f'{city_id}.csv', 'a+', newline='', encoding='utf-8') as f:writer = csv.writer(f)writer.writerow([f'{city_info["project_id"]}', f'{city_info["project_name"]}', f'{city_info["start_time"]}',f'{city_info["venue_name"]}', f'{city_info["price_low"]}', f'{city_info["price_high"]}'])class Myspider(object):types_list = ['演出', '展览', '本地生活']cities_id_list = []failed_urls = []CONCURRENTCY = 4RETRY_LIMIT = 3def __init__(self):self.session = Noneself.semaphore = asyncio.Semaphore(Myspider.CONCURRENTCY)# 获取城市编号并设置类属性@staticmethoddef set_cities_id():headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0'}cities_data = requests.get("https://show.bilibili.com/api/ticket/city/list?channel=4", headers=headers).json()['data']developed_cities_id = [city['id'] for city in cities_data['list']]developing_cities_id = [city['id'] for part in cities_data['more'] for city in part['list']]Myspider.cities_id_list = developed_cities_id + developing_cities_idreturn None# 解决单个任务,爬取相关信息async def get_every_page_info(self, url):async with self.semaphore:logging.info(f"scraping {url}")for attempt in range(Myspider.RETRY_LIMIT):try:async with self.session.get(url) as response:data = await response.json()return data["data"]["result"]except ContentTypeError:logging.info(f"error ocurred when scraping {url}", exc_info=True)except aiohttp.ClientError as e:logging.error(f"ClientError on {url}: {e}", exc_info=True)if attempt < Myspider.RETRY_LIMIT - 1:await asyncio.sleep(2 ** attempt)  # Exponential backoffcontinueexcept aiohttp.ServerDisconnectedError:logging.error(f"Server disconnected: {url}", exc_info=True)if attempt < Myspider.RETRY_LIMIT - 1:await asyncio.sleep(2 ** attempt)continueMyspider.failed_urls.append(url)return None  # Return None if all retry attempts fail# 获取 此分类下 此城市下 最大页数def get_max_page(self, url):headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0'}response = requests.get(url, headers=headers)data = response.json()return data["data"]["numPages"]# 主方法, 获取任务列表, 开4个协程去抓async def main(self):headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0'}# 初始化session(主要加header头信息以及代理,cookie等头信息)async with aiohttp.ClientSession(headers=headers) as session:self.session = sessionfor type in Myspider.types_list:for city_id in Myspider.cities_id_list:begin_url = "https://show.bilibili.com/api/ticket/project/listV2?version=134&page=1&pagesize=16&area={}&filter=&platform=web&p_type={}".format(city_id, type)max_page = self.get_max_page(begin_url)# 生成任务列表scrapy_tasks = [self.get_every_page_info("https://show.bilibili.com/api/ticket/project/listV2?version=134&page={}&pagesize=16&area={}&filter=&platform=web&p_type={}".format(page, city_id, type)) for page in range(1, max_page + 1)]# 并发执行任务,获取执行结果scrapy_results = await asyncio.gather(*scrapy_tasks)# 解析结果数据for result in scrapy_results:data = parse_data(result)for city_info in data:print(city_info)save_file(city_info, city_id)# 关闭连接await self.session.close()if __name__ == '__main__':# 开始时间start_time = time.time()# 获取城市编号,设置类属性cities_id_listMyspider.set_cities_id()# 初始化Myspiderspider = Myspider()# 创建事件循环池loop = asyncio.get_event_loop()# 注册loop.run_until_complete(spider.main())# 结束事件end_time = time.time()logging.info(f"total_time: {end_time - start_time}")# print(spider.get_max_page('https://show.bilibili.com/api/ticket/project/listV2?version=134&page=1&pagesize=16&area=110100&filter=&platform=web&p_type=%E5%85%A8%E9%83%A8%E7%B1%BB%E5%9E%8B'))

更多精致内容: [CodeRealm]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BB6kKZj6-1722175080359)(https://i-blog.csdnimg.cn/direct/e18ac94120d945d28ffc46243559ba96.png#pic_center)]


http://www.ppmy.cn/devtools/85387.html

相关文章

47、PHP实现机器人的运动范围

题目&#xff1a; PHP 实现机器人的运动范围 描述&#xff1a; 地上有一个m行和n列的方格。一个机器人从坐标0,0的格子开始移动&#xff0c;每一次只能向左&#xff0c;右&#xff0c;上&#xff0c;下四个方向移动一格&#xff0c;但是不能进入行坐标和列坐标的数位之和大于k…

汽车免拆诊断案例 | 2018 款别克阅朗车蓄电池偶尔亏电

故障现象 一辆2018款别克阅朗车&#xff0c;搭载LI6发动机和GF6变速器&#xff0c;累计行驶里程约为9.6万km。车主反映&#xff0c;该车停放一晚后&#xff0c;蓄电池偶尔亏电。 故障诊断 接车后用虹科Pico汽车示波器和高精度电流钳&#xff08;30 A&#xff09;测量该车的寄…

计算机网络中的 IPv6 部署与转换

背景介绍 随着互联网的迅速发展&#xff0c;IPv4 地址资源日益枯竭&#xff0c;无法满足未来互联网设备连接的需求。为了解决这一问题&#xff0c;IPv6 应运而生。IPv6&#xff08;互联网协议第六版&#xff09;提供了比 IPv4 更大的地址空间、更好的安全性和扩展性。然而&…

爬虫自己做的

1.urllib 1.1基本使用 1.2 下载&#xff08;图片&#xff0c;页面&#xff0c;视频&#xff09; 1.3 get 1.3.1 quote 中文变成对应uncode编码 当url 的wd中文时 quote是将中文变成对应uncode编码 然后拼接成完整的url 1.3.2urlencode方法 wd有多个参数 1.3.3ajas get实例 …

【BUG】已解决:IndexError: positional indexers are out-of-bounds

IndexError: positional indexers are out-of-bounds 目录 IndexError: positional indexers are out-of-bounds 【常见模块错误】 【解决方案】 原因分析 解决方法 示例代码 欢迎来到英杰社区https://bbs.csdn.net/topics/617804998 欢迎来到我的主页&#xff0c;我是博…

长短期记忆网络(LSTM)及其Python和MATLAB实现

LSTM&#xff08;Long Short-Term Memory&#xff09;是一种循环神经网络&#xff08;RNN&#xff09;的变种&#xff0c;它专门用来解决RNN的长期依赖问题。RNN在处理长序列数据时会出现梯度消失或梯度爆炸的问题&#xff0c;导致难以捕捉长期记忆信息。而LSTM通过引入一种称为…

基于微信小程序+SpringBoot+Vue的流浪动物救助(带1w+文档)

基于微信小程序SpringBootVue的流浪动物救助(带1w文档) 基于微信小程序SpringBootVue的流浪动物救助(带1w文档) 本系统实现的目标是使爱心人士都可以加入到流浪动物的救助工作中来。考虑到救助流浪动物的爱心人士文化水平不齐&#xff0c;所以本系统在设计时采用操作简单、界面…

理解、检测与克服大语言模型的外在幻觉

引言 大语言模型&#xff08;LLMs&#xff09;在自然语言处理领域展现了巨大的潜力&#xff0c;但同时也带来了“幻觉”问题。幻觉指的是模型生成不真实、虚构或不一致的内容。Lilian Weng&#xff0c;OpenAI安全系统团队负责人&#xff0c;最近在她的博客中详细梳理了在理解、…