使用python获取nature系列期刊封面高清图片

news/2024/11/25 6:11:32/

nature作为科学界最顶级的期刊之一,其期刊封面审美也一直很在线,兼具科学和艺术的美感

为了方便快速获取nature系列封面,这里用python requests模块进行自动化请求并使用BeautifulSoup模块进行html解析

import requests
from bs4 import BeautifulSoup
import ospath = 'C:\\Users\\User\\Desktop\\nature 封面\\nature 正刊'
# path = os.getcwd()
if not os.path.exists(path):os.makedirs(path)print("新建文件夹 nature正刊")# 在这里改变要下载哪期的封面
# 注意下载是从后往前下载的,所以start_volume应大于等于end_volume
start_volume = 501
end_volume = 500
# nature_url = 'https://www.nature.com/ng/volumes/' # nature genetics
nature_url='https://www.nature.com/nature/volumes/'  # nature 正刊
kv = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}
while start_volume >= end_volume:try:volume_url = nature_url + str(start_volume)volume_response = requests.get(url=volume_url, headers=kv, timeout=120)except Exception:print(str(start_volume) + "请求异常")with open(path + "\\异常.txt", 'at') as txt:txt.write(str(start_volume) + "请求异常\n")continuevolume_response.encoding = 'utf-8'volume_soup = BeautifulSoup(volume_response.text, 'html.parser')ul_tag = volume_soup.find_all('ul',class_='ma0 clean-list grid-auto-fill grid-auto-fill-w220 very-small-column medium-row-gap')img_list = ul_tag[0].find_all("img")issue_number = 0for img_tag in img_list:issue_number += 1filename = path + '\\' + str(start_volume) + '_' + str(issue_number) + '.png'if os.path.exists(filename):print(filename + "已经存在")continueprint("Loading...........................")img_url = 'https:' + img_tag.get("src").replace("w200", "w1000")try:img_response = requests.get(img_url, timeout=240, headers=kv)except Exception:print(start_volume, issue_number, '???????????异常????????')with open(path + "\\异常.txt", 'at') as txt:txt.write(str(start_volume) + '_' + str(issue_number) + "请求异常\n")continuewith open(filename, 'wb') as imgfile:imgfile.write(img_response.content)print("成功下载图片:" + str(start_volume) + '_' + str(issue_number))start_volume -= 1

运行结果:

以上部分代码可以自动下载nature和nature genetics的封面,这两个期刊的网站结构跟其他子刊略有不同,其他子刊可以用以下代码来进行爬虫:

import requests
from bs4 import BeautifulSoup
import osother_journals = {'nature biomedical engineering': 'natbiomedeng','nature methods': 'nmeth','nature astronomy': 'natastron','nature medicine': 'nm','nature protocols': 'nprot','nature microbiology': 'nmicrobiol','nature cell biology': 'ncb','nature nanotechnology': 'nnano','nature immunology': 'ni','nature energy': 'nenergy','nature materials': 'nmat','nature cancer': 'natcancer','nature neuroscience': 'neuro','nature machine intelligence': 'natmachintell','nature metabolism': 'natmetab','nature food': 'natfood','nature ecology & evolution': "natecolevol","nature stuctural & molecular biology":"nsmb","nature physics":"nphys","nature human behavior":"nathumbehav","nature chemical biology":"nchembio"
}nature_journal = {# 要下载的期刊放这里'nature plants': 'nplants','nature biotechnology': 'nbt'
}
folder_Name = "nature 封面"
kv = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}def makefile(path):folder = os.path.exists(path)if not folder:os.makedirs(path)print("Make file -- " + path + " -- successfully!")else:raise AssertionError################################################################
def getCover(url, journal, year, filepath, startyear=2022, endyear=2022):# 注意endyear是比startyear小的,因为是从endyear开始由后往前来下载的if not (endyear <= year <= startyear):returntry:issue_response = requests.get("https://www.nature.com" + url,timeout=120,headers=kv)except Exception:print(journal + "  " + str(year) + "  Error")returnissue_response.encoding = 'gbk'if 'Page not found' in issue_response.text:print(journal + "  Page not found")returnissue_soup = BeautifulSoup(issue_response.text, 'html.parser')cover_image = issue_soup.find_all("img", class_='image-constraint pt10')for image in cover_image:image_url = image.get("src")print("Start loading img.............................")image_url = image_url.replace("w200", "w1000")if (image_url[-2] == '/'):month = "0" + image_url[-1]else:month = image_url[-2:]image_name = nature_journal[journal] + "_" + str(year) + "_" + month + ".png"if os.path.exists(filepath + journal + "\\" + image_name):print(image_url + " 已经存在")continueprint(image_url)try:image_response = requests.get("http:" + image_url,timeout=240,headers=kv)except Exception:print("获取图片异常:" + image_name)continuewith open(filepath + journal + "\\" + image_name,'wb') as downloaded_img:downloaded_img.write(image_response.content)def main():try:path = os.getcwd() + '\\'makefile(path + folder_Name)except Exception:print("文件夹 --nature 封面-- 已经存在")path = path + folder_Name + "\\"for journal in nature_journal:try:makefile(path + journal)except AssertionError:print("File -- " + path + " -- has already exist!")try:volume_response = requests.get("https://www.nature.com/" +nature_journal[journal] +"/volumes",timeout=120,headers=kv)except Exception:print(journal + " 异常")continuevolume_response.encoding = 'gbk'volume_soup = BeautifulSoup(volume_response.text, 'html.parser')volume_list = volume_soup.find_all('ul',class_='clean-list ma0 clean-list grid-auto-fill medium-row-gap background-white')number_of_volume = 0for volume_child in volume_list[0].children:if volume_child == '\n':continueissue_url = volume_child.find_all("a")[0].get("href")print(issue_url)print(2020 - number_of_volume)getCover(issue_url,journal,year=(2020 - number_of_volume),filepath=path,startyear=2022, endyear=2022)number_of_volume += 1if __name__ == "__main__":main()print("Finish Everything!")

运行结果:


http://www.ppmy.cn/news/256428.html

相关文章

Windows 2000 下载

Windows 2000 下载 简介&#xff1a; Microsoft Windows 2000是沿袭微软公司Windows NT系列的32位视窗操作系统&#xff0c;是Windows操作系统发展的一个新里程碑。Windows 2000起初称为Windows NT 5.0。它的英文版于1999年12月19日上市&#xff0c;中文版于次年2月上市。Win…

BTP-2200E.2300E驱动下载及安装

新北洋BTP-2200E.2300E打印驱动下载及安装 BTP-2200E打印机驱动下载地址 http://www.snbc.cn/news/4924.html 1.下载压缩包 2.解压缩后选择02手动安装打印机驱动程序。   3.选择接受&#xff08;打印机驱动安装程序运行中&#xff0c;按F1可显示打印机驱动官方安装手册&…

【wsl2】win11基于hyber-v桥接

wsl2 固定ip 和 wsl2 外部网络设置 &#xff0c;从而解决内部网络的一些问题 WSL系列内容&#xff1a;wsl2 通过桥接网络实现被外部局域网主机直接访问(更新一键执行powershell脚本) 装hyber-v 默认是内部 Get-NetAdapter 改为外部会报错 大神这么解决&#xff0c;不过我没…

[HTML+CSS] 仿京东首页项目实战

实现效果&#xff1a; 此项目要实现结构与样式相分离的设计思想 样式文件的分类&#xff1a; 初始化css让浏览器风格统一&#xff0c;把常用的初始化语句放入base.css里面 我们把一些公共的样式 放入common.css里面&#xff08;因为我们还有除首页的其他页面&#xff0c;后续…

品优购项目——index.html

品优购项目——index.html 品优购首页页面源码。 品优购首页页面源码 <!DOCTYPE html> <html lang"zh-CN"><head><meta charset"UTF-8"><title>品优购-综合网购首选-正品低价、品质保障、配送及时、轻松购物&#xff01;…

仿购物网站-HTML手写代码(仅提供参考,不要擅用否则侵权)

最近一段时间都在码字,算是有一点收获吧 结构图 预览图: index.html -1 index.html -2 list.html -1

SW2

别骂了别骂了 SW 有努力在学了 但用SW搞arduino是真的雷人&#x1f923;&#x1f923;&#x1f923; 前情提要&#xff1a; 上一次的博客写了我从乜嘢都不懂的SW新手菜鸡 自学成为了一个稍微对SW有滴滴了解的 可以用SW自娱自乐的菜鸡 上一次的博客 之后 师兄给我发了个任务&…

web前端项目开发重置样式reset.css

前端项目开发重置样式&#xff0c;包括清空浏览器的默认样式、设置网页主体的背景颜色、外间距、内间距、行高、字体大小、字体家族、页面布局、背景颜色样式、颜色、设置超链接的样式、字体大小、文本处理与对齐方式、行高、鼠标样式、告诉部分浏览器不要给图片添加边框、将超…