Python数据挖掘入门与实践-第9章-古腾堡计划网站书籍资料下载

news/2024/11/28 5:50:45/

由于python版本以及网站更新等原因,导致了书上的代码没有用了。因此自己试着修改了代码。

下面就来讲讲修改中遇到的主要问题

问题:网站URL变更

# 书上的代码
url_base = "http://www.gutenberg.myebook.bg/"fixes[1044] = url_base + "1/0/4/1044/1044-0.txt"
fixes[5148] = url_base + "5/1/4/5148/5148-0.txt"
fixes[4657]="https://archive.org/stream/personalnarrativ04657gut/pnpa110.txt"

由于网站更新,现在应该使用以下URL

url_base = 'http://www.gutenberg.org/files/'
url_format = '{url_base}{id}/{id}-0.txt'# 修复URL
url_fix_format = 'http://www.gutenberg.org/cache/epub/{id}/pg{id}.txt'

并且这里需要保存修复的URL,使用下面代码

from collections import defaultdict
fiexes = defaultdict(list)
# 这样就可以直接保存每个作者需要重新请求作品的URL
# 比如
fixex['twain'].append(bookid)

详细代码如下(亲测有效):

import requests
import os
import time
from collections import defaultdicttitles = {}titles['burton'] = [4657, 2400, 5760, 6036, 7111, 8821,18506, 4658, 5761, 6886, 7113]
titles['dickens'] = [24022, 1392, 1414, 1467, 2324, 580,786, 888, 963, 27924, 1394, 1415, 15618,25985, 588, 807, 914, 967, 30127, 1400,1421, 16023, 28198, 644, 809, 917, 968, 1023,1406, 1422, 17879, 30368, 675, 810, 924, 98,1289, 1413, 1423, 17880, 32241, 699, 821, 927]
titles['doyle'] = [2349, 11656, 1644, 22357, 2347, 290, 34627, 5148,8394, 26153, 12555, 1661, 23059, 2348, 294, 355,5260, 8727, 10446, 126, 17398, 2343, 2350, 3070,356, 5317, 903, 10581, 13152, 2038, 2344, 244, 32536,423, 537, 108, 139, 2097, 2345, 24951, 32777, 4295,7964, 11413, 1638, 21768, 2346, 2845, 3289, 439, 834]
titles['gaboriau'] = [1748, 1651, 2736, 3336, 4604, 4002, 2451,305, 3802, 547]
titles['nesbit'] = [34219, 23661, 28804, 4378, 778, 20404, 28725,33028, 4513, 794]
titles['tarkington'] = [1098, 15855, 1983, 297, 402, 5798,8740, 980, 1158, 1611, 2326, 30092,483, 5949, 8867, 13275, 18259, 2595,3428, 5756, 6401, 9659]
titles['twain'] = [1044, 1213, 245, 30092, 3176, 3179, 3183, 3189, 74,86, 1086, 142, 2572, 3173, 3177, 3180, 3186, 3192,76, 91, 119, 1837, 2895, 3174, 3178, 3181, 3187, 3432,8525]assert len(titles) == 7assert len(titles['tarkington']) == 22
assert len(titles['dickens']) == 44
assert len(titles['nesbit']) == 10
assert len(titles['doyle']) == 51
assert len(titles['twain']) == 29
assert len(titles['burton']) == 11
assert len(titles['gaboriau']) == 10url_base = 'http://www.gutenberg.org/files/'
url_format = '{url_base}{id}/{id}-0.txt'# 修复URL
url_fix_format = 'http://www.gutenberg.org/cache/epub/{id}/pg{id}.txt'fiexes = defaultdict(list)
# fixes = {}
# fixes[4657] = 'http://www.gutenberg.org/cache/epub/4657/pg4657.txt'# make parent folder if not exists
# data_folder = os.path.join(os.path.expanduser('~'),'Data','books') # 这是在用户user目录中存储
data_folder = os.path.join(os.path.abspath('.'), 'Data\\books') # 这样就可以在当前目录存储数据了
if not os.path.exists(data_folder):os.makedirs(data_folder)
print(data_folder)for author in titles:print('Downloading titles from', author)# make author's folder if not existsauthor_folder = os.path.join(data_folder, author)if not os.path.exists(author_folder):os.makedirs(author_folder)# download each title to this folderfor bookid in titles[author]:# if bookid in fixes:#     print(' - Applying fix to book with id', bookid)#     url = fixes[bookid]# else:#     print(' - Getting book with id', bookid)#     url = url_format.format(url_base=url_base, id=bookid)url = url_format.format(url_base=url_base, id=bookid)print(' - ', url)filename = os.path.join(author_folder, '%s.txt' % bookid)if os.path.exists(filename):print(' - File already exists, skipping')else:r = requests.get(url)if r.status_code == 404:print('url 404:', author, bookid, 'add to fixes list')fiexes[author].append(bookid)else:txt = r.textwith open(filename, 'w', encoding='utf-8') as f:f.write(txt)time.sleep(1)
print('Download complete')print('开始下载修复列表')
for author in fiexes:print('开始下载<%s>的作品' % author)author_folder = os.path.join(data_folder, author)if not os.path.exists(author_folder):os.makedirs(author_folder)for bookid in fiexes[author]:filename = os.path.join(author_folder, '%s.txt' % bookid)if os.path.exists(filename):print('文件已经下载,跳过')else:url_fix = url_fix_format.format(id=bookid)print(' - ', url_fix)r = requests.get(url_fix)if r.status_code == 404:print('又出错了!', author, bookid)else:with open(filename, 'w', encoding='utf-8') as f:f.write(r.text)time.sleep(1)
print('修复列表下载完毕')

http://www.ppmy.cn/news/487375.html

相关文章

js基础5 事件的传播/拖拽/键盘事件/div移动

事件的传播 事件的传播关于事件的传播网景公司和微软公司有不同的理解微软公司认为事件应该是由内向外传播&#xff0c;也就是当事件触发时&#xff0c;应该先触发当前元素上的事件&#xff0c;然后再向当前元素的祖先元素上传播&#xff0c;也就说事件应该在冒泡阶段执行。网…

locust学习教程(3)- 编写locust脚本

前言 一句话总结&#xff1a;并发的用户执行了第一类测试任务、第二类测试任务&#xff0c;设置所有类的测试前置、测试后置&#xff0c;设置每一类测试任务的测试前置&#xff0c;测试后置。 1、概念 1.1、一个完整的脚本示例 from locust import task, HttpUser, constant…

el-tree的使用,懒加载数据

前段时间碰到项目需要使用el-tree&#xff0c;由于未接触过还是花费了一段时间&#xff0c;特此记录一下。 需求 要初始化树形数据&#xff0c;点击展开时请求后端数据返回&#xff0c;组装成新树再渲染展示。 来吧展示 <el-form-item label"实体类型" prop&quo…

后端一次返回过多数据,前端应该如何优化处理?

我们可以采用延迟加载的策略&#xff0c;根据用户的滚动位置动态渲染数据。 创建一个server.js const http require(http) const port 8000;let list [] let num 0// create 100,000 records for (let i 0; i < 100_000; i) {numlist.push({src: https://miro.medium.…

CentOS7中使用nfs共享资源

nfs服务器&#xff1a;192.168.186.10 nfs客户端&#xff1a;192.168.186.71&#xff0c;192.168.186.74&#xff0c;192.168.186.89&#xff0c;192.168.186.92 需求&#xff1a;客户端这四台获取服务端的/home/wwwweb/ea目录 1&#xff1a;nfs客户端和服务端都要先启动rpcb…

数据结构(一二章链表题目)

已知长度为n的线性表A采用顺序存储结构&#xff0c;请写一时间复杂度为O(n)、空间复杂度为O(1)的算法&#xff0c;该算法删除线性表中所有值为item的数据元素。 #include <iostream> #include<malloc.h> #include<cstdio> using namespace std; typedef int…

springboot2.1.1 mybatis mysql连接weblogic12的JNDI数据源,并在weblogic中部署

目测只能把springboot打成war包&#xff0c;部署到weblgic上。 因为开发环境下用main函数运行项目&#xff0c;会报错&#xff1a; Caused by: javax.naming.NoInitialContextException: Need to specify class name in environment or system property, or as an applet par…

c++(sum求和---静态数据成员)

#include<iostream> using namespace std; class myclass { public:myclass(int a,int b,int c);static void getsum();//声明静态函数成员private:int a,b,c;static int sum;//s声明静态数据成员}; int myclass ::sum0;//定义并初始化静态数据成员 myclass::myclass(int…