Scrapy安装
Python实现一个简单的爬虫程序(爬取图片)_python简单扒图脚本-CSDN博客
创建爬虫项目
创建爬虫项目:
scrapy startproject test_spider
创建爬虫程序文件:
>cd test_spider\test_spider\spiders
>scrapy genspider doubanSpider movie.douban.com
编写爬虫程序
分析网址:
https://movie.douban.com/top250?start=25&filter=
其中,start=25是分页信息,一共有10页,每页25个电影记录,start数值为0、25、50……225。
python">for i in range(0,24,25):req = "https://movie.douban.com/top250?start={}&filter=".format(i)yield scrapy.Request(url=req, meta={'url': req}, headers=self.headers,callback=self.parse)time.sleep(2)
提取电影网址、中文名、外文名:
python">html_data = response.body
sp = BeautifulSoup(html_data, 'html.parser')
list = sp.find(class_='grid_view').find_all('li')
for one in list:link = one.find(class_='info').find(class_='hd').find('a')['href']print(link)titles = one.find_all(class_='title')title_zh =titles[0].text.strip().replace(',',' ')title_en = ''if len(titles)>1:title_en = titles[1].text.strip().replace(',',' ').lstrip('/')print(title_zh,title_en)
提取导演、演员信息:
python">bd = one.find(class_='info').find(class_='bd')
p1 = bd.find_all('p')[0].text.strip().replace('\n','').replace('\r','').replace(',',' ')
print(p1)
提取评分信息:
python">spans = bd.find(class_='star').find_all('span')
score = spans[1].text
num = spans[3].text.replace('人评价','')
print(score,num)
写入csv文件:
python">with open('movies.csv','a+',encoding='utf-8') as f:f.write('网址,中文名,外文名,导演,评分,评价人数\n')with open('movies.csv','a+',encoding='utf-8') as f:f.write('{},{},{},{},{},{}\n'.format(link,title_zh,title_en,p1,score,num))
完整代码:
python">import scrapy
from bs4 import BeautifulSoup
import timeclass DoubanspiderSpider(scrapy.Spider):name = "doubanSpider"allowed_domains = ["movie.douban.com"]start_urls = ["https://movie.douban.com"]headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36','Cookie': ''}def start_requests(self): with open('movies.csv','a+',encoding='utf-8') as f:f.write('网址,中文名,外文名,导演,评分,评价人数\n')for i in range(0,24,25):req = "https://movie.douban.com/top250?start={}&filter=".format(i)yield scrapy.Request(url=req, meta={'url': req}, headers=self.headers,callback=self.parse)time.sleep(2) def parse(self, response):print("========= parse ==============")html_data = response.bodysp = BeautifulSoup(html_data, 'html.parser')list = sp.find(class_='grid_view').find_all('li')for one in list:link = one.find(class_='info').find(class_='hd').find('a')['href']print(link)titles = one.find_all(class_='title')title_zh =titles[0].text.strip().replace(',',' ')title_en = ''if len(titles)>1:title_en = titles[1].text.strip().replace(',',' ').lstrip('/')print(title_zh,title_en)bd = one.find(class_='info').find(class_='bd')p1 = bd.find_all('p')[0].text.strip().replace('\n','').replace('\r','').replace(',',' ')print(p1)spans = bd.find(class_='star').find_all('span')score = spans[1].textnum = spans[3].text.replace('人评价','')print(score,num)with open('movies.csv','a+',encoding='utf-8') as f:f.write('{},{},{},{},{},{}\n'.format(link,title_zh,title_en,p1,score,num))
运行爬虫程序:
scrapy crawl doubanSpider
生成的csv文件如下: