由于不经常看群消息,收藏的表情包比较少,每次在群里斗图我都处于下风,最近在中国大学MOOC上学习了嵩天老师的Python网络爬虫与信息提取课程,于是决定写一个爬取网上表情包的网络爬虫。通过搜索发现站长素材上的表情包很是丰富,一共有446页,每页10个表情包,一共是4000多个表情包,近万个表情,我看以后谁还敢给我斗图
技术路线
requests+beautifulsoup
网页分析
站长素材第一页表情包是这样的:
可以看到第一页的url是:http://sc.chinaz.com/biaoqing/index.html
点击下方的翻页按钮可以看到第二页的url是:http://sc.chinaz.com/biaoqing/index_2.html
可以推测第446页的url是:http://sc.chinaz.com/biaoqing/index_446.html
接下来是分析每一页表情包列表的源代码:
再来分析每个表清包全部表情对应的网页:
步骤
1、获得每页展示的每个表情包连接和title
2、获得每个表情包的所有表情的链接
3、使用获取到的表情链接获取表情,每个表情包的表情放到一个单独的文件夹中,文件夹的名字是title属性值
代码
#-*-coding:utf-8-*-
'''
Created on 2017年3月18日
@author: lavi
'''
import bs4
from bs4 import BeautifulSoup
import re
import requests
import os
import traceback
'''
获得页面内容
'''
def getHtmlText(url):try:r = requests.get(url,timeout=30)r.raise_for_status()r.encoding = r.apparent_encodingreturn r.textexcept:return ""
'''
获得content
'''
def getImgContent(url):head = {"user-agent":"Mozilla/5.0"}try:r = requests.get(url,headers=head,timeout=30)print("status_code:"+r.status_code)r.raise_for_status()return r.contentexcept:return None'''
获得页面中的表情的链接
'''
def getTypeUrlList(html,typeUrlList):soup = BeautifulSoup(html,'html.parser')divs = soup.find_all("div", attrs={"class":"up"})for div in divs:a = div.find("div", attrs={"class":"num_1"}).find("a")title = a.attrs["title"]typeUrl = a.attrs["href"]typeUrlList.append((title,typeUrl))
def getImgUrlList(typeUrlList,imgUrlDict):for tuple in typeUrlList:title = tuple[0]url = tuple[1]title_imgUrlList=[]html = getHtmlText(url)soup = BeautifulSoup(html,"html.parser")#print(soup.prettify())div = soup.find("div", attrs={"class":"img_text"})#print(type(div))imgDiv = div.next_sibling.next_sibling#print(type(imgDiv))imgs = imgDiv.find_all("img");for img in imgs:src = img.attrs["src"]title_imgUrlList.append(src)imgUrlDict[title] = title_imgUrlList
def getImage(imgUrlDict,file_path):head = {"user-agent":"Mozilla/5.0"}countdir = 0for title,imgUrlList in imgUrlDict.items():#print(title+":"+str(imgUrlList))try:dir = file_path+titleif not os.path.exists(dir):os.mkdir(dir)countfile = 0for imgUrl in imgUrlList:path = dir+"/"+imgUrl.split("/")[-1]#print(path)#print(imgUrl)if not os.path.exists(path):r = requests.get(imgUrl,headers=head,timeout=30)r.raise_for_status()with open(path,"wb") as f:f.write(r.content)f.close()countfile = countfile+1print("当前进度文件夹进度{:.2f}%".format(countfile*100/len(imgUrlList)))countdir = countdir + 1print("文件夹进度{:.2f}%".format(countdir*100/len(imgUrlDict)))except:print(traceback.print_exc())#print("from getImage 爬取失败")def main():#害怕磁盘爆满就不获取全部的表情了,只获取30页,大约300个表情包里的表情pages = 30root = "http://sc.chinaz.com/biaoqing/"url = "http://sc.chinaz.com/biaoqing/index.html"file_path = "e://biaoqing/"imgUrlDict = {}typeUrlList = []html = getHtmlText(url);getTypeUrlList(html,typeUrlList)getImgUrlList(typeUrlList,imgUrlDict)getImage(imgUrlDict,file_path)for page in range(pages):url = root + "index_"+str(page)+".html"imgUrlDict = {}typeUrlList = []html = getHtmlText(url);getTypeUrlList(html,typeUrlList)getImgUrlList(typeUrlList,imgUrlDict)getImage(imgUrlDict,file_path)main()
结果
如果你在群里斗图吃了亏,把上面的程序运行一遍。。。不要谢我,3月是学雷锋月。哈哈,来把我们斗会图,