[python 爬虫]微信公众号权律二表情和壁纸爬虫

搜狗搜索引擎可以搜索到微信的公众号，许久没有爬虫了，最近买了崔大神的《python网络爬虫开发实战》，感觉又回到了一年前初学爬虫时满怀激情的时代。下面小试牛刀，利用一些基本的库 requests-html，xpath，request以及正则表达式来抓一些表情和壁纸。

先来看看效果是怎么样吧

源码奉上，其实改一改就能爬取其他内容。

import os
import urllib.request
import re
import ssl
from requests_html import HTMLSessionimport time
from lxml import etreessl._create_default_https_context = ssl._create_unverified_contextdef getData(url):# 模拟成浏览器headers = ("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0")opener = urllib.request.build_opener()opener.addheaders = [headers]# 将opener安装为全局urllib.request.install_opener(opener)data = urllib.request.urlopen(url).read().decode("utf-8")return datadef getcontent(url):data = getData(url)# 构建表情提取的正则表达式stickerpat = '<img.*?data-src="(.*?)"'stickerlist = re.compile(stickerpat, re.S).findall(data)# 构建标题提取的正则表达式titlepat = '<h2.*?>(.*?)</h2>'title = re.compile(titlepat,re.S).findall(data)title = title[0].replace('\n','').replace('|','').strip()return stickerlist,titledef download(stickerlist,title):path = titlenumber = 1for sticker in stickerlist:if(sticker.endswith('gif')):filename =   os.path.join(path,str(number)+'.gif')print("正在下载：" ,filename)urllib.request.urlretrieve(sticker, filename = filename)time.sleep(1)number += 1if (sticker.endswith('jpeg')):filename = os.path.join(path, str(number) + '.jpeg')print("正在下载：", filename)urllib.request.urlretrieve(sticker, filename=filename)time.sleep(1)number += 1def creatDir(title):isExists = os.path.exists(title)if not isExists:os.makedirs(title)print(title + ' 创建成功')return Truereturn Falsedef getUrlList():session = HTMLSession()for page in range(1,11):url = 'http://weixin.sogou.com/weixin?query=%E6%9D%83%E5%BE%8B%E4%BA%8C&_sug_type_=&sut=4989&lkt=1%2C1530759390068%2C1530759390068&s_from=input&_sug_=y&type=2&sst0=1530759390170&page='+str(page)+'&ie=utf8&w=01019900&dr=1'time.sleep(5)r = session.get(url)dom = r.htmlprint(dom)for i in range(10):try:result = dom.xpath('//*[@id="sogou_vr_11002601_title_'+str(i)+'"]//@href')time.sleep(5)print(i,result)stickerlist, title = getcontent(result[0])if(creatDir(title)):download(stickerlist, title)except Exception:continuegetUrlList()

代码

顺便复习一下基础的知识，等到暑假再好好精修吧。

正则表达式基础知识

基础1：
全局匹配函数使用格式	re.compile(正则表达式).findall(源字符串)普通字符	正常匹配
\n			匹配换行符  
\t 			匹配制表符
\w 			匹配字母、数字、下划线
\W 			匹配除字母、数字、下划线
\d 			匹配十进制数字
\D 			匹配除十进制数字
\s 			匹配空白字符
\S 			匹配除空白字符
[ab89x]		原子表，匹配ab89x中的任意一个
[^ab89x]		原子表，匹配除ab89x以外的任意一个字符实例1：
源字符串："aliyunedu"
正则表达式："yu"
匹配出什么？	yu源字符串：'''aliyun
edu'''
正则表达式："yun\n"
匹配出什么？	yun\n源字符串："aliyu89787nedu"
正则表达式："\w\d\w\d\d\w"
匹配出什么？	u89787源字符串："aliyu89787nedu"
正则表达式："\w\d[nedu]\w"
匹配出什么？	87ne基础2：
.	匹配除换行外任意一个字符
^	匹配开始位置
$	匹配结束位置
*	前一个字符出现0\1\多次 
?	前一个字符出现0\1次
+	前一个字符出现1\多次
{n}	前一个字符恰好出现n次
{n,}	前一个字符至少n次
{n,m}前一个字符至少n，至多m次 
|	模式选择符或
()	模式单元，通俗来说就是：想提取出什么内容，就在正则中用小括号将其括起来实例2:
源字符串：'''aliyunnnnji87362387aoyubaidu'''正则表达式："ali..."
匹配出什么？	aliyun正则表达式："^li..."
匹配出什么？	None正则表达式："^ali..."
匹配出什么？	aliyun正则表达式："bai..$"
匹配出什么？	baidu正则表达式："ali.*"
匹配出什么？	aliyunnnnji87362387aoyubaidu
Tips：默认贪婪，即默认尽可能多地进行匹配正则表达式："aliyun+"
匹配出什么？ aliyunnnn正则表达式："aliyun?"
匹配出什么？ aliyun正则表达式："yun{1,2}"
匹配出什么？	yunn正则表达式："^al(i..)."
匹配出什么？	iyu基础3：
贪婪模式：尽可能多地匹配
懒惰模式：尽可能少地匹配，精准模式默认贪婪模式
如果出现如下组合，则代表为懒惰模式：
*?
+?实例3：
源字符串："poytphonyhjskjsa"
正则表达式："p.*y"
匹配出什么？	poytphony
为什么？	默认贪婪模式源字符串："poytphonyhjskjsa"
正则表达式："p.*?y"
匹配出什么？	['poy', 'phony']
为什么？	懒惰模式，精准匹配基础4：
模式修正符：在不改变正则表达式的情况下通过模式修正符使匹配结果发生更改re.S		让.也可以匹配多行
re.I		让匹配时忽略大小写实例4:
源字符串："Python"
正则表达式："pyt"
匹配方式:re.compile("pyt").findall("Python")
匹配结果： []源字符串："Python"
正则表达式："pyt"
匹配方式:re.compile("pyt",re.I).findall("Python")
匹配结果： Pyt源字符串：string="Python"
正则表达式："pyt"
匹配方式:re.compile("pyt",re.I).findall("Python")
匹配结果： Pyt源字符串：string="""我是阿里云大学
欢迎来学习
Python网络爬虫课程
"""
正则表达式：pat="阿里.*?Python"
匹配方式:re.compile(pat).findall(string)
匹配结果： []源字符串：string="""我是阿里云大学
欢迎来学习
Python网络爬虫课程
"""
正则表达式：pat="阿里.*?Python"
匹配方式:re.compile(pat,re.S).findall(string)
匹配结果： ['阿里云大学\n欢迎来学习\nPython']

xpath基础知识

/ 逐层提取
text() 提取标签下面的文本
//标签名**  提取所有名为**的标签
//标签名[@属性='属性值']  提取属性为XX的标签
@属性名  代表取某个属性值<html>
<head>
<title>
主页
</title>
</head>
<body>
<p>abc</p>
<p>bbbvb</p><a href="//qd.alibaba.com/go/v/pcdetail" target="_top">安全推荐</a>
<a href="//qd.alibaba.com/go/v/pcdetail" target="_top">安全推荐2</a>
<div class="J_AsyncDC" data-type="dr"><div id="official-remind">明月几时有
</div>
</div>
</body>分析以下XPath表达式提取的内容：
/html/head/title/text()
//p/text()
//a
//div[@id='official-remind']/text()
//a/@href实例：
提取标题：/html/head/title/text()
提取所有的div标签：//div
提取div中<div class="tools">标签的内容： //div[@class='tools']/text()

Requests-HTML基础

Make a GET request to 'python.org', using Requests:>>> from requests_html import HTMLSession
>>> session = HTMLSession()>>> r = session.get('https://python.org/')
Grab a list of all links on the page, as–is (anchors excluded):>>> r.html.links
{'//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/dev/peps/', '/about/legal/', '//docs.python.org/3/tutorial/introduction.html#lists', '/download/alternatives', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', '/download/other/', '/downloads/windows/', 'https://mail.python.org/mailman/listinfo/python-dev', '/doc/av', 'https://devguide.python.org/', '/about/success/#engineering', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', '/about/gettingstarted/', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', '/success-stories/industrial-light-magic-runs-python/', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', '/', 'http://pyfound.blogspot.com/', '/events/python-events/past/', '/downloads/release/python-2714/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/', 'https://status.python.org/', '/community/workshops/', '/community/lists/', 'http://buildbot.net/', '/community/awards', 'http://twitter.com/ThePSF', 'https://docs.python.org/3/license.html', '/psf/donations/', 'http://wiki.python.org/moin/Languages', '/dev/', '/events/python-user-group/', 'https://wiki.qt.io/PySide', '/community/sigs/', 'https://wiki.gnome.org/Projects/PyGObject', 'http://www.ansible.com', 'http://www.saltstack.com', 'http://planetpython.org/', '/events/python-events', '/about/help/', '/events/python-user-group/past/', '/about/success/', '/psf-landing/', '/about/apps', '/about/', 'http://www.wxpython.org/', '/events/python-user-group/665/', 'https://www.python.org/psf/codeofconduct/', '/dev/peps/peps.rss', '/downloads/source/', '/psf/sponsorship/sponsors/', 'http://bottlepy.org', 'http://roundup.sourceforge.net/', 'http://pandas.pydata.org/', 'http://brochure.getpython.info/', 'https://bugs.python.org/', '/community/merchandise/', 'http://tornadoweb.org', '/events/python-user-group/650/', 'http://flask.pocoo.org/', '/downloads/release/python-364/', '/events/python-user-group/660/', '/events/python-user-group/638/', '/psf/', '/doc/', 'http://blog.python.org', '/events/python-events/604/', '/about/success/#government', 'http://python.org/dev/peps/', 'https://docs.python.org', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', '/users/membership/', '/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3', '/downloads/', '/jobs/', 'http://trac.edgewall.org/', 'http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html', '/privacy/', 'https://pypi.python.org/', 'http://www.riverbankcomputing.co.uk/software/pyqt/intro', 'http://www.scipy.org', '/community/forums/', '/about/success/#scientific', '/about/success/#software-development', '/shell/', '/accounts/signup/', 'http://www.facebook.com/pythonlang?fref=ts', '/community/', 'https://kivy.org/', '/about/quotes/', 'http://www.web2py.com/', '/community/logos/', '/community/diversity/', '/events/calendars/', 'https://wiki.python.org/moin/BeginnersGuide', '/success-stories/', '/doc/essays/', '/dev/core-mentorship/', 'http://ipython.org', '/events/', '//docs.python.org/3/tutorial/controlflow.html', '/about/success/#education', '/blogs/', '/community/irc/', 'http://pycon.blogspot.com/', '//jobs.python.org', 'http://www.pylonsproject.org/', 'http://www.djangoproject.com/', '/downloads/mac-osx/', '/about/success/#business', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://docs.python.org/faq/', '//docs.python.org/3/tutorial/controlflow.html#defining-functions'}
Grab a list of all links on the page, in absolute form (anchors excluded):>>> r.html.absolute_links
{'https://github.com/python/pythondotorg/issues', 'https://docs.python.org/3/tutorial/', 'https://www.python.org/about/success/', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', 'https://www.python.org/dev/peps/', 'https://mail.python.org/mailman/listinfo/python-dev', 'https://www.python.org/doc/', 'https://www.python.org/', 'https://www.python.org/about/', 'https://www.python.org/events/python-events/past/', 'https://devguide.python.org/', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', 'https://docs.python.org/3/tutorial/introduction.html#lists', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', 'http://pyfound.blogspot.com/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/', 'https://www.python.org/events/python-events', 'https://status.python.org/', 'https://www.python.org/about/apps', 'https://www.python.org/downloads/release/python-2714/', 'https://www.python.org/psf/donations/', 'http://buildbot.net/', 'http://twitter.com/ThePSF', 'https://docs.python.org/3/license.html', 'http://wiki.python.org/moin/Languages', 'https://docs.python.org/faq/', 'https://jobs.python.org', 'https://www.python.org/about/success/#software-development', 'https://www.python.org/about/success/#education', 'https://www.python.org/community/logos/', 'https://www.python.org/doc/av', 'https://wiki.qt.io/PySide', 'https://www.python.org/events/python-user-group/660/', 'https://wiki.gnome.org/Projects/PyGObject', 'http://www.ansible.com', 'http://www.saltstack.com', 'https://www.python.org/dev/peps/peps.rss', 'http://planetpython.org/', 'https://www.python.org/events/python-user-group/past/', 'https://docs.python.org/3/tutorial/controlflow.html#defining-functions', 'https://www.python.org/community/diversity/', 'https://docs.python.org/3/tutorial/controlflow.html', 'https://www.python.org/community/awards', 'https://www.python.org/events/python-user-group/638/', 'https://www.python.org/about/legal/', 'https://www.python.org/dev/', 'https://www.python.org/download/alternatives', 'https://www.python.org/downloads/', 'https://www.python.org/community/lists/', 'http://www.wxpython.org/', 'https://www.python.org/about/success/#government', 'https://www.python.org/psf/', 'https://www.python.org/psf/codeofconduct/', 'http://bottlepy.org', 'http://roundup.sourceforge.net/', 'http://pandas.pydata.org/', 'http://brochure.getpython.info/', 'https://www.python.org/downloads/source/', 'https://bugs.python.org/', 'https://www.python.org/downloads/mac-osx/', 'https://www.python.org/about/help/', 'http://tornadoweb.org', 'http://flask.pocoo.org/', 'https://www.python.org/users/membership/', 'http://blog.python.org', 'https://www.python.org/privacy/', 'https://www.python.org/about/gettingstarted/', 'http://python.org/dev/peps/', 'https://www.python.org/about/apps/', 'https://docs.python.org', 'https://www.python.org/success-stories/', 'https://www.python.org/community/forums/', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', 'https://www.python.org/community/merchandise/', 'https://www.python.org/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3', 'http://trac.edgewall.org/', 'http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html', 'https://pypi.python.org/', 'https://www.python.org/events/python-user-group/650/', 'http://www.riverbankcomputing.co.uk/software/pyqt/intro', 'https://www.python.org/about/quotes/', 'https://www.python.org/downloads/windows/', 'https://www.python.org/events/calendars/', 'http://www.scipy.org', 'https://www.python.org/community/workshops/', 'https://www.python.org/blogs/', 'https://www.python.org/accounts/signup/', 'https://www.python.org/events/', 'https://kivy.org/', 'http://www.facebook.com/pythonlang?fref=ts', 'http://www.web2py.com/', 'https://www.python.org/psf/sponsorship/sponsors/', 'https://www.python.org/community/', 'https://www.python.org/download/other/', 'https://www.python.org/psf-landing/', 'https://www.python.org/events/python-user-group/665/', 'https://wiki.python.org/moin/BeginnersGuide', 'https://www.python.org/accounts/login/', 'https://www.python.org/downloads/release/python-364/', 'https://www.python.org/dev/core-mentorship/', 'https://www.python.org/about/success/#business', 'https://www.python.org/community/sigs/', 'https://www.python.org/events/python-user-group/', 'http://ipython.org', 'https://www.python.org/shell/', 'https://www.python.org/community/irc/', 'https://www.python.org/about/success/#engineering', 'http://www.pylonsproject.org/', 'http://pycon.blogspot.com/', 'https://www.python.org/about/success/#scientific', 'https://www.python.org/doc/essays/', 'http://www.djangoproject.com/', 'https://www.python.org/success-stories/industrial-light-magic-runs-python/', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://www.python.org/jobs/', 'https://www.python.org/events/python-events/604/'}
Select an element with a CSS Selector:>>> about = r.html.find('#about', first=True)
Grab an element's text contents:>>> print(about.text)
About
Applications
Quotes
Getting Started
Help
Python Brochure
Introspect an Element's attributes:>>> about.attrs
{'id': 'about', 'class': ('tier-1', 'element-1'), 'aria-haspopup': 'true'}
Render out an Element's HTML:>>> about.html
'<li aria-haspopup="true" class="tier-1 element-1 " id="about">\n<a class="" href="/about/" title="">About</a>\n<ul aria-hidden="true" class="subnav menu" role="menu">\n<li class="tier-2 element-1" role="treeitem"><a href="/about/apps/" title="">Applications</a></li>\n<li class="tier-2 element-2" role="treeitem"><a href="/about/quotes/" title="">Quotes</a></li>\n<li class="tier-2 element-3" role="treeitem"><a href="/about/gettingstarted/" title="">Getting Started</a></li>\n<li class="tier-2 element-4" role="treeitem"><a href="/about/help/" title="">Help</a></li>\n<li class="tier-2 element-5" role="treeitem"><a href="http://brochure.getpython.info/" title="">Python Brochure</a></li>\n</ul>\n</li>'
Select Elements within Elements:>>> about.find('a')
[<Element 'a' href='/about/' title='' class=''>, <Element 'a' href='/about/apps/' title=''>, <Element 'a' href='/about/quotes/' title=''>, <Element 'a' href='/about/gettingstarted/' title=''>, <Element 'a' href='/about/help/' title=''>, <Element 'a' href='http://brochure.getpython.info/' title=''>]
Search for links within an element:>>> about.absolute_links
{'http://brochure.getpython.info/', 'https://www.python.org/about/gettingstarted/', 'https://www.python.org/about/', 'https://www.python.org/about/quotes/', 'https://www.python.org/about/help/', 'https://www.python.org/about/apps/'}
Search for text on the page:>>> r.html.search('Python is a {} language')[0]
programming
More complex CSS Selector example (copied from Chrome dev tools):>>> r = session.get('https://github.com/')
>>> sel = 'body > div.application-main > div.jumbotron.jumbotron-codelines > div > div > div.col-md-7.text-center.text-md-left > p'>>> print(r.html.find(sel, first=True).text)
GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside millions of other developers.
XPath is also supported:>>> r.html.xpath('/html/body/div[1]/a')
[<Element 'a' class=('px-2', 'py-4', 'show-on-focus', 'js-skip-to-content') href='#start-of-content' tabindex='1'>]
JavaScript Support
Let's grab some text that's rendered by JavaScript:>>> r = session.get('http://python-requests.org')>>> r.html.render()>>> r.html.search('Python 2 will retire in only {months} months!')['months']
'<time>25</time>'
Note, the first time you ever run the render() method, it will download Chromium into your home directory (e.g. ~/.pyppeteer/). This only happens once.Using without Requests
You can also use this library without Requests:>>> from requests_html import HTML
>>> doc = """<a href='https://httpbin.org'>""">>> html = HTML(html=doc)
>>> html.links
{'https://httpbin.org'}

[python 爬虫]微信公众号权律二表情和壁纸爬虫

相关文章

如何做好VR全景营销？选择好的全景平台很重要

Enscape 3.4这些功能还不会用？学会让你事半功倍

VR智慧家装，给业主带来别样的家装体验！

免费HTTPS证书 Certbot Letsencrypt 傻瓜教程

关于项目介入VR功能简单介绍

如何扫描汽车并在VR中进行组装？

VR 科学技术

VR介绍