[python 爬虫]微信公众号权律二表情和壁纸爬虫

news/2024/9/23 9:35:54/

搜狗搜索引擎可以搜索到微信的公众号,许久没有爬虫了,最近买了崔大神的《python网络爬虫开发实战》,感觉又回到了一年前初学爬虫时满怀激情的时代。下面小试牛刀,利用一些基本的库 requests-html,xpath,request以及正则表达式来抓一些表情和壁纸。

先来看看效果是怎么样吧

源码奉上,其实改一改就能爬取其他内容。

import os
import urllib.request
import re
import ssl
from requests_html import HTMLSessionimport time
from lxml import etreessl._create_default_https_context = ssl._create_unverified_contextdef getData(url):# 模拟成浏览器headers = ("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0")opener = urllib.request.build_opener()opener.addheaders = [headers]# 将opener安装为全局urllib.request.install_opener(opener)data = urllib.request.urlopen(url).read().decode("utf-8")return datadef getcontent(url):data = getData(url)# 构建表情提取的正则表达式stickerpat = '<img.*?data-src="(.*?)"'stickerlist = re.compile(stickerpat, re.S).findall(data)# 构建标题提取的正则表达式titlepat = '<h2.*?>(.*?)</h2>'title = re.compile(titlepat,re.S).findall(data)title = title[0].replace('\n','').replace('|','').strip()return stickerlist,titledef download(stickerlist,title):path = titlenumber = 1for sticker in stickerlist:if(sticker.endswith('gif')):filename =   os.path.join(path,str(number)+'.gif')print("正在下载:" ,filename)urllib.request.urlretrieve(sticker, filename = filename)time.sleep(1)number += 1if (sticker.endswith('jpeg')):filename = os.path.join(path, str(number) + '.jpeg')print("正在下载:", filename)urllib.request.urlretrieve(sticker, filename=filename)time.sleep(1)number += 1def creatDir(title):isExists = os.path.exists(title)if not isExists:os.makedirs(title)print(title + ' 创建成功')return Truereturn Falsedef getUrlList():session = HTMLSession()for page in range(1,11):url = 'http://weixin.sogou.com/weixin?query=%E6%9D%83%E5%BE%8B%E4%BA%8C&_sug_type_=&sut=4989&lkt=1%2C1530759390068%2C1530759390068&s_from=input&_sug_=y&type=2&sst0=1530759390170&page='+str(page)+'&ie=utf8&w=01019900&dr=1'time.sleep(5)r = session.get(url)dom = r.htmlprint(dom)for i in range(10):try:result = dom.xpath('//*[@id="sogou_vr_11002601_title_'+str(i)+'"]//@href')time.sleep(5)print(i,result)stickerlist, title = getcontent(result[0])if(creatDir(title)):download(stickerlist, title)except Exception:continuegetUrlList()

代码

顺便复习一下基础的知识,等到暑假再好好精修吧。

正则表达式基础知识

基础1:
全局匹配函数使用格式	re.compile(正则表达式).findall(源字符串)普通字符	正常匹配
\n			匹配换行符  
\t 			匹配制表符
\w 			匹配字母、数字、下划线
\W 			匹配除字母、数字、下划线
\d 			匹配十进制数字
\D 			匹配除十进制数字
\s 			匹配空白字符
\S 			匹配除空白字符
[ab89x]		原子表,匹配ab89x中的任意一个
[^ab89x]		原子表,匹配除ab89x以外的任意一个字符实例1:
源字符串:"aliyunedu"
正则表达式:"yu"
匹配出什么?	yu源字符串:'''aliyun
edu'''
正则表达式:"yun\n"
匹配出什么?	yun\n源字符串:"aliyu89787nedu"
正则表达式:"\w\d\w\d\d\w"
匹配出什么?	u89787源字符串:"aliyu89787nedu"
正则表达式:"\w\d[nedu]\w"
匹配出什么?	87ne基础2:
.	匹配除换行外任意一个字符
^	匹配开始位置
$	匹配结束位置
*	前一个字符出现0\1\多次 
?	前一个字符出现0\1次
+	前一个字符出现1\多次
{n}	前一个字符恰好出现n次
{n,}	前一个字符至少n次
{n,m}前一个字符至少n,至多m次 
|	模式选择符或
()	模式单元,通俗来说就是:想提取出什么内容,就在正则中用小括号将其括起来实例2:
源字符串:'''aliyunnnnji87362387aoyubaidu'''正则表达式:"ali..."
匹配出什么?	aliyun正则表达式:"^li..."
匹配出什么?	None正则表达式:"^ali..."
匹配出什么?	aliyun正则表达式:"bai..$"
匹配出什么?	baidu正则表达式:"ali.*"
匹配出什么?	aliyunnnnji87362387aoyubaidu
Tips:默认贪婪,即默认尽可能多地进行匹配正则表达式:"aliyun+"
匹配出什么? aliyunnnn正则表达式:"aliyun?"
匹配出什么? aliyun正则表达式:"yun{1,2}"
匹配出什么?	yunn正则表达式:"^al(i..)."
匹配出什么?	iyu基础3:
贪婪模式:尽可能多地匹配
懒惰模式:尽可能少地匹配,精准模式默认贪婪模式
如果出现如下组合,则代表为懒惰模式:
*?
+?实例3:
源字符串:"poytphonyhjskjsa"
正则表达式:"p.*y"
匹配出什么?	poytphony
为什么?	默认贪婪模式源字符串:"poytphonyhjskjsa"
正则表达式:"p.*?y"
匹配出什么?	['poy', 'phony']
为什么?	懒惰模式,精准匹配基础4:
模式修正符:在不改变正则表达式的情况下通过模式修正符使匹配结果发生更改re.S		让.也可以匹配多行
re.I		让匹配时忽略大小写实例4:
源字符串:"Python"
正则表达式:"pyt"
匹配方式:re.compile("pyt").findall("Python")
匹配结果: []源字符串:"Python"
正则表达式:"pyt"
匹配方式:re.compile("pyt",re.I).findall("Python")
匹配结果: Pyt源字符串:string="Python"
正则表达式:"pyt"
匹配方式:re.compile("pyt",re.I).findall("Python")
匹配结果: Pyt源字符串:string="""我是阿里云大学
欢迎来学习
Python网络爬虫课程
"""
正则表达式:pat="阿里.*?Python"
匹配方式:re.compile(pat).findall(string)
匹配结果: []源字符串:string="""我是阿里云大学
欢迎来学习
Python网络爬虫课程
"""
正则表达式:pat="阿里.*?Python"
匹配方式:re.compile(pat,re.S).findall(string)
匹配结果: ['阿里云大学\n欢迎来学习\nPython']

xpath基础知识

/ 逐层提取
text() 提取标签下面的文本
//标签名**  提取所有名为**的标签
//标签名[@属性='属性值']  提取属性为XX的标签
@属性名  代表取某个属性值<html>
<head>
<title>
主页
</title>
</head>
<body>
<p>abc</p>
<p>bbbvb</p><a href="//qd.alibaba.com/go/v/pcdetail" target="_top">安全推荐</a>
<a href="//qd.alibaba.com/go/v/pcdetail" target="_top">安全推荐2</a>
<div class="J_AsyncDC" data-type="dr"><div id="official-remind">明月几时有
</div>
</div>
</body>分析以下XPath表达式提取的内容:
/html/head/title/text()
//p/text()
//a
//div[@id='official-remind']/text()
//a/@href实例:
提取标题:/html/head/title/text()
提取所有的div标签://div
提取div中<div class="tools">标签的内容: //div[@class='tools']/text()

 

Requests-HTML基础

Make a GET request to 'python.org', using Requests:>>> from requests_html import HTMLSession
>>> session = HTMLSession()>>> r = session.get('https://python.org/')
Grab a list of all links on the page, as–is (anchors excluded):>>> r.html.links
{'//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/dev/peps/', '/about/legal/', '//docs.python.org/3/tutorial/introduction.html#lists', '/download/alternatives', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', '/download/other/', '/downloads/windows/', 'https://mail.python.org/mailman/listinfo/python-dev', '/doc/av', 'https://devguide.python.org/', '/about/success/#engineering', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', '/about/gettingstarted/', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', '/success-stories/industrial-light-magic-runs-python/', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', '/', 'http://pyfound.blogspot.com/', '/events/python-events/past/', '/downloads/release/python-2714/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/', 'https://status.python.org/', '/community/workshops/', '/community/lists/', 'http://buildbot.net/', '/community/awards', 'http://twitter.com/ThePSF', 'https://docs.python.org/3/license.html', '/psf/donations/', 'http://wiki.python.org/moin/Languages', '/dev/', '/events/python-user-group/', 'https://wiki.qt.io/PySide', '/community/sigs/', 'https://wiki.gnome.org/Projects/PyGObject', 'http://www.ansible.com', 'http://www.saltstack.com', 'http://planetpython.org/', '/events/python-events', '/about/help/', '/events/python-user-group/past/', '/about/success/', '/psf-landing/', '/about/apps', '/about/', 'http://www.wxpython.org/', '/events/python-user-group/665/', 'https://www.python.org/psf/codeofconduct/', '/dev/peps/peps.rss', '/downloads/source/', '/psf/sponsorship/sponsors/', 'http://bottlepy.org', 'http://roundup.sourceforge.net/', 'http://pandas.pydata.org/', 'http://brochure.getpython.info/', 'https://bugs.python.org/', '/community/merchandise/', 'http://tornadoweb.org', '/events/python-user-group/650/', 'http://flask.pocoo.org/', '/downloads/release/python-364/', '/events/python-user-group/660/', '/events/python-user-group/638/', '/psf/', '/doc/', 'http://blog.python.org', '/events/python-events/604/', '/about/success/#government', 'http://python.org/dev/peps/', 'https://docs.python.org', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', '/users/membership/', '/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3', '/downloads/', '/jobs/', 'http://trac.edgewall.org/', 'http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html', '/privacy/', 'https://pypi.python.org/', 'http://www.riverbankcomputing.co.uk/software/pyqt/intro', 'http://www.scipy.org', '/community/forums/', '/about/success/#scientific', '/about/success/#software-development', '/shell/', '/accounts/signup/', 'http://www.facebook.com/pythonlang?fref=ts', '/community/', 'https://kivy.org/', '/about/quotes/', 'http://www.web2py.com/', '/community/logos/', '/community/diversity/', '/events/calendars/', 'https://wiki.python.org/moin/BeginnersGuide', '/success-stories/', '/doc/essays/', '/dev/core-mentorship/', 'http://ipython.org', '/events/', '//docs.python.org/3/tutorial/controlflow.html', '/about/success/#education', '/blogs/', '/community/irc/', 'http://pycon.blogspot.com/', '//jobs.python.org', 'http://www.pylonsproject.org/', 'http://www.djangoproject.com/', '/downloads/mac-osx/', '/about/success/#business', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://docs.python.org/faq/', '//docs.python.org/3/tutorial/controlflow.html#defining-functions'}
Grab a list of all links on the page, in absolute form (anchors excluded):>>> r.html.absolute_links
{'https://github.com/python/pythondotorg/issues', 'https://docs.python.org/3/tutorial/', 'https://www.python.org/about/success/', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', 'https://www.python.org/dev/peps/', 'https://mail.python.org/mailman/listinfo/python-dev', 'https://www.python.org/doc/', 'https://www.python.org/', 'https://www.python.org/about/', 'https://www.python.org/events/python-events/past/', 'https://devguide.python.org/', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', 'https://docs.python.org/3/tutorial/introduction.html#lists', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', 'http://pyfound.blogspot.com/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/', 'https://www.python.org/events/python-events', 'https://status.python.org/', 'https://www.python.org/about/apps', 'https://www.python.org/downloads/release/python-2714/', 'https://www.python.org/psf/donations/', 'http://buildbot.net/', 'http://twitter.com/ThePSF', 'https://docs.python.org/3/license.html', 'http://wiki.python.org/moin/Languages', 'https://docs.python.org/faq/', 'https://jobs.python.org', 'https://www.python.org/about/success/#software-development', 'https://www.python.org/about/success/#education', 'https://www.python.org/community/logos/', 'https://www.python.org/doc/av', 'https://wiki.qt.io/PySide', 'https://www.python.org/events/python-user-group/660/', 'https://wiki.gnome.org/Projects/PyGObject', 'http://www.ansible.com', 'http://www.saltstack.com', 'https://www.python.org/dev/peps/peps.rss', 'http://planetpython.org/', 'https://www.python.org/events/python-user-group/past/', 'https://docs.python.org/3/tutorial/controlflow.html#defining-functions', 'https://www.python.org/community/diversity/', 'https://docs.python.org/3/tutorial/controlflow.html', 'https://www.python.org/community/awards', 'https://www.python.org/events/python-user-group/638/', 'https://www.python.org/about/legal/', 'https://www.python.org/dev/', 'https://www.python.org/download/alternatives', 'https://www.python.org/downloads/', 'https://www.python.org/community/lists/', 'http://www.wxpython.org/', 'https://www.python.org/about/success/#government', 'https://www.python.org/psf/', 'https://www.python.org/psf/codeofconduct/', 'http://bottlepy.org', 'http://roundup.sourceforge.net/', 'http://pandas.pydata.org/', 'http://brochure.getpython.info/', 'https://www.python.org/downloads/source/', 'https://bugs.python.org/', 'https://www.python.org/downloads/mac-osx/', 'https://www.python.org/about/help/', 'http://tornadoweb.org', 'http://flask.pocoo.org/', 'https://www.python.org/users/membership/', 'http://blog.python.org', 'https://www.python.org/privacy/', 'https://www.python.org/about/gettingstarted/', 'http://python.org/dev/peps/', 'https://www.python.org/about/apps/', 'https://docs.python.org', 'https://www.python.org/success-stories/', 'https://www.python.org/community/forums/', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', 'https://www.python.org/community/merchandise/', 'https://www.python.org/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3', 'http://trac.edgewall.org/', 'http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html', 'https://pypi.python.org/', 'https://www.python.org/events/python-user-group/650/', 'http://www.riverbankcomputing.co.uk/software/pyqt/intro', 'https://www.python.org/about/quotes/', 'https://www.python.org/downloads/windows/', 'https://www.python.org/events/calendars/', 'http://www.scipy.org', 'https://www.python.org/community/workshops/', 'https://www.python.org/blogs/', 'https://www.python.org/accounts/signup/', 'https://www.python.org/events/', 'https://kivy.org/', 'http://www.facebook.com/pythonlang?fref=ts', 'http://www.web2py.com/', 'https://www.python.org/psf/sponsorship/sponsors/', 'https://www.python.org/community/', 'https://www.python.org/download/other/', 'https://www.python.org/psf-landing/', 'https://www.python.org/events/python-user-group/665/', 'https://wiki.python.org/moin/BeginnersGuide', 'https://www.python.org/accounts/login/', 'https://www.python.org/downloads/release/python-364/', 'https://www.python.org/dev/core-mentorship/', 'https://www.python.org/about/success/#business', 'https://www.python.org/community/sigs/', 'https://www.python.org/events/python-user-group/', 'http://ipython.org', 'https://www.python.org/shell/', 'https://www.python.org/community/irc/', 'https://www.python.org/about/success/#engineering', 'http://www.pylonsproject.org/', 'http://pycon.blogspot.com/', 'https://www.python.org/about/success/#scientific', 'https://www.python.org/doc/essays/', 'http://www.djangoproject.com/', 'https://www.python.org/success-stories/industrial-light-magic-runs-python/', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://www.python.org/jobs/', 'https://www.python.org/events/python-events/604/'}
Select an element with a CSS Selector:>>> about = r.html.find('#about', first=True)
Grab an element's text contents:>>> print(about.text)
About
Applications
Quotes
Getting Started
Help
Python Brochure
Introspect an Element's attributes:>>> about.attrs
{'id': 'about', 'class': ('tier-1', 'element-1'), 'aria-haspopup': 'true'}
Render out an Element's HTML:>>> about.html
'<li aria-haspopup="true" class="tier-1 element-1 " id="about">\n<a class="" href="/about/" title="">About</a>\n<ul aria-hidden="true" class="subnav menu" role="menu">\n<li class="tier-2 element-1" role="treeitem"><a href="/about/apps/" title="">Applications</a></li>\n<li class="tier-2 element-2" role="treeitem"><a href="/about/quotes/" title="">Quotes</a></li>\n<li class="tier-2 element-3" role="treeitem"><a href="/about/gettingstarted/" title="">Getting Started</a></li>\n<li class="tier-2 element-4" role="treeitem"><a href="/about/help/" title="">Help</a></li>\n<li class="tier-2 element-5" role="treeitem"><a href="http://brochure.getpython.info/" title="">Python Brochure</a></li>\n</ul>\n</li>'
Select Elements within Elements:>>> about.find('a')
[<Element 'a' href='/about/' title='' class=''>, <Element 'a' href='/about/apps/' title=''>, <Element 'a' href='/about/quotes/' title=''>, <Element 'a' href='/about/gettingstarted/' title=''>, <Element 'a' href='/about/help/' title=''>, <Element 'a' href='http://brochure.getpython.info/' title=''>]
Search for links within an element:>>> about.absolute_links
{'http://brochure.getpython.info/', 'https://www.python.org/about/gettingstarted/', 'https://www.python.org/about/', 'https://www.python.org/about/quotes/', 'https://www.python.org/about/help/', 'https://www.python.org/about/apps/'}
Search for text on the page:>>> r.html.search('Python is a {} language')[0]
programming
More complex CSS Selector example (copied from Chrome dev tools):>>> r = session.get('https://github.com/')
>>> sel = 'body > div.application-main > div.jumbotron.jumbotron-codelines > div > div > div.col-md-7.text-center.text-md-left > p'>>> print(r.html.find(sel, first=True).text)
GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside millions of other developers.
XPath is also supported:>>> r.html.xpath('/html/body/div[1]/a')
[<Element 'a' class=('px-2', 'py-4', 'show-on-focus', 'js-skip-to-content') href='#start-of-content' tabindex='1'>]
JavaScript Support
Let's grab some text that's rendered by JavaScript:>>> r = session.get('http://python-requests.org')>>> r.html.render()>>> r.html.search('Python 2 will retire in only {months} months!')['months']
'<time>25</time>'
Note, the first time you ever run the render() method, it will download Chromium into your home directory (e.g. ~/.pyppeteer/). This only happens once.Using without Requests
You can also use this library without Requests:>>> from requests_html import HTML
>>> doc = """<a href='https://httpbin.org'>""">>> html = HTML(html=doc)
>>> html.links
{'https://httpbin.org'}

 


http://www.ppmy.cn/news/765269.html

相关文章

如何做好VR全景营销?选择好的全景平台很重要

现如今&#xff0c;许多关于VR全景创业、VR全景加盟的信息铺天盖地&#xff0c;很多人都看到了VR行业的潜力&#xff0c;想要借助VR全景平台来创业&#xff0c;但是如何找合适的全景平台呢&#xff1f;这是一个很重要的问题&#xff0c;因为只有好的VR全景平台&#xff0c;才能…

Enscape 3.4这些功能还不会用?学会让你事半功倍

Enscape是面向建筑师、设计师和其他 AEC 专业人士的实时可视化和 VR 插件。其最新更新Enscape 3.4为现有功能提供了有价值的更新&#xff0c;因此您可以继续改进您的设计和可视化体验。 功能更新包括 Enscape 自定义资源库的类别、在视图中保存自然光位置的能力、对现有视图的进…

VR智慧家装,给业主带来别样的家装体验!

装修行业的格局是典型的大行业、小企业&#xff0c;中小企业占据了整个行业很大一部分&#xff0c;而且由于家装流程过于繁琐&#xff0c;涉及的环节也比较多&#xff0c;造成企业和客户信息不对称&#xff0c;客户满意度很难平衡&#xff0c;因此装修行业迫切的需要新的技术和…

免费HTTPS证书 Certbot Letsencrypt 傻瓜教程

概述 Letsencrypt提供了免费的https证书服务&#xff0c;但是操作复杂&#xff0c;所以官方就开发了Cerbot自动化工具。 我的环境 ubuntu 20.04.5 LTS nginx 1.18.0 python 3.8.10 配置Nginx #可以添加多个服务&#xff0c;Cerbot都会自动识别 server {listen 80;listen […

关于项目介入VR功能简单介绍

技术介绍 VR Virtual Reality&#xff0c;虚拟现实&#xff0c;或称灵境技术&#xff0c;实际上是一种可创建和体验虚拟世界&#xff08;Virtual World&#xff09;的计算机系统。 友好度很重要 随着社会经济的发展&#xff0c;计算机已经成为社会生活中不可缺少的重要组成部分…

如何扫描汽车并在VR中进行组装?

这个项目是如何诞生的&#xff1f; 在我开始这个项目之前&#xff0c;我的大部分周末都是在打工和玩车。我有第一代Miata&#xff0c;我自动横渡了几年&#xff0c;并开始准备赛道日。我也做一些碳纤维和钢铁制造。我喜欢制造&#xff0c;解决问题和设计。为了我的兴趣&#…

VR 科学技术

VR&#xff08;科学技术&#xff09; 即VR&#xff08;Virtual Reality&#xff0c;即 虚拟现实&#xff0c;简称VR&#xff09;&#xff0c;是由美国VPL公司创建人拉尼尔&#xff08;Jaron Lanier&#xff09;在20世纪80年代初提出的。其具体内涵是&#xff1a;综合利用计算机…

VR介绍

VR&#xff08;Virtual Reality&#xff0c;即虚拟现实&#xff0c;简称VR&#xff09;&#xff0c;是由美国VPL公司创建人拉尼尔在20世纪80年代初提出的。其具体内涵是&#xff1a;综合利用计算机图形系统和各种现实及控制等接口设备&#xff0c;在计算机上生成的、可交互的三…