美团图片爬取
- 1 背景
- 2 数据准备
- 2.1 读入数据
- 2.2 查看美团商户网址前五个看看
- 3 汇总
- 3.1 定义正则匹配网址函数
- 3.1.1 匹配大大图
- 3.1.2 匹配大图
- 3.1.3 匹配推荐菜
- 3.2 测试上述函数
- 3.2.1 大大图
- 3.2.2 大图
- 3.2.3 推荐菜
- 3.3 整体进行爬取
- 4 找到空文件夹 然后把名称记录下来 然后拼接网址再继续爬取数据
- 4.1 找到空文件夹名称
- 4.2 拼接网址重新爬取数据
1 背景
在之前的博客: 爬虫 | selenium动态爬取美团商家图片 中,和大家分享了爬取美团商家图片的case,但上次爬取的时候带上了图片的像素,另外爬取的也不全(只爬取了页面上显示了的图片,但是左右滑动查看还是很多的),于是最近又完全的爬取了一波!这次爬虫的策略为:
- 直接将网页的源码page_source打印出来
- 然后直接用正则表达式匹配出图片的网址!
- 使用函数来下载图片的网址!
2 数据准备
2.1 读入数据
- df2
- 出于隐私考虑就不再展示数据,主要信息就是商户首页网址,所以这个无关紧要。
2.2 查看美团商户网址前五个看看
df2['美团商户首页网址'].head()
0 https://www.meituan.com/meishi/42998199/
1 https://www.meituan.com/meishi/159397975/
2 https://www.meituan.com/meishi/179937478/
3 https://www.meituan.com/meishi/119822157/
4 https://www.meituan.com/meishi/162756463/
Name: 美团商户首页网址, dtype: object
df2.index
RangeIndex(start=0, stop=702, step=1)
3 汇总
思路:
- 循环一遍网址
- 对每一个网址里:分别重新建立一个文件夹 文件夹名称对应就是美团的商户id
- 然后再在上面路径下面下载相应的图片即可!
3.1 定义正则匹配网址函数
3.1.1 匹配大大图
以商户:瑶茶坊为例, 大大图指:
def RegHttp_bbig(driver):try:a = driver.page_source# 先限定一个范围split=re.compile('photos.*albumImgUrls')result=split.findall(a)
# print(result)# 再提取图片网址split2=re.compile('http\S*jpg|png')result2=split2.findall(result[0])result_bigg_s = result2[0]except Exception as e:print('正则匹配时报错信息为: ',e)result_bigg_s = []return result_bigg_s
result_bigg_final = RegHttp_bbig(driver)
result_bigg_final
'https://img.meituan.net/msmerchant/956fec11f0d2760e141a4269dfcf74bf207363.jpg'
3.1.2 匹配大图
以商户:瑶茶坊为例, 大图指:
import redef MatchChar(x):# x 表示要匹配的字符import resplit3=re.compile('http.*[.jpg | .png]')result3=split3.findall(x)return result3def RegHttp_big(driver):# 获取源网址try:a = driver.page_source# print(a)# 先限定一个范围split=re.compile('albumImgUrls.*},"recommended')result=split.findall(a)# print(result)# 再提取图片网址split2=re.compile('http\S*[@]')result2=split2.findall(result[0])result_big = result2[0].split(',')result_big_s = list(map(MatchChar, result_big))except Exception as e:print('正则匹配时报错信息为: ',e)result_big_s = []return result_big_s
result_big_final = RegHttp_big(driver)
result_big_final
[['https://img.meituan.net/msmerchant/9e127093b40299a87d10998196aa083747135.jpg'],['https://img.meituan.net/msmerchant/9f36ccc83c00db9ef15e66b99627af1e50131.jpg'],['https://img.meituan.net/msmerchant/4ed98687cebc0471c2610fd2c33c488649544.jpg'],......['https://p0.meituan.net/poiskudish/fd705a6a21520609c040707dd382b9a2570300.png']]
result_big_final[0][0]
'https://img.meituan.net/msmerchant/9e127093b40299a87d10998196aa083747135.jpg'
3.1.3 匹配推荐菜
以商户:瑶茶坊为例, 推荐菜指:
import redef RegHttp_rec(driver):# 获取源网址try:a = driver.page_source# print(a)# 先限定一个范围# 先限定一个范围split=re.compile('recommended.*],"crumbNav')result=split.findall(a)# print(result)# 再提取图片网址split2=re.compile('http\S*jpg|png')result2=split2.findall(result[0])result_rec = result2[0].split('},{')result_rec_s = list(map(MatchChar, result_rec))except Exception as e:print('正则匹配时报错信息为: ',e)result_rec_s = []return result_rec_s
result_rec_s = RegHttp_rec(driver)
result_rec_s
[['http://p1.meituan.net/poirichness/menu_18583543_691441813.jpg'],['http://p1.meituan.net/dealwatera/6591fbeef4e76cd00aa1a32a77b83e1e72222.jpg'],['http://p1.meituan.net/poirichness/menu_18583544_691441814.jpg'],['http://p0.meituan.net/poirichness/menu_17917613_683523760.jpg'],['http://qcloud.dpfile.com/pc/u68eK5ImOJ1kCypuryrNI2ZhBwLgY9YrMSDGiZYJwqoEgns8-GSrWUXr4-guZPzNmXKqvF8xz-Pgbz9r8ffpSA.jpg'],['http://qcloud.dpfile.com/pc/4tLk_E_Y1X14VHoiovKNqVBNSyxHdXYaGkM18s1RNFBF_15S8aSMm533JjerbNbwmXKqvF8xz-Pgbz9r8ffpSA.jpg']]
3.2 测试上述函数
3.2.1 大大图
driver = webdriver.Chrome()
driver.get('https://bj.meituan.com/')
# 手动登录一下
time.sleep(3)
# 开始访问商户网址
driver.get('https://www.meituan.com/meishi/1801010/')a = driver.page_source
print(a)
<html><head><meta charset="utf-8"><meta name="for" content="meituan.com"><title>芙蓉江渔村豆花鱼头火锅(置信路店)_电话地址_营业时间-北京美团网</title><meta name="description" content="北京美团网提供2019最新芙蓉江渔村豆花鱼头火锅(置信路店)电话,营业时间,地址,以及芙蓉江渔村豆花鱼头火锅(置信路店)人均价格,优惠菜单,招牌菜怎么样.查看芙蓉江渔村豆花鱼头火锅(置信路店)团购信息,了解店铺环境/服务怎么样."><meta name="keyword" content="芙蓉江渔村豆花鱼头火锅(置信路店),电话,地址,营业时间">
......
extraInfos":[{"iconUrl":"http://p0.meituan.net/codeman/551290739062eda37e52999e2315f50c1887.png","text":"提供wifi"},{"iconUrl":"http://p1.meituan.net/codeman/4b1c5696fe5bf2c4d23fb01659b3e68b1960.png","text":"停车位"}],"hasFoodSafeInfo":true,"longitude":104.016005,"latitude":30.647733,"avgPrice":67,"brandId":0,"brandName":"","showStatus":1,"isMeishi":true},"photos":{"frontImgUrl":"https://img.meituan.net/msmerchant/956fec11f0d2760e141a4269dfcf74bf207363.jpg@600w_600h_1l","albumImgUrls":["https://img.meituan.net/msmerchant/f17ebb4360b1cd7b3069979ffc732dfc386058.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/6b8442a1b416fc280693ac518ad41935586412.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/02cd329ca9abdfc08f62c39a46f898ee394165.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/6a2e338f317248c5fd01cfc72c483f01358268.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/b34536eace1526969cad40a3087f20ef473478.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/eb61165e637f1d76a2f370e84488b8b4339830.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/436a0acfdd49da16e009285bbaa48755371783.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/0000a1290b10bd01c7ce84e9db355426321403.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/5b17409e47c1b9991f59cb2f5df0514a436195.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/43454e1afdd4af1ad328ae1ee4f21004207872.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/b6ebe8f753ac49f7615a39ee06d9daa273876.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/f68c4c0f83a9a8ebf548e52db45c041b73140.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/a8d03dad7410214c586335bdb352a7d072514.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/0e7b7aa4329f92bfab4031d209ff447b70202.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/858a1859369f78d258a09091f84e310982429.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/f3dfe5354e3047ea0d0e0080af6d77ea213010.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/43454e1afdd4af1ad328ae1ee4f21004207872.jpg@600w_600h_1l","http://p1.meituan.net/deal/__47185133__7213327.jpg@600w_600h_1l","http://p1.meituan.net/shaitu/98ff13cdad6c7058905080ac4d55463587680.jpg@600w_600h_1l","http://p1.meituan.net/deal/__48840575__4575359.jpg@600w_600h_1l","http://p0.meituan.net/deal/__47185133__8041792.jpg@600w_600h_1l","http://p1.meituan.net/deal/__48838606__8431233.jpg@600w_600h_1l","http://p0.meituan.net/shaitu/b0e8611f14f117d0fe2b94185193f83e56313.jpg@600w_600h_1l","http://p0.meituan.net/deal/__47185133__3763278.jpg@600w_600h_1l","http://p0.meituan.net/deal/__47179908__5612408.jpg@600w_600h_1l","http://p0.meituan.net/shaitu/5387320f950b9685630545d736d818b0284873.jpg@600w_600h_1l","http://p1.meituan.net/deal/__47185132__9743594.jpg@600w_600h_1l","http://p1.meituan.net/deal/__48838606__7400628.jpg@600w_600h_1l","http://p1.meituan.net/deal/__47185133__6502411.jpg@600w_600h_1l","http://p1.meituan.net/deal/__47185132__7962378.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/51a0f60124cb7bdfe7a0125eb385486b62971.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/8a99b5e4481b273a471b22f210a1f36362341.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/a2fed7151069b4aadf59fcd02305328865447.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/956fec11f0d2760e141a4269dfcf74bf207363.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/588412841cef1a1a1a70d47e8bcf4b2b63514.jpg@600w_600h_1l","http://p1.meituan.net/deal/__47179647__6451784.jpg@600w_600h_1l","http://p1.meituan.net/deal/__47185132__8515815.jpg@600w_600h_1l","http://p1.meituan.net/dealwatera/bb0bd9de97d76ce53d2baf55579ee418167415.jpg@600w_600h_1l","http://p1.meituan.net/dealwatera/46c9afe9e201982589e752f486d3a49c102919.jpg@600w_600h_1l","http://p1.meituan.net/dealwatera/1ba3bef593c68aaa213d89a4d24c880a166243.jpg@600w_600h_1l","http://p0.meituan.net/deal/__47179647__2657511.jpg@600w_600h_1l","http://p1.meituan.net/dealwatera/e8d5fbf0926ebe045ad2d804b14ad4ce124823.jpg@600w_600h_1l","http://p0.meituan.net/deal/__47179647__6693059.jpg@600w_600h_1l","http://p0.meituan.net/deal/__47179647__7248161.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/1f53e7505f5a2f9126cd106d4c65787180733.jpg@600w_600h_1l"]},"recommended":
# 先限定一个范围
split=re.compile('photos.*albumImgUrls')
result=split.findall(a)
print(result)# 再提取图片网址
split2=re.compile('http\S*jpg|png')
result2=split2.findall(result[0])
result2[0]
['photos":{"frontImgUrl":"https://img.meituan.net/msmerchant/956fec11f0d2760e141a4269dfcf74bf207363.jpg@600w_600h_1l","albumImgUrls']
# 再提取图片网址
split2=re.compile('http\S*jpg|png')
result2=split2.findall(result[0])
result2[0]
'https://img.meituan.net/msmerchant/956fec11f0d2760e141a4269dfcf74bf207363.jpg'
3.2.2 大图
# 导入相应的模块
import json
import time
import requests
from bs4 import BeautifulSoup
import pandas as pd
from lxml import etree
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import random
import re
driver = webdriver.Chrome()
driver.get('https://bj.meituan.com/')
# 手动登录一下
time.sleep(3)
# 开始访问商户网址
driver.get('https://www.meituan.com/meishi/119822157/')
a = driver.page_source
print(a)```python
# 先限定一个范围
split=re.compile('albumImgUrls.*},"recommended')
result=split.findall(a)
print(result)
['albumImgUrls":["https://img.meituan.net/msmerchant/9e127093b40299a87d10998196aa083747135.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/9f36ccc83c00db9ef15e66b99627af1e50131.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/4ed98687cebc0471c2610fd2c33c488649544.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/be2e186b049b867754caa5d9e0bea6d349263.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/6c4511c530ae967e6840ffb34373c50782962.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/79fe851ff96b87e2307eda359dd5153c683045.png@600w_600h_1l","https://img.meituan.net/msmerchant/6e26e237a84989979dbaf17233312975156044.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/54ceeb7d88845726443d72261deef84c194642.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/6e26e237a84989979dbaf17233312975156044.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/d4e2ef6facb2b8bfcb14d5af59b1f45d99902.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/ecf88d7800040c8a88d3322ace5299f388058.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/178d2076df5b8a513f457fac5825454a187213.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/54ceeb7d88845726443d72261deef84c194642.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/5e1b9bb1376307aa3d8cd024d3698fed94964.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/6dfd67bfd37a2e167ee5dfcb3dff25bc8657860.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/56ef7d97a0b50cae0f9515131dd9315088356.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/aa8339726b0263cff3c14011e601b4b0248709.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/2a27c8929b2534d375c3065ccf4872311457764.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/f4293d0a4bbf100950278df2df181bef2024140.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/6b8ef2dc1c3adae0742307993e8055cb832156.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/42281ddd18eaee41bee89d9d95cdc7cc143557.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/cc33e7a27d49bbee76c2bd390885179b188091.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/725895b6cb524f75c92a2c4e4261a409164430.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/617f16b6392a774de971be69c564992e228911.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/537fa9f90dc1f1b82035533ccbf8283d168035.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/aa8e4f615a7ee1c6ae9b588d15e6b929217460.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/9f49b48a7811374ff8798c104a1fa3f7200230.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/37a715a5c73087bc8f44573ad9223c22157969.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/68866bb6fea3b9187352ff4cceebdb80139204.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/2e07b318f292d01be95a0ffc940b347a225802.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/f9e31ed96fcc3637d4471c6e1144e1d895098.jpg@600w_600h_1l","http://p1.meituan.net/deal/97df898454dd3bda2c3e4d1b6811ec2095228.jpg@600w_600h_1l","http://p0.meituan.net/deal/590111dd397b91987b07ca348d421d3097108.jpg@600w_600h_1l","http://p0.meituan.net/deal/c9b4bdef0dd47bd1ff2b16c35e25528663365.jpg@600w_600h_1l","http://p1.meituan.net/deal/bc8122ffd4adbc73d43d4c8fcd71c8bd68309.jpg@600w_600h_1l","http://p1.meituan.net/deal/d5261a0a3932e61f354ead96934f023864652.jpg@600w_600h_1l","http://p0.meituan.net/deal/b74c2a8f32d2c0118fa28e0d8515389d44777.jpg@600w_600h_1l","https://p0.meituan.net/poiskudish/fd705a6a21520609c040707dd382b9a2570300.png@600w_600h_1l"]},"recommended']
# 再提取图片网址
split2=re.compile('http\S*[@]')
result2=split2.findall(result[0])
result2
['https://img.meituan.net/msmerchant/9e127093b40299a87d10998196aa083747135.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/9f36ccc83c00db9ef15e66b99627af1e50131.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/4ed98687cebc0471c2610fd2c33c488649544.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/be2e186b049b867754caa5d9e0bea6d349263.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/6c4511c530ae967e6840ffb34373c50782962.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/79fe851ff96b87e2307eda359dd5153c683045.png@600w_600h_1l","https://img.meituan.net/msmerchant/6e26e237a84989979dbaf17233312975156044.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/54ceeb7d88845726443d72261deef84c194642.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/6e26e237a84989979dbaf17233312975156044.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/d4e2ef6facb2b8bfcb14d5af59b1f45d99902.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/ecf88d7800040c8a88d3322ace5299f388058.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/178d2076df5b8a513f457fac5825454a187213.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/54ceeb7d88845726443d72261deef84c194642.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/5e1b9bb1376307aa3d8cd024d3698fed94964.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/6dfd67bfd37a2e167ee5dfcb3dff25bc8657860.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/56ef7d97a0b50cae0f9515131dd9315088356.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/aa8339726b0263cff3c14011e601b4b0248709.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/2a27c8929b2534d375c3065ccf4872311457764.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/f4293d0a4bbf100950278df2df181bef2024140.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/6b8ef2dc1c3adae0742307993e8055cb832156.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/42281ddd18eaee41bee89d9d95cdc7cc143557.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/cc33e7a27d49bbee76c2bd390885179b188091.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/725895b6cb524f75c92a2c4e4261a409164430.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/617f16b6392a774de971be69c564992e228911.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/537fa9f90dc1f1b82035533ccbf8283d168035.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/aa8e4f615a7ee1c6ae9b588d15e6b929217460.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/9f49b48a7811374ff8798c104a1fa3f7200230.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/37a715a5c73087bc8f44573ad9223c22157969.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/68866bb6fea3b9187352ff4cceebdb80139204.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/2e07b318f292d01be95a0ffc940b347a225802.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/f9e31ed96fcc3637d4471c6e1144e1d895098.jpg@600w_600h_1l","http://p1.meituan.net/deal/97df898454dd3bda2c3e4d1b6811ec2095228.jpg@600w_600h_1l","http://p0.meituan.net/deal/590111dd397b91987b07ca348d421d3097108.jpg@600w_600h_1l","http://p0.meituan.net/deal/c9b4bdef0dd47bd1ff2b16c35e25528663365.jpg@600w_600h_1l","http://p1.meituan.net/deal/bc8122ffd4adbc73d43d4c8fcd71c8bd68309.jpg@600w_600h_1l","http://p1.meituan.net/deal/d5261a0a3932e61f354ead96934f023864652.jpg@600w_600h_1l","http://p0.meituan.net/deal/b74c2a8f32d2c0118fa28e0d8515389d44777.jpg@600w_600h_1l","https://p0.meituan.net/poiskudish/fd705a6a21520609c040707dd382b9a2570300.png@']
下面代码是亮点:根据逗号进行分割,然后对每一个再正则匹配一次,使用map提高效率
result_big = result2[0].split(',')
result_big
['https://img.meituan.net/msmerchant/9e127093b40299a87d10998196aa083747135.jpg@600w_600h_1l"','"https://img.meituan.net/msmerchant/9f36ccc83c00db9ef15e66b99627af1e50131.jpg@600w_600h_1l"',
......'"http://p0.meituan.net/deal/b74c2a8f32d2c0118fa28e0d8515389d44777.jpg@600w_600h_1l"','"https://p0.meituan.net/poiskudish/fd705a6a21520609c040707dd382b9a2570300.png@']
定义一个方法截取掉每个里面的@之前的:
def MatchChar(x):# x 表示要匹配的字符import resplit3=re.compile('http.*[.jpg | .png]')result3=split3.findall(x)return result3
result_big_s = list(map(MatchChar, result_big))
result_big_s[0][0]
'https://img.meituan.net/msmerchant/9e127093b40299a87d10998196aa083747135.jpg'
len(result_big_s)
38
测试下载所有图片:
for i in range(len(result_big_s)):import urllib.requestimg_url = result_big_s[i][0]filename = 'big_' + str(i) + '.jpg'urllib.request.urlretrieve(img_url,filename=filename)
没有问题
3.2.3 推荐菜
a = driver.page_source
print(a)
# 先限定一个范围
split=re.compile('recommended.*],"crumbNav')
result=split.findall(a)
print(result)
['recommended":[{"id":"14496153","name":"紫苏叶牛蛙","price":0,"frontImgUrl":"http://p1.meituan.net/poirichness/menu_18583543_691441813.jpg@600w_600h_1l"},{"id":"157931980","name":"牛蛙爱上基围虾平锅","price":118,"frontImgUrl":"http://p1.meituan.net/dealwatera/6591fbeef4e76cd00aa1a32a77b83e1e72222.jpg@600w_600h_1l"},{"id":"15031996","name":"冰皮麻薯","price":0,"frontImgUrl":"http://p1.meituan.net/poirichness/menu_18583544_691441814.jpg@600w_600h_1l"},{"id":"14910675","name":"土豆","price":8,"frontImgUrl":"http://p0.meituan.net/poirichness/menu_17917613_683523760.jpg@600w_600h_1l"},{"id":"14496155","name":"炭烤猪蹄","price":0,"frontImgUrl":"http://qcloud.dpfile.com/pc/u68eK5ImOJ1kCypuryrNI2ZhBwLgY9YrMSDGiZYJwqoEgns8-GSrWUXr4-guZPzNmXKqvF8xz-Pgbz9r8ffpSA.jpg"},{"id":"14496157","name":"鸡脚","price":0,"frontImgUrl":"http://qcloud.dpfile.com/pc/4tLk_E_Y1X14VHoiovKNqVBNSyxHdXYaGkM18s1RNFBF_15S8aSMm533JjerbNbwmXKqvF8xz-Pgbz9r8ffpSA.jpg"},{"id":"14496156","name":"烤鸡翅","price":0,"frontImgUrl":""},{"id":"144134197","name":"紫苏牛蛙锅","price":138,"frontImgUrl":""},{"id":"200346482","name":"红糖糍粑","price":0,"frontImgUrl":""},{"id":"14496154","name":"桂圆红枣汁","price":0,"frontImgUrl":""},{"id":"157931981","name":"紫苏味平锅","price":108,"frontImgUrl":""},{"id":"144138375","name":"经典牛蛙锅","price":138,"frontImgUrl":""},{"id":"102831228","name":"麻辣藕片","price":0,"frontImgUrl":""},{"id":"133832305","name":"草莓酸奶苏打","price":0,"frontImgUrl":""},{"id":"15031993","name":"皮蛋","price":0,"frontImgUrl":""},{"id":"198259728","name":"德国香肠","price":5,"frontImgUrl":""},{"id":"157931982","name":"皇家海陆空平锅","price":138,"frontImgUrl":""},{"id":"171800525","name":"豆皮","price":0,"frontImgUrl":""},{"id":"15031995","name":"核桃包","price":0,"frontImgUrl":""},{"id":"144134285","name":"螃蟹牛蛙锅","price":168,"frontImgUrl":""},{"id":"157931984","name":"炭烤猪手","price":15,"frontImgUrl":""},{"id":"211913507","name":"黄瓜","price":0,"frontImgUrl":""},{"id":"194909762","name":"糯米糍","price":0,"frontImgUrl":""},{"id":"236161890","name":"排骨","price":10,"frontImgUrl":""},{"id":"241883728","name":"可乐鸡翅","price":0,"frontImgUrl":""},{"id":"230162051","name":"4","price":0,"frontImgUrl":""},{"id":"218095625","name":"香辣味泡锅","price":0,"frontImgUrl":""},{"id":"241883957","name":"红薯粉","price":0,"frontImgUrl":""},{"id":"201507362","name":"娃娃菜","price":0,"frontImgUrl":""},{"id":"200811400","name":"烤鱼","price":0,"frontImgUrl":""},{"id":"200114454","name":"龙虾","price":0,"frontImgUrl":""},{"id":"198259727","name":"冰淇淋","price":0,"frontImgUrl":""},{"id":"198008716","name":"烤蛙","price":0,"frontImgUrl":""},{"id":"197723225","name":"油条","price":0,"frontImgUrl":""},{"id":"197223569","name":"炸鸡柳","price":0,"frontImgUrl":""},{"id":"196067268","name":"腐竹","price":0,"frontImgUrl":""}],"crumbNav']
# 再提取图片网址
# split2=re.compile('http:.*[.jpg | .png]')
split2=re.compile('http\S*jpg|png')
result2=split2.findall(result[0])
result2
['http://p1.meituan.net/poirichness/menu_18583543_691441813.jpg@600w_600h_1l"},{"id":"157931980","name":"牛蛙爱上基围虾平锅","price":118,"frontImgUrl":"http://p1.meituan.net/dealwatera/6591fbeef4e76cd00aa1a32a77b83e1e72222.jpg@600w_600h_1l"},{"id":"15031996","name":"冰皮麻薯","price":0,"frontImgUrl":"http://p1.meituan.net/poirichness/menu_18583544_691441814.jpg@600w_600h_1l"},{"id":"14910675","name":"土豆","price":8,"frontImgUrl":"http://p0.meituan.net/poirichness/menu_17917613_683523760.jpg@600w_600h_1l"},{"id":"14496155","name":"炭烤猪蹄","price":0,"frontImgUrl":"http://qcloud.dpfile.com/pc/u68eK5ImOJ1kCypuryrNI2ZhBwLgY9YrMSDGiZYJwqoEgns8-GSrWUXr4-guZPzNmXKqvF8xz-Pgbz9r8ffpSA.jpg"},{"id":"14496157","name":"鸡脚","price":0,"frontImgUrl":"http://qcloud.dpfile.com/pc/4tLk_E_Y1X14VHoiovKNqVBNSyxHdXYaGkM18s1RNFBF_15S8aSMm533JjerbNbwmXKqvF8xz-Pgbz9r8ffpSA.jpg']
result_rec = result2[0].split('},{')
result_rec
['http://p1.meituan.net/poirichness/menu_18583543_691441813.jpg@600w_600h_1l"','"id":"157931980","name":"牛蛙爱上基围虾平锅","price":118,"frontImgUrl":"http://p1.meituan.net/dealwatera/6591fbeef4e76cd00aa1a32a77b83e1e72222.jpg@600w_600h_1l"','"id":"15031996","name":"冰皮麻薯","price":0,"frontImgUrl":"http://p1.meituan.net/poirichness/menu_18583544_691441814.jpg@600w_600h_1l"','"id":"14910675","name":"土豆","price":8,"frontImgUrl":"http://p0.meituan.net/poirichness/menu_17917613_683523760.jpg@600w_600h_1l"','"id":"14496155","name":"炭烤猪蹄","price":0,"frontImgUrl":"http://qcloud.dpfile.com/pc/u68eK5ImOJ1kCypuryrNI2ZhBwLgY9YrMSDGiZYJwqoEgns8-GSrWUXr4-guZPzNmXKqvF8xz-Pgbz9r8ffpSA.jpg"','"id":"14496157","name":"鸡脚","price":0,"frontImgUrl":"http://qcloud.dpfile.com/pc/4tLk_E_Y1X14VHoiovKNqVBNSyxHdXYaGkM18s1RNFBF_15S8aSMm533JjerbNbwmXKqvF8xz-Pgbz9r8ffpSA.jpg']
result_rec_s = list(map(MatchChar, result_rec))
result_rec_s
[['http://p1.meituan.net/poirichness/menu_18583543_691441813.jpg'],['http://p1.meituan.net/dealwatera/6591fbeef4e76cd00aa1a32a77b83e1e72222.jpg'],['http://p1.meituan.net/poirichness/menu_18583544_691441814.jpg'],['http://p0.meituan.net/poirichness/menu_17917613_683523760.jpg'],['http://qcloud.dpfile.com/pc/u68eK5ImOJ1kCypuryrNI2ZhBwLgY9YrMSDGiZYJwqoEgns8-GSrWUXr4-guZPzNmXKqvF8xz-Pgbz9r8ffpSA.jpg'],['http://qcloud.dpfile.com/pc/4tLk_E_Y1X14VHoiovKNqVBNSyxHdXYaGkM18s1RNFBF_15S8aSMm533JjerbNbwmXKqvF8xz-Pgbz9r8ffpSA.jpg']]
3.3 整体进行爬取
# 导入相应的模块
import json
import time
from bs4 import BeautifulSoup
import pandas as pd
from lxml import etree
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import random
import os
import re
import urllib.request# 全局计时
t0_all = time.time()# 1 启动浏览器
driver = webdriver.Chrome()
driver.get('https://www.meituan.com')# 最大化窗口
driver.maximize_window()
# 手动登录一下
time.sleep(3)# 2 循环遍历每一个网址
# for i in range(0, len(df2)):
for i in range(0, 5):# 开始计时t0 = time.time()print('开始下载第 %s 个商户图片' % (i+1))# 2.1 创建商户文件夹path1 = str(df2['美团商户id'][i])os.mkdir(path1)# 2.2 开始访问商户网址driver.get(df2['美团商户首页网址'][i])# 加载得等一下!time.sleep(5)# 然后滑动到推荐菜位置# 指定像素jsCode = "var q=document.documentElement.scrollTop=550"driver.execute_script(jsCode)# 2.3 开始爬取大图print('开始下载大大图')# 2.3.1 链接准备+下载photo_url = RegHttp_bbig(driver)filename = path1 + '/' + 'photo.jpg'try:urllib.request.urlretrieve(photo_url, filename = filename)except Exception as e:print('报错信息为: ', e)time.sleep(1)print('开始下载大图')# 2.3.1 链接准备+下载all_url = RegHttp_big(driver)for i in range(len(all_url)):filename = path1 + '/' + 'big_' + str(i) + '.jpg'try:urllib.request.urlretrieve(all_url[i][0], filename = filename)except Exception as e:print('报错信息为: ', e)time.sleep(1)# 2.4 开始爬取推荐菜图print('开始下载推荐菜')# 2.4.1 链接准备+下载all_url_rec = RegHttp_rec(driver)for i in range(len(all_url_rec)):filename_rec = path1 + '/' + 'rec_' + str(i) + '.jpg'try:urllib.request.urlretrieve(all_url_rec[i][0], filename = filename_rec)except Exception as e:print('报错信息为: ', e)time.sleep(1)t1 = time.time()print('商户 %s id: %s 的图片下载完毕,所需时间为 %.2f' % (df2['美团商户名称'][i], df2['美团商户id'][i], t1-t0))t1_all = time.time()
print(' %s 商户信息爬取完毕,所需时间为 %.2f s' % (len(df2), t1_all - t0_all))
开始下载第 1 个商户图片
开始下载大大图
开始下载大图
开始下载推荐菜
商户 新海府幸福大酒楼 id: 162756463 的图片下载完毕,所需时间为 19.34
开始下载第 2 个商户图片
开始下载大大图
开始下载大图
正则匹配时报错信息为: list index out of range
开始下载推荐菜
商户 新海府幸福大酒楼 id: 162756463 的图片下载完毕,所需时间为 13.93
开始下载第 3 个商户图片
开始下载大大图
开始下载大图
开始下载推荐菜
商户 麦巴克 id: 160715653 的图片下载完毕,所需时间为 38.58
开始下载第 4 个商户图片
开始下载大大图
报错信息为: unknown url type: 'png'
开始下载大图
开始下载推荐菜
商户 麦巴克 id: 160715653 的图片下载完毕,所需时间为 74.74
开始下载第 5 个商户图片
开始下载大大图
开始下载大图
正则匹配时报错信息为: list index out of range
开始下载推荐菜
商户 麦巴克 id: 160715653 的图片下载完毕,所需时间为 15.26702 商户信息爬取完毕,所需时间为 168.52 s
- 爬取结果展示:
以商户:瑶茶坊为例, 我们看到的首页图片只有5张:
但实际上我们下载下来的图片远不止,见下图:
- 上次就是没有注意到这个坑,原因就是点击那5张图片的任一张,然后还会发现很多张!所以还是直接看源码比较靠谱,里面会把所有图片的链接都储存了,故这种page_source+正则匹配还是很不错的!
下列工作是以防之前没有爬取下来,于是把空文件夹再搞出来!然后再爬取一波!
4 找到空文件夹 然后把名称记录下来 然后拼接网址再继续爬取数据
4.1 找到空文件夹名称
file_names = os.listdir('.')
file_names = [x for x in file_names if '.' not in x]
file_names
['150569726','163443323','177474079',......'179449164','179162834','188997974']
bad_file = []
for i in range(len(file_names)):if len(os.listdir(file_names[i])) == 0:bad_file.append(file_names[i])
print(len(bad_file))
187
bad_file
['177474079','178086221',......'179449164','179162834','188997974']
4.2 拼接网址重新爬取数据
变化的地方:
- 不需要再创建一个文件夹了
- 修改网址,自己拼接一下即可!
# 导入相应的模块
import json
import time
from bs4 import BeautifulSoup
import pandas as pd
from lxml import etree
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import random
import os
import re
import urllib.request# 全局计时
t0_all = time.time()# 1 启动浏览器
driver = webdriver.Chrome()
driver.get('https://cq.meituan.com/meishi/165869167/')
# 手动登录一下
time.sleep(30)# 2 循环遍历每一个网址
# for i in range(0, len(df2)):
for i in range(170,len(bad_file)):# 开始计时t0 = time.time()print('开始下载第 %s 个商户图片' % (i+1))# 2.1 创建商户文件夹path1 = (bad_file[i])
# os.mkdir(path1)url = 'https://www.meituan.com/meishi/' + bad_file[i] + '/'# 2.2 开始访问商户网址driver.get(url)# 2.3 开始爬取大图print('开始下载大大图')# 2.3.1 链接准备+下载photo_url = RegHttp_bbig(driver)filename = path1 + '/' + 'photo.jpg'try:urllib.request.urlretrieve(photo_url, filename = filename)except Exception as e:print('报错信息为: ', e)time.sleep(1)print('开始下载大图')# 2.3.1 链接准备+下载all_url = RegHttp_big(driver)for i in range(len(all_url)):filename = path1 + '/' + 'big_' + str(i) + '.jpg'try:urllib.request.urlretrieve(all_url[i][0], filename = filename)except Exception as e:print('报错信息为: ', e)time.sleep(1)# 2.4 开始爬取推荐菜图print('开始下载推荐菜')# 2.4.1 链接准备+下载all_url_rec = RegHttp_rec(driver)for i in range(len(all_url_rec)):filename_rec = path1 + '/' + 'rec_' + str(i) + '.jpg'try:urllib.request.urlretrieve(all_url_rec[i][0], filename = filename_rec)except Exception as e:print('报错信息为: ', e)time.sleep(1)t1 = time.time()print('商户 %s id: %s 的图片下载完毕,所需时间为 %.2f' % (df2['美团商户名称'][i], df2['美团商户id'][i], t1-t0))t1_all = time.time()
print(' %s 商户信息爬取完毕,所需时间为 %.2f s' % (len(df2), t1_all - t0_all))
开始下载第 171 个商户图片
开始下载大大图
正则匹配时报错信息为: list index out of range
报错信息为: expected string or bytes-like object
开始下载大图
正则匹配时报错信息为: list index out of range
开始下载推荐菜
正则匹配时报错信息为: list index out of range
商户 知味鲜风味餐厅 id: 152769787 的图片下载完毕,所需时间为 2.23
......