爬虫 | 打印page_source+正则匹配

news/2024/11/24 13:37:41/

美团图片爬取

  • 1 背景
  • 2 数据准备
    • 2.1 读入数据
    • 2.2 查看美团商户网址前五个看看
  • 3 汇总
    • 3.1 定义正则匹配网址函数
      • 3.1.1 匹配大大图
      • 3.1.2 匹配大图
      • 3.1.3 匹配推荐菜
    • 3.2 测试上述函数
      • 3.2.1 大大图
      • 3.2.2 大图
      • 3.2.3 推荐菜
    • 3.3 整体进行爬取
  • 4 找到空文件夹 然后把名称记录下来 然后拼接网址再继续爬取数据
    • 4.1 找到空文件夹名称
    • 4.2 拼接网址重新爬取数据

1 背景

在之前的博客: 爬虫 | selenium动态爬取美团商家图片 中,和大家分享了爬取美团商家图片的case,但上次爬取的时候带上了图片的像素,另外爬取的也不全(只爬取了页面上显示了的图片,但是左右滑动查看还是很多的),于是最近又完全的爬取了一波!这次爬虫的策略为:

  • 直接将网页的源码page_source打印出来
  • 然后直接用正则表达式匹配出图片的网址!
  • 使用函数来下载图片的网址!

2 数据准备

2.1 读入数据

  • df2
  • 出于隐私考虑就不再展示数据,主要信息就是商户首页网址,所以这个无关紧要。

2.2 查看美团商户网址前五个看看

df2['美团商户首页网址'].head()
0     https://www.meituan.com/meishi/42998199/
1    https://www.meituan.com/meishi/159397975/
2    https://www.meituan.com/meishi/179937478/
3    https://www.meituan.com/meishi/119822157/
4    https://www.meituan.com/meishi/162756463/
Name: 美团商户首页网址, dtype: object
df2.index
RangeIndex(start=0, stop=702, step=1)

3 汇总

思路:

  • 循环一遍网址
  • 对每一个网址里:分别重新建立一个文件夹 文件夹名称对应就是美团的商户id
  • 然后再在上面路径下面下载相应的图片即可!

3.1 定义正则匹配网址函数

3.1.1 匹配大大图

以商户:瑶茶坊为例, 大大图指:
在这里插入图片描述

def RegHttp_bbig(driver):try:a = driver.page_source# 先限定一个范围split=re.compile('photos.*albumImgUrls')result=split.findall(a)
#         print(result)# 再提取图片网址split2=re.compile('http\S*jpg|png')result2=split2.findall(result[0])result_bigg_s = result2[0]except Exception as e:print('正则匹配时报错信息为: ',e)result_bigg_s = []return result_bigg_s
result_bigg_final = RegHttp_bbig(driver)
result_bigg_final
'https://img.meituan.net/msmerchant/956fec11f0d2760e141a4269dfcf74bf207363.jpg'

3.1.2 匹配大图

以商户:瑶茶坊为例, 大图指:
在这里插入图片描述

import redef MatchChar(x):# x 表示要匹配的字符import resplit3=re.compile('http.*[.jpg | .png]')result3=split3.findall(x)return result3def RegHttp_big(driver):# 获取源网址try:a = driver.page_source#     print(a)# 先限定一个范围split=re.compile('albumImgUrls.*},"recommended')result=split.findall(a)#     print(result)# 再提取图片网址split2=re.compile('http\S*[@]')result2=split2.findall(result[0])result_big = result2[0].split(',')result_big_s = list(map(MatchChar, result_big))except Exception as e:print('正则匹配时报错信息为: ',e)result_big_s = []return result_big_s
result_big_final = RegHttp_big(driver)
result_big_final
[['https://img.meituan.net/msmerchant/9e127093b40299a87d10998196aa083747135.jpg'],['https://img.meituan.net/msmerchant/9f36ccc83c00db9ef15e66b99627af1e50131.jpg'],['https://img.meituan.net/msmerchant/4ed98687cebc0471c2610fd2c33c488649544.jpg'],......['https://p0.meituan.net/poiskudish/fd705a6a21520609c040707dd382b9a2570300.png']]
result_big_final[0][0]
'https://img.meituan.net/msmerchant/9e127093b40299a87d10998196aa083747135.jpg'

3.1.3 匹配推荐菜

以商户:瑶茶坊为例, 推荐菜指:
在这里插入图片描述

import redef RegHttp_rec(driver):# 获取源网址try:a = driver.page_source#     print(a)# 先限定一个范围# 先限定一个范围split=re.compile('recommended.*],"crumbNav')result=split.findall(a)#     print(result)# 再提取图片网址split2=re.compile('http\S*jpg|png')result2=split2.findall(result[0])result_rec = result2[0].split('},{')result_rec_s = list(map(MatchChar, result_rec))except Exception as e:print('正则匹配时报错信息为: ',e)result_rec_s = []return result_rec_s
result_rec_s = RegHttp_rec(driver)
result_rec_s
[['http://p1.meituan.net/poirichness/menu_18583543_691441813.jpg'],['http://p1.meituan.net/dealwatera/6591fbeef4e76cd00aa1a32a77b83e1e72222.jpg'],['http://p1.meituan.net/poirichness/menu_18583544_691441814.jpg'],['http://p0.meituan.net/poirichness/menu_17917613_683523760.jpg'],['http://qcloud.dpfile.com/pc/u68eK5ImOJ1kCypuryrNI2ZhBwLgY9YrMSDGiZYJwqoEgns8-GSrWUXr4-guZPzNmXKqvF8xz-Pgbz9r8ffpSA.jpg'],['http://qcloud.dpfile.com/pc/4tLk_E_Y1X14VHoiovKNqVBNSyxHdXYaGkM18s1RNFBF_15S8aSMm533JjerbNbwmXKqvF8xz-Pgbz9r8ffpSA.jpg']]

3.2 测试上述函数

3.2.1 大大图

driver = webdriver.Chrome()
driver.get('https://bj.meituan.com/')
# 手动登录一下
time.sleep(3)
# 开始访问商户网址
driver.get('https://www.meituan.com/meishi/1801010/')a = driver.page_source
print(a)
<html><head><meta charset="utf-8"><meta name="for" content="meituan.com"><title>芙蓉江渔村豆花鱼头火锅(置信路店)_电话地址_营业时间-北京美团网</title><meta name="description" content="北京美团网提供2019最新芙蓉江渔村豆花鱼头火锅(置信路店)电话,营业时间,地址,以及芙蓉江渔村豆花鱼头火锅(置信路店)人均价格,优惠菜单,招牌菜怎么样.查看芙蓉江渔村豆花鱼头火锅(置信路店)团购信息,了解店铺环境/服务怎么样."><meta name="keyword" content="芙蓉江渔村豆花鱼头火锅(置信路店),电话,地址,营业时间">
......
extraInfos":[{"iconUrl":"http://p0.meituan.net/codeman/551290739062eda37e52999e2315f50c1887.png","text":"提供wifi"},{"iconUrl":"http://p1.meituan.net/codeman/4b1c5696fe5bf2c4d23fb01659b3e68b1960.png","text":"停车位"}],"hasFoodSafeInfo":true,"longitude":104.016005,"latitude":30.647733,"avgPrice":67,"brandId":0,"brandName":"","showStatus":1,"isMeishi":true},"photos":{"frontImgUrl":"https://img.meituan.net/msmerchant/956fec11f0d2760e141a4269dfcf74bf207363.jpg@600w_600h_1l","albumImgUrls":["https://img.meituan.net/msmerchant/f17ebb4360b1cd7b3069979ffc732dfc386058.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/6b8442a1b416fc280693ac518ad41935586412.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/02cd329ca9abdfc08f62c39a46f898ee394165.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/6a2e338f317248c5fd01cfc72c483f01358268.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/b34536eace1526969cad40a3087f20ef473478.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/eb61165e637f1d76a2f370e84488b8b4339830.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/436a0acfdd49da16e009285bbaa48755371783.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/0000a1290b10bd01c7ce84e9db355426321403.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/5b17409e47c1b9991f59cb2f5df0514a436195.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/43454e1afdd4af1ad328ae1ee4f21004207872.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/b6ebe8f753ac49f7615a39ee06d9daa273876.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/f68c4c0f83a9a8ebf548e52db45c041b73140.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/a8d03dad7410214c586335bdb352a7d072514.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/0e7b7aa4329f92bfab4031d209ff447b70202.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/858a1859369f78d258a09091f84e310982429.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/f3dfe5354e3047ea0d0e0080af6d77ea213010.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/43454e1afdd4af1ad328ae1ee4f21004207872.jpg@600w_600h_1l","http://p1.meituan.net/deal/__47185133__7213327.jpg@600w_600h_1l","http://p1.meituan.net/shaitu/98ff13cdad6c7058905080ac4d55463587680.jpg@600w_600h_1l","http://p1.meituan.net/deal/__48840575__4575359.jpg@600w_600h_1l","http://p0.meituan.net/deal/__47185133__8041792.jpg@600w_600h_1l","http://p1.meituan.net/deal/__48838606__8431233.jpg@600w_600h_1l","http://p0.meituan.net/shaitu/b0e8611f14f117d0fe2b94185193f83e56313.jpg@600w_600h_1l","http://p0.meituan.net/deal/__47185133__3763278.jpg@600w_600h_1l","http://p0.meituan.net/deal/__47179908__5612408.jpg@600w_600h_1l","http://p0.meituan.net/shaitu/5387320f950b9685630545d736d818b0284873.jpg@600w_600h_1l","http://p1.meituan.net/deal/__47185132__9743594.jpg@600w_600h_1l","http://p1.meituan.net/deal/__48838606__7400628.jpg@600w_600h_1l","http://p1.meituan.net/deal/__47185133__6502411.jpg@600w_600h_1l","http://p1.meituan.net/deal/__47185132__7962378.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/51a0f60124cb7bdfe7a0125eb385486b62971.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/8a99b5e4481b273a471b22f210a1f36362341.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/a2fed7151069b4aadf59fcd02305328865447.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/956fec11f0d2760e141a4269dfcf74bf207363.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/588412841cef1a1a1a70d47e8bcf4b2b63514.jpg@600w_600h_1l","http://p1.meituan.net/deal/__47179647__6451784.jpg@600w_600h_1l","http://p1.meituan.net/deal/__47185132__8515815.jpg@600w_600h_1l","http://p1.meituan.net/dealwatera/bb0bd9de97d76ce53d2baf55579ee418167415.jpg@600w_600h_1l","http://p1.meituan.net/dealwatera/46c9afe9e201982589e752f486d3a49c102919.jpg@600w_600h_1l","http://p1.meituan.net/dealwatera/1ba3bef593c68aaa213d89a4d24c880a166243.jpg@600w_600h_1l","http://p0.meituan.net/deal/__47179647__2657511.jpg@600w_600h_1l","http://p1.meituan.net/dealwatera/e8d5fbf0926ebe045ad2d804b14ad4ce124823.jpg@600w_600h_1l","http://p0.meituan.net/deal/__47179647__6693059.jpg@600w_600h_1l","http://p0.meituan.net/deal/__47179647__7248161.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/1f53e7505f5a2f9126cd106d4c65787180733.jpg@600w_600h_1l"]},"recommended":
# 先限定一个范围
split=re.compile('photos.*albumImgUrls')
result=split.findall(a)
print(result)# 再提取图片网址
split2=re.compile('http\S*jpg|png')
result2=split2.findall(result[0])
result2[0]
['photos":{"frontImgUrl":"https://img.meituan.net/msmerchant/956fec11f0d2760e141a4269dfcf74bf207363.jpg@600w_600h_1l","albumImgUrls']
# 再提取图片网址
split2=re.compile('http\S*jpg|png')
result2=split2.findall(result[0])
result2[0]
'https://img.meituan.net/msmerchant/956fec11f0d2760e141a4269dfcf74bf207363.jpg'

3.2.2 大图

# 导入相应的模块
import json
import time
import requests
from bs4 import BeautifulSoup
import pandas as pd
from lxml import etree
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import random
import re
driver = webdriver.Chrome()
driver.get('https://bj.meituan.com/')
# 手动登录一下
time.sleep(3)
# 开始访问商户网址
driver.get('https://www.meituan.com/meishi/119822157/')
a = driver.page_source
print(a)```python
# 先限定一个范围
split=re.compile('albumImgUrls.*},"recommended')
result=split.findall(a)
print(result)
['albumImgUrls":["https://img.meituan.net/msmerchant/9e127093b40299a87d10998196aa083747135.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/9f36ccc83c00db9ef15e66b99627af1e50131.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/4ed98687cebc0471c2610fd2c33c488649544.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/be2e186b049b867754caa5d9e0bea6d349263.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/6c4511c530ae967e6840ffb34373c50782962.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/79fe851ff96b87e2307eda359dd5153c683045.png@600w_600h_1l","https://img.meituan.net/msmerchant/6e26e237a84989979dbaf17233312975156044.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/54ceeb7d88845726443d72261deef84c194642.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/6e26e237a84989979dbaf17233312975156044.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/d4e2ef6facb2b8bfcb14d5af59b1f45d99902.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/ecf88d7800040c8a88d3322ace5299f388058.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/178d2076df5b8a513f457fac5825454a187213.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/54ceeb7d88845726443d72261deef84c194642.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/5e1b9bb1376307aa3d8cd024d3698fed94964.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/6dfd67bfd37a2e167ee5dfcb3dff25bc8657860.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/56ef7d97a0b50cae0f9515131dd9315088356.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/aa8339726b0263cff3c14011e601b4b0248709.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/2a27c8929b2534d375c3065ccf4872311457764.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/f4293d0a4bbf100950278df2df181bef2024140.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/6b8ef2dc1c3adae0742307993e8055cb832156.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/42281ddd18eaee41bee89d9d95cdc7cc143557.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/cc33e7a27d49bbee76c2bd390885179b188091.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/725895b6cb524f75c92a2c4e4261a409164430.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/617f16b6392a774de971be69c564992e228911.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/537fa9f90dc1f1b82035533ccbf8283d168035.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/aa8e4f615a7ee1c6ae9b588d15e6b929217460.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/9f49b48a7811374ff8798c104a1fa3f7200230.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/37a715a5c73087bc8f44573ad9223c22157969.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/68866bb6fea3b9187352ff4cceebdb80139204.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/2e07b318f292d01be95a0ffc940b347a225802.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/f9e31ed96fcc3637d4471c6e1144e1d895098.jpg@600w_600h_1l","http://p1.meituan.net/deal/97df898454dd3bda2c3e4d1b6811ec2095228.jpg@600w_600h_1l","http://p0.meituan.net/deal/590111dd397b91987b07ca348d421d3097108.jpg@600w_600h_1l","http://p0.meituan.net/deal/c9b4bdef0dd47bd1ff2b16c35e25528663365.jpg@600w_600h_1l","http://p1.meituan.net/deal/bc8122ffd4adbc73d43d4c8fcd71c8bd68309.jpg@600w_600h_1l","http://p1.meituan.net/deal/d5261a0a3932e61f354ead96934f023864652.jpg@600w_600h_1l","http://p0.meituan.net/deal/b74c2a8f32d2c0118fa28e0d8515389d44777.jpg@600w_600h_1l","https://p0.meituan.net/poiskudish/fd705a6a21520609c040707dd382b9a2570300.png@600w_600h_1l"]},"recommended']
# 再提取图片网址
split2=re.compile('http\S*[@]')
result2=split2.findall(result[0])
result2
['https://img.meituan.net/msmerchant/9e127093b40299a87d10998196aa083747135.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/9f36ccc83c00db9ef15e66b99627af1e50131.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/4ed98687cebc0471c2610fd2c33c488649544.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/be2e186b049b867754caa5d9e0bea6d349263.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/6c4511c530ae967e6840ffb34373c50782962.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/79fe851ff96b87e2307eda359dd5153c683045.png@600w_600h_1l","https://img.meituan.net/msmerchant/6e26e237a84989979dbaf17233312975156044.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/54ceeb7d88845726443d72261deef84c194642.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/6e26e237a84989979dbaf17233312975156044.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/d4e2ef6facb2b8bfcb14d5af59b1f45d99902.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/ecf88d7800040c8a88d3322ace5299f388058.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/178d2076df5b8a513f457fac5825454a187213.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/54ceeb7d88845726443d72261deef84c194642.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/5e1b9bb1376307aa3d8cd024d3698fed94964.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/6dfd67bfd37a2e167ee5dfcb3dff25bc8657860.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/56ef7d97a0b50cae0f9515131dd9315088356.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/aa8339726b0263cff3c14011e601b4b0248709.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/2a27c8929b2534d375c3065ccf4872311457764.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/f4293d0a4bbf100950278df2df181bef2024140.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/6b8ef2dc1c3adae0742307993e8055cb832156.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/42281ddd18eaee41bee89d9d95cdc7cc143557.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/cc33e7a27d49bbee76c2bd390885179b188091.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/725895b6cb524f75c92a2c4e4261a409164430.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/617f16b6392a774de971be69c564992e228911.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/537fa9f90dc1f1b82035533ccbf8283d168035.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/aa8e4f615a7ee1c6ae9b588d15e6b929217460.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/9f49b48a7811374ff8798c104a1fa3f7200230.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/37a715a5c73087bc8f44573ad9223c22157969.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/68866bb6fea3b9187352ff4cceebdb80139204.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/2e07b318f292d01be95a0ffc940b347a225802.jpg@600w_600h_1l","https://img.meituan.net/msmerchant/f9e31ed96fcc3637d4471c6e1144e1d895098.jpg@600w_600h_1l","http://p1.meituan.net/deal/97df898454dd3bda2c3e4d1b6811ec2095228.jpg@600w_600h_1l","http://p0.meituan.net/deal/590111dd397b91987b07ca348d421d3097108.jpg@600w_600h_1l","http://p0.meituan.net/deal/c9b4bdef0dd47bd1ff2b16c35e25528663365.jpg@600w_600h_1l","http://p1.meituan.net/deal/bc8122ffd4adbc73d43d4c8fcd71c8bd68309.jpg@600w_600h_1l","http://p1.meituan.net/deal/d5261a0a3932e61f354ead96934f023864652.jpg@600w_600h_1l","http://p0.meituan.net/deal/b74c2a8f32d2c0118fa28e0d8515389d44777.jpg@600w_600h_1l","https://p0.meituan.net/poiskudish/fd705a6a21520609c040707dd382b9a2570300.png@']

下面代码是亮点:根据逗号进行分割,然后对每一个再正则匹配一次,使用map提高效率

result_big = result2[0].split(',')
result_big
['https://img.meituan.net/msmerchant/9e127093b40299a87d10998196aa083747135.jpg@600w_600h_1l"','"https://img.meituan.net/msmerchant/9f36ccc83c00db9ef15e66b99627af1e50131.jpg@600w_600h_1l"',
......'"http://p0.meituan.net/deal/b74c2a8f32d2c0118fa28e0d8515389d44777.jpg@600w_600h_1l"','"https://p0.meituan.net/poiskudish/fd705a6a21520609c040707dd382b9a2570300.png@']

定义一个方法截取掉每个里面的@之前的:

def MatchChar(x):# x 表示要匹配的字符import resplit3=re.compile('http.*[.jpg | .png]')result3=split3.findall(x)return result3
result_big_s = list(map(MatchChar, result_big))
result_big_s[0][0]
'https://img.meituan.net/msmerchant/9e127093b40299a87d10998196aa083747135.jpg'
len(result_big_s)
38

测试下载所有图片:

for i in range(len(result_big_s)):import urllib.requestimg_url = result_big_s[i][0]filename = 'big_' + str(i) + '.jpg'urllib.request.urlretrieve(img_url,filename=filename)

没有问题

3.2.3 推荐菜

a = driver.page_source
print(a)
# 先限定一个范围
split=re.compile('recommended.*],"crumbNav')
result=split.findall(a)
print(result)
['recommended":[{"id":"14496153","name":"紫苏叶牛蛙","price":0,"frontImgUrl":"http://p1.meituan.net/poirichness/menu_18583543_691441813.jpg@600w_600h_1l"},{"id":"157931980","name":"牛蛙爱上基围虾平锅","price":118,"frontImgUrl":"http://p1.meituan.net/dealwatera/6591fbeef4e76cd00aa1a32a77b83e1e72222.jpg@600w_600h_1l"},{"id":"15031996","name":"冰皮麻薯","price":0,"frontImgUrl":"http://p1.meituan.net/poirichness/menu_18583544_691441814.jpg@600w_600h_1l"},{"id":"14910675","name":"土豆","price":8,"frontImgUrl":"http://p0.meituan.net/poirichness/menu_17917613_683523760.jpg@600w_600h_1l"},{"id":"14496155","name":"炭烤猪蹄","price":0,"frontImgUrl":"http://qcloud.dpfile.com/pc/u68eK5ImOJ1kCypuryrNI2ZhBwLgY9YrMSDGiZYJwqoEgns8-GSrWUXr4-guZPzNmXKqvF8xz-Pgbz9r8ffpSA.jpg"},{"id":"14496157","name":"鸡脚","price":0,"frontImgUrl":"http://qcloud.dpfile.com/pc/4tLk_E_Y1X14VHoiovKNqVBNSyxHdXYaGkM18s1RNFBF_15S8aSMm533JjerbNbwmXKqvF8xz-Pgbz9r8ffpSA.jpg"},{"id":"14496156","name":"烤鸡翅","price":0,"frontImgUrl":""},{"id":"144134197","name":"紫苏牛蛙锅","price":138,"frontImgUrl":""},{"id":"200346482","name":"红糖糍粑","price":0,"frontImgUrl":""},{"id":"14496154","name":"桂圆红枣汁","price":0,"frontImgUrl":""},{"id":"157931981","name":"紫苏味平锅","price":108,"frontImgUrl":""},{"id":"144138375","name":"经典牛蛙锅","price":138,"frontImgUrl":""},{"id":"102831228","name":"麻辣藕片","price":0,"frontImgUrl":""},{"id":"133832305","name":"草莓酸奶苏打","price":0,"frontImgUrl":""},{"id":"15031993","name":"皮蛋","price":0,"frontImgUrl":""},{"id":"198259728","name":"德国香肠","price":5,"frontImgUrl":""},{"id":"157931982","name":"皇家海陆空平锅","price":138,"frontImgUrl":""},{"id":"171800525","name":"豆皮","price":0,"frontImgUrl":""},{"id":"15031995","name":"核桃包","price":0,"frontImgUrl":""},{"id":"144134285","name":"螃蟹牛蛙锅","price":168,"frontImgUrl":""},{"id":"157931984","name":"炭烤猪手","price":15,"frontImgUrl":""},{"id":"211913507","name":"黄瓜","price":0,"frontImgUrl":""},{"id":"194909762","name":"糯米糍","price":0,"frontImgUrl":""},{"id":"236161890","name":"排骨","price":10,"frontImgUrl":""},{"id":"241883728","name":"可乐鸡翅","price":0,"frontImgUrl":""},{"id":"230162051","name":"4","price":0,"frontImgUrl":""},{"id":"218095625","name":"香辣味泡锅","price":0,"frontImgUrl":""},{"id":"241883957","name":"红薯粉","price":0,"frontImgUrl":""},{"id":"201507362","name":"娃娃菜","price":0,"frontImgUrl":""},{"id":"200811400","name":"烤鱼","price":0,"frontImgUrl":""},{"id":"200114454","name":"龙虾","price":0,"frontImgUrl":""},{"id":"198259727","name":"冰淇淋","price":0,"frontImgUrl":""},{"id":"198008716","name":"烤蛙","price":0,"frontImgUrl":""},{"id":"197723225","name":"油条","price":0,"frontImgUrl":""},{"id":"197223569","name":"炸鸡柳","price":0,"frontImgUrl":""},{"id":"196067268","name":"腐竹","price":0,"frontImgUrl":""}],"crumbNav']
# 再提取图片网址
# split2=re.compile('http:.*[.jpg | .png]')
split2=re.compile('http\S*jpg|png')
result2=split2.findall(result[0])
result2
['http://p1.meituan.net/poirichness/menu_18583543_691441813.jpg@600w_600h_1l"},{"id":"157931980","name":"牛蛙爱上基围虾平锅","price":118,"frontImgUrl":"http://p1.meituan.net/dealwatera/6591fbeef4e76cd00aa1a32a77b83e1e72222.jpg@600w_600h_1l"},{"id":"15031996","name":"冰皮麻薯","price":0,"frontImgUrl":"http://p1.meituan.net/poirichness/menu_18583544_691441814.jpg@600w_600h_1l"},{"id":"14910675","name":"土豆","price":8,"frontImgUrl":"http://p0.meituan.net/poirichness/menu_17917613_683523760.jpg@600w_600h_1l"},{"id":"14496155","name":"炭烤猪蹄","price":0,"frontImgUrl":"http://qcloud.dpfile.com/pc/u68eK5ImOJ1kCypuryrNI2ZhBwLgY9YrMSDGiZYJwqoEgns8-GSrWUXr4-guZPzNmXKqvF8xz-Pgbz9r8ffpSA.jpg"},{"id":"14496157","name":"鸡脚","price":0,"frontImgUrl":"http://qcloud.dpfile.com/pc/4tLk_E_Y1X14VHoiovKNqVBNSyxHdXYaGkM18s1RNFBF_15S8aSMm533JjerbNbwmXKqvF8xz-Pgbz9r8ffpSA.jpg']
result_rec = result2[0].split('},{')
result_rec
['http://p1.meituan.net/poirichness/menu_18583543_691441813.jpg@600w_600h_1l"','"id":"157931980","name":"牛蛙爱上基围虾平锅","price":118,"frontImgUrl":"http://p1.meituan.net/dealwatera/6591fbeef4e76cd00aa1a32a77b83e1e72222.jpg@600w_600h_1l"','"id":"15031996","name":"冰皮麻薯","price":0,"frontImgUrl":"http://p1.meituan.net/poirichness/menu_18583544_691441814.jpg@600w_600h_1l"','"id":"14910675","name":"土豆","price":8,"frontImgUrl":"http://p0.meituan.net/poirichness/menu_17917613_683523760.jpg@600w_600h_1l"','"id":"14496155","name":"炭烤猪蹄","price":0,"frontImgUrl":"http://qcloud.dpfile.com/pc/u68eK5ImOJ1kCypuryrNI2ZhBwLgY9YrMSDGiZYJwqoEgns8-GSrWUXr4-guZPzNmXKqvF8xz-Pgbz9r8ffpSA.jpg"','"id":"14496157","name":"鸡脚","price":0,"frontImgUrl":"http://qcloud.dpfile.com/pc/4tLk_E_Y1X14VHoiovKNqVBNSyxHdXYaGkM18s1RNFBF_15S8aSMm533JjerbNbwmXKqvF8xz-Pgbz9r8ffpSA.jpg']
result_rec_s = list(map(MatchChar, result_rec))
result_rec_s
[['http://p1.meituan.net/poirichness/menu_18583543_691441813.jpg'],['http://p1.meituan.net/dealwatera/6591fbeef4e76cd00aa1a32a77b83e1e72222.jpg'],['http://p1.meituan.net/poirichness/menu_18583544_691441814.jpg'],['http://p0.meituan.net/poirichness/menu_17917613_683523760.jpg'],['http://qcloud.dpfile.com/pc/u68eK5ImOJ1kCypuryrNI2ZhBwLgY9YrMSDGiZYJwqoEgns8-GSrWUXr4-guZPzNmXKqvF8xz-Pgbz9r8ffpSA.jpg'],['http://qcloud.dpfile.com/pc/4tLk_E_Y1X14VHoiovKNqVBNSyxHdXYaGkM18s1RNFBF_15S8aSMm533JjerbNbwmXKqvF8xz-Pgbz9r8ffpSA.jpg']]

3.3 整体进行爬取

# 导入相应的模块
import json
import time
from bs4 import BeautifulSoup
import pandas as pd
from lxml import etree
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import random
import os
import re
import urllib.request# 全局计时
t0_all = time.time()# 1 启动浏览器
driver = webdriver.Chrome()
driver.get('https://www.meituan.com')# 最大化窗口
driver.maximize_window()
# 手动登录一下
time.sleep(3)# 2 循环遍历每一个网址
# for i in range(0, len(df2)):
for i in range(0, 5):# 开始计时t0 = time.time()print('开始下载第 %s 个商户图片' % (i+1))# 2.1 创建商户文件夹path1 = str(df2['美团商户id'][i])os.mkdir(path1)# 2.2 开始访问商户网址driver.get(df2['美团商户首页网址'][i])# 加载得等一下!time.sleep(5)# 然后滑动到推荐菜位置# 指定像素jsCode = "var q=document.documentElement.scrollTop=550"driver.execute_script(jsCode)# 2.3 开始爬取大图print('开始下载大大图')# 2.3.1 链接准备+下载photo_url = RegHttp_bbig(driver)filename = path1 + '/' + 'photo.jpg'try:urllib.request.urlretrieve(photo_url, filename = filename)except Exception as e:print('报错信息为: ', e)time.sleep(1)print('开始下载大图')# 2.3.1 链接准备+下载all_url = RegHttp_big(driver)for i in range(len(all_url)):filename = path1 + '/' + 'big_' + str(i) + '.jpg'try:urllib.request.urlretrieve(all_url[i][0], filename = filename)except Exception as e:print('报错信息为: ', e)time.sleep(1)# 2.4 开始爬取推荐菜图print('开始下载推荐菜')# 2.4.1 链接准备+下载all_url_rec = RegHttp_rec(driver)for i in range(len(all_url_rec)):filename_rec = path1 + '/' + 'rec_' + str(i) + '.jpg'try:urllib.request.urlretrieve(all_url_rec[i][0], filename = filename_rec)except Exception as e:print('报错信息为: ', e)time.sleep(1)t1 = time.time()print('商户 %s id: %s 的图片下载完毕,所需时间为 %.2f' % (df2['美团商户名称'][i], df2['美团商户id'][i], t1-t0))t1_all = time.time()
print(' %s 商户信息爬取完毕,所需时间为 %.2f s' % (len(df2), t1_all - t0_all))
开始下载第 1 个商户图片
开始下载大大图
开始下载大图
开始下载推荐菜
商户 新海府幸福大酒楼 id: 162756463 的图片下载完毕,所需时间为 19.34
开始下载第 2 个商户图片
开始下载大大图
开始下载大图
正则匹配时报错信息为:  list index out of range
开始下载推荐菜
商户 新海府幸福大酒楼 id: 162756463 的图片下载完毕,所需时间为 13.93
开始下载第 3 个商户图片
开始下载大大图
开始下载大图
开始下载推荐菜
商户 麦巴克 id: 160715653 的图片下载完毕,所需时间为 38.58
开始下载第 4 个商户图片
开始下载大大图
报错信息为:  unknown url type: 'png'
开始下载大图
开始下载推荐菜
商户 麦巴克 id: 160715653 的图片下载完毕,所需时间为 74.74
开始下载第 5 个商户图片
开始下载大大图
开始下载大图
正则匹配时报错信息为:  list index out of range
开始下载推荐菜
商户 麦巴克 id: 160715653 的图片下载完毕,所需时间为 15.26702 商户信息爬取完毕,所需时间为 168.52 s
  • 爬取结果展示:
    在这里插入图片描述
    以商户:瑶茶坊为例, 我们看到的首页图片只有5张:
    在这里插入图片描述
    但实际上我们下载下来的图片远不止,见下图:
    在这里插入图片描述
  • 上次就是没有注意到这个坑,原因就是点击那5张图片的任一张,然后还会发现很多张!所以还是直接看源码比较靠谱,里面会把所有图片的链接都储存了,故这种page_source+正则匹配还是很不错的!

下列工作是以防之前没有爬取下来,于是把空文件夹再搞出来!然后再爬取一波!

4 找到空文件夹 然后把名称记录下来 然后拼接网址再继续爬取数据

4.1 找到空文件夹名称

file_names = os.listdir('.')
file_names = [x for x in file_names if '.' not in x]
file_names
['150569726','163443323','177474079',......'179449164','179162834','188997974']
bad_file = []
for i in range(len(file_names)):if len(os.listdir(file_names[i])) == 0:bad_file.append(file_names[i])
print(len(bad_file))
187
bad_file
['177474079','178086221',......'179449164','179162834','188997974']

4.2 拼接网址重新爬取数据

变化的地方:

  • 不需要再创建一个文件夹了
  • 修改网址,自己拼接一下即可!
# 导入相应的模块
import json
import time
from bs4 import BeautifulSoup
import pandas as pd
from lxml import etree
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import random
import os
import re
import urllib.request# 全局计时
t0_all = time.time()# 1 启动浏览器
driver = webdriver.Chrome()
driver.get('https://cq.meituan.com/meishi/165869167/')
# 手动登录一下
time.sleep(30)# 2 循环遍历每一个网址
# for i in range(0, len(df2)):
for i in range(170,len(bad_file)):# 开始计时t0 = time.time()print('开始下载第 %s 个商户图片' % (i+1))# 2.1 创建商户文件夹path1 = (bad_file[i])
#     os.mkdir(path1)url = 'https://www.meituan.com/meishi/' + bad_file[i] + '/'# 2.2 开始访问商户网址driver.get(url)# 2.3 开始爬取大图print('开始下载大大图')# 2.3.1 链接准备+下载photo_url = RegHttp_bbig(driver)filename = path1 + '/' + 'photo.jpg'try:urllib.request.urlretrieve(photo_url, filename = filename)except Exception as e:print('报错信息为: ', e)time.sleep(1)print('开始下载大图')# 2.3.1 链接准备+下载all_url = RegHttp_big(driver)for i in range(len(all_url)):filename = path1 + '/' + 'big_' + str(i) + '.jpg'try:urllib.request.urlretrieve(all_url[i][0], filename = filename)except Exception as e:print('报错信息为: ', e)time.sleep(1)# 2.4 开始爬取推荐菜图print('开始下载推荐菜')# 2.4.1 链接准备+下载all_url_rec = RegHttp_rec(driver)for i in range(len(all_url_rec)):filename_rec = path1 + '/' + 'rec_' + str(i) + '.jpg'try:urllib.request.urlretrieve(all_url_rec[i][0], filename = filename_rec)except Exception as e:print('报错信息为: ', e)time.sleep(1)t1 = time.time()print('商户 %s id: %s 的图片下载完毕,所需时间为 %.2f' % (df2['美团商户名称'][i], df2['美团商户id'][i], t1-t0))t1_all = time.time()
print(' %s 商户信息爬取完毕,所需时间为 %.2f s' % (len(df2), t1_all - t0_all))
开始下载第 171 个商户图片
开始下载大大图
正则匹配时报错信息为:  list index out of range
报错信息为:  expected string or bytes-like object
开始下载大图
正则匹配时报错信息为:  list index out of range
开始下载推荐菜
正则匹配时报错信息为:  list index out of range
商户 知味鲜风味餐厅 id: 152769787 的图片下载完毕,所需时间为 2.23
......

http://www.ppmy.cn/news/819447.html

相关文章

2019北京物联网智慧城市大数据博览会开启中国之路

邀 请 函 时间2019年05月16-18日 地点 中国•北京亦创国际会展中心4组织单位 特邀单位&#xff1a; 商务部批准单位&#xff1a;北京市商务委员会主办单位&#xff1a;中国电子商会物联网技术产品应用专业委员会北京铭世博国际展览有限公司 支持单位&#xff1a; 中国智能家居产…

“3D打印”的魔法时代还有多远?

2009年&#xff0c;24岁的浙江金华人金涛&#xff0c;从浙江大学计算机专业研究生毕业后&#xff0c;在香港做了不到一年博士&#xff0c;就决定回内地创业。启发他创业灵感的&#xff0c;是国外一家电子礼品店的个性3D打印服务。 当时&#xff0c;面向普通人的3D打印服务在国内…

极兔快递电子面单打印API接口-极兔快递

前言 J&T 极兔速递是一家科技创新型互联网快递物流企业,致力于为用户带来优质的快递和物流体验。 2015年8月由印尼首都雅加达作为起点,进入快递物流市场,目前覆盖了印度尼西亚、越南、马来西亚、泰国、菲律宾、柬埔寨及新加坡七个国家,成为东南亚超过5.5亿人口信赖的…

随手刷屏的波士顿动力机器人,用3D打印解决了哪些问题

方栗子 编译自 Design World量子位 出品 | 公众号 QbitAI 波士顿动力的Atlas&#xff0c;自然不是用一天时间练就浑身技能的。 近日&#xff0c;在德国西部的小城亚琛&#xff0c;波士顿动力副总裁Aaron Saunders讲述了这台人形机器人与3D打印/增材制造之间的故事。 2009年&…

WH5097D有源矩阵驱动的Mini LED背光应用方案

Miniled技术为lcd的全面升级版&#xff0c;Miniled的背光层在单位面积内可以容纳更多LED&#xff0c;从而大大提高背光源数量&#xff0c;因此可以进行区域亮度调节的设计&#xff0c;从而在个别区域实现关闭led从而达到完全的黑色&#xff0c;不仅减小了功耗&#xff0c;而且由…

【C++】STL——stack和queue使用及模拟实现

&#x1f680; 作者简介&#xff1a;一名在后端领域学习&#xff0c;并渴望能够学有所成的追梦人。 &#x1f681; 个人主页&#xff1a;不 良 &#x1f525; 系列专栏&#xff1a;&#x1f6f8;C &#x1f6f9;Linux &#x1f4d5; 学习格言&#xff1a;博观而约取&#xff0…

牛客刷题——Python入门总结

&#x1f935;‍♂️ 个人主页: 北极的三哈 个人主页 &#x1f468;‍&#x1f4bb; 作者简介&#xff1a;Python领域优质创作者。 &#x1f4d2; 系列专栏&#xff1a;《Python入门学习》《牛客题库-Python篇》 &#x1f310;推荐《牛客网》——找工作神器|笔试题库|面试经…

免疫荧光二之技术详解

免疫荧光技术是在免疫学、生物化学和显微镜技术的基础上建立起来的一项技术。它是根据抗原抗体反应的原理&#xff0c;先将已知的抗原或抗体标记上荧光基团&#xff0c;再用这种荧光抗体&#xff08;或抗原&#xff09;作为探针检查细胞或组织内的相应抗原&#xff08;或抗体&a…