前两天收到一个公司的笔试题:
从结果页面中提取股东信息,如:http://www.tianyancha.com/company/9519792 中展示的”许晨晔”等姓名
oh 我还不会爬虫,吓的我赶紧刷了刷知乎,找到一个例子,大体是个模板,然后又去刷了BeautifulSoup的文档(毕竟爬下来之后还是要解析的),so,在我晚上睡觉的时候,我终于能爬个虫了(逃
然而用我学会的套路去爬笔试题,却出现了问题,爬下来的html里面的数据呢????are you kidding?吓得宝宝又看了看,soga json的。。
查查查:之后就有了下面的版本
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0")
driver = webdriver.PhantomJS(executable_path='D:/Anaconda3/phantomjs.exe', desired_capabilities=dcap)
import time
driver.get('http://www.tianyancha.com/company/9519792')
time.sleep(5)
# 获取网页内容
content = driver.page_source.encode('utf-8')
driver.close()
from bs4 import BeautifulSoup
data=BeautifulSoup(content,'lxml')
use_data=data.find_all(attrs={"ng-if": "dataItemCount.holderCount>0"})
list_td=use_data[0].find_all('td')
name=[]
for line in list_td:l=line.find('a')if l is not None:name.append(l.string)
结果:
[‘马化腾’, ‘张志东’, ‘陈一丹’, ‘许晨晔’]