Python 3:使用请求无法获取网页的全部内容

Python 3:使用请求无法获取网页的全部内容,第1张

Python 3:使用请求无法获取网页的全部内容

页面使用Javascript渲染,发出了更多请求以获取其他数据。您可以使用硒来获取整个页面。

from bs4 import BeautifulSoupfrom selenium import webdriverdriver = webdriver.Chrome()url = "https://shop.nordstrom.com/c/womens-dresses-shop?origin=topnav&cm_sp=Top%20Navigation-_-Women-_-Dresses&offset=11&page=3&top=72"driver.get(url)soup = BeautifulSoup(driver.page_source, 'html.parser')driver.quit()print(soup.prettify())

有关其他解决方案,请参阅我对“刮除Google财务”(BeautifulSoup)的回答

该页面使用Javascript呈现。有几种渲染和刮取的方法。

我可以用硒刮。首先安装Selenium:

sudo pip3 install selenium

然后获取驱动程序https://sites.google.com/a/chromium.org/chromedriver/downloads

import bs4 as bsfrom selenium import webdriver  browser = webdriver.Chrome()url = ("https://www.google.com/finance?q=tsla")browser.get(url)html_source = browser.page_sourcebrowser.quit()soup = bs.BeautifulSoup(html_source, "lxml")for el in soup.find_all("table", {"id": "cc-table"}):    print(el.get_text())

或者 PyQt5

from PyQt5.QtGui import *  from PyQt5.QtCore import *  from PyQt5.QtWebKit import *  from PyQt5.QtWebKitWidgets import QWebPagefrom PyQt5.QtWidgets import QApplicationimport bs4 as bsimport sysclass Render(QWebPage):      def __init__(self, url):          self.app = QApplication(sys.argv)          QWebPage.__init__(self)          self.loadFinished.connect(self._loadFinished)          self.mainframe().load(QUrl(url))          self.app.exec_()      def _loadFinished(self, result):          self.frame = self.mainframe()          self.app.quit()  url = "https://www.google.com/finance?q=tsla"r = Render(url)  result = r.frame.toHtml()soup = bs.BeautifulSoup(result,'lxml')for el in soup.find_all("table", {"id": "cc-table"}):    print(el.get_text())

另选Dryscrape

import bs4 as bsimport dryscrapeurl = "https://www.google.com/finance?q=tsla"session = dryscrape.Session()session.visit(url)dsire_get = session.body()soup = bs.BeautifulSoup(dsire_get,'lxml')for el in soup.find_all("table", {"id": "cc-table"}):    print(el.get_text())

所有输出:

Valuation▲▼Company name▲▼Price▲▼Change▲▼Chg %▲▼d | m | y▲▼Mkt Cap▲▼TSLATesla Inc328.40-1.52-0.46%53.69BDDAIFDaimler AG72.94-1.50-2.01%76.29BFFord Motor Company11.53-0.17-1.45%45.25BGMGeneral Motors Co...36.07-0.34-0.93%53.93BRNSDFRENAULT SA EUR3.8197.000.000.00%28.69BHMCHonda Motor Co Lt...27.52-0.18-0.65%49.47BAUDVFAUDI AG NPV840.400.000.00%36.14BTMToyota Motor Corp...109.31-0.53-0.48%177.79BBAMXFBAYER MOTOREN WER...94.57-2.41-2.48%56.93BNSANYNissan Motor Co L...20.400.000.00%42.85BMMTOFMITSUBISHI MOTOR ...6.86+0.091.26%10.22B

编辑

QtWebKit在Qt 5.5中被上游弃用,在5.6中被删除。

您可以切换到PyQt5.QtWebEngineWidgets



欢迎分享,转载请注明来源:内存溢出

原文地址: https://outofmemory.cn/zaji/5655232.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-12-16
下一篇 2022-12-16

发表评论

登录后才能评论

评论列表(0条)

保存