尝试使用Beautifulsoup:
from BeautifulSoup import BeautifulSoupimport urllib2import rehtml_page = urllib2.urlopen("http://www.yourwebsite.com")soup = BeautifulSoup(html_page)for link in soup.findAll('a'): print link.get('href')
如果您只想要以开头的链接
http://,则应使用:
soup.findAll('a', attrs={'href': re.compile("^http://")})
在带有BS4的Python 3中,它应该是:
from bs4 import BeautifulSoupimport urllib.requesthtml_page = urllib.request.urlopen("http://www.yourwebsite.com")soup = BeautifulSoup(html_page, "html.parser")for link in soup.findAll('a'): print(link.get('href'))
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)