以上所有答案确实可以帮助我构建答案,因此,我对其他用户提出的所有答案投了赞成票:但是我最终对自己正在处理的确切问题汇总了自己的答案:
正如明确定义的问题一样,我必须以dom结构访问某些兄弟姐妹及其子代:此解决方案将迭代dom结构中的图像,并使用产品标题构造图像名称,并将图像保存到本地目录。
import urlparse from urllib2 import urlopen from urllib import urlretrieve from BeautifulSoup import BeautifulSoup as bs import requests def getImages(url): #Download the images r = requests.get(url) html = r.text soup = bs(html) output_folder = '~/amazon' #extracting the images that in div(s) for div in soup.findAll('div', attrs={'class':'image'}): modified_file_name = None try: #getting the data div using findNext nextDiv = div.findNext('div', attrs={'class':'data'}) #use findNext again on previous object to get to the anchor tag fileName = nextDiv.findNext('a').text modified_file_name = fileName.replace(' ','-') + '.jpg' except TypeError: print 'skip' imageUrl = div.find('img')['src'] outputPath = os.path.join(output_folder, modified_file_name) urlretrieve(imageUrl, outputPath) if __name__=='__main__': url = r'http://www.amazon.com/s/ref=sr_pg_1?rh=n%3A172282%2Ck%3Adigital+camera&keywords=digital+camera&ie=UTF8&qid=1343600585' getImages(url)
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)