python – BeautifulSoup：剥离指定的属性,但保留标记及其内容_python

概述我正在尝试’defrontpagify’MS FrontPage生成的网站的html,我正在写一个BeautifulSoup脚本来做它. 但是,我试图从包含它们的文档中的每个标记中剥离特定属性(或列表属性)的部分.代码段： REMOVE_ATTRIBUTES = ['lang','language','onmouseover','onmouseout','script','style','font 我正在尝试’defrontpagify’MS FrontPage生成的网站的HTML,我正在写一个BeautifulSoup脚本来做它.

但是,我试图从包含它们的文档中的每个标记中剥离特定属性(或列表属性)的部分.代码段：

REMOVE_ATTRIBUTES = ['lang','language','onmouSEOver','onmouSEOut','script','style','Font','dir','face','size','color','class','wIDth','height','hspace','border','valign','align','background','bgcolor','text','link','vlink','alink','cellpadding','cellspacing']# remove all attributes in REMOVE_ATTRIBUTES from all Tags,# but preserve the tag and its content. for attribute in REMOVE_ATTRIBUTES:    for tag in soup.findAll(attribute=True):        del(tag[attribute])

它运行没有错误,但实际上并没有删除任何属性.当我在没有外部循环的情况下运行它时,只需对单个属性进行硬编码(soup.findAll(‘style’= True),它就可以了.

有人知道这里有问题吗？

PS – 我也不太喜欢嵌套循环.如果有人知道更具功能性的map / filter-ish风格,我很乐意看到它.

解决方法这条线

for tag in soup.findAll(attribute=True):

没有找到任何标签.可能有一种方法可以使用findAll;我不确定.但是,这有效：

import BeautifulSoupREMOVE_ATTRIBUTES = [    'lang','cellspacing']doc = '''<HTML><head><Title>Page Title</Title></head><body><p ID="firstpara" align="center">This is <i>paragraph</i> <a onmouSEOut="">one</a>.<p ID="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</HTML>'''soup = BeautifulSoup.BeautifulSoup(doc)for tag in soup.recursiveChildGenerator():    try:        tag.attrs = [(key,value) for key,value in tag.attrs                     if key not in REMOVE_ATTRIBUTES]    except AttributeError:         # 'NavigableString' object has no attribute 'attrs'        passprint(soup.prettify())

总结

以上是内存溢出为你收集整理的python – BeautifulSoup：剥离指定的属性,但保留标记及其内容全部内容，希望文章能够帮你解决python – BeautifulSoup：剥离指定的属性,但保留标记及其内容所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/1206683.html

python – BeautifulSoup：剥离指定的属性,但保留标记及其内容

发表评论

评论列表（0条）