如何使用正则表达式删除推文的主题标签,@ user,链接

如何使用正则表达式删除推文的主题标签,@ user,链接,第1张

如何使用正则表达式删除推文的主题标签,@ user,链接

以下示例是一个近似的例子。不幸的是,仅通过正则表达式没有正确的方法。以下正则表达式仅去除URL(不只是http),任何标点,用户名或任何非字母数字字符。它还将单词分隔为单个空格。如果您想按预期分析推文,则系统中需要更多智能。考虑到没有标准tweet提要格式的一些认知性自我学习算法。

这是我的建议。

' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",x).split())

这是你的例子的结果

>>> x="@peter I really love that shirt at #Macy. http://bit.ly//WjdiW4">>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",x).split())'I really love that shirt at Macy'>>> x="@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bit.ly/tuN2wx">>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",x).split())'Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve'>>> x="I am at Starbucks http://4sq.com/samqUI (7419 3rd ave, at 75th, Brooklyn) ">>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",x).split())'I am at Starbucks 7419 3rd ave at 75th Brooklyn'>>>

这是一些不完美的例子

>>> x="I c RT @iamFink: @SamanthaSpice that's my excited face and my regular face. The expression never changes.">>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",x).split())'I c RT that s my excited face and my regular face The expression never changes'>>> x="RT @AstrologyForYou: #Gemini recharges through regular contact with people of like mind, and social involvement that allows expression of their ideas">>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",x).split())'RT Gemini recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'>>> # Though after you add # to the regex expression filter, results become a bit better>>> ' '.join(re.sub("([@#][A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",x).split())'RT recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'>>> x="New comment by diego.bosca: Re: Re: wrong regular expression? http://t.co/4KOb94ua">>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)"," ",x).split())'New comment by diego bosca Re Re wrong regular expression'>>> #See how miserably it performed?>>>


欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/zaji/5649402.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-12-16
下一篇 2022-12-16

发表评论

登录后才能评论

评论列表(0条)

保存