python-3.x – 从列表中的元素中删除尾随空格

python-3.x – 从列表中的元素中删除尾随空格,第1张

概述我有一个火花数据框,其中给定的列是一些文本.我正在尝试清理文本并用逗号分割,这将输出一个包含单词列表的新列. 我遇到的问题是该列表中的某些元素包含我想删除的尾随空格. 码: # Libraries# Standard Librariesfrom typing import Dict, List, Tuple# Third Party Librariesimport pysparkfro 我有一个火花数据框,其中给定的列是一些文本.我正在尝试清理文本并用逗号分割,这将输出一个包含单词列表的新列.

我遇到的问题是该列表中的某些元素包含我想删除的尾随空格.

码:

# librarIEs# Standard librarIEsfrom tyPing import Dict,List,Tuple# Third Party librarIEsimport pysparkfrom pyspark.ml.feature import Tokenizerfrom pyspark.sql import SparkSessionimport pyspark.sql.functions as s_functiondef tokenize(sdf,input_col="text",output_col="tokens"):    # Remove email     sdf_temp = sdf.withColumn(        colname=input_col,col=s_function.regexp_replace(s_function.col(input_col),"[\w\.-]+@[\w\.-]+\.\w+",""))    # Remove digits    sdf_temp = sdf_temp.withColumn(        colname=input_col,"\d",""))    # Remove one(1) character that is not a word character except for    # commas(,),since we still want to split on commas(,)    sdf_temp = sdf_temp.withColumn(        colname=input_col,"[^a-zA-Z0-9,]+"," "))     # Split the affiliation string based on a comma    sdf_temp = sdf_temp.withColumn(        colname=output_col,col=s_function.split(sdf_temp[input_col],","))    return sdf_tempif __name__ == "__main__":    # Sample data    a_1 = "Department of Bone and Joint Surgery,Ehime University Graduate"\        " School of Medicine,Shitsukawa,Toon 791-0295,Ehime,Japan."\        " shinyama@m.ehime-u.ac.jp."     a_2 = "stroke Pharmacogenomics and genetics,Fundació Docència i Recerca"\        " Mútua Terrassa,Hospital Mútua de Terrassa,08221 Terrassa,Spain."    a_3 = "Neurovascular Research Laboratory,Vall d'Hebron Institute of Research,"\        " Hospital Vall d'Hebron,08035 barcelona,Spain;catycarrerav@gmail.com"\        " (C.C.). catycarrerav@gmail.com."    data = [(1,a_1),(2,a_2),(3,a_3)]    spark = SparkSession\        .builder\        .master("local[*]")\        .appname("My_test")\        .config("spark.ui.port","37822")\        .getorCreate()    sc = spark.sparkContext    sc.setLogLevel("WARN")    af_data = spark.createDataFrame(data,["index","text"])    sdf_tokens = tokenize(af_data)    # sdf_tokens.select("tokens").show(truncate=False)

产量

|[Department of Bone and Joint Surgery,Ehime University Graduate School of Medicine,Toon,Japan ]                                                ||[stroke Pharmacogenomics and genetics,Fundaci Doc ncia i Recerca M tua Terrassa,Hospital M tua de Terrassa,Terrassa,Spain ]                                       ||[Neurovascular Research Laboratory,Vall d Hebron Institute of Research,Hospital Vall d Hebron,barcelona,Spain C C ]

期望的输出:

|[Department of Bone and Joint Surgery,Japan]                                                ||[stroke Pharmacogenomics and genetics,Spain]                                       ||[Neurovascular Research Laboratory,Spain C C]

所以在

>第1行:’香椿’ – > ‘香椿’,’日本’ – > ‘日本’.
>第二行:’西班牙’ – > ‘西班牙’
>第3行:’西班牙C C’ – > ‘西班牙C C’

注意

尾随空格不仅出现在列表的最后一个元素中,它们也可以出现在任何元素中.

解决方法 更新

原始解决方案不起作用,因为trim仅在整个字符串的开头和结尾处 *** 作,而您需要它来处理每个令牌.

@PatrickArtner的solution有效,但另一种方法是使用RegexTokenizer.

以下是如何修改tokenize()函数的示例:

from pyspark.ml.feature import RegexTokenizerdef tokenize(sdf,output_col="tokens"):    # Remove email     sdf_temp = sdf.withColumn(        colname=input_col," "))    # call trim to remove any trailing (or leading spaces)    sdf_temp = sdf_temp.withColumn(        colname=input_col,col=s_function.trim(sdf_temp[input_col]))    # use RegexTokenizer to split on commas optionally surrounded by whitespace    myTokenizer = RegexTokenizer(        inputCol=input_col,outputCol=output_col,pattern="( +)?,?")    sdf_temp = myTokenizer.transform(sdf_temp)    return sdf_temp

基本上,在字符串上调用trim来处理任何前导或尾随空格.然后使用RegexTokenizer分割使用模式“()?,?”.

>()?:零和无限空格之间的匹配
>,:完全匹配逗号
>?:匹配可选空格

这是输出

sdf_tokens.select('tokens',f.size('tokens').alias('size')).show(truncate=False)

您可以看到数组的长度(令牌数)是正确的,但所有令牌都是小写的(因为这是Tokenizer和RegexTokenizer所做的).

+------------------------------------------------------------------------------------------------------------------------------+----+|tokens                                                                                                                        |size|+------------------------------------------------------------------------------------------------------------------------------+----+|[department of bone and joint surgery,ehime university graduate school of medicine,shitsukawa,toon,ehime,japan]          |6   ||[stroke pharmacogenomics and genetics,fundaci doc ncia i recerca m tua terrassa,hospital m tua de terrassa,terrassa,spain]|5   ||[neurovascular research laboratory,vall d hebron institute of research,hospital vall d hebron,barcelona,spain c c]        |5   |+------------------------------------------------------------------------------------------------------------------------------+----+

原始答案

只要您使用Spark 1.5或更高版本,就可以使用pyspark.sql.functions.trim(),它将:

Trim the spaces from both ends for the specifIEd string column.

所以一种方法是添加:

sdf_temp = sdf_temp.withColumn(        colname=input_col,col=s_function.trim(sdf_temp[input_col]))

在tokenize()函数的末尾.

但您可能想要查看pyspark.ml.feature.Tokenizerpyspark.ml.feature.RegexTokenizer.一个想法可能是使用您的函数来清理您的字符串,然后使用Tokenizer来制作令牌. (我看到你已经导入了它,但似乎没有使用它).

总结

以上是内存溢出为你收集整理的python-3.x – 从列表中的元素中删除尾随空格全部内容,希望文章能够帮你解决python-3.x – 从列表中的元素中删除尾随空格所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/langs/1192087.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-06-03
下一篇 2022-06-03

发表评论

登录后才能评论

评论列表(0条)

保存