我遇到的问题是该列表中的某些元素包含我想删除的尾随空格.
码:
# librarIEs# Standard librarIEsfrom tyPing import Dict,List,Tuple# Third Party librarIEsimport pysparkfrom pyspark.ml.feature import Tokenizerfrom pyspark.sql import SparkSessionimport pyspark.sql.functions as s_functiondef tokenize(sdf,input_col="text",output_col="tokens"): # Remove email sdf_temp = sdf.withColumn( colname=input_col,col=s_function.regexp_replace(s_function.col(input_col),"[\w\.-]+@[\w\.-]+\.\w+","")) # Remove digits sdf_temp = sdf_temp.withColumn( colname=input_col,"\d","")) # Remove one(1) character that is not a word character except for # commas(,),since we still want to split on commas(,) sdf_temp = sdf_temp.withColumn( colname=input_col,"[^a-zA-Z0-9,]+"," ")) # Split the affiliation string based on a comma sdf_temp = sdf_temp.withColumn( colname=output_col,col=s_function.split(sdf_temp[input_col],",")) return sdf_tempif __name__ == "__main__": # Sample data a_1 = "Department of Bone and Joint Surgery,Ehime University Graduate"\ " School of Medicine,Shitsukawa,Toon 791-0295,Ehime,Japan."\ " shinyama@m.ehime-u.ac.jp." a_2 = "stroke Pharmacogenomics and genetics,Fundació Docència i Recerca"\ " Mútua Terrassa,Hospital Mútua de Terrassa,08221 Terrassa,Spain." a_3 = "Neurovascular Research Laboratory,Vall d'Hebron Institute of Research,"\ " Hospital Vall d'Hebron,08035 barcelona,Spain;catycarrerav@gmail.com"\ " (C.C.). catycarrerav@gmail.com." data = [(1,a_1),(2,a_2),(3,a_3)] spark = SparkSession\ .builder\ .master("local[*]")\ .appname("My_test")\ .config("spark.ui.port","37822")\ .getorCreate() sc = spark.sparkContext sc.setLogLevel("WARN") af_data = spark.createDataFrame(data,["index","text"]) sdf_tokens = tokenize(af_data) # sdf_tokens.select("tokens").show(truncate=False)
产量
|[Department of Bone and Joint Surgery,Ehime University Graduate School of Medicine,Toon,Japan ] ||[stroke Pharmacogenomics and genetics,Fundaci Doc ncia i Recerca M tua Terrassa,Hospital M tua de Terrassa,Terrassa,Spain ] ||[Neurovascular Research Laboratory,Vall d Hebron Institute of Research,Hospital Vall d Hebron,barcelona,Spain C C ]
期望的输出:
|[Department of Bone and Joint Surgery,Japan] ||[stroke Pharmacogenomics and genetics,Spain] ||[Neurovascular Research Laboratory,Spain C C]
所以在
>第1行:’香椿’ – > ‘香椿’,’日本’ – > ‘日本’.
>第二行:’西班牙’ – > ‘西班牙’
>第3行:’西班牙C C’ – > ‘西班牙C C’
注意
尾随空格不仅出现在列表的最后一个元素中,它们也可以出现在任何元素中.
解决方法 更新原始解决方案不起作用,因为trim仅在整个字符串的开头和结尾处 *** 作,而您需要它来处理每个令牌.
@PatrickArtner的solution有效,但另一种方法是使用RegexTokenizer.
以下是如何修改tokenize()函数的示例:
from pyspark.ml.feature import RegexTokenizerdef tokenize(sdf,output_col="tokens"): # Remove email sdf_temp = sdf.withColumn( colname=input_col," ")) # call trim to remove any trailing (or leading spaces) sdf_temp = sdf_temp.withColumn( colname=input_col,col=s_function.trim(sdf_temp[input_col])) # use RegexTokenizer to split on commas optionally surrounded by whitespace myTokenizer = RegexTokenizer( inputCol=input_col,outputCol=output_col,pattern="( +)?,?") sdf_temp = myTokenizer.transform(sdf_temp) return sdf_temp
基本上,在字符串上调用trim来处理任何前导或尾随空格.然后使用RegexTokenizer分割使用模式“()?,?”.
>()?:零和无限空格之间的匹配
>,:完全匹配逗号
>?:匹配可选空格
这是输出
sdf_tokens.select('tokens',f.size('tokens').alias('size')).show(truncate=False)
您可以看到数组的长度(令牌数)是正确的,但所有令牌都是小写的(因为这是Tokenizer和RegexTokenizer所做的).
+------------------------------------------------------------------------------------------------------------------------------+----+|tokens |size|+------------------------------------------------------------------------------------------------------------------------------+----+|[department of bone and joint surgery,ehime university graduate school of medicine,shitsukawa,toon,ehime,japan] |6 ||[stroke pharmacogenomics and genetics,fundaci doc ncia i recerca m tua terrassa,hospital m tua de terrassa,terrassa,spain]|5 ||[neurovascular research laboratory,vall d hebron institute of research,hospital vall d hebron,barcelona,spain c c] |5 |+------------------------------------------------------------------------------------------------------------------------------+----+
原始答案
只要您使用Spark 1.5或更高版本,就可以使用pyspark.sql.functions.trim()
,它将:
Trim the spaces from both ends for the specifIEd string column.
所以一种方法是添加:
sdf_temp = sdf_temp.withColumn( colname=input_col,col=s_function.trim(sdf_temp[input_col]))
在tokenize()函数的末尾.
但您可能想要查看pyspark.ml.feature.Tokenizer
或pyspark.ml.feature.RegexTokenizer
.一个想法可能是使用您的函数来清理您的字符串,然后使用Tokenizer来制作令牌. (我看到你已经导入了它,但似乎没有使用它).
以上是内存溢出为你收集整理的python-3.x – 从列表中的元素中删除尾随空格全部内容,希望文章能够帮你解决python-3.x – 从列表中的元素中删除尾随空格所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)