regexp_extract(col('Notes'), '(.)(by)(s+)(w+)', 4))该表达式从其后面 的* 任意位置 提取 _ 雇员姓名_ , 然后在文本column()中添加 空格
*col('Notes')
详细地:
创建一个样本数据框
data = [('2345', 'Checked by John'),('2398', 'Verified by Stacy'),('2328', 'Verified by Srinivas than some random text'), ('3983', 'Double Checked on 2/23/17 by Marsha')]df = sc.parallelize(data).toDF(['ID', 'Notes'])df.show()+----+--------------------+| ID| Notes|+----+--------------------+|2345| Checked by John||2398| Verified by Stacy||2328|Verified by Srini...||3983|Double Checked on...|+----+--------------------+
做所需的进口
from pyspark.sql.functions import regexp_extract, col
在
df提取
Employee从列名使用
regexp_extract(column_name, regex, group_number)。
此处 regex (
'(.)(by)(s+)(w+)')表示
- (。) -任何字符(换行符除外)
- (by) -文字中的单词 by
- ( s +) -一个或多个空格
- ( w +) -长度为一的字母数字或下划线字符
并且 group_number 为4,因为group
(w+)在表达式中位于第4位
result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(s+)(w+)', 4))result.show()+----+--------------------+--------+| ID| Notes|Employee|+----+--------------------+--------+|2345| Checked by John| John||2398| Verified by Stacy| Stacy||2328|Verified by Srini...|Srinivas||3983|Double Checked on...| Marsha|+----+--------------------+--------+
Databricks笔记本
注意:regexp_extract(col('Notes'), '.bys+(w+)',1))似乎更干净的版本,并在这里检查正则表达式
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)