08-UDFs

08-UDFs,第1张

User-Defined Functions
  1. Define a function

  2. Create and apply UDF

  3. Register UDF to use in SQL

  4. Use Decorator Syntax (Python Only)

  5. Use Vectorized UDF (Python Only)

Methods
  • UDF Registration (spark.udf): register

  • Built-In Functions : udf

  • Python UDF Decorator : @udf

  • Pandas UDF Decorator : @pandas_udf

Define a function

Define a function in local Python/Scala to get the first letter of a string from the email field.

def firstLetterFunction(email):
  return email[0]

该函数在spark.DataFrame中是无法使用的。

from pyspark.sql.functions import col
display(salesDF.select(firstLetterFunction(col("email"))))

通过udf函数将该函数定义为udf函数后就可以使用了

from pyspark.sql.functions import udf
firstLetterUDF = udf(firstLetterFunction)
display(salesDF.select(firstLetterUDF(col("email"))))

Register UDF to use in SQL

Register UDF using spark.udf.register to create UDF in the SQL namespace.

salesDF.createOrReplaceTempView("sales")

spark.udf.register("sql_udf", firstLetterFunction)
SELECT email,sql_udf(email) AS firstLetter FROM sales

Use Decorator Syntax (Python Only)

Alternatively, define UDF using decorator syntax in Python with the datatype the function returns.

# Our input/output is a string
@udf("string")
def decoratorUDF(email: str) -> str:
  return email[0]
from pyspark.sql.functions import col
salesDF = spark.read.parquet("/mnt/dbswarehouse/raw/sales.parquet")
display(salesDF.select(decoratorUDF(col("email"))))

Use Vectorized UDF (Python Only)
import pandas as pd
from pyspark.sql.functions import pandas_udf

# We have a string input/output
@pandas_udf("string")
def vectorizedUDF(email: pd.Series) -> pd.Series:
  return email.str[0]

# Alternatively
vectorizedUDF = pandas_udf(lambda s: s.str[0], "string")
display(salesDF.select(vectorizedUDF(col("email"))))

We can also register these Vectorized UDFs to the SQL namespace.

spark.udf.register("sql_vectorized_udf", vectorizedUDF)

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/langs/718606.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-04-25
下一篇 2022-04-25

发表评论

登录后才能评论

评论列表(0条)

保存