08-UDFs_python_内存溢出

User-Defined Functions

Define a function
Create and apply UDF
Register UDF to use in SQL
Use Decorator Syntax (Python Only)
Use Vectorized UDF (Python Only)

Methods

UDF Registration (spark.udf): register
Built-In Functions : udf
Python UDF Decorator : @udf
Pandas UDF Decorator : @pandas_udf

Define a function

Define a function in local Python/Scala to get the first letter of a string from the email field.

def firstLetterFunction(email):
  return email[0]

该函数在spark.DataFrame中是无法使用的。

from pyspark.sql.functions import col
display(salesDF.select(firstLetterFunction(col("email"))))

通过udf函数将该函数定义为udf函数后就可以使用了

from pyspark.sql.functions import udf
firstLetterUDF = udf(firstLetterFunction)
display(salesDF.select(firstLetterUDF(col("email"))))

salesDF.createOrReplaceTempView("sales")

spark.udf.register("sql_udf", firstLetterFunction)

SELECT email,sql_udf(email) AS firstLetter FROM sales

Use Decorator Syntax (Python Only)

Alternatively, define UDF using decorator syntax in Python with the datatype the function returns.

# Our input/output is a string
@udf("string")
def decoratorUDF(email: str) -> str:
  return email[0]

from pyspark.sql.functions import col
salesDF = spark.read.parquet("/mnt/dbswarehouse/raw/sales.parquet")
display(salesDF.select(decoratorUDF(col("email"))))

Use Vectorized UDF (Python Only)

import pandas as pd
from pyspark.sql.functions import pandas_udf

# We have a string input/output
@pandas_udf("string")
def vectorizedUDF(email: pd.Series) -> pd.Series:
  return email.str[0]

# Alternatively
vectorizedUDF = pandas_udf(lambda s: s.str[0], "string")

display(salesDF.select(vectorizedUDF(col("email"))))

We can also register these Vectorized UDFs to the SQL namespace.

spark.udf.register("sql_vectorized_udf", vectorizedUDF)

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/718606.html

08-UDFs

发表评论

评论列表（0条）