这是删除所有具有NULL值的所有列的一种可能的方法,以获取每列NULL值计数代码的源代码。
import pyspark.sql.functions as F# Sample datadf = pd.Dataframe({'x1': ['a', '1', '2'], 'x2': ['b', None, '2'], 'x3': ['c', '0', '3'] })df = sqlContext.createDataframe(df)df.show()def drop_null_columns(df): """ This function drops all columns which contain null values. :param df: A PySpark Dataframe """ null_counts = df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0].asDict() to_drop = [k for k, v in null_counts.items() if v > 0] df = df.drop(*to_drop) return df# Drops column b2, because it contains null valuesdrop_null_columns(df).show()
之前:
+---+----+---+| x1| x2| x3|+---+----+---+| a| b| c|| 1|null| 0|| 2| 2| 3|+---+----+---+
后:
+---+---+| x1| x3|+---+---+| a| c|| 1| 0|| 2| 3|+---+---+
希望这可以帮助!
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)