从Spark 1.5开始,您可以使用许多日期处理功能:
pyspark.sql.functions.year
pyspark.sql.functions.month
pyspark.sql.functions.dayofmonth
pyspark.sql.functions.dayofweek()
pyspark.sql.functions.dayofyear
pyspark.sql.functions.weekofyear()
import datetime
from pyspark.sql.functions import year, month, dayofmonthelevDF = sc.parallelize([
(datetime.datetime(1984, 1, 1, 0, 0), 1, 638.55),
(datetime.datetime(1984, 1, 1, 0, 0), 2, 638.55),
(datetime.datetime(1984, 1, 1, 0, 0), 3, 638.55),
(datetime.datetime(1984, 1, 1, 0, 0), 4, 638.55),
(datetime.datetime(1984, 1, 1, 0, 0), 5, 638.55)
]).toDF([“date”, “hour”, “value”])elevDF.select(
+----+-----+—+|year|month|day|+----+-----+—+|1984| 1| 1||1984| 1| 1||1984| 1| 1||1984| 1| 1||1984| 1| 1|+----+-----+—+
year(“date”).alias(‘year’),
month(“date”).alias(‘month’),
dayofmonth(“date”).alias(‘day’)
).show()
您可以将simple
map与其他任何RDD一起使用:
elevDF = sqlContext.createDataframe(sc.parallelize([ Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)]))(elevDF .map(lambda (date, hour, value): (date.year, date.month, date.day)) .collect())
结果是:
[(1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1)]
顺便说一句:
datetime.datetime无论如何都存储一个小时,所以分开保存似乎浪费了内存。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)