如何使用dask映射列_随笔

如何使用dask映射列

您可以使用.map方法，就像在Pandas中一样

In [1]: import dask.dataframe as ddIn [2]: import pandas as pdIn [3]: df = pd.Dataframe({'x': [1, 2, 3]})In [4]: ddf = dd.from_pandas(df, npartitions=2)In [5]: df.x.map(lambda x: x + 1)Out[5]: 0    21    32    4Name: x, dtype: int64In [6]: ddf.x.map(lambda x: x + 1).compute()Out[6]: 0    21    32    4Name: x, dtype: int64

元数据

可能会要求您提供一个

meta=

关键字。这使dask.dataframe知道函数的输出名称和类型。从

map_partitions

此处复制文档字符串：

meta : pd.Dataframe, pd.Series, dict, iterable, tuple, optionalAn empty pd.Dataframe or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a Dataframe, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is  recommended.For more information, see dask.dataframe.utils.make_meta.

因此，在上面的示例中，我的输出将是具有name

'x'

和dtype的系列，

int

我可以执行以下任一 *** 作来更明确

>>> ddf.x.map(lambda x: x + 1, meta=('x', int))

要么

>>> ddf.x.map(lambda x: x + 1, meta=pd.Series([], dtype=int, name='x'))

这告诉dask.dataframe对我们的函数有什么期望。如果未提供任何元数据，则dask.dataframe将尝试在少量数据上运行函数。如果失败，将引发错误以寻求帮助。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5650094.html

如何使用dask映射列

发表评论

评论列表（0条）