您可以使用.map方法,就像在Pandas中一样
元数据In [1]: import dask.dataframe as ddIn [2]: import pandas as pdIn [3]: df = pd.Dataframe({'x': [1, 2, 3]})In [4]: ddf = dd.from_pandas(df, npartitions=2)In [5]: df.x.map(lambda x: x + 1)Out[5]: 0 21 32 4Name: x, dtype: int64In [6]: ddf.x.map(lambda x: x + 1).compute()Out[6]: 0 21 32 4Name: x, dtype: int64
可能会要求您提供一个
meta=关键字。这使dask.dataframe知道函数的输出名称和类型。从
map_partitions此处复制文档字符串:
meta : pd.Dataframe, pd.Series, dict, iterable, tuple, optionalAn empty pd.Dataframe or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a Dataframe, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended.For more information, see dask.dataframe.utils.make_meta.
因此,在上面的示例中,我的输出将是具有name
'x'和dtype的系列,
int我可以执行以下任一 *** 作来更明确
>>> ddf.x.map(lambda x: x + 1, meta=('x', int))
要么
>>> ddf.x.map(lambda x: x + 1, meta=pd.Series([], dtype=int, name='x'))
这告诉dask.dataframe对我们的函数有什么期望。如果未提供任何元数据,则dask.dataframe将尝试在少量数据上运行函数。如果失败,将引发错误以寻求帮助。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)