先上官方文档:Dask
搜索相关问题:Stack Overflow with the #dask tag
Anaconda默认安装了Dask,因此我们不需要再另外安装Dask。并且linux和window都可以使用。
介绍
图片来源:https://docs.dask.org/en/latest/index.html
一句话总结: Dask is a flexible library for parallel computing in Python.
- 动态调度资源提供并行计算(加速)
- 并行化的数据集成提供接口给numpy,pandas或者python迭代器(提供接口)
- Task Graph 任务图非常清晰,使得开发人员和用户都可以自由地构建复杂的算法,并处理大多数数据工程框架中常见的map/filter/groupby范式难以处理的混乱情况。(帮助理解)
- 从个人电脑->集群(用途广泛)
简单的对比
Dask DataFrame mimics Pandas
import pandas as pd import dask.dataframe as dd
df = pd.read_csv('2015-01-01.csv') df = dd.read_csv('2015-*-*.csv')
df.groupby(df.user_id).value.mean() df.groupby(df.user_id).value.mean().compute()
Dask Array mimics NumPy
import numpy as np import dask.array as da
f = h5py.File('myfile.hdf5') f = h5py.File('myfile.hdf5')
x = np.array(f['/small-data']) x = da.from_array(f['/big-data'],
chunks=(1000, 1000))
x - x.mean(axis=1) x - x.mean(axis=1).compute()
Dask Bag mimics iterators, Toolz, and PySpark
import dask.bag as db
b = db.read_text('2015-*-*.json.gz').map(json.loads)
b.pluck('name').frequencies().topk(10, lambda pair: pair[1]).compute()
简单的案例 NCAR: Hydrological Modeling
- We use Xarray to provide a higher level (and familiar) interface around Numpy arrays and Dask arrays
- We use NetCDF and HDF files for data storage
- I mostly work on HPC systems and have been helping develop the dask-jobqueue package for deploying Dask on job queueing systems
- In the Pangeo project, we’re exploring Dask applications using Kubernetes and Jupyter notebooks
总结:
Dask给我们提供了两大功能:
- 并行化计算:array、dataframe、function…
- 进程图或者任务图:Diagnostics、Dask schedulers、task graphs
在目前(2020)主要使用 install in single-machine
-
High Level
- Arrays: Parallel NumPy
- Bags: Parallel lists
- DataFrames: Parallel Pandas
- Machine Learning : Parallel Scikit-Learn
- Others from external projects, like XArray
-
Low Level
- Delayed: Parallel function evaluation
- Futures: Real-time parallel function evaluation
- Dask Task Stream
- Dask progress
还没掌握(2020.4.23)