Using Dask and XArray through Juypter-Lab for scalable EO Data Analytics

Demonstration Platform Documentation

Using Dask and XArray through Juypter-Lab for scalable EO Data Analytics

Two python libraries are key to the ability to deliver scalable analysis of satellite EO datasets over large portions of the Canadian landmass. These python libraries are Xarray and Dask. 

To conduct analytics that scale:

  • Use Zarr as your on-disk data storage format
  • Use Xarray as your in-memory data interface
  • Use Dask to execute your code with parallel execution using Kubernetes to provide on-demand scalable compute
  • Use lazy loading/execution throughout. Lazy loading and execution is the default with both Xarray and Dask.

Xarray

Xarray is an open source project and Python package that provides a toolkit for working with labeled N-dimensional arrays of data. Xarray adopts the Common Data Model for self- describing scientific data in widespread use in the Earth sciences: xarray. Dataset is an in-memory representation of a netCDF file. Xarray provides the basic data structures used for EO analysis, as well as powerful tools for computation and visualization. Xarray distinguishes itself from many netCDF data tools as it provides data structures for in-memory analytics that both utilize and preserve dimensional labels. Users only need to do the tedious work of adding metadata once, not every time a file is saved.

Dask

Dask is a flexible parallel computing library for analytics. Dask is the key to the scalability of the system platform; its data structures can represent extremely large datasets without actually loading them in memory, and its distributed schedulers permit supercomputers and cloud computing clusters to efficiently parallelize computations across many nodes. Dask works well with Python’s existing scientific software ecosystem, including libraries like Xarray, NumPy, Pandas, and Scikit-Learn. In many cases, it offers users the ability to take existing workflows and quickly scale them to much larger applications with parallel execution across many compute nodes. In this way, applications can be first tested on a minimal set of resources and then scaled using Dask to hundreds of nodes using the platform’s scalable IAAS features.

Dask has child libraries that allow distributed parallel computation to be run on public cloud infrastructure that implements Kubernetes (such as on Amazon Web Services, Google Cloud and Microsoft Azure), and HPC systems (with job queing systems such as PBS, Slurm, MOAB, SGE, and LSF, or that can launch MPI applications).
Through the proof of concept system, accessing data using Xarray’s n-dimensional array interfaces have been provided, along with demonstrations of using Dask to scale analytics to function over large areas and deep timeseries.

Dask and Xarray are currently Python-only

The Jupyer-Lab environment is currently constrained to only support applications written in Python (Dask and XArray are currently only available to python applications). Where non-python analytics or pre-processing is required (such as running Sen2Cor or SNAP tools), the proof of GEO Analytics Canada Demonstration system’s KubeFlow system can be used.

Scroll to top