The PyAOS Stack

A group of programs that works in tandem to produce a result or achieve a common goal is often referred to as a software stack. This page gives an overview of the PyAOS stack.

PyAOS Stack

Core libraries

The thick solid box in the diagram above represents the core of the PyAOS stack, so let’s start our tour there. The default library for dealing with numerical arrays in Python is NumPy. It has some built in functions for calculating very simple statistics (e.g. maximum, mean, standard deviation), but for more complex analysis (e.g. interpolation, integration, linear algebra) the SciPy library is the default. If you’re dealing with particularly large arrays, Dask works with the existing Python ecosystem (i.e. NumPy, SciPy, etc) to scale your analysis to multi-core machines and/or distributed clusters (i.e. parallel processing).

If you want to visualise your NumPy data arrays the default library is matplotlib. As you can see at the matplotlib gallery, this library is great for any simple (e.g. bar charts, contour plots, line graphs), static (e.g. .png, .eps, .pdf) plots. The cartopy library provides additional plotting functionality for common map projections.

While pretty much all data analysis and visualisation tasks could be achieved with a combination of these core libraries, their flexible, all-purpose nature means relatively common/simple tasks can often require quite a bit of work (i.e. many lines of code). To make things more efficient for data scientists, the scientific Python community has therefore built a number of libraries on top of the core stack. These high-levels libraries aren’t as flexible – they can’t do everything like the core stack can – but they can do common tasks with far less effort.

High-level libraries (generic)

Let’s first consider the generic high-level libraries. That is, the ones that can be used in essentially all fields of data science. The most popular of these libraries is undoubtedly pandas, which has been a real game-changer for the Python data science community. The key advance offered by pandas is the concept of labelled arrays. Rather than referring to the individual elements of a data array using a numeric index (as is required with NumPy), the actual row and column headings can be used. That means information from the cardiac ward on 3 July 2005 could be obtained from a medical dataset by asking for data['cardiac'].loc['2005-07-03'], rather than having to remember the numeric index corresponding to that ward and date. This labelled array feature, combined with a bunch of other features that simplify common statistical and plotting tasks traditionally performed with SciPy and matplotlib, greatly simplifies the code development process (read: less lines of code).

One of the limitations of pandas is that it’s only able to handle one- or two-dimensional (i.e. tabular) data arrays. The xarray library was therefore created to extend the labelled array concept to x-dimensional arrays. Not all of the pandas functionality is available (which is a trade-off associated with being able to handle multi-dimensional arrays), but the ability to refer to array elements by their actual latitude (e.g. 20 South), longitude (e.g. 50 East), height (e.g. 500 hPa) and time (e.g. 2015-04-27), for example, makes the xarray data array far easier to deal with than the NumPy array. As an added bonus, xarray also has built in functionality for reading/writing specific AOS file formats (e.g netCDF, GRIB), which numpy and pandas don’t have.

High-level libraries (AOS-specific)

While the xarray library is a good option for those working in the atmosphere and ocean sciences (especially those dealing with large multi-dimensional arrays from model simulations), the SciTools project led by the MetOffice has taken a different approach to building on top of the core stack. Rather than striving to make their software generic (xarray is designed to handle any multi-dimensional data), they explicitly assume that users of their Iris library are dealing with weather/ocean/climate data. Doing this allows them to make common weather/climate tasks super quick and easy, and it also means they have added functionality specific to atmosphere and ocean science. In addition to Iris, you may also come across other weather/climate/ocean-oriented high level libraries such as cf-python or PyGeode. Those libraries may have particular functionality that makes them useful for your work, but in general they have a much smaller developer and user base than xarray and Iris.

How to choose

In terms of choosing between xarray and Iris, some people like the slightly more AOS-centric experience offered by Iris, while others don’t like the restrictions that places on their work and prefer the generic xarray experience (e.g. to use Iris your input data files have to be CF compliant or close to it). Either way, they are both a vast improvement on the NumPy/matplotlib experience.

Simplifying data exploration

While the plotting functionality associated with xarray and Iris speeds up the process of visually exploring data (as compared to matplotlib), there’s still a fair bit of messing around involved in tweaking the various aspects of a plot (e.g. colour schemes, plot size, labels, map projections, etc). This tweaking burden is an issue across all data science fields and programming languages, so developers of the latest generation of visualisation tools are moving towards something called declarative visualisation. The basic concept is that the user simply has to describe the characteristics of their data, and then the software figures out the optimal way to visualise it (i.e. it makes all the tweaking decisions for you).

The two major Python libraries in the declarative visualisation space are HoloViews and Altair. The former (which has been around much longer) uses matplotlib or Bokeh (interactive plots where you can zoom and scroll) under the hood, which means it allows for the generation of static or interactive plots. Since HoloViews doesn’t have support for geographic plots, GeoViews and hvPlot have been created on top of it and offer geographic plotting functionality by leveraging many elements of the PyAOS stack (i.e. cartopy, xarray, dask, etc).

Sub-discipline-specific libraries

So far we’ve considered libraries that do general, broad-scale tasks like data input/output, common statistics, visualisation, etc. Given their large user base, these libraries are usually written and supported by large companies (e.g. Anaconda supports Bokeh and HoloViews/Geoviews), large institutions (e.g. the MetOffice supports Iris, cartopy and GeoViews) or the wider PyData community (e.g. pandas, xarray). Within each sub-discipline of atmosphere and ocean science, individuals and research groups take these libraries and apply them to their very specific data analysis tasks. Increasingly, these individuals and groups are formally packaging and releasing their code for use within their community. For instance, Andrew Dawson (an atmospheric scientist at Oxford) does a lot of EOF analysis and manipulation of wind data, so he has released his eofs and windspharm libraries. Similarly, a group at the Atmospheric Radiation Measurement (ARM) Climate Research Facility have released their Python ARM Radar Toolkit (Py-ART) for analysing weather radar data, and a similar story is true for MetPy.

The diagram above shows some of the more widely used sub-discipline-specific libraries for meteorology, oceanography, climate, statistics, working with gridded data, etc. Many are built on top of xarray and/or iris (i.e. the core data construct is an xarray DataArray or Iris cube), as indicated by the dashed circles. Check out the Package Index for a more complete listing of the sub-discipline-specific libraries in your particular area of AOS research and the results of the 2021 PyAOS Census for more information on the wide range of Python libraries used by the AOS community.

Summary

Most Python users in the atmosphere and ocean sciences base their data analysis around the xarray or Iris libraries. The appeal of these high-level libraries is that they are built on top of (and thus hide the complexity of) core data science libraries like NumPy and matplotlib. You will occasionally find yourself needing to use a core library directly (e.g. you might create a plot with xarray and then call a specific matplotlib function to customise a label on that plot), but to avoid re-inventing the wheel your first impulse should always be to check whether a high-level library has the functionality you need. Nothing would be more heartbreaking than spending hours writing your own function using the netCDF4 library for extracting the metadata contained within a netCDF file, for instance, only to find that xarray and Iris automatically keep this information upon reading a netCDF file. In this way, a solid working knowledge of the PyAOS stack can save you a lot of time and effort.