Home Artificial Intelligence Beyond Numpy and Pandas: Unlocking the Potential of Lesser-Known Python Libraries 1. Dask 2. SymPy 3. Xarray Conclusions

Beyond Numpy and Pandas: Unlocking the Potential of Lesser-Known Python Libraries 1. Dask 2. SymPy 3. Xarray Conclusions

0
Beyond Numpy and Pandas: Unlocking the Potential of Lesser-Known Python Libraries
1. Dask
2. SymPy
3. Xarray
Conclusions

Introducing Xarray

Xarray is a Python library that extends the features and functionalities of NumPy, giving us the likelihood to work with labeled arrays and datasets.

As they are saying on their website, the truth is:

Xarray makes working with labeled multi-dimensional arrays in Python easy, efficient, and fun!

And likewise:

Xarray introduces labels in the shape of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and fewer error-prone developer experience.

In other words, it extends the functionality of NumPy arrays by adding labels or coordinates to the array dimensions. These labels provide metadata and enable more advanced evaluation and manipulation of multi-dimensional data.

For instance, in NumPy, arrays are accessed using integer-based indexing.

In Xarray, as a substitute, each dimension can have a label related to it, making it easier to know and manipulate the info based on meaningful names.

For instance, as a substitute of accessing data with arr[0, 1, 2], we will use arr.sel(x=0, y=1, z=2) in Xarray, where x, y, and z are dimension labels.

This makes the code way more readable!

So, let’s see some features of Xarray.

Some features of Xarray in motion

As usual, to put in it:

$ pip install xarray

FEATURE ONE: WORKING WITH LABELED COORDINATES

Suppose we would like to create some data related to temperature and we would like to label these with coordinates like latitude and longitude. We are able to do it like so:

import xarray as xr
import numpy as np

# Create temperature data
temperature = np.random.rand(100, 100) * 20 + 10

# Create coordinate arrays for latitude and longitude
latitudes = np.linspace(-90, 90, 100)
longitudes = np.linspace(-180, 180, 100)

# Create an Xarray data array with labeled coordinates
da = xr.DataArray(
temperature,
dims=['latitude', 'longitude'],
coords={'latitude': latitudes, 'longitude': longitudes}
)

# Access data using labeled coordinates
subset = da.sel(latitude=slice(-45, 45), longitude=slice(-90, 0))

And if we print them we get:

# Print data
print(subset)

>>>

array([[13.45064786, 29.15218061, 14.77363206, ..., 12.00262833,
16.42712411, 15.61353963],
[23.47498117, 20.25554247, 14.44056286, ..., 19.04096482,
15.60398491, 24.69535367],
[25.48971105, 20.64944534, 21.2263141 , ..., 25.80933737,
16.72629302, 29.48307134],
...,
[10.19615833, 17.106716 , 10.79594252, ..., 29.6897709 ,
20.68549602, 29.4015482 ],
[26.54253304, 14.21939699, 11.085207 , ..., 15.56702191,
19.64285595, 18.03809074],
[26.50676351, 15.21217526, 23.63645069, ..., 17.22512125,
13.96942377, 13.93766583]])
Coordinates:
* latitude (latitude) float64 -44.55 -42.73 -40.91 ... 40.91 42.73 44.55
* longitude (longitude) float64 -89.09 -85.45 -81.82 ... -9.091 -5.455 -1.818

So, let’s see the method step-by-step:

  1. We’ve created the temperature values as a NumPy array.
  2. We’ve defined the latitudes and longitueas values as NumPy arrays.
  3. We’ve stored all the info in an Xarray array with the strategy DataArray().
  4. We’ve chosen a subset of the latitudes and longitudes with the strategy sel() that selects the values we would like for our subset.

The result can also be easily readable, so labeling is de facto helpful in plenty of cases.

FEATURE TWO: HANDLING MISSING DATA

Suppose we’re collecting data related to temperatures throughout the 12 months. We would like to know if we’ve got some null values in our array. Here’s how we will accomplish that:

import xarray as xr
import numpy as np
import pandas as pd

# Create temperature data with missing values
temperature = np.random.rand(365, 50, 50) * 20 + 10
temperature[0:10, :, :] = np.nan # Set the primary 10 days as missing values

# Create time, latitude, and longitude coordinate arrays
times = pd.date_range('2023-01-01', periods=365, freq='D')
latitudes = np.linspace(-90, 90, 50)
longitudes = np.linspace(-180, 180, 50)

# Create an Xarray data array with missing values
da = xr.DataArray(
temperature,
dims=['time', 'latitude', 'longitude'],
coords={'time': times, 'latitude': latitudes, 'longitude': longitudes}
)

# Count the variety of missing values along the time dimension
missing_count = da.isnull().sum(dim='time')

# Print missing values
print(missing_count)

>>>


array([[10, 10, 10, ..., 10, 10, 10],
[10, 10, 10, ..., 10, 10, 10],
[10, 10, 10, ..., 10, 10, 10],
...,
[10, 10, 10, ..., 10, 10, 10],
[10, 10, 10, ..., 10, 10, 10],
[10, 10, 10, ..., 10, 10, 10]])
Coordinates:
* latitude (latitude) float64 -90.0 -86.33 -82.65 ... 82.65 86.33 90.0
* longitude (longitude) float64 -180.0 -172.7 -165.3 ... 165.3 172.7 180.0

And so we obtain that we’ve got 10 null values.

Also, if we have a look closely on the code, we will see that we will apply Pandas’ methods to an Xarray like isnull.sum(), as on this case, that counts the whole variety of missing values.

FEATURE ONE: HANDLING AND ANALYZING MULTI-DIMENSIONAL DATA

The temptation to handle and analyze multi-dimensional data is high when we’ve got the likelihood to label our arrays. So, why not try it?

For instance, suppose we’re still collecting data related to temperatures at certain latitudes and longitudes.

We will probably want to calculate the mean, the max, and the median temperatures. We are able to do it like so:

import xarray as xr
import numpy as np
import pandas as pd

# Create synthetic temperature data
temperature = np.random.rand(365, 50, 50) * 20 + 10

# Create time, latitude, and longitude coordinate arrays
times = pd.date_range('2023-01-01', periods=365, freq='D')
latitudes = np.linspace(-90, 90, 50)
longitudes = np.linspace(-180, 180, 50)

# Create an Xarray dataset
ds = xr.Dataset(
{
'temperature': (['time', 'latitude', 'longitude'], temperature),
},
coords={
'time': times,
'latitude': latitudes,
'longitude': longitudes,
}
)

# Perform statistical evaluation on the temperature data
mean_temperature = ds['temperature'].mean(dim='time')
max_temperature = ds['temperature'].max(dim='time')
min_temperature = ds['temperature'].min(dim='time')

# Print values
print(f"mean temperature:n {mean_temperature}n")
print(f"max temperature:n {max_temperature}n")
print(f"min temperature:n {min_temperature}n")

>>>

mean temperature:

array([[19.99931701, 20.36395016, 20.04110699, ..., 19.98811842,
20.08895803, 19.86064693],
[19.84016491, 19.87077812, 20.27445405, ..., 19.8071972 ,
19.62665953, 19.58231185],
[19.63911165, 19.62051976, 19.61247548, ..., 19.85043831,
20.13086891, 19.80267099],
...,
[20.18590514, 20.05931149, 20.17133483, ..., 20.52858247,
19.83882433, 20.66808513],
[19.56455575, 19.90091128, 20.32566232, ..., 19.88689221,
19.78811145, 19.91205212],
[19.82268297, 20.14242279, 19.60842148, ..., 19.68290006,
20.00327294, 19.68955107]])
Coordinates:
* latitude (latitude) float64 -90.0 -86.33 -82.65 ... 82.65 86.33 90.0
* longitude (longitude) float64 -180.0 -172.7 -165.3 ... 165.3 172.7 180.0

max temperature:

array([[29.98465531, 29.97609171, 29.96821276, ..., 29.86639343,
29.95069558, 29.98807808],
[29.91802049, 29.92870312, 29.87625447, ..., 29.92519055,
29.9964299 , 29.99792388],
[29.96647016, 29.7934891 , 29.89731136, ..., 29.99174546,
29.97267052, 29.96058079],
...,
[29.91699117, 29.98920555, 29.83798369, ..., 29.90271746,
29.93747041, 29.97244906],
[29.99171911, 29.99051943, 29.92706773, ..., 29.90578739,
29.99433847, 29.94506567],
[29.99438621, 29.98798699, 29.97664488, ..., 29.98669576,
29.91296382, 29.93100249]])
Coordinates:
* latitude (latitude) float64 -90.0 -86.33 -82.65 ... 82.65 86.33 90.0
* longitude (longitude) float64 -180.0 -172.7 -165.3 ... 165.3 172.7 180.0

min temperature:

array([[10.0326431 , 10.07666029, 10.02795524, ..., 10.17215336,
10.00264909, 10.05387097],
[10.00355858, 10.00610942, 10.02567816, ..., 10.29100316,
10.00861792, 10.16955806],
[10.01636216, 10.02856619, 10.00389027, ..., 10.0929342 ,
10.01504103, 10.06219179],
...,
[10.00477003, 10.0303088 , 10.04494723, ..., 10.05720692,
10.122994 , 10.04947012],
[10.00422182, 10.0211205 , 10.00183528, ..., 10.03818058,
10.02632697, 10.06722953],
[10.10994581, 10.12445222, 10.03002468, ..., 10.06937041,
10.04924046, 10.00645499]])
Coordinates:
* latitude (latitude) float64 -90.0 -86.33 -82.65 ... 82.65 86.33 90.0
* longitude (longitude) float64 -180.0 -172.7 -165.3 ... 165.3 172.7 180.0

And we obtained what we wanted, also in a clearly readable way.

And again, as before, to calculate the max, min, and mean values of temperatures we’ve used Pandas’ functions applied to an array.

LEAVE A REPLY

Please enter your comment!
Please enter your name here