Rasters Sources Using Recipes¶

Dataset.create_raster_recipe(key, fp, dtype, channel_count, channels_schema=None, sr=None, compute_array=None, merge_arrays=<function concat_arrays>, queue_data_per_primitive=mappingproxy({}), convert_footprint_per_primitive=None, computation_pool='cpu', merge_pool='cpu', resample_pool='cpu', computation_tiles=None, max_computation_size=None, max_resampling_size=None, automatic_remapping=True, debug_observers=())[source]¶

Warning

This method is not yet implemented. It exists for documentation purposes.

Create a raster recipe and register it under key within this Dataset.

A raster recipe implements the same interfaces as all other rasters, but internally it computes data on the fly by calling a callback. The main goal of the raster recipes is to provide a boilerplate-free interface that automatize those cumbersome tasks:

tiling,
parallelism
caching
file reads
resampling
lazy evaluation
backpressure prevention and
optimised task scheduling.

If you are familiar with create_cached_raster_recipe two parameters are new here: automatic_remapping and max_computation_size.

Parameters

key:: see Dataset.create_raster()
fp:: see Dataset.create_raster()
dtype:: see Dataset.create_raster()
channel_count:: see Dataset.create_raster()
channels_schema:: see Dataset.create_raster()
sr:: see Dataset.create_raster()
compute_array: callable: see Computation Function below
merge_arrays: callable: see Merge Function below
queue_data_per_primitive: dict of hashable (like a string) to a queue_data method pointer: see Primitives below
convert_footprint_per_primitive: None or dict of hashable (like a string) to a callable: see Primitives below
computation_pool:: see Pools below
merge_pool:: see Pools below
resample_pool:: see Pools below
computation_tiles: None or (int, int) or numpy.ndarray of Footprint: see Computation Tiling below
max_computation_size: None or int or (int, int): see Computation Tiling below
max_resampling_size: None or int or (int, int): Optionally define a maximum resampling size. If a larger resampling has to be performed, it will be performed tile by tile in parallel.
automatic_remapping: bool: see Automatic Remapping below
debug_observers: sequence of object: Entry points that observe what is happening with this raster in the Dataset’s scheduler.

Returns

source: NocacheRasterRecipe

Computation Function

The function that will map a Footprint to a numpy.ndarray. If queue_data_per_primitive is not empty, it will map a Footprint and primitive arrays to a numpy.ndarray.

It will be called in parallel according to the computation_pool parameter provided at construction.

The function will be called with the following positional parameters:

fp: Footprint of shape (Y, X)
The location at which the pixels should be computed
primitive_fps: dict of hashable to Footprint
For each primitive defined through the queue_data_per_primitive parameter, the input Footprint.
primitive_arrays: dict of hashable to numpy.ndarray
For each primitive defined through the queue_data_per_primitive parameter, the input numpy.ndarray that was automatically computed.
raster: CachedRasterRecipe or None
The Raster object of the ongoing computation.

It should return either:

a single ndarray of shape (Y, X) if only one channel was computed
a single ndarray of shape (Y, X, C) if one or more channels were computed

If computation_pool points to a process pool, the compute_array function must be picklable and the raster parameter will be None.

Computation Tiling

You may sometimes want to have control on the Footprints that are requested to the compute_array function, for example:

If pixels computed by compute_array are long to compute, you want to tile to increase parallelism.
If the compute_array function scales badly in term of memory or time, you want to tile to reduce complexity.
If compute_array can work only on certain Footprints, you want a hard constraint on the set of Footprint that can be queried from compute_array. (This may happen with convolutional neural networks)

To do so use the computation_tiles or max_computation_size parameter (not both).

If max_computation_size is provided, a Footprint to be computed will be tiled given this parameter.

If computation_tiles is a numpy.ndarray of Footprint, it should be a tiling of the fp parameter. Only the Footprints contained in this tiling will be asked to the computation_tiles. If computation_tiles is (int, int), a tiling will be constructed using Footprint.tile using those two ints.

Merge Function

The function that will map several pairs of Footprint/numpy.ndarray to a single numpy.ndarray. If the computation_tiles is None, it will never be called.

It will be called in parallel according to the merge_pool parameter provided at construction.

The function will be called with the following positional parameters:

fp: Footprint of shape (Y, X)
The location at which the pixels should be computed.
array_per_fp: dict of Footprint to numpy.ndarray
The pairs of Footprint/numpy.ndarray of each arrays that were computed by compute_array and that overlap with fp.
raster: CachedRasterRecipe or None
The Raster object of the ongoing computation.

It should return either:

a single ndarray of shape (Y, X) if only one channel was computed
a single ndarray of shape (Y, X, C) if one or more channels were computed

If merge_pool points to a process pool, the merge_array function must be picklable and the raster parameter will be None.

Automatic Remapping

When creating a recipe you give a Footprint through the fp parameter. When calling your compute_array function the scheduler will only ask for slices of fp. This means that the scheduler takes care of those boilerplate steps:

If you request a Footprint on a different grid in a get_data() call, the scheduler takes care of resampling the outputs of your compute*array function.
If you request a Footprint partially or fully outside of the raster’s extent, the scheduler will call your compute_array function to get the interior pixels and then pad the output with nodata.

This system is flexible and can be deactivated by passing automatic_remapping=False to the constructor of a NocacheRasterRecipe, in this case the scheduler will call your compute_array function for any kind of Footprint; thus your function must be able to comply with any request.

Primitives

The queue_data_per_primitive and convert_footprint_per_primitive parameters can be used to create dependencies between dependee async rasters and the raster recipe being created. The dependee/dependent relation is called primitive/derived throughout buzzard. A derived recipe can itself be the primitive of another raster. Pipelines of any depth and width can be instanciated that way.

In queue_data_per_primitive you declare a dependee by giving it a key of your choice and the pointer to the queue_data method of dependee raster. You can parameterize the connection by currying the channels, dst_nodata, interpolation and max_queue_size parameters using functools.partial.

The convert_footprint_per_primitive dict should contain the same keys as queue_data_per_primitive. A value in the dict should be a function that maps a Footprint to another Footprint. It can be used for example to request larger rectangles of primitives data to compute a derived array.

e.g. If the primitive raster is an rgb image, and the derived raster only needs the green channel but with a context of 10 additional pixels on all 4 sides:

>>> derived = ds.create_raster_recipe(
...     # <other parameters>
...     queue_data_per_primitive={'green': functools.partial(primitive.queue_data, channels=1)},
...     convert_footprint_per_primitive={'green': lambda fp: fp.dilate(10)},
... )

Pools

The *_pool parameters can be used to select where certain computations occur. Those parameters can be of the following types:

A multiprocessing.pool.ThreadPool, should be the default choice.
A multiprocessing.pool.Pool, a process pool. Useful for computations that requires the GIL or that leaks memory.
None, to request the scheduler thread to perform the tasks itself. Should be used when the computation is very light.
A hashable (like a string), that will map to a pool registered in the Dataset. If that key is missing from the Dataset, a ThreadPool with multiprocessing.cpu_count() workers will be automatically instanciated. When the Dataset is closed, the pools instanciated that way will be joined.

See Also

Dataset.acreate_raster_recipe(): To skip the key assigment
Dataset.create_raster_recipe(): For results caching
Dataset.acreate_cached_raster_recipe(): To skip the key assigment

Dataset.create_cached_raster_recipe(key, fp, dtype, channel_count, channels_schema=None, sr=None, compute_array=None, merge_arrays=<function concat_arrays>, cache_dir=None, ow=False, queue_data_per_primitive=mappingproxy({}), convert_footprint_per_primitive=None, computation_pool='cpu', merge_pool='cpu', io_pool='io', resample_pool='cpu', cache_tiles=(512, 512), computation_tiles=None, max_resampling_size=None, debug_observers=())[source]¶

Create a cached raster recipe and register it under key within this Dataset.

Compared to a NocacheRasterRecipe, in a CachedRasterRecipe the pixels are never computed twice. Cache files are used to store and reuse pixels from computations. The cache can even be reused between python sessions.

If you are familiar with create_raster_recipe four parameters are new here: io_pool, cache_tiles, cache_dir and ow. They are all related to file system operations.

See create_raster_recipe method, since it shares most of the features:

>>> help(CachedRasterRecipe)

Parameters

key:: see Dataset.create_raster() method
fp:: see Dataset.create_raster() method
dtype:: see Dataset.create_raster() method
channel_count:: see Dataset.create_raster() method
channels_schema:: see Dataset.create_raster() method
sr:: see Dataset.create_raster() method
compute_array:: see Dataset.create_raster_recipe() method
merge_arrays:: see Dataset.create_raster_recipe() method
cache_dir: str or pathlib.Path: Path to the directory that holds the cache files associated with this raster. If cache files are present, they will be reused (or erased if corrupted). If a cache file is needed and missing, it will be computed.
ow: bool: Overwrite. Whether or not to erase the old cache files contained in cache_dir.

Warning

not only the tiles needed (hence computed) but all buzzard cache files in cache_dir will be deleted.
queue_data_per_primitive:: see Dataset.create_raster_recipe() method
convert_footprint_per_primitive:: see Dataset.create_raster_recipe() method
computation_pool:: see Dataset.create_raster_recipe() method
merge_pool:: see Dataset.create_raster_recipe() method
io_pool:: see Dataset.create_raster_recipe() method
resample_pool:: see Dataset.create_raster_recipe() method
cache_tiles: (int, int) or numpy.ndarray of Footprint: A tiling of the fp parameter. Each tile will correspond to one cache file. if (int, int): Construct the tiling by calling Footprint.tile with this parameter
computation_tiles:: if None: Use the same tiling as cache_tiles else: see create_raster_recipe method
max_resampling_size: None or int or (int, int): see Dataset.create_raster_recipe() method
debug_observers: sequence of object: see Dataset.create_raster_recipe() method

Returns

source: CachedRasterRecipe