Rasters Sources Using Recipes¶
-
Dataset.
create_raster_recipe
(key, fp, dtype, channel_count, channels_schema=None, sr=None, compute_array=None, merge_arrays=<function concat_arrays>, queue_data_per_primitive=mappingproxy({}), convert_footprint_per_primitive=None, computation_pool='cpu', merge_pool='cpu', resample_pool='cpu', computation_tiles=None, max_computation_size=None, max_resampling_size=None, automatic_remapping=True, debug_observers=())[source]¶ Warning
This method is not yet implemented. It exists for documentation purposes.
Create a raster recipe and register it under key within this Dataset.
A raster recipe implements the same interfaces as all other rasters, but internally it computes data on the fly by calling a callback. The main goal of the raster recipes is to provide a boilerplate-free interface that automatize those cumbersome tasks:
tiling,
parallelism
caching
file reads
resampling
lazy evaluation
backpressure prevention and
optimised task scheduling.
If you are familiar with create_cached_raster_recipe two parameters are new here: automatic_remapping and max_computation_size.
Parameters
- key:
- fp:
- dtype:
- channel_count:
- channels_schema:
- sr:
- compute_array: callable
see Computation Function below
- merge_arrays: callable
see Merge Function below
- queue_data_per_primitive: dict of hashable (like a string) to a queue_data method pointer
see Primitives below
- convert_footprint_per_primitive: None or dict of hashable (like a string) to a callable
see Primitives below
- computation_pool:
see Pools below
- merge_pool:
see Pools below
- resample_pool:
see Pools below
- computation_tiles: None or (int, int) or numpy.ndarray of Footprint
see Computation Tiling below
- max_computation_size: None or int or (int, int)
see Computation Tiling below
- max_resampling_size: None or int or (int, int)
Optionally define a maximum resampling size. If a larger resampling has to be performed, it will be performed tile by tile in parallel.
- automatic_remapping: bool
see Automatic Remapping below
- debug_observers: sequence of object
Entry points that observe what is happening with this raster in the Dataset’s scheduler.
Returns
- source: NocacheRasterRecipe
Computation Function
The function that will map a Footprint to a numpy.ndarray. If queue_data_per_primitive is not empty, it will map a Footprint and primitive arrays to a numpy.ndarray.
It will be called in parallel according to the computation_pool parameter provided at construction.
The function will be called with the following positional parameters:
- fp: Footprint of shape (Y, X)
The location at which the pixels should be computed
- primitive_fps: dict of hashable to Footprint
For each primitive defined through the queue_data_per_primitive parameter, the input Footprint.
- primitive_arrays: dict of hashable to numpy.ndarray
For each primitive defined through the queue_data_per_primitive parameter, the input numpy.ndarray that was automatically computed.
- raster: CachedRasterRecipe or None
The Raster object of the ongoing computation.
It should return either:
- a single ndarray of shape (Y, X) if only one channel was computed
- a single ndarray of shape (Y, X, C) if one or more channels were computed
If computation_pool points to a process pool, the compute_array function must be picklable and the raster parameter will be None.
Computation Tiling
You may sometimes want to have control on the Footprints that are requested to the compute_array function, for example:
If pixels computed by compute_array are long to compute, you want to tile to increase parallelism.
If the compute_array function scales badly in term of memory or time, you want to tile to reduce complexity.
If compute_array can work only on certain Footprints, you want a hard constraint on the set of Footprint that can be queried from compute_array. (This may happen with convolutional neural networks)
To do so use the computation_tiles or max_computation_size parameter (not both).
If max_computation_size is provided, a Footprint to be computed will be tiled given this parameter.
If computation_tiles is a numpy.ndarray of Footprint, it should be a tiling of the fp parameter. Only the Footprints contained in this tiling will be asked to the computation_tiles. If computation_tiles is (int, int), a tiling will be constructed using Footprint.tile using those two ints.
Merge Function
The function that will map several pairs of Footprint/numpy.ndarray to a single numpy.ndarray. If the computation_tiles is None, it will never be called.
It will be called in parallel according to the merge_pool parameter provided at construction.
The function will be called with the following positional parameters:
- fp: Footprint of shape (Y, X)
The location at which the pixels should be computed.
- array_per_fp: dict of Footprint to numpy.ndarray
The pairs of Footprint/numpy.ndarray of each arrays that were computed by compute_array and that overlap with fp.
- raster: CachedRasterRecipe or None
The Raster object of the ongoing computation.
It should return either:
- a single ndarray of shape (Y, X) if only one channel was computed
- a single ndarray of shape (Y, X, C) if one or more channels were computed
If merge_pool points to a process pool, the merge_array function must be picklable and the raster parameter will be None.
Automatic Remapping
When creating a recipe you give a Footprint through the fp parameter. When calling your compute_array function the scheduler will only ask for slices of fp. This means that the scheduler takes care of those boilerplate steps:
If you request a Footprint on a different grid in a get_data() call, the scheduler takes care of resampling the outputs of your compute*array function.
If you request a Footprint partially or fully outside of the raster’s extent, the scheduler will call your compute_array function to get the interior pixels and then pad the output with nodata.
This system is flexible and can be deactivated by passing automatic_remapping=False to the constructor of a NocacheRasterRecipe, in this case the scheduler will call your compute_array function for any kind of Footprint; thus your function must be able to comply with any request.
Primitives
The queue_data_per_primitive and convert_footprint_per_primitive parameters can be used to create dependencies between dependee async rasters and the raster recipe being created. The dependee/dependent relation is called primitive/derived throughout buzzard. A derived recipe can itself be the primitive of another raster. Pipelines of any depth and width can be instanciated that way.
In queue_data_per_primitive you declare a dependee by giving it a key of your choice and the pointer to the queue_data method of dependee raster. You can parameterize the connection by currying the channels, dst_nodata, interpolation and max_queue_size parameters using functools.partial.
The convert_footprint_per_primitive dict should contain the same keys as queue_data_per_primitive. A value in the dict should be a function that maps a Footprint to another Footprint. It can be used for example to request larger rectangles of primitives data to compute a derived array.
e.g. If the primitive raster is an rgb image, and the derived raster only needs the green channel but with a context of 10 additional pixels on all 4 sides:
>>> derived = ds.create_raster_recipe( ... # <other parameters> ... queue_data_per_primitive={'green': functools.partial(primitive.queue_data, channels=1)}, ... convert_footprint_per_primitive={'green': lambda fp: fp.dilate(10)}, ... )
Pools
The *_pool parameters can be used to select where certain computations occur. Those parameters can be of the following types:
A multiprocessing.pool.ThreadPool, should be the default choice.
A multiprocessing.pool.Pool, a process pool. Useful for computations that requires the GIL or that leaks memory.
None, to request the scheduler thread to perform the tasks itself. Should be used when the computation is very light.
A hashable (like a string), that will map to a pool registered in the Dataset. If that key is missing from the Dataset, a ThreadPool with multiprocessing.cpu_count() workers will be automatically instanciated. When the Dataset is closed, the pools instanciated that way will be joined.
See Also
Dataset.acreate_raster_recipe()
: To skip the key assigmentDataset.create_raster_recipe()
: For results cachingDataset.acreate_cached_raster_recipe()
: To skip the key assigment
-
Dataset.
create_cached_raster_recipe
(key, fp, dtype, channel_count, channels_schema=None, sr=None, compute_array=None, merge_arrays=<function concat_arrays>, cache_dir=None, ow=False, queue_data_per_primitive=mappingproxy({}), convert_footprint_per_primitive=None, computation_pool='cpu', merge_pool='cpu', io_pool='io', resample_pool='cpu', cache_tiles=(512, 512), computation_tiles=None, max_resampling_size=None, debug_observers=())[source]¶ Create a cached raster recipe and register it under key within this Dataset.
Compared to a NocacheRasterRecipe, in a CachedRasterRecipe the pixels are never computed twice. Cache files are used to store and reuse pixels from computations. The cache can even be reused between python sessions.
If you are familiar with create_raster_recipe four parameters are new here: io_pool, cache_tiles, cache_dir and ow. They are all related to file system operations.
See create_raster_recipe method, since it shares most of the features:
>>> help(CachedRasterRecipe)
Parameters
- key:
see
Dataset.create_raster()
method- fp:
see
Dataset.create_raster()
method- dtype:
see
Dataset.create_raster()
method- channel_count:
see
Dataset.create_raster()
method- channels_schema:
see
Dataset.create_raster()
method- sr:
see
Dataset.create_raster()
method- compute_array:
see
Dataset.create_raster_recipe()
method- merge_arrays:
see
Dataset.create_raster_recipe()
method- cache_dir: str or pathlib.Path
Path to the directory that holds the cache files associated with this raster. If cache files are present, they will be reused (or erased if corrupted). If a cache file is needed and missing, it will be computed.
- ow: bool
Overwrite. Whether or not to erase the old cache files contained in cache_dir.
Warning
not only the tiles needed (hence computed) but all buzzard cache files in cache_dir will be deleted.
- queue_data_per_primitive:
see
Dataset.create_raster_recipe()
method- convert_footprint_per_primitive:
see
Dataset.create_raster_recipe()
method- computation_pool:
see
Dataset.create_raster_recipe()
method- merge_pool:
see
Dataset.create_raster_recipe()
method- io_pool:
see
Dataset.create_raster_recipe()
method- resample_pool:
see
Dataset.create_raster_recipe()
method- cache_tiles: (int, int) or numpy.ndarray of Footprint
A tiling of the fp parameter. Each tile will correspond to one cache file. if (int, int): Construct the tiling by calling Footprint.tile with this parameter
- computation_tiles:
if None: Use the same tiling as cache_tiles else: see create_raster_recipe method
- max_resampling_size: None or int or (int, int)
see
Dataset.create_raster_recipe()
method- debug_observers: sequence of object
see
Dataset.create_raster_recipe()
method
Returns
- source: CachedRasterRecipe
See Also
Dataset.create_raster_recipe()
: To skip the cachingDataset.acreate_cached_raster_recipe()
: To skip the key assigment
-
Dataset.
acreate_cached_raster_recipe
(fp, dtype, channel_count, channels_schema=None, sr=None, compute_array=None, merge_arrays=<function concat_arrays>, cache_dir=None, ow=False, queue_data_per_primitive=mappingproxy({}), convert_footprint_per_primitive=None, computation_pool='cpu', merge_pool='cpu', io_pool='io', resample_pool='cpu', cache_tiles=(512, 512), computation_tiles=None, max_resampling_size=None, debug_observers=())[source]¶ Create a cached raster reciped anonymously within this Dataset.
See Dataset.create_cached_raster_recipe
See Also
Dataset.create_raster_recipe()
: To skip the cachingDataset.create_cached_raster_recipe()
: To assign a key to this source within the Dataset