Producing GeoPackages with massivegeopackage

From Maria GDK Wiki
Jump to navigation Jump to search

Massivegeopackage is a Python package for building a GeoPackage from very large raster or elevation datasets.

The core GDAL tools gdalwarp, gdal_translate and gdaladdo seem to struggle with mosaicing very large raster datasets. When the size of the source dataset is larger than 15-20 GB, processing time seems to increase drastically.

This Python package will convert each source file to individual GeoPackages with a common tile matrix. This is done in a user configurable number of parallel processes. Completed files will immediately be queued to be merged into a base GeoPackage. The merging is very efficient because it consists of data copy - no geoprocessing.

Dependencies

  • numpy
  • Pillow
  • GDAL >= 3.1.4

The easiest way to get a Python environment where these dependencies are met, is to install QGIS 3.16 or newer. Included is the OSGeo4W shell, where you can install and use the package.

Installation

The package is distributed as a Python wheel-file. Install using pip:

python -m pip install c:\path\massivegeopackage-3.0.0-py3-none-any.whl

Usage

From version 1.4.0, input arguments can be supplied on the command line. Run the module with --help to list all arguments. The first argument should be raster or elevation depending on your source data.

Examples:

python -m massivegeopackage --help

python -m massivegeopackage raster --srcfolder c:\data\geotiff --targetfolder c:\data\gpkg_output

python -m massivegeopackage elevation --srcfolder c:\data\50_dtm --areapath c:\norway\counties --targetfolder c:\data\gpkg_output --targetdatatype 32

For version information, use this command:

python -m pip show massivegeopackage

Common arguments

The following arguments are common for both raster and elevation modes.

Argument Description
{raster,elevation} Mode for output data. Use raster for 8 bit output. Use elevation for 16 or 32 bit output.
-s, --srcfolder Full path to a folder containing source raster files.
-a, --areapath Full path to a vector dataset. Each layer in the dataset should contain a single polygon feature. One clipped GeoPackge will be created for each layer, unless a single layer is specified with --arealayer
-al --arealayer Name of a single layer from the vector area dataset.
-s_epsg, --source_epsg EPSG code of source if this information is not embedded in the input dataset.
-t, --targetfolder Empty folder for temporary files, logs and the completed GeoPackage.
-ft, --input_file_type File extension on input files. Default is tif
-resamp, --resampling Resampling method to use when creating overviews
-co, --creation_options Comma-separated list of creation options for the GDAL GeoPackage driver. Example: tile_format=jpeg,quality=50
-p, --num_processes Number of parallel processes to use. Most common storage devices will become a bottleneck with 20 or more parallel processes.
-r, --recursive Search for source files in subfolders
--name Name of the output GeoPackage file.
--safe Use SQLite WAL mode when merging files. When processing data on file systems which are prone to interruptions (e.g. network file systems), the script will roll back and retry a failed merge. This will increase the total processing time.
--debug Log debug messages in addition to info
--nocleanup Do not clean up temporary GeoPackage files

Raster mode

Input files should be homogenous (same projection, dimensions, bands, pixel size).

Supported input:

Format Band configuration
Any raster format supported by GDAL 1 band color index (8 bit)
1 band grayscale (8 bit)
3 band RGB (24 bit)
4 band RGBA (32 bit)

Raster specific arguments:

Argument Description
-nd, --nodata Pixel value to make transparent in the completed GeoPackage. Example: 0 0 0
-scl, --scalefactor Factor to use when it is neccessary to scale down to 8 bit. The range of values to use from each source band is calculated as (mean ± stddev × scalefactor).
--defrag Increase the page size of the base GeoPackage to avoid fragmentation issues when the file grows very large (several hundred gigabytes). When processing is done, the page size will be changed back to default and VACUUM is run to apply the change and to compact the database.

Elevation mode

Input files should be homogenous (same projection, dimensions, bands, pixel size).

If the input files are Float32 and target is set to 16 bit, a scale and offset will be computed for each tile. These are applied to each pixel and then rounded to the nearest integer. This stretches the tile's value range to utilize the full range of a 16 bit unsigned integer (0-65534). The scale and offset is reversed when an application reads from the output file. This way, an effective precision of around 0.01 - 0.001 meters is achieved (less varied source data results in higher precision).

Supported input and corresponding output:

Format Input configuration Output
Any raster format supported by GDAL 1 band (Int16) 1 band (UInt16)
1 band (UInt16) 1 band (UInt16)
1 band (Float32) 1 band (UInt16 or Float32)

Elevation specific arguments:

Arguments Description
-dt,--targetdatatype {16,32} Datatype for output GeoPackage - 32 or 16 bits. If source is Float32 and target is set to 16 bit, the value range in each tile will be streched to utilize the entire space that a 16 bit unsigned integer provides (0-65535). A scale factor and offset will be computed, which will be applied when reading from the file output file. This way, an effective precision of around 0.01 - 0.001 meters is achieved (less varied source data results in higher precision).

Known issues

Disk I/O error

Massivegeopackage works by building a large GeoPackage from smaller ones, which are deleted as soon as they are merged into the base file. As the base geopackage grows, it becomes increasingly fragmented on the hard drive. With a large enough file, fragmentation can reach a limit, causing "Disk I/O error" and corrupting the file. The file size limit where this happens this can vary, but chances increase when the file grows beyond 150-200 GB.

To avoid this issue, parameter --defrag can be used when the output is expected to be very large. The parameter will increase the page size of the base GeoPackage, decreasing fragmentation but also increasing file size. When processing is done, the page size will be changed back to default and VACUUM is run to apply the change and to compact the database. This takes some extra processing time, but should be negligable when building a GeoPackage of several hundred gigabytes. The final VACUUM statement will also slightly decrease the size of the output GeoPackage compared to not using the --defrag parameter

Network file systems

Using network folders as input and output is supported, but processing data on a (preferably fast) local disk is recommended. From the web page of the SQLite project:

SQLite depends on the underlying filesystem to do locking as the documentation says it will. But some filesystems contain bugs in their locking logic such that the locks do not always behave as advertised. This is especially true of network filesystems and NFS in particular. If SQLite is used on a filesystem where the locking primitives contain bugs, and if two or more threads or processes try to access the same database at the same time, then database corruption might result.

Additionally, long-running processes will be vulnerable to small outages in the network connectivity. This can be mitigated by using the --safe argument.

Script freezes

When running the script in a command window and execution appears to stop for no apparent reason, this can be caused by the command window "Quick Edit" feature. If text in the command window is selected with the left mouse button, the execution will pause. Right-clicking inside the command window will resume the script.