Producing GeoPackages with massivegeopackage: Difference between revisions
(→) |
(→) |
||
(37 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
'''Massivegeopackage''' is a Python package for building a GeoPackage from very large raster or elevation datasets. | '''Massivegeopackage''' is a Python package for building a GeoPackage from very large raster or elevation datasets. | ||
The core GDAL tools | The core GDAL tools '''''gdalwarp''''', '''''gdal_translate''''' and '''''gdaladdo''''' seem to struggle with mosaicing very large raster datasets. When the size of the source dataset is larger than 15-20 GB, processing time seems to increase drastically. | ||
This Python package will convert each source file to individual GeoPackages | This Python package will convert each source file to individual GeoPackages with a common tile matrix. This is done in a user configurable number of parallel processes. Completed files will immediately be queued to be merged into a base GeoPackage. The merging is very efficient because it consists of data copy - no geoprocessing. | ||
== Dependencies == | == Dependencies == | ||
Line 14: | Line 14: | ||
== Installation == | == Installation == | ||
The package is distributed as a Python wheel-file. Install | The package is distributed as a Python wheel-file. Install using pip: | ||
<pre>python -m pip install c:\path\massivegeopackage- | <pre>python -m pip install c:\path\massivegeopackage-3.0.0-py3-none-any.whl</pre> | ||
== Usage == | == Usage == | ||
The | From version 1.4.0, input arguments can be supplied on the command line. Run the module with <code>--help</code> to list all arguments. The first argument should be <code>raster</code> or <code>elevation</code> depending on your source data. | ||
Examples: | |||
<pre>python -m massivegeopackage | <pre> | ||
python -m massivegeopackage --help | |||
python -m massivegeopackage raster --srcfolder c:\data\geotiff --targetfolder c:\data\gpkg_output | |||
python -m massivegeopackage elevation --srcfolder c:\data\50_dtm --areapath c:\norway\counties --targetfolder c:\data\gpkg_output --targetdatatype 32 | |||
</pre> | |||
For version information, use this command: | |||
<pre>python -m pip show massivegeopackage</pre> | |||
=== | === Common arguments === | ||
The following arguments are common for both raster and elevation modes. | |||
{| class="wikitable" | |||
! style="width: 200px" | Argument | |||
!Description | |||
|- | |||
|<code>{raster,elevation}</code> | |||
|Mode for output data. Use <code>raster</code> for 8 bit output. Use <code>elevation</code> for 16 or 32 bit output. | |||
|- | |||
| <code>-s, --srcfolder</code>|| Full path to a folder containing source raster files. | |||
|- | |||
| <code>-a, --areapath</code>|| Full path to a vector dataset. Each layer in the dataset should contain a single polygon feature. One clipped GeoPackge will be created for each layer, unless a single layer is specified with <code>--arealayer</code> | |||
|- | |||
| <code>-al --arealayer</code> | |||
|Name of a single layer from the vector area dataset. | |||
|- | |||
| <code>-s_epsg, --source_epsg</code> | |||
|EPSG code of source if this information is not embedded in the input dataset. | |||
|- | |||
| <code>-t, --targetfolder</code>|| Empty folder for temporary files, logs and the completed GeoPackage. | |||
|- | |||
| <code>-ft, --input_file_type</code>|| File extension on input files. Default is <code>tif</code> | |||
|- | |||
| <code>-resamp, --resampling</code> | |||
|Resampling method to use when creating overviews | |||
|- | |||
|<code>-co, --creation_options</code> | |||
|Comma-separated list of creation options for the GDAL GeoPackage driver. Example: <code>tile_format=jpeg,quality=50</code> | |||
|- | |||
| <code>-p, --num_processes</code>|| Number of parallel processes to use. Most common storage devices will become a bottleneck with 20 or more parallel processes. | |||
|- | |||
| <code>-r, --recursive</code>|| Search for source files in subfolders | |||
|- | |||
| <code>--name</code> | |||
|Name of the output GeoPackage file. | |||
|- | |||
| <code>--safe</code> | |||
|Use SQLite WAL mode when merging files. When processing data on file systems which are prone to interruptions (e.g. network file systems), the script will roll back and retry a failed merge. This will increase the total processing time. | |||
|- | |||
| <code>--debug</code>|| Log debug messages in addition to info | |||
|- | |||
| <code>--nocleanup</code>|| Do not clean up temporary GeoPackage files | |||
|} | |||
=== Raster mode === | |||
Input files should be homogenous (same projection, dimensions, bands, pixel size). | Input files should be homogenous (same projection, dimensions, bands, pixel size). | ||
Line 50: | Line 96: | ||
|} | |} | ||
Raster specific arguments: | |||
{| class="wikitable" | {| class="wikitable" | ||
! | ! style="width: 200px" | Argument | ||
!Description | !Description | ||
|- | |- | ||
| | | <code>-nd, --nodata</code> | ||
|Pixel value to make transparent in the completed GeoPackage. Example: <code>0 0 0</code> | |||
|- | |- | ||
| | | <code>-scl, --scalefactor</code> | ||
|Factor to use when it is neccessary to scale down to 8 bit. The range of values to use from each source band is calculated as (mean ± stddev × scalefactor). | |||
|- | |- | ||
| | |<code>--defrag</code> | ||
|Increase the page size of the base GeoPackage to avoid fragmentation issues when the file grows very large (several hundred gigabytes). When processing is done, the page size will be changed back to default and VACUUM is run to apply the change and to compact the database. | |||
|} | |} | ||
=== | === Elevation mode === | ||
Input files should be homogenous (same projection, dimensions, bands, pixel size). | Input files should be homogenous (same projection, dimensions, bands, pixel size). | ||
Supported input: | If the input files are Float32 and target is set to 16 bit, a scale and offset will be computed for each tile. These are applied to each pixel and then rounded to the nearest integer. This stretches the tile's value range to utilize the full range of a 16 bit unsigned integer (0-65534). The scale and offset is reversed when an application reads from the output file. This way, an effective precision of around 0.01 - 0.001 meters is achieved (less varied source data results in higher precision). | ||
Supported input and corresponding output: | |||
{| class="wikitable" | {| class="wikitable" | ||
!Format | !Format | ||
!Input configuration | !Input configuration | ||
! | !Output | ||
|- | |- | ||
| rowspan="3" | [https://gdal.org/drivers/raster/index.html Any raster format supported by GDAL] | | rowspan="3" | [https://gdal.org/drivers/raster/index.html Any raster format supported by GDAL] | ||
Line 87: | Line 130: | ||
|} | |} | ||
Elevation specific arguments: | |||
{| class="wikitable" | {| class="wikitable" | ||
! | ! style="width: 250px" | Arguments | ||
!Description | !Description | ||
|- | |- | ||
| | | <code>-dt,</code><code>--targetdatatype {16,32}</code>|| Datatype for output GeoPackage - 32 or 16 bits. If source is Float32 and target is set to 16 bit, the value range in each tile will be streched to utilize the entire space that a 16 bit unsigned integer provides (0-65535). A scale factor and offset will be computed, which will be applied when reading from the file output file. This way, an effective precision of around 0.01 - 0.001 meters is achieved (less varied source data results in higher precision). | ||
|} | |} | ||
== Known issues == | |||
=== Disk I/O error === | |||
Massivegeopackage works by building a large GeoPackage from smaller ones, which are deleted as soon as they are merged into the base file. As the base geopackage grows, it becomes increasingly fragmented on the hard drive. With a large enough file, fragmentation can reach a limit, causing "Disk I/O error" and corrupting the file. The file size limit where this happens this can vary, but chances increase when the file grows beyond 150-200 GB. | |||
To avoid this issue, parameter <code>--defrag</code> can be used when the output is expected to be very large. The parameter will increase the page size of the base GeoPackage, decreasing fragmentation but also increasing file size. When processing is done, the page size will be changed back to default and <code>VACUUM</code> is run to apply the change and to compact the database. This takes some extra processing time, but should be negligable when building a GeoPackage of several hundred gigabytes. The final <code>VACUUM</code> statement will also slightly decrease the size of the output GeoPackage compared to not using the <code>--defrag</code> parameter | |||
=== Network file systems === | |||
Using network folders as input and output is supported, but processing data on a (preferably fast) local disk is recommended. From the web page of the [https://sqlite.org/howtocorrupt.html SQLite project]: | |||
<blockquote> | |||
''SQLite depends on the underlying filesystem to do locking as the documentation says it will. But some filesystems contain bugs in their locking logic such that the locks do not always behave as advertised. This is especially true of network filesystems and NFS in particular. If SQLite is used on a filesystem where the locking primitives contain bugs, and if two or more threads or processes try to access the same database at the same time, then database corruption might result.'' | |||
</blockquote>Additionally, long-running processes will be vulnerable to small outages in the network connectivity. This can be mitigated by using the <code>--safe</code> argument. | |||
=== Script freezes === | |||
When running the script in a command window and execution appears to stop for no apparent reason, this can be caused by the command window "Quick Edit" feature. If text in the command window is selected with the left mouse button, the execution will pause. Right-clicking inside the command window will resume the script.<!-- massive --> |
Latest revision as of 12:27, 19 October 2023
Massivegeopackage is a Python package for building a GeoPackage from very large raster or elevation datasets.
The core GDAL tools gdalwarp, gdal_translate and gdaladdo seem to struggle with mosaicing very large raster datasets. When the size of the source dataset is larger than 15-20 GB, processing time seems to increase drastically.
This Python package will convert each source file to individual GeoPackages with a common tile matrix. This is done in a user configurable number of parallel processes. Completed files will immediately be queued to be merged into a base GeoPackage. The merging is very efficient because it consists of data copy - no geoprocessing.
Dependencies
- numpy
- Pillow
- GDAL >= 3.1.4
The easiest way to get a Python environment where these dependencies are met, is to install QGIS 3.16 or newer. Included is the OSGeo4W shell, where you can install and use the package.
Installation
The package is distributed as a Python wheel-file. Install using pip:
python -m pip install c:\path\massivegeopackage-3.0.0-py3-none-any.whl
Usage
From version 1.4.0, input arguments can be supplied on the command line. Run the module with --help
to list all arguments. The first argument should be raster
or elevation
depending on your source data.
Examples:
python -m massivegeopackage --help python -m massivegeopackage raster --srcfolder c:\data\geotiff --targetfolder c:\data\gpkg_output python -m massivegeopackage elevation --srcfolder c:\data\50_dtm --areapath c:\norway\counties --targetfolder c:\data\gpkg_output --targetdatatype 32
For version information, use this command:
python -m pip show massivegeopackage
Common arguments
The following arguments are common for both raster and elevation modes.
Argument | Description |
---|---|
{raster,elevation}
|
Mode for output data. Use raster for 8 bit output. Use elevation for 16 or 32 bit output.
|
-s, --srcfolder |
Full path to a folder containing source raster files. |
-a, --areapath |
Full path to a vector dataset. Each layer in the dataset should contain a single polygon feature. One clipped GeoPackge will be created for each layer, unless a single layer is specified with --arealayer
|
-al --arealayer
|
Name of a single layer from the vector area dataset. |
-s_epsg, --source_epsg
|
EPSG code of source if this information is not embedded in the input dataset. |
-t, --targetfolder |
Empty folder for temporary files, logs and the completed GeoPackage. |
-ft, --input_file_type |
File extension on input files. Default is tif
|
-resamp, --resampling
|
Resampling method to use when creating overviews |
-co, --creation_options
|
Comma-separated list of creation options for the GDAL GeoPackage driver. Example: tile_format=jpeg,quality=50
|
-p, --num_processes |
Number of parallel processes to use. Most common storage devices will become a bottleneck with 20 or more parallel processes. |
-r, --recursive |
Search for source files in subfolders |
--name
|
Name of the output GeoPackage file. |
--safe
|
Use SQLite WAL mode when merging files. When processing data on file systems which are prone to interruptions (e.g. network file systems), the script will roll back and retry a failed merge. This will increase the total processing time. |
--debug |
Log debug messages in addition to info |
--nocleanup |
Do not clean up temporary GeoPackage files |
Raster mode
Input files should be homogenous (same projection, dimensions, bands, pixel size).
Supported input:
Format | Band configuration |
---|---|
Any raster format supported by GDAL | 1 band color index (8 bit) |
1 band grayscale (8 bit) | |
3 band RGB (24 bit) | |
4 band RGBA (32 bit) |
Raster specific arguments:
Argument | Description |
---|---|
-nd, --nodata
|
Pixel value to make transparent in the completed GeoPackage. Example: 0 0 0
|
-scl, --scalefactor
|
Factor to use when it is neccessary to scale down to 8 bit. The range of values to use from each source band is calculated as (mean ± stddev × scalefactor). |
--defrag
|
Increase the page size of the base GeoPackage to avoid fragmentation issues when the file grows very large (several hundred gigabytes). When processing is done, the page size will be changed back to default and VACUUM is run to apply the change and to compact the database. |
Elevation mode
Input files should be homogenous (same projection, dimensions, bands, pixel size).
If the input files are Float32 and target is set to 16 bit, a scale and offset will be computed for each tile. These are applied to each pixel and then rounded to the nearest integer. This stretches the tile's value range to utilize the full range of a 16 bit unsigned integer (0-65534). The scale and offset is reversed when an application reads from the output file. This way, an effective precision of around 0.01 - 0.001 meters is achieved (less varied source data results in higher precision).
Supported input and corresponding output:
Format | Input configuration | Output |
---|---|---|
Any raster format supported by GDAL | 1 band (Int16) | 1 band (UInt16) |
1 band (UInt16) | 1 band (UInt16) | |
1 band (Float32) | 1 band (UInt16 or Float32) |
Elevation specific arguments:
Arguments | Description |
---|---|
-dt, --targetdatatype {16,32} |
Datatype for output GeoPackage - 32 or 16 bits. If source is Float32 and target is set to 16 bit, the value range in each tile will be streched to utilize the entire space that a 16 bit unsigned integer provides (0-65535). A scale factor and offset will be computed, which will be applied when reading from the file output file. This way, an effective precision of around 0.01 - 0.001 meters is achieved (less varied source data results in higher precision). |
Known issues
Disk I/O error
Massivegeopackage works by building a large GeoPackage from smaller ones, which are deleted as soon as they are merged into the base file. As the base geopackage grows, it becomes increasingly fragmented on the hard drive. With a large enough file, fragmentation can reach a limit, causing "Disk I/O error" and corrupting the file. The file size limit where this happens this can vary, but chances increase when the file grows beyond 150-200 GB.
To avoid this issue, parameter --defrag
can be used when the output is expected to be very large. The parameter will increase the page size of the base GeoPackage, decreasing fragmentation but also increasing file size. When processing is done, the page size will be changed back to default and VACUUM
is run to apply the change and to compact the database. This takes some extra processing time, but should be negligable when building a GeoPackage of several hundred gigabytes. The final VACUUM
statement will also slightly decrease the size of the output GeoPackage compared to not using the --defrag
parameter
Network file systems
Using network folders as input and output is supported, but processing data on a (preferably fast) local disk is recommended. From the web page of the SQLite project:
SQLite depends on the underlying filesystem to do locking as the documentation says it will. But some filesystems contain bugs in their locking logic such that the locks do not always behave as advertised. This is especially true of network filesystems and NFS in particular. If SQLite is used on a filesystem where the locking primitives contain bugs, and if two or more threads or processes try to access the same database at the same time, then database corruption might result.
Additionally, long-running processes will be vulnerable to small outages in the network connectivity. This can be mitigated by using the --safe
argument.
Script freezes
When running the script in a command window and execution appears to stop for no apparent reason, this can be caused by the command window "Quick Edit" feature. If text in the command window is selected with the left mouse button, the execution will pause. Right-clicking inside the command window will resume the script.