Initial dataset wrappers.

pp-mo · pp-mo · commit 2a4b614aa26c · 2025-10-31T14:58:13.000Z
diff --git a/lib/iris/fileformats/netcdf/_byte_encoded_data.py b/lib/iris/fileformats/netcdf/_byte_encoded_data.py
@@ -0,0 +1,47 @@
+# Copyright Iris contributors
+#
+# This file is part of Iris and is released under the BSD license.
+# See LICENSE in the root of the repository for full licensing details.
+"""Module providing to netcdf datasets with automatic character encoding.
+
+The requirement is to convert numpy fixed-width unicode arrays on writing to a variable
+which is declared as a byte (character) array with a fixed-length string dimension.
+
+Numpy unicode string arrays are ones with dtypes of the form "U<character-width>".
+Numpy character variables have the dtype "S1", and map to a fixed-length "string
+dimension".
+
+In principle, netCDF4 already performs these translations, but in practice current
+releases are not functional for anything other than "ascii" encoding -- including UTF-8,
+which is the most obvious and desirable "general" solution.
+
+There is also the question of whether we should like to implement UTF-8 as our default.
+Current discussions on this are inconclusive and neither CF conventions nor the NetCDF
+User Guide are definite on what possible values of "_Encoding" are, or what the effective
+default is, even though they do both mention the "_Encoding" attribute as a potential
+way to handle the issue.
+
+Because of this, we interpret as follows:
+  * in the absence of an "_Encoding" attribute, we will attempt to decode bytes as UTF-8
+  * when writing string data, in the absense of an "_Encoding" attribute (on the Iris
+    cube or coord object), we will attempt to encode data with "ascii" : If this suceeds,
+    we will save as is (with no "_Encoding" attribute), but if it fails we will encode
+    as UTF-8 **and** add an "_Encoding='UTF-8'" attribute.
+
+Where an "_Encoding" attribute is provided to Iris, we will honour it where possible,
+identifying with "codecs.lookup" :  This means we support the encodings in the Python
+Standard Library, and name aliases which it recognises.
+
+See:
+
+* known problems https://github.com/Unidata/netcdf4-python/issues/1440
+* suggestions for how this "ought" to work, discussed in the netcdf-c library
+   * https://github.com/Unidata/netcdf-c/issues/402
+
+"""
+from iris.fileformats.netcdf._thread_safe_nc import DatasetWrapper
+
+class EncodedDataset(DatasetWrapper):
+    """A dataset wrapper that translates variable data according to byte encodings."""
+    pass
+