1 The Problem¶
Since about 2013, a handful of poor/expedient design decisions in lsst.afw
have been slowly compounding.
A full history of the situation can be found in a Community post.
To summarize,
- We use the
lsst.afw.image.Exposure
class throughout the code base to aggregate instances of all of the classes that characterize an image - PSFs, WCSs, aperture corrections, photometric calibrations, etc. - Because
Exposure
holds those components via direct C++ composition, they need to live either inlsst.afw.image
or in a package it depends on. - Because these components are written to and read from disk via the (C++)
lsst.afw.table.io
persistence framework, they are doubly-required to be implemented in C++, and they also need to live in eitherlsst.afw.table
or a package that depends on it.
The net result is that a large fraction of our classes must:
- live in
afw
itself, leading to churn and bloat at the bottom of our dependency tree, and hence frequent, slow, rebuilds; - be implemented in C++, a language much more complex and much less familiar to most of our developers than Python.
Furthermore, the circular dependency between the lsst.afw.image
and lsst.afw.table
subpackages makes the build and particular the Python import order for afw
fragile, leading to hard-to-debug and even harder-to-fix bugs whenever a new Exposure
-component class is added.
A related problem is that lsst.afw.table.io
was never intended to be a long-term or general persistence solution.[1]
It lacks any kind of built-in support for versioning, produces opaque and sometimes inefficient files, involves a lot of boilerplate to use, and is built on top of a bespoke C++ library (lsst.afw.table
itself) that presents a maintenance challenge and has I/O performance problems.
None of these limitations represents a particularly acute problem (we have managed to work around all of them for a while now), but as a persistence framework more or less specifies the on-disk format, significant (i.e., backwards incompatible) changes will become progressively harder to make as we approach operations.
[1] | lsst.afw.table.io was written to enable writing CoaddPsf objects into FITS files on an HSC data release deadline, with the expectation that it would later be replaced by the Data Access Team.
That “Data Access” has since been reinterpreted to mean only science user data access (i.e., not pipeline data access or data access middleware), and the hence the responsibility for persistence frameworks has fallen through the cracks. |
2 Overview¶
We’ve taken the following actions to mitigate the design problems imposed by Exposure
and to prepare for future work:
- Add a new heterogeneous-type mapping interface and implementation in C++,
GenericMap
. This is an alternative to rather than a replacement of our existingPropertySet
andPropertyList
to avoid their behavioral idiosyncrasies (especially with respect to vector-valued entries) and to provide natural interfaces in both C++ and Python (e.g., thecollections.abc.Mapping
API). We also recognize that attempting to change the behavior of the existing classes at this stage would be extremely disruptive. - Define an interface,
Storable
, for objects that can be held byGenericMap
. This includes basic functionality like stringification, comparisons, and cloning as well as persistence (currently still vialsst.afw.table.io
). - Rewrite
lsst.afw.image.ExposureInfo
(which holds the non-image components ofExposure
) to hold its components as a map (originally aGenericMap
, later replaced with a more specialized type), allowing anyStorable
to be attached to anExposure
. This allowsExposure
components to be defined in any package downstream ofafw
(but still need to be implemented in C++ and uselsst.afw.table.io
).
To go further, and allow Python-implemented Exposure
components, we need to replace lsst.afw.table.io
as the way Storable
s are persisted, preferably using a third-party persistence library as much as possible.
We finish with a brief review of some candidate libraries.
3 The Design of GenericMap¶
While the current implementation of ExposureInfo
no longer uses GenericMap
, developing this class was a significant part of our ExposureInfo
-related work.
In addition, we believe that LSST developers can learn from both the solutions we found and the mistakes we made in developing GenericMap
.
3.1 Rationale¶
We wished to implement ExposureInfo
using a map (string keys to object values) to decouple it from the details of what information is associated with an exposure.
Doing so allows ExposureInfo
to store objects of classes unknown to afw
, and allows ExposureInfo
to be extended by pipeline or third-party packages without changing its code.
This concept requires some standardization of keys, but no more so than, for example, FITS header keywords.
Our original plan involved a heterogeneous map (i.e., one that stores values of multiple formal types), in order to support appending completely arbitrary metadata to an Exposure
, possibly including primitive types.
The LSST stack previously had at least three heterogeneous map types in C++: lsst.daf.base.PropertySet
, lsst.daf.base.PropertyList
, and lsst.pex.policy.Policy
(deprecated).
However, all these types are specialized for particular roles (e.g., PropertyList
is designed to represent FITS headers), and mix heterogeneous mapping with other functions.
As a result, these classes are difficult to adapt to new use cases.
In addition, the lack of a common code base makes these classes difficult to maintain, and the limited type safety makes these classes easy to use incorrectly from both C++ and Python.
GenericMap
attempted to not only serve as a suitable back-end for ExposureInfo
, but also address the design problems that prevented the use of PropertySet
in the first place.
Currently, lsst.afw.typehandling.GenericMap
is a component of afw
because it depends, indirectly, on the lsst.table.io
framework.
However, lsst.afw.typehandling
has no other dependencies on afw
.
Once object persistence is decoupled from lsst.table.io
, the entire subpackage can be moved to a lower level in the Stack and GenericMap
can be treated as a fundamental LSST type.
3.2 Design Goals¶
We designed and implemented GenericMap
while striving to obey the following principles:
- provide both a C++ and a Python API for
GenericMap
, although it is unlikely that many Python users will useGenericMap
directly, rather through classes such asExposureInfo
. - support storage of primitive types, as well as LSST classes written in either C++ or Python.
- provide C++ and Python APIs that are as idiomatic as possible in their respective languages, making full use of pybind11’s ability to serve as an API adapter.
- do not provide any features beyond a heterogeneous mapping with simple set-get behavior.
The single responsibility makes
GenericMap
suitable as a back-end for a variety of applications, including more ornate mapping types likePropertySet
. - provide separate interface (
GenericMap
andMutableGenericMap
) and implementation classes. An abstract interface makes it easy to create mappings that require specific properties (asPropertySet
andPropertyList
do) without breaking code for other users, following the open-closed principle. Most code based onGenericMap
can be agnostic to the implementation class. - rely on compile-time type safety as much as possible, with safeguards preventing invalid element retrieval in C++ and spurious type errors in Python.
- allow element retrieval without knowing the exact type as which it was stored, so long as the conversion is valid (e.g., superclass vs. subclass, or different sizes of floating-point number). Supporting inexact types is not only user-friendly, it avoids unnecessary coupling between code changes at the points of storage and retrieval (which may be in different packages).
3.3 The GenericMap and MutableGenericMap APIs¶
The design of GenericMap
is based on some preliminary design work by J. Bosch and on a similar class that K. Findeisen wrote in Java.
The design ended up a hybrid of the two concepts, with a number of compromises to work in a mixture of C++, Python, and pybind11.
While GenericMap
was originally conceived as a class that could store values of any type, this proved incompatible with the need for a pybind11 wrapper and the desire for an idiomatic Python API.
In particular, unless __getitem__
takes type information as part of its input, it needs a finite set of types it can test in C++ (the current implementation does so implicitly through a pybind11 wrapper for boost::variant
).
The final GenericMap
supports values of built-in types as well as Storable
, an interface that can be added as a mixin to any LSST class.
Inspired by the Python distinction between Mapping
and MutableMapping
, we provide separate interfaces for reading (GenericMap
) and writing (MutableGenericMap
).
3.3.1 C++¶
The C++ API is loosely based on the standard library mapping interface (taken as the intersection of the C++14 APIs for std::map
and std::unordered_map
).
GenericMap
omits standard methods that would have paradoxical or surprising behavior when generalized to heterogeneous values, such as the new element created by operator[]
when a key does not match.
The first level of type-safety is provided by a lsst::afw::typehandling::Key
class template, which combines a nominal key (e.g., a string) with the required type of the value.
Key
objects are lightweight values, and can be passed around or created on the fly as easily as the underlying key type.
The original design for GenericMap
called for the map to expose public method templates (e.g., void set(Key<K, T> const &, T const &)
) that would provide compile-time type safety.
Since method templates cannot be overridden in C++, these public templates would be implemented in terms of protected methods (e.g., void _set(Key<K, Storable> const &, Storable const &)
), which subclasses could use to define how the key-value pairs are stored and managed.
Prototyping revealed that this design had several major issues:
- The process of delegating calls to a protected non-template method would strip away any information about which subclass of
Storable
was being stored. A careless implementation would make it legal to store aSkyWcs
object and then ask for it as aPsf
, or vice versa. - An implementation of
__getitem__
would need to explicitly enumerate and test all possible value types, particularly all subclasses ofStorable
. This would, at the very least, introduce elaborate pybind11 code that would need to be kept in sync with the class definition, but would not have an obvious failure mode if the two files diverged. - The protected API would require multiple methods per supported type, including integers, floating point numbers, strings, and
Storable
(andconst
/non-const
and value/smart pointer variants thereof). This would impose an enormous writing and maintenance burden on subclass authors.
We tried several solutions that would still let us use a heterogeneous map.
One of the simplest was to drop the use of separate interface and implementation classes, and with it the need to delegate to specialized protected methods.
However, we did not pursue this approach because it did not solve the problem of how to implement __getitem__
, and the eventual solution to that problem made an interface-oriented design acceptable again.
One option to implement __getitem__
without hard-coding a list of value types was to design Key
objects to allow retrieval by superclasses of the desired type.
However, we could not find a satisfactory implementation of this approach in C++.
While it is possible to use templates to express questions like “Does a Key<Storable>
match a Key<SkyWcs>
?”, storing mixed Key
types would require removing the compile-time information that enables such comparisons.
Adding run-time type information to Key
would solve the information loss problem, but C++’s support for RTTI is very limited, and the standard API only allows tests for exact type equality.
The solution we adopted is to no longer store type information – as a Key
object or in any other form – in a GenericMap
.
All queries internally pass the stored value through dynamic casting, which accesses RTTI in an implementation-dependent way that does account for subclasses.
This approach solves all the original design issues, at the cost of making GenericMap
no longer strictly type-safe (and much slower):
- Queries for an object of the wrong type are blocked at the casting step.
__getitem__
can retrieve anyStorable
by considering onlyKey<Storable>
andKey<shared_ptr<Storable>>
.- The protected API is greatly simplified because it no longer needs to accommodate a variety of
Key
classes.
In practice, the desire for type-unsafe storage was handled by making all protected methods work in terms of untyped keys and boost::variant
values.
Passing by variant
provides a convenient way to express, in code, which values are supported by GenericMap
without committing subclasses to any particular storage mechanism.
Unfortunately, boost::variant
handles implicitly convertible types very poorly (in particular, it gets confused when expected to convert numeric types, and cannot distinguish between const
and non-const
versions of the same type).
This internal detail has consequences for the public API: GenericMap
does not currently support const
values for any type except shared_ptr<Storable>
, and operations involving primitive types must match them exactly.
We hope to lift these restrictions once we can migrate to std::variant
in C++17.
After the nature of GenericMap
’s type handling, the largest remaining problem was how to handle shared pointers, which are used extensively by ExposureInfo
.
In keeping with C++ idioms, and in order to correctly handle polymorphism of Storable
, GenericMap
returns most values by reference.
However, because the implementation holds pointers as shared_ptr<Storable>
yet must return them as pointers of the correct type, its accessors create a new pointer of the desired type, which cannot be returned by reference.
We chose to return smart pointers alone by value, though the inconsistency with other value types makes it much harder to write type-agnostic code against GenericMap
.
The alternatives were to return everything by value, which would make it impossible to support non-smart-pointer storage of Storable
, or to always return shared pointers as shared_ptr<Storable>
, which would force users to perform unsafe casts in their code.
It is not practical to design an idiomatic C++ API for iterating over a GenericMap
.
Instead, we developed a system similar to the visitors used by boost::variant
and std::variant
, where the user represents the body of the loop by a callable object that accepts values of any supported type.
In practice the callables are usually private classes with templates or overloaded methods, but in rare cases a generic lambda can be used as well.
This approach involves considerable boilerplate, but is more natural to users than an API written in terms of variant
or some kind of iterator-like proxy.
3.3.2 Python¶
In Python, GenericMap
follows the Mapping
API almost exactly, aside from the need for homogeneous keys and the specific set of value types imposed by the C++ implementation.
Operations that would violate these constraints raise TypeError
.
As in C++, LSST-specific types can only be stored in a GenericMap
if they inherit from Storable
.
However, these types need not be implemented in C++; Storable
is designed to be subclassed by Python types as well (see 4 The Design of Storable).
In C++, GenericMap
is a class template parametrized by the key type.
Each specialization has its own pybind11 wrapper, but these wrappers are hidden by an lsst.utils.TemplateMeta
facade.
Attempts to construct a GenericMap
in Python will infer the key type from any input data, so most users need not specify a key type explicitly.
Since the compile-time type safety provided by lsst::afw::typehandling::Key
is irrelevant in Python, Key
does not have a pybind11 wrapper.
Instead, all GenericMap
methods take the underlying key type (e.g., a string instead of Key<string>
), and the pybind11 code expresses the operations in terms of Key
-based equivalents.
4 The Design of Storable¶
4.1 Rationale¶
As noted in 3 The Design of GenericMap, we were unable to develop a design for lsst.afw.typehandling.GenericMap
that could accept objects of any type.
We introduced the lsst.afw.typehandling.Storable
interface to let GenericMap
interact with LSST-specific types.
Any user-defined class must inherit from Storable
to be stored in a GenericMap
or ExposureInfo
.
To make it easier to work with Storable
objects in C++, the interface declares several optional methods that provide common operations like equality comparison.
These methods add some clutter to implementation classes that don’t define them, but make it possible to persist Storable
objects and perform generic object manipulation without the need for unsafe casting in user code.
4.2 Design Goals¶
We designed Storable
to meet the following goals:
- support subclasses written in either C++ or Python
- support
lsst.afw.table.io
persistence - support the smallest reasonable subset of generic operations, chosen to be equality comparison, hashing, copying, and string representation
- do not conflict with existing APIs of classes that may be retrofitted with
Storable
4.3 The Storable API¶
lsst.afw.typehandling.Storable
is a subclass of lsst::afw::table::io::Persistable
, though it does not require that persistence be implemented.
This relationship ensures that lsst.afw.image.ExposureInfo
can persist Storable
objects using the same mechanism as most of its original members.
lsst.afw.typehandling.Storable
provides methods equals
, hash_value
, cloneStorable
, and toString
to allow comparisons, hashing, copying, and printing in C++.
equals
defaults to object identity comparisons, while the others throw an exception by default.
The method names, including the underscore in hash_value
, were chosen to avoid collisions with existing APIs (e.g., a clone
method that returns a smart pointer to a more specific type than Storable
).
We preferred this approach over a more elaborate delegation system, such as that used in the AST library and many table::io
classes, because the latter approach requires that authors remember to write a new method for each subclass, no matter how remotely descended from the base.
Storable
can be inherited from by Python classes, which should override the appropriate Python methods (e.g., __repr__
instead of toString
).
The inheritance is handled using the pybind11 API for Python inheritance, including a “trampoline” helper class.
While the helper class has hooks for all of Storable
’s C++ methods, Storable
’s pybind11 wrapper does not include them to keep the Python API from being cluttered by default implementations.
In practice, C++ classes that implement these operations declare them in their own Python wrappers anyway.
5 The Design of ExposureInfo¶
5.1 Rationale¶
lsst.afw.image.ExposureInfo
is one of the most fundamental classes in the LSST Science Pipelines, and any breaking changes to it can have far-reaching effects.
We will almost certainly need to break ExposureInfo
when adopting a new persistence framework, as the lsst.afw.table.io
framework is built into both the API and the persisted form.
Therefore, we avoided introducing breaking changes in the conversion to a map-based implementation, to keep users from having to change their code or data twice.
5.2 Design Goals¶
We implemented our changes to ExposureInfo
based on the following goals:
- do not change the behavior of any existing method on
ExposureInfo
, particularly component retrieval methods likegetSkyWcs()
. - keep the new code compatible with the previous
Exposure
file format - keep the
Exposure
file format readable by old science pipelines code. In practice, this means that components stored in “archives” can be rearranged and new header keywords can be added, but no other changes are possible. - minimize the number of new API elements added to allow operations on unknown components
5.3 ExposureInfo Code Changes¶
lsst.afw.image.ExposureInfo
now contains a map that stores the ExposureInfo
components.
This map is not exposed to client code.
However, the C++ API for inserting and removing components does take a lsst::afw::typehandling::Key
, as there is no better way to make these methods type-safe.
In Python, as for lsst.afw.typehandling.GenericMap
, the key arguments are simple strings.
In the original conversion, ExposureInfo
stored its components in a MutableGenericMap<string>
.
The need for backwards-compatibility in ExposureInfo
, and the intersection of type requirements imposed by lsst.table.io
and GenericMap
, meant that all generic components had to be stored as shared pointer to const
Storable
(see 5.3.1 Constraints Imposed by the Use of GenericMap).
When it became clear that GenericMap
will be a brittle class until at least C++17, we replaced the MutableGenericMap
with a private class, StorableMap
, that only holds shared pointers to Storable
but still checks for the appropriate subclass.
This change makes ExposureInfo
much easier to maintain, at the cost of being unable to incorporate a broader range of types in the future.
All but three of ExposureInfo
’s traditional components have been migrated to map-based storage.
The three exceptions are:
- the image metadata are stored in a
lsst.daf.base.PropertySet
, which cannot inherit fromStorable
because it’s in a dependency oflsst.afw
. - the visit info is stored inside the metadata rather than as a separate component, so it must continue to be written there for backward compatibility.
We chose not to duplicate it (storing both as metadata and as a
Storable
) for simplicity. - the filter is stored inside the metadata, like the visit info.
In addition,
ExposureInfo
’s filter-related methods pass and return a filter object, not a shared pointer, so we cannot migrate it without either changing the existing API to use shared pointers or introducing inconsistencies between, for example, the return types ofgetFilter()
andgetComponent()
.
5.3.1 Constraints Imposed by the Use of GenericMap¶
While GenericMap
can store both Storable
s and shared pointers to Storable
, it cannot preserve this distinction after persistence – the lsst.table.io
framework always depersists objects as shared pointers to Persistable
, so the information on whether an element was originally retrievable by reference or by shared pointer is lost.
To avoid inconsistencies in saving and restoring Exposure
s, we required that generic components in ExposureInfo
be pointers to Storable
until we could change to a more flexible persistence framework.
The original access methods for the migrated components were very inconsistent in their use of const
, with the majority using shared_ptr<T const>
, some using shared_ptr<T>
, and some using const
inconsistently between input and output.
Since:
GenericMap
cannot support both shared pointers toconst
and shared pointers to non-const
before C++17,- keeping the existing mixture would require lots of (potentially unsafe) casts both in
ExposureInfo
and in client code, - changing the inputs to non-
const
would likely break client code, and - changing the outputs to
const
would be relatively safe,
we chose to standardize both inputs and outputs to const
, and to modify GenericMap
to hold shared pointers to const
instead of pointers to non-const
.
Standardizing output to const
was a breaking change to the C++ interfaces for getPsf()
and getCoaddInputs()
.
However, as there was no C++ code in Science Pipelines that stored the results as pointer to non-const
, the change caused no problems to our knowledge.
5.4 ExposureInfo Persistence Changes¶
The lsst.afw.image.Exposure
persistence format stores ExposureInfo
components in binary “archives” in FITS extensions, with the extension number stored in the header using keys like SKYWCS_ID
.
Generic components generalize this format by creating a header key from the component’s key string (e.g., MYCOMPONENT_ID
).
The extensions containing generic components are not guaranteed to be arranged in any particular order, but neither the original nor the generalized formats depend on ordering.
The above conventions, combined with the need for backwards compatibility, imply that components that have been migrated to generic storage must have a component key that matches the original header key (e.g., SKYWCS
for WCS information).
The awkward names add some inconvenience to ExposureInfo
clients, but an alternative would need to both generate an independent FITS header key in a backward-compatible way and store the component key inside the archive in a backward-compatible way.
Basing the FITS header key on the component key was deemed a less error-prone solution.
While the FITS header keys listing the extensions could previously be hard-coded into ExposureInfo
’s depersistence code, the depersistence code for generic components must search the header for keys of the expected format.
Queries for [arbitrary string]_ID
are vulnerable to false positives from header keys like VISIT_ID
or CCD_ID
; if associated values are small integers, then there is no way to distinguish such keys from real archive IDs.
We therefore added a second convention for archive IDs, of the form ARCHIVE_ID_[component]
.
Old components are depersisted using the *_ID
syntax, to retain compatibility with old files, while new ones are depersisted using ARCHIVE_ID_*
.
The new code strips both sets of header keywords when they are detected; new files read using old code may have leftover ARCHIVE_ID_
keywords.
6 Alternative Persistence Frameworks¶
6.1 Rationale¶
Despite recent improvements to lsst.afw.image.ExposureInfo
, our ability to attach data to an Exposure
is limited by the lsst.afw.table.io
persistence framework.
We still have circular dependencies within afw
, we still need persistable objects to be written in C++, and we still need to work around the non-optimal nature of lsst.afw.table.io
itself.
To go further, we need to replace lsst.afw.table.io
with a well-maintained package that, unlike afw.table
, was designed for persistence.
For this technote, a “persistence framework” is any library that defines at least a data persistence format and an API for reading and writing to it.
It does not need to provide support for serializing objects to a data representation; for example, lsst.afw.table.io
relies on class-specific code to do this.
6.2 Adoption Constraints¶
We are looking for a persistence framework that meets many of the following criteria:
- it should have a GPL3-compatible license.
- it should allow persistence of both C++ and Python objects.
- it should allow persistence to multiple formats, including FITS, JSON, and YAML. Compatibility with JSON and YAML requires a framework that can represent objects as key-value pairs.
- it should allow versioning of persistence formats.
- it should allow depersistence of old files written with the
lsst.afw.table.io
framework. - it should allow efficient storage of arrays, particularly images, but not necessarily in formats other than FITS.
- it should correctly depersist polymorphic types that are stored in C++ by their base class (e.g.,
lsst.afw.detection.Psf
), reproducing their exact type (e.g.,lsst.meas.algorithms.ImagePsf
). - it should be able to read in part of a persisted object, such as only the WCS from a persisted
Exposure
. - it should be able to store relationships between objects that refer to each other.
This would allow us to separately store composite objects and their components, such as the individual PSFs used to create a
lsst.meas.algorithms.CoaddPsf
, simplifying provenance tracking. - it should persist files in a human-readable form, where practical, as a debugging aid.
We do not have any expectation that we should be able to easily change persistence frameworks in the future.
6.3 Option: Arrow¶
While a memory management library rather than a persistence framework, Apache Arrow provides serialization tools through its Python interface, pyarrow
.
The tools behave much like pickle
, with a default serialized form that can be overridden by custom functions.
The serialized form is a byte stream, which is encoded and decoded into Python objects.
While optimized for NumPy arrays, pyarrow
serialization is advertised as faster than pickle
in general.
This option should not be confused with Parquet.
Although it uses Arrow as its default reader in C++ and Python, Parquet is a different format from that used by pyarrow
’s object serialization code.
Arrow meets only a few of our criteria:
- it uses the Apache 2.0 license.
- it supports fast file I/O of suitable byte streams in both C++ and Python, but offers a generic way to serialize objects to a byte stream only in Python.
- it uses a proprietary format exclusively.
- it versions the general byte stream format, but does not support versioning of any particular type’s serialized form.
- it does not interoperate with external formats like
lsst.afw.table.io
- it is designed for efficient array storage.
- the documentation does not mention how polymorphic types are handled by default, though we could emulate it by storing an explicit class name (much like
lsst.afw.table.io
does). - it supports partial deserialization (only of NumPy arrays?), but requires that the object be read into memory first.
- it is not clear how it stores object references (although its ability to delegate subobjects’ serialization to
pickle
provides a clue). - its persisted form is not human-readable.
6.4 Option: Avro¶
Avro is a table-like persistence library provided by Apache.
It defines persistence formats in terms of schemas, which may be composed (e.g., the user can declare that a lsst.geom.Box2I
is stored as a pair of lsst.geom.Point2I
, as long as there is a persisted form for Point2I
).
In C++, the schemas are typically compiled into proxy classes, whose data are then accessed using object member syntax.
In Python, no code generation is required, and data are represented using dict
-like objects.
Avro uses a schema definition language based on JSON, but as the notation is similar in spirit to lsst.pipe.base.PipelineTaskConnections
, it should not be a major conceptual burden on LSST developers.
Because of its record-based approach, Avro is best-suited for bulk data: one could store a Catalog
very naturally using Avro, but a free compound object like an Exposure
would have some overhead (essentially, its persisted form would be a collection of one-row tables).
Avro meets many, but not all, of our criteria:
- it uses the Apache 2.0 license.
- it has built-in support for both C++ and Python, although the Python API is much better.
- it has built-in support for JSON, as well as a proprietary format. We can, in principle, support additional formats using stream I/O, but the interfaces do not appear to have been designed for third-party implementations.
- it supports schema versioning.
- it may be possible to depersist
lsst.afw.table.io
files by treating it as a custom format. - it natively supports arrays of arbitrary type, including numeric primitives.
- it does not natively handle polymorphism, though we could emulate it by storing an explicit class name (much like
lsst.afw.table.io
does). Avro makes heavy use of dynamic typing on depersistence, so differences in the persisted form among related classes are not a problem. - its tables are row-oriented, so it does not have any special ability to partially read objects.
- it does not natively store object references, though we could emulate them by introducing a unique object ID data type.
- in Python, the persisted form is a
dict
from field names to field values, which is reasonably human-readable.
6.5 Option: Bond¶
Bond is a struct-like persistence library provided by Microsoft.
It defines persistence formats in terms of schemas, which may be composed (e.g., the user can declare that a lsst.geom.Box2I
is stored as a pair of lsst.geom.Point2I
, as long as there is a persisted form for Point2I
).
The schemas are compiled into C++ proxy classes, whose data are then accessed (in C++ or Python) using object member syntax.
Bond uses a custom schema definition language with a C-like syntax.
Because the persisted form of each persistable type is represented by a different class, it may be difficult to write generic code against persistables. However, Bond provides a reflection API for working with generic persisted forms.
Bond has significant external dependencies, namely the Haskell Tool Stack and RapidJSON.
Bond meets most of our criteria, though with caveats:
- it uses the MIT license.
- it has built-in support for both C++ and Python. However, Python support depends on module files compiled with Boost Python, which may have surprising interactions with our Pybind11-based system.
- it has built-in support for JSON and several proprietary formats with different speed-size tradeoffs. It also has extensive support for different kinds of custom data formats.
- it does not support schema versioning.
- it can depersist
lsst.afw.table.io
files by treating it as a custom format, though it would need to be tied to the Bond schema. - it natively supports arrays of arbitrary type, including numeric primitives.
- it does not natively handle polymorphism, but it does provide a persistence format for its own schemas (either as part of a data file, or separately). If there’s a one-to-one mapping between persistable classes and schema structs, then the distinction will be preserved in the file.
- it provides an API for lazy deserialization, although whether this allows partial reads in practice depends on the file format (specifically, on whether the data block for a subobject can be identified without parsing it).
- it does not natively store object references, though we could emulate them by introducing a unique object ID data type.
- the persisted form is not human-readable, though it can be converted to, e.g., a JSON representation
6.6 Option: Cap’n Proto¶
Cap’n Proto is a struct-like persistence library developed by the Sandstorm web platform.
It defines persistence formats in terms of schemas, which may be composed (e.g., the user can declare that a lsst.geom.Box2I
is stored as a pair of lsst.geom.Point2I
, as long as there is a persisted form for Point2I
).
In C++, the schemas are typically compiled into proxy classes, whose data are then accessed using object member syntax.
In Python, no code generation is required.
Cap’n Proto uses a custom schema definition language that is easy to use correctly but also easy to use incorrectly.
The library is designed for its persisted forms to be usable as in-memory objects, but that’s not an option if we’re using it as a general-purpose persistence framework. Serialization and deserialization may be slower than for other frameworks because it wasn’t a design priority.
Cap’n Proto meets only a few of our criteria:
- it uses the MIT license
- it has built-in support for C++, and there is a wrapper for Python.
- it uses a proprietary format exclusively.
- it does not support schema versioning. New fields can be added safely, but must absolutely never be removed.
- it cannot depersist external files.
- it natively supports lists of arbitrary type, including numeric primitives and other lists.
- it does not natively handle polymorphism, though we could emulate it by storing an explicit class name (much like
lsst.afw.table.io
does). However, this approach would probably make it impossible to make changes to the base class’s schema because all fields need an continuously incremented ID number. - it supports several techniques for partial reads of large objects.
- it has native support for object relationships, but only within the same file.
- its persisted form is highly efficient, and therefore not human-readable.
6.7 Option: cereal¶
cereal is a C++ persistence library provided by the University of Southern California.
It’s similar in style to Boost Serialization, but more lightweight.
References to objects are passed to a cereal::InputArchive
or cereal::OutputArchive
object, which calls user-defined methods or functions to handle non-built-in types (these methods usually call the archive recursively).
cereal was not designed for long-term storage, and makes no guarantee that a later version of the library will correctly load a file written by an earlier version. We can work around this limitation by locking the Stack’s version of cereal.
cereal meets many, but not all, of our criteria:
- it uses the 3-clause BSD license.
- it does not support languages other than C++.
It probably cannot handle pure-Python classes (it needs C++ functions or methods to define the persisted form), but it might be able to handle Python subclasses of a C++ interface like
lsst.afw.typehandling.Storable
. - it explicitly supports subclassing the archive classes, including specializing them for new output formats.
- it supports class versioning, using a system similar to that currently used for
ExposureInfo
. - it is possible to depersist
lsst.afw.table.io
files by treating it as a custom format. - it supports raw binary output for individual fields, which can be used to efficiently store arrays.
- it supports polymorphism, but requires explicit registration of all subclasses. The registration code for a particular class must be aware of all possible archive implementations, making it difficult to add new file formats.
- it supports “out of order loading” of sub-objects in specific archive implementations, including the built-in JSON archive.
- it can self-consistently track object relationships, but requires that all objects (such as a
lsst.meas.algorithms.CoaddPsf
and its constituents) be in the same archive. It cannot be used to store such objects in separate files, though we could emulate such functionality by introducing a unique object ID to the persisted form. - it supports human-readable key-value pairs, as well as a “simplified” output format for human-readability.
6.8 Option: FlatBuffer¶
FlatBuffer is a struct-like persistence library provided by Google.
It defines persistence formats in terms of schemas, which may be composed (e.g., the user can declare that a lsst.geom.Box2I
is stored as a pair of lsst.geom.Point2I
, as long as there is a persisted form for Point2I
).
The schemas are compiled into proxy classes (in both C++ and Python), whose data are then accessed using object member syntax.
FlatBuffer uses a custom schema definition language, but as the notation is similar in spirit to lsst.pipe.base.PipelineTaskConnections
, it should not be a major conceptual burden on LSST developers.
Because the persisted form of each persistable type is represented by a different class, it may be difficult to write generic code against persistables. However, FlatBuffer does provide a reflection API for working with generic persisted forms.
FlatBuffer meets some of our criteria:
- it uses the Apache 2.0 license.
- it has built-in support for both C++ and Python, although Python features are very limited. The C++ API is also cleaner, particularly with vector types.
- it only natively supports a proprietary file format (though it allows import from JSON). In C++, it may be possible to use reflection to support other formats.
- it supports schema versioning.
- it probably cannot depersist old files, as FlatBuffer makes strict demands on persistence formats in the interest of efficiency.
- it natively supports arrays of primitive type.
- it does not natively handle polymorphism, though we could emulate it by storing an explicit class name (much like
lsst.afw.table.io
does) and using the FlexBuffers feature to avoid knowing the schema a priori. - it supports streaming input of large files, at least in C++, but does not support reading only a predetermined part of an object. However, it does depersist at close to disk-limited speed, so partial reads may be less performance-critical.
- it has native support for object relationships, even among different persisted files.
- its persisted form is highly efficient, and therefore not human-readable.
6.9 Option: HDF5¶
While it’s often associated with large numerical datasets, the filesystem-like HDF5 file format can store arbitrary objects (compound datatypes, in its terminology).
It defines compound datatypes in terms of other HDF5 datatypes, which need to be explicitly composed in C code and expressed as numpy.dtype
objects in Python code.
Because it’s not specifically designed for persistence of single objects, if we used HDF5 as a persistence framework, we would likely need to create an interface explicitly geared toward object persistence rather than forcing users to use HDF5’s own APIs.
HDF5 meets only a few of our criteria:
- HDF5 for Python uses the 3-clause BSD license. The HDF5 library itself uses a modified version of BSD 3-clause; I’m not sure whether it’s still GPL-compatible.
- it supports Python and C. The Python API is in general cleaner, but does not support nested compound datatypes.
- it uses a proprietary format exclusively.
- it does not have any support for datatype versioning.
- it cannot depersist old files.
- it natively supports arrays of arbitrary type, with explicit support for image data.
- it does not natively handle polymorphism, though we could emulate it by storing an explicit class name (much like
lsst.afw.table.io
does). - it supports partial reads only for subsets of array datasets, not for elements of a compound dataset.
- it has native support for dataset relationships, but only between datasets in the same HDF5 file
- its persisted form is not human-readable.
6.10 Option: MsgPack¶
MsgPack is a tree-like persistence format by independent developer Sadayuki Furuhashi.
It’s sometimes described as a binary version of JSON.
The C++ implementation provides several interfaces for persisting objects, including custom msgpack::adaptor::pack
and msgpack::adaptor::convert
callables, and macros that resemble Python’s use of tuples for pickling.
The Python implementation provides an API similar to Python’s built-in JSON library.
Because the MsgPack format handles object types in a very client-specific way, extra care may be needed to get persisted forms that interoperate between C++ and Python.
MsgPack meets some of our criteria:
- the C++ implementation uses the Boost 1.0 license, while the Python implementation uses the Apache 2.0 license.
- it supports Python and C++.
- it uses a proprietary format exclusively.
- it does not have any support for datatype versioning; support would need to be included in the custom depersistence code. The Python implementation has tools to manually parse the serialized form; I could not find a C++ equivalent.
- it cannot depersist external files.
- it natively supports arrays of arbitrary type, including numeric primitives.
- it does not natively handle polymorphism, though we could emulate it by storing an explicit class name (much like
lsst.afw.table.io
does) or through custom depersistence logic. - it does not support partial reads.
- it does not natively store object references, though we could emulate them by introducing a unique object ID data type.
- its persisted form is highly efficient, and therefore not human-readable.
6.11 Option: Parquet¶
Parquet is a table-like persistence format provided by Apache.
The format requires exactly one schema for each file, but multiple files (and therefore multiple schemas) can be combined into a logical dataset.
The standard Parquet reader in both C++ and Python is Apache Arrow, which represents the persisted form as a Table
object.
Since Parquet was not designed specifically for object persistence, client code is fully responsible for serializing and deserializing an object from a Table
.
Arrow’s Table
API is similar in spirit to lsst.afw.table
, encoding such concepts as non-contiguous tables (“chunked” types) and compound keys (“struct arrays”).
However, this means that it is not necessarily better-suited for object persistence than lsst.afw.table.io
.
Parquet meets only a few of our criteria:
- it uses the Apache 2.0 license.
- it supports both Python and C++, although the C++ interface is not fully documented.
- it is a proprietary format, with no built-in interoperability with JSON, YAML, etc.
- it does not support versioning of persistence formats.
- it does not allow depersistence of files written with the
lsst.afw.table.io
framework. - it natively supports arrays of arbitrary type, including numeric primitives.
- it does not natively handle polymorphism, though we could emulate it by storing an explicit class name (much like
lsst.afw.table.io
does). - its tables are column-oriented, so partial reads are fast in the limit of large numbers of rows. For small tables, the Parquet format does not necessarily provide a speed improvement because most columns may be on the same disk page.
- it does not natively store object references, though we could emulate them by introducing a unique object ID data type.
- its file format is heavily optimized and compressed, and therefore not human-readable.
The intermediate persisted form (a
Table
object) is moderately human-redable in Python.
6.12 Option: Pyrobuf¶
Pyrobuf is an AppNexus emulation of Google’s Protobuf, built for speed.
Like Google’s related library, FlatBuffer, Protobuf and Pyrobuf define persistence formats in terms of schemas, which may be composed (e.g., the user can declare that a lsst.geom.Box2I
is stored as a pair of lsst.geom.Point2I
, as long as there is a persisted form for Point2I
).
The schemas are compiled into proxy classes, whose data are accessed using object member syntax.
Unlike in FlatBuffer, the proxy classes must be explicitly serialized to and from byte streams.
Pyrobuf meets some of our criteria:
- it uses the Apache 2.0 license.
- it works only in Python, but the need for multiple translation layers means Pybind11-wrapped C++ classes can be easily accommodated, as long as their full state is visible from Python.
- it natively supports the Google Protobuf format and JSON. It’s possible to write proxy classes to other formats, although writing the code to do so takes away much of the savings of using a third-party library.
- it does not support versioning of persistence formats.
- it can create proxy classes from
lsst.afw.table.io
with a custom translator. - it natively supports arrays of arbitrary type.
- it does not natively handle polymorphism, though we could emulate it by storing an explicit class name (much like
lsst.afw.table.io
does). Pyrobuf is fail-soft for addition or removal of fields, so we can add extra fields in subclasses. - it does not support partial reads of objects.
- it does not natively store object references, though we could emulate them by introducing a unique object ID data type.
- the serialized byte streams are not human-readable, although the proxy classes are
6.13 Recommendations¶
We do not yet recommend any of these frameworks as a replacement for lsst.afw.table.io
.