Developer Guide¶
This short guide contains information about the development process of brain-indexer.
Design¶
brain-indexer allows to create indexes and perform volumetric queries with efficiency in mind. It was developed on top of the Boost Geometry.Index libraries, renown for their excellent performance.
brain-indexer specializes to several core geometries, namely spheres and cylinders, and derivates: indexed spheres/segments and “unions” of them (variant). These objects are directly stored in the tree structure and have respective bounding boxes created on the fly. As such some CPU cycles were traded for substantial memory savings, aiming at near-billion indexed geometries per cluster node.
Public API: C++¶
The core library is implemented as a C++17 Header-Only library and offers a wide range of possibilities.
By including brain_indexer/index.hpp and brain_indexer/util.hpp the user has
access to the whole API which is highly flexible due to templates. For instance
the IndexTree class can be templated to any of the defined structures or new
user defined geometries, providing the user with a boost index rtree object
which automatically supports serialization and higher level API functions.
Please refer to the unit tests (under tests/cpp/unit_tests.cpp) to see examples
on how to create several kinds of IndexTrees.
Public API: Python¶
The Python API gives access to the most common instances of ‘IndexTree’s, while adding higher level API and IO helpers. The user may instantiate a spatial index of Morphology Pieces (both Somas and Segments) or, for maximum efficiency, Spheres alone.
As so, if it is enough for the user to create a spatial index of spherical
geometries, like soma placement, he should rather use
brain_indexer.SphereIndex. For the cases where spatial indexing of
cylindrical or mixed geometries is required, e.g. full neuron geometries, the
user shall use brain_indexer.MorphIndex. Furthermore, the latter
features higher level methods to help in the load of entire Neuron morphologies.
I/O Policy¶
- The rules are
brain-indexer will create intermediate directories if required.
brain-indexer will throw a meaningful error message when trying to open a missing file. Please use
brain_indexer::util::open_ifstreamto open anstd::ifstreamwith the appropriate checks.
There’s also brain_indexer::util::open_ofstream to open output file streams.
Logging Policy¶
- There are three levels of logging
info This level contains information that is meaningful and of interest to the user. Typically its purpose is to understand what the library is doing, on a coarse level. The size of the log must be kept easily human readable.
warn This level contains information that might indicate a problem. For example, an invalid setting that falls back to a default.
error Additional information about the error not present in the exception.
Note, logging is distinct from error handling, i.e. calling log_error will not
throw an exception automatically.
The Python side should use
import brain_indexer
brain_indexer.logger.info("Hi, let's get started.")
brain_indexer.logger.warning("This is how to log warnings.")
brain_indexer.logger.error(
"For demonstration purposes we will raise and exception now."
)
raise ValueError("Mysterious error.")
Please don’t write to the root logger via logging.info or similar.
The C++ core allows registering a callback to which all logging is forwarded.
When used as a Python library brain-indexer will register (a wrapper around) the
Python logger brain_indexer.logger as the callback. Therefore, all log
messages from the C++ side are handles by the Python logging library. A different
logger can be registered by
brain_indexer.register_logger(some_other_logger)
This enables, with reasonable effort, to send logs to a specific file (one per MPI rank), etc.
On the C++ side please use:
#include <brain_indexer/logging.hpp>
namespace brain_indexer {
log_info("Hello!");
log_info(boost::format("Hello %s!") % "Alice");
log_warn("This might not be as intended.");
log_error("oops.");
raise std::runtime_error("tja.");
}
While nobody admits using printf debugging, here’s a trick:
#include <brain_indexer/logging.hpp>
SI_LOG_DEBUG("bla....");
SI_LOG_DEBUG_IF(
i == 42,
boost::format("%d: %e %e) % i % x % y)
);
This is interesting because you can break up the output by MPI rank; and therefore get a clean stream of messages from each MPI rank.
Environment Variables¶
The following environmental variables are used by brain-indexer:
SI_LOG_SEVERITY: can be used to control the minimum severity that log message need to have. Valid values areINFO,WARN,ERROR,DEBUG. Note that DEBUG requires that SI was built withSI_ENABLE_LOG_DEBUG. The default isINFO.SI_REPORT_USAGE_STATS: if activated by assigning it toOnor1, the multi-index cache usage statistics report gets saved to disk. By default it is deactivated.
Boost Serialization & Struct Versioning¶
brain-indexer uses boost::serialization to write indexes to disk. In order
to be able to at least detect old indexes; and ideally be able to also open
them, we need to version each serialized struct.
There are a few things to respect:
Boost versions individual classes.
The base class must be serialized through
boost::serialization::base_object<Base>(*this); and must not usestatic_castsince this will silently work but fails to serialize important type information.It (effectively) requires that every class has a private serialize method, even if it only serializes its base class. In toy examples it was easy to modify an existing
serializemethod. However, adding one to a derived class would never work properly. In particular the difficulty was opening classes serialized through the old protocol, i.e. without aserializemethod in the base.
brain-indexer defines a constant SPATIAL_INDEX_STRUCT_VERSION which defines the
version of the structs that are serialized. This is the global version that every
struct will use. Therefore, when creating a new class that needs to be
serialized, e.g., because it’s part of something that’s being serialized, then
it must set its version to SPATIAL_INDEX_STRUCT_VERSION; and assert that the
version is not 0. (This last part is only to check that you haven’t
forgotten to set the version.)
Include and Inline Policy¶
Every header should include everything it needs to be used by itself. There should be no include ordering, transitive assumptions about what’s already included are permitted.
This also affect the inline policy, i.e. everything needs to either be a
template or be inline ot avoid violations of the ODR.
In order to check that all headers <brain_indexer/*.hpp> adhere to these rules
we’ve added one compilation unit per header in tests/cpp/check_headers/*.cpp.
These simply include the header and then the compiler can check if the rules are
adhered. In order to generate the required dummy code, we have a script:
bin/update_header_checks.py
(It must be run in the project root.)
History¶
This section contains summaries of things that have been attempted in the past, but have been removed without leaving any (obvious) traces behind.
Memory mapping¶
Memory mapping refers to mapping parts of a file into the virtual address space. Hence the “memory” can be allocated from the file.
This was used to create an R-Tree which was backed by memory stored on disk. Thereby one could create “in-memory” indexes even if this would have required a few TB of RAM.
The approach was reasonably efficient when the backing disk was an NVME disk. However, for GPFS was way too slow for this to work.
The Git history contains two commits in which this functionality was removed. In the first step only the Python bindings were removed. In a second step all memory mapping code was removed. At time of removal this functionality was complete (but might have bit rotted slightly).