Build specifications and artifacts

The build spec (build.json)

The heart of HashDist are build specs, JSON documents that describes everything that goes into a build: the source code to use, the build environment, the build commands.

A build artifact is the result of executing the instruction in the build spec. The build store is the collection of locally available build artifacts.

The way the build store is queried is by using build specs as keys to look up software. E.g., take the following build.json:

    "name": "zlib",
    "sources": [
        {"key": "tar.gz:7kojzbry564mxdxv4toviu7ekv2r4hct", "target": ".", "strip": 1}
    "build": {
        "import": [
            {"ref": "UNIX", "id": "virtual:unix"},
            {"ref": "GCC", "id": gcc/gxuzlqihu4ok5obtxm5xt6pvi6a3gp5b"},
        "commands": [
            {"cmd": ["./configure", "--prefix=$ARTIFACT"]},
            {"cmd": ["make", "install"]}

The artifact ID derived from this build specification is:

$ hit hash build.json

Let us check if it is already in the build store:

$ hit resolve build.json
(not built)
$ hit resolve -h zlib/d4jwf2sb2g6glprsdqfdpcracwpzujwq
(not built)

In the future, we hope to make it possible to automatically download build artifacts that are not already built. For now, building from source is the only option, so let’s do that:

$ hit build -v build.json

The last line is the location of the corresponding build artifact, which can from this point of be looked up by the either the build spec or the hash:

$ hit resolve build.json
$ hit resolve -h zlib/d4jwf2sb2g6glprsdqfdpcracwpzujwq


Note that the intention is not that end-users write build.json files themselves; they are simply the API for the build artifact store. It is the responsibility of distributions such as python-hpcmp to properly generate build specifications for HashDist in a user-friendly manner.

Build artifacts

A build artifact contains the result of a build; usually as a prefix-style directory containing one library/program only, i.e., typical subdirectories are bin, include, lib and so on. The build artifact should ideally be in a relocatable form that can be moved around or packed and distributed to another computer (see Guidelines for packaging/building), although HashDist does not enforce this.

A HashDist artifact ID has the form name/hash, e.g., zlib/4niostz3iktlg67najtxuwwgss5vl6k4. For the artifact paths on disk, a shortened form (4-char hash) is used to make things more friendly to the human user. If there is a collision, the length is simply increased for the one that comes later (see also Discussion).

Some information is present in every build artifact:

The log from performing the build.
The input build spec.
Metadata about the build artifact itself. Some of this is simply copied over from build.json; however, this is a separate file because parts of it could be generated as part of the build process. This is important during Profile generation (docs TBD).

Build spec spec

The build spec document has the following top-level fields:

Used as the prefix in the artifact ID. Should match [a-zA-Z0-9-_+]+.
(Currently not used, but will become important for virtual build artifacts). Should match [a-zA-Z0-9-_+]*.
A job to run to perform the build. See hashdist.core.run_job for the documentation of this sub-document.
Sources listed are unpacked to build directory; documentation for now in ‘hit unpack-sources’
Copied to $ARTIFACT/artifact.json before the build.
Copied to $ARTIFACT/artifact.json before the build.

The build environment

See hashdist.core.execute_job for information about how the build job is executed. In addition, the following environment variables are set:

Set to the build directory. This is also the starting cwd of each build command. This directory may be removed after the build.
The location of the final artifact. Usually this is the “install location” and should, e.g., be passed as the --prefix to ./configure-style scripts.

The build specification is available under $BUILD/build.json, and stdout and stderr are redirected to $BUILD/build.log. These two files will also be present in $ARTIFACT after the build.


Safety of the shortened IDs

HashDist will never use these to resolve build artifacts, so collision problems come in two forms:

First, automatically finding the list of run-time dependencies from the build dependencies. In this case one scans the artifact directory only for the build dependencies (less than hundred). It then makes sense to consider the chance of finding one exact string aaa/0.0/ZXa3 in a random stream of 8-bit bytes, which helps collision strength a lot, with chance “per byte” of collision on the order \(2^{-(8 \cdot 12)}=2^{-96}\) for this minimal example.

If this is deemed a problem (the above is too optimistice), one can also scan for “duplicates” (other artifacts where longer hashes were chosen, since we know these).

The other problem can be future support for binary distribution of build artifacts, where you get pre-built artifacts which have links to other artifacts embedded, and artifacts from multiple sources may collide. In this case it makes sense to increase the hash lengths a bit since the birthday effect comes into play and since one only has 6 bits per byte. However, the downloaded builds presumably will contain the full IDs, and so on can check if there is a conflict and give an explicit error.