hashdist.core.source_cache — Source cache

The source cache makes sure that one doesn’t have to re-download source code from the net every time one wants to rebuild. For consistency/simplicity, the software builder also requires that local sources are first “uploaded” to the cache.

The software cache currently has explicit support for tarballs, git, and storing files as-is without metadata. A “source item” (tarball, git commit, or set of files) is identified by a secure hash. The generic API in SourceCache.fetch() and SourceCache.unpack() works by using such hashes as keys. The retrieval and unpacking methods are determined by the key prefix:

sc.fetch('http://python.org/ftp/python/2.7.3/Python-2.7.3.tar.bz2',
         'tar.bz2:ttjyphyfwphjdc563imtvhnn4x4pluh5')
sc.unpack('tar.bz2:ttjyphyfwphjdc563imtvhnn4x4pluh5', '/your/location/here')

sc.fetch('https://github.com/numpy/numpy.git',
         'git:35dc14b0a59cf16be8ebdac04f7269ac455d5e43')

For cases where one doesn’t know the key up front one uses the key-retrieving API. This is typically done in interactive settings to aid distribution/package developers:

key1 = sc.fetch_git('https://github.com/numpy/numpy.git', 'master')
key2 = sc.fetch_archive('http://python.org/ftp/python/2.7.3/Python-2.7.3.tar.bz2')

Features

  • Re-downloading all the sources on each build gets old quickly...
  • Native support for multiple retrieval mechanisms. This is important as one wants to use tarballs for slowly-changing stable code, but VCS for quickly-changing code.
  • Isolates dealing with various source code retrieval mechanisms from upper layers, who can simply pass along two strings regardless of method.
  • Safety: Hashes are re-checked on the fly while unpacking, to protect against corruption or tainting of the source cache.
  • Should be safe for multiple users to share a source cache directory on a shared file-system as long as all have write access, though this may need some work with permissions.

Source keys

The keys for a given source item can be determined a priori. The rules are as follows:

Tarballs/archives:
SHA-256, encoded in base64 using format_digest(). The prefix is currently either tar.gz or tar.bz2.
Git commits:
Identified by their (SHA-1) commits prefixed with git:.
Individual files or directories (“hit-pack”):

A tarball hash is not deterministic from the file contents alone (there’s metadata, compression, etc.). In order to hash build scripts etc. with hashes based on the contents alone, we use a custom “archive format” as the basis of the hash stream. The format starts with the 8-byte magic string “HDSTPCK1”, followed by each file sorted by their filename (potentially containing “/”). Each file is stored as

little-endian uint32_t length of filename
little-endian uint32_t length of contents
filename (no terminating null)
contents

This stream is then encoded like archives (SHA-256 in base-64), and prefixed with files: to get the key.

Module reference

class hashdist.core.source_cache.ProgressSpinner

Replacement for ProgressBar when we don’t know the file length.

Methods

class hashdist.core.source_cache.SourceCache(cache_path, logger, mirrors=(), create_dirs=False)

Methods

static create_from_config(config, logger, create_dirs=False)

Creates a SourceCache from the settings in the configuration

fetch(*args, **kwargs)

Fetch sources whose key is known.

This is the method to use in automated settings. If the sources globally identified by key are already present in the cache, the method returns immediately, otherwise it attempts to download the sources from url. How to interpret the URL is determined by the prefix of key.

Parameters:

url : str or None

Location to download sources from. Exact meaning depends on prefix of key. If None is passed, an exception is raised if the source object is not present.

key : str

Globally unique key for the source object.

repo_name : str or None

A unique ID for the source code repo; required for git and ignored otherwise. This must be present because a git “project” is distributed and cannot be deduced from URL (and pulling everything into the same repo was way too slow). Hopefully this can be mended in the future.

fetch_archive(url, type=None)

Fetches a tarball without knowing the key up-front.

In automated settings, fetch() should be used instead.

Parameters:

url : str

Where to download archive from. Local files can be specified by prepending "file:" to the path.

type : str (optional)

Type of archive, such as "tar.gz", "tar.gz2". For use when this cannot be determined from the suffix of the url.

fetch_git(repository, rev, repo_name)

Fetches source code from git repository

With this method one does not need to know a specific commit, but can use a generic git rev such as master or revs/heads/master. In automated settings or if the commit hash is known exactly, fetch() should be used instead.

Parameters:

repository : str

The repository URL (forwarded to git)

rev : str

The rev to download (forwarded to git)

repo_name : str

A unique name to use for the repository, e.g., numpy. This is currently required because git doesn’t seem to allow getting a unique ID for a remote repo; and cloning all repos into the same git repo has scalability issues.

Returns:

key : str

The globally unique key; this is the git commit SHA-1 hash prepended by git:.

put(files)

Put in-memory contents into the source cache.

Parameters:

files : dict or list of (filename, contents)

The contents of the archive. filename may contain forward slashes / as path separators. contents is a pure bytes objects which will be dumped directly to stream.

Returns:

key : str

The resulting key, it has the files: prefix.

unpack(key, target_path)

Unpacks the sources identified by key to target_path

The sources are verified against their secure hash to guard against corruption/security problems. CorruptSourceCacheError will be raised in this case. In normal circumstances this should never happen.

The archive will be loaded into memory, checked against the hash, and then extracted from the memory copy, so that attacks through tampering with on-disk archives should not be possible.

Parameters:

key : str

The source item key/secure hash

target_path : str

Path to extract in

class hashdist.core.source_cache.TarSubprocessHandler(logger)

Call external tar

This handler should only be used as fallback, it lacks some features and/or depends on the vagueries of the host tar.

Methods

hashdist.core.source_cache.hit_pack(files, stream=None)

Packs the given files in the “hit-pack” format documented above, and returns the resulting key. This is useful to hash a set of files solely by their contents, not metadata, except the filename.

Parameters:

files : list of (filename, contents)

The contents of the archive. filename may contain forward slashes / as path separators. contents is a pure bytes objects which will be dumped directly to stream.

stream : file-like (optional)

Result of the packing, or None if one only wishes to know the hash.

Returns:

The key of the resulting pack :

(e.g., ``files:cmRX4RyxU63D9Ciq8ZAfxWGjdMMOXn2mdCwHQqM4Zjw``). :

hashdist.core.source_cache.hit_unpack(stream, key)

Unpacks the files in the “hit-pack” format documented above, verifies that it matches the given key, and returns the contents (in memory).

Parameters:

stream : file-like

Stream to read the pack from

key : str

Result from hit_pack().

Returns:

list of (filename, contents) :

hashdist.core.source_cache.scatter_files(files, target_dir)

Given a list of filenames and their contents, write them to the file system.

Will not overwrite files (raises an OSError(errno.EEXIST)).

This is typically used together with hit_unpack().

Parameters:

files : list of (filename, contents)

target_dir : str

Filesystem location to emit the files to