hashdist.core.source_cache
— Source cache¶
The source cache makes sure that one doesn’t have to re-download source code from the net every time one wants to rebuild. For consistency/simplicity, the software builder also requires that local sources are first “uploaded” to the cache.
The software cache currently has explicit support for tarballs, git,
and storing files as-is without metadata. A “source item” (tarball, git commit, or set
of files) is identified by a secure hash. The generic API in SourceCache.fetch()
and
SourceCache.unpack()
works by using such hashes as keys. The retrieval
and unpacking methods are determined by the key prefix:
sc.fetch('http://python.org/ftp/python/2.7.3/Python-2.7.3.tar.bz2',
'tar.bz2:ttjyphyfwphjdc563imtvhnn4x4pluh5')
sc.unpack('tar.bz2:ttjyphyfwphjdc563imtvhnn4x4pluh5', '/your/location/here')
sc.fetch('https://github.com/numpy/numpy.git',
'git:35dc14b0a59cf16be8ebdac04f7269ac455d5e43')
For cases where one doesn’t know the key up front one uses the key-retrieving API. This is typically done in interactive settings to aid distribution/package developers:
key1 = sc.fetch_git('https://github.com/numpy/numpy.git', 'master')
key2 = sc.fetch_archive('http://python.org/ftp/python/2.7.3/Python-2.7.3.tar.bz2')
Features¶
- Re-downloading all the sources on each build gets old quickly...
- Native support for multiple retrieval mechanisms. This is important as one wants to use tarballs for slowly-changing stable code, but VCS for quickly-changing code.
- Isolates dealing with various source code retrieval mechanisms from upper layers, who can simply pass along two strings regardless of method.
- Safety: Hashes are re-checked on the fly while unpacking, to protect against corruption or tainting of the source cache.
- Should be safe for multiple users to share a source cache directory on a shared file-system as long as all have write access, though this may need some work with permissions.
Source keys¶
The keys for a given source item can be determined a priori. The rules are as follows:
- Tarballs/archives:
- SHA-256, encoded in base64 using
format_digest()
. The prefix is currently eithertar.gz
ortar.bz2
. - Git commits:
- Identified by their (SHA-1) commits prefixed with
git:
. - Individual files or directories (“hit-pack”):
A tarball hash is not deterministic from the file contents alone (there’s metadata, compression, etc.). In order to hash build scripts etc. with hashes based on the contents alone, we use a custom “archive format” as the basis of the hash stream. The format starts with the 8-byte magic string “HDSTPCK1”, followed by each file sorted by their filename (potentially containing “/”). Each file is stored as
little-endian uint32_t
length of filename little-endian uint32_t
length of contents — filename (no terminating null) — contents This stream is then encoded like archives (SHA-256 in base-64), and prefixed with
files:
to get the key.
Module reference¶
-
class
hashdist.core.source_cache.
ProgressSpinner
¶ Replacement for ProgressBar when we don’t know the file length.
Methods
-
class
hashdist.core.source_cache.
SourceCache
(cache_path, logger, mirrors=(), create_dirs=False)¶ Methods
-
static
create_from_config
(config, logger, create_dirs=False)¶ Creates a SourceCache from the settings in the configuration
-
fetch
(*args, **kwargs)¶ Fetch sources whose key is known.
This is the method to use in automated settings. If the sources globally identified by key are already present in the cache, the method returns immediately, otherwise it attempts to download the sources from url. How to interpret the URL is determined by the prefix of key.
Parameters: url : str or None
Location to download sources from. Exact meaning depends on prefix of key. If None is passed, an exception is raised if the source object is not present.
key : str
Globally unique key for the source object.
repo_name : str or None
A unique ID for the source code repo; required for git and ignored otherwise. This must be present because a git “project” is distributed and cannot be deduced from URL (and pulling everything into the same repo was way too slow). Hopefully this can be mended in the future.
-
fetch_archive
(url, type=None)¶ Fetches a tarball without knowing the key up-front.
In automated settings,
fetch()
should be used instead.Parameters: url : str
Where to download archive from. Local files can be specified by prepending
"file:"
to the path.type : str (optional)
Type of archive, such as
"tar.gz"
,"tar.gz2"
. For use when this cannot be determined from the suffix of the url.
-
fetch_git
(repository, rev, repo_name)¶ Fetches source code from git repository
With this method one does not need to know a specific commit, but can use a generic git rev such as master or revs/heads/master. In automated settings or if the commit hash is known exactly,
fetch()
should be used instead.Parameters: repository : str
The repository URL (forwarded to git)
rev : str
The rev to download (forwarded to git)
repo_name : str
A unique name to use for the repository, e.g.,
numpy
. This is currently required because git doesn’t seem to allow getting a unique ID for a remote repo; and cloning all repos into the same git repo has scalability issues.Returns: key : str
The globally unique key; this is the git commit SHA-1 hash prepended by
git:
.
-
put
(files)¶ Put in-memory contents into the source cache.
Parameters: files : dict or list of (filename, contents)
The contents of the archive. filename may contain forward slashes
/
as path separators. contents is a pure bytes objects which will be dumped directly to stream.Returns: key : str
The resulting key, it has the
files:
prefix.
-
unpack
(key, target_path)¶ Unpacks the sources identified by key to target_path
The sources are verified against their secure hash to guard against corruption/security problems. CorruptSourceCacheError will be raised in this case. In normal circumstances this should never happen.
The archive will be loaded into memory, checked against the hash, and then extracted from the memory copy, so that attacks through tampering with on-disk archives should not be possible.
Parameters: key : str
The source item key/secure hash
target_path : str
Path to extract in
-
static
-
class
hashdist.core.source_cache.
TarSubprocessHandler
(logger)¶ Call external tar
This handler should only be used as fallback, it lacks some features and/or depends on the vagueries of the host tar.
Methods
-
hashdist.core.source_cache.
hit_pack
(files, stream=None)¶ Packs the given files in the “hit-pack” format documented above, and returns the resulting key. This is useful to hash a set of files solely by their contents, not metadata, except the filename.
Parameters: files : list of (filename, contents)
The contents of the archive. filename may contain forward slashes
/
as path separators. contents is a pure bytes objects which will be dumped directly to stream.stream : file-like (optional)
Result of the packing, or None if one only wishes to know the hash.
Returns: The key of the resulting pack :
(e.g., ``files:cmRX4RyxU63D9Ciq8ZAfxWGjdMMOXn2mdCwHQqM4Zjw``). :
-
hashdist.core.source_cache.
hit_unpack
(stream, key)¶ Unpacks the files in the “hit-pack” format documented above, verifies that it matches the given key, and returns the contents (in memory).
Parameters: stream : file-like
Stream to read the pack from
key : str
Result from
hit_pack()
.Returns: list of (filename, contents) :
-
hashdist.core.source_cache.
scatter_files
(files, target_dir)¶ Given a list of filenames and their contents, write them to the file system.
Will not overwrite files (raises an OSError(errno.EEXIST)).
This is typically used together with
hit_unpack()
.Parameters: files : list of (filename, contents)
target_dir : str
Filesystem location to emit the files to