hashdist.core.hasher
– Utilities for hashing¶
-
class
hashdist.core.hasher.
DocumentSerializer
(wrapped)¶ Stable non-Python-specific serialization of nested objects/documents. The primary usecase is for hashing (see
Hasher
), and specifically hashing of JSON documents, thus no de-serialization is implemented. The idea is simply that by hashing a proper serialization format we ensure that we don’t weaken the hash function.The API used is that of
hashlib
(i.e. an update method).A core goal is that it should be completely stable, and easy to reimplement in other languages. Thus we stay away from Python-specific pickling mechanisms etc.
Supported types: Basic scalars (ints, floats, True, False, None), bytes, unicode, and buffers, lists/tuples and dicts.
Additionally, when encountering user-defined objects with the
get_secure_hash
method, that method is called and the result used as the “serialization”. The method should return a tuple (type_id, secure_hash); the former should be a string representing the “type” of the object (often the fully qualified class name), in order to avoid conflicts with the hashes of other objects, and the latter a hash of the contents.The serialization is “type-safe” so that
"3"
and3
and3.0
will serialize differently. Lists and tuples are treated as the same ((1,)
and[1]
are the same) and buffers, strings and Unicode objects (in their UTF-8 encoding) are also treated the same.Note
Currently only string keys are supported for dicts, and the items are serialized in the order of the keys. This is because all Python objects implement comparison, and comparing by arbitrary Python objects could lead to easy misuse (hashes that are not stable across processes).
One could instead sort the keys by their hash (getting rid of comparison), but that would make the hash-stream (and thus the unit tests) much more complicated, and the idea is this should be reproducible in other languages. However that is a possibility for further extension, as long as string keys are treated as today.
In order to prevent somebody from constructing colliding documents, each object is hashed with an envelope specifying the type and the length (in number of items in the case of a container, or number of bytes in the case of str/unicode/buffer).
In general, see unit tests for format examples/details.
Parameters: wrapped : object
wrapped.update is called with strings or buffers to emit the resulting stream (the API of the
hashlib
hashers)Methods
-
class
hashdist.core.hasher.
Hasher
(x=None)¶ Cryptographically hashes buffers or nested objects (“JSON-like” object structures). See
DocumentSerializer
for more details.This is the standard hashing method of HashDist.
Methods
-
format_digest
()¶ The HashDist standard digest.
-
-
class
hashdist.core.hasher.
HashingReadStream
(hasher, stream)¶ Utility for reading from a stream and hashing at the same time.
Methods
-
class
hashdist.core.hasher.
HashingWriteStream
(hasher, stream)¶ Utility for hashing and writing to a stream at the same time. The stream may be None for convenience.
Methods
-
hashdist.core.hasher.
check_no_floating_point
(doc)¶ Verifies that the document doc does not contain floating-point numbers.
-
hashdist.core.hasher.
format_digest
(hasher)¶ The HashDist standard format for encoding hash digests
This is one of the cases where it is prudent to just repeat the implementation in the docstring:
base64.b32encode(hasher.digest()[:20]).lower()
Parameters: hasher : hasher object
An object with a digest method (a
Hasher
or an object from thehashlib
module)
-
hashdist.core.hasher.
hash_document
(doctype, doc)¶ Computes a hash from a document. This is done by serializing to as compact JSON as possible with sorted keys, then perform sha256 an. The string
{doctype}|
is prepended to the hashed string and serves to make sure different kind of documents yield different hashes even if they are identical.Some unicode characters have multiple possible code-points, so that this definition; however, this should be considered an extreme corner case. In general it should be very unusual for hashes that are publicly shared/moves beyond one computer to contain anything but ASCII. However, we do not enforce this, in case one wishes to encode references in the local filesystem.
Floating-point numbers are not supported (these have multiple representations).
-
hashdist.core.hasher.
prune_nohash
(doc)¶ Returns a copy of the document with every key/value-pair whose key starts with
'nohash_'
is removed.