hashdist.core.hasher – Utilities for hashing

class hashdist.core.hasher.DocumentSerializer(wrapped)

Stable non-Python-specific serialization of nested objects/documents. The primary usecase is for hashing (see Hasher), and specifically hashing of JSON documents, thus no de-serialization is implemented. The idea is simply that by hashing a proper serialization format we ensure that we don’t weaken the hash function.

The API used is that of hashlib (i.e. an update method).

A core goal is that it should be completely stable, and easy to reimplement in other languages. Thus we stay away from Python-specific pickling mechanisms etc.

Supported types: Basic scalars (ints, floats, True, False, None), bytes, unicode, and buffers, lists/tuples and dicts.

Additionally, when encountering user-defined objects with the get_secure_hash method, that method is called and the result used as the “serialization”. The method should return a tuple (type_id, secure_hash); the former should be a string representing the “type” of the object (often the fully qualified class name), in order to avoid conflicts with the hashes of other objects, and the latter a hash of the contents.

The serialization is “type-safe” so that "3" and 3 and 3.0 will serialize differently. Lists and tuples are treated as the same ((1,) and [1] are the same) and buffers, strings and Unicode objects (in their UTF-8 encoding) are also treated the same.

Note

Currently only string keys are supported for dicts, and the items are serialized in the order of the keys. This is because all Python objects implement comparison, and comparing by arbitrary Python objects could lead to easy misuse (hashes that are not stable across processes).

One could instead sort the keys by their hash (getting rid of comparison), but that would make the hash-stream (and thus the unit tests) much more complicated, and the idea is this should be reproducible in other languages. However that is a possibility for further extension, as long as string keys are treated as today.

In order to prevent somebody from constructing colliding documents, each object is hashed with an envelope specifying the type and the length (in number of items in the case of a container, or number of bytes in the case of str/unicode/buffer).

In general, see unit tests for format examples/details.

Parameters:

wrapped : object

wrapped.update is called with strings or buffers to emit the resulting stream (the API of the hashlib hashers)

Methods

class hashdist.core.hasher.Hasher(x=None)

Cryptographically hashes buffers or nested objects (“JSON-like” object structures). See DocumentSerializer for more details.

This is the standard hashing method of HashDist.

Methods

format_digest()

The HashDist standard digest.

class hashdist.core.hasher.HashingReadStream(hasher, stream)

Utility for reading from a stream and hashing at the same time.

Methods

class hashdist.core.hasher.HashingWriteStream(hasher, stream)

Utility for hashing and writing to a stream at the same time. The stream may be None for convenience.

Methods

hashdist.core.hasher.check_no_floating_point(doc)

Verifies that the document doc does not contain floating-point numbers.

hashdist.core.hasher.format_digest(hasher)

The HashDist standard format for encoding hash digests

This is one of the cases where it is prudent to just repeat the implementation in the docstring:

base64.b32encode(hasher.digest()[:20]).lower()
Parameters:

hasher : hasher object

An object with a digest method (a Hasher or an object from the hashlib module)

hashdist.core.hasher.hash_document(doctype, doc)

Computes a hash from a document. This is done by serializing to as compact JSON as possible with sorted keys, then perform sha256 an. The string {doctype}| is prepended to the hashed string and serves to make sure different kind of documents yield different hashes even if they are identical.

Some unicode characters have multiple possible code-points, so that this definition; however, this should be considered an extreme corner case. In general it should be very unusual for hashes that are publicly shared/moves beyond one computer to contain anything but ASCII. However, we do not enforce this, in case one wishes to encode references in the local filesystem.

Floating-point numbers are not supported (these have multiple representations).

hashdist.core.hasher.prune_nohash(doc)

Returns a copy of the document with every key/value-pair whose key starts with 'nohash_' is removed.