405 lines
20 KiB
Markdown
405 lines
20 KiB
Markdown
Strong (well-distributed and unpredictable) hashes:
|
|
|
|
* Portable implementation of
|
|
[SipHash](https://www.131002.net/siphash/siphash.pdf)
|
|
* HighwayHash, a 5x faster SIMD hash with [security
|
|
claims](https://arxiv.org/abs/1612.06257)
|
|
|
|
## Quick Start
|
|
|
|
To build on a Linux or Mac platform, simply run `make`. For Windows, we provide
|
|
a Visual Studio 2015 project in the `msvc` subdirectory.
|
|
|
|
Run `benchmark` for speed measurements. `sip_hash_test` and `highwayhash_test`
|
|
ensure the implementations return known-good values for a given set of inputs.
|
|
|
|
64-bit SipHash for any CPU:
|
|
|
|
```
|
|
#include "highwayhash/sip_hash.h"
|
|
using namespace highwayhash;
|
|
HH_ALIGNAS(16) const HH_U64 key2[2] = {1234, 5678};
|
|
char in[8] = {1};
|
|
return SipHash(key2, in, 8);
|
|
```
|
|
|
|
64, 128 or 256 bit HighwayHash for the CPU determined by compiler flags:
|
|
|
|
```
|
|
#include "highwayhash/highwayhash.h"
|
|
using namespace highwayhash;
|
|
HH_ALIGNAS(32) const HHKey key = {1, 2, 3, 4};
|
|
char in[8] = {1};
|
|
HHResult64 result; // or HHResult128 or HHResult256
|
|
HHStateT<HH_TARGET> state(key);
|
|
HighwayHashT(&state, in, 8, &result);
|
|
```
|
|
|
|
64, 128 or 256 bit HighwayHash for the CPU on which we're currently running:
|
|
|
|
```
|
|
#include "highwayhash/highwayhash_target.h"
|
|
#include "highwayhash/instruction_sets.h"
|
|
using namespace highwayhash;
|
|
HH_ALIGNAS(32) const HHKey key = {1, 2, 3, 4};
|
|
char in[8] = {1};
|
|
HHResult64 result; // or HHResult128 or HHResult256
|
|
InstructionSets::Run<HighwayHash>(key, in, 8, &result);
|
|
```
|
|
|
|
C-callable 64-bit HighwayHash for the CPU on which we're currently running:
|
|
|
|
#include "highwayhash/c_bindings.h"
|
|
const uint64_t key[4] = {1, 2, 3, 4};
|
|
char in[8] = {1};
|
|
return HighwayHash64(key, in, 8);
|
|
|
|
Printing a 256-bit result in a hexadecimal format similar to sha1sum:
|
|
|
|
HHResult256 result;
|
|
printf("%016"PRIx64"%016"PRIx64"%016"PRIx64"%016"PRIx64"\n",
|
|
result[3], result[2], result[1], result[0]);
|
|
|
|
## Introduction
|
|
|
|
Hash functions are widely used, so it is desirable to increase their speed and
|
|
security. This package provides two 'strong' (well-distributed and
|
|
unpredictable) hash functions: a faster version of SipHash, and an even faster
|
|
algorithm we call HighwayHash.
|
|
|
|
SipHash is a fast but 'cryptographically strong' pseudo-random function by
|
|
Aumasson and Bernstein [https://www.131002.net/siphash/siphash.pdf].
|
|
|
|
HighwayHash is a new way of mixing inputs which may inspire new
|
|
cryptographically strong hashes. Large inputs are processed at a rate of 0.24
|
|
cycles per byte, and latency remains low even for small inputs. HighwayHash is
|
|
faster than SipHash for all input sizes, with 5 times higher throughput at 1
|
|
KiB. We discuss design choices and provide statistical analysis and preliminary
|
|
cryptanalysis in https://arxiv.org/abs/1612.06257.
|
|
|
|
## Applications
|
|
|
|
Unlike prior strong hashes, these functions are fast enough to be recommended
|
|
as safer replacements for weak hashes in many applications. The additional CPU
|
|
cost appears affordable, based on profiling data indicating C++ hash functions
|
|
account for less than 0.25% of CPU usage.
|
|
|
|
Hash-based selection of random subsets is useful for A/B experiments and similar
|
|
applications. Such random generators are idempotent (repeatable and
|
|
deterministic), which is helpful for parallel algorithms and testing. To avoid
|
|
bias, it is important that the hash function be unpredictable and
|
|
indistinguishable from a uniform random generator. We have verified the bit
|
|
distribution and avalanche properties of SipHash and HighwayHash.
|
|
|
|
64-bit hashes are also useful for authenticating short-lived messages such as
|
|
network/RPC packets. This requires that the hash function withstand
|
|
differential, length extension and other attacks. We have published a formal
|
|
security analysis for HighwayHash. New cryptanalysis tools may still need to be
|
|
developed for further analysis.
|
|
|
|
Strong hashes are also important parts of methods for protecting hash tables
|
|
against unacceptable worst-case behavior and denial of service attacks
|
|
(see "hash flooding" below).
|
|
|
|
128 and 256-bit hashes can be useful for verifying data integrity (checksums).
|
|
|
|
## SipHash
|
|
|
|
Our SipHash implementation is a fast and portable drop-in replacement for
|
|
the reference C code. Outputs are identical for the given test cases (messages
|
|
between 0 and 63 bytes).
|
|
|
|
Interestingly, it is about twice as fast as a SIMD implementation using SSE4.1
|
|
(https://goo.gl/80GBSD). This is presumably due to the lack of SIMD bit rotate
|
|
instructions prior to AVX-512.
|
|
|
|
SipHash13 is a faster but weaker variant with one mixing round per update and
|
|
three during finalization.
|
|
|
|
We also provide a data-parallel 'tree hash' variant that enables efficient SIMD
|
|
while retaining safety guarantees. This is about twice as fast as SipHash, but
|
|
does not return the same results.
|
|
|
|
## HighwayHash
|
|
|
|
We have devised a new way of mixing inputs with SIMD multiply and permute
|
|
instructions. The multiplications are 32x32 -> 64 bits and therefore infeasible
|
|
to reverse. Permuting equalizes the distribution of the resulting bytes.
|
|
|
|
The internal state is quite large (1024 bits) but fits within SIMD registers.
|
|
Due to limitations of the AVX2 instruction set, the registers are partitioned
|
|
into two 512-bit halves that remain independent until the reduce phase. The
|
|
algorithm outputs 64 bit digests or up to 256 bits at no extra cost.
|
|
|
|
In addition to high throughput, the algorithm is designed for low finalization
|
|
cost. The result is more than twice as fast as SipTreeHash.
|
|
|
|
We also provide an SSE4.1 version (80% as fast for large inputs and 95% as fast
|
|
for short inputs), an implementation for VSX on POWER and a portable version
|
|
(10% as fast). A third-party ARM implementation is referenced below.
|
|
|
|
Statistical analyses and preliminary cryptanalysis are given in
|
|
https://arxiv.org/abs/1612.06257.
|
|
|
|
## Versioning and stability
|
|
|
|
Now that 21 months have elapsed since their initial release, we have declared
|
|
all (64/128/256 bit) variants of HighwayHash frozen, i.e. unchanging forever.
|
|
|
|
SipHash and HighwayHash are 'fingerprint functions' whose input -> hash
|
|
mapping will not change. This is important for applications that write hashes to
|
|
persistent storage.
|
|
|
|
## Speed measurements
|
|
|
|
To measure the CPU cost of a hash function, we can either create an artificial
|
|
'microbenchmark' (easier to control, but probably not representative of the
|
|
actual runtime), or insert instrumentation directly into an application (risks
|
|
influencing the results through observer overhead). We provide novel variants of
|
|
both approaches that mitigate their respective disadvantages.
|
|
|
|
profiler.h uses software write-combining to stream program traces to memory
|
|
with minimal overhead. These can be analyzed offline, or when memory is full,
|
|
to learn how much time was spent in each (possibly nested) zone.
|
|
|
|
nanobenchmark.h enables cycle-accurate measurements of very short functions.
|
|
It uses CPU fences and robust statistics to minimize variability, and also
|
|
avoids unrealistic branch prediction effects.
|
|
|
|
We compile the 64-bit C++ implementations with a patched GCC 4.9 and run on a
|
|
single idle core of a Xeon E5-2690 v3 clocked at 2.6 GHz. CPU cost is measured
|
|
as cycles per byte for various input sizes:
|
|
|
|
Algorithm | 8 | 31 | 32 | 63 | 64 | 1024
|
|
---------------- | ----- | ---- | ---- | ---- | ---- | ----
|
|
HighwayHashAVX2 | 7.34 | 1.81 | 1.71 | 1.04 | 0.95 | 0.24
|
|
HighwayHashSSE41 | 8.00 | 2.11 | 1.75 | 1.13 | 0.96 | 0.30
|
|
SipTreeHash | 16.51 | 4.57 | 4.09 | 2.22 | 2.29 | 0.57
|
|
SipTreeHash13 | 12.33 | 3.47 | 3.06 | 1.68 | 1.63 | 0.33
|
|
SipHash | 8.13 | 2.58 | 2.73 | 1.87 | 1.93 | 1.26
|
|
SipHash13 | 6.96 | 2.09 | 2.12 | 1.32 | 1.33 | 0.68
|
|
|
|
SipTreeHash is slower than SipHash for small inputs because it processes blocks
|
|
of 32 bytes. AVX2 and SSE4.1 HighwayHash are faster than SipHash for all input
|
|
sizes due to their highly optimized handling of partial vectors.
|
|
|
|
Note that previous measurements included the initialization of their input,
|
|
which dramatically increased timings especially for small inputs.
|
|
|
|
## CPU requirements
|
|
|
|
SipTreeHash(13) requires an AVX2-capable CPU (e.g. Haswell). HighwayHash
|
|
includes a dispatcher that chooses the implementation (AVX2, SSE4.1, VSX or
|
|
portable) at runtime, as well as a directly callable function template that can
|
|
only run on the CPU for which it was built. SipHash(13) and
|
|
ScalarSipTreeHash(13) have no particular CPU requirements.
|
|
|
|
### AVX2 vs SSE4
|
|
|
|
When both AVX2 and SSE4 are available, the decision whether to use AVX2 is
|
|
non-obvious. AVX2 vectors are twice as wide, but require a higher power license
|
|
(integer multiplications count as 'heavy' instructions) and can thus reduce the
|
|
clock frequency of the core or entire socket(!) on Haswell systems. This
|
|
partially explains the observed 1.25x (not 2x) speedup over SSE4. Moreover, it
|
|
is inadvisable to only sporadically use AVX2 instructions because there is also
|
|
a ~56K cycle warmup period during which AVX2 operations are slower, and Haswell
|
|
can even stall during this period. Thus, we recommend avoiding AVX2 for
|
|
infrequent hashing if the rest of the application is also not using AVX2. For
|
|
any input larger than 1 MiB, it is probably worthwhile to enable AVX2.
|
|
|
|
### SIMD implementations
|
|
|
|
Our x86 implementations use custom vector classes with overloaded operators
|
|
(e.g. `const V4x64U a = b + c`) for type-safety and improved readability vs.
|
|
compiler intrinsics (e.g. `const __m256i a = _mm256_add_epi64(b, c)`).
|
|
The VSX implementation uses built-in vector types alongside Altivec intrinsics.
|
|
A high-performance third-party ARM implementation is mentioned below.
|
|
|
|
### Dispatch
|
|
|
|
Our instruction_sets dispatcher avoids running newer instructions on older CPUs
|
|
that do not support them. However, intrinsics, and therefore also any vector
|
|
classes that use them, require (on GCC < 4.9 or Clang < 3.9) a compiler flag
|
|
that also allows the compiler to generate code for that CPU. This means the
|
|
intrinsics must be placed in separate translation units that are compiled with
|
|
the required flags. It is important that these source files and their headers
|
|
not define any inline functions, because that might break the one definition
|
|
rule and cause crashes.
|
|
|
|
To minimize dispatch overhead when hashes are computed often (e.g. in a loop),
|
|
we can inline the hash function into its caller using templates. The dispatch
|
|
overhead will only be paid once (e.g. before the loop). The template mechanism
|
|
also avoids duplicating code in each CPU-specific implementation.
|
|
|
|
## Defending against hash flooding
|
|
|
|
To mitigate hash flooding attacks, we need to take both the hash function and
|
|
the data structure into account.
|
|
|
|
We wish to defend (web) services that utilize hash sets/maps against
|
|
denial-of-service attacks. Such data structures assign attacker-controlled
|
|
input messages `m` to a hash table bin `b` by computing the hash `H(s, m)`
|
|
using a hash function `H` seeded by `s`, and mapping it to a bin with some
|
|
narrowing function `b = R(h)`, discussed below.
|
|
|
|
Attackers may attempt to trigger 'flooding' (excessive work in insertions or
|
|
lookups) by finding multiple `m` that map to the same bin. If the attacker has
|
|
local access, they can do far worse, so we assume the attacker can only issue
|
|
remote requests. If the attacker is able to send large numbers of requests,
|
|
they can already deny service, so we need only ensure the attacker's cost is
|
|
sufficiently large compared to the service's provisioning.
|
|
|
|
If the hash function is 'weak', attackers can easily generate 'hash collisions'
|
|
(inputs mapping to the same hash values) that are independent of the seed. In
|
|
other words, certain input messages will cause collisions regardless of the seed
|
|
value. The author of SipHash has published C++ programs to generate such
|
|
'universal (key-independent) multicollisions' for CityHash and Murmur. Similar
|
|
'differential' attacks are likely possible for any hash function consisting only
|
|
of reversible operations (e.g. addition/multiplication/rotation) with a constant
|
|
operand. `n` requests with such inputs cause `n^2` work for an unprotected hash
|
|
table, which is unacceptable.
|
|
|
|
By contrast, 'strong' hashes such as SipHash or HighwayHash require infeasible
|
|
attacker effort to find a hash collision (an expected 2^32 guesses of `m` per
|
|
the birthday paradox) or recover the seed (2^63 requests). These security claims
|
|
assume the seed is secret. It is reasonable to suppose `s` is initially unknown
|
|
to attackers, e.g. generated on startup or even per-connection. A timing attack
|
|
by Wool/Bar-Yosef recovers 13-bit seeds by testing all 8K possibilities using
|
|
millions of requests, which takes several days (even assuming unrealistic 150 us
|
|
round-trip times). It appears infeasible to recover 64-bit seeds in this way.
|
|
|
|
However, attackers are only looking for multiple `m` mapping to the same bin
|
|
rather than identical hash values. We assume they know or are able to discover
|
|
the hash table size `p`. It is common to choose `p = 2^i` to enable an efficient
|
|
`R(h) := h & (p - 1)`, which simply retains the lower hash bits. It may be
|
|
easier for attackers to compute partial collisions where only the lower `i` bits
|
|
match. This can be prevented by choosing a prime `p` so that `R(h) := h % p`
|
|
incorporates all hash bits. The costly modulo operation can be avoided by
|
|
multiplying with the inverse (https://goo.gl/l7ASm8). An interesting alternative
|
|
suggested by Kyoung Jae Seo chooses a random subset of the `h` bits. Such an `R`
|
|
function can be computed in just 3 cycles using PEXT from the BMI2 instruction
|
|
set. This is expected to defend against SAT-solver attacks on the hash bits at a
|
|
slightly lower cost than the multiplicative inverse method, and still allows
|
|
power-of-two table sizes.
|
|
|
|
Summary thus far: given a strong hash function and secret seed, it appears
|
|
infeasible for attackers to generate hash collisions because `s` and/or `R` are
|
|
unknown. However, they can still observe the timings of data structure
|
|
operations for various `m`. With typical table sizes of 2^10 to 2^17 entries,
|
|
attackers can detect some 'bin collisions' (inputs mapping to the same bin).
|
|
Although this will be costly for the attacker, they can then send many instances
|
|
of such inputs, so we need to limit the resulting work for our data structure.
|
|
|
|
Hash tables with separate chaining typically store bin entries in a linked list,
|
|
so worst-case inputs lead to unacceptable linear-time lookup cost. We instead
|
|
seek optimal asymptotic worst-case complexity for each operation (insertion,
|
|
deletion and lookups), which is a constant factor times the logarithm of the
|
|
data structure size. This naturally leads to a tree-like data structure for each
|
|
bin. The Java8 HashMap only replaces its linked list with trees when needed.
|
|
This leads to additional cost and complexity for deciding whether a bin is a
|
|
list or tree.
|
|
|
|
Our first proposal (suggested by Github user funny-falcon) avoids this overhead
|
|
by always storing one tree per bin. It may also be worthwhile to store the first
|
|
entry directly in the bin, which avoids allocating any tree nodes in the common
|
|
case where bins are sparsely populated. What kind of tree should be used?
|
|
|
|
Given SipHash and HighwayHash provide high quality randomness, depending on
|
|
expecting attack surface simple non-balancing binary search tree could perform
|
|
reasonably well. [Wikipedia says](https://en.wikipedia.org/wiki/Binary_search_tree#Definition)
|
|
> After a long intermixed sequence of random insertion and deletion, the
|
|
> expected height of the tree approaches square root of the number of keys, √n,
|
|
> which grows much faster than log n.
|
|
|
|
While `O(√n)` is much larger than `O(log n)`, it is still much smaller than `O(n)`.
|
|
And it will certainly complicate the timing attack, since the time of operation
|
|
on collisioned bin will grow slower.
|
|
|
|
If stronger safety guarantees are needed, then a balanced tree should be used.
|
|
Scapegoat and splay trees only offer amortized complexity guarantees, whereas
|
|
treaps require an entropy source and have higher constant factors in practice.
|
|
Self-balancing structures such as 2-3 or red-black trees require additional
|
|
bookkeeping information. We can hope to reduce rebalancing cost by realizing
|
|
that the output bits of strong `H` functions are uniformly distributed. When
|
|
using them as keys instead of the original message `m`, recent relaxed balancing
|
|
schemes such as left-leaning red-black or weak AVL trees may require fewer tree
|
|
rotations to maintain their invariants. Note that `H` already determines the
|
|
bin, so we should only use the remaining bits. 64-bit hashes are likely
|
|
sufficient for this purpose, and HighwayHash generates up to 256 bits. It seems
|
|
unlikely that attackers can craft inputs resulting in worst cases for both the
|
|
bin index and tree key without being able to generate hash collisions, which
|
|
would contradict the security claims of strong hashes. Even if they succeed, the
|
|
relaxed tree balancing still guarantees an upper bound on height and therefore
|
|
the worst-case operation cost. For the AVL variant, the constant factors are
|
|
slightly lower than for red-black trees.
|
|
|
|
The second proposed approach uses augmented/de-amortized cuckoo hash tables
|
|
(https://goo.gl/PFwwkx). These guarantee worst-case `log n` bounds for all
|
|
operations, but only if the hash function is 'indistinguishable from random'
|
|
(uniformly distributed regardless of the input distribution), which is claimed
|
|
for SipHash and HighwayHash but certainly not for weak hashes.
|
|
|
|
Both alternatives retain good average case performance and defend against
|
|
flooding by limiting the amount of extra work an attacker can cause. The first
|
|
approach guarantees an upper bound of `log n` additional work even if the hash
|
|
function is compromised.
|
|
|
|
In summary, a strong hash function is not, by itself, sufficient to protect a
|
|
chained hash table from flooding attacks. However, strong hash functions are
|
|
important parts of two schemes for preventing denial of service. Using weak hash
|
|
functions can slightly accelerate the best-case and average-case performance of
|
|
a service, but at the risk of greatly reduced attack costs and worst-case
|
|
performance.
|
|
|
|
## Third-party implementations / bindings
|
|
|
|
Thanks to Damian Gryski and Frank Wessels for making us aware of these
|
|
third-party implementations or bindings. Please feel free to get in touch or
|
|
raise an issue and we'll add yours as well.
|
|
|
|
By | Language | URL
|
|
--- | --- | ---
|
|
Damian Gryski | Go and x64 assembly | https://github.com/dgryski/go-highway/
|
|
Simon Abdullah | NPM package | https://www.npmjs.com/package/highwayhash-nodejs
|
|
Lovell Fuller | node.js bindings | https://github.com/lovell/highwayhash
|
|
Andreas Sonnleitner | [WebAssembly](https://github.com/asonnleitner/highwayhash-wasm) and NPM package | https://www.npmjs.com/package/highwayhash-wasm
|
|
Nick Babcock | Rust port | https://github.com/nickbabcock/highway-rs
|
|
Caleb Zulawski | Rust portable SIMD | https://github.com/calebzulawski/autobahn-hash
|
|
Vinzent Steinberg | Rust bindings | https://github.com/vks/highwayhash-rs
|
|
Frank Wessels & Andreas Auernhammer | Go and ARM assembly | https://github.com/minio/highwayhash
|
|
Phil Demetriou | Python 3 bindings | https://github.com/kpdemetriou/highwayhash-cffi
|
|
Jonathan Beard | C++20 constexpr | https://gist.github.com/jonathan-beard/632017faa1d9d1936eb5948ac9186657
|
|
James Cook | Ruby bindings | https://github.com/jamescook/highwayhash
|
|
|
|
## Modules
|
|
|
|
### Hashes
|
|
|
|
* c_bindings.h declares C-callable versions of SipHash/HighwayHash.
|
|
* sip_hash.cc is the compatible implementation of SipHash, and also provides
|
|
the final reduction for sip_tree_hash.
|
|
* sip_tree_hash.cc is the faster but incompatible SIMD j-lanes tree hash.
|
|
* scalar_sip_tree_hash.cc is a non-SIMD version.
|
|
* state_helpers.h simplifies the implementation of the SipHash variants.
|
|
* highwayhash.h is our new, fast hash function.
|
|
* hh_{avx2,sse41,vsx,portable}.h are its various implementations.
|
|
* highwayhash_target.h chooses the best available implementation at runtime.
|
|
|
|
### Infrastructure
|
|
|
|
* arch_specific.h offers byte swapping and CPUID detection.
|
|
* compiler_specific.h defines some compiler-dependent language extensions.
|
|
* data_parallel.h provides a C++11 ThreadPool and PerThread (similar to
|
|
OpenMP).
|
|
* instruction_sets.h and targets.h enable efficient CPU-specific dispatching.
|
|
* nanobenchmark.h measures elapsed times with < 1 cycle variability.
|
|
* os_specific.h sets thread affinity and priority for benchmarking.
|
|
* profiler.h is a low-overhead, deterministic hierarchical profiler.
|
|
* tsc_timer.h obtains high-resolution timestamps without CPU reordering.
|
|
* vector256.h and vector128.h contain wrapper classes for AVX2 and SSE4.1.
|
|
|
|
By Jan Wassenberg <jan.wassenberg@gmail.com> and Jyrki Alakuijala
|
|
<jyrki.alakuijala@gmail.com>, updated 2023-03-29
|
|
|
|
This is not an official Google product.
|