Cryptographic Hashes
A conversational, jargon-free, and mostly self-contained introduction
A ubiquitous and fundamental building block
As our lives become more digital, Cryptography will inevitably become more and more relevant as well.
It is through Cryptography that digital security, digital privacy, digital identities, and digital assets are made possible.
Hashes (or hash functions) are a fundamental building block of Cryptography.
They are pretty much everywhere:
The “act” of mining, as you may have heard of in the context of Bitcoin, is deeply related to hashes
The unforgeable links that tie everything together in the so-called blockchain are “created” via hashes
The ability to “easily” check the contents of any blockchain (and therefore trust them) is also due to hashes
The digital signatures used to “prove” a user really is who she claims to be are related to hashes, too
The way we ensure that data (files, messages, etc) arrives at destination in its original (unbreached, intact) form depends on hashes
The storage of our passwords in any online service we use nowadays involves “hashing”
And the list goes on.
On top of its ubiquity, hashes also are a fascinating topic. And if we do our job well, by the end of this reading you should agree with us.
In the next several minutes, we will gently walk you through all the most important concepts related to hashes using an informal and conversational tone. We’ll assume no previous knowledge of the topic and any sufficiently interested high school student should be able to have a good time. We’ll do our best to explain things in the clearest manner possible, avoiding jargons as much as possible but also not shying away from more nuanced stuff. In our sidenotes, we will refer to additional links for readers that may want to go deeper.
Let’s start by getting an intuitive sense of what hashes are through a toy example.
A fairy tale example
Loosely speaking, a hash is any function that maps data of arbitrary size (input) to data of fixed size (output).
Fear not. Here’s a toy (actually, a fairy tale) example.
Assume a set of 7 dwarf names: Dopey, Sneezy, Bashful, Sleepy, Happy, Grumpy, and Doc. Obviously, these names have different lengths, from 3 characters (Doc) to 7 (Bashful).
If we create a mapping from each name into a number between 0 and 6, we will have then created a hash.
Dopey -> 0
Sneezy -> 1
Bashful -> 2
Sleepy -> 3
Happy -> 4
Grumpy -> 5
Doc -> 6
Voilà! How impressive?
Well, in order to really grasp some of the utilities of hash functions, we should first remember that computers are not humans. They work differently.
For example, computers only “understand” zeros and ones (in the binary base). So, even characters in dwarf names must be converted to sequences of 1
and 0
. There are a few different ways of converting everyday alphabet characters depending on the computer and its setup.
Here’s an example in Unicode of how a computer might “see” those names:
Dopey = 0100010001101111011100000110010101111001
Sneezy = 010100110110111001100101011001010111101001111001
Bashful = 01000010011000010111001101101000011001100111010101101100
Sleepy = 010100110110110001100101011001010111000001111001
Happy = 0100100001100001011100000111000001111001
Grumpy = 010001110111001001110101011011010111000001111001
Doc = 010001000110111101100011
Let’s also convert the decimal numbers of our hash mapping to the binary base as well:
Dopey -> 0 = 000
Sneezy -> 1 = 001
Bashful -> 2 = 010
Sleepy -> 3 = 011
Happy -> 4 = 100
Grumpy -> 5 = 101
Doc -> 6 = 110
We’ll now be able to make a fairer comparison. Suppose you are a computer and you want to store this set in your memory – which is made of “spaces” for zeros and ones. Our hash clearly uses much less memory space than the standard Unicode representation, right?
What if you need to distinguish between these names? Using our hash to do so makes it faster, too.
Why is it faster? In our hash, we know by definition that each element’s length in binary is 3 (called 3 bits) and that’s all we need to differentiate them.
On the other hand, the names in Unicode all start with the same sequence of 3 bits (010
) and therefore are indistinguishable within these 3 bits.
Also, you will only be able to distinguish between Dopey (0100010001101111011100000110010101111001
) and Doc (010001000110111101100011
) after going through their first 19 bits (0100010001101111011
). How inefficient!
Based on this fairy tale example, we can then tell that hashes seem like neat substitutes used by computers. The potential gains in space and in speed we just saw hint us about some of the advantages of using hashes. There’s even more to them in terms of advantages – but, as it often is the case in life, there are also trade-offs.
Let’s dive in and learn about them (and more)!
What properties make a great cryptographic hash function?
First, please note: hashes have several important applications in Computer Science beyond Cryptography (e.g. databases, file comparison, etc) and, depending on where and how hashes are used, different properties (sometimes opposite ones!) are needed.
In this text, we will be presenting and discussing only their uses and their desirable properties in Cryptography. So, please beware.
What are we looking for in a cryptographic hash?
Here’s a list of properties:
- Deterministic
- Fixed output range of bits
- As few collisions as possible
- Easy to compute but (at the same time) as hard to invert as possible
Why these and not others? Well, explaining what do we mean by them should help you understand where we’re coming from:
Deterministic
For a given input value (e.g. a dwarf name) the hash function must always generate the same hash (output) value. That is to say, we’d like our hashes to be “strictly predictable”. Actually, any mathematical function satisfies this property. (So, as you might guess, this is not what makes hashes particularly special, but it still is an essential requirement).
Fixed output range of bits
For instance, in our previous fairy tale example, we had a fixed output range of 3 bits. Why? This is due to “practical” reasons: computations can be much faster if computers know a priori what to expect (in this case, the size of output) and “prepare” themselves accordingly.
As few collisions as possible (a.k.a. collision resistance)
It should be very rare to find two inputs generating the same hash output value. When that happens, it is called a collision. In other words, using Mathematics concepts, we’d aspire that our hash functions behave like injections, in which each valid input maps to a different output value.
Easy to compute but (at the same time) as hard to invert as possible
We are looking for functions that are very easy to calculate from input to hash output, but very hardThe hard-to-invert feature is also known as preimage attack resistance.
to reconstruct (“invert”) an input given its hash output. They’re called one-way functions.
The last two properties clearly are the most intriguing ones.
Why collision resistance (#3)? Well, if we have two pieces of data producing the same hash output, then we have no way of telling one from another (e.g. imagine two different dwarf names having the same hash number). Even worse, imagine some bad guy interchanges the pieces of input data and show us only the hash? We’d be easily fooled!
As for one-way functions (#4), it is mostly related to the idea of irreversibility. If going from output to input is found to be easy, then the function is not irreversible. Hashes can be much more useful in Cryptography if it is “safe” to assume that “there’s no going back” – i.e. that once the hash is done, it has “integrity” and it’s “impossible” to reveal what input is “behind” it.
Summarizing what we’ve just learned: hashes are substitutes that may save our computers time and space and that are believed to be non-interchangeable and irreversible.
On collisions, rarity, and counterintuitiveness
As you went through the list of properties above, you might have asked yourself: what do they mean by very rare in property (#3)? And by very easy and very hard in property (#4)? Why the expressions believed to be and as possible?
Why are we using such conservative, almost self-doubting terms?
Let’s start with the goal of very rare collisions (#3).
First, we must recognize that, in practice, cryptographic hash functions are actually expected to have collisions.
Wait a minute. Expected? What? Aren’t collisions a bad thing?
Well. Recall that a hash is any function that maps data of arbitrary size (input) to data of fixed size (output).
Briefly investigating the sizes of inputs and outputs, let:
- \(\footnotesize m\) be the total number of possible inputs within all the possible (arbitrary) input sizes
- \(\footnotesize n\) be the total number of possible hash outputs given its fixed output size
If \(\footnotesize m > n\), from the pigeonhole principleFamous counting argument exemplified in real life by truisms like “in any group of three gloves there must be at least two left gloves or two right gloves”. Check Wikipedia for more.
, then at least \(\footnotesize (m - n) + 1\) inputs will inevitably have duplicated hash outputs – i.e., there will be collisions.
That is easily checkable if the hash output is fixed in \(\footnotesize 1\) bit. The only values an \(\footnotesize 1\)-bit output hash can assume are 1
and 0
, a total of \(\footnotesize 2\) (or, put alternatively, \(\footnotesize n = 2^1 = 2\)). Indeed, if we have \(\footnotesize m = 3\) possible inputs hashed into an 1-bit output, then \(\footnotesize (3 - 2) + 1 = 2\), i.e., there will be at least 2 inputs with the same hash output, at least one collision.
In our fairy tale example, the hash output was fixed in \(\footnotesize 3\) bits, then total number of possible inputs is given by \(\footnotesize n = 2*2*2 = 2^3 = 8\). Since our \(\footnotesize m = 7 < n = 8\), it was possible to avoid collisions, something we did avoid.
But what if two new dwarves arrive? Say, Doak and Dolan (making a total of \(\footnotesize 7 + 2 = 9\)). Since our hash output has been fixed on \(\footnotesize 3\) bits, we’d only have \(\footnotesize n = 2^3 = 8\) “slots” for \(\footnotesize m = 9\) names. We will inevitably have two dwarf names with the same hash. There will be at least one collision!
And that is also the case in real-life hashes.
In (almost?) all practical cases \(\footnotesize m\) is much bigger than \(\footnotesize n\).
Why? This is by design. We want to be able to hash a huge number of things of many different sizes into fixed- and finite-sized outputs so that computations take less space and are much faster. There just cannot be too many of these outputs available.
Also note that even if \(\footnotesize n \leq m\), it may happen that a hash has collisions. It really depends on how “carefully” the mapping of the hash function is conducted.
Finally, we should consider another important observation, one that provide us pratical limits to collision resistance and that is also counterintuitive for most.
We, as humans, are known for a lack of good intuition on non-linear phenomena. And finding collisions on hashes is one of these circumstances.
On everyday life, you probably have already observed that, despite the year having \(\footnotesize 365\) days, it is pretty common for individuals to share the same birthday (even in small groups of, say, \(\footnotesize 30\) people). At first glance, one would expect this to be much rarer. Probably because our intuition would guess something like \(\footnotesize \frac{30}{365} \approx 8\%\) chance.
Which, by the way, is completely wrong!
Mathematicians Another way to state it is: “how many people must there be in a room before there is a 50% chance that two of them were born on the same day of the year?”
have studied this specific problem before and have even come up with a name for it: the birthday problem (or birthday paradox).
How does this relate to hashes? It turns out finding collisions in a hash functions and sharing the same birthdays are basically the same thing. Consider that:
- the birthdays (more precisely, the \(\footnotesize 365\) days of a year) can be seen as the \(\footnotesize n\) possible hash outputs
- the people in a group can be thought as the hash inputs
So, every time different people share the same birthday, it is analogous to different hash inputs having the same hash output. Both are collisions!
By digging deeper into the birthday problem If you are familiar with Combinatorics Principles and with Taylor Series, you should definitely read the Appendix (below) in which we uncover the actual numbers behind the birthday problem. It is fun!
, we’re able quantify it:
Suppose a hash function with \(\footnotesize N\) bits of fixed output (and, therefore, with \(\footnotesize n = 2^N\) possible outputs). There is a sharp increase in the chances of randomly finding at least two matching hash outputs (one collision) after computing only roughly \(\footnotesize 2^{N/2} = n^{1/2} = \sqrt{n}\) hashes.
Numerically, there’s a \(\footnotesize \thicksim10\%\) chance of finding a collision after randomly computing \(\footnotesize \thicksim0.5\sqrt{n}\) hashes. The chances increase to \(\footnotesize \thicksim50\%\) after a total of \(\footnotesize \thicksim1.2\sqrt{n}\) random hashes. In other words, after roughly doubling (\(\footnotesize 2x\)) the number of hashes computed, the chances of finding a collision increase four fold (\(\footnotesize 4x\))!
Then, after \(\footnotesize \thicksim2.1\sqrt{n}\) hashes, the chances of collision already reach \(\footnotesize \thicksim90\%\).
So, we do not want any collision to occur in real life – one wants to reduce the risks of being fooled, after all. Unfortunately, we now know that:
- collisions will inevitably occur
- even worse, given the birthday problem, collisions are known to occur earlier (in terms of hashes needed) than our intuition would make us expect
What can be done about it then? Unfortunately, not much.
The most relevant “protection” against collisions (i.e. the best way to make them “rarer”) is using as a large \(\footnotesize N\)-bit fixed output length as possible and therefore also having a large \(\footnotesize n\).
And, as we will see, that’s something everyone tries do.
What’s considered to be difficult for computers?
Having got a sense of the “very rare” collisions as in property (#3) mean, it’s time to unpack the meaning of “very easy” and “very hard” in our definition of one-way functions (#4).
As we’ve put it, one-way functions are easy to compute but (at the same time) as hard to invert as possible.
Then, we must again examine how computers work.
Our goal in this section will be to build a self-contained and intuitive understanding of what is known as computational complexity.
This will take us a few intermediary steps as we journey through some of the core concepts of Computer Science – from what is an algorithm to discussions about time-space, from attacks of brute force to classes of computational problems, and so on.
Wikipedia is a helpful starting point:
In computer science, the computational complexity (or simply, complexity) of an algorithm is the amount of resources required for running it.
To be able to parse and internalize the essence of the definition above, let’s make sure we get a grasp of three key concepts.
Problems, Algorithms, and Inputs – the elemental concepts
What is a problem in our current context? It’s a question that: (i) is “suitable” for computation; and (ii) describes desired outputs. Also, there must be no room for subjectivity: the desired outputs must be unambiguous“To be or not be?” does not qualify as a problem.
.
What’s an algorithm You may have not noticed, but algorithms are pretty much everywhere. The chips inside all electronic devices are designed used algorithms. How devices communicate with each other also rely on algorithms for routing. The list is very long and goes on and on.
? It’s a list of unambiguous steps to be conducted by a computer that acts as one possible manual (a “blueprint”) to finding the desired outputs of a problem.
And what is considered an input? It’s data in a format “acceptable” by a computer that, after going through the steps of an algorithm, result in a output that can be evaluated if it is a desired one.
A simple example should make things clearer:
Problem: is x odd?
Algorithm:
- take input x and divide it by 2
- if the division remainder is 1, declare x is odd
- but if it is 0, declare x is not odd
Input: x = 312
Output: 312 is not odd
Input: x = 817231
Output: 817231 is odd
Very importantly, several different algorithms that “solve” a problem could coexistYes, “there’s more than one way to skin a cat”.
. For instance, here’s another algorithm that also solves the problem above:
Another algorithm:
- take input x and multiply it by 3
- take the multiplication result and divide it by 2
- if the division remainder is 1, declare x is odd
- but if it is 0, declare x is not odd
OK, so we now know that complexity is related to the resources needed for a computer to do something that can be unambiguously evaluated.
Time and Space – the resources computers need
But what resources does a computer need to do it (or anything for that matter)?
Time and space We will be making the super reasonable assumptions that: computers cannot compute infinitely fast nor have infinite space in their memories.
.
Yeah, really. Talk about needs more fundamental than these, han?
Time is commonly measured by the amount of elementary operations conducted when computing an algorithm – it’s computation time then.
Some examples of what we will consider elementary operationsWe will be using here what is known as the RAM (random-access machine) model.
are: basic arithmetic (addition, subtraction, multiply, divide, remainder, floor, ceiling), data movement (load, store, copy), control stuff (comparisons, if-then-else branching, etc).
Because there are different ways of modeling and building computers both in theory and in practice, describing what exactly is an elementary operation can be very nuancedFor example, the computer you are now reading this in is not exactly a random-access machine – it most likely does computation concurrently and has a hierarchy of memories. An example of something even more distinct would be a quantum computer, the way they compute is far from anything we use daily nowadays.
.
Fortunately, the main idea is pretty straightforward: elementary operations should be the simplest ones that can be done in a roughly constant and tiny unit of time. Practically, we will consider the time of computing any elementary operation to be constant indeedWe will be using the uniform cost model. A more precise way of accounting for the length \(\footnotesize L\) of an integer \(\footnotesize n\) on computing times is the logarithmic cost model, named after the fact that \(\footnotesize 2^{(L-1)} \leq n \leq (2^{L} - 1)\) and thus \(\footnotesize (L-1) \leq log_2{(n)} \leq log_2(2^{L} - 1) < L\).
. Hence, adding \(\footnotesize 1+1\) will take the same time of something huge like \(\footnotesize 10^{100}+\pi^{99}\).
Additionally, for our purposes, we will have operations happening only sequentially (i.e., not concurrently). We will also assumeIt means that computation time will be independent of using a supermodern Mac Pro or the 25-old Pentium in grandpa’s basement. Interestingly, it’s been demonstrated that this kind of dependence only results on a “modest loss” when compared to the variation because of algorithms and inputs (search for Extended Church-Turing Thesis to learn more).
that the computation time will only depend on the algorithm itself and on the input.
Now, into space!
Space can be thought as the size of the memory used in computations (i.e., where all the 0
and 1
live in between operations).
Here’s a nice simple analogy: imagine that we have a pan of a certain given size and that the problem we solve is boiling pigs. The pan can be seen as the memory of our computer and the steps for boiling different pigs as algorithms.
This analogy also serves us well for highlighting a notable link between the resources computers need for computations.
Let’s say we need to boil a huge pig. At first, we could consider boiling the whole creature at once using just an ordinary, small pan. But, well, that would be not possibleCrazy, unexpected things can happen when we try to compute at once things too big to any computer’s memory space. This is called an overflow.
at all! We must chop it into smaller pieces and do the boiling sequentially, right?
That is to say: time and space are deeply linkedSee Space–time tradeoff on Wikipedia.
in Computer Science. Things can take longer if your space is too limited and vice-versa.
One quick remark before moving on to actually quantifying complexity: frequently, when computational complexity is measured, people are implicitly talking about time complexity only. We will do the same here in our exploration, but one must be alway careful in real life – for example, in Cryptography it is not uncommon for space complexity to be super relevant.
Quantifying the complexity of problems – here comes the loop
How does one goes about comparing between complexities of algorithm?
The trick is to quantify them. Summarizing what’ve already written about how we will use as model for complexity:
- time complexity only (remember that we will leave space considerations out)
- it will depend on algorithm and on input
- each elementary operation will take an uniform amount of time given any acceptable input
Now, onto an example Remember that according to the fairy tale hash function we’ve created, Bashful -> 2 = 010
.
:
Problem: by using the fairy tale hash function, find the input whose hash is 010
Algorithm:
- take hash_ouput = 010
- for each input_element in the input_list:
- calculate the hash of input_element
- compare the result to hash_ouput and decide between:
- if they're equal, declare input_element found
- if not, back to line 2 for the next input_element
Input: input_list = [Dopey, Sneezy, Bashful, Sleepy, Happy, Grumpy, Doc]
Output: Bashful found
Informally, the execution that happened above was:
- our algorithm first tests the hash of
Dopey
, which is the first element in the input list - since it is not the hash value (
010
) it is looking for, it tries again withSneezy
- since, again, it is not, it then tries
Bashful
- the execution succeeds and the algorithm ends itself
In order to estimate how much time this example would take to compute, let’s tabulate useful data about the execution of each line:
Line | Time Cost | Times Executed |
---|---|---|
1. take hash_ouput = 010 |
\(\footnotesize c_1\) | 1 |
2. for each input_element in the input_list: |
\(\footnotesize c_2\) | 3 |
3. calculate the hash of input_element |
\(\footnotesize c_3\) | 3 |
4. compare the result to hash_ouput |
\(\footnotesize c_4\) | 3 |
5. if they're equal, declare input_element found |
\(\footnotesize c_5\) | 1 |
6. if not, back to line 2 for the next input_element |
\(\footnotesize c_6\) | 2 |
It should be fair to suppose that each of these lines contains one elementary operation. All of them, with exception of line 3, fits very well into our previous definition: they’re either data movement or basic control stuff.
Line 3 does involve a potentially “complicated” (far from elementary) calculation. And here our point for a “very easy” to compute function kind of comes full circle.
If all lines can be assumed to be elementary, then all time costs are the same – let’s call it \(\footnotesize c\). So, \(\footnotesize c_1 = c_2 = c_3 = c_4 = c_5 = c_6 = c\).
The Times Executed column, as its name hints, tells us about the number of times each line of the algorithm is executed given the input we’ve submitted.
Why is that important? Well, since we’ve modelled most things to take the same time cost, the values in this column actually are the most decisive part of our calculation of time complexity (or the total computation time) for our example:
\[\footnotesize complexity = 1*c_1 + 3*c_2 + 3*c_3 + 3*c_4 + 1*c_5 + 2*c_2 =\] \[\footnotesize = (1 + 3 + 3 + 3 + 1 + 2) * c = 13*c\]
In any computer we use nowadays, \(\footnotesize c\) is a tiny fraction of a second. And our algorithm will be executed very quickly (like in a blink) for this input.
As we’ve written, in our model, computation time depends on algorithm and on input. There are very intrincate algorithms that contain a lot of operations. But typically what matters the most is how many times some operations are repeatedly executedThe most common type of repetitions are called loops. Almost all relevant algorithms in Computer Science involve several of them.
. And that is mostly due to input size and to how an input is “organized”.
What do we mean by input sizeInput size is one of these nuanced concepts – it really depends on the context. For a given situation, the chief aspect of input sizing could be how large the numbers are. In others, it could be how many items there are. And so on.
? Having a dwarf called Montgomery, a name certainly longer than Doc, is a way to vary the size. Another one (one would really move the neddle in this case) would be having to go through a list with, say, millions of dwarves and not just seven.
And what is input “organization”?
Well, suppose our input_list
was [Bashful, Dopey, Sneezy, Sleepy, Happy, Grumpy, Doc]
, what would be it time complexity? Since Bashful is the first one in the list, lines would be executed at most just once!
\[\footnotesize complexity = 1*c_1 + 1*c_2 + 1*c_3 + 1*c_4 + 1*c_5 + 0*c_2 = 5c\]
That clearly is the best case possible. Notice that \(\footnotesize 5c\) is less than half of the \(\footnotesize 13c\) we’ve found before.
And here’s an example of the worst case scenario for an input list with seven items, input_list = [Dopey, Sneezy, Sleepy, Happy, Grumpy, Doc, Bashful]
.
In effect:
\[\footnotesize complexity = 1*c_1 + 7*c_2 + 7*c_3 + 7*c_4 + 1*c_5 + 6*c_2 = 29c\]
These examples show us how wildly different the computing times can be for the exact same input size if we simply vary how the input is “organized”.
When studying computation time, what people are typically most interested is in how time costs grow when input sizes become huge.
For instance, what would be the worst case of our example for a \(\footnotesize n\)-length input list?
For \(\footnotesize n = 7\), we saw that \(\footnotesize complexity = c_1 + 7c_2 + 7c_3 + 7c_4 + c_5 + 6c_2\).
We then argue by analogy that:
\[\footnotesize complexity = c_1 + nc_2 + nc_3 + nc_4 + c_5 + nc_2 =\] \[\footnotesize = (n + n + n + n + 2) * c = (4n + 2) * c\]
Given above, it is said that the asymptotic rate of growth of complexity of our algorithm follows a polynomial over input size \(\footnotesize n\). More specifically in this case The best case, as we’ve just seen, is independent of input size \(\footnotesize n\) and executes in \(\footnotesize 5c\) time.
(the worst one), it follows a linear (i.e. degree 1) polynomial.
Since we’re more interested in its behavior for large \(\footnotesize n\), lower order ones will be relatively insignificant. Then, it is fine to consider only the leading term \(\footnotesize 4n\). That’s why it is called asymptoticIf you are not familiar with asymptotes, assume intituively that asymptotic roughly means when it goes to infinity.
rate of growth.
As one could expect, there is a variety of asymptotic rates of growth of time complexities:
- constant, which is independent of input size \(\footnotesize n\)
- exponential, that depends on exponents of \(\footnotesize n\), like \(\footnotesize 2^n\)
- logarithmic, whose higher order term is of the \(\footnotesize \log_2n\) form
- everything in between, and more
If we calculate the asymptotic rates of growth of different algorithms Of course, “harder” only given the model of computation we’d chosen, only for large values of input size, etc.
, then we can easily tell which one is harder in the worst case just by comparing their leading terms.
We are now finally able to appreciate what is difficult for computers!
A way to define a “hard” problem for computers would be: it is a problem that, given a very large input and in most cases, has a big asymptotic rate of growth.
What asymptotic rates of growth are big? Constant and logarithmic cannot make the cut. They simply grow too slowly, well below a linear polynomial. Really hard stuff are super-polynomial! Like exponential or, worse, involving factorials – e.g. \(\footnotesize n!\).
This chart from Wikipedia helps to make the point more visually:
Note that this graph displays only the rates of growth for small \(\footnotesize x\). That’s only the “beginning”. Axes with at most \(\footnotesize 100\) do not represent anything really asymptotic.
Time to revisit our definition of one-way functions (#4): easy to compute but (at the same time) as hard to invert as possible. We now know what “hard” looks like, but we are still missing any link between “hard” and “inverting”.
That turns out to be a very interesting link. And phylosophically deep as well.
Keep reading to learn why!
No shortcuts! Brute forcing should be the only way through
Mesdames et messieurs, have you realized that we’ve actually inverted our hash function in the examples above?
Yes, we went from hash output to input there. And that really is an inversion.
Putting it in slighly more technical words: given our hash function (that we used to calculate hashes of inputs) and a specific hash output, we exhaustively searched to find what possible input would map to the output we were interested in.
The inversion we’ve made is a super simple example of a successful preimage attack and, as you may recall, it violates the (#4) property of great hash functions. Sadly, our fairy tale hash function is simply too vulnerable to be considered a great one.
We conducted what is known as a brute-force attack, i.e. we’ve tested every possible solution candidate one by one.
Instead of only seven dwarves as the total amount of possible inputs, there should be a lot more in real-life. So, enumerating every possible candidate In principle, pretty much anything that could be made digestible by a computer can be considered a potential hash input, from your full name to the entire Wikipedia. The amount of potential hash inputs really depends on the context at hand.
should not be feasible in practical cases due to time and space limits.
Examining an example closer to real-life should be illuminating.
Say we want to hash a password with maximum length of 25 characters and that could containWe are considering the printable characters in the ASCII table. There are 95 of these in total.
letters, digits, punctuation marks, space, and some miscellaneous symbols.
How many password possibilities (i.e. input candidates) an attacker would need to try in the worst case?
For the \(\footnotesize 1\)st character, there are \(\footnotesize 95\) possible choices. Then:
- \(\footnotesize 2\) characters: \(\footnotesize 95*95 = 9\,025\) choices
- \(\footnotesize 3\) chars: \(\footnotesize 95*95*95 = {95}^{3} = 857\,375\) choices
- \(\footnotesize 4\) chars: \(\footnotesize {95}^{4} = 81\,450\,625\) choices
- \(\footnotesize ...\)
- \(\footnotesize 25\) chars: \(\footnotesize {95}^{25} = {2\,773\,895\,731} * {1\underbrace{0\,000\,000\,...\,000}_{\text{40 zeros}}}\) choices
That is a huge number indeed. Much, much larger than seven. And we can see that the input possibilities grow exponentially over the length of input.
But is an amount of input candidates like that big enough?
Hypothetically, how long would it take checking all the possible choices in a normal computer like ours?
Well, a precise answer would depend on many details Here are some software used to crack hashed passwords: John the Ripper, Hashcat, ElcomSoft.
– e.g. on the hash function being used, on how optimized our software is to our hardware, etc.
As an approximation, consider we are using a PC with Nvidia’s GTX 1080 Ti GPUA graphics processing unit is a specialized electronic circuit inside most computers. GPUs are very useful for tasks both highly “parallelizable” and “arithmetic-heavy”, like graphics and brute forcing against a hash. Read this question at Stack Exchange for more.
and are brute forcing against a SHA256 hash functionSHA256 is part of the SHA-2 family of hash functions. It is widely used, including on Bitcoin. SHA256’s hash output is 256 bits long.
. Then we would be able to make 4 billion guesses per second.
Therefore:
\[\footnotesize time = \frac{{95}^{25} \: choices}{{4*{10}^{9}} \: guesses \: per \: second} \approx 6.9 * {10}^{39} \: seconds\] \[\footnotesize time \approx 2.1 * {10}^{29} \: millenia\]
Yeah, that definitely seems safe against one attacker using one common computer For the sake of comparison, the “most hardcore password cracker money can buy” costs $21,100 and can guess 23 billion SHA256 hashes per second. And a single specialized Bitcoin miner device AntMiner S9 currently costs $1,000 and has a hash rate of 14 trillion per second.
.
When thinking about this, since computers will probably continue to improve consistently over the next years, it is always prudent to demand some margin of safety.
For instance, if the maximum length of the passwords were only \(\footnotesize 9\) characters, then there would be \(\footnotesize {95}^{9} = 6.3 * {10}^{17}\) possible input choices and it could be brute forced in \(\footnotesize 5\) years by our common computer. That does not sound like enough margin of safety.
Additionally, there are much more powerful computers in the world As of May 2018, the hashrate of Bitcoin network – which roughly means how many SHA256 hashes per second all computers connected to the Bitcoin network can collectively calculate – is around \(\footnotesize 30\) million of trillion (\(\footnotesize 30 * {10}^{18}\)) hashes per second.
. Specially if we are hashing something very valuable, one cannot rule out being attacked by supercomputers or by huge networks of computers. Their computing power is obviously several orders of magnitude larger than an ordinary personal computer. On the bright side, since doing anything over the internet is much slower than locally (connection bandwidth is the major bottleneck), they would need to have access to the hash outputSee this Stack Exchange answer to understand the difference between online and offline attacks.
.
OK. We are aware now that a huge amount possible inputs can make brute force attacks much less feasible, even by super powerful attackers that keep getting better.
Here’s a plot twist, though.
All these big numbers are only of any meaning if there are no shortcuts to inversion. All the calculations we’ve done above assumed we were completely ignorant about inputs and had to rely on brute forcing.
That is key.
In order for those numbers to hold, no strategy should be better than picking candidates randomly. There should be no hints, no heuristics, no shortcuts on how to choose which hash inputs should be tested first.
And that, unfortunately, can be a big ask regarding inputs.
Depending on the situation A paper published in April 2015 showed that people’s choices of password structure often follow several known patterns. As a result, passwords may be much more easily cracked than their mathematical probabilities would otherwise indicate. Passwords containing one digit, for example, disproportionately include it at the end of the password. Read more.
, the input that is being hashed follows some kind of pattern that makes guessing it much faster than something truly random. For example, when coming up with passwords, we, human beings, tend to follow patterns that are easily exploitable Rainbow tables (a.k.a. lookup tables) and dictionary attacks are well-known techniques of exploiting knowledge about inputs in order to invert hash functions. To counter that, there are random salts and key-stretching algorithms. A good practical overview can be found in this Stack Exchange answer.
.
Life ain’t easy, friends.
This is the major link between inversion and hardness that we were looking for. If brute force is the only way through, then the inversion of a hash function can really be seen as hard.
But getting to that is not trivial at all.
As you probably can tell by now, the world out there really is a place with its sizable share of evil-minded computers deliberately wanting and trying to “break” our hashes.
For any attacker, the worst-case hardness of inversion that we’ve talked about before is not what really matters. Aggressors are often looking for any vulnerable “surface” they could take advantage of That should reminds us of: “a chain is only as strong as its weakest link.”
.
For the sake of argument, let’s say the worst-case hardness happens in 1% of possible cases and that in the other 99% possibilities it is easy (i.e. not hard) to succeed inverting a given hash function. That’s still game over for our hash function!
Ideally, hash security should require hardness on the vast majority of cases. If not, at least hardness on the average case.
This may look like a mere formality, but, believe us, it is not.
In Computer Science, it’s been notoriously difficult to prove if some problems are hard indeed in their worst casesWe are talking about \(\footnotesize \mathcal{NP}\)-complete problems. In general terms, they are a kind of decision problem in which we can easily check that a correct solution is indeed correct, but to which is hard to find correct solutions in the first place.
in which is known as the \(\footnotesize \mathcal{NP} \stackrel{?}{=} \mathcal{P}\) problem.
And all this difficulty just for the worst cases. To require hardness on average case is to ask for even more!
So, having to worry about patterns in inputs is just part of the challenge with one-way functions (and with great cryptographic hash functions in general).
OK.
Time has come.
You are now about to read the most stunning phrase contained in this text.
I’ll directly quote two professors in their book because I certainly couldn’t do any better:
Although one-way functions are widely believed to exist, and there are several conjectured candidate one-way functions which are widely used, we currently do not know how to mathematically prove that they actually exist.
How about that?
The existence of one-way functions is, as of now, an important open, unsolved problem in Computer Science.
We simply don’t know if nearly everything we’ve discussed so far about hash functions and its one-way property (#4) is really possible. It may be. It may not be.
Computer scientists and cryptographers have evidenceSee this Stack Exchange thread, specially this comment which mentions the origins of SHA-3 (Keccak).
that it is very plausable to assume one-way functions do exist, but nobody can be certain about it.
Since hashes are a fundamental building block in Cryptography but depend on one-way functions, it also follows that:
We argue that the existence of one-way functions is a necessary condition to the existence of most known cryptographic primitives including secure encryption and digital signatures.
And it gets deeper.
The foremost unsolved question of Computer Science, the \(\footnotesize \mathcal{P} \stackrel{?}{=} \mathcal{NP}\) problem we’ve talked about above, is also deeply linked to one-way functions.
If one-way functions are proved to exist Advanced readers will probably appreciate reading a demonstration of this deep link here at Stack Exchange.
, then it would also prove that the \(\footnotesize \mathcal{P} \neq \mathcal{NP}\) and, thus, finally solving the famously important open problem.
As late professor Ingo Wegener put it:
Problems that are viewed as difficult have not actually been proved to be difficult, but it has been shown that thousands of problems are essentially equally difficult. An efficient solution to any one of these thousands of problems implies an efficient solution to all the others.
Or stated another way: a proof that any one of these problems is not efficiently solvable implies that none of them is.
Thousands of secrets have joined together to form one great mystery, the unmasking of which reveals all the secrets.
In this sense, each of these secrets is just as central as every other and just as important as the great mystery.
Our modern, digital world and its security rely on fundamental stuff that are still unproven If you are familiar with the concept of complexity classes but haven’t heard about Impagliazzo’s Worlds, then you must stop now and read the answers to this question at Stack Exchange.
.
Let that sink in.
And try sleeping well tonight.
Tying everything together, plus a glimpse to recent developments
What a ride!
To wrap things up, let’s briefly recap the key points that we’ve came across.
We’ve seen hashes are substitutes that may save our computers time and space and that are believed to be non-interchangeable and irreversible.
More specifically, great cryptographic hash functions should have the following properties:
- Deterministic
- Fixed output range of bits
- As few collisions as possible (a.k.a. collision resistance)
- Easy to compute but as hard to invert as possible (a.k.a. one-way function)
The real meat is in the collision resistance (#3) and in the one-way function (#4) properties.
The shared principle among them is that there should be no feasible shortcuts for finding collisions or for inverting hashes. Brute force, after all, should be the optimal method to attack them.
We should be able to list a few consequences that, being true, make hashes even more useful (e.g. in cryptoassets):
- Every hash value in the output range should happen with the same frequency
- Knowing any pair hash input-output shouldn’t help figuring out any other pairs
- Joining two known inputs together and then hashing them, should result in a hash unfeasible to guess in advance
These last two points would be satisfied if any minor change made to a given input results in a substantial change to the hash output. And that does happen and it is known as avalanche effect.
As we’ve seen, nobody has been able to actually prove that one-way functions really do exist. Thus, it should pay to be vigilant on hash function critical weaknesses.
Namely, successful attacks that are more efficient than brute force for inversions (preimage attacks) and faster collisions than theoretically predicted by the birthday problem.
Collision resistance attacks to hashes that are still pretty commonly used have been found and are easily accessable to anyone in the internet.
For instance, MD5, a 128-bit hash, suffers a collision attack within seconds in a \(\footnotesize 2.6\) GHz Pentium 4 PC after just \(\footnotesize {2}^{18}\) hashes, much faster the theoretical range of the birthday problem of \(\footnotesize {2}^{\frac{128}{2}} = {2}^{64}\).
SHA-1, an extremely popular 160-bit hash function, has recently suffered from a successful public collision attack too. In total, the computational effort spent was \(\footnotesize {2}^{63.1}\) hashes (also below the theoretical range of the birthday problem) and used approximately 6,500 CPU years and 100 GPU years.
Finally, another recent troublesome anecdote happened with IOTA, a cryptoasset. In August 2017, Ethan Heilman, Tadge Dryja, Madars Virza, and Neha Narula found very serious collision vulnerabilities in Curl, the hash function created and used by the project:
Once we developed our attack, we could find collisions using commodity hardware within just a few minutes, and forge signatures on IOTA payments. We informed the IOTA developers, they patched their system, and we wrote a vulnerability report.
Where does this go from here?
Well, if we were to continue, we would talk more about real-world examples of hash functions:
How exactly are they constructed?
- Did any of them were cracked by techniques more efficient than brute force? (Spoiler: yes)
How did it happen?
The cat-and-mouse game in depth: attacks vs countermeasures
The newest cryptographic hash functions
Cannot guarantee we will have the time to learn and then write about this in the near future though.
Anyway, thanks a lot for your time. Hope you’ve had a good one.
Appendix: diving into the birthday problem
Suppose a hash function with \(\footnotesize N\) bits of fixed output (and, therefore, with \(\footnotesize n = 2^N\) of possible hash outputs).
There is a sharp increase in the chances of randomly finding at least two matching hash outputs after computing only roughly \(\footnotesize 2^{N/2} = n^{1/2} = \sqrt{n}\) hashes.
Let’s derive this This is an informal derivation, not a rigorous proof. We will just walk you through the key ideas. That should be sufficiently interesting anyway.
.
First, we must highlight two assumptions needed for our arguments: (i) each hash computation is independent, and (ii) all hash outputs are equaly probable (in other words, uniformly distributed).
If \(\footnotesize P(col, x)\) is the probability of occuring at least one collision in \(\footnotesize x\) hashes we compute and \(\footnotesize P(\overline{col}, x)\) is the probability of not occuring any collisions in these \(\footnotesize x\) hashes, then \(\footnotesize P(col, x) + P(\overline{col}, x) = 1\).
It is be easier to calculate \(\footnotesize P(\overline{col}, x)\), the probability of not occuring any collisions in these \(\footnotesize x\) hashes. So, we will focus on that and then find \(\footnotesize P(col, x)\) by simply doing \(\footnotesize 1 - P(\overline{col}, x)\).
For \(\footnotesize x = 1\) hash computation, obviously \(\footnotesize P(\overline{col}, x) = 1\), since it is impossible to have a collision if we’ve computed just one hash.
For \(\footnotesize x = 2\) hashes, \(\footnotesize P(\overline{col}, x) = \frac{(n-1)}{n}\)
Why? You can think of it as: of all \(\footnotesize n\) possible hashes (the denominator) we need to “avoid” (in other words, “remove”) only the one that is the same of the first hash (therefore, \(\footnotesize n - 1\) in the numerator).
For \(\footnotesize x = 3\), \(\footnotesize P(\overline{col}, x) = \frac{(n-1)}{n} * \frac{(n-2)}{n}\)
Now, we have computed three hashes. The second cannot be the same of the first (as in \(\footnotesize x = 2\)). At the same time, the third cannot be the same of either the first or the second, hence the term \(\footnotesize \frac{(n-2)}{n}\).
And, since the hash computations are independent and equally probably, per counting principles in Combinatorics, we can multiply them to get the probability we’re looking for.
Similarly, for \(\footnotesize x = 4\), \(\footnotesize P(\overline{col}, x) = \frac{(n-1)}{n} * \frac{(n-2)}{n} * \frac{(n-3)}{n}\)
By the same arguments, it follows that:
\[\footnotesize P(col, x) = 1 - \frac{(n-1)}{n} * \frac{(n-2)}{n} * \frac{(n-3)}{n} * ... * \frac{(n-x+1)}{n}\]
And rearranging it:
\[\footnotesize P(col, x) = 1 - {\bigg\lgroup 1 - \frac{1}{n}\bigg\rgroup} * {\bigg\lgroup 1 - \frac{2}{n}\bigg\rgroup} * {\bigg\lgroup 1 - \frac{3}{n}\bigg\rgroup} * ... * {\bigg\lgroup 1 - \frac{x-1}{n}\bigg\rgroup}\]
From Taylor Series, we know that:
\[\footnotesize e^{-k} = 1 - k + \frac{k^2}{2!} - \frac{k^3}{3!} + ...\]
And that for sufficiently small \(\footnotesize k\):
\[\footnotesize e^{-k} \approx 1 - k\]
Assuming \(\footnotesize n\) is large and therefore \(\footnotesize \frac{1}{n}\) is small, we can then approximate:
\[\footnotesize P(col, x) \approx 1 - e^{-\frac{1}{n}} * e^{-\frac{2}{n}} * e^{-\frac{3}{n}} * ... * e^{-\frac{(x-1)}{n}} = 1 - e^{-\frac{1+2+3+...+(x-1)}{n}}\]
From the sum of the arithmetic progression The sum of a finite arithmetic progression is called an arithmetic series. See Wikipedia for visual derivations of the result we are using.
\(\footnotesize 1+2+3+...+(x-1)\):
\[\footnotesize P(col, x) \approx 1 - e^{\lgroup\frac{x*(x-1)}{2}\frac{1}{n}\rgroup} = 1 - e^{\frac{x*(x-1)}{2n}} = 1 - e^{\frac{x^2-x}{2n}}\]
Since \(\footnotesize x\) is a large positive integer, \(\footnotesize x^2 \gg x\). Then, approximating \(\footnotesize x^2-x \approx x^2\):
\[\footnotesize P(col, x) \approx 1 - e^{\frac{x^2}{2n}}\]
Finally, solving this equation in respect to \(\footnotesize x\), we find (approximately) the smallest number of hashes we need to compute in order to have probability \(\footnotesize P(col, x) = p\) of finding a collision:
\[\footnotesize x \approx {\bigg\lbrack n * 2 * \log{\bigg\lgroup\frac{1}{1-p}\bigg\rgroup}\bigg\rbrack}^{1/2}\]
So, if \(\footnotesize P(col, x) = p = 0.5 = 50\%\), then:
\[\footnotesize x \approx {\bigg\lbrack n * 2 * \log{\bigg\lgroup\frac{1}{1-0.5}\bigg\rgroup}\bigg\rbrack}^{1/2} \approx {\bigg\lbrack n * 2 * 0.693 \bigg\rbrack}^{1/2} \approx 1.18 * \sqrt{n}\]
In order to get a grasp of what is happening, let’s calculate the needed number of hash computations \(\small x\) for different probabilities \(\footnotesize P(col, x) = p\):
\(\footnotesize p = 10\%\), \(\footnotesize x \approx {[n * 2 * \log{1.11}]}^{1/2} \approx \sqrt{n * 0.208} \approx 0.456 * \sqrt{n}\)
\(\footnotesize p = 25\%\), \(\footnotesize x \approx {[n * 2 * \log{1.33}]}^{1/2} \approx \sqrt{n * 0.570} \approx 0.755 * \sqrt{n}\)
\(\footnotesize p = 75\%\), \(\footnotesize x \approx {[n * 2 * \log{4}]}^{1/2} \approx \sqrt{n * 2.772} \approx 1.664 * \sqrt{n}\)
\(\footnotesize p = 90\%\), \(\footnotesize x \approx {[n * 2 * \log{10}]}^{1/2} \approx \sqrt{n * 4.605} \approx 2.145 * \sqrt{n}\)
So, we do observe that there is a sharp increase in the chances of randomly finding at least two matching hash outputs (i.e., one collision) after computing only roughly \(\footnotesize 2^{N/2} = n^{1/2} = \sqrt{n}\) hashes.
Why? A priori, it may be seen as unexpected, but let’s look closer to the approximation we’ve found for \(\footnotesize x\):
\[\footnotesize x \approx {\bigg\lbrack n \ast 2 \ast \log{\bigg\lgroup\frac{1}{1-p}\bigg\rgroup}\bigg\rbrack}^{1/2}\]
It may be rearranged as:
\[\footnotesize x \approx {\bigg\lbrack 2 \ast \log{\bigg\lgroup\frac{1}{1-p}\bigg\rgroup}\bigg\rbrack}^{1/2} \ast n^{1/2} = \sqrt{2 \log{\bigg\lgroup\frac{1}{1-p}\bigg\rgroup}} \ast \sqrt{n}\]
Now, it should be clearer that the needed number of hash computations \(\footnotesize x\) to find at least one collision for a hash with fixed ouput size of \(\footnotesize N\) bits depends on \(\small\sqrt{n}\) and on a factor \(\footnotesize \sqrt{2\log{(\frac{1}{1-p})}}\) that is independent of either \(\footnotesize N\) or \(\footnotesize n\).
Making things more visual, let’s plot the factor sqrt(2log(1/1-p)) for p in (0,1)
:
We can see that the probability \(\footnotesize p\), on x-axis, of finding at least one collision grows, for example, from 20% to 80% while the factor \(\footnotesize \sqrt{2\log{(\frac{1}{1-p})}}\), on y-axis, increases from 0.6 to 1.9.
Since that is the only factor that multiples the \(\footnotesize \sqrt{n}\), it indeed explains why the probability \(\footnotesize p\) behaves this way around \(\footnotesize \sqrt{n}\).
On a final remark, if it’s the case that the hash outputs are known not to be equally probable, then it’s even faster to find collisions. Deriving this result is not super hard, but it does require a “Math toolkit” than feels beyond the scope of this text.
If you’re interested in going deeper on unequal probabilities, read Approximations to the Birthday Problem with Unequal Occurence Probabilities and their Application to the Surname Problem in Japan by Shigeru Mase.
For a broader survey of the problem and its variations, see The matching, birthday and the strong birthday problem: a contemporary review by Anirban DasGupta.