BG
Back to blog

What is 'hashing' and what it's used for?

May 29th 2020 / Programming

Yet another magical word - 'hash'.




This article came around as a result of a discussion in the forum when it became more than clear that everyone has heard of the term but quite a few of the newbie programmers actually do know what it means and how these techniques could be implemented in programming. 
This article's main goal is to make it clear for the beginners what 'hashing' is in a nutshell. From then on it's quite easy for them to gather some additional information regarding the matter and its realizations by performing a simple Google search.

"Hash, hashing" - in the dictionary we find synonyms of the word such as 'shred' or 'slice'. Sometimes you can use "message digest", which transcribes directly to the idea of 'message processing'. One can also come across the term as a "digital fingerprint".

What does all that mean when it comes to computers and IT?


As a whole, hashing algorithms get in the entry stage some kind of a series of data with variable length - mostly it is a string of symbols, but can also be a string or some other type of data, then on the exit stage we get a number or a string with a fixed-length which characterizes as 'unique' the initial entering message. By typing 'unique', I must say that this is not exactly true as we are about to see a bit later on.  

To make the matter easier for comprehension, for now on in the article we will assume that the entering data is a text string and the exiting value is that of a number.



For a hashing algorithm to be precise it has to meet the following criteria:


1. To be irreversible - this means that there should not be an algorithmic way for restoring the exiting string from the hash value (I do imply on 'algorithmic' since there is always the way of the 'raw force' - listing all the possible combinations of entry data and trying to solve each one of it).

2. To present the closest possible state to an even distribution of the value in the allowable range for the typical representatives of the entered string.  The common rule here states that when small differences in the entry strings occur, the hash function should result in big differences in its exiting value. By this, the hash functions are pretty similar to the random numbers generators.  

3. The hash functions should be processed fast. It is not mandatory, but additional speed won't do any harm.

Is hashing encrypting? 


As a whole, by 'encrypting' we assume that if one has the keys needed the reversed operation is possible - in other words by using keys in the encrypted message we can get the output one.
When it comes to hash functions, that's not exactly the case. 
Some use the term 'one-way encrypting' which, if you ask me, is not quite right - by the term 'encrypting' one always gets the idea for bidirectionality so when speaking of it as a one-way thing, it gets contradicted. 
I consider hashing's most appropriate definition as 'one way-transformation' or 'irreversible transformation'.

Is hash value unique? 



Don't get misled. A hash value is not unique for an entry string!
Having in mind the hash value length and the hashing algorithm, there is more or less probability of the so-called hash collisions - they occur when two different types of string have the same hash value.



With some algorithms, this probability is smaller when with others it's bigger, but it never is zero. As an example I'll give the simple (but still used here and there) algorithm - the hash value is gained by summing up one after another all the symbols of the string. It's obvious that by random changing of the symbols in the string, we will get a string that has the same hash.
When we have more complex hash functions, finding a second string with the same hash function is not a trivial task, but this doesn't mean that the hash value will never have a match. Having in mind that the hash value is always smaller than the string, this guarantees hash matches for some entry pieces of value.



The rule states as this - the bigger the length of the string, the rarer the occasion for collisions. If the hash length is greater than the longest string, a function could be found that never triggers collisions, but, of course, using such long hash pieces of value is acceptable only for some really specific tasks.

What is hashing used for? 


Hashing is used in several main fields of application. As they are really different from one another, we will take a glimpse of each one of them separately:

1. Searching for strings in big arrays. Hash tables.

Let's assume that we have a big table (array) including random, unique strings and then we have a new string that we want to check if it is listed in the table or not. The trivial decision is to scan the array element by element while comparing our string with each one of the strings in the array.
This approach, however, is highly ineffective. Comparing strings is an expensive operation that is internally processed with cycles by comparing both strings symbol by symbol. On the other hand, comparing numbers is a cheap operation, processed with 1 or 2 processor instructions. 
How can we replace string comparison with number comparison?
Let's assume that when we fill the array out, along with each string we also calculate and list its hash value.

 
Then by search we can calculate the hash of the string in mind and scan the array, comparing not the strings directly, but their hash pieces of value. Of course, we absolutely must foresee the collisions - if the hashes match, we have to perform a mandatory comparison directly among the strings, just to make sure that the string in question is located - but here we face only one comparison instead of thousands.


This approach can instantly deliver the acceleration of the search speed. However, we still have to scan the array although comparing numbers and not strings.
The algorithm speed is O(n) - it means that the time grows in a linear manner with the growth of the array size.
Can we get rid of that scanning also and make the speed independent from the array size and be a constant one instead?


Yes, we can, and that's in a nutshell hashing's biggest advantage. 
Let's say we have a hash function which gives as a result one-byte number, meaning for every entry string we get a value of 0..255.
We create an array containing 256 elements which contain pointers to strings or 0 - this will be our 'hash table', let's denote it as HashTable[0..255].
When filling in the array in which we'll perform a search, we calculate the hash value of each string and list a string pointer in the respective element of the hash table:

Code:
h := Hash(NewString);
HashTable[h] := ptr(NewString);


After that, if we have a string that we want to search for in the table, it will be enough just to calculate its hash value and check if on the respective position there is a 0 or a pointer. It will look like this:

Code:
h :=  Hash(SearchString);
if Hash[h] = 0 then "Стринга не е в масива"
else "Стринга е намерен."



We shall never forget about the collisions! That's more than clear given the example above. Best case scenario, if we want to add more than 256 strings, the table will get overloaded and at least 2 different strings will have matching hash pieces of value.

 
Usually, collisions occur earlier. The main rule is not to fill the table up more than its half. In other words, if we want to work with a maximum of 1000 strings, we have to choose a hash table with at least 2000 elements.

This does not mean that in a table of such sort there will be no collisions! There will be. All we say is that the sufficient table size will maintain the collisions in reasonable limits which do not affect the algorithm's speed.


Managing collisions.



Managing the aspect of collisions comes in two parts:
1. Decreasing the probability of collisions arising.
2. Handling the collisions after they have arisen.

In part 1 - we choose a bigger hash, we choose a suitable hash function which gives fewer collisions for the typical entry pieces of data. 

In part 2 - there are several ways for handling collisions. They all have pros and cons and, overall, we have to pick wisely given the task in question. Here we will discuss some of them:

а. Closed hashing



The idea here is to only the current table to be used without taking additional space in the memory of collision handling. 


This decreases the memory used, but obviously, in a table of that sort we can add up a restricted number of strings - exactly as many as the table size. 


When adding a string to the hash table we get a collision (meaning the place of the string in mind there is already another string with the same hash), we simply add the new string to the next free position possible.

 
When searching, the algorithm is the same. If in the position in mind in the table there is a string listed but not the one we are looking for, we check with the next element. If there is not our string - we proceed to the next one and so on, until we do find our string or 0 - which would mean that this string, in particular, is missing from the array. 
Have you noticed how, in a case of collision, this algorithm turns into scanning? This could drastically decrease the speed of the whole algorithm. That's why collisions are so harmful and all the measures needed have to be taken into consideration in order to not causing a collision.

б. Open hashing



The idea here is this - when a collision occurs, in the particular position in the table we put not a pointer to a string, but a pointer to a list of strings where we put the various strings with the same hashes.
This allows for an unlimited amount of strings to be put in the table. Of course, when searching if we stumble upon a collision - or a pointer to a list, this list should also be scanned element by element (or in some other way - let's say binary search) to check whether our string is there or not.


в. Double hashing



The idea here is this - when filling the table in we stumble upon a collision, we calculate a second hash value with another algorithm and we hash the string in another table. While performing a search, if the string listed in the first table doesn't match, we calculate the second hash for the string searched and we check in the second table. 
Of course, collisions are also possible in the second table as well and they must be taken care of using another method. But collisions of this sort will occur more rarely, thus affecting the algorithm speed less. 

г. Dynamic building of a hash table

As the common rule states (as you may have already noticed), we always try to use the biggest table size possible in order to decrease the risk of collisions and speed degrading. 
On the other hand, however, we are being limited from the size of the memory available, plus since the enormous hash table is filled with lots of zeros and less data we are having the constant feeling of memory waste. 
We could try a slightly more different approach: We choose hash's big size (for instance 32bits or 64bits) but we don't create the hash table in the memory (and how could we - just try out calculating the memory needed for a 32bit hash table if each element is a pointer), and instead, we build it in the process of adding strings to it while storing only the currently filled part of the table in the memory.
How does that work: We scan the string's hash byte by byte when each byte serves as a branch to the current level of a binary tree. On the last line of the tree, we have string pointers. Knots in the tree we add only when we need them - that's how we always have space in the memory for the part of the table filled with data. 

This algorithm ensures also constant time of search which has nothing to do with the number of strings in the array because the thorough search of the tree is always constant and depends only on the length of the hash value = the tree depth. Of course, this takes longer than a direct check in the table, but the greater length of the string - and also the small number of collisions, compensate this delay with a big profit if the number of strings in the table is too vast.



Hashing algorithm



Picking the 'right' algorithm for hashing is a really responsible task - it plays a huge role in the programs' effectiveness.

Case study: The parser of FASM (flat assembler) uses a hash function of 32bit of length for a fast search in the tag list while compilation. By the first realization (to version 1.51 - if I remember correctly) a direct search in the tag list has been used by comparing the hash values. 
When compilation big sources with lots of tags, the speed was starting to drop and the compilation time depended on the square of the number of lines in the program.



When these effects occurred (when big enough programs written in FASM appeared), Tomash Grishtar (author of FASM), changed the searching algorithm with the binary hash tree algorithm described above. 
This increased the speed of the parser 4 to 5 times. 



The next step was really extraordinary: After changing the previous hash function in the compilation with a new one (following the FNV-1a algorithm), the parser speed has increased up to 25 times during some tests.
From this point on the work of the parser became so quick that in all practical cases it takes only a little part of the entire compilation time.  
You can check out the entire discussion regarding this story here: Discussion about the changes in FASM 1.51 (en)

Here we are giving an example of the algorithm source of FNV-1a, used in FASM (flatassember) and also in the Fresh project. 

Code:
;-------------------------------------------------
; function StrHash
;   Computes 32 bit hash value from the string.
;   The function is compatible with FASM hash
;   function if OrMask = 0.
; Algorithm: FNV-1a
;
; Arguments:
;   hString - handle/pointer of the string.
;   OrMask  - every byte from the string will be ORed
;             with this value (byte)
; Return:
;   eax - 32bit hash value.
;-------------------------------------------------
proc StrHash, hString, OrMask
begin
push    esi edi ebx ecx edx

stdcall StrPtr, [hString]
mov     esi, eax

xor     ebx, ebx
mov     eax,2166136261
xor     ecx, ecx
mov     edi, 16777619
.hashloop:
mov     cl, [esi+ebx]
jecxz   .endstring
inc     bl
or      cl, byte [OrMask]
xor     al,cl
mul     edi
jmp     .hashloop

.endstring:
mov     edx,eax
and     eax,0FFFFFFh
shr     edx,24
xor     eax,edx
shl     ebx,24
or      eax,ebx
pop     edx ecx ebx edi esi
return
endp



The code goes without explanation because in reality hashing algorithms cannot be explained. It just scans the string and using the ASCII code of each symbol complex arithmetical and logical actions are being made in order to make the output number if possible dependable on all the entry codes. Rotation and multiplication of simple numbers are often used. Every algorithm stands for itself. 

2. Passwords, identification



The other big usage of hash algorithms is the identification of users and passwords recognition. Here the irreversible characteristics of the hash are used, meaning there is no algorithmic way in which the initial string to be defined by using the hash value.
When using hashing in this particular area, collisions don't really matter, because we can use hashes of variable length - even a greater one than the entry string. Here we are not after speed, but security. And we do know that if the hash length is greater than the length of the entry string, it's possible for a function to be found which does not trigger collisions for the whole range of entry strings.

How does user recognition using password work?



The simplest way is to store the passwords in an array along with the users' names and to compare them directly to the ones the user is providing.
This method is highly insecure, however. Anyone who has access to the password file could see them and use them.
By delivering those passwords via the internet, things get worse since anyone could have access to them.
Using encryption of files containing passwords is also not an option since each encrypting algorithm could be decrypted. 
In reality, the right decision is simple - we do not store the users' passwords, we store only the hash value of these passwords.


When access to resources is asked for, the user delivers some king of a word that is being hashed and the received value is compared to the database for the particular user. If the hash values match, then the user is being approved.
Even if someone has access to the database, he/she won't be able to know the users' passwords because the hash function cannot be inverted.
As we have already stated, collisions here don't matter because hashes with big lengths are being picked out (MD5, for instance, is 16 bytes - 128bits long)



As a whole, using hash functions in this area is trivial and fairly simple, when the main goal is quality for the hash algorithms to be provided.