date posted: 2020-01-19
They were invented to overcome deficiency that Direct Access Tables were facing.
Direct access table is a table where each data has associated index therefore desired data could
be directly reached with O(1) time complexity.
For example, for each of your friend if you enter their name on your directory
all their informations show up, this is direct access table where name is an index.
In mathematical terms its an injective function or one-on-one mapping.
You can easily see that since each index is used to get their data,
operations like inserting, deleting, searching all
have time complexity of O(1).
Then seems to be the problem?
Even though all operations have exceptional small time complexity, it suffers largely from space complexity. Since for each friend there must be an extra memory space for every index that points to an information of a friend. As list of friends become larger there will be more memory space that aren't often used.
Say there are 100 students in a school where on average 50% of students study in the library, if you want to create a library of 100 seats for each students, half of the seats will not be occupied. On average less than half will not be used and in break times none of them will be used. We could avoid this problem by using concept of Hash Table.
Assign 2 student per desk therefore you only need 50 desks instead of 100. From student_1 to student_100, we use a function called hash function to efficiently assign two students to one desk. In this case we will use modular arithmetics (hash function may not be modular arithmetics) and create hash function denoted h(x) where x represents student number.
h(x) = x mod 50, which returns what is left after dividing x by 50.
So by using this hash function we are able reduce cost of 50 desks
The reason it is called a hash table is because original key is transformed via hash function which is now a key(index) of new table, hash table.
Hash table's time complexity depends on its hash function therefore main goal of using hash table is to use hash function that distribute original keys equally into hash table. If hash function map multiple key into same position then when calling that index multiple values will be returned and from there you must search in linear fashion to find your index. This phenomenon is referred to as collision. When this happens there is two ways of handling them:
There are two types of Open addressing: Linear probing and Quadratic probing.
Linear Probing: basically what I've described above. I will give an example for clarity.
Until I buy digital drawing pad, lets suffer together...
Looking at image below: We are passing keys: 3, 4, 7, 10 through hash function h(x) to
hash table.
As you can see if all keys maps to index 1. Time complexity of inserting becomes linear which is
never good. This is main problem with linear probing and this phenomenon is known as
Quadratic Probing:
For simplicity, lets use quadratic probing on example from linear probing.
Double Hashing:
Similar to quadratic probing however here, replace quadratic with
another hash function.
In summary, to avoid increase of space complexity we use hash functions but time complexity increases if collision happens so we must create hash table with hash function that minimizes collision.