Data Structures Hashing And Strings

Posted by VitoDH Blog on November 19, 2019

Chapter 3 - Data Structures - 5 - Hashing and Strings

Source: Algorithm Design Manual(Skiena)

Def

  • Hash table: very practical way to maintain a dictionary. We use the value of our hash function as an index into an array, and store our item at that position
  • Hash function: map each key to a big integer
    • : size of the alphabet
    • char(c): A function that maps each symbol of the alphabet to a unique integer from 0 to
    • maps each string to a unique integer by creating the characters of the strings as digits in a base- number system
    • To reduce the size of the table(), we need to take

1. Collision Resolution

Two distinct keys will occasionally hash to the same value.

  • Chaining: Represent the hash table as an array of m linked lists. The th list will contain all the items that hash to the value of . Then search, insertion and deletion reduce to the corresponding problem in linked list. But this method devotes a considerable amount of memory to pointers, which originally can be used to make the table larger
    • Traversing all the elements takes O(n+m) time because we have to scan all m buckets looking for elements

Using chaining with doubly-linked list to resolve collisions in an m-element hash table, the dictionary for n items can be implemented in the following time:

  Hash table(expected) Hash table(worst case)
Search(L,k) O(n/m) O(n)
Insert(L,x) O(1) O(1)
Delete(L,x) O(1) O(1)
Successor(L,x) O(n+m) O(n+m)
Predecessor(L,x) O(n+m) O(n+m)
Minimum(L) O(n+m) O(n+m)
Maximum(L) O(n+m) O(n+m)
  • Open addressing: The hash table is maintained as an array of elements(not buckets), each initialized to null.
    • Insertion: check whether the desired position is empty. If not, find the next open spot in the table and insert it(sequential probing)
    • Searching: Going to the appropriate hash values and check to see if the item is the one we want. If not, we need to check through the length of the run
    • Deletion: Reinsert all the items in the run following the new hole
    • Traversal: O(m) since n must be at most m
  • Chaining and open addressing both require O(m) to initialize an m-element hash table to null elements prior to the first insertion

  • Pragmatically, a hash table is often the best data structure to maintain a dictionary

2. Efficient String Matching via Hashing

  • Data structure: array of characters

Operation

  • Substring search

Problem: Substring pattern matching

Input: A text string and a pattern string

Output: Does contain the pattern as a substring, and if so where?

Simple Algorithm

  • Search for the presence of pattern string in text overlays the pattern string at every position in the text, and checks whether the pattern matches the text
  • Runtime: ,

Rabin-Karp Algorithm

  • Run time: linear expected-time
  • Idea: Compute a given hash function on both the pattern string and the -character substring starting from the th position of
  • If the two strings are identical, their hash values must be the same. If not, their hash values are almost different. If somehow they are the same, we just need time to compare the strings.
  • So we have hash values computations( () for text and 1 for ), each computation takes time, plus a small number of verification with time, the runtime should be

But looking at the hash function starting from the th position of string:

Then,

This can be done in constant time. So the time to compute the hash values is actually $O(n+m)$ as long as we don’t get frequent collision(remember the verifications takes time ).

If the values of hash function ranges from 0 to, then the probability of a false collision should be .

3. Duplicate Detection Via Hashing

  • Is a given document different from all the rest in a large corpus?
    • Hash the new document into an integer, compare it to the hash codes of the rest of the corpus. Only when there is a collision is this a possible collision
  • Is part of this document plagiarized from a document in a large corpus?
    • Adding, deleting and changing even one character of the document will change the hash code
    • Build a hash table of all overlapping windows(substrings) of length in all the documents in the corpus. Whenever their is match of hash codes, there is likely a common substrings of length between these two documents
    • Downside: size of table becomes as large as the documents. But we can retain a small but well-chosen subset of these hash codes for each document
  • How can I convince you that a file isn’t change?
    • Check the hash codes of the documents to see whether it changes or not

4. Specialized Data Structures

Details are in Chapter 12.

  • String data structures - Suffix trees/arrays
  • Geometric data structures - kd trees
  • Graph data structures - Adjacency matrix and adjacency list
  • Set data structures - bit vectors, union-find data structures