Suffix Arrays

Created: 2019-11-06 Wed 15:37

Objectives

Your Objectives:

  • Explain how to find an occurrence of any substring in \(\log n\) time.
  • Find the longest common substring in a given text.

Example

  • Suppose you have a string of 100,000,000 characters. You want to know if a certain string is in there. How can we do this quickly?
  • E.g. Genetic codes GATTACAGATTACAGATTACA

The idea

  • There is a data structure called a "suffix tree".
  • It uses a lot of memory, so we are not going to use it.
  • We will use a suffix array instead.

Example

  • Suppose we have this string: this is his.
| 0  | this is his
| 1  | his is his
| 2  | is is his
| 3  | s is his
| 4  |  is his
| 5  | is his
| 6  | s his
| 7  |  his
| 8  | his
| 9  | is
| 10 | s

Example, ctd.

  • Next, sort the substrings…
| 7  |  his
| 4  |  is his
| 8  | his
| 1  | his is his
| 9  | is
| 5  | is his
| 2  | is is his
| 10 | s
| 6  | s his
| 3  | s is his
| 0  | this is his

Details

+--------------------------------------------+
| 7 | 4 | 8 | 1 | 9 | 5 | 2 | 10 | 6 | 3 | 0 |
+--------------------------------------------+
  • We use a separate array to hold just the indices.
  • Time complexity is \(O(n^2 \log n) \)
    • There are \(O(n \log n) \) and \(O(n)\) algorithms too!
  • How to use it…
    • To search for the string his
    • To find the length of the longest common subsequence?

Related structure: LCS array

| 7  |  his         |   |
| 4  |  is his      | 1 |
| 8  | his          | 0 |
| 1  | his is his   | 3 |
| 9  | is           | 0 |
| 5  | is his       | 2 |
| 2  | is is his    | 3 |
| 10 | s            | 0 |
| 6  | s his        | 1 |
| 3  | s is his     | 2 |
| 0  | this is his  | 0 |