As requested by @everule1
Quick info:
- |s| denotes the length of a string s
- I will use 0-indexing for strings, always
- A great article on this algorithm, from which I take my inspiration, is here
- Code is again in C++. An implementation warning: the
substr
function in C++ takes the starting position as the first argument, and the desired length as the second one. This is different from other languages, such as Java. - A “valid prefix/suffix length,” which I will use as shorthand, is a length such that the prefix of that length of a string is equal to the string’s suffix of that length
- I know you don’t have to initialize values in a
vector
to 0, but I do so for clarity - When a and b are strings, ab represents the concatenation of a and b, that is, a + b
This is a specific algorithm I’m going to talk about. So before I begin the explanation, let me describe what we’re actually trying to do here.
The motivation (pattern matching)
You have a text t and a pattern p, both strings, where ideally |p| < |t|. Your goal is to find all of the (possibly overlapping) occurrences of p in t. Example time: take p as abca
and t as abdabcabca
. Then p occurs twice in t: once beginning at index 3 and once at index 6. Specifically:
0 1 2 3 4 5 6 7 8 9 a b d a b c a b c a a b c a (index 3) a b c a (index 6)
Our intended complexity is O(|p| + |t|).
Prerequisites
None! Just sit back and enjoy the ride.
Prefix function
The crucial part of the algorithm is this cool array called the “prefix function.” I’ll get into why it’s so important later, but for now, let’s just figure out what it is and how to get the values for it. We’re going to compute this function for the pattern p.
The prefix function, for a string s, is defined as the length of the longest proper prefix of s that is also a suffix of s. A proper (I’ll stop italicizing it now) prefix is any prefix of s that isn’t s itself, that is, any prefix of s. with length < |s|. Example time #2: let’s just actually compute this for the string s = abcabca
.
Length 0: not interesting.
Length 1: valid, the prefix a
is equal to the suffix a
Length 2: invalid, the prefix ab
is not equal to the suffix ca
Length 3: invalid, abc
is not equal to bca
Length 4: valid, abca
= abca
Length 5: invalid, abcab
\neq cabca
Length 6: invalid, abcabc
\neq bcabca
Length 7: invalid, because a proper prefix can’t be the whole string
So 4 is the longest length l in which the prefix of length l is equal to the suffix of length l. Great. So that seems naively O(n^2) (with n = |s|) because we have O(n) lengths to check, and O(n) time for each string comparison (yes, there are many test cases that can make the complexity actually O(n^2)). But wait, it’s worse. Because to make this work, we need to compute the prefix function for all prefixes of the pattern p. So that’s O(n^3), which is not great. Most string problems have lengths of 10^5! But we can do so much better.
Even before we start optimizing, let’s just get the notation out of the way. For some reason, they decided to use pi (\pi) as the symbol for the prefix function, which makes my life harder because I have to type it out (or copy/paste it) in math format each time. This means \pi[i] is the value of the prefix function for the substring s[0 \dots i] (remember, 0-indexing). It should be clear that \pi[0] = 0 because a string of length 1 has no proper prefixes.
Just to be absolutely clear, I’ll give one more example of the \pi-values for the string abcabca
(for all prefixes this time). The values of \pi are [0, 0, 0, 1, 2, 3, 4] for that string.
Naive algorithm - O(n^3)
We’ve basically already been over this. For all O(n) lengths from 1 to n, try all possible O(n) lengths for the prefix/suffix, with each comparison taking O(n) time.
Code (for understanding)
vector<int> compute_pi(string pat) {
int n = pat.length();
vector<int> pi(n);
for (int i = 0; i < n; i++) {
pi[i] = 0;
for (int j = 0; j < i; j++) { /* trying length j + 1 */
if (pat.substr(0, j + 1) == pat.substr(i - j, j + 1)) {
pi[i] = j + 1;
}
}
}
return pi;
}
This sucks. Let’s make some observations.
Improved algorithm - O(n^2)
The first observation we make is that \pi[i + 1] \leq \pi[i] + 1. This means that if we know \pi[i], the \pi-value for the string that’s longer by 1 is at most 1 greater than \pi[i] (why did I write that out, what am I doing).
Stop and think about what that means. At each step, either we increase the number of candidate lengths by 1, or we actually check some of those candidate lengths, but in doing so we remove them from being candidate lengths. What I mean is, let’s say \pi[i] = 573 for some i, and we want to find out i + 1 now. Let’s say we check 574, and it doesn’t work. So we check 573, 572, etc. until finally length 123 works. So we may have checked O(n) lengths this iteration, but now we have that many fewer lengths to check next time (in the example, the highest possible value we’d check next is 124). In total, that comes out to O(n) string comparisons in total (around 2n at worst). That means we’ve gotten it down to O(n^2)!
Okay, let’s not get ahead of ourselves. We should prove it first, and that’s what I’ll do right now.
Proof
It’s actually easier to show the opposite, yet equivalent, equation: \pi[i] \geq \pi[i + 1] - 1. Let me give a general string with any \pi[i + 1] > 1 (if \pi[i + 1] \leq 1, obviously \pi[i] \geq 0):
s_0s_1s_2s_3s_4
where s_0 = s_3, s_1 = s_4, s_0s_1 = s_3s_4, s_ 2 is anything, |s_0| + |s_1| = |s_3| + |s_4| = \pi[i + 1], and |s_1| = |s_4| = 1. That’s kind of a lot to digest at once. Basically what that all means is that s_0s_1 and s_3s_4 are equal and form the prefix/suffix of length \pi[i + 1]. What’s also important is that s_1 and s_4 are single characters. It’s very possible that there is overlap between the prefix/suffix and s_2 doesn’t exist at all, but it doesn’t change anything. If we now move back to the prefix of length i, the string we have is:
s_0s_1s_2s_3
where s_0 = s_3 as we’ve already said. It’s clear that |s_0| is a valid equal prefix/suffix for the i-th prefix (\pi[i] may be larger than |s_0| though), so \pi[i] \geq |s_0| = (|s_0| + 1) - 1= (|s_0| + |s_1|) - 1 = \pi[i + 1] - 1, and by transitivity, \pi[i] \geq \pi[i + 1] - 1. So the proof is done!
Code
vector<int> compute_pi(string pat) {
int n = pat.length();
vector<int> pi(n);
pi[0] = 0;
for (int i = 1; i < n; i++) {
pi[i] = 0;
for (int j = pi[i - 1]; j >= 0; j--) { /* trying length j + 1 */
if (pat.substr(0, j + 1) == pat.substr(i - j, j + 1)) {
pi[i] = j + 1;
break; /* we're going backwards, so we can't do better than this */
}
}
}
return pi;
}
I know I said there were no prerequisites. But any experienced user who knows hashing can recognize that we can actually stop right here - we have O(n) string comparisons, and hashing will make them O(1) each, so we’ve already done it in O(n). But hashing is either random (with a chance of failing) or quite slow (using multiple hashes, which still has a chance of failing). So we’ll take this one step further, for a fully deterministic (and clean) solution.
Final algorithm - O(n)
Okay. I think this part will be the hardest to understand. I’m going to try my best, but please complain in the comments if something’s unclear. But also, please re-read it at least a couple of times and try to let everything sink in.
There’s still room to improve - the string comparisons are O(n). What if we strategically check prefixes that allow us to compare in O(1)?
Let’s assume we’re going to “extend” some suffix of i to get \pi[i + 1], because if we don’t, the length is at most 1 so it’s trivial. We first try extending the suffix of length j = \pi[i] by checking if s[j] = s[i + 1]. If that works, cool, we’re done already for that length! Otherwise, we need to find the next-longest possible j that still works as an equal prefix/suffix for i, that is, the next-longest suffix length that we could “extend” from i. And if that j doesn’t work, we find the next-shortest j again, and so on. So now let’s treat j as a generic length, but one that works as a valid prefix/suffix length for i. I’ll make another string “diagram” (again, if there’s overlap between the prefix and suffix, it doesn’t matter):
s_0s_1s_2s_3s_4
We’re looking for the largest k < j such that k is also a valid prefix/suffix length. Let s_0s_1 = s_3s_4, |s_0| + |s_1| = |s_3| + |s_4| = j (that means s_3s_4 is the suffix we’re working with). Furthermore, without loss of generality, let |s_0| = |s_4| = k, and s_0 = s_4. Then, because s_0s_1 = s_3s_4, we can write the string s_0s_1 as s_0s_5s_4 where, if there isn’t overlap, s_1 = s_5s_4 (but again, overlap doesn’t matter). Now if we take the \pi-value for the prefix of length |s_0| + |s_1|, it must be equal to k because k is, by definition, the longest length such that s_0 = s_4,|s_0| = |s_4| = k (which is also, conveniently, the definition of the prefix function). So the k to try next is the \pi-value of the prefix length |s_0| + |s_1| = j, which is \pi[j - 1].
This means the algorithm actually comes out to be really simple: set j to \pi[i], then while j > 0 and the current j doesn’t work, keep setting j to \pi[j - 1]. Notice that we still have at most O(n) string comparisons, but each one only compares two characters, so it’s now O(1) per comparison. That means we’ve done it - we can compute the whole prefix function in O(n)!
Code
vector<int> compute_pi(string pat) {
int n = pat.length();
vector<int> pi(n);
pi[0] = 0;
for (int i = 1; i < n; i++) {
pi[i] = 0;
int j = pi[i - 1]; /* trying length j + 1 */
while (j > 0 && pat[j] != pat[i]) {
j = pi[j - 1];
}
if (pat[j] == pat[i]) {
pi[i] = j + 1;
}
}
return pi;
}
Actual pattern matching
Now that we’ve gone through the hell of figuring out how the prefix function works, this part is actually extremely simple. Let’s choose some arbitrary separator that isn’t in either of the strings. I will choose the dollar sign, \$, to symbolize my pain of typing this entire thing out in math mode. Again, let the pattern be p and the text be t. We’ll make the string p\$t and compute the prefix function for it. Notice how the value of the prefix function will never exceed |p|, because if so, some character would have to equal the separator.
Furthermore, if \pi[i] = |p| for some i, then we’ve actually found a match that ends at i! After all that you’ve been through with the prefix function, this part should be obvious - because the prefix of our new string of length |p| is exactly p, if \pi[i] = |p| that means p is equal to the suffix of length |p| ending at i. So we just take all positions where \pi[i] = |p|.
In total, this is O(|p| + |t|) in both time and memory. However, you can optimize it to O(|p|) memory by not explicitly storing the prefix values after the separator as it isn’t necessary. The implementation below does not do this.
Code
vector<int> find_matches(string text, string pat) {
int n = pat.length(), m = text.length();
string s = pat + "$" + text;
vector<int> pi = compute_pi(s), ans;
for (int i = n + 1; i <= n + m; i++) { /* n + 1 is where the text starts */
if (pi[i] == n) {
ans.push_back(i - 2 * n); /* i - (n - 1) - (n + 1) */
}
}
return ans;
}
This code (the O(|p| + |t|) version only) has been tested on this problem, as well as this problem. The more naive versions might not even compile, they’re just there for understanding and you shouldn’t use them.
So, for the people that actually read this. It might be clear to see that I’m struggling with how to title this “series” - do you guys have any suggestions? It may also allow me to use a custom tag so all my tutorials can be in one place. I could also just go with the boring (yet intuitive) “galencolin” or “galencolin-tutorial”.