STRQUERY - Editorial

anton_lunyov · April 20, 2013, 9:39pm

PROBLEM LINK:

Practice
Contest

Author: Tom Chen
Tester: Hiroto Sekido
Editorialist: Anton Lunyov

DIFFICULTY:

HARD

PREREQUISITES:

Suffix array, Treap, Knuth–Morris–Pratt algorithm

PROBLEM:

You are given a string S of length 10. Its characters are numbered from 0.
The middle character is the character with index |S| div 2.
You need efficiently perform the following types of queries:

Count number of occurrences of the given query string q.
Delete the first character.
Insert given character before the first character.
Delete the last character.
Insert given character after the last character.
Delete the middle character.
Insert given character before the middle character.

There could be up to 150,000 queries and the overall length of all strings in count queries could be up to 1,500,000.

QUICK EXPLANATION:

The solution that will be described is a customized version of winger’s solution which I was lucky to understand completely (partially reinventing some parts independently of his code).

At first we develop the data structure that supports only modifications at the beginning of the string. For this we will maintain self-balancing binary search tree (BST) where each node corresponds to some suffix of the string and nodes are ordered according to lexicographical ordering of the corresponding suffixes. We maintain two arrays str and nodes. The array str contains the string itself in reversed order, the array nodes contains nodes of the tree. The key thing is that the node nodes[id] corresponds to the suffix that starts at str[id]. Each node should contain links to its sons and parent and also the size of its subtree. Let’s call it dynamic suffix array. Jump here for further representation details.

Then to perform count query we do some kind of binary search over this tree. Namely at each step we compare q with the current suffix. If q is not its prefix, we move either to the right subtree of this suffix or to the left one. Otherwise we found an occurrence and we should try both subtrees. To handle this efficiently we should pass two boolean parameters mathLeft and mathRight. The complexity is O(|q| * log|S|). Jump here for detailed explanation.

Delete query is very simple. We need to delete last character of str and last node of nodes. So we already know where this node is in the tree (we have access to its sons and parent). Hence we can run available delete query of the particular BST data structure we are using. The complexity is O(log|S|). Jump here for detailed explanation.

Insert query is the most tricky because we need to insert a new suffix cS to the tree. For this we need to compare this suffix with old suffixes. To compare cS with id-th suffix we at first compare their first characters c and str[id]. If they are different we should compare their remaining parts S and (id-1)-th suffix. But S is already in the tree. To compare two correct suffixes we calculate their order statistics using size of the subtrees. So compare routine is ready and then insert can be performed by available insert routine of BST. The final complexity is O(log²|S|). Jump here for detailed explanation.

Next we develop the data structure that supports modifications at both ends of the string. For this we maintain two dynamic suffix arrays left and right such that left contains some prefix of S and right contains the remaining suffix. Let’s call this data structure string deque. Then it is clear how to perform modifications. The only tricky case is when we want to delete the last character but right is empty (or the first character but left is empty). In this case using |S| insert queries we divide left into two equal parts and put first half in left and reversed second half in right. Then we can delete last character by calling delete query for right. To perform count query we count occurrences of q in left and right and then count occurrences that intersects with both left and right by constructing border string of length at most 2 * (|q| − 1) and using KMP. Jump here for detailed explanation.

To solve the actual problem we will maintain two string deques left and right. The first one will contain first |S| div 2 characters of S, while the second one will contain the remaining characters of S. So left and right should satisfy the inequality |left| ≤ |right| ≤ |left| + 1. It is clear how to perform modification queries − each such query will be a modification query for one of the string deques. Of course some modifications ruin the inequality |left| ≤ |right| ≤ |left| + 1. Hence we equalize left and right by additional movements of characters from one of them to another. The count query could be performed in the same way as in the string deque data structure, dividing all occurrences of q into three types and using auxiliary string border. This part is quite simple and similar to the previous part so refer to the editorialist’s solution for any implementation details.

EXPLANATION:

At first note that naive algorithms that work in O(|S| + |q|) time for each count query will not work fast enough in this problem. Though initial string is short, the following scenario is possible:

The first 75,000 queries are insert queries. So the string will have length 75,010 after them.
They are followed by 75,000 count queries where each query has length 20.

Clearly in this case naive processing of each count query in O(|S| + |q|) time will lead to about 75,000 * (75,010 + 20) ≅ 5 * 10¹⁰ elementary operations, while modern computers allow to perform about 10⁸ such operations in a second. Thus, we need to invent something clever here.

When there are no modification queries, the fast and simple data structure that allows to answer each count query in O(|q| * log|S|) time with O(|S| * log|S|) preprocessing is suffix array. Namely, we sort all non-empty suffixes of the string in lexicographical order and save the permutation of positions of their first characters into an array which is called the suffix array of the string. Then each count query could be performed by simple binary search over this array.

Of course, we can’t just take all suffixes and sort them − there are different efficient techniques exist to perform this (some of them are even linear in time). But we should deal with modification operations so actually well-known techniques will not work here and we should invent a new data structure.

Dynamic suffix array (back to QUICK EXPLANATION)

At first we develop the data structure that allows to perform efficiently first three types of queries. Namely, only modifications at the beginning of the string are allowed. We will achieve the following performance:

O(|S|) memory.
O(|q| * log|S|) time for each count query.
O(log|S|) time for each delete query.
O(log²|S|) time for each insert query.

For this we will maintain self-balancing binary search tree (BST) where each node corresponds to some suffix of the string and nodes are ordered according to lexicographical ordering of the corresponding suffixes.

Below for each suffix of the string we should have access to the corresponding node of the tree. Hence we will maintain two arrays:

The first array str will contain the actual string in reversed order.
The second array nodes will contain corresponding nodes of the tree.

Also we should maintain the root of the tree which can be stored as the index of the actual root in the array nodes and can be denoted as root.

Arrays str and nodes should allow random access to their elements and also deleting or adding elements in the end. In C++ the perfect fit is vector container. Alternatively we can define static arrays of large enough size and maintain the variable size that will contain the current length of the string. But there are some issues with stack overflow for such implementation (at least for me). In what follows we will use implementation with vectors in code snippets.

Because of such representation of our data structure and also for some future needs it will be convenient to store in each node the following information:

left − the index of its left son in nodes.
right − the index of its right son in nodes.
parent − the index of its parent in nodes. It will be needed for finding order statistics of suffixes and also for remove query.
size − the size of the subtree rooted at this node. It will be needed for order statistics as well, and also for count query.

Of course, some additional information, needed to keep tree self-balancing, should be stored as well. For example, in the case of a treap we need to store additional field y such that our tree is a heap considering y values as keys. Namely, the root of each subtree should have the largest y among all nodes of its subtree.

In what follows we will provide code snippets assuming that we are using treap. We will use syntax close to C++ in code snippets but indicate blocks by indentation instead of braces and also will not use semicolons at the end of lines. Also to distinguish variable or methods in comments we surround them by grave accents.

Next, we need empty node that will play a role of non-existing sons or parents. To handle this, it is convenient to have 0-th element of str to be zero character and 0-th element of nodes to be node with all zero fields (including size). So such initialization should be a part of the constructor of our data structure:

DynamicSuffixArray()
    str.push_back(0)
    nodes.push_back(Node())
    nodes[0].size = 0
    root = 0

Here Node is the data structure that maintain nodes in the tree. Its constructor in the case of a treap in view of definition of empty node has the form:

Node()
    left = right = parent = 0
    size = 1
    y = rand()

The node constructed in this way will be called a default node. Calling rand() here is important to keep treap balanced.

This is all about actual representation of our data structure and its purpose. Now we can describe how to perform queries. At first we define several simple routines.

size() − returns the length of the current string:

int size()
// “-1” because of the dummy zero character
return str.size() - 1
setParent(id, parent_id) − sets parent of the node id to parent_id. Has effect only for non-zero nodes:

setParent(int id, int parent_id)
if (id)
nodes[id].parent = parent_id
fixSize(id) − fixes size field of the node id after some possible changes in its subtree. It simply means to set its size field to the sum of sizes of its children plus one. Note that in view of setting size of zero node to zero we can do this without additional checks for its children:

fixSize(int id)
if (id)
nodes[id].size = 1 + nodes[nodes[id].left].size + nodes[nodes[id].right].size

The most tricky in this data structure is insert query since it requires to compare two suffixes when we know only their ids in the array nodes. But count and delete queries are quite standard and we start with them.

Count query (back to QUICK EXPLANATION)

This can be done similarly to the case of usual suffix array by some kind of binary search. Namely, the routine match(tree, q) will count the number of times, q is the prefix of some suffix in the subtree, rooted at the node tree. For this we should match q and the suffix of the node tree character-by-character: at i-th step we compare q[i] and str[tree - i] (recall that the string S is stored in reverse order).

If for the first i where they differ we have q[i] > str[tree - i] then q is lexicographically greater than this suffix (this happens in particular when tree = i since in this case str[0] = 0, this is the case when the current suffix is a proper prefix of q). So all occurrences of q are contained in the right subtree and the answer is match(nodes[tree].right, q). On the other hand if q[i] < str[tree - i] then all occurrences of q are in the left subtree and the answer is match(nodes[tree].left, q).

But if q appears to be a prefix of this suffix then we find the occurrence, and also some other occurrences could be in both subtrees. If we simply run this routine on both subtrees this will be too slow in the case when there are many occurrences of q in the string.

To handle this issue we need to pass two additional boolean parameters matchLeft and matchRight to match routine. The first one is true if we already know that the rightmost suffix, left to this subtree, starts with q. mathRight has similar meaning. Then in the case when root of the subtree has q as a prefix, the answer will be composed of the found occurrence, occurrences in the left subtree with matchRight = true (and the same matchLeft as passed to this subtree) and occurrences in the right subtree with matchLeft = true (and the same matchRight as passed to this subtree). Now if both matchLeft and matchRight are true we definitely know that all suffixes in this subtree start with q and we can simply add its size to the answer. This simplification is the main point why updated match routine becomes efficient. Now we present the code snippet:

int match(int tree, const string& q, bool matchLeft, bool matchRight)
    if (!tree)
        return 0
    if (matchLeft && matchRight)
        return nodes[tree].size
    for (int i = 0; i < q.size(); ++i)
        if (q[i] > str[tree - i])
            return match(nodes[tree].right, q, false, matchRight)
        else if (q[i] < str[tree - i])
            return match(nodes[tree].left, q, matchLeft, false)
    return 1 +
        match(nodes[tree].left, q, matchLeft, true) +
        match(nodes[tree].right, q, true, matchRight)

To find the number of occurrences of q in the whole string S we should call match(root, q, false, false). It can be shown using some amortized analysis (which is usual for segment tree) that the overall number of recursive calls of match routine is O(H), where H is the height of the tree. Since we are using self-balancing BST, which is characterized by H = O(log|S|), this leads to the mentioned O(|q| * log|S|) complexity for this query.

Delete query (back to QUICK EXPLANATION)

We need to delete first character of the string which has index id := size() in the array str and also corresponding suffix (which is the whole string S) from the tree. The node of this suffix, of course, has the same index id in the array nodes so we know where it is in this tree (we can access its sons and parent). Hence we can easily delete it from the tree by available delete operation of the particular BST data structure we are using (note that we don’t need to compare any nodes for this). In particular, in the case of a treap we should merge sons of the node id, replace it by their merge and after that decrease by 1 field size of each node on the path from the parent of the node id to the root. We call public method for delete query as pop. It will also return the deleted character which will be useful later. In turn, pop will call private routine remove that do the actual deleting of node from the tree. The corresponding code snippet for pop has the form:

// Returns the deleted character.
char pop()
    // Removes last node from the tree and updates the root.
    remove(size())
    nodes.pop_back()
    // Removes last character of the string and returns its value.
    char c = str.back()
    str.pop_back()
    return c

This is standard remove routine in the treap if we already know the node we should delete:

// Removes node `id` from the tree and update the root.
// Vector `nodes` does not change.
void remove(int id)
    int par = nodes[id].parent
    int tmp = merge(nodes[id].left, nodes[id].right)
    // `tmp` is id of node that contains merge of children of the node `id`.
    // `par` is the parent of `tmp` since `tmp` is replacement of `id`.
    setParent(tmp, par)
    if (par)
        // If `id` has parent then root remains the same,
        // but we need to replace `id` by `tmp` in sons of the node `par`.
        if(nodes[par].left == id)
            nodes[par].left = tmp
        else
            nodes[par].right = tmp
        while (par)
            // Since we delete one node in subtree of `id`,
            // we should decrease the field `size` of each its ancestor by 1.
            nodes[par].size--
            par = nodes[par].parent
    else
        // Otherwise `id` was the root and the new root is `tmp`.
        root = tmp

This is the standard merge routine in the treap:

// Merges subtrees rooted at `L` and `R` and save result in one of the them.
// Each node in the subtree rooted at `L` should be less than
// each node in the subtree rooted at `R`.
int merge(int L, int R)
    // We return L when R is 0 and R when L is 0.
    if (!L || !R)
        return L + R
    // We should preserve the heap property by field `y`.
    if (nodes[L].y < nodes[R].y)
        // In this case `R` should be a root and we merge `L` and left son of `R`,
        // and result is saved in `tmp`.
        int tmp = merge(L, nodes[R].left)
        // `tmp` will be the new left son of `R`.
        setParent(tmp, R)
        nodes[R].left = tmp
        fixSize(R)
        return R
    else
        // In this case `L` will be a root.
        int tmp = merge(nodes[L].right, R)
        setParent(tmp, L)
        nodes[L].right = tmp
        fixSize(L)
        return L

Since merge has complexity O(H) = O(log|S|) and in remove we have one call to merge and one simple cycle with at most H iterations the complexity of delete query is O(log|S|).

Insert query (back to QUICK EXPLANATION)

We need to insert character c to the beginning of the string S which leads to inserting suffix cS to the dynamic suffix array. Hence we add c to the array str and also add default node to the array nodes which will correspond to this suffix. Now the hardest part is to actually insert this node to the tree preserving ordering of suffixes. For this we do some kind of binary search and need to compare this new suffix with the old suffixes.

To perform comparison of the string cS with the id-th suffix (the suffix that starts with the character str[id]) we at first compare their first characters c = str[size()] and str[id] and if they are different we are done. Otherwise we should compare S, which is the (size() - 1)-th suffix of the old tree, with the (id - 1)-th suffix.

For this we use routine getIndex(id) that returns the actual position (1-based) of the id-th suffix in the usual suffix array of the current string S (0-th suffix, which is an empty string, has position 0). This is a standard routine for BST where each node has field size. At first we add to the answer one more the size of the left subtree of the node id (since all nodes in the left subtree are less than node id). Then for each ancestor par of the node id we add one more the size of the left subtree of the node par in the case if id is contained in the right subtree of par (in this case par is less than id as well as the whole its left subtree). Clearly all other nodes are greater than id. Having this routine coded we can simple compare getIndex(size() - 1) and getIndex(id - 1) in the above compare routine. We will follow common practice to return signed integer in compare routine such that its sign contains the answer.

Summarizing we have the following code snippets for getIndex and compare routines:

int getIndex(int id)
    if (!id)
        return 0
    int index = nodes[nodes[id].left].size + 1
    while (nodes[id].parent)
        int par = nodes[id].parent  // Ancestor of the initial node `id`
        if (id == nodes[par].right)
            index += nodes[nodes[par].left].size + 1
        id = par
    return index

int compare(int id1, int id2)
    int cmp = str[id1] - str[id2]
    if (cmp == 0)
        cmp = getIndex(id1 - 1) - getIndex(id2 - 1)
    return cmp

Now we are ready to insert the new suffix to the tree. Having compare routine ready, this is a standard routine for particular BST data structure we are using. Corresponding code snippet in the case of a treap looks as follows:

// Inserts node `id` to the subtree rooted at `tree`
// and returns the new root of this subtree (it may change).
int insert(int tree, int id)
    // In the case of empty subtree the node `id` will be a new root.
    if (!tree)
        return id
    // The treap should be a heap by `y` fields.
    // Hence if `id` has larger `y` than `tree`, `id` should be a new root.
    if (nodes[tree].y < nodes[id].y)
        // In this case we split subtree rooted at `tree` into two parts:
        // 1) with suffixes < the `id`-th suffix (the root is left son of `id`);
        // 2) with suffixes > the `id`-th suffix (the root is right son of `id`).
        split(tree, id, nodes[id].left, nodes[id].right)
        // We set parents of both left and right sons if `id` to be `id`.
        setParent(nodes[id].left, id)
        setParent(nodes[id].right, id)
        fixSize(id)
        return id  // This case is terminal for insert query.
    // Otherwise we should move `id` to one of the subtrees of the node `tree`,
    // selecting the correct subtree by comparing suffixes at `tree` and `id`.
    int cmp = compare(id, tree)
    if (cmp < 0)
        // If `id` is less than `tree` than we insert it to the left subtree.
        // `tmp` is the root of this subtree after inserting.
        int tmp = insert(nodes[tree].left, id)
        // And since root of this subtree could change,
        // we fix relation between `tree` and `tmp`
        setParent(tmp, tree)
        nodes[tree].left = tmp
    else
        // Now cmp > 0 since all suffixes are different
        // and we insert `id` to the right subtree.
        int tmp = insert(nodes[tree].right, id)
        setParent(tmp, tree)
        nodes[tree].right = tmp
    // We should fix the field `size` of the node `tree`.
    // We can simply increase it by 1 since we add only one node to its subtree.
    nodes[tree].size++
    return tree  // `tree` remains the root of its subtree

The split routine used in insert looks as follows:

// Splits the subtree rooted at `tree` by the node `id` into two parts:
// 1) with suffixes < the `id`-th suffix (the root is `L`);
// 2) with suffixes > the `id`-th suffix (the root is `R`).
// `id`-th node does not have to be a correct node of the tree.
// But node `id - 1` should be.
void split(int tree, int id, int& L, int& R)
    // In the case of empty subtree there is nothing to split,
    // and both `L` and `R` should be empty trees as well.
    if (!tree)
        L = R = 0
        return
    // Otherwise we compare the root of the subtree with the node `id`.
    int cmp = compare(id, tree)
    if (cmp < 0)
        // If `id` < `tree` then we should split left subtree of `tree` by `id`
        // and place the second subtree to the root of left subtree itself,
        split(nodes[tree].left, id, L, nodes[tree].left)
        setParent(nodes[tree].left, tree)
        R = tree
    else
        // Otherwise cmp > 0 and we split right subtree by the node `id`.
        split(nodes[tree].right, id, nodes[tree].right, R)
        setParent(nodes[tree].right, tree)
        L = tree
    fixSize(tree)

We see that split and insert calls for compare O(log|S|) times while each compare calls getIndex which has complexity O(log|S|). Hence overall complexity of push routine is O(log²|S|) as announced above. BTW, it looks as follows:

// Inserts character `c` to the beginning of the string.
void push(char c)
    str.push_back(c)         // We add `c` to the array `str`.
    nodes.push_back(Node())  // We add default node to the array `nodes`...
    // ...and then actually insert this node to the tree and update the root.
    root = insert(root, size())

There is a place for improvement in the implementation of insert query. We calculate getIndex(size() - 1) O(log|S|) times during insert query. If we calculate it only once, save it and reuse saved value, it should give double acceleration. I did not describe this in detail here to keep code simple and elegant. But it is implemented in optimized editorialist’s solution. On official test data this optimization gives acceleration in 1.5 times: the maximal timing decreases from 0.80 to 0.53.

Now dynamic suffix array data structure is completely described.

String deque (back to QUICK EXPLANATION)

The next step is to cover two more types of queries. So we now allow modifications at both ends of the string. It is quite clear how to implement this using two dynamic suffix arrays. Namely, we maintain two dynamic suffix arrays left and right. At each moment left corresponds to some prefix A of the string S, while right corresponds to the reverse of the remaining suffix B of S. Thus, S = A + B = left + reverse(right). Then queries could be performed as follows:

To insert the character c at the end of the string just call right.push(c).
To insert the character c at the beginning of the string just call left.push(c).
To delete the last character just call right.pop() if right is non-empty. But in the case of empty right we should delete the last character in left, which is not supported. Hence using |S| insert queries we save first (|S| + 1) div 2 characters of left in right and save the last |S| div 2 characters in left. Then we can call right.pop() safely.
To delete the first character just call left.pop() if left is non-empty. In the case of empty left we do similar reallocation as in the previous query.
To perform count query we count occurrences of q in A using available match routine in left, then count occurrences of reverse of q in B calling match routine in right (reverse, since right contains reversed version of B). Finally we need to count occurrences of q that intersects with both A and B. For this we form string border as concatenation of min(|q| − 1, |A|) last characters of A and min(|q| − 1, |B|) first characters of B and using some standard algorithms for string matching (e.g. Knuth-Morris-Pratt algorithm) count occurrences of q in border in O(|q| + |border|) = O(|q|) time.

It should be noted that though each reallocation in delete query has complexity O(|S| * log²|S|), the overall complexity for all delete queries is O(Q * log² Q), where Q is the total number of queries. This is because if some reallocation happens the next one will happen after at least |S|/2 queries.

Thus, string deque is a data structure that already supports first 5 types of queries but now complexity of each delete query is O(log²|S|) in average, while other types of queries have the same complexity as in dynamic suffix array. Since this data structure is quite simple, refer to the editorialist’s solution for any implementation details.

It is described in the QUICK EXPLANATION section how to solve the actual problem then (jump here).

ALTERNATIVE SOLUTIONS:

Author’s and tester’s solutions are based on similar idea. But they use long double tag to compare suffixes and probably their solutions have better complexity for insert query. But I can’t say for sure since they recalculate these tags for all nodes when precision issues are possible. Namely, when the new node is inserted between nodes node1 and node2 its tag is set to be (node1.tag + node2.tag)/2. This will ensure property that larger suffixes has larger tags. But if at some moment we have two consecutive nodes with indistinguishable tags under double precision we should recalculate all tags in O(|S|) time using some tree traversal to avoid precision issues. At least this is how I understand their approach. But maybe I am wrong since for me it seems possible to create a test where the number of tags recalculations will be at least Q/40, which leads to quadratic in worst case solution, though with very small constant.

Dynamic Extended Suffix Array should do the job too. Thanks to @cyberax for noting this. But this article is 29 pages long and data structure invented there solves much more complex problem where modification of each character is allowed. So this probably will be an overkill here.

Moreover, most of the contestants answer queries offline saving all the strings of count queries in a trie.
It would be interesting to know the details of such an approach.

AUTHOR’S, TESTER’S AND EDITORIALIST’S SOLUTIONS:

Author’s solution can be found here. It’s maximal timing is 0.64.

Tester’s solution can be found here. It’s maximal timing is 0.55.

Editorialist’s solution can be found here. It’s maximal timing is 0.80.
It is the best viewed at mono-space editor. For example here.

Optimized editorialist’s solution can be found here. It’s maximal timing is 0.48.
This is achieved by the following two optimizations taken from winger’s solution:

Save order statistic of suffix S in insert query and use it in compare routine. Gives boost from 0.80 to 0.53.
Calculate number of occurrences of q in border on the fly without constructing the string border itself.