You are given a string S of length 10. Its characters are numbered from 0.
The middle character is the character with index |S| div 2.
You need efficiently perform the following types of queries:
- Count number of occurrences of the given query string q.
- Delete the first character.
- Insert given character before the first character.
- Delete the last character.
- Insert given character after the last character.
- Delete the middle character.
- Insert given character before the middle character.
There could be up to 150,000 queries and the overall length of all strings in count queries could be up to 1,500,000.
The solution that will be described is a customized version of winger’s solution which I was lucky to understand completely (partially reinventing some parts independently of his code).
At first we develop the data structure that supports only modifications at the beginning of the string. For this we will maintain self-balancing binary search tree (BST) where each node corresponds to some suffix of the string and nodes are ordered according to lexicographical ordering of the corresponding suffixes. We maintain two arrays
nodes. The array
str contains the string itself in reversed order, the array
nodes contains nodes of the tree. The key thing is that the node
nodes[id] corresponds to the suffix that starts at
str[id]. Each node should contain links to its sons and parent and also the size of its subtree. Let’s call it dynamic suffix array. Jump here for further representation details.
Then to perform count query we do some kind of binary search over this tree. Namely at each step we compare
q with the current suffix. If
q is not its prefix, we move either to the right subtree of this suffix or to the left one. Otherwise we found an occurrence and we should try both subtrees. To handle this efficiently we should pass two boolean parameters
mathRight. The complexity is O(|q| * log|S|). Jump here for detailed explanation.
Delete query is very simple. We need to delete last character of
str and last node of
nodes. So we already know where this node is in the tree (we have access to its sons and parent). Hence we can run available delete query of the particular BST data structure we are using. The complexity is O(log|S|). Jump here for detailed explanation.
Insert query is the most tricky because we need to insert a new suffix cS to the tree. For this we need to compare this suffix with old suffixes. To compare cS with
id-th suffix we at first compare their first characters c and
str[id]. If they are different we should compare their remaining parts S and
(id-1)-th suffix. But S is already in the tree. To compare two correct suffixes we calculate their order statistics using size of the subtrees. So compare routine is ready and then insert can be performed by available insert routine of BST. The final complexity is O(log2|S|). Jump here for detailed explanation.
Next we develop the data structure that supports modifications at both ends of the string. For this we maintain two dynamic suffix arrays
right such that
left contains some prefix of S and
right contains the remaining suffix. Let’s call this data structure string deque. Then it is clear how to perform modifications. The only tricky case is when we want to delete the last character but
right is empty (or the first character but
left is empty). In this case using |S| insert queries we divide
left into two equal parts and put first half in
left and reversed second half in
right. Then we can delete last character by calling delete query for
right. To perform count query we count occurrences of
right and then count occurrences that intersects with both
right by constructing border string of length at most 2 * (|q| − 1) and using KMP. Jump here for detailed explanation.
To solve the actual problem we will maintain two string deques
right. The first one will contain first |S| div 2 characters of S, while the second one will contain the remaining characters of S. So
right should satisfy the inequality |left| ≤ |right| ≤ |left| + 1. It is clear how to perform modification queries − each such query will be a modification query for one of the string deques. Of course some modifications ruin the inequality |left| ≤ |right| ≤ |left| + 1. Hence we equalize
right by additional movements of characters from one of them to another. The count query could be performed in the same way as in the string deque data structure, dividing all occurrences of
q into three types and using auxiliary string
border. This part is quite simple and similar to the previous part so refer to the editorialist’s solution for any implementation details.
At first note that naive algorithms that work in O(|S| + |q|) time for each count query will not work fast enough in this problem. Though initial string is short, the following scenario is possible:
- The first 75,000 queries are insert queries. So the string will have length 75,010 after them.
- They are followed by 75,000 count queries where each query has length 20.
Clearly in this case naive processing of each count query in O(|S| + |q|) time will lead to about 75,000 * (75,010 + 20) ≅ 5 * 1010 elementary operations, while modern computers allow to perform about 108 such operations in a second. Thus, we need to invent something clever here.
When there are no modification queries, the fast and simple data structure that allows to answer each count query in O(|q| * log|S|) time with O(|S| * log|S|) preprocessing is suffix array. Namely, we sort all non-empty suffixes of the string in lexicographical order and save the permutation of positions of their first characters into an array which is called the suffix array of the string. Then each count query could be performed by simple binary search over this array.
Of course, we can’t just take all suffixes and sort them − there are different efficient techniques exist to perform this (some of them are even linear in time). But we should deal with modification operations so actually well-known techniques will not work here and we should invent a new data structure.
##Dynamic suffix array (back to QUICK EXPLANATION)
At first we develop the data structure that allows to perform efficiently first three types of queries. Namely, only modifications at the beginning of the string are allowed. We will achieve the following performance:
- O(|S|) memory.
- O(|q| * log|S|) time for each count query.
- O(log|S|) time for each delete query.
- O(log2|S|) time for each insert query.
For this we will maintain self-balancing binary search tree (BST) where each node corresponds to some suffix of the string and nodes are ordered according to lexicographical ordering of the corresponding suffixes.
Below for each suffix of the string we should have access to the corresponding node of the tree. Hence we will maintain two arrays:
- The first array
strwill contain the actual string in reversed order.
- The second array
nodeswill contain corresponding nodes of the tree.
Also we should maintain the root of the tree which can be stored as the index of the actual root in the array
nodes and can be denoted as
nodes should allow random access to their elements and also deleting or adding elements in the end. In C++ the perfect fit is
vector container. Alternatively we can define static arrays of large enough size and maintain the variable
size that will contain the current length of the string. But there are some issues with stack overflow for such implementation (at least for me). In what follows we will use implementation with vectors in code snippets.
Because of such representation of our data structure and also for some future needs it will be convenient to store in each node the following information:
left− the index of its left son in
right− the index of its right son in
parent− the index of its parent in
nodes. It will be needed for finding order statistics of suffixes and also for remove query.
size− the size of the subtree rooted at this node. It will be needed for order statistics as well, and also for count query.
Of course, some additional information, needed to keep tree self-balancing, should be stored as well. For example, in the case of a treap we need to store additional field
y such that our tree is a heap considering
y values as keys. Namely, the root of each subtree should have the largest
y among all nodes of its subtree.
In what follows we will provide code snippets assuming that we are using treap. We will use syntax close to C++ in code snippets but indicate blocks by indentation instead of braces and also will not use semicolons at the end of lines. Also to distinguish variable or methods in comments we surround them by grave accents.
Next, we need empty node that will play a role of non-existing sons or parents. To handle this, it is convenient to have 0-th element of
str to be zero character and 0-th element of
nodes to be node with all zero fields (including
size). So such initialization should be a part of the constructor of our data structure:
DynamicSuffixArray() str.push_back(0) nodes.push_back(Node()) nodes.size = 0 root = 0
Node is the data structure that maintain nodes in the tree. Its constructor in the case of a treap in view of definition of empty node has the form:
Node() left = right = parent = 0 size = 1 y = rand()
The node constructed in this way will be called a default node. Calling
rand() here is important to keep treap balanced.
This is all about actual representation of our data structure and its purpose. Now we can describe how to perform queries. At first we define several simple routines.
size()− returns the length of the current string:
// “-1” because of the dummy zero character
return str.size() - 1
setParent(id, parent_id)− sets parent of the node
parent_id. Has effect only for non-zero nodes:
setParent(int id, int parent_id)
nodes[id].parent = parent_id
sizefield of the node
idafter some possible changes in its subtree. It simply means to set its
sizefield to the sum of sizes of its children plus one. Note that in view of setting size of zero node to zero we can do this without additional checks for its children:
nodes[id].size = 1 + nodes[nodes[id].left].size + nodes[nodes[id].right].size
The most tricky in this data structure is insert query since it requires to compare two suffixes when we know only their ids in the array
nodes. But count and delete queries are quite standard and we start with them.
Count query (back to QUICK EXPLANATION)
This can be done similarly to the case of usual suffix array by some kind of binary search. Namely, the routine
match(tree, q) will count the number of times,
q is the prefix of some suffix in the subtree, rooted at the node
tree. For this we should match
q and the suffix of the node
tree character-by-character: at
i-th step we compare
str[tree - i] (recall that the string S is stored in reverse order).
If for the first
i where they differ we have
q* > str[tree - i] then
q is lexicographically greater than this suffix (this happens in particular when
tree = i since in this case
str = 0, this is the case when the current suffix is a proper prefix of
q). So all occurrences of
q are contained in the right subtree and the answer is
match(nodes[tree].right, q). On the other hand if
q* < str[tree - i] then all occurrences of
q are in the left subtree and the answer is
q appears to be a prefix of this suffix then we find the occurrence, and also some other occurrences could be in both subtrees. If we simply run this routine on both subtrees this will be too slow in the case when there are many occurrences of
q in the string.
To handle this issue we need to pass two additional boolean parameters
match routine. The first one is
true if we already know that the rightmost suffix, left to this subtree, starts with
mathRight has similar meaning. Then in the case when root of the subtree has
q as a prefix, the answer will be composed of the found occurrence, occurrences in the left subtree with
matchRight = true (and the same
matchLeft as passed to this subtree) and occurrences in the right subtree with
matchLeft = true (and the same
matchRight as passed to this subtree). Now if both
true we definitely know that all suffixes in this subtree start with
q and we can simply add its size to the answer. This simplification is the main point why updated
match routine becomes efficient. Now we present the code snippet:
int match(int tree, const string& q, bool matchLeft, bool matchRight) if (!tree) return 0 if (matchLeft && matchRight) return nodes[tree].size for (int i = 0; i < q.size(); ++i) if (q* > str[tree - i]) return match(nodes[tree].right, q, false, matchRight) else if (q* < str[tree - i]) return match(nodes[tree].left, q, matchLeft, false) return 1 + match(nodes[tree].left, q, matchLeft, true) + match(nodes[tree].right, q, true, matchRight)
To find the number of occurrences of
q in the whole string S we should call
match(root, q, false, false). It can be shown using some amortized analysis (which is usual for segment tree) that the overall number of recursive calls of
match routine is O(H), where H is the height of the tree. Since we are using self-balancing BST, which is characterized by H = O(log|S|), this leads to the mentioned O(|q| * log|S|) complexity for this query.
Delete query (back to QUICK EXPLANATION)
We need to delete first character of the string which has index
id := size() in the array
str and also corresponding suffix (which is the whole string S) from the tree. The node of this suffix, of course, has the same index
id in the array
nodes so we know where it is in this tree (we can access its sons and parent). Hence we can easily delete it from the tree by available delete operation of the particular BST data structure we are using (note that we don’t need to compare any nodes for this). In particular, in the case of a treap we should merge sons of the node
id, replace it by their merge and after that decrease by 1 field
size of each node on the path from the parent of the node
id to the root. We call public method for delete query as
pop. It will also return the deleted character which will be useful later. In turn,
pop will call private routine
remove that do the actual deleting of node from the tree. The corresponding code snippet for
pop has the form:
// Returns the deleted character. char pop() // Removes last node from the tree and updates the root. remove(size()) nodes.pop_back() // Removes last character of the string and returns its value. char c = str.back() str.pop_back() return c
This is standard remove routine in the treap if we already know the node we should delete:
// Removes node `id` from the tree and update the root. // Vector `nodes` does not change. void remove(int id) int par = nodes[id].parent int tmp = merge(nodes[id].left, nodes[id].right) // `tmp` is id of node that contains merge of children of the node `id`. // `par` is the parent of `tmp` since `tmp` is replacement of `id`. setParent(tmp, par) if (par) // If `id` has parent then root remains the same, // but we need to replace `id` by `tmp` in sons of the node `par`. if(nodes[par].left == id) nodes[par].left = tmp else nodes[par].right = tmp while (par) // Since we delete one node in subtree of `id`, // we should decrease the field `size` of each its ancestor by 1. nodes[par].size-- par = nodes[par].parent else // Otherwise `id` was the root and the new root is `tmp`. root = tmp
This is the standard merge routine in the treap:
// Merges subtrees rooted at `L` and `R` and save result in one of the them. // Each node in the subtree rooted at `L` should be less than // each node in the subtree rooted at `R`. int merge(int L, int R) // We return L when R is 0 and R when L is 0. if (!L || !R) return L + R // We should preserve the heap property by field `y`. if (nodes[L].y < nodes[R].y) // In this case `R` should be a root and we merge `L` and left son of `R`, // and result is saved in `tmp`. int tmp = merge(L, nodes[R].left) // `tmp` will be the new left son of `R`. setParent(tmp, R) nodes[R].left = tmp fixSize(R) return R else // In this case `L` will be a root. int tmp = merge(nodes[L].right, R) setParent(tmp, L) nodes[L].right = tmp fixSize(L) return L
merge has complexity O(H) = O(log|S|) and in
remove we have one call to
merge and one simple cycle with at most H iterations the complexity of delete query is O(log|S|).
Insert query (back to QUICK EXPLANATION)
We need to insert character
c to the beginning of the string S which leads to inserting suffix cS to the dynamic suffix array. Hence we add
c to the array
str and also add default node to the array
nodes which will correspond to this suffix. Now the hardest part is to actually insert this node to the tree preserving ordering of suffixes. For this we do some kind of binary search and need to compare this new suffix with the old suffixes.
To perform comparison of the string cS with the
id-th suffix (the suffix that starts with the character
str[id]) we at first compare their first characters
c = str[size()] and
str[id] and if they are different we are done. Otherwise we should compare S, which is the
(size() - 1)-th suffix of the old tree, with the
(id - 1)-th suffix.
For this we use routine
getIndex(id) that returns the actual position (1-based) of the
id-th suffix in the usual suffix array of the current string S (0-th suffix, which is an empty string, has position 0). This is a standard routine for BST where each node has field
size. At first we add to the answer one more the size of the left subtree of the node
id (since all nodes in the left subtree are less than node
id). Then for each ancestor
par of the node
id we add one more the size of the left subtree of the node
par in the case if
id is contained in the right subtree of
par (in this case
par is less than
id as well as the whole its left subtree). Clearly all other nodes are greater than
id. Having this routine coded we can simple compare
getIndex(size() - 1) and
getIndex(id - 1) in the above compare routine. We will follow common practice to return signed integer in
compare routine such that its sign contains the answer.
Summarizing we have the following code snippets for
int getIndex(int id) if (!id) return 0 int index = nodes[nodes[id].left].size + 1 while (nodes[id].parent) int par = nodes[id].parent // Ancestor of the initial node `id` if (id == nodes[par].right) index += nodes[nodes[par].left].size + 1 id = par return index int compare(int id1, int id2) int cmp = str[id1] - str[id2] if (cmp == 0) cmp = getIndex(id1 - 1) - getIndex(id2 - 1) return cmp
Now we are ready to insert the new suffix to the tree. Having
compare routine ready, this is a standard routine for particular BST data structure we are using. Corresponding code snippet in the case of a treap looks as follows:
// Inserts node `id` to the subtree rooted at `tree` // and returns the new root of this subtree (it may change). int insert(int tree, int id) // In the case of empty subtree the node `id` will be a new root. if (!tree) return id // The treap should be a heap by `y` fields. // Hence if `id` has larger `y` than `tree`, `id` should be a new root. if (nodes[tree].y < nodes[id].y) // In this case we split subtree rooted at `tree` into two parts: // 1) with suffixes < the `id`-th suffix (the root is left son of `id`); // 2) with suffixes > the `id`-th suffix (the root is right son of `id`). split(tree, id, nodes[id].left, nodes[id].right) // We set parents of both left and right sons if `id` to be `id`. setParent(nodes[id].left, id) setParent(nodes[id].right, id) fixSize(id) return id // This case is terminal for insert query. // Otherwise we should move `id` to one of the subtrees of the node `tree`, // selecting the correct subtree by comparing suffixes at `tree` and `id`. int cmp = compare(id, tree) if (cmp < 0) // If `id` is less than `tree` than we insert it to the left subtree. // `tmp` is the root of this subtree after inserting. int tmp = insert(nodes[tree].left, id) // And since root of this subtree could change, // we fix relation between `tree` and `tmp` setParent(tmp, tree) nodes[tree].left = tmp else // Now cmp > 0 since all suffixes are different // and we insert `id` to the right subtree. int tmp = insert(nodes[tree].right, id) setParent(tmp, tree) nodes[tree].right = tmp // We should fix the field `size` of the node `tree`. // We can simply increase it by 1 since we add only one node to its subtree. nodes[tree].size++ return tree // `tree` remains the root of its subtree
split routine used in
insert looks as follows:
// Splits the subtree rooted at `tree` by the node `id` into two parts: // 1) with suffixes < the `id`-th suffix (the root is `L`); // 2) with suffixes > the `id`-th suffix (the root is `R`). // `id`-th node does not have to be a correct node of the tree. // But node `id - 1` should be. void split(int tree, int id, int& L, int& R) // In the case of empty subtree there is nothing to split, // and both `L` and `R` should be empty trees as well. if (!tree) L = R = 0 return // Otherwise we compare the root of the subtree with the node `id`. int cmp = compare(id, tree) if (cmp < 0) // If `id` < `tree` then we should split left subtree of `tree` by `id` // and place the second subtree to the root of left subtree itself, split(nodes[tree].left, id, L, nodes[tree].left) setParent(nodes[tree].left, tree) R = tree else // Otherwise cmp > 0 and we split right subtree by the node `id`. split(nodes[tree].right, id, nodes[tree].right, R) setParent(nodes[tree].right, tree) L = tree fixSize(tree)
We see that
insert calls for
compare O(log|S|) times while each
getIndex which has complexity O(log|S|). Hence overall complexity of
push routine is O(log2|S|) as announced above. BTW, it looks as follows:
// Inserts character `c` to the beginning of the string. void push(char c) str.push_back(c) // We add `c` to the array `str`. nodes.push_back(Node()) // We add default node to the array `nodes`... // ...and then actually insert this node to the tree and update the root. root = insert(root, size())
There is a place for improvement in the implementation of insert query. We calculate
getIndex(size() - 1) O(log|S|) times during insert query. If we calculate it only once, save it and reuse saved value, it should give double acceleration. I did not describe this in detail here to keep code simple and elegant. But it is implemented in optimized editorialist’s solution. On official test data this optimization gives acceleration in 1.5 times: the maximal timing decreases from 0.80 to 0.53.
Now dynamic suffix array data structure is completely described.
##String deque (back to QUICK EXPLANATION)
The next step is to cover two more types of queries. So we now allow modifications at both ends of the string. It is quite clear how to implement this using two dynamic suffix arrays. Namely, we maintain two dynamic suffix arrays
right. At each moment
left corresponds to some prefix A of the string S, while
right corresponds to the reverse of the remaining suffix B of S. Thus, S = A + B = left + reverse(right). Then queries could be performed as follows:
- To insert the character
cat the end of the string just call
- To insert the character
cat the beginning of the string just call
- To delete the last character just call
rightis non-empty. But in the case of empty
rightwe should delete the last character in
left, which is not supported. Hence using |S| insert queries we save first (|S| + 1) div 2 characters of
rightand save the last |S| div 2 characters in
left. Then we can call
- To delete the first character just call
leftis non-empty. In the case of empty
leftwe do similar reallocation as in the previous query.
- To perform count query we count occurrences of
qin A using available
left, then count occurrences of reverse of
qin B calling
rightcontains reversed version of B). Finally we need to count occurrences of
qthat intersects with both A and B. For this we form string
borderas concatenation of min(|q| − 1, |A|) last characters of A and min(|q| − 1, |B|) first characters of B and using some standard algorithms for string matching (e.g. Knuth-Morris-Pratt algorithm) count occurrences of
borderin O(|q| + |border|) = O(|q|) time.
It should be noted that though each reallocation in delete query has complexity O(|S| * log2|S|), the overall complexity for all delete queries is O(Q * log2 Q), where Q is the total number of queries. This is because if some reallocation happens the next one will happen after at least |S|/2 queries.
Thus, string deque is a data structure that already supports first 5 types of queries but now complexity of each delete query is O(log2|S|) in average, while other types of queries have the same complexity as in dynamic suffix array. Since this data structure is quite simple, refer to the editorialist’s solution for any implementation details.
It is described in the QUICK EXPLANATION section how to solve the actual problem then (jump here).
Author’s and tester’s solutions are based on similar idea. But they use long double tag to compare suffixes and probably their solutions have better complexity for insert query. But I can’t say for sure since they recalculate these tags for all nodes when precision issues are possible. Namely, when the new node is inserted between nodes
node2 its tag is set to be
(node1.tag + node2.tag)/2. This will ensure property that larger suffixes has larger tags. But if at some moment we have two consecutive nodes with indistinguishable tags under double precision we should recalculate all tags in O(|S|) time using some tree traversal to avoid precision issues. At least this is how I understand their approach. But maybe I am wrong since for me it seems possible to create a test where the number of tags recalculations will be at least Q/40, which leads to quadratic in worst case solution, though with very small constant.
Dynamic Extended Suffix Array should do the job too. Thanks to @cyberax for noting this. But this article is 29 pages long and data structure invented there solves much more complex problem where modification of each character is allowed. So this probably will be an overkill here.
Moreover, most of the contestants answer queries offline saving all the strings of count queries in a trie.
It would be interesting to know the details of such an approach.
AUTHOR’S, TESTER’S AND EDITORIALIST’S SOLUTIONS:
Author’s solution can be found here. It’s maximal timing is 0.64.
Tester’s solution can be found here. It’s maximal timing is 0.55.
- Save order statistic of suffix S in insert query and use it in
compareroutine. Gives boost from 0.80 to 0.53.
- Calculate number of occurrences of
borderon the fly without constructing the string