PROBLEM LINK:
Practice
Contest
Author: Tom Chen
Tester: Hiroto Sekido
Editorialist: Anton Lunyov
DIFFICULTY:
HARD
PREREQUISITES:
Suffix array, Treap, Knuth–Morris–Pratt algorithm
PROBLEM:
You are given a string S of length 10. Its characters are numbered from 0.
The middle character is the character with index |S| div 2.
You need efficiently perform the following types of queries:
- Count number of occurrences of the given query string q.
- Delete the first character.
- Insert given character before the first character.
- Delete the last character.
- Insert given character after the last character.
- Delete the middle character.
- Insert given character before the middle character.
There could be up to 150,000 queries and the overall length of all strings in count queries could be up to 1,500,000.
QUICK EXPLANATION:
The solution that will be described is a customized version of winger's solution which I was lucky to understand completely (partially reinventing some parts independently of his code).
At first we develop the data structure that supports only modifications at the beginning of the string. For this we will maintain self-balancing binary search tree (BST) where each node corresponds to some suffix of the string and nodes are ordered according to lexicographical ordering of the corresponding suffixes. We maintain two arrays str
and nodes
. The array str
contains the string itself in reversed order, the array nodes
contains nodes of the tree. The key thing is that the node nodes[id]
corresponds to the suffix that starts at str[id]
. Each node should contain links to its sons and parent and also the size of its subtree. Let's call it dynamic suffix array. Jump here for further representation details.
Then to perform count query we do some kind of binary search over this tree. Namely at each step we compare q
with the current suffix. If q
is not its prefix, we move either to the right subtree of this suffix or to the left one. Otherwise we found an occurrence and we should try both subtrees. To handle this efficiently we should pass two boolean parameters mathLeft
and mathRight
. The complexity is O(|q| * log|S|). Jump here for detailed explanation.
Delete query is very simple. We need to delete last character of str
and last node of nodes
. So we already know where this node is in the tree (we have access to its sons and parent). Hence we can run available delete query of the particular BST data structure we are using. The complexity is O(log|S|). Jump here for detailed explanation.
Insert query is the most tricky because we need to insert a new suffix cS to the tree. For this we need to compare this suffix with old suffixes. To compare cS with id
-th suffix we at first compare their first characters c and str[id]
. If they are different we should compare their remaining parts S and (id-1)
-th suffix. But S is already in the tree. To compare two correct suffixes we calculate their order statistics using size of the subtrees. So compare routine is ready and then insert can be performed by available insert routine of BST. The final complexity is O(log^{2}|S|). Jump here for detailed explanation.
Next we develop the data structure that supports modifications at both ends of the string. For this we maintain two dynamic suffix arrays left
and right
such that left
contains some prefix of S and right
contains the remaining suffix. Let's call this data structure string deque. Then it is clear how to perform modifications. The only tricky case is when we want to delete the last character but right
is empty (or the first character but left
is empty). In this case using |S| insert queries we divide left
into two equal parts and put first half in left
and reversed second half in right
. Then we can delete last character by calling delete query for right
. To perform count query we count occurrences of q
in left
and right
and then count occurrences that intersects with both left
and right
by constructing border string of length at most 2 * (|q| − 1) and using KMP. Jump here for detailed explanation.
To solve the actual problem we will maintain two string deques left
and right
. The first one will contain first |S| div 2 characters of S, while the second one will contain the remaining characters of S. So left
and right
should satisfy the inequality |left| ≤ |right| ≤ |left| + 1. It is clear how to perform modification queries − each such query will be a modification query for one of the string deques. Of course some modifications ruin the inequality |left| ≤ |right| ≤ |left| + 1. Hence we equalize left
and right
by additional movements of characters from one of them to another. The count query could be performed in the same way as in the string deque data structure, dividing all occurrences of q
into three types and using auxiliary string border
. This part is quite simple and similar to the previous part so refer to the editorialist's solution for any implementation details.
EXPLANATION:
At first note that naive algorithms that work in O(|S| + |q|) time for each count query will not work fast enough in this problem. Though initial string is short, the following scenario is possible:
- The first 75,000 queries are insert queries. So the string will have length 75,010 after them.
- They are followed by 75,000 count queries where each query has length 20.
Clearly in this case naive processing of each count query in O(|S| + |q|) time will lead to about 75,000 * (75,010 + 20) ≅ 5 * 10^{10} elementary operations, while modern computers allow to perform about 10^{8} such operations in a second. Thus, we need to invent something clever here.
When there are no modification queries, the fast and simple data structure that allows to answer each count query in O(|q| * log|S|) time with O(|S| * log|S|) preprocessing is suffix array. Namely, we sort all non-empty suffixes of the string in lexicographical order and save the permutation of positions of their first characters into an array which is called the suffix array of the string. Then each count query could be performed by simple binary search over this array.
Of course, we can't just take all suffixes and sort them − there are different efficient techniques exist to perform this (some of them are even linear in time). But we should deal with modification operations so actually well-known techniques will not work here and we should invent a new data structure.
At first we develop the data structure that allows to perform efficiently first three types of queries. Namely, only modifications at the beginning of the string are allowed. We will achieve the following performance:
- O(|S|) memory.
- O(|q| * log|S|) time for each count query.
- O(log|S|) time for each delete query.
- O(log^{2}|S|) time for each insert query.
For this we will maintain self-balancing binary search tree (BST) where each node corresponds to some suffix of the string and nodes are ordered according to lexicographical ordering of the corresponding suffixes.
Below for each suffix of the string we should have access to the corresponding node of the tree. Hence we will maintain two arrays:
- The first array
str
will contain the actual string in reversed order.
- The second array
nodes
will contain corresponding nodes of the tree.
Also we should maintain the root of the tree which can be stored as the index of the actual root in the array nodes
and can be denoted as root
.
Arrays str
and nodes
should allow random access to their elements and also deleting or adding elements in the end. In C++ the perfect fit is vector
container. Alternatively we can define static arrays of large enough size and maintain the variable size
that will contain the current length of the string. But there are some issues with stack overflow for such implementation (at least for me). In what follows we will use implementation with vectors in code snippets.
Because of such representation of our data structure and also for some future needs it will be convenient to store in each node the following information:
left
− the index of its left son in nodes
.
right
− the index of its right son in nodes
.
parent
− the index of its parent in nodes
. It will be needed for finding order statistics of suffixes and also for remove query.
size
− the size of the subtree rooted at this node. It will be needed for order statistics as well, and also for count query.
Of course, some additional information, needed to keep tree self-balancing, should be stored as well. For example, in the case of a treap we need to store additional field y
such that our tree is a heap considering y
values as keys. Namely, the root of each subtree should have the largest y
among all nodes of its subtree.
In what follows we will provide code snippets assuming that we are using treap. We will use syntax close to C++ in code snippets but indicate blocks by indentation instead of braces and also will not use semicolons at the end of lines. Also to distinguish variable or methods in comments we surround them by grave accents.
Next, we need empty node that will play a role of non-existing sons or parents. To handle this, it is convenient to have 0-th element of str
to be zero character and 0-th element of nodes
to be node with all zero fields (including size
). So such initialization should be a part of the constructor of our data structure:
DynamicSuffixArray()
str.push_back(0)
nodes.push_back(Node())
nodes[0].size = 0
root = 0
Here Node
is the data structure that maintain nodes in the tree. Its constructor in the case of a treap in view of definition of empty node has the form:
Node()
left = right = parent = 0
size = 1
y = rand()
The node constructed in this way will be called a default node. Calling rand()
here is important to keep treap balanced.
This is all about actual representation of our data structure and its purpose. Now we can describe how to perform queries. At first we define several simple routines.
1) size()
− returns the length of the current string:
int size()
// "-1" because of the dummy zero character
return str.size() - 1
2) setParent(id, parent_id)
− sets parent of the node id
to parent_id
. Has effect only for non-zero nodes:
setParent(int id, int parent_id)
if (id)
nodes[id].parent = parent_id
3) fixSize(id)
− fixes size
field of the node id
after some possible changes in its subtree. It simply means to set its size
field to the sum of sizes of its children plus one. Note that in view of setting size of zero node to zero we can do this without additional checks for its children:
fixSize(int id)
if (id)
nodes[id].size = 1 + nodes[nodes[id].left].size + nodes[nodes[id].right].size
The most tricky in this data structure is insert query since it requires to compare two suffixes when we know only their ids in the array nodes
. But count and delete queries are quite standard and we start with them.
This can be done similarly to the case of usual suffix array by some kind of binary search. Namely, the routine match(tree, q)
will count the number of times, q
is the prefix of some suffix in the subtree, rooted at the node tree
. For this we should match q
and the suffix of the node tree
character-by-character: at i
-th step we compare q[i]
and str[tree - i]
(recall that the string S is stored in reverse order).
If for the first i
where they differ we have q[i] > str[tree - i]
then q
is lexicographically greater than this suffix (this happens in particular when tree = i
since in this case str[0] = 0
, this is the case when the current suffix is a proper prefix of q
). So all occurrences of q
are contained in the right subtree and the answer is match(nodes[tree].right, q)
. On the other hand if q[i] < str[tree - i]
then all occurrences of q
are in the left subtree and the answer is match(nodes[tree].left, q)
.
But if q
appears to be a prefix of this suffix then we find the occurrence, and also some other occurrences could be in both subtrees. If we simply run this routine on both subtrees this will be too slow in the case when there are many occurrences of q
in the string.
To handle this issue we need to pass two additional boolean parameters matchLeft
and matchRight
to match
routine. The first one is true
if we already know that the rightmost suffix, left to this subtree, starts with q
. mathRight
has similar meaning. Then in the case when root of the subtree has q
as a prefix, the answer will be composed of the found occurrence, occurrences in the left subtree with matchRight = true
(and the same matchLeft
as passed to this subtree) and occurrences in the right subtree with matchLeft = true
(and the same matchRight
as passed to this subtree). Now if both matchLeft
and matchRight
are true
we definitely know that all suffixes in this subtree start with q
and we can simply add its size to the answer. This simplification is the main point why updated match
routine becomes efficient. Now we present the code snippet:
int match(int tree, const string& q, bool matchLeft, bool matchRight)
if (!tree)
return 0
if (matchLeft && matchRight)
return nodes[tree].size
for (int i = 0; i < q.size(); ++i)
if (q[i] > str[tree - i])
return match(nodes[tree].right, q, false, matchRight)
else if (q[i] < str[tree - i])
return match(nodes[tree].left, q, matchLeft, false)
return 1 +
match(nodes[tree].left, q, matchLeft, true) +
match(nodes[tree].right, q, true, matchRight)
To find the number of occurrences of q
in the whole string S we should call match(root, q, false, false)
. It can be shown using some amortized analysis (which is usual for segment tree) that the overall number of recursive calls of match
routine is O(H), where H is the height of the tree. Since we are using self-balancing BST, which is characterized by H = O(log|S|), this leads to the mentioned O(|q| * log|S|) complexity for this query.
We need to delete first character of the string which has index id := size()
in the array str
and also corresponding suffix (which is the whole string S) from the tree. The node of this suffix, of course, has the same index id
in the array nodes
so we know where it is in this tree (we can access its sons and parent). Hence we can easily delete it from the tree by available delete operation of the particular BST data structure we are using (note that we don't need to compare any nodes for this). In particular, in the case of a treap we should merge sons of the node id
, replace it by their merge and after that decrease by 1 field size
of each node on the path from the parent of the node id
to the root. We call public method for delete query as pop
. It will also return the deleted character which will be useful later. In turn, pop
will call private routine remove
that do the actual deleting of node from the tree. The corresponding code snippet for pop
has the form:
// Returns the deleted character.
char pop()
// Removes last node from the tree and updates the root.
remove(size())
nodes.pop_back()
// Removes last character of the string and returns its value.
char c = str.back()
str.pop_back()
return c
This is standard remove routine in the treap if we already know the node we should delete:
// Removes node `id` from the tree and update the root.
// Vector `nodes` does not change.
void remove(int id)
int par = nodes[id].parent
int tmp = merge(nodes[id].left, nodes[id].right)
// `tmp` is id of node that contains merge of children of the node `id`.
// `par` is the parent of `tmp` since `tmp` is replacement of `id`.
setParent(tmp, par)
if (par)
// If `id` has parent then root remains the same,
// but we need to replace `id` by `tmp` in sons of the node `par`.
if(nodes[par].left == id)
nodes[par].left = tmp
else
nodes[par].right = tmp
while (par)
// Since we delete one node in subtree of `id`,
// we should decrease the field `size` of each its ancestor by 1.
nodes[par].size--
par = nodes[par].parent
else
// Otherwise `id` was the root and the new root is `tmp`.
root = tmp
This is the standard merge routine in the treap:
// Merges subtrees rooted at `L` and `R` and save result in one of the them.
// Each node in the subtree rooted at `L` should be less than
// each node in the subtree rooted at `R`.
int merge(int L, int R)
// We return L when R is 0 and R when L is 0.
if (!L || !R)
return L + R
// We should preserve the heap property by field `y`.
if (nodes[L].y < nodes[R].y)
// In this case `R` should be a root and we merge `L` and left son of `R`,
// and result is saved in `tmp`.
int tmp = merge(L, nodes[R].left)
// `tmp` will be the new left son of `R`.
setParent(tmp, R)
nodes[R].left = tmp
fixSize(R)
return R
else
// In this case `L` will be a root.
int tmp = merge(nodes[L].right, R)
setParent(tmp, L)
nodes[L].right = tmp
fixSize(L)
return L
Since merge
has complexity O(H) = O(log|S|) and in remove
we have one call to merge
and one simple cycle with at most H iterations the complexity of delete query is O(log|S|).
We need to insert character c
to the beginning of the string S which leads to inserting suffix cS to the dynamic suffix array. Hence we add c
to the array str
and also add default node to the array nodes
which will correspond to this suffix. Now the hardest part is to actually insert this node to the tree preserving ordering of suffixes. For this we do some kind of binary search and need to compare this new suffix with the old suffixes.
To perform comparison of the string cS with the id
-th suffix (the suffix that starts with the character str[id]
) we at first compare their first characters c = str[size()]
and str[id]
and if they are different we are done. Otherwise we should compare S, which is the (size() - 1)
-th suffix of the old tree, with the (id - 1)
-th suffix.
For this we use routine getIndex(id)
that returns the actual position (1-based) of the id
-th suffix in the usual suffix array of the current string S (0-th suffix, which is an empty string, has position 0). This is a standard routine for BST where each node has field size
. At first we add to the answer one more the size of the left subtree of the node id
(since all nodes in the left subtree are less than node id
). Then for each ancestor par
of the node id
we add one more the size of the left subtree of the node par
in the case if id
is contained in the right subtree of par
(in this case par
is less than id
as well as the whole its left subtree). Clearly all other nodes are greater than id
. Having this routine coded we can simple compare getIndex(size() - 1)
and getIndex(id - 1)
in the above compare routine. We will follow common practice to return signed integer in compare
routine such that its sign contains the answer.
Summarizing we have the following code snippets for getIndex
and compare
routines:
int getIndex(int id)
if (!id)
return 0
int index = nodes[nodes[id].left].size + 1
while (nodes[id].parent)
int par = nodes[id].parent // Ancestor of the initial node `id`
if (id == nodes[par].right)
index += nodes[nodes[par].left].size + 1
id = par
return index
int compare(int id1, int id2)
int cmp = str[id1] - str[id2]
if (cmp == 0)
cmp = getIndex(id1 - 1) - getIndex(id2 - 1)
return cmp
Now we are ready to insert the new suffix to the tree. Having compare
routine ready, this is a standard routine for particular BST data structure we are using. Corresponding code snippet in the case of a treap looks as follows:
// Inserts node `id` to the subtree rooted at `tree`
// and returns the new root of this subtree (it may change).
int insert(int tree, int id)
// In the case of empty subtree the node `id` will be a new root.
if (!tree)
return id
// The treap should be a heap by `y` fields.
// Hence if `id` has larger `y` than `tree`, `id` should be a new root.
if (nodes[tree].y < nodes[id].y)
// In this case we split subtree rooted at `tree` into two parts:
// 1) with suffixes < the `id`-th suffix (the root is left son of `id`);
// 2) with suffixes > the `id`-th suffix (the root is right son of `id`).
split(tree, id, nodes[id].left, nodes[id].right)
// We set parents of both left and right sons if `id` to be `id`.
setParent(nodes[id].left, id)
setParent(nodes[id].right, id)
fixSize(id)
return id // This case is terminal for insert query.
// Otherwise we should move `id` to one of the subtrees of the node `tree`,
// selecting the correct subtree by comparing suffixes at `tree` and `id`.
int cmp = compare(id, tree)
if (cmp < 0)
// If `id` is less than `tree` than we insert it to the left subtree.
// `tmp` is the root of this subtree after inserting.
int tmp = insert(nodes[tree].left, id)
// And since root of this subtree could change,
// we fix relation between `tree` and `tmp`
setParent(tmp, tree)
nodes[tree].left = tmp
else
// Now cmp > 0 since all suffixes are different
// and we insert `id` to the right subtree.
int tmp = insert(nodes[tree].right, id)
setParent(tmp, tree)
nodes[tree].right = tmp
// We should fix the field `size` of the node `tree`.
// We can simply increase it by 1 since we add only one node to its subtree.
nodes[tree].size++
return tree // `tree` remains the root of its subtree
The split
routine used in insert
looks as follows:
// Splits the subtree rooted at `tree` by the node `id` into two parts:
// 1) with suffixes < the `id`-th suffix (the root is `L`);
// 2) with suffixes > the `id`-th suffix (the root is `R`).
// `id`-th node does not have to be a correct node of the tree.
// But node `id - 1` should be.
void split(int tree, int id, int& L, int& R)
// In the case of empty subtree there is nothing to split,
// and both `L` and `R` should be empty trees as well.
if (!tree)
L = R = 0
return
// Otherwise we compare the root of the subtree with the node `id`.
int cmp = compare(id, tree)
if (cmp < 0)
// If `id` < `tree` then we should split left subtree of `tree` by `id`
// and place the second subtree to the root of left subtree itself,
split(nodes[tree].left, id, L, nodes[tree].left)
setParent(nodes[tree].left, tree)
R = tree
else
// Otherwise cmp > 0 and we split right subtree by the node `id`.
split(nodes[tree].right, id, nodes[tree].right, R)
setParent(nodes[tree].right, tree)
L = tree
fixSize(tree)
We see that split
and insert
calls for compare
O(log|S|) times while each compare
calls getIndex
which has complexity O(log|S|). Hence overall complexity of push
routine is O(log^{2}|S|) as announced above. BTW, it looks as follows:
// Inserts character `c` to the beginning of the string.
void push(char c)
str.push_back(c) // We add `c` to the array `str`.
nodes.push_back(Node()) // We add default node to the array `nodes`...
// ...and then actually insert this node to the tree and update the root.
root = insert(root, size())
There is a place for improvement in the implementation of insert query. We calculate getIndex(size() - 1)
O(log|S|) times during insert query. If we calculate it only once, save it and reuse saved value, it should give double acceleration. I did not describe this in detail here to keep code simple and elegant. But it is implemented in optimized editorialist's solution. On official test data this optimization gives acceleration in 1.5 times: the maximal timing decreases from 0.80 to 0.53.
Now dynamic suffix array data structure is completely described.
The next step is to cover two more types of queries. So we now allow modifications at both ends of the string. It is quite clear how to implement this using two dynamic suffix arrays. Namely, we maintain two dynamic suffix arrays left
and right
. At each moment left
corresponds to some prefix A of the string S, while right
corresponds to the reverse of the remaining suffix B of S. Thus, S = A + B = left + reverse(right). Then queries could be performed as follows:
- To insert the character
c
at the end of the string just call right.push(c)
.
- To insert the character
c
at the beginning of the string just call left.push(c)
.
- To delete the last character just call
right.pop()
if right
is non-empty. But in the case of empty right
we should delete the last character in left
, which is not supported. Hence using |S| insert queries we save first (|S| + 1) div 2 characters of left
in right
and save the last |S| div 2 characters in left
. Then we can call right.pop()
safely.
- To delete the first character just call
left.pop()
if left
is non-empty. In the case of empty left
we do similar reallocation as in the previous query.
- To perform count query we count occurrences of
q
in A using available match
routine in left
, then count occurrences of reverse of q
in B calling match
routine in right
(reverse, since right
contains reversed version of B). Finally we need to count occurrences of q
that intersects with both A and B. For this we form string border
as concatenation of min(|q| − 1, |A|) last characters of A and min(|q| − 1, |B|) first characters of B and using some standard algorithms for string matching (e.g. Knuth-Morris-Pratt algorithm) count occurrences of q
in border
in O(|q| + |border|) = O(|q|) time.
It should be noted that though each reallocation in delete query has complexity O(|S| * log^{2}|S|), the overall complexity for all delete queries is O(Q * log^{2} Q), where Q is the total number of queries. This is because if some reallocation happens the next one will happen after at least |S|/2 queries.
Thus, string deque is a data structure that already supports first 5 types of queries but now complexity of each delete query is O(log^{2}|S|) in average, while other types of queries have the same complexity as in dynamic suffix array. Since this data structure is quite simple, refer to the editorialist's solution for any implementation details.
It is described in the QUICK EXPLANATION section how to solve the actual problem then (jump here).
ALTERNATIVE SOLUTIONS:
Author's and tester's solutions are based on similar idea. But they use long double tag to compare suffixes and probably their solutions have better complexity for insert query. But I can't say for sure since they recalculate these tags for all nodes when precision issues are possible. Namely, when the new node is inserted between nodes node1
and node2
its tag is set to be (node1.tag + node2.tag)/2
. This will ensure property that larger suffixes has larger tags. But if at some moment we have two consecutive nodes with indistinguishable tags under double precision we should recalculate all tags in O(|S|) time using some tree traversal to avoid precision issues. At least this is how I understand their approach. But maybe I am wrong since for me it seems possible to create a test where the number of tags recalculations will be at least Q/40, which leads to quadratic in worst case solution, though with very small constant.
Dynamic Extended Suffix Array should do the job too. Thanks to @cyberax for noting this. But this article is 29 pages long and data structure invented there solves much more complex problem where modification of each character is allowed. So this probably will be an overkill here.
Moreover, most of the contestants answer queries offline saving all the strings of count queries in a trie.
It would be interesting to know the details of such an approach.
AUTHOR'S, TESTER'S AND EDITORIALIST'S SOLUTIONS:
Author's solution can be found here. It's maximal timing is 0.64.
Tester's solution can be found here. It's maximal timing is 0.55.
Editorialist's solution can be found here. It's maximal timing is 0.80.
It is the best viewed at mono-space editor. For example here.
Optimized editorialist's solution can be found here. It's maximal timing is 0.48.
This is achieved by the following two optimizations taken from winger's solution:
- Save order statistic of suffix S in insert query and use it in
compare
routine. Gives boost from 0.80 to 0.53.
- Calculate number of occurrences of
q
in border
on the fly without constructing the string border
itself.
RELATED PROBLEMS:
SPOJ - STRLCP - 3109. Longest Common Prefix