ENGLISH Editorial

vijju123 · January 22, 2020, 8:22pm

PROBLEM LINK:

Div1
Div2
Practice

Setter- Vlad Zavodnik
Tester- Encho Mishinev
Editorialist- Abhishek Pandey

DIFFICULTY:

Medium

PRE-REQUISITES:

Observations, Tries OR Hashing [Let me know if this link doesn’t help]

PROBLEM:

For a pair of strings, let L_p be length of common prefix and L_s be length of common suffix. The beauty of this pair is defined is min(L_p,L_s)^2.

Given an array of N words, find the maximum sum of beauty possible by pairing the words.

QUICK-EXPLANATION:

Key to AC- The question hints for trie. Alternatively, if you practiced similar questions, you will know how to proceed with hashing as alternate solution!

Realize that we need to check for both prefixes and suffixes - and require the minimum length out of the 2. Hence, if we are able to somehow check for both prefix and suffix while traversing the trie, the problem will be essentially solved!

Lets invent a new alphabet system made by pairing every alphabet with every other alphabet, i,e, a system comprising of 26^2=676 alphabets pair. Eg, my new alphabet system will start from (a,a) and end at (z,z).

Now transform the given string S=S_1S_2...S_N (where S_i represent the i’th character of the string) to the new alphabet system by pairing i'th character with (N+1-i) character (1-based indexing), i.e. make S=(S_1,S_N)(S_2,S_{N-1})...(S_N,S_1) .

Because of above, now our beauty is nothing but square of common prefix. The answer can be now obtained by doing DFS on the trie.

EXPLANATION:

So this question took the 4’th spot in my poll for question whose editorial you all want. So thanks for bearing with the delay, I have tried to make this editorial comprehensive.

We will discuss both - setter’s trie solution, and as a bonus, also the tester’s hashing solution. The editorial however assumes that you know tries (at least conceptually!) and hashing before you are reading it. If not, please refer to the pre-requisite links or learn the topics from any medium comfortable with you

In setter’s solution, we shall discuss the following:

How to identify it as a trie problem.
The transformation & rough intuition on why it works.
How to calculate answer via DFS on trie

In tester’s solution, we will discuss the following-

The simple, yet tricky greedy idea
Hashing and Essence of the idea

So lets get started!

Setter

Identifying it as a Trie problem

On first looks, the fact that a trie is to be used might not be clear to you (which is essentially why you are reading this section)!

If I talk crudely, the following 4 types of string problems are most common:

Easy observation ones.
Pattern Finding or related concept.
Tries
Hashing

Like, if you go through the list of string problems (and, I mean, actual string problems and not ones where string is given just for sake of it), you will see above classification helping.

The next doubt to address is that its not primarily a DP problem. I am careful with the wording - perhaps someone with a brilliant mind saw some DP solution here - but I do not see one. And the reason is that, if I ask myself “Hey, if I got answer for upto first (i-1) words, can it help me get answer for all words upto index i ? Is there some particular order of traversal and filling the DP table which makes recurrence more visible?” For me the answers were no.

Now, lets answer why tries. For most of the string questions, if they involve computing some formula involving “common prefix” its trie. Like, I am not even kidding, most of the times it can be solved by trie. Its possible that the problem is having an easier solution, but it will be solvable by trie. Another similar popular usage of tries is problems where we have to maximize/minimize result of some bitwise operation like XOR, but thats a story of another day.

The twist here is that it kind of involves suffix as well, which kind of throws people to stuff like Suffix Arrays, Suffix Automaton etc. That is definitely not an invalid thing to consider actually. If you read my previous CHEFSPA editoiral, I discussed that you should never completely eliminate an algo until you find the right solution. Don’t narrow down by trimming feasible candidates - thats a mistake many, many people do. Once you trim/throw out the correct algo out of consideration, any further time you put into the problem is very less efficient - so be considerate of developing this mindset while practicing. It especially helps in short contests

The Transformation and Why it works

First we will discuss some intuition on what hints towards this transformation. Then we will formally write the transformation and see why it works.

Now, what if the problem is reduced to - “Beauty of the pair is l_p^2 where l_p is length of common prefix.” ? If you looked at pre-requisites or read trie, you know this is a very classical trie problem. But the problem says that beauty is min(l_p,l_s)^2 . Now what?

If we can somehow, somehow check the suffix as well while checking prefix, we are done! One thing where we shouldn’t be confused is the definition of what suffix is. For example, ``abcdef" has ``def" as suffix of length 3. ``fed" is NOT a suffix, BUT it is the reverse of suffix!

Now, when doing the prefix check, what do we do? We check if the first characters match. Then we check if second character match, and so on. What if we change it to -

We check the first characters match. We also check that the last characters match. If both do, we know min(l_p,l_s)=1.
If above is true, then we check the second character and second last character. If they match, we know min(l_p,l_s)=2. How? For prefix, its the classical stuff we discussed already (checking first character, second character, …, and so on). For suffix, we know last characters match. Then we also know that second last characters match. This directly implies that suffix of length 2 is possible!

What we achieved with above is, checking the first and last, second and second-last, third and third-last (and so on) characters simultaneously. Again, see that the intuition is simply that “If first k characters match, prefix of length k is possible, and if last k characters match, suffix of length k is possible”.

So, we change the string such that each character is replaced by a pair of required characters. In other words, if S=C_1C_2...C_N (where C_i= the i'th character of the string) then we change S to S=(C_1,C_N) (C_2,C_{N-1})...(C_N,C_1) where (C_1,C_N) is the first “character”, and so on. This is similar to changing our simple English alphabets (which are from A-Z) to be from (A,A)-(Z,Z).

Note that the entire purpose of above transformation is to provide a medium by which we can check the N+1-i'th character along with the i'th character as we go along the trie. This reduces the problem to simply checking the length of matching prefix which reduces this problem to a more classical form.

Wrapping up

All that is left is to now calculate the answer. There are multiple approaches possible, I will describe the setter’s approach.

The setter decides to incrementally add the answer as he does DFS on the trie. Each node of this trie stores the characters ( i.e. the new pair of characters we defined) and count of how many strings have common prefix obtained by concatenating all characters from root to current node. Lets call this count for u'th node as cnt_u.

What setter does is as follows:

Say we are at Node u. cnt_u is count of words with required prefix. The length of this prefix is L. Recall that by using our transformation, we reduced the problem so that beauty is now square of matching prefix.
Add \large \lfloor \frac{cnt_u}{2} \rfloor * L to answer.
Goto its children. Say we are at child v. The corresponding count for v is cnt_v and length is L+1.
Out of cnt_u words, cnt_v match for one character more. Hence we will add \large \lfloor \frac{cnt_v}{2} \rfloor *( (L+1)^2-L^2) to the answer
You can verify that this finally evaluates to \large \lfloor \frac{(cnt_u-cnt_v)}{2} \rfloor * L + \lfloor \frac{cnt_v}{2} \rfloor * (L+1)^2, i.e. “Number of pairs with min(l_p,l_s)=L multiplied by L, and similar for L+1.”

So it is like, we check which nodes have min(l_p,l_s)=1, add the corresponding value to answer. Then we check out of those words, how many have min(l_p,l_s)=2. We then finetune the value to add so that calculate the value of beauty correctly. This goes on and on until we process all nodes of the tree.

Of course, this is just one of the ways, there can be multiple other ways possible to achieve the same. Compared to the magnitude of problem, this is a very trivial implementation thing on which we have to work on.

Tester

The simple, tricky greedy idea

The idea of tester’s solution was that, it should be always optimal to match words with highest beauty first (into a pair). In other words, if we first pair words which give highest beauty, then words which give second highest beauty, and so on, then we will arrive at an optimal solution.

Now, the question is why? If this greedy holds, then this makes the problem fundamentally simpler! The tester has given a very comprehensive proof about this in his notes, which I will try to explain below-

Let O be an optimal solution where words with highest beauty (among all current pairs), A and B are not matched. Let A be matched with C and similarly B be matched with D. Lets define a few variables to use later-

Beauty(A,B)=X^2
Beauty(A,C)=Y^2
Beauty(B,D)=Z^2

Current beauty of stanza with pairing (A,C) and (B,D) is Y^2+Z^2.

Now, lets see the beauty if we pair (A,B) and (C,D). (A,B) contributes X^2 to the score. What about (C,D)? We know that A and B have min(l_p,l_s)=X, A and C have min(l_p,l_s)=Y and similarly B and D have min(l_p,l_s)=Z. This means, C and D MUST have at least min(l_p,l_s)=min(X,Y,Z), else one of the above values (i.e. X, Y or Z) will be wrong, which isnt possible. Hence, our new score becomes X^2+min(Y,Z)^2.

Without loss of generality, lets say Y<Z, so the score now becomes X^2+Y^2 which is more than Y^2+Z^2. But we assumed O is an optimal solution! Hence, we get a contradiction, that if the solution does not pair words with highest current beauty, we do not get an optimal solution.

Hashing and Essence of Idea

Knowing that we need to pair the words with highest (current) beauty first, we follow the below algorithm:

(PS: I assume you guys know how to hash strings. If no, I’d recommend solving SHIFTPAL from related problem section first.)

We maintain the prefix and suffix hashes for all the words.
Say the highest length among all words is L.
We loop from i=L to 1 and do the following:
- Insert words with Length=i into consideration.
- Among all words under consideration, see how many have prefix and suffix match of i characters. Note that its guaranteed that all words have length \geq i because of our manner of looping.
- The above check can be done by checking the prefix and suffix hash of the word for prefix and suffix lengths of i. If 2 words have equal hash values for this, we assume min(l_p,l_s)=i for them. We pair them, add the corresponding score to answer and remove them from consideration now.

Believe it or not, more or less it does the trick. The only real pain here is implementing this idea, which is not as trivial as it may sound - but rest assured, its not extremely complex or undoable either.

Regarding hashing part, there is actually nothing new to add here. Its the same classical hashing of a string, for which I would recommend you the SHIFTPAL problem to get familiar with.

With this, we end the editorial - hope you all understood at least one of the solutions

SOLUTION

Setter

#include<bits/stdc++.h>
 
using namespace std;
 
const int A=26;
 
struct node
{
    node* sons[A*A];
    int cnt;
    node():cnt(0)
    {
        for(auto&to:sons)
            to=0;
    }
};
typedef node* pnode;
 
pnode root;
 
void add(string s)
{
    pnode v=root;
    for(int i=0;i<s.size();i++)
    {
        char x=s[i],y=s[s.size()-1-i];
        int ch=(x-'a')*A+(y-'a');
        if(!v->sons[ch])
            v->sons[ch]=new node();
        v=v->sons[ch];
        v->cnt++;
    }
}
 
void del(pnode v=root)
{
    if(!v)
        return;
    for(auto to:v->sons)
        del(to);
    delete v;
}
 
long long solve(pnode v=root,long long h=0)
{
    if(!v)
        return 0;
    long long ans=v->cnt/2*(h*h-(h-1)*(h-1));
    for(auto to:v->sons)
        ans+=solve(to,h+1);
    return ans;
}
 
int main()
{
    int t;
    cin>>t;
    while(t--)
    {
        del();
        root=new node();
        int n;
        cin>>n;
        while(n--)
        {
            string s;
            cin>>s;
            add(s);
        }
        cout<<solve()<<"\n";
    }
}

Tester

#include <iostream>
#include <stdio.h>
#include <string.h>
#include <vector>
#include <map>
using namespace std;
typedef unsigned long long ullong;
typedef long long llong;
 
int t,n;
char inp[100111];
int L;
 
ullong B = 127ULL;
 
vector<char> words[100111];
vector<int> active;
 
vector<int> activated[100111];
 
vector<ullong> forw[100111];
vector<ullong> backw[100111];
 
map<ullong, int> mymap;
map<ullong, int>::iterator myit;
bool disabled[100111];
 
int main()
{
    int i,j;
    int test;
    ullong Bshift = B;
 
    for (i=1;i<=100000;i++)
    {
        Bshift *= B;
    }
 
    scanf("%d",&t);
 
    for (test=1;test<=t;test++)
    {
        llong answer = 0;
        int maxL = 0;
 
        scanf("%d",&n);
 
        active.clear();
 
        for (i=1;i<=n;i++)
        {
            disabled[i] = false;
 
            scanf("%s",inp);
            L = strlen(inp);
 
            maxL = max(maxL, L);
 
            words[i].clear();
            forw[i].clear();
            backw[i].clear();
 
            ullong H = 0;
            for (j=0;j<L;j++)
            {
                words[i].push_back(inp[j]);
 
                H *= B;
                H += inp[j];
                forw[i].push_back(H);
            }
 
            H = 0;
            for (j=L-1;j>=0;j--)
            {
                H *= B;
                H += inp[j];
                backw[i].push_back(H);
            }
        }
 
        for (i=1;i<=maxL;i++)
        {
            activated[i].clear();
        }
 
        for (i=1;i<=n;i++)
        {
            activated[ words[i].size() ].push_back(i);
        }
 
        for (i=maxL;i>=1;i--)
        {
            for (j=0;j<activated[i].size();j++)
            {
                active.push_back(activated[i][j]);
            }
 
            mymap.clear();
            int uk = 0;
            while(uk < active.size())
            {
                if (disabled[ active[uk] ])
                {
                    active[uk] = active.back();
                    active.pop_back();
                    continue;
                }
 
                ullong H = forw[ active[uk] ][i-1] * Bshift + backw[ active[uk] ][i-1];
                myit = mymap.find(H);
 
                if (myit == mymap.end())
                {
                    mymap.insert(make_pair(H, active[uk]));
                }
                else
                {
                    answer += (llong)i * (llong)i;
                    disabled[ active[uk] ] = true;
                    disabled[ (*myit).second ] = true;
 
                    mymap.erase(myit);
                }
 
                uk++;
            }
        }
 
        printf("%lld\n",answer);
    }
}

Time Complexity=O(\sum W_i) for setter and O((\sum W_i) LogN) for tester.
Space Complexity=O(\sum W_i)

CHEF VIJJU’S CORNER

Prove or disprove the following:
- The transformation discussed in setter’s solution is not required. We can solve the problem by using 2 tries within the time and memory constraints. Proof by AC is acceptable.
- The setter’s transformation works if beauty is defined as max(L_p,L_s) instead of min(L_p,L_s). If true, give a proof. If false, a counter example is acceptable.
- In worst case the trie can have \large O(K^{depth}) nodes where there are K letters in alphabet and maximum depth of trie is depth. Then why does the current trie solution pass? Prove that the worst case is inachievable under given constraints.
- You are at Toogle developing their search engine. You want to calculate the beauties fast without using a lot of memory. In this light, i.e. w.r.t. time and memory consumption, discuss the pro’s and con’s of setter and tester’s approach.
- In tester’s solution, prove that the step where we remove words from consideration is correct. How can you be sure that its matching words with maximum current beauty? (Hint: See when string of length \geq i remain under consideration for next iteration!)
Mr Tash is learning hashing. He designs various hashing functions. For each of them, tell if they are good hashing functions or not:
- H=\sum_{i=1}^{i=N}V_i where V_i=0 for 'a', 1 for 'b' , ..., 25 for 'z'.
- H=\sum_{i=1}^{i=N}i*V_i where V_i=0 for 'a', 1 for 'b' , ..., 25 for 'z'.
- H=\sum_{i=1}^{i=N}2^i*V_i where V_i=0 for 'a', 1 for 'b' , ..., 25 for 'z'.
- H=\sum_{i=1}^{i=N}26^i*V_i where V_i=0 for 'a', 1 for 'b' , ..., 25 for 'z'.
- H=( \sum_{i=1}^{i=N}26^i*V_i )mod \ 2 where V_i=0 for 'a', 1 for 'b' , ..., 25 for 'z'.
- H=( \sum_{i=1}^{i=N}26^i*V_i )mod \ 1000000000 where V_i=0 for 'a', 1 for 'b' , ..., 25 for 'z'.
- H=( \sum_{i=1}^{i=N}26^i*V_i )mod \ 1000000007 where V_i=0 for 'a', 1 for 'b' , ..., 25 for 'z'.
- H=( \sum_{i=1}^{i=N}26^i*V_i )mod \ 2^{64} where V_i=0 for 'a', 1 for 'b' , ..., 25 for 'z'.

Hint

A good hashing function is such that: H=\sum_{i=1}^{i=N} L^i*V_i such that L > K where K is the size of alphabet. For these functions, if we can somehow store all large numbers they produce, we will never have a collision!

But however, since we cannot store such large numbers, we modulo them with suitably large primes as to minimize chances of collision. We usually prefer primes here. Also, modulo 2^i are considered bad numbers to modulo from as its very easy to frame anti hash tests against them.

With respect to setter's solution, what is the significance of Subtask 1?

You can get those points without coming up with the transformation as it C_i=C_{N+1-i} is guaranteed in a palindrome string.

Setter's Notes

Consider an alphabet consisting of 26 ^ 2 = 676 characters, where each character is a pair of two characters of the English alphabet. Convert each word from the input data as follows:
s = s_1 s_2 ... s_ {(|s| - 1)} s_ {|s|} \implies (s_1, s_ {|s|}) (s_2, s_ {(|s| - 1)}) ... (s_ {(|s| - 1)}, s_2) (s_ {|s|}, s_1)
That is, the first letter with the last, the second with the penultimate …, the last with the first.

Now the beauty of a pair of words is the square of the length of their greatest common prefix.
Construct a trie of these words. In each of its vertices we will store a list of sons of this vertex and the number of words whose prefix is the line corresponding to this vertex (let’s call this value cnt and add 1 to this value each time, when adding a word, when we will go through this vertex). We solve the problem by dfs into the tree.

Let we are at some vertex. We know its depth h and the value cnt. Among all words, there are exactly cnt words whose prefix is the string corresponding to this vertex. All these words have a common prefix of length h, that is, a rhyme of length not less than h. Of these, you can make no more than \lfloor \frac{cnt}{ 2} \rfloor word pairs. Since the function is recursive, the beauty for each of these words is calculated as (h-1) ^ 2. It is necessary to change it to h ^ 2. That is, all you need to add to the answer is the value: \lfloor \frac{cnt} {2} \rfloor * (h ^ 2- (h-1) ^ 2). It should be noted that when we are at the root, we don’t need to add anything to the answer, since the root corresponds to an empty string.

Tester's Notes

My solution is based on the greedy idea that picking the highest-scoring pair first is always optimal. I’ll write a short proof of this fact at the end.

Let L be the maximum length of any word. For each value from L to 1 in order we try to find all unmatched pairs that have a matching prefix and suffix of this length. Suppose we want to find all such pairs for some value K. We create a hash of the first K characters and the last K characters of every unmatched word whose length is at least K and then simply pair words with equal hashes, marking them matched.

Let S be the sum of the lengths of all words. In order to implement this efficiently, we calculate all prefix and suffix polynomial hashes of all words, which takes O(S). This allows finding the hashes for a fixed value K for a fixed word to be done in O(1). Each word W is considered exactly |W| times and since we use sorting or map/set to find pairings, the total complexity of the main part of the algorithm is O(S log N). It is important that for some fixed value K we don’t consider words whose length is less than K at all, as otherwise the complexity will blow up to O(S^2) in the worst case (e.g. on a test with one really long word and many 1-letter words).

Each time we match a pair we just add the beauty to the score. The whole algorithm is thus O(S log N) per test.

Proof of the greedy:

Let the words A and B have the largest beauty among all pairs and consider any pairing in which A is not paired with B. If they are not paired then there exist some words C and D such that A is paired with C and B is paired with D (consider C,D empty words if A or B is unpaired).

Let beauty(A, B) = X^2, beauty(A, C) = Y^2 and beauty(B, D) = Z^2. In the current pairing those four words contribute a total of Y^2 + Z^2.

Now consider instead pairing A with B and C with D. The pair A and B would contribute X^2. Since C and A have a common prefix/suffix of length Y, A and B of length X and B and D of length Z, then beauty(C, D) is at least min(X, Y, Z)^2 = min(Y, Z)^2 (since X >= Y,Z by definition). It is then clear that Y^2 + Z^2 \leq X^2 + min(Y, Z)^2 since X \leq Y, Z and min(Y, Z) is either Y or Z. We’ve thus proved that pairing the highest scoring pair in an arbitrary pairing will not make the score worse. By induction we can then easily show that always choosing the pair with the maximum beauty is optimal.