### PROBLEM LINKS

### DIFFICULTY

MEDIUM

### PREREQUISITES

Suffix Array, Longest Common Prefix Array

### PROBLEM

Given two strings **a** and **b**, let **A** be the set of all substrings of **a**, and **B** be the set of all substrings of **b**.

Find the number of **unique** strings in **A** plus the number of **unique** strings in **B**, that are not common to both **A** and **B**.

### QUICK EXPLANATION

Efficient solutions to a large set of problems on strings are approachable by use of **Suffix** **Arrays** (or Trees). If you have not yet added **Suffix** **Sorting** (for construction of Suffix Arrays) to your skill-set, then this is the perfect problem to do so.

Suffix Sorting can be done using

**Manber Myers algorithm in O(L log L) time****Karkkainan Sander’s algorithm in O(L) time**

**Longest Common Prefixes** of adjacent items in the list of sorted suffixes is a classical problem in strings. Most texts discuss construction of Suffix Arrays immediately followed by calculation of **LCP arrays**.

The **LCP Array** can be found by **augmenting** the Suffix Sorting algorithms above. The time complexity for LCP Array calculation will be **exactly** equal to the time complexity of the Suffix Sorting. You can find it as an entirely different step as well, following the Suffix Sorting.

Both **O(L)** and **O(L log L)** approaches will fit within the time limit for this problem. Hence, choose as you please

Let us assume that we can find the **number of unique sub-strings of a string** by use of the **LCP array** for the suffixes of that string. (We will see the algorithm to do so in the EXPLANATION section)

U(A) = number of unique strings in A

We are given two strings **a** and **b** and their set of substrings **A** and **B** respectively. We wish to find

U(A) + U(B) - U(AintersectionB)

Which is equal to

U(A) + U(B) - (U(AunionB) - U(A) - U(B))

Or

2*U(A) + 2*U(B) - U(AunionB)

### EXPLANATION

Let us see how to find the number of unique substrings of a string by using the LCP array.

- If we consider
**all the prefixes of all the suffixes**, we would be considering the entire set of sub-strings. - The LCA array helps us determine how many prefixes to
**ignore**for each suffix - We of course do not ignore any prefix of the first suffix. These are all valid and unique substrings.

Let s be given string indexes in s are 1-based Let S be the suffix array S stores the list of 1-based indexes that represent the start position of the suffix of s Let L be the LCP array L(i) is the longest common prefix between the suffixes starting from S(i) and S(i-1) Thus, L is only defined from 2 to |s| uniq_sub_strings = |s| - S[1] + 1 // thus we count all prefixes of the first suffix for i = 2 to N uniq_sub_strings += |s| - S* + 1 - L*

Let us try this with an example to see why this is correct.

s = abaabba S = [ 7, // a 3, // aabba 1, // abaabba 4, // abba 6, // ba 2, // baabba 5 // bba ] L = [ 0, // not defined for L[1] 1, // a is the common prefix 1, // a 2, // ab 0, // nothing 2, // ba 1 // b ] uniq_sub_strings = 1 + 4 + 6 + 2 + 2 + 4 + 2 = 21

Thus, there are **21** substrings (out of **28**) that are unique. You can work this out by deducing that the sub-strings that are not unique (and hence weren’t counted are)

{ a, // prefix of aabba // since a was counted as prefix of "a" a, // prefix of abaabba a, // prefix of abba ab, // prefix of abba // since ab was counted as prefix of "abaabba" b, // prefix of baabba // since b was counted as prefix of "ba" ba, // prefix of baabba // since ba was counted as prefix of "ba" b // prefix of bba }

Now, you can find **U(A)** and **U(B)** by using the above algorithm. The only part that remains is to calculate **U(A union B)**.

This can be done by considering a string

c = a + "$" + b

We can find all the unique substrings of **c** and reduce from the result, all the strings that contain the character **‘'</b>. We know that all the different strings that contain <b>'’** are unique anyway, since **‘$’** is not part of the alphabet.

There are ** (|a|+1) * (|b|+1)** substrings of

**c**that contain

**‘$’**.

Thus, we can find **U(A union B)**, the number of unique substrings that exists in either **A**, or **B**.

We can find the number of unique sub-strings of either **a**, or **b**, but not both, in time **O(F(|a|) + F(|b|) + F(|a+b|)**, where **F(n)** is the complexity of calculating the Suffix Array or the LCA Array (which ever is more). In the best implementation, **F(n) = O(n)**. But the problem may be solved even with **F(n) = O(n log n)**.

### SETTER’S SOLUTION

Can be found here.

### TESTER’S SOLUTION

Can be found here.