Help with ukkonen's algorithm

anon25504910 · December 24, 2019, 7:47am

I’m trying to use (almost a copy paste but rewritten) Segrey Makagonov’s code for Ukken’s algorithm for suffix tree construction and then dfs to construct suffix array.
However for N \sim 10^6 my program uses \sim 1.4 GB of space, and when used on problems on CF it gives memory limit exceeded.
Can someone suggest some changes so that I can bring the memory usage down?
Thanks.
This is the code I’m using:

#include <bits/stdc++.h>
using namespace std;
#define MAXN 1000000 //change as needed
#define alphabet 256 //(use 26 for lowercase latin etc)


//suffix tree implementation mostly based on sergey makagonov's work

const int upper_limit = 1 << 25;

int root; //pos of root
int last_added, pos, needSL;
int rem, active_node, active_edge, active_length;

struct node{
	//corresponds also to edge ending at this node
	int start;
	int end;
	int suffix_link;
	int next[alphabet];

	int edge_length(){
		return min(end,pos+1) - start; //distance from start
	}
};

node suffix_tree[2*MAXN];
char text[MAXN];

int new_node(int start, int end = upper_limit){
	node n;
	n.start = start;
	n.end = end;
	n.suffix_link = 0;
	for(int i = 0; i < alphabet; i++)
		n.next[i] = 0;
	suffix_tree[++last_added] = n;
	return last_added;
}

char active_edg(){
	return text[active_edge];
}

void add_suffix_link(int nod){
	if(needSL > 0) suffix_tree[needSL].suffix_link = nod;
	needSL = nod;
}

bool walk_down(int nod){
	if(active_length >= suffix_tree[nod].edge_length()){
		active_edge += suffix_tree[nod].edge_length();
		active_length -= suffix_tree[nod].edge_length();
		active_node = nod;
		return true;
	}
	return false;
}

void suffix_tree_init(){
	needSL = 0, last_added = 0, pos = -1, rem = 0, active_node = 0;
	active_edge = 0, active_length = 0;
	root = active_node = new_node(-1,-1);
}

void suffix_tree_extend(char c){
	text[++pos] = c;
	needSL = 0;
	rem++;
	while(rem > 0){
		if(active_length == 0) active_edge = pos;
		if(suffix_tree[active_node].next[active_edg()] == 0){
			int leaf = new_node(pos);
			suffix_tree[active_node].next[active_edg()] = leaf;
			add_suffix_link(active_node);
		}
		else{
			int nxt = suffix_tree[active_node].next[active_edg()];
			if(walk_down(nxt)) continue;
			if(text[suffix_tree[nxt].start + active_length] == c){
				active_length++;
				add_suffix_link(active_node);
				break;
			}
			int split = new_node(suffix_tree[nxt].start, suffix_tree[nxt].start + active_length);
			suffix_tree[active_node].next[active_edg()] = split;
			int leaf = new_node(pos);
			suffix_tree[split].next[c] = leaf;
			suffix_tree[nxt].start += active_length;
			suffix_tree[split].next[text[suffix_tree[nxt].start]] = nxt;
			add_suffix_link(split);
		}
		rem--;
		if(active_node == root && active_length > 0){
			active_length--;
			active_edge = pos - rem + 1;
		}
		else{
			active_node = suffix_tree[active_node].suffix_link > 0? suffix_tree[active_node].suffix_link : root;
		}
	}
}

int suffix_array[MAXN];   
int temp, temp2, ins = 0;

void dfs(int cur, int curlen){
	//cout << suffix_tree[cur].start << ' ' << suffix_tree[cur].end << ' ' << curlen << endl;
	temp = 0;
	if(suffix_tree[cur].next['$']){
		temp2 = suffix_tree[cur].next['$'];
		dfs(temp2, curlen + suffix_tree[cur].end - suffix_tree[cur].start);
	}
	for(char c = 'a'; c <= 'z'; c++){
		if(suffix_tree[cur].next[c]){
			//cout << c << endl;
			temp2 = suffix_tree[cur].next[c];
			dfs(temp2, curlen + suffix_tree[cur].end - suffix_tree[cur].start);
			//cout << '~' << c << endl;
		}
	}
	if(temp == 0){
		suffix_array[ins++] = suffix_tree[cur].start - curlen;
	}
	temp = 1;
}

int main(){
	ios_base::sync_with_stdio(false);
	cin.tie(NULL);
	string S;
	cin >> S;
	suffix_tree_init();
	for(int i = 0; i < S.size(); i++)
		suffix_tree_extend(S[i]);
	suffix_tree_extend('$');
	dfs(root,0);
	for(int i = 1; i <= S.size(); i++)
		cout << suffix_array[i] << ' ';
	cout << endl;
	return 0;
}

ssjgz · December 24, 2019, 7:51am

Do you really need an alphabet of 256 characters?

anon25504910 · December 24, 2019, 7:52am

I think not. Wait, I’ll check with a smaller size and revert.

anon25504910 · December 24, 2019, 8:06am

So when I changed my next[] size to 63 (for a-z + A-Z + 0-9 + ‘$’) it uses 356 MB for 10^6, and when I changed it to 27 for ‘a’-‘z’ and ‘$’ it is 150ish MB which is still quite high, though it helped.
What else can I do to reduce the memory usage?
Thanks btw.
The added lines :

#define alphabet 63

static inline int indx(char c){
	if(c == '$') return 0;
	else if(c >= 'a' && c <= 'z') return c-'a'+1;
	else if (c >= 'A' && c <= 'Z') return c-'A'+27;
	else return c-'0'+53;
}

Also, less I/O with disk => faster ! My program is now 2x faster. Thanks @ssjgz!
Though, it still uses 350 MB . Can you suggest further changes?

ssjgz · December 24, 2019, 8:10am

Hmmm … nothing leaps out. I always use a vector for my equivalent of your next (so next contains no “empty entries”: it’s size is precisely the set of transitions leading from a Node), but then there’s a sharp runtime vs memory tradeoff

anon25504910 · December 24, 2019, 8:14am

I still get memory limit exceeded on this one

The limit seems to be 262 MB ish, and the string is alphanumeric and not only lowercase, meaning that I can’t trim my array down further.
I don’t see where else I can cut memory usage…
Maybe I should use arrays instead of structs but that means rewriting the whole thing from scratch

ssjgz · December 24, 2019, 8:22am

Wait - do you actually need fancy suffix-tree stuff for this Problem? If i’m understanding it correctly, you just merge the words in order, and if merged is the merged string so far, and w is the next word for merge, then the time taken to merge w into merged to get the next merged string is just |w|, so the whole algorithm is linear in the sum over all w of |w|, which is 10^6.

anon25504910 · December 24, 2019, 8:24am

Yeah I don’t think I need suffix trees. Though, on problems where I would need them, this would still pop up as an issue.

ssjgz · December 24, 2019, 8:25am

I’ve done plenty of string Problems on Hackerrank, and I don’t think I’ve encountered one where memory usage was an issue

Edit:

Hmmm … actually, not so sure if there is a way to do this without suffix-trees, now.

anon25504910 · December 24, 2019, 8:29am

Can you post your implementation or the link to a good one?
Edit : BTW I tried using unordered_maps with a custom hash instead of arrays, but it takes way too long.

ssjgz · December 24, 2019, 9:01am

Yeah, I misjudged and don’t have a proper solution at the moment.

This might work, though I absolutely hate this approach as it is more of a “heuristic that is astronomically unlikely to fail” than an algorithm, but what the hey:

// Simon St James (ssjgz) - 2019-12-24
// 
// Solution to: https://codeforces.com/contest/1200/problem/E
//
#include <iostream>
#include <vector>

#include <cassert>

using namespace std;

template <typename T>
T read()
{
    T toRead;
    cin >> toRead;
    assert(cin);
    return toRead;
}

//#define VERIFY_HASH // Debugging - all seems to be correct, though.

class HashedString
{
    public:
        void appendLetter(const char letter)
        {
            const int letterIndex = letter - 'a' + 1;
            m_hash += letterIndex * m_powerOfPrime;
            m_powerOfPrime *= largeRandomPrime;

#ifdef VERIFY_HASH
            m_string += letter;
            assertHashIsCorrect();
#endif
        }
        void prependLetter(const char letter)
        {
            const int letterIndex = letter - 'a' + 1;
            m_hash *= largeRandomPrime;
            m_hash += letterIndex * largeRandomPrime;

            m_powerOfPrime *= largeRandomPrime;
#ifdef VERIFY_HASH
            m_string = letter + m_string;
            assertHashIsCorrect();
#endif
        };
        uint64_t hash() const
        {
            return m_hash;
        }
#ifdef VERIFY_HASH
       void assertHashIsCorrect()
       {
           uint64_t powerOfPrime = largeRandomPrime;
           uint64_t correctHash = 0;
           for (int i = 0; i < m_string.length(); i++)
           {
               const int letterIndex = m_string[i] - 'a' + 1;
               correctHash += letterIndex * powerOfPrime;

               powerOfPrime = powerOfPrime * largeRandomPrime;
           }
           cout << "String: " << m_string << " correctHash: " << correctHash << " m_hash: " << m_hash << endl;
           assert(correctHash == m_hash);
       }
#endif
    private:
#ifdef VERIFY_HASH
        string m_string;
#endif
        uint64_t m_hash = 0;
        static constexpr uint64_t largeRandomPrime = 189'385'063'559;
        uint64_t m_powerOfPrime = largeRandomPrime;
};

string mergeWords(const vector<string>& wordsToMerge)
{
    string merged;

    for (const auto& wordToMerge : wordsToMerge)
    {
        HashedString hashedPrefixOfNew;
        HashedString hashedSuffixOfMerged;
        int largestMatchLen = 0;
        for (int i = 0; i < min(wordToMerge.length(), merged.length()); i++)
        {
            hashedPrefixOfNew.appendLetter(wordToMerge[i]);
            hashedSuffixOfMerged.prependLetter(merged[merged.length() - 1 - i]);
            // Yuck - assumes that strings have the same hash if and only if they
            // are equivalent, which is mathematically provably false but 
            // enormously unlikely to be wrong in real-life.
            //
            // See https://www.hackerrank.com/challenges/gridland-provinces/problem for a
            // Problem where this sort of logic is the basis for the official Editorial
            // solution.
            if (hashedPrefixOfNew.hash() == hashedSuffixOfMerged.hash())
            {
                largestMatchLen = i + 1;
            }
        }

        merged += wordToMerge.substr(largestMatchLen);
    }
    
    return merged;
}

int main(int argc, char* argv[])
{
    ios::sync_with_stdio(false);

    const int numWords = read<int>();
    vector<string> wordsToMerge(numWords);
    for (auto& word : wordsToMerge)
    {
        word = read<string>();
    }

    cout << mergeWords(wordsToMerge) << endl;


    assert(cin);
}

I haven’t tested it beyond the sample testcases.

Edit:

The naive solution would be, for each 1 \le i \le |\textit{wordToMerge}|, to try and see if the suffix of \textit{merged} of length i is equal to the prefix of \textit{wordToMerge} - this sounds very, very similar to the kind of situation that KMP helps with.

anon25504910 · December 24, 2019, 9:10am

Thanks.
I tried implementing the cp algorithms code, with dfs for suffix array.
113 MB, somewhat better, but the code is slow: 1.8 seconds for 10^6.
I didn’t even try to read the compressed implementation given there on cp algos, its just so repelling.
Edit: the compressed implementation there has only 26 slots, i.e. no ‘$’ char support meaning no easy way to do dfs to find “leaves”. I don’t want to read the implementation and change it more, its already so repelling.

anon25504910 · December 24, 2019, 4:39pm

Wow, look at this.

I changed the endl’s to \n’s here, added some small tweaks and rewrote the lcp array function as it seemed to give a wrong output, and now for 10^6 it takes 0.3 seconds and 16 MB space. WOW.

malekfouda · December 7, 2020, 8:34pm

can you please send me the full source code of suffix tree? I need it urgently