Distance Nodes

Consider the following sequence:
Tree = ((((A,B),((C,D),E)),F),G,H)

This is a representation of an un-rooted tree includes matches parenthesis, commas and letters, each letter is indicating a leaf or node.

We should calculate distances between every two leaves. Distance between leaves into a pair parenthesis is one. For example distance (A,B) = (C,D) = 1.
Also, distance between (Nodei, Nodej) = (Nodej, Nodei). For example, distance(C,F) = distance (F,C).
For calculating nodes, I did following approach:
I put every pair nodes into a nod:

X = (A,B), Y = (C,D) ==> Tree = (((X,(Y,E)),F),G,H) 
Z = (Y,E) ==> Tree = (((X,Z),F),G,H)
T = (X,Z) ==> Tree = ((T,F),G,H)
U = (T,F) ==> Tree = (U,G,H)

Also, I put all of distances in a table:

	A	B	C	D	E	F	G	H
A	0	1	4	4	3	3	4	4
B	1	0	4	4	3	3	4	4
C	4	4	0	1	2	4	5	5
D	4	4	1	0	2	4	5	5
E	3	3	2	2	0	3	4	4
F	3	3	4	4	3	0	2	2
G	4	4	5	5	4	2	0	1
H	4	4	5	5	4	2	1	0

I need to a program to calculate distances between every two nodes.

Thanks a lot.

1 Like

If you have constructed tree you can find common parent for both nodes and distance is max path length form both nodes…

And how to find common parent?

int parentDistance( nodeA, nodeB )
    // it holds that nodeA.parent != nodeB && nodeB.parent != nodeA
    if ( nodeA == nodeB ) return 0;
    int da = nodeA.depth
    int db = nodeB.depth
    if ( da == db ) return parentDistance( nodeA.parent, nodeB.parent ) + 1;
    if ( da < db ) return parentDistance( nodeA, nodeB.parent ) + 1;
    if ( da > db ) return parentDistance( nodeA.parent, nodeB ) + 1;
1 Like

Obviously, the betlista solution is correct. It gives you the distance between a chosen pair of leaves, but it does not compute the matrix itself. That’s why i wondered if it was possible to do it. Following is the thinking i just had, without pretending it’s correct, nor it’s fast enough ! :slight_smile:

I think the solution will really depends on the data representation you choose. Let me explain.

Your problem is typically constructing a 0-diagonal symmetric distance matrix in a leaf-labeled tree. It’s a very common routine in phylogenetics, applied on so-called phylogenetic trees, to valuate the patristic distance between groups of DNA-related organisms. Of course, computing pairwise distances between nodes could be achieved with usual shortest path algorithms that compute the length of paths between all pairs of nodes (as Dijkstra, for example). That principle holds for any structure representation : trees, for instance.

But let me show you something.

If you represent your tree as a correctly parenthesized string, the distance between two nodes is simply the number of mispaired parenthesis between them. Thus, if you want to compute the distance between one chosen pair of leaves, it’s fast (it IS an equivalent solution to betlista’s one). For the whole matrix, the gap is probably located in turning this algorithm to a dynamic programming one.

Here is the Python code i wrote for a single pair of leaves :

#!/usr/bin/env python
#-*- coding:utf-8 -*-

def distance(T, start, stop):
    if start == stop: return 0
    a, b = T.index(start), T.index(stop)
    a, b = min(a, b), max(a, b)
    UP, PBP = 0, 0  # unbalanced / possibly balanced
    for c in T[a:b]:
        if   c == ')' and not PBP: UP += 1
        elif c == ')' and     PBP: PBP -= 1
        elif c == '(':             PBP += 1
    return UP + PBP + 1

def main():
    Tree = '((((A,B),((C,D),E)),F),G,H)'
    print ' ',
    for c2 in 'ABCDEFGH':
        print c2,
    print
    for c1 in 'ABCDEFGH':
        print c1,
        for c2 in 'ABCDEFGH':
            print distance(Tree, c1, c2),
        print

main()

If you want the whole matrix, just use the same idea with dynamic programming.

#!/usr/bin/env python
#-*- coding:utf-8 -*-

def init(T):
    UP, PBP = 0, 0  # unbalanced / possibly balanced
    result = {}
    for c in T:
        if   c in 'ABCDEFGH':
             result[c] = (UP, PBP)
        elif c == ')' and not PBP: UP  += 1
        elif c == ')' and     PBP: PBP -= 1
        elif c == '(':             PBP += 1
    return result

def distance(d, c1, c2):
    if  c1 == c2:
        return 0
    else:
        return abs(d[c2][0]-d[c1][0]) + abs(d[c2][1]-d[c1][1]) + 1

def main():
    Tree = '((((A,B),((C,D),E)),F),G,H)'
    d = init(Tree[4:-1])
    print ' ',
    for c2 in 'ABCDEFGH':
        print c2,
    print
    for c1 in 'ABCDEFGH':
        print c1,
        for c2 in 'ABCDEFGH':
            print distance(d, c1, c2),
        print

main()

Please note it’s only a proof-of-concept algorithm, as leaves names are hard-coded.

(d[c] is the numbers of ‘)’ and ‘(’ mispaired parenthesis between the first leaf and ‘c’)

This dp solution is in O(n) where n is the size of the string (representing the tree).

The previous one was in O(n^2) where n was the number of leaves.

Depending on how big is your tree, it could make a difference. Has it many leaves ? What is its maximum depth ? You should really consider these constraints, it matters. :slight_smile:

Well, it’s a possible solution. I don’t know if it’s the best one, but in my opinion, it’s worth giving it a try ! :slight_smile:

pictures uploading is not working on CodeChef, share it somewhere please (f.e. I’m using http://imgur.com/ for this)

Hi.

Thank you for your comment.
Actually I can build my tree from the mentioned approach (scan the tree every time and assign pair nodes to a parent and scan again)

But I am looking for faster way. I could find an approach for that and I am testing.
I will publish my solution after.