Obviously, the betlista solution is correct. It gives you the distance between a chosen pair of leaves, but it does not compute the matrix itself. That’s why i wondered if it was possible to do it. Following is the thinking i just had, without pretending it’s correct, nor it’s fast enough !
I think the solution will really depends on the data representation you choose. Let me explain.
Your problem is typically constructing a 0-diagonal symmetric distance matrix in a leaf-labeled tree. It’s a very common routine in phylogenetics, applied on so-called phylogenetic trees, to valuate the patristic distance between groups of DNA-related organisms. Of course, computing pairwise distances between nodes could be achieved with usual shortest path algorithms that compute the length of paths between all pairs of nodes (as Dijkstra, for example). That principle holds for any structure representation : trees, for instance.
But let me show you something.
If you represent your tree as a correctly parenthesized string, the distance between two nodes is simply the number of mispaired parenthesis between them. Thus, if you want to compute the distance between one chosen pair of leaves, it’s fast (it IS an equivalent solution to betlista’s one). For the whole matrix, the gap is probably located in turning this algorithm to a dynamic programming one.
Here is the Python code i wrote for a single pair of leaves :
#!/usr/bin/env python
#-*- coding:utf-8 -*-
def distance(T, start, stop):
if start == stop: return 0
a, b = T.index(start), T.index(stop)
a, b = min(a, b), max(a, b)
UP, PBP = 0, 0 # unbalanced / possibly balanced
for c in T[a:b]:
if c == ')' and not PBP: UP += 1
elif c == ')' and PBP: PBP -= 1
elif c == '(': PBP += 1
return UP + PBP + 1
def main():
Tree = '((((A,B),((C,D),E)),F),G,H)'
print ' ',
for c2 in 'ABCDEFGH':
print c2,
print
for c1 in 'ABCDEFGH':
print c1,
for c2 in 'ABCDEFGH':
print distance(Tree, c1, c2),
print
main()
If you want the whole matrix, just use the same idea with dynamic programming.
#!/usr/bin/env python
#-*- coding:utf-8 -*-
def init(T):
UP, PBP = 0, 0 # unbalanced / possibly balanced
result = {}
for c in T:
if c in 'ABCDEFGH':
result[c] = (UP, PBP)
elif c == ')' and not PBP: UP += 1
elif c == ')' and PBP: PBP -= 1
elif c == '(': PBP += 1
return result
def distance(d, c1, c2):
if c1 == c2:
return 0
else:
return abs(d[c2][0]-d[c1][0]) + abs(d[c2][1]-d[c1][1]) + 1
def main():
Tree = '((((A,B),((C,D),E)),F),G,H)'
d = init(Tree[4:-1])
print ' ',
for c2 in 'ABCDEFGH':
print c2,
print
for c1 in 'ABCDEFGH':
print c1,
for c2 in 'ABCDEFGH':
print distance(d, c1, c2),
print
main()
Please note it’s only a proof-of-concept algorithm, as leaves names are hard-coded.
(d[c] is the numbers of ‘)’ and ‘(’ mispaired parenthesis between the first leaf and ‘c’)
This dp solution is in O(n) where n is the size of the string (representing the tree).
The previous one was in O(n^2) where n was the number of leaves.
Depending on how big is your tree, it could make a difference. Has it many leaves ? What is its maximum depth ? You should really consider these constraints, it matters.
Well, it’s a possible solution. I don’t know if it’s the best one, but in my opinion, it’s worth giving it a try !