Friday, January 29, 2010

Object-to-Object Similarity

Recently, I read a paper published at VLDB'2008 titled: Accuracy Estimates and Optimization Techniques for SimRank Computation by Lizorkin et al. [1].

In summary, the paper studies a specific algorithm -- SimRank -- that determines similarity scores between objects by augmenting an available link structure that connects the objects (e.g., hyperlinks between pages, friendship links in online social networks, similar tags between photos in Flickr). The intuition behind SimRank is that "two objects are similar if they are referred to by similar objects".

As the title indicates, the work contributions can be divided into two parts: 1) it provides accuracy estimates for the iterative computation of SimRank scores; and, 2) it proposes and analysis three optimization techniques that reduces the computational complexity of the original algorithm from O(n^4) to O(n^3).

One point, which seems to need a deeper investigation, is whether the similarity scores produced by SimRank have high quality (by quality I mean they do capture the true similarity between objects). Indeed, the original SimRank work by Jeh & Widom [2] and Lizorkin's paper lack a discussion (and experiments) on the quality of scores.

Nevertheless, the beauty of Lizorkin's paper resides on the fact that the proposed optimization techniques may be helpful in other contexts, though they originate from a specific context -- improving SimRank. In the following, I briefly describe the intuition behind each optimization.

Essential nodes -- by exploiting the graph structure, the authors are able to define the set of essential nodes of a particular node v, and more importantly, to prove that if a given node b is outside the set of essential nodes of v, the similarity score between them is zero. It turns out, that computing the set of essential nodes is much cheaper than computing the similarity score. Hence, it leads to the first optimization.

Partial sums -- SimRank similarity is recursive by definition. This implies that the similarity score of between some pairs may be reused (without recomputation) to compute the similarity between several other pairs of nodes. The idea applied here was to use memoization to reduce the cost of computing parts of the sum in the computation of a final score. It occurs to me that memoization may be good even in a dynamic graph (i.e., a graph that changes over time), depending, of course, on the trade off between accuracy of the scores and the computational cost implied by the recomputation.

Threshold-sieved similarity -- here, the idea relies on the fact that setting a minimum threshold on the similarity that should be considered enables a reduction in the number of node pairs that the similarity have to be computed. The interesting point is that the threshold does not break the relative ranking between similarity scores, as one could expect due to the fact that some nodes are discarded.

Currently, I am investigating the applicability of one or two of the techniques above in a problem that I have in hands now. More about it soon.