Mundaú - Distributed Computing (or not): 2010

Wednesday, December 01, 2010

Theory Tuesdays -- What do Wikipedia, car-pooling, couch surfing, and social bookmarking have in common?

Last fall, I worked for four months in Google, Zürich. It was a fun and enriching experience, indeed. The opportunity to learn and contribute to the technology that is used by millions of people daily is quite exciting.

Besides enjoying the Google-life, during my days in Zürich, I was invited by two artist friends, Silvan Käelin [1,2,3] and Philip Matesic [4,5,6]), to give a talk about my PhD thesis research at Perla Mode -- as part of the Theory Tuesdays project.

Philip organizes a weekly event at Perla Mode named Theory Tuesdays. The goal is to bring together artists, researchers (of multiple disciplines), and the public to discuss a variety of topics such as art, technology, society, and their intersection.

I received the invitation with surprise and interest. It was a new experience to talk to an audience completely outside my field of research. Also, it was a chance to receive feedback (from such non-technical group) about what I think it is a relevant topic of study.

The idea was to have a middle ground between what I have been investigating (i.e., techniques to assess the value of contributions in peer production systems [pdf]) and a the broader topic that could interest the attendance.

Therefore, it seemed appropriate to introduce the notion of peer production systems [7], hint the questions I am interested in answering in this context, and asking the participants related questions such as: how do they perceive the value of information they consume online? do they often perceive themselves contributing to others by producing information online? what is the main incentive to do so? What are the aspects they take into account to decided whether an information provider produces value to them?

I am glad that the "talk" turned into a lively conversation about about all these questions and other aspects related to online peer production. We covered topics from the basic notion of social production (and why it works so well in certain scenarios), passed through specifics about the utility of tagging (e.g., classification languages may emerge through collaboration), and talked about the intuition behind the techniques I am designing to assess the value of contributions in social tagging systems.

Although anecdotal, it was possible to observe from the discussion two explicit trends on the perception of value of online peer-produced information: novelty and trust on the information producer. These aspects came up in the discussions as crucial to the users to assess the value of peer-produced information. It is important to note that the information consumer's interest is an implicitly aspect considered in the value assessment.

The observations are somehow intuitive, but for me it was quite important and helpful to have a first-hand discussion with the real users of the systems I study. It does help one to tune the questions to ask and where the relevancy of one's research. I hope to have more opportunities like this. Thanks to Silvan and Philip for the first one.

References:

[1] One Man
[2] Lagoa do Ouro
[3] Temps de Poussiére (Time of Dust)
[4] An Bonus
[5] Mau Series
[6] To Don Pedro with Mr. Gonzalez
[7] Y. Benkler. "The Wealth of Networks: How Social Production Transforms Markets and Freedom"

Friday, October 01, 2010

The Small World of File Sharing

The IEEE TPDS has recently published a preprint of an interesting piece of work that I had the pleasure to collaborate with.

Adriana Iamnitchi, Matei Ripeanu, Elizeu Santos-Neto, Ian Foster, "The Small World of File Sharing," IEEE Transactions on Parallel and Distributed Systems, 28 Sept. 2010. IEEE computer Society Digital Library. IEEE Computer Society.

Abstract:

Web caches, content distribution networks, peer-to-peer file sharing networks, distributed file systems, and data grids all have in common that they involve a community of users who use shared data. In each case, overall system performance can be improved significantly by first identifying and then exploiting the structure of community's data access patterns. We propose a novel perspective for analyzing data access workloads that considers the implicit relationships that form among users based on the data they access. We propose a new structure —the interest-sharing graph— that captures common user interests in data and justify its utility with studies on four data-sharing systems: a high-energy physics collaboration, the Web, the Kazaa peer-to-peer network, and a BitTorrent file-sharing community. We find small-world patterns in the interest-sharing graphs of all four communities. We investigate analytically and experimentally some of the potential causes that lead to this pattern and conclude that user preferences play a major role. The significance of small-world patterns is twofold: it provides a rigorous support to intuition and it suggests the potential to exploit these naturally emerging patterns. As a proof of concept, we design and evaluate an information dissemination system that exploits the small-world interest-sharing graphs by building an interest-aware network overlay. We show that this approach leads to improved information dissemination performance.

Thursday, July 08, 2010

Assessing the Value of Contributions in Tagging Systems

During the past two and a half months, I have been visiting the InWeb/UFMG at Belo Horizonte - MG - Brazil. Besides enjoying the great 'Cozinha Mineira', in this opportunity, we studied the problem of assessing the value of contributions in social tagging systems, and took the first steps towards a solution. The ideas will be presented in an article at the 2nd IEEE International Symposium on Social Intelligence and Networking (SIN-10) in 20-22August 2010. Next stop: to give a talk at the Laboratório de Sistemas Distribuídos (UFCG).

Abstract -- Assessing the value of individual users' contributions in peer-production systems is paramount to the design of mechanisms that support collaboration and improve users’ experience. For instance, to incentivize contributions, file-sharing systems based on the BitTorrent protocol equate value with volume of contributed content and use a prioritization mechanism to reward users who contribute more. This approach and similar techniques used in resource-sharing systems rely on the fact that the physical resources shared among users are easily quantifiable. In contrast, information-sharing systems, like social tagging systems, lack the notion of a physical resource unit (e.g., content size, bandwidth) that facilitates the task of evaluating user contributions. For this reason, the issue of estimating the value of user contributions in information sharing systems remains largely unexplored. This paper introduces this problem and takes the first steps towards a solution. More precisely, it presents a framework to design algorithms that estimate the value of user contributions in tagging systems, proposes three complementary success criteria for potential solutions, and outlines the methodological evaluation challenges.

Friday, January 29, 2010

Object-to-Object Similarity

Recently, I read a paper published at VLDB'2008 titled: Accuracy Estimates and Optimization Techniques for SimRank Computation by Lizorkin et al. [1].

In summary, the paper studies a specific algorithm -- SimRank -- that determines similarity scores between objects by augmenting an available link structure that connects the objects (e.g., hyperlinks between pages, friendship links in online social networks, similar tags between photos in Flickr). The intuition behind SimRank is that "two objects are similar if they are referred to by similar objects".

As the title indicates, the work contributions can be divided into two parts: 1) it provides accuracy estimates for the iterative computation of SimRank scores; and, 2) it proposes and analysis three optimization techniques that reduces the computational complexity of the original algorithm from O(n^4) to O(n^3).

One point, which seems to need a deeper investigation, is whether the similarity scores produced by SimRank have high quality (by quality I mean they do capture the true similarity between objects). Indeed, the original SimRank work by Jeh & Widom [2] and Lizorkin's paper lack a discussion (and experiments) on the quality of scores.

Nevertheless, the beauty of Lizorkin's paper resides on the fact that the proposed optimization techniques may be helpful in other contexts, though they originate from a specific context -- improving SimRank. In the following, I briefly describe the intuition behind each optimization.

Essential nodes -- by exploiting the graph structure, the authors are able to define the set of essential nodes of a particular node v, and more importantly, to prove that if a given node b is outside the set of essential nodes of v, the similarity score between them is zero. It turns out, that computing the set of essential nodes is much cheaper than computing the similarity score. Hence, it leads to the first optimization.

Partial sums -- SimRank similarity is recursive by definition. This implies that the similarity score of between some pairs may be reused (without recomputation) to compute the similarity between several other pairs of nodes. The idea applied here was to use memoization to reduce the cost of computing parts of the sum in the computation of a final score. It occurs to me that memoization may be good even in a dynamic graph (i.e., a graph that changes over time), depending, of course, on the trade off between accuracy of the scores and the computational cost implied by the recomputation.

Threshold-sieved similarity -- here, the idea relies on the fact that setting a minimum threshold on the similarity that should be considered enables a reduction in the number of node pairs that the similarity have to be computed. The interesting point is that the threshold does not break the relative ranking between similarity scores, as one could expect due to the fact that some nodes are discarded.

Currently, I am investigating the applicability of one or two of the techniques above in a problem that I have in hands now. More about it soon.

References:

[1] http://www.citeulike.org/user/elsantosneto/article/6509946
[2] http://www.citeulike.org/user/ldietz/article/349900

Mundaú - Distributed Computing (or not)