Mundaú - Distributed Computing (or not)

Friday, January 29, 2010

Object-to-Object Similarity

Recently, I read a paper published at VLDB'2008 titled: Accuracy Estimates and Optimization Techniques for SimRank Computation by Lizorkin et al. [1].

In summary, the paper studies a specific algorithm -- SimRank -- that determines similarity scores between objects by augmenting an available link structure that connects the objects (e.g., hyperlinks between pages, friendship links in online social networks, similar tags between photos in Flickr). The intuition behind SimRank is that "two objects are similar if they are referred to by similar objects".

As the title indicates, the work contributions can be divided into two parts: 1) it provides accuracy estimates for the iterative computation of SimRank scores; and, 2) it proposes and analysis three optimization techniques that reduces the computational complexity of the original algorithm from O(n^4) to O(n^3).

One point, which seems to need a deeper investigation, is whether the similarity scores produced by SimRank have high quality (by quality I mean they do capture the true similarity between objects). Indeed, the original SimRank work by Jeh & Widom [2] and Lizorkin's paper lack a discussion (and experiments) on the quality of scores.

Nevertheless, the beauty of Lizorkin's paper resides on the fact that the proposed optimization techniques may be helpful in other contexts, though they originate from a specific context -- improving SimRank. In the following, I briefly describe the intuition behind each optimization.

Essential nodes -- by exploiting the graph structure, the authors are able to define the set of essential nodes of a particular node v, and more importantly, to prove that if a given node b is outside the set of essential nodes of v, the similarity score between them is zero. It turns out, that computing the set of essential nodes is much cheaper than computing the similarity score. Hence, it leads to the first optimization.

Partial sums -- SimRank similarity is recursive by definition. This implies that the similarity score of between some pairs may be reused (without recomputation) to compute the similarity between several other pairs of nodes. The idea applied here was to use memoization to reduce the cost of computing parts of the sum in the computation of a final score. It occurs to me that memoization may be good even in a dynamic graph (i.e., a graph that changes over time), depending, of course, on the trade off between accuracy of the scores and the computational cost implied by the recomputation.

Threshold-sieved similarity -- here, the idea relies on the fact that setting a minimum threshold on the similarity that should be considered enables a reduction in the number of node pairs that the similarity have to be computed. The interesting point is that the threshold does not break the relative ranking between similarity scores, as one could expect due to the fact that some nodes are discarded.

Currently, I am investigating the applicability of one or two of the techniques above in a problem that I have in hands now. More about it soon.

References:

[1] http://www.citeulike.org/user/elsantosneto/article/6509946
[2] http://www.citeulike.org/user/ldietz/article/349900

Friday, October 23, 2009

Spam in social software

From the Greg Linden's blog:

An arms race in spamming social software
by Greg Linden (16 Oct 2009)
http://glinden.blogspot.com/2009/10/arms-race-in-spamming-social-software.html

The post presents an ultra-short, yet interesting, summary of spam strategies observed in social software. In particular, Greg Linden mentions several observed badmouthing strategies -- the spammer tries to taint competitors reputation, as opposed to attempt to use spam to obtain benefits directly. Interestingly, Chen and Friedman [1] conjecture that badmouthing attacks are harder to tackle than attacks that attempt to artificially boost one's reputation.

The post ends suggesting that incentive-based approaches to deter spam are (possibly good) alternatives compared to techniques that detect spam by focusing on content analysis (e.g., detection of commercial intent on blog comments). I would say they are complementary, though.

[1] Chen & Friedman. "Sybilproof Reputation Mechanisms".
http://doi.acm.org/10.1145/1080192.1080202

Wednesday, August 26, 2009

Relationship between cross-field citations and work impact

A recent work by Shi, Adamic , Tseng and Clarkson has an interesting analysis on the relationship between works that draw from different areas (i.e., cite papers outside their fields) and their subsequent impact. [1]

One of the interesting bits:

[...]
Intuitively, any individual citation will at most have a very weak impact on the success of a citing paper. It will only be one of possibly dozens of references made in an article or patent. Other factors, such as the publication venue and the reputation of the authors, are more likely to contribute to the impact of the article than any individual citation the authors include. We nevertheless see a significant relationship between the interdisciplinarity of citations and the impact of the publication.
[...]

This reminds me of previous results on the relationship between network constraint and value of ideas [2]. The intuition is that a person who is in a bridge position in her social network (i.e., connecting two distinct groups) is more exposed to different ways of thinking, which may lead to that person having more valuable ideas. Here, the social network is the citation network, and the bridges are papers that cite otherwise unconnected clusters (i.e., fields).

A recipe for higher impact research?

[1] Shi et al. 2009. The Impact of Boundary Spanning Scholarly Publications and Patents. PLoS ONE.
[2] Burt, R., 2003. Structural Holes and Good Ideas. American Journal of Sociology.

Wednesday, July 15, 2009

The Internet and its Topology

A while ago, I came across an article that discusses points related to network modeling and characterization [1], particularly the Internet physical topology.

The motivation used by the authors is, as the article puts it, the power-law argument. In particular, the authors highlight important points researchers should focus when performing similar studies. They also go further and challenge the now traditional assumption that the Internet topology resembles a scale free network. In this post, I briefly summarize the paper.

The paper recounts the now well known and accepted scale-free Internet argument (i.e., that the Internet topology is well modeled by a network with a power-law node degree distribution). The arguments, as presented by the authors, are rooted on the limitations of traceroute on accurately determining the physical topology, as it only captures interfaces of routers instead of physical boxes.

In short, the work proposes to focus on the decision making process an Internet service provider goes through when planning and deploying its physical infrastructure, as opposed to using traceroute traces to inspire models for the Internet topology. Moreover, the authors suggest that the right tool to do that is constrained optimization that allows to formalize the referred decision making process.

Other interesting points extracted from the paper:

a).

A node high degree implies low capacity links, conversely low degree implies high capacity links, this is due to the limited capacity on processing traffic. Thus, the high degree nodes would be on the edge of network, as opposed to the core, which is different from what most of the traceroute studies claim.

b).

To avoid confusion and to emphasize the fact that preferential attachment is just one of many other mechanisms that is capable of generating scale-free graphs, we will refer here to the network models proposed in [2].

Suggested principles on characterization and modeling, which, I think, are sufficiently general to be applied in fields other than network topology characterization:

1. Know your data:

The data used by Faloutsos, Faloutsos, and Faloutsos [3] was not intended to capture the physical topology, rather to "get some experimental data on the shape of multicast trees one an actually obtain in the real Internet"

2. When modeling is more than data fitting:

If we wish to increase our confidence in a proposed model, we ought also to ask what new types of measurements are either already available or could be collected and used for validation. (By new they mean completely new types of data not involved whatsoever with the original modelling exercise).

3. Know your statistic:

Once agreeing that the data set is problematic, one could try to use a more robust statistic to avoid the mistakes.

[1] W. Willinger, D. Alderson and C. Doyle. "Mathematics and the Internet: A source of enormous confusion and great potential", Notices of the AMS, Vol. 56, no. 5, May 2009.

[2] A.-L. Barabási and R. Albert, Emergence of scaling in random networks, Science 286 (1999).

[3] M. Faloutsos, P. Faloutsos, and C. Faloutsos, On power-law relationships of the Internet topology, ACM SIGCOMM (1999).

Tuesday, June 23, 2009

The role of the scientific method

Here is an interesting article by John Polanyi published by The Globe and Mail.

Hope lies in the scientific method by John Polanyi.

The article discusses the role of science, and in my opinion, more importantly, the importance of the critical view of scientists on the impact that political decisions may have on the lives of many.

Food for thought, indeed.

Wednesday, June 03, 2009

Data Reliability Tradeoffs

Abdullah Gharaibeh (my colleague at the NetSysLab) has a recent work that explores a combination of heterogeneous storage components in terms of cost, reliability and throughput.

The work proposes a storage architecture that leverages idle storage resources, located on volatile nodes (e.g., desktops), to provide high throughput access at low cost. Moreover, the system is designed to provide durability with a low throughput durable storage component (e.g., tape).

What I like in the solution is that it shows nicely how to decouple components that provide two important features for data-intensive applications: availability and durability. Moreover, this separation (together with the evidence showed by the experiments) helps system administrators to reason about the deployment cost.

Here is the abstract:

Abdullah Gharaibeh and Matei Ripeanu. Exploring Data Reliability Tradeoffs in Replicated Storage Systems. ACM/IEEE International Symposium on High Performance Distributed Computing (HPDC 2009), Munich, Germany, June 2009.

This paper explores the feasibility of a cost-efficient storage architecture that offers the reliability and access performance characteristics of a high-end system. This architecture exploits two opportunities: First, scavenging idle storage from LAN connected desktops not only offers a low-cost storage space, but also high I/O throughput by aggregating the I/O channels of the participating nodes. Second, the two components of data reliability – durability and availability – can be decoupled to control overall system cost. To capitalize on these opportunities, we integrate two types of components: volatile, scavenged storage and dedicated, yet low-bandwidth durable storage. On the one hand, the durable storage forms a low-cost back-end that enables the system to restore the data the volatile nodes may lose. On the other hand, the volatile nodes provide a high-throughput front-end. While integrating these components has the potential to offer a unique combination of high throughput, low cost, and durability, a number of concerns need to be addressed to architect and correctly provision the system. To this end, we develop analytical- and simulation-based tools to evaluate the impact of system characteristics (e.g., bandwidth limitations on the durable and the volatile nodes) and design choices (e.g., replica placement scheme) on data availability and the associated system costs (e.g., maintenance traffic). Further, we implement and evaluate a prototype of the proposed architecture: namely a GridFTP server that aggregates volatile resources. Our evaluation demonstrates an impressive, up to 800MBps transfer throughput for the new GridFTP service

Wednesday, April 22, 2009

Individual and Social Behavior in Tagging Systems

As part of a much broader investigation on the peer production of information, Individual and Social Behavior in Tagging Systems [1] is a recent work that focuses on the quantitative aspects of tag reuse, item re-tagging and the implicit social relation inferred from the similarity of user interests.

The observations point to interesting directions on the design of systems, such as recommendation systems, that aim at exploiting past user activity. For instance, it providers quantitative evidence why item recommendation tends to be less efficient than tag recommendations in these systems (based on the relatively higher level of tag reuse, compared to the item re-tagging).

I must mention that this work is a result of a collaboration with an enthusiastic team: Nazareno Andrade, David Condon, Adriana Iamnitchi and Matei Ripeanu,

Reference:

[1] Elizeu Santos-Neto, David Condon, Nazareno Andrade, Adriana Iamnitchi and Matei Ripeanu. "Individual and Social Behavior in Tagging Systems". In the 20th ACM Conference on Hypertext and Hypermedia. Torino, Italy, June 29-July 1, 2009.

Tuesday, March 10, 2009

Tracing Influenza Epidemics via Crowdsourcing

Researchers from Google (Mountain View,CA,USA) and Centers for Disease Control and Prevention (Atlanta, GA,USA) recently reported a method to build a model that helps detecting the spread of Influenza.

The article describes the analysis of search logs in combination with records of doctor visits related to Influenza-like Illness (ILI). The goal of the model is to predict the percentage of doctor visits that are ILI-related. The authors report a correlation of up to 0.96 between the model predictions and the Centers for Disease Control (CDC) reports.

Citation:

Jeremy Ginsberg, Matthew H. Mohebbi, Rajan S. Patel, Lynnette Brammer, Mark S. Smolinski and Larry Brilliant. Detecting influenza epidemics using search engine query data. Nature 457, 1012-1014 (19 February 2009).

Tuesday, March 03, 2009

One Microsoft Way

Last week, at Redmond, Microsoft Research showcased a list of projects from its labs located worldwide during the TechFest 2009. As part of my internship last year, at the lovely Microsoft Research Cambridge - UK, I worked on the Mobile Content-Casting project, which was also showcased.

Ars Technica has an article summarizing some of the projects presented at TechFest 2009.

It is worth taking a look: Microsoft Research TechFest 2009: a glance at the road ahead.

I particularly found the SecondLight and the suite of Social-Digital demos really cool. Both are from MS Research at Cambridge - UK.

Friday, February 27, 2009

A nice short story

This is a fun short story published by Nature that I came across recently.

"Lost in sun and silence -- The golden age of communication"
by Vincenzo Palermo. Nature, Vol 457, 26 February 2009.

Wednesday, August 13, 2008

The Anticommons

The New Yorker's Financial Page has an interesting article: The Permission Problem by James Surowiecki (the author of Wisdom of Crowds).

James Surowiecki discusses the notion of anticommons as presented by Professor Michael Heller (The Gridlock Economy).

To illustrate the point, Surowiecki points out two extreme scenarios of the resource sharing problem -- i) common good model: the resource is deemed public and it is shared among individuals without the notion of individual ownership over the shared resource; ii) private property model: the notion of unlimited property, where the resource is owned by a subset of individuals, who may charge other individuals that want to consume units of that resource.

The article says that, on the one hand, common goods may lead to the well-known tragedy of the commons: overuse. On the other hand, unlimited property rights may lead to the exactly opposite: waste of resources (or the Tragedy of the anticommons.

The article has nice examples:

[...]
The commons leads to overuse and destruction; the anticommons leads to underuse and waste. In the cultural sphere, ever tighter restrictions on copyright and fair use limit artists’ abilities to sample and build on older works of art. In biotechnology, the explosion of patenting over the past twenty-five years—particularly efforts to patent things like gene fragments—may be retarding drug development, by making it hard to create a new drug without licensing myriad previous patents. Even divided land ownership can have unforeseen consequences. Wind power, for instance, could reliably supply up to twenty per cent of America’s energy needs—but only if new transmission lines were built, allowing the efficient movement of power from the places where it’s generated to the places where it’s consumed. Don’t count on that happening anytime soon. Most of the land that the grid would pass through is owned by individuals, and nobody wants power lines running through his back yard.
[...]

It seems to me that certain computational environments present an interesting middle ground between these two extremes discussed above. For example, Nazareno pointed out a while ago to a Large-scale commons-based Wi-fi: users, who own an Internet connection, may share the spare capacity in exchange for either using the spare capacity of others later or being paid for it. The wonderful insight of this resource sharing model is that people buy more Internet bandwidth (as many other computational goods -- if I can name it like this) than they are able to use. Hence, resources are mostly underutilized. So, why not sharing it (the spare capacity)in exchange for access to others spare capacity in the future?

Finally, a question comes to my mind: besides the fact that certain resource units bear an extra capacity by definition (e.g. often my CPU is 99% idle), does any other intrinsic resource characteristic play a role in suggesting which model is suitable for the sharing of that resource?

Sunday, July 27, 2008

Information Management in Living Organisms

Nature has an article by Paul Nurse (Life, Logic and Information), where he discusses some ideas on studying living organisms as information management systems.

Paul Nurse suggests that analyzing the information flow in living organisms would help to understand certain behaviors, which are not completely clear nowadays.

From a computer systems researcher standpoint, the interesting aspect of Paul Nurse's idea is that it goes on the opposite direction of previous studies: instead of drawing inspiration from nature to build information management systems (e.g. ant inspired algorithms), the author proposes to use information science tools to study the nature.

A piece from the article:

Systems analyses of living organisms have used a variety of biochemical and genetic interaction traps with the emphasis on identifying the components and describing how these interact with each other. These approaches are essential but need to be supplemented by more investigation into how living systems gather, process, store and use information, as was emphasized at the birth of molecular biology.

This sounds exciting, as a better understanding of living systems could feedback into the bio-inspired approach of designing distributed computational systems.

A couple of years ago, I briefly explored the design of a distributed storage system based on the behavior of the Messor Barbarus ants (for more details on the M. Barbarus ants see Anderson, C., J.J. Boomsma, and J.J. Bartholdi, III. 2002. Task partitioning in insect societies: bucket brigades. Insectes sociaux 49(2): 171-180).

The rationale behind it is quite simple: every time an unloaded larger ant encounters a loaded smaller ant, the load is passed from the smaller to the larger ant. This naturally spread the work among the workers according to their capacity (strength and speed).

Bringing it back to the context of distributed storage systems, the idea is to enable a self-organizing load balance scheme by making larger nodes to request more load from lower capacity nodes. The goal is to improve throughput and data availability.

Obviously, a comprehensive performance evaluation is necessary to claim that this strategy would lead to an globally efficient system.

Thursday, July 03, 2008

HPDC'08 - Part II

Gilles Fedak delivered an invited talk at the UPGRADE-CN workshop (part of the HPDC'08). He presented the BitDew - a programmable environment that targets data management in Desktop Grids [1].

The rationale behind BitDew is that applications can define routines for data manipulation. These routines are expressed via predefined metadata, which are used by the infrastructure mechanisms to perform data management tasks, such as replication.

In a Technical Report, Gilles and colleagues present use cases and performance evaluation of mechanisms that provide data management functionality in BitDew.

In particular, I found the approach of leveraging metadata interesting. The predefined set of metadata allows the application layer to communicate requirements to the infrastructure regarding the desired level of fault tolerance and transfer protocols, for example.

In fact, this intersects with one of our projects at the NetSys Lab, where we investigate the use of custom metadata as a cross-layer communication method for storage systems [2].

As we use the file system interface to separate between the application and the storage layers, the two approaches (BitDew and our Custom Metadata approach) seem complementary. The metadata passed by the applications via BitDew could propagate to the file system, where it would interact with the Extended Attributes interface (which is exploited by our solution).

More coding fun to come...

[1] Fedak et al. "BitDew: A Programmable Environment for Large-Scale Data Management and Distribution". Technical Report, 6427, INRIA.

[2] Elizeu Santos-Neto, Samer Al-Kiswany, Nazareno Andrade, Sathish Gopalakrishnan and Matei Ripeanu. "Enabling Cross-Layer Optimizations in Storage Systems with Custom Metadata". In HPDC'08 - HotTopics. Boston, MA, USA. September, 2008.

Monday, June 30, 2008

HPDC'2008 - Part I

Last week I participated to two great conferences: the International Symposium in High Performance Distributed Computing (HPDC) and the USENIX. Both events took place in Boston, MA, USA.

There a lot of interesting things to mention. Thus, to avoid a single long post, I will describe a few presentations that I attended (and discuss some ideas) in a series of short posts.

In the first two days at HPDC, there were two interesting workshops: UPGRADE-CN (P2P and Grids for the Development of Content Networks) and MMS (Managed Multicore Systems).

The two workshops had works related to the research projects I am currently working on.

Molina [1] presented his work on designing two protocols that enable collaborative content delivery in mobile transient networks. By transient networks, the authors mean networks composed of devices that are geographically co-located for a short period such as a music festival.

The authors suggest to exploit the multiple network interfaces currently available in most mobile devices and to enable collaborative use of these multihomed devices.

The idea is quite interesting. In particular, it raises some issues from the perspective of distributed resource sharing. It would be good to understand whether incentive mechanisms are necessary in transient networks. The idea is to encourage users to share their connections with a community for collaborative downloading/streaming of content.

On top of that, a nice follow-up work would be to investigate the feasibility of collaborative data dissemination protocols, which are widely used in the Internet (e.g. BitTorrent), in the transient networks scenarios.

[1] Molina et al. "A Social Framework for Content Distribution in Mobile Transient Networks". In UPGRADE-CN'2008.

Thursday, June 05, 2008

IPTV Viewing Habits and Netflix Player

The IPTPS 2008 has an interesting paper on exploiting TV Viewing Habits to reduce the traffic on the ISP backbone generated by IPTV consumers.

"On Next-Generation Telco-Managed P2P TV Architectures" by Meeyoung Cha (KAIST), Pablo Rodriguez (Telefonica Research, Barcelona), Sue Moon (KAIST), and Jon Crowcroft (University of Cambridge).

In this paper the authors analyze the utilization o P2P content distribution techniques to reduce network overhead in a Internet Service Provider IPTV infrastructure. To exploit the patterns of channel holding time, channel popularity and the correlation between the time of the day and the number of viewers, the authors propose a locality-aware P2P content distribution scheme that reduces the traffic on the ISP backbone.

From the paper:

we ascertain the sweet spots and the overheads of server-based unicast, multicast, and serverless P2P and also show the empirical lower bound network cost of P2P (where cost reduction is up to 83% compared to current IP multicast distribution

[...]

We believe that our work provides valuable insights to service providers in designing the next-generation IPTV architecture. Especially, it highlights that dedicated multicast is useful for few of the extremely popular channels and that P2P can handle a much larger number of channels while imposing very little demand for infrastructure.

This week I saw some news on the internal characteristics of the Netflix Player. Immediately, I thought of the paper from IPTPS as a possible optimization to the Netflix Player.

The NetFlix player is supposed to use the conventional broadband connection, as opposed to well provisioned IPTV architectures described in Cha et al. Perhaps, the locality-aware P2P content distribution technique is even more interesting in the Netflix Player case.

Nevertheless, the viewing habits and interest sharing among Netflix users may differ dramatically from what is observed in the IPTV environment, which would impact the efficiency of the locality-aware P2P content distribution.

Tuesday, June 03, 2008

Oh My Goosh!

If you like typing your commands away to interact with your computer, you will like this: Goosh :-)

From Slashdot: goosh, the Unofficial Google Shell posted by kdawson on Monday June 02, @07:26PM.

It's essentially a browser-oriented, shell-like interface that allows you to quickly search Google (and images and news) and Wikipedia and get information in a text-only format.

Wednesday, May 28, 2008

"Yes, There Is a Correlation"

This week I came across an interesting paper: "Yes, There is a Correlation - From Social Networks to Personal Behavior on the Web" by Parag Singla (University of Washington) and Matthew Richardson (Microsoft Research) in WWW'2008.

In summary, they show that the similarity between the personal interests and attributes of two users who are MSN contacts is much higher than two random users. Moreover, I've found the problem formulation elegant and the scale of data non-trivial to handle (approx. 13 million unique users).

From the paper:

Summarizing the results, we showed that people who talk to each other on the messenger network are more likely to be similar than a random pair of users, where similarity is measured in terms of matching on attributes such as queries issued, query categories, age, zip and gender. Further, this similarity increases with increasing talk time. The similarities tend to decrease with increasing average time spent per message. Also, we showed that even within the same demographics, people who talk to each other are more likely to be similar. Finally, as we hop away in the messenger network, the similarity still exists, though it is reduced.

I wonder whether a similar level of correlation would be observed in online communities with other purposes, such as content-sharing (e.g. Flickr and YouTube).

Friday, May 16, 2008

OurGrid 4.0 released

Good news from the South! The OurGrid 4.0 is out.

In summary: OurGrid is an open source, free-to-join, peer-to-peer grid, where users trade computational resources. The loosely coupled computational infrastructure is ideal for the execution of embarrassingly parallel applications.

I am particularly glad with this release, as OurGrid has been a useful tool in my previous studies. I used it to analyze traces of activity of content sharing communities. OurGrid makes it easy to harness the idle times of our desktop machines and monitor the progress of the computations in a much easier way.

Next week I will definitely give a try on the new version (as we are still using version 3.3).

The new site looks great too. Congratulations, OurGrid Community! :-)

Thursday, May 15, 2008

Deep Water(*) at 45rpm

A bit about recent music experiences...

It has been a while, since I listened to the voice of Beth Gibbons. So, I recently indulged myself with the addition of a new record to my collection. The Portishead new album: Third. By record, I mean an LP. :-)

The immediate surprise after playing the first song was to hear some voices in Portuguese. For a moment, I thought that the record player switch was set to FM Radio on some independent radio station.

The second big surprise was that record sound dark, with very slow beats and bass sounds. After a couple of seconds, I noticed that the album is an 45 RPM LP, as opposed to the usual 33 RPM records (more popular today). A quick adjustment on the player rotation speed and voilá!.

Overall, Portishead's new album sounds quite different from the previous works. I feel that there is more emphasis on the drums, which is really nice. Beth Gibbons voice is more discrete than in her solo album or in the previous Portishead recordings.

In my opinion, the best song of the album so far is Magic Doors (side 4, as the package comes with 2 LPs). :-)

Unfortunately, I missed their Coachella's concert, but I hope to have another opportunity soon...

(*) Deep Water is one of the songs of the album. I think it would work as a nice title for it.

Sunday, April 27, 2008

Non-absolute numbers

As you come back to life (after a series of conference and project deadlines), you are able to recover clever and funny pieces of literature such as the explanation of what is Bistromathics and its relation to the non-absolute numbers.