Mundaú - Distributed Computing (or not): 2008

Wednesday, August 13, 2008

The Anticommons

The New Yorker's Financial Page has an interesting article: The Permission Problem by James Surowiecki (the author of Wisdom of Crowds).

James Surowiecki discusses the notion of anticommons as presented by Professor Michael Heller (The Gridlock Economy).

To illustrate the point, Surowiecki points out two extreme scenarios of the resource sharing problem -- i) common good model: the resource is deemed public and it is shared among individuals without the notion of individual ownership over the shared resource; ii) private property model: the notion of unlimited property, where the resource is owned by a subset of individuals, who may charge other individuals that want to consume units of that resource.

The article says that, on the one hand, common goods may lead to the well-known tragedy of the commons: overuse. On the other hand, unlimited property rights may lead to the exactly opposite: waste of resources (or the Tragedy of the anticommons.

The article has nice examples:

[...]
The commons leads to overuse and destruction; the anticommons leads to underuse and waste. In the cultural sphere, ever tighter restrictions on copyright and fair use limit artists’ abilities to sample and build on older works of art. In biotechnology, the explosion of patenting over the past twenty-five years—particularly efforts to patent things like gene fragments—may be retarding drug development, by making it hard to create a new drug without licensing myriad previous patents. Even divided land ownership can have unforeseen consequences. Wind power, for instance, could reliably supply up to twenty per cent of America’s energy needs—but only if new transmission lines were built, allowing the efficient movement of power from the places where it’s generated to the places where it’s consumed. Don’t count on that happening anytime soon. Most of the land that the grid would pass through is owned by individuals, and nobody wants power lines running through his back yard.
[...]

It seems to me that certain computational environments present an interesting middle ground between these two extremes discussed above. For example, Nazareno pointed out a while ago to a Large-scale commons-based Wi-fi: users, who own an Internet connection, may share the spare capacity in exchange for either using the spare capacity of others later or being paid for it. The wonderful insight of this resource sharing model is that people buy more Internet bandwidth (as many other computational goods -- if I can name it like this) than they are able to use. Hence, resources are mostly underutilized. So, why not sharing it (the spare capacity)in exchange for access to others spare capacity in the future?

Finally, a question comes to my mind: besides the fact that certain resource units bear an extra capacity by definition (e.g. often my CPU is 99% idle), does any other intrinsic resource characteristic play a role in suggesting which model is suitable for the sharing of that resource?

Sunday, July 27, 2008

Information Management in Living Organisms

Nature has an article by Paul Nurse (Life, Logic and Information), where he discusses some ideas on studying living organisms as information management systems.

Paul Nurse suggests that analyzing the information flow in living organisms would help to understand certain behaviors, which are not completely clear nowadays.

From a computer systems researcher standpoint, the interesting aspect of Paul Nurse's idea is that it goes on the opposite direction of previous studies: instead of drawing inspiration from nature to build information management systems (e.g. ant inspired algorithms), the author proposes to use information science tools to study the nature.

A piece from the article:

Systems analyses of living organisms have used a variety of biochemical and genetic interaction traps with the emphasis on identifying the components and describing how these interact with each other. These approaches are essential but need to be supplemented by more investigation into how living systems gather, process, store and use information, as was emphasized at the birth of molecular biology.

This sounds exciting, as a better understanding of living systems could feedback into the bio-inspired approach of designing distributed computational systems.

A couple of years ago, I briefly explored the design of a distributed storage system based on the behavior of the Messor Barbarus ants (for more details on the M. Barbarus ants see Anderson, C., J.J. Boomsma, and J.J. Bartholdi, III. 2002. Task partitioning in insect societies: bucket brigades. Insectes sociaux 49(2): 171-180).

The rationale behind it is quite simple: every time an unloaded larger ant encounters a loaded smaller ant, the load is passed from the smaller to the larger ant. This naturally spread the work among the workers according to their capacity (strength and speed).

Bringing it back to the context of distributed storage systems, the idea is to enable a self-organizing load balance scheme by making larger nodes to request more load from lower capacity nodes. The goal is to improve throughput and data availability.

Obviously, a comprehensive performance evaluation is necessary to claim that this strategy would lead to an globally efficient system.

Thursday, July 03, 2008

HPDC'08 - Part II

Gilles Fedak delivered an invited talk at the UPGRADE-CN workshop (part of the HPDC'08). He presented the BitDew - a programmable environment that targets data management in Desktop Grids [1].

The rationale behind BitDew is that applications can define routines for data manipulation. These routines are expressed via predefined metadata, which are used by the infrastructure mechanisms to perform data management tasks, such as replication.

In a Technical Report, Gilles and colleagues present use cases and performance evaluation of mechanisms that provide data management functionality in BitDew.

In particular, I found the approach of leveraging metadata interesting. The predefined set of metadata allows the application layer to communicate requirements to the infrastructure regarding the desired level of fault tolerance and transfer protocols, for example.

In fact, this intersects with one of our projects at the NetSys Lab, where we investigate the use of custom metadata as a cross-layer communication method for storage systems [2].

As we use the file system interface to separate between the application and the storage layers, the two approaches (BitDew and our Custom Metadata approach) seem complementary. The metadata passed by the applications via BitDew could propagate to the file system, where it would interact with the Extended Attributes interface (which is exploited by our solution).

More coding fun to come...

[1] Fedak et al. "BitDew: A Programmable Environment for Large-Scale Data Management and Distribution". Technical Report, 6427, INRIA.

[2] Elizeu Santos-Neto, Samer Al-Kiswany, Nazareno Andrade, Sathish Gopalakrishnan and Matei Ripeanu. "Enabling Cross-Layer Optimizations in Storage Systems with Custom Metadata". In HPDC'08 - HotTopics. Boston, MA, USA. September, 2008.

Monday, June 30, 2008

HPDC'2008 - Part I

Last week I participated to two great conferences: the International Symposium in High Performance Distributed Computing (HPDC) and the USENIX. Both events took place in Boston, MA, USA.

There a lot of interesting things to mention. Thus, to avoid a single long post, I will describe a few presentations that I attended (and discuss some ideas) in a series of short posts.

In the first two days at HPDC, there were two interesting workshops: UPGRADE-CN (P2P and Grids for the Development of Content Networks) and MMS (Managed Multicore Systems).

The two workshops had works related to the research projects I am currently working on.

Molina [1] presented his work on designing two protocols that enable collaborative content delivery in mobile transient networks. By transient networks, the authors mean networks composed of devices that are geographically co-located for a short period such as a music festival.

The authors suggest to exploit the multiple network interfaces currently available in most mobile devices and to enable collaborative use of these multihomed devices.

The idea is quite interesting. In particular, it raises some issues from the perspective of distributed resource sharing. It would be good to understand whether incentive mechanisms are necessary in transient networks. The idea is to encourage users to share their connections with a community for collaborative downloading/streaming of content.

On top of that, a nice follow-up work would be to investigate the feasibility of collaborative data dissemination protocols, which are widely used in the Internet (e.g. BitTorrent), in the transient networks scenarios.

[1] Molina et al. "A Social Framework for Content Distribution in Mobile Transient Networks". In UPGRADE-CN'2008.

Thursday, June 05, 2008

IPTV Viewing Habits and Netflix Player

The IPTPS 2008 has an interesting paper on exploiting TV Viewing Habits to reduce the traffic on the ISP backbone generated by IPTV consumers.

"On Next-Generation Telco-Managed P2P TV Architectures" by Meeyoung Cha (KAIST), Pablo Rodriguez (Telefonica Research, Barcelona), Sue Moon (KAIST), and Jon Crowcroft (University of Cambridge).

In this paper the authors analyze the utilization o P2P content distribution techniques to reduce network overhead in a Internet Service Provider IPTV infrastructure. To exploit the patterns of channel holding time, channel popularity and the correlation between the time of the day and the number of viewers, the authors propose a locality-aware P2P content distribution scheme that reduces the traffic on the ISP backbone.

From the paper:

we ascertain the sweet spots and the overheads of server-based unicast, multicast, and serverless P2P and also show the empirical lower bound network cost of P2P (where cost reduction is up to 83% compared to current IP multicast distribution

[...]

We believe that our work provides valuable insights to service providers in designing the next-generation IPTV architecture. Especially, it highlights that dedicated multicast is useful for few of the extremely popular channels and that P2P can handle a much larger number of channels while imposing very little demand for infrastructure.

This week I saw some news on the internal characteristics of the Netflix Player. Immediately, I thought of the paper from IPTPS as a possible optimization to the Netflix Player.

The NetFlix player is supposed to use the conventional broadband connection, as opposed to well provisioned IPTV architectures described in Cha et al. Perhaps, the locality-aware P2P content distribution technique is even more interesting in the Netflix Player case.

Nevertheless, the viewing habits and interest sharing among Netflix users may differ dramatically from what is observed in the IPTV environment, which would impact the efficiency of the locality-aware P2P content distribution.

Tuesday, June 03, 2008

Oh My Goosh!

If you like typing your commands away to interact with your computer, you will like this: Goosh :-)

From Slashdot: goosh, the Unofficial Google Shell posted by kdawson on Monday June 02, @07:26PM.

It's essentially a browser-oriented, shell-like interface that allows you to quickly search Google (and images and news) and Wikipedia and get information in a text-only format.

Wednesday, May 28, 2008

"Yes, There Is a Correlation"

This week I came across an interesting paper: "Yes, There is a Correlation - From Social Networks to Personal Behavior on the Web" by Parag Singla (University of Washington) and Matthew Richardson (Microsoft Research) in WWW'2008.

In summary, they show that the similarity between the personal interests and attributes of two users who are MSN contacts is much higher than two random users. Moreover, I've found the problem formulation elegant and the scale of data non-trivial to handle (approx. 13 million unique users).

From the paper:

Summarizing the results, we showed that people who talk to each other on the messenger network are more likely to be similar than a random pair of users, where similarity is measured in terms of matching on attributes such as queries issued, query categories, age, zip and gender. Further, this similarity increases with increasing talk time. The similarities tend to decrease with increasing average time spent per message. Also, we showed that even within the same demographics, people who talk to each other are more likely to be similar. Finally, as we hop away in the messenger network, the similarity still exists, though it is reduced.

I wonder whether a similar level of correlation would be observed in online communities with other purposes, such as content-sharing (e.g. Flickr and YouTube).

Friday, May 16, 2008

OurGrid 4.0 released

Good news from the South! The OurGrid 4.0 is out.

In summary: OurGrid is an open source, free-to-join, peer-to-peer grid, where users trade computational resources. The loosely coupled computational infrastructure is ideal for the execution of embarrassingly parallel applications.

I am particularly glad with this release, as OurGrid has been a useful tool in my previous studies. I used it to analyze traces of activity of content sharing communities. OurGrid makes it easy to harness the idle times of our desktop machines and monitor the progress of the computations in a much easier way.

Next week I will definitely give a try on the new version (as we are still using version 3.3).

The new site looks great too. Congratulations, OurGrid Community! :-)

Thursday, May 15, 2008

Deep Water(*) at 45rpm

A bit about recent music experiences...

It has been a while, since I listened to the voice of Beth Gibbons. So, I recently indulged myself with the addition of a new record to my collection. The Portishead new album: Third. By record, I mean an LP. :-)

The immediate surprise after playing the first song was to hear some voices in Portuguese. For a moment, I thought that the record player switch was set to FM Radio on some independent radio station.

The second big surprise was that record sound dark, with very slow beats and bass sounds. After a couple of seconds, I noticed that the album is an 45 RPM LP, as opposed to the usual 33 RPM records (more popular today). A quick adjustment on the player rotation speed and voilá!.

Overall, Portishead's new album sounds quite different from the previous works. I feel that there is more emphasis on the drums, which is really nice. Beth Gibbons voice is more discrete than in her solo album or in the previous Portishead recordings.

In my opinion, the best song of the album so far is Magic Doors (side 4, as the package comes with 2 LPs). :-)

Unfortunately, I missed their Coachella's concert, but I hope to have another opportunity soon...

(*) Deep Water is one of the songs of the album. I think it would work as a nice title for it.

Sunday, April 27, 2008

Non-absolute numbers

As you come back to life (after a series of conference and project deadlines), you are able to recover clever and funny pieces of literature such as the explanation of what is Bistromathics and its relation to the non-absolute numbers.

Wednesday, March 05, 2008

Making faces in virtual worlds

Today, a friend sent me a link to this device named: Emotiv.

A rough description of it would be: a helmet that monitors your brain activity and convert it into signals that can be used in a variety of ways.

Although games might be the first thought, the Expressive application particularly reminded me of a previous post by Ian Foster on a Second Life hack that allows the manipulation of objects in virtual worlds as a response to some physical process.

Thursday, February 21, 2008

Hadoop - now in larger scales!

Yahoo! just reported their new deployment of a Hadoop-based application. The achievement is considered to be the world's largest Hadoop deployment in a production environment.

The scale of the application is quite impressive. They used Hadoop to process the Webmap, as part of their search engine architecture. From their post (check their website for a video with some discussion about Hadoop in this context):

* Number of links between pages in the index: roughly 1 trillion links
* Size of output: over 300 TB, compressed!
* Number of cores used to run a single Map-Reduce job: over 10,000
* Raw disk used in the production cluster: over 5 Petabytes

I can even see the difference in the quality of the search results now. :-)

Update: Greg Linden also posted about the new Hadoop-cluster. It's nice that he puts the numbers above in perspective, by comparing to Google's infrastructure.

Tuesday, February 12, 2008

Interesting Articles: IPTPS 2008

For those interested in the convergence of Online Social Networks and Peer-to-Peer Systems, it is worth taking a look at some articles in the program of the International workshop on Peer-To-Peer Systems (IPTPS).

Tuesday, January 29, 2008

Taking photography to new heights

In 1906, some panoramic pictures of San Francisco after the big earthquake were taken. These were not ordinary pictures. George Lawrence used kites to place a camera at the right place to record the extension of the damage caused by the earthquake. Besides the historic value of the pictures, they are the outcome of a quite interesting engineering project.

Two years ago, Lawrence's project was revisited. Although they did not use kites this time, the picture is still impressive. They also have some interactive version that allows you to zoom in and see more details of the landscape.

Saturday, January 19, 2008

Scientific Data For All!

The Wired Blog is running a brief article about yet another Google's initiative. The idea is to provide storage, and as far as I understood, free access to scientific data sets.

One interesting point of the article is the following:

(Google people) are providing a 3TB drive array (Linux RAID5). The array is provided in "suitcase" and shipped to anyone who wants to send they data to Google. Anyone interested gives Google the file tree, and they SLURP the data off the drive. I believe they can extend this to a larger array (my memory says 20TB).

It sounds exciting that in the near feature, we might have access to a long list of data sets. Perhaps, under a standard API. If you like buzzwords, this might be named (if it is not the case already) -- Science in the Cloud. Despite the name they will give to this, this initiative can bring a long list of advantages for the the scientific community, I think.

Finally, I wonder when the RFC for the "suitcase-based transport protocol" will be available (similar to RFC 1149). :-)

Mundaú - Distributed Computing (or not)