Friday, October 23, 2009

Spam in social software

From the Greg Linden's blog:

An arms race in spamming social software
by Greg Linden (16 Oct 2009)
http://glinden.blogspot.com/2009/10/arms-race-in-spamming-social-software.html

The post presents an ultra-short, yet interesting, summary of spam strategies observed in social software. In particular, Greg Linden mentions several observed badmouthing strategies -- the spammer tries to taint competitors reputation, as opposed to attempt to use spam to obtain benefits directly. Interestingly, Chen and Friedman [1] conjecture that badmouthing attacks are harder to tackle than attacks that attempt to artificially boost one's reputation.

The post ends suggesting that incentive-based approaches to deter spam are (possibly good) alternatives compared to techniques that detect spam by focusing on content analysis (e.g., detection of commercial intent on blog comments). I would say they are complementary, though.

[1] Chen & Friedman. "Sybilproof Reputation Mechanisms".
http://doi.acm.org/10.1145/1080192.1080202

Wednesday, August 26, 2009

Relationship between cross-field citations and work impact

A recent work by Shi, Adamic , Tseng and Clarkson has an interesting analysis on the relationship between works that draw from different areas (i.e., cite papers outside their fields) and their subsequent impact. [1]

One of the interesting bits:

[...]
Intuitively, any individual citation will at most have a very weak impact on the success of a citing paper. It will only be one of possibly dozens of references made in an article or patent. Other factors, such as the publication venue and the reputation of the authors, are more likely to contribute to the impact of the article than any individual citation the authors include. We nevertheless see a significant relationship between the interdisciplinarity of citations and the impact of the publication.
[...]

This reminds me of previous results on the relationship between network constraint and value of ideas [2]. The intuition is that a person who is in a bridge position in her social network (i.e., connecting two distinct groups) is more exposed to different ways of thinking, which may lead to that person having more valuable ideas. Here, the social network is the citation network, and the bridges are papers that cite otherwise unconnected clusters (i.e., fields).

A recipe for higher impact research?

[1] Shi et al. 2009. The Impact of Boundary Spanning Scholarly Publications and Patents. PLoS ONE.
[2] Burt, R., 2003. Structural Holes and Good Ideas. American Journal of Sociology.

Wednesday, July 15, 2009

The Internet and its Topology

A while ago, I came across an article that discusses points related to network modeling and characterization [1], particularly the Internet physical topology.

The motivation used by the authors is, as the article puts it, the power-law argument. In particular, the authors highlight important points researchers should focus when performing similar studies. They also go further and challenge the now traditional assumption that the Internet topology resembles a scale free network. In this post, I briefly summarize the paper.

The paper recounts the now well known and accepted scale-free Internet argument (i.e., that the Internet topology is well modeled by a network with a power-law node degree distribution). The arguments, as presented by the authors, are rooted on the limitations of traceroute on accurately determining the physical topology, as it only captures interfaces of routers instead of physical boxes.

In short, the work proposes to focus on the decision making process an Internet service provider goes through when planning and deploying its physical infrastructure, as opposed to using traceroute traces to inspire models for the Internet topology. Moreover, the authors suggest that the right tool to do that is constrained optimization that allows to formalize the referred decision making process.

Other interesting points extracted from the paper:

a).
A node high degree implies low capacity links, conversely low degree implies high capacity links, this is due to the limited capacity on processing traffic. Thus, the high degree nodes would be on the edge of network, as opposed to the core, which is different from what most of the traceroute studies claim.


b).
To avoid confusion and to emphasize the fact that preferential attachment is just one of many other mechanisms that is capable of generating scale-free graphs, we will refer here to the network models proposed in [2].


Suggested principles on characterization and modeling, which, I think, are sufficiently general to be applied in fields other than network topology characterization:

1. Know your data:
The data used by Faloutsos, Faloutsos, and Faloutsos [3] was not intended to capture the physical topology, rather to "get some experimental data on the shape of multicast trees one an actually obtain in the real Internet"


2. When modeling is more than data fitting:
If we wish to increase our confidence in a proposed model, we ought also to ask what new types of measurements are either already available or could be collected and used for validation. (By new they mean completely new types of data not involved whatsoever with the original modelling exercise).


3. Know your statistic:
Once agreeing that the data set is problematic, one could try to use a more robust statistic to avoid the mistakes.


[1] W. Willinger, D. Alderson and C. Doyle. "Mathematics and the Internet: A source of enormous confusion and great potential", Notices of the AMS, Vol. 56, no. 5, May 2009.

[2] A.-L. Barabási and R. Albert, Emergence of scaling in random networks, Science 286 (1999).

[3] M. Faloutsos, P. Faloutsos, and C. Faloutsos, On power-law relationships of the Internet topology, ACM SIGCOMM (1999).

Tuesday, June 23, 2009

The role of the scientific method

Here is an interesting article by John Polanyi published by The Globe and Mail.

Hope lies in the scientific method by John Polanyi.

The article discusses the role of science, and in my opinion, more importantly, the importance of the critical view of scientists on the impact that political decisions may have on the lives of many.

Food for thought, indeed.

Wednesday, June 03, 2009

Data Reliability Tradeoffs

Abdullah Gharaibeh (my colleague at the NetSysLab) has a recent work that explores a combination of heterogeneous storage components in terms of cost, reliability and throughput.

The work proposes a storage architecture that leverages idle storage resources, located on volatile nodes (e.g., desktops), to provide high throughput access at low cost. Moreover, the system is designed to provide durability with a low throughput durable storage component (e.g., tape).

What I like in the solution is that it shows nicely how to decouple components that provide two important features for data-intensive applications: availability and durability. Moreover, this separation (together with the evidence showed by the experiments) helps system administrators to reason about the deployment cost.

Here is the abstract:

Abdullah Gharaibeh and Matei Ripeanu. Exploring Data Reliability Tradeoffs in Replicated Storage Systems. ACM/IEEE International Symposium on High Performance Distributed Computing (HPDC 2009), Munich, Germany, June 2009.


This paper explores the feasibility of a cost-efficient storage architecture that offers the reliability and access performance characteristics of a high-end system. This architecture exploits two opportunities: First, scavenging idle storage from LAN connected desktops not only offers a low-cost storage space, but also high I/O throughput by aggregating the I/O channels of the participating nodes. Second, the two components of data reliability – durability and availability – can be decoupled to control overall system cost. To capitalize on these opportunities, we integrate two types of components: volatile, scavenged storage and dedicated, yet low-bandwidth durable storage. On the one hand, the durable storage forms a low-cost back-end that enables the system to restore the data the volatile nodes may lose. On the other hand, the volatile nodes provide a high-throughput front-end. While integrating these components has the potential to offer a unique combination of high throughput, low cost, and durability, a number of concerns need to be addressed to architect and correctly provision the system. To this end, we develop analytical- and simulation-based tools to evaluate the impact of system characteristics (e.g., bandwidth limitations on the durable and the volatile nodes) and design choices (e.g., replica placement scheme) on data availability and the associated system costs (e.g., maintenance traffic). Further, we implement and evaluate a prototype of the proposed architecture: namely a GridFTP server that aggregates volatile resources. Our evaluation demonstrates an impressive, up to 800MBps transfer throughput for the new GridFTP service

Wednesday, April 22, 2009

Individual and Social Behavior in Tagging Systems

As part of a much broader investigation on the peer production of information, Individual and Social Behavior in Tagging Systems [1] is a recent work that focuses on the quantitative aspects of tag reuse, item re-tagging and the implicit social relation inferred from the similarity of user interests.

The observations point to interesting directions on the design of systems, such as recommendation systems, that aim at exploiting past user activity. For instance, it providers quantitative evidence why item recommendation tends to be less efficient than tag recommendations in these systems (based on the relatively higher level of tag reuse, compared to the item re-tagging).

I must mention that this work is a result of a collaboration with an enthusiastic team: Nazareno Andrade, David Condon, Adriana Iamnitchi and Matei Ripeanu,

Reference:

[1] Elizeu Santos-Neto, David Condon, Nazareno Andrade, Adriana Iamnitchi and Matei Ripeanu. "Individual and Social Behavior in Tagging Systems". In the 20th ACM Conference on Hypertext and Hypermedia. Torino, Italy, June 29-July 1, 2009.

Tuesday, March 10, 2009

Tracing Influenza Epidemics via Crowdsourcing

Researchers from Google (Mountain View,CA,USA) and Centers for Disease Control and Prevention (Atlanta, GA,USA) recently reported a method to build a model that helps detecting the spread of Influenza.

The article describes the analysis of search logs in combination with records of doctor visits related to Influenza-like Illness (ILI). The goal of the model is to predict the percentage of doctor visits that are ILI-related. The authors report a correlation of up to 0.96 between the model predictions and the Centers for Disease Control (CDC) reports.

Citation:

Jeremy Ginsberg, Matthew H. Mohebbi, Rajan S. Patel, Lynnette Brammer, Mark S. Smolinski and Larry Brilliant. Detecting influenza epidemics using search engine query data. Nature 457, 1012-1014 (19 February 2009).

Tuesday, March 03, 2009

One Microsoft Way

Last week, at Redmond, Microsoft Research showcased a list of projects from its labs located worldwide during the TechFest 2009. As part of my internship last year, at the lovely Microsoft Research Cambridge - UK, I worked on the Mobile Content-Casting project, which was also showcased.

Ars Technica has an article summarizing some of the projects presented at TechFest 2009.

It is worth taking a look: Microsoft Research TechFest 2009: a glance at the road ahead.

I particularly found the SecondLight and the suite of Social-Digital demos really cool. Both are from MS Research at Cambridge - UK.

Friday, February 27, 2009

A nice short story

This is a fun short story published by Nature that I came across recently.

"Lost in sun and silence -- The golden age of communication"
by Vincenzo Palermo. Nature, Vol 457, 26 February 2009.