Monday, June 30, 2008

Facebook's next open source projects: Hive and Cassandra

A couple weeks ago, Jeff Hammerbacher from Facebook presented some details on Cassandra (see later slides), a structured p2p storage system similar to Google's Bigtable or Amazon's Dynamo. What is most interesting about Cassandra is that they seem to be preparing to open source it imminently. Jeff bookmarked two things on delicious last night:
  1. Cassandra: Welcome to your new Wikidot site

  2. Cassandra: A Structured Storage System on a P2P Network in Launchpad

Both sites are empty as of now, but it looks like they're planning on releasing the source some time soon using bzr for version control.

Another interesting Facebook project is Hive, a sort of data warehousing solution built on Hadoop. They've been discussing open sourcing this for several months now, but it looks like things are starting to happen with HADOOP-3601: Hive as contrib project.

On the non-Facebook open source front, we've got some news coming soon as well. We've made the decision to open source several of our internal tools under an MIT license - hold tight for more info.

Monday, June 23, 2008

How to work around an MSIE bug with Scriptaculous

For several months we've noticed a lot of 404s in our access logs that look like this:
x.x.x.x - - [2008-06-23 13:14:30] "GET /static/r/toLRhQ/js/'+libraryName+' HTTP/1.1" 404 549 - "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)"
x.x.x.x - - [2008-06-23 13:14:30] "GET /static/r/toLRhQ/js/'+libraryName+' HTTP/1.1" 404 568 - "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)"
x.x.x.x - - [2008-06-23 13:17:06] "GET /static/r/toLRhQ/js/'+libraryName+' HTTP/1.1" 404 568 - "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)"
x.x.x.x - - [2008-06-23 13:17:06] "GET /static/r/toLRhQ/js/'+libraryName+' HTTP/1.1" 404 549 - "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)"
x.x.x.x - - [2008-06-23 13:18:50] "GET /static/r/toLRhQ/js/'+libraryName+' HTTP/1.1" 404 549 - "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)"

For a long time we just ignored them, but I finally decided to track the issue down yesterday, and found the following line in scriptaculous.js:

document.write('<script type="text/javascript" src="'+libraryName+'"></script>');

For whatever reason, this particular version of MSIE decides to interpret this script tag in the middle of the javascript and go trying to find a JS named "'+libraryName+'". This is clearly incorrect behavior, and it seems like the bug was fixed in a later version of IE.

Eric came up with the following clever fix:
    document.write('<' + 'script type="text/javascript" src="'+libraryName+'"></script>');

Splitting the '<' and the 'script' tricks the broken version of IE into not seeing the script tag and solves the issue.

Hope this is helpful to others who find these mysterious requests in their logs.

Tuesday, June 17, 2008

Automatically parallelizing list comprehensions in Erlang

This evening I got to chatting with some people on IRC about how it would be neat if Erlang had syntax to automatically parallelize a list comprehension. For those who aren't familiar with Erlang yet, here's a brief introduction:

A list comprehension is a quick way of performing an operation over a list of values. For example, at the erlang prompt:
1> MyList = lists:seq(1,10).
[1,2,3,4,5,6,7,8,9,10]
2> [X * 2 || X <- MyList].
[2,4,6,8,10,12,14,16,18,20]

For those familiar with Python list comprehensions, this is equivalent to [x * 2 for x in range(1,11)]. It's also just like the map operator common in pretty much every functional language. In both Python and Erlang you can get a bit more complicated:
3> [X * 2 || X <- MyList, X rem 2 == 1]. %% Filter out the odd numbers from the list
[2,6,10,14,18]

or even take the cartesian product of several lists in the same list comprehension:
4> [{X, Y, X * Y} || X <- lists:seq(1,3), Y <- lists:seq(1,3)].
[{1,1,1},
{1,2,2},
{1,3,3},
{2,1,2},
{2,2,4},
{2,3,6},
{3,1,3},
{3,2,6},
{3,3,9}]

Now, back to the interesting part. Everyone talks about how Erlang is so easy to make concurrent. But if we add the self() function, which returns the current running process id, to the output of the list comprehension we see that it is not parallelized:
5> [{self(), X, Y, X * Y} || X <- lists:seq(1,3), Y <- lists:seq(1,3)].
[{<0.31.0>,1,1,1},
{<0.31.0>,1,2,2},
{<0.31.0>,1,3,3},
{<0.31.0>,2,1,2},
{<0.31.0>,2,2,4},
{<0.31.0>,2,3,6},
{<0.31.0>,3,1,3},
{<0.31.0>,3,2,6},
{<0.31.0>,3,3,9}]

All of the results were computed in the same process.

Handily, Erlang provides a way of inserting user code in between the parser and the compiler in the form of parse transforms. A parse transform is specified in the compile options for a module and can make arbitrary modifications to the abstract syntax tree of an Erlang program before the compiler gets to it, with the important caveat: "programmers are strongly advised not to engage in parse transformations and no support is offered for problems encountered." Everyone knows caveats are lame, so I went ahead and built a parse transform. Here's example usage:
-module(test).
-compile({parse_transform, plc}).
-export([test/0]).

test() ->
Result = plc:lc([{self(), A, B, C} || A <- lists:seq(1,2),
B <- lists:seq(3,4),
C <- lists:seq(5,6),
A * B < 5
]),
io:format("result: ~p~n", [Result]).

Output:

result: [{<0.3145.0>,1,3,5},
{<0.3146.0>,1,4,5},
{<0.3147.0>,1,3,6},
{<0.3148.0>,1,4,6}]

We can see that the list comprehension has been performed in parallel this time - each result element has a different value of self().

Want to check it out? Clone the git repo on github.

This probably doesn't work that well. Don't use it on production software. Unless you want to test it thoroughly first.