Thursday, May 29, 2008

Confidence intervals for Jaccard Similarity?

Hoping that someone is googling for the right terms here:

Anyone out there know how to calculate a confidence interval around an estimate of the Jaccard similarity coefficient?

For Pearson correlation you can use Fisher's Z-prime Transformation, but I can't quite figure a principled way of doing the same for Jaccard similarity.

Friday, May 16, 2008

FB Engineering blog post on Facebook Chat

Eugene at Facebook posted an interesting article about the technology behind the new Facebook Chat. This new service has large parts written in Erlang and communicates with the rest of the system using the Thrift bindings Amie Street and Facebook have been collaborating on for the last couple of months.

The good news for us: our thrift bindings are pretty much guaranteed to be stable and leak/bug free now that they're used for millions of messages/second over at FB.

If you're interested, check out over at the thrift git repository

Tuesday, May 13, 2008

Forcing a process to garbage collect in Erlang

We upgraded our dynamic pricing service tonight with a new version of thrift, so I was checking top to make sure everything was cool a few hours later. I noticed that one of the pricers was using 1.1G of RAM - significantly more than I'd ever seen it using before. Figuring it was a memory leak, I started a console node and connected it to the erlang cluster:

amiest@app2:~$ erl -name console
Erlang (BEAM) emulator version 5.5.2 [source] [64-bit] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.5.2 (abort with ^G)
(console@app2.prod.amiestreet.com)1> P = pricer@app2.prod.amiestreet.com.
'pricer@app2.prod.amiestreet.com'
(console@app2.prod.amiestreet.com)2> net_adm:ping(P).
pong

First step of diagnostics was to fire up etop with etop:start([{node, P}]). This showed that the process count was remaining stable at an appropriate number -- we'd had a bug with a process leak once before but it didn't seem to be the cause this time. The amount of RAM used by processes seemed pretty high though. Next step:

(console@app2.prod.amiestreet.com)4> rpc:call(P, shell_default, i, []).

This is equivalent to running the i() command on the remote node, and shows all the running processes along with some info.

This printed out something useful - the rex process was using almost 900MB of RAM for no apparent reason. I'd never heard of rex, but evidently it handles remote execution from other languages, and possibly RPC as well. Checking on some other erlang nodes I saw that rex usually only used a few hundred KB.

After much googling, I came across this article which is my only plausible explanation for how rex got so big -- the Erlang GC doesn't run on a process if the process isn't doing any work.

The solution? rpc:call(P, erlang, garbage_collect, [pid(5038,10,0)]). (5038.10.0 was the pid shown by i()). This kicked the memory usage back down where it should be.

Friday, May 02, 2008

io_lib_pretty - a nice secret module

There are some modules in the erlang stdlib that aren't exactly advertised, but are quite useful. My newest discovery is io_lib_pretty. It hasn't got a manpage, but there are some docs if you less `locate io_lib_pretty.erl`.

io_lib_pretty is the module used by the shell to print records in a nicely formatted way. This isn't possible using plain io:format but can make program output a lot nicer.

Take for example a logging program that deals with records that look like this:

5> L = #logMessage{actor=23507, server_ip = <<123,234,123,234>>}.
#logMessage{actor = 23507,
server_ip = <<"{\352{\352">>,
timestamp = undefined,
level = undefined,
log_filename = undefined,
message = undefined}

If you just try to print it out, you get:


7> io:format("Logged: ~p", [L]).
Logged: {logMessage,23507,<<"{\352{\352">>,undefined,undefined,undefined,undefined}ok

Pretty useless output.

Using io_lib_pretty you can get:

9> io:format(io_lib_pretty:print(L, fun(logMessage, 6) -> [actor, server_ip, timestamp, level, log_filename, message] end)).
#logMessage{actor = 23507,
server_ip = <<"{\352{\352">>,
timestamp = undefined,
level = undefined,
log_filename = undefined,
message = undefined}ok

Just like the shell. I listed the record information manually in the function above, but you can easily use the record_info macro to accomplish the same without code duplication. Or even easier, use the exprecs parse transform (pretty printing example available there).

Next time: how to load record definitions dynamically at runtime.