Oct 252012
 

The other day I got a low disk space warning because my Thunderbird Inbox had grown to over 100 GB. It turned out my Inbox, Trash, and Sent mailbox folders were all impacted by some bug in which Thunderbird would keep fetching the same messages again and again from the server (IMAP) and appending them to the mailbox file. Compacting the mailbox would recover the disk space, but the mailboxes would start growing again shortly thereafter.

The magic incantation to resolve the problem was some quick succession of compacting the mailbox (right-click -> Compact) and repairing it (right-click -> Properties... -> Repair Folder).  I did have Preferences -> Advanced -> Network & Disk Space -> Compact all folders when it will save over 1 MB in total set, but it wasn’t kicking in.

Oct 112012
 

Google Music Manager uploads are based on looking for music files in a particular directory. This isn’t helpful if you have a large directory structure of music and want to upload a subset of it. In my case, I want to use Banshee’s smart playlist feature to select songs to upload. Fortunately Banshee has a .m3u playlist export, but this is only half the battle. The other half is to use symlinks to fool Google Music Manager into thinking the songs in the playlist are in its directory.

The following shell command does the trick. It takes input lines in the .m3u of the form <path>/<artist>/<album>/<song> (e.g. ../../../mnt/onion/media/Music/Banshee/Wir Sind Helden/Soundso/01. (Ode) An Die Arbeit.mp3) and makes symlinks of the form <artist>_<album>_<song>.

cat ~/my_playlist.m3u | ruby -ne 'IO.popen(["ln", "-s", "#{$&}", "./#{$2[0..50]}_#{$3[0..50]}_#{$4[0..50]}.#{$5}"]) if $_.strip =~ /^([^#].*)\/([^\/]*)\/([^\/]*)\/([^\/]*)\.([^.\/]*)$/'

For each line ($_) that matches the pattern (not starting with #, having at least the expected number of slashes), execute: ln -s <input line ($&)> <composed filename> . The [0..50] ranges keep filename length manageable.

Aug 172012
 

Solr has a handy ability to ingest local CSV files. The neatest aspect of which is that you can populate multi-valued fields by sub-parsing an individual field. E.g. the following will ingest /tmp/input.csv and split SomeField into multiple values by semi-colon delimiters:

curl http://localhost:80/solr/my_core/update\?stream.file\=/tmp/input.csv\&stream.contentType\=text/csv\;charset\=utf-8\&commit\=true\&f.SpmeField.split\=true\&f.SomeField.separator\=%3B

When running an ingest, I got the following response, which was confusing since myField was, in fact, defined in my schema:

<?xml version="1.0" encoding="UTF-8"?>
<response>
    <lst name="responseHeader">
        <int name="status">400</int>
        <int name="QTime">1</int></lst>
        <lst name="error">
            <str name="msg">undefined field: "myField"</str>
        <int name="code">400</int>
    </lst>
</response>

A peek in the log provided a clue (note the leading question mark):

SEVERE: org.apache.solr.common.SolrException: undefined field: "?myField"

Examining a hex dump of the CSV file revealed that it started with a UTF-8 Byte Order Mark:

xxd /tmp/input.csv | head
0000000: efbb bf...

One way to strip the BOM is with Bomstrip, a collection of BOM-stripping implementations in various languages, including a Perl one-liner. Alternatively, just open the file in Vim, do :set nobomb and save. Done!

Aug 092012
 

Given a working Perl 5.12 install (via MacPorts), doing a sudo port install perl5.16 does not update the perl symlink:

% ls -alF /opt/local/bin/perl
lrwxr-xr-x 1 root admin 8 Jun 6 15:19 /opt/local/bin/perl@ -> perl5.12

The magic incantation is to install a the perl5_16 variant of the perl5 package:

sudo port install perl5 +perl5_16

With this done, the symlink is updated and perl loads the expected version.

% ls -alF /opt/local/bin/perl                                                                                                    
lrwxr-xr-x  1 root  admin  8 Aug  9 12:45 /opt/local/bin/perl@ -> perl5.16
% perl -v
This is perl 5, version 16, subversion 0 (v5.16.0) built for darwin-thread-multi-2level
Aug 092012
 

Stackable traits in Scala refers to being able to mix in multiple traits that work together to apply multiple modifications to a method. This involves invoking super.theMethod and modifying its input and/or output. But what is super in the context of a trait? The class (or trait) the trait extends from? The class the trait is being mixed into? It depends! All the mixed in traits and all the superclasses are linearized. super invokes the nearest preceding definition further up the chain. The general effect is that mixins to the right (and their ancestor classes) come earlier than those to the left. However, ancestors that are shared are deduped to only show up once, and they show up as late as possible. Here’s a detailed description of the Scala object hierarchy linearization algorithm.

If a trait which extends MyInterface tries to invoke super.myMethod but MyInterface.myMethod is abstract, the compiler generates this error:

error: method myMethod in trait MyInterface is accessed from super. It may not be abstract unless it is overridden by a member declared `abstract' and `override'

What this means is: generally, invoking an abstract method of a superclass is an error. However, with traits, the meaning of super is not known at compile time. The call would be valid if the trait were mixed into a class that had an implementation of the method. But the compiler errs on the side of caution unless told otherwise. abstract override def myMethod provides signals that you expect an implementation of the method to be available at run-time and to not treat the super.myMethod invocation as an error. (Note: this applies regardless of whether the trait itself provides an implementation of the method.)

Here are some examples:

trait Munger {
  def munge(l : List[String]) : List[String]
}

trait Replace1 extends Munger {
  override def munge(l : List[String]) = l :+ "Replace1"
}

trait Replace2 extends Munger {
  override def munge(l : List[String]) = l :+ "Replace2"
}

//abstract override def munge required in the Stack* classes because they invoke
//abstract super.munge

trait Stack1 extends Munger {
  abstract override def munge(l : List[String]) = super.munge(l) :+ "Stack1"
}

trait Stack2Parent extends Munger
  abstract override def munge(l : List[String]) = super.munge(l) :+ "Stack2Parent"
}

trait Stack2 extends Stack2Parent {
  abstract override def munge(l : List[String]) = super.munge(l) :+ "Stack2"
}

class Bottom {
  this : Munger =>

  def apply() {
    println(
      munge(List("bottom"))
    )
  }
}

scala> (new Bottom with Replace1)()
List(bottom, Replace1)

scala> (new Bottom with Replace1 with Replace2)()
List(bottom, Replace2) //Replace1's munge was overridden and never ran

scala> (new Bottom with Replace1 with Stack1)()
List(bottom, Replace1, Stack1) //Stack1 called super.munge, which invoked the
//munge from the trait to the left

scala> (new Bottom with Replace1 with Stack2)()
List(bottom, Replace1, Stack2Parent, Stack2) //Stack2's super.munge called to its
//superclass, whereas Stack2Parent's super.munge called the trait to the left
Jul 312012
 

Trying to unit test some code that was to run inside Solr, I bumped into this:

Cannot mock/spy class org.apache.solr.core.SolrCore
Mockito cannot mock/spy following:
  - final classes
  - anonymous classes
  - primitive types

Fortunately, there’s a simple solution: PowerMock. After adding the following two annotations to my test class definition (and the requisite Maven dependency declarations), everything just worked. No changes needed to the actual Mockito calls themselves. Sweet.

@RunWith(PowerMockRunner.class)
@PrepareForTest( { SolrCore.class })
Jul 032012
 

When trying to use Apache Pig in local mode to connect to a stand-alone HBase using HBaseStorage, I kept getting errors like this:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. Not a host:port pair: ?42548@endive.local10.1.10.70,    64058,1341349176322

The unrecognized host:port pair corresponds to a happy sign-on message from the HBase log:

INFO org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: The identifier of this process is 42548@endive.local

The problem is a version mismatch: HBase apparently changed the format of this data in 0.92. As of Pig 0.10.0, the solution is downgrade HBase to 0.90.6.

Mar 282012
 

There are two tricks to using VNC from a non Mac to connect to a Mac running OS X Lion.

  1. Turn on the VNC server by enabling System Preferences -> Sharing ->  Screen Sharing. Even though it provides little security, a VNC password must be set so that OS X will present an authentication scheme that makes sense to a standard VNC client. Enable “VNC viewers may control screen with password
  2. After connecting, you will see a grey linen-backgrounded desktop with nothing in it. Type your user name and password. After logging in, your desktop contents will display!
Mar 102012
 

TL;DR: Install Do Not Track Plus, use Duck Duck Go (with !sp sometimes) for web searches., To go the extra mile also install Straight Google (requires Greasemonkey), Cookie Whitelist and BetterPrivacy.

I don’t like the idea of advertisers, search engines, and social networks building extensive profiles about what I do online (why). A short-list of tools to avoid such tracking:

Prevent Inter-Website Tracking

  • Abine’s Do Not Track Plus is nearly a one-stop shop. I wish more details were available about what it does, but the gist is:
    • Install and maintain a large number of generic do-not-track-me cookies for many ad networks and tracking services. When content is fetched from these sites, the generic cookie is sent rather than one which is unique to you
    • Special handling for social buttons (e.g. Like this on Facebook), in which the button is fetched anonymously, but, should you choose to click on it, the veil is lifted and the Like associated with your account
    • Many ads are blocked from rendering too, which I hadn’t expected. Those that remain are innocuous enough that I do not use Ad Block Plus any more.

Reduce Google Information Gathering

I store some personal information on Google (thanks to Google+, Google Calendar, etc.). I do not want to Google to associate that personal information with all the web searches I do every day. Do Not Track Plus is of limited value here: if you sign in to Google, Do Not Track Plus will be obliged to permit your identity to be sent. Additional steps are needed:

  • Don’t search with Google. I prefer Duck Duck Go for most searches, thanks to their Zero-click Info and other goodies.
  • For needle-in-the-haystack searches, I find Google often has the best results. Startpage is an anonymous Google Search proxy. Rather than use it directly, I just prefix my Duck Duck Go searches with !sp when needed.
  • Straight Google (requires Greasemonkey) prevents Google’s click-tracking. This is less important if you follow the above steps to avoid doing your web searches at google.com. However, they still track links clicked on their other products, which Straight Google can prevent.

Control Intra-Website Tracking

The above steps should take care of attempts to track your movement across the web. However, most websites will still store long-term cookies in your browser to track your history of interaction with that particular website.

  • Cookie Whitelist is designed to only allow white-listed cookies from being accepted. In practice, this breaks too many websites. For less hassle, configure as follows:
    • Cookie button (the red one): ON. This lets any website set a cookie, but it will be deleted at the end of the session
    • For the few websites you wish to remain logged in to (or otherwise personalized) click the green button to whitelist as needed
    • Do not accept third-party cookies
  • BetterPrivacy is to Flash LSOs (local shared objects, or Flash cookies) what Cookie Whitelist is to regular cookies. Alas with a more confusing set of configuration options.

Note: this post is (obviously?) not about how to avoid your employer/ISP/government monitoring what you do online. To hide what you are doing from someone who has access to all your traffic, you need encryption and proxying. A good first stop to get some encryption is EFF’s HTTPS Everywhere. This goes a long way to prevent the person nearby in the coffee shop from stealing your Facebook account.

Originally published 2012-03-10. Updated 2012-03-14 with intra-website tracking steps.