Jun 282013
 

Here is how to customize how Jackson serializes Joda-Time dates to JSON:

objectMapperFactory.registerModule(new SimpleModule() {
    {
        addSerializer(DateTime.class, new StdSerializer<DateTime>(DateTime.class) {
            @Override
            public void serialize(DateTime value, JsonGenerator jgen, SerializerProvider provider) throws IOException, JsonGenerationException {
                 jgen.writeString(ISODateTimeFormat.date().print(value));
            }
        });
    }
});

You can use this in combination with JodaModule, just place it after the JodaModule is registered.

Alternatively, if all you need is to write DateTimes in ISO 8061 format instead of as Unix epochs, you can use the following:

objectMapperFactory.registerModule(new JodaModule())
objectMapperFactory.disable(SerializationFeature.WRITE_DATES_AS_TIMESTAMPS);

JodaModule registers a custom DateTimeSerializer that takes the setting into account. However, unlike the standard Java Date implementation, SerializationFeature.WRITE_DATE_KEYS_AS_TIMESTAMPS and getSerializationConfig().setDateFormat(myDateFormat) are ignored, so there is no way to fine-tune the serialization.

Ultimately a more elegant solution would be to give JodaModule some additional constructors or setters that allow passing in a DateFormatter that its various helper classes would use.

Mar 282013
 

If you use the timestamptz data type, Postgres does timezone conversions automatically.

First, some test data:

pg=> create table time_test (id text, stamp timestamptz);
CREATE TABLE
pg=> insert into time_test values('foo', now());
INSERT 0 1
pg=> insert into time_test values('foo', now());
INSERT 0 1
pg=> select * from time_test;
id | stamp
-----+-------------------------------
foo | 2013-01-22 00:53:40.325041+00
foo | 2013-01-22 00:54:02.021018+00
(2 rows)

Client-supplied data data in other timezones is automatically converted for comparisons:

pg=> select * from time_test where stamp > '2013-01-21 16:54:00 PST';
id | stamp
-----+-------------------------------
foo | 2013-01-22 00:54:02.021018+00
(1 row)

Results can be converted on the fly:

pg=> select id, stamp at time zone 'PST' from time_test;
id | timezone
-----+----------------------------
foo | 2013-01-21 16:53:40.325041
foo | 2013-01-21 16:54:02.021018
(2 rows)

…once, or for the whole session.

pg=> set session time zone "pst8pdt";
SET
pg=> select * from time_test;
id | stamp
-----+-------------------------------
foo | 2013-01-21 16:53:40.325041-08
foo | 2013-01-21 16:54:02.021018-08
(2 rows)

pg=> insert into time_test values ('bar', '2013-01-21 16:55:03');
INSERT 0 1
pg=> select * from time_test;
id | stamp
-----+-------------------------------
foo | 2013-01-21 16:53:40.325041-08
foo | 2013-01-21 16:54:02.021018-08
bar | 2013-01-21 16:55:03-08
(3 rows)

 

Aug 172012
 

Solr has a handy ability to ingest local CSV files. The neatest aspect of which is that you can populate multi-valued fields by sub-parsing an individual field. E.g. the following will ingest /tmp/input.csv and split SomeField into multiple values by semi-colon delimiters:

curl http://localhost:80/solr/my_core/update\?stream.file\=/tmp/input.csv\&stream.contentType\=text/csv\;charset\=utf-8\&commit\=true\&f.SpmeField.split\=true\&f.SomeField.separator\=%3B

When running an ingest, I got the following response, which was confusing since myField was, in fact, defined in my schema:

<?xml version="1.0" encoding="UTF-8"?>
<response>
    <lst name="responseHeader">
        <int name="status">400</int>
        <int name="QTime">1</int></lst>
        <lst name="error">
            <str name="msg">undefined field: "myField"</str>
        <int name="code">400</int>
    </lst>
</response>

A peek in the log provided a clue (note the leading question mark):

SEVERE: org.apache.solr.common.SolrException: undefined field: "?myField"

Examining a hex dump of the CSV file revealed that it started with a UTF-8 Byte Order Mark:

xxd /tmp/input.csv | head
0000000: efbb bf...

One way to strip the BOM is with Bomstrip, a collection of BOM-stripping implementations in various languages, including a Perl one-liner. Alternatively, just open the file in Vim, do :set nobomb and save. Done!

Aug 092012
 

Stackable traits in Scala refers to being able to mix in multiple traits that work together to apply multiple modifications to a method. This involves invoking super.theMethod and modifying its input and/or output. But what is super in the context of a trait? The class (or trait) the trait extends from? The class the trait is being mixed into? It depends! All the mixed in traits and all the superclasses are linearized. super invokes the nearest preceding definition further up the chain. The general effect is that mixins to the right (and their ancestor classes) come earlier than those to the left. However, ancestors that are shared are deduped to only show up once, and they show up as late as possible. Here’s a detailed description of the Scala object hierarchy linearization algorithm.

If a trait which extends MyInterface tries to invoke super.myMethod but MyInterface.myMethod is abstract, the compiler generates this error:

error: method myMethod in trait MyInterface is accessed from super. It may not be abstract unless it is overridden by a member declared `abstract' and `override'

What this means is: generally, invoking an abstract method of a superclass is an error. However, with traits, the meaning of super is not known at compile time. The call would be valid if the trait were mixed into a class that had an implementation of the method. But the compiler errs on the side of caution unless told otherwise. abstract override def myMethod provides signals that you expect an implementation of the method to be available at run-time and to not treat the super.myMethod invocation as an error. (Note: this applies regardless of whether the trait itself provides an implementation of the method.)

Here are some examples:

trait Munger {
  def munge(l : List[String]) : List[String]
}

trait Replace1 extends Munger {
  override def munge(l : List[String]) = l :+ "Replace1"
}

trait Replace2 extends Munger {
  override def munge(l : List[String]) = l :+ "Replace2"
}

//abstract override def munge required in the Stack* classes because they invoke
//abstract super.munge

trait Stack1 extends Munger {
  abstract override def munge(l : List[String]) = super.munge(l) :+ "Stack1"
}

trait Stack2Parent extends Munger
  abstract override def munge(l : List[String]) = super.munge(l) :+ "Stack2Parent"
}

trait Stack2 extends Stack2Parent {
  abstract override def munge(l : List[String]) = super.munge(l) :+ "Stack2"
}

class Bottom {
  this : Munger =>

  def apply() {
    println(
      munge(List("bottom"))
    )
  }
}

scala> (new Bottom with Replace1)()
List(bottom, Replace1)

scala> (new Bottom with Replace1 with Replace2)()
List(bottom, Replace2) //Replace1's munge was overridden and never ran

scala> (new Bottom with Replace1 with Stack1)()
List(bottom, Replace1, Stack1) //Stack1 called super.munge, which invoked the
//munge from the trait to the left

scala> (new Bottom with Replace1 with Stack2)()
List(bottom, Replace1, Stack2Parent, Stack2) //Stack2's super.munge called to its
//superclass, whereas Stack2Parent's super.munge called the trait to the left
Jul 312012
 

Trying to unit test some code that was to run inside Solr, I bumped into this:

Cannot mock/spy class org.apache.solr.core.SolrCore
Mockito cannot mock/spy following:
  - final classes
  - anonymous classes
  - primitive types

Fortunately, there’s a simple solution: PowerMock. After adding the following two annotations to my test class definition (and the requisite Maven dependency declarations), everything just worked. No changes needed to the actual Mockito calls themselves. Sweet.

@RunWith(PowerMockRunner.class)
@PrepareForTest( { SolrCore.class })
Jul 032012
 

When trying to use Apache Pig in local mode to connect to a stand-alone HBase using HBaseStorage, I kept getting errors like this:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. Not a host:port pair: ?42548@endive.local10.1.10.70,    64058,1341349176322

The unrecognized host:port pair corresponds to a happy sign-on message from the HBase log:

INFO org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: The identifier of this process is 42548@endive.local

The problem is a version mismatch: HBase apparently changed the format of this data in 0.92. As of Pig 0.10.0, the solution is downgrade HBase to 0.90.6.

Feb 212012
 

After my previous adventures in slicing and dicing a huge XML file, I wanted a means to randomly select files. But first, the directory had so many entries it was unwieldy on my laptop. The Python script below divvies the files up into directories of up to 1000 files each. (Adaptable to other contexts via slight tweaking of the filename regex and subdir name generation.)

#!/usr/bin/python
import os
import re
where = '.' # source directory 

ls = os.listdir(where)
for f in ls:
  m = re.search('.*_COMM-([0-9]+).xml', f)
  if m:
    subdir = "%03d" % (int(m.group(1)) / 1000)
    try:
      os.mkdir(subdir)
    except OSError as e:
      pass
    os.rename(f, os.path.join(subdir, f))

Now on to the random selection, again with Python:

#!/usr/bin/python
import os
import random
import re
import sys

if len(sys.argv) > 1:
  where = sys.argv[1]
else:
  where = '.' # source directory 


subdirs = filter(lambda x: re.search('^[0-9]*$', x), os.listdir(where))
subdir = os.path.join(where,random.choice(subdirs))
print os.path.join(subdir,random.choice(os.listdir(subdir)))

A quick shell loop leverages the Python script to grab files and dump into a repository of test data. Works on ZSH, Bash, perhaps others:

for i in {1..250}; do cp $(./pick_a_file.py sub_dir_with_files) /destination/dir/filename_prefix_$(printf "%03d" $i).xml; done;

 

 

Feb 082012
 

Often Array(arg) is used for this, but is flawed. Note the last result when applied to a Hash:

> Array(42)
 => [42] 
> Array([1,2,3])
 => [1, 2, 3] 
> Array(nil)
 => [] 
> Array("foo")
 => ["foo"] 
> Array({"foo" => "bar", "biz" => "baz"})
 => [["foo", "bar"], ["biz", "baz"]]

What went wrong is that Array() calls the (now deprecated) to_a on each of its arguments. Hash has a custom to_a implementation with different semantics. Instead, do  this:

class Array
  def self.wrap(args)
    return [] unless args
    args.is_a?(Array) ? args : [args]
  end
end

That yields the expected results, even for Hashes:

> Array.wrap(42)
 => [42] 
> Array.wrap([1,2,3])
 => [1, 2, 3] 
> Array.wrap(nil)
 => [] 
> Array.wrap("foo")
 => ["foo"] 
> Array.wrap({"foo" => "bar", "biz" => "baz"})
 => [{"foo"=>"bar", "biz"=>"baz"}]

Use of is_a? is deliberate; duck-typing in this situation ([:[], :each].all? { |m| args.respond_to? m }) yields unexpected surprises since e.g. String is Enumerable and would not get wrapped.

For further discussion see Ruby-forum thread “shortcut for x = [x] unless x.is_a?(Array)” and StackOverflow “Ruby: Object.to_a replacement“.

 Tagged with:
Feb 082012
 

Slicing up XML files is best done with an XML parser. (Regular expressions, csplit, etc. are too easily confused by arbitrary strings in CDATA sections.) xml_split (may be obtained with CPAN by installing XML::Twig) mostly does the trick. Given a file like:

<?xml version="1.0" encoding="UTF-8"?>
<foo:Root xmlns:foo="http://www.foo.bar/fnarf/foo">
  <foo:child>
    ...
  </foo:child>
  <foo:child>
    ...
  </foo:child>
</foo:Root>

…xml_split can create many files, each containing:

<?xml version="1.0" encoding="UTF-8"?>
<foo:child>
  ...
</foo:child>

However, this loses the namespace declaration and the enclosing root element. Luckily, a little sed magic can bring those back:

find . -name '*.xml' | xargs -n1 sed -e '1 a\ 
<foo:Root xmlns:foo="http://www.foo.bar/fnarf/foo">
' -e '$ a\
</foo:Root>
' -i ''

find lists all the files, xargs invokes sed on them one by one (-n1), and sed adds the opening tag with namespace declaration after the first line (1 a) and the closing tag after the last line ($ a). Now each file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<foo:Root xmlns:foo="http://www.foo.bar/fnarf/foo">
  <foo:child>
    ...
  </foo:child>
</foo:Root>