Feb 212012
 

After my previous adventures in slicing and dicing a huge XML file, I wanted a means to randomly select files. But first, the directory had so many entries it was unwieldy on my laptop. The Python script below divvies the files up into directories of up to 1000 files each. (Adaptable to other contexts via slight tweaking of the filename regex and subdir name generation.)

#!/usr/bin/python
import os
import re
where = '.' # source directory 

ls = os.listdir(where)
for f in ls:
  m = re.search('.*_COMM-([0-9]+).xml', f)
  if m:
    subdir = "%03d" % (int(m.group(1)) / 1000)
    try:
      os.mkdir(subdir)
    except OSError as e:
      pass
    os.rename(f, os.path.join(subdir, f))

Now on to the random selection, again with Python:

#!/usr/bin/python
import os
import random
import re
import sys

if len(sys.argv) > 1:
  where = sys.argv[1]
else:
  where = '.' # source directory 


subdirs = filter(lambda x: re.search('^[0-9]*$', x), os.listdir(where))
subdir = os.path.join(where,random.choice(subdirs))
print os.path.join(subdir,random.choice(os.listdir(subdir)))

A quick shell loop leverages the Python script to grab files and dump into a repository of test data. Works on ZSH, Bash, perhaps others:

for i in {1..250}; do cp $(./pick_a_file.py sub_dir_with_files) /destination/dir/filename_prefix_$(printf "%03d" $i).xml; done;

 

 

Feb 142012
 

There is lots to be said about the intricacies of IMAP delete flags vs. actual expunging of deleted messages and the confusion caused when something is merely flagged for deletion and the user expected it to be really gone. This post is not about that. Everyone agrees that once a message is expunged, it definitely should be gone. But sometimes expunged messages still display in Thunderbird!

I often observe this:

  1. Delete message on the way to work using K-9 on my phone.
  2. Arrive at work and message is gone from my Inbox in Mail.app
  3. Come home, download new mail in Thunderbird and see an Inbox full of undead messages.

No amount of re-expunging and re-fetching mail helps. Grepping through the server-side Maildir shows the messages really are gone from the folders in which Thunderbird is still showing them.

It turns out the reason they are still displaying in Thunderbird is mundane client-side index corruption. To clean things up:

  1. Right-click on mailbox
  2. Choose Properties...
  3. Click Repair Folder
  4. Rejoice at tidy mailbox
Feb 082012
 

Often Array(arg) is used for this, but is flawed. Note the last result when applied to a Hash:

> Array(42)
 => [42] 
> Array([1,2,3])
 => [1, 2, 3] 
> Array(nil)
 => [] 
> Array("foo")
 => ["foo"] 
> Array({"foo" => "bar", "biz" => "baz"})
 => [["foo", "bar"], ["biz", "baz"]]

What went wrong is that Array() calls the (now deprecated) to_a on each of its arguments. Hash has a custom to_a implementation with different semantics. Instead, do  this:

class Array
  def self.wrap(args)
    return [] unless args
    args.is_a?(Array) ? args : [args]
  end
end

That yields the expected results, even for Hashes:

> Array.wrap(42)
 => [42] 
> Array.wrap([1,2,3])
 => [1, 2, 3] 
> Array.wrap(nil)
 => [] 
> Array.wrap("foo")
 => ["foo"] 
> Array.wrap({"foo" => "bar", "biz" => "baz"})
 => [{"foo"=>"bar", "biz"=>"baz"}]

Use of is_a? is deliberate; duck-typing in this situation ([:[], :each].all? { |m| args.respond_to? m }) yields unexpected surprises since e.g. String is Enumerable and would not get wrapped.

For further discussion see Ruby-forum thread “shortcut for x = [x] unless x.is_a?(Array)” and StackOverflow “Ruby: Object.to_a replacement“.

 Tagged with:
Feb 082012
 

Slicing up XML files is best done with an XML parser. (Regular expressions, csplit, etc. are too easily confused by arbitrary strings in CDATA sections.) xml_split (may be obtained with CPAN by installing XML::Twig) mostly does the trick. Given a file like:

<?xml version="1.0" encoding="UTF-8"?>
<foo:Root xmlns:foo="http://www.foo.bar/fnarf/foo">
  <foo:child>
    ...
  </foo:child>
  <foo:child>
    ...
  </foo:child>
</foo:Root>

…xml_split can create many files, each containing:

<?xml version="1.0" encoding="UTF-8"?>
<foo:child>
  ...
</foo:child>

However, this loses the namespace declaration and the enclosing root element. Luckily, a little sed magic can bring those back:

find . -name '*.xml' | xargs -n1 sed -e '1 a\ 
<foo:Root xmlns:foo="http://www.foo.bar/fnarf/foo">
' -e '$ a\
</foo:Root>
' -i ''

find lists all the files, xargs invokes sed on them one by one (-n1), and sed adds the opening tag with namespace declaration after the first line (1 a) and the closing tag after the last line ($ a). Now each file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<foo:Root xmlns:foo="http://www.foo.bar/fnarf/foo">
  <foo:child>
    ...
  </foo:child>
</foo:Root>