Feb 212012

After my previous adventures in slicing and dicing a huge XML file, I wanted a means to randomly select files. But first, the directory had so many entries it was unwieldy on my laptop. The Python script below divvies the files up into directories of up to 1000 files each. (Adaptable to other contexts via slight tweaking of the filename regex and subdir name generation.)

import os
import re
where = '.' # source directory 

ls = os.listdir(where)
for f in ls:
  m = re.search('.*_COMM-([0-9]+).xml', f)
  if m:
    subdir = "%03d" % (int(m.group(1)) / 1000)
    except OSError as e:
    os.rename(f, os.path.join(subdir, f))

Now on to the random selection, again with Python:

import os
import random
import re
import sys

if len(sys.argv) > 1:
  where = sys.argv[1]
  where = '.' # source directory 

subdirs = filter(lambda x: re.search('^[0-9]*$', x), os.listdir(where))
subdir = os.path.join(where,random.choice(subdirs))
print os.path.join(subdir,random.choice(os.listdir(subdir)))

A quick shell loop leverages the Python script to grab files and dump into a repository of test data. Works on ZSH, Bash, perhaps others:

for i in {1..250}; do cp $(./pick_a_file.py sub_dir_with_files) /destination/dir/filename_prefix_$(printf "%03d" $i).xml; done;