Feb 212012
 

After my previous adventures in slicing and dicing a huge XML file, I wanted a means to randomly select files. But first, the directory had so many entries it was unwieldy on my laptop. The Python script below divvies the files up into directories of up to 1000 files each. (Adaptable to other contexts via slight tweaking of the filename regex and subdir name generation.)

#!/usr/bin/python
import os
import re
where = '.' # source directory 

ls = os.listdir(where)
for f in ls:
  m = re.search('.*_COMM-([0-9]+).xml', f)
  if m:
    subdir = "%03d" % (int(m.group(1)) / 1000)
    try:
      os.mkdir(subdir)
    except OSError as e:
      pass
    os.rename(f, os.path.join(subdir, f))

Now on to the random selection, again with Python:

#!/usr/bin/python
import os
import random
import re
import sys

if len(sys.argv) > 1:
  where = sys.argv[1]
else:
  where = '.' # source directory 


subdirs = filter(lambda x: re.search('^[0-9]*$', x), os.listdir(where))
subdir = os.path.join(where,random.choice(subdirs))
print os.path.join(subdir,random.choice(os.listdir(subdir)))

A quick shell loop leverages the Python script to grab files and dump into a repository of test data. Works on ZSH, Bash, perhaps others:

for i in {1..250}; do cp $(./pick_a_file.py sub_dir_with_files) /destination/dir/filename_prefix_$(printf "%03d" $i).xml; done;