{"id":109,"date":"2012-02-21T10:28:50","date_gmt":"2012-02-21T18:28:50","guid":{"rendered":"http:\/\/www.lorrin.org\/blog\/?p=109"},"modified":"2012-03-16T23:26:31","modified_gmt":"2012-03-17T06:26:31","slug":"random-file-selections-with-python","status":"publish","type":"post","link":"https:\/\/www.lorrin.org\/blog\/2012\/02\/21\/random-file-selections-with-python\/","title":{"rendered":"Random file selections with Python"},"content":{"rendered":"<p>After my previous adventures in <a title=\"Splitting large XML files with xml_split and sed (preserving root element and namespace declaration)\" href=\"http:\/\/www.lorrin.org\/blog\/?p=93\">slicing and dicing a huge XML<\/a> file, I wanted a means to randomly select files. But first, the directory had so many entries it was unwieldy on my laptop. The Python script below divvies the files up into directories of up to 1000 files each. (Adaptable to other contexts via slight tweaking of the filename regex and subdir name generation.)<\/p>\n<pre class=\"brush:py\">#!\/usr\/bin\/python\r\nimport os\r\nimport re\r\nwhere = '.' # source directory \r\n\r\nls = os.listdir(where)\r\nfor f in ls:\r\n  m = re.search('.*_COMM-([0-9]+).xml', f)\r\n  if m:\r\n    subdir = \"%03d\" % (int(m.group(1)) \/ 1000)\r\n    try:\r\n      os.mkdir(subdir)\r\n    except OSError as e:\r\n      pass\r\n    os.rename(f, os.path.join(subdir, f))<\/pre>\n<p>Now on to the random selection, again with Python:<\/p>\n<pre class=\"brush:py\">#!\/usr\/bin\/python\r\nimport os\r\nimport random\r\nimport re\r\nimport sys\r\n\r\nif len(sys.argv) &gt; 1:\r\n  where = sys.argv[1]\r\nelse:\r\n  where = '.' # source directory \r\n\r\n\r\nsubdirs = filter(lambda x: re.search('^[0-9]*$', x), os.listdir(where))\r\nsubdir = os.path.join(where,random.choice(subdirs))\r\nprint os.path.join(subdir,random.choice(os.listdir(subdir)))<\/pre>\n<p>A quick shell loop leverages the Python script to grab files and dump into a repository of test data. Works on ZSH, Bash, perhaps others:<\/p>\n<pre class=\"brush:shell\">for i in {1..250}; do cp $(.\/pick_a_file.py sub_dir_with_files) \/destination\/dir\/filename_prefix_$(printf \"%03d\" $i).xml; done;<\/pre>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>After my previous adventures in slicing and dicing a huge XML file, I wanted a means to randomly select files. But first, the directory had so many entries it was unwieldy on my laptop. The Python script below divvies the files up into directories of up to 1000 files each. (Adaptable to other contexts via <a href='https:\/\/www.lorrin.org\/blog\/2012\/02\/21\/random-file-selections-with-python\/' class='excerpt-more'>[&#8230;]<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[48],"tags":[35,11,36,34],"_links":{"self":[{"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/posts\/109"}],"collection":[{"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/comments?post=109"}],"version-history":[{"count":4,"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/posts\/109\/revisions"}],"predecessor-version":[{"id":153,"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/posts\/109\/revisions\/153"}],"wp:attachment":[{"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/media?parent=109"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/categories?post=109"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/tags?post=109"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}