Feb 082012
 

Slicing up XML files is best done with an XML parser. (Regular expressions, csplit, etc. are too easily confused by arbitrary strings in CDATA sections.) xml_split (may be obtained with CPAN by installing XML::Twig) mostly does the trick. Given a file like:

<?xml version="1.0" encoding="UTF-8"?>
<foo:Root xmlns:foo="http://www.foo.bar/fnarf/foo">
  <foo:child>
    ...
  </foo:child>
  <foo:child>
    ...
  </foo:child>
</foo:Root>

…xml_split can create many files, each containing:

<?xml version="1.0" encoding="UTF-8"?>
<foo:child>
  ...
</foo:child>

However, this loses the namespace declaration and the enclosing root element. Luckily, a little sed magic can bring those back:

find . -name '*.xml' | xargs -n1 sed -e '1 a\ 
<foo:Root xmlns:foo="http://www.foo.bar/fnarf/foo">
' -e '$ a\
</foo:Root>
' -i ''

find lists all the files, xargs invokes sed on them one by one (-n1), and sed adds the opening tag with namespace declaration after the first line (1 a) and the closing tag after the last line ($ a). Now each file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<foo:Root xmlns:foo="http://www.foo.bar/fnarf/foo">
  <foo:child>
    ...
  </foo:child>
</foo:Root>