{"id":93,"date":"2012-02-08T20:47:42","date_gmt":"2012-02-09T04:47:42","guid":{"rendered":"http:\/\/www.lorrin.org\/blog\/?p=93"},"modified":"2012-03-16T23:27:17","modified_gmt":"2012-03-17T06:27:17","slug":"splitting-large-xml-files-with-xml_split-and-sed","status":"publish","type":"post","link":"https:\/\/www.lorrin.org\/blog\/2012\/02\/08\/splitting-large-xml-files-with-xml_split-and-sed\/","title":{"rendered":"Splitting large XML files with xml_split and sed (preserving root element and namespace declaration)"},"content":{"rendered":"<p>Slicing up XML files is best done with an XML parser. (Regular expressions, csplit, etc. are too easily confused by arbitrary strings in CDATA sections.) <a href=\"http:\/\/search.cpan.org\/perldoc?xml_split\">xml_split<\/a> (may be obtained with CPAN by installing <a href=\"http:\/\/search.cpan.org\/~mirod\/XML-Twig-3.39\/\">XML::Twig<\/a>) <em>mostly<\/em> does the trick. Given a file like:<\/p>\n<pre class=\"brush:xml\">&lt;?xml version=\"1.0\" encoding=\"UTF-8\"?&gt;\r\n&lt;foo:Root xmlns:foo=\"http:\/\/www.foo.bar\/fnarf\/foo\"&gt;\r\n  &lt;foo:child&gt;\r\n    ...\r\n  &lt;\/foo:child&gt;\r\n  &lt;foo:child&gt;\r\n    ...\r\n  &lt;\/foo:child&gt;\r\n&lt;\/foo:Root&gt;<\/pre>\n<p>&#8230;xml_split can create many files, each containing:<\/p>\n<pre class=\"brush:xml\">&lt;?xml version=\"1.0\" encoding=\"UTF-8\"?&gt;\r\n&lt;foo:child&gt;\r\n  ...\r\n&lt;\/foo:child&gt;<\/pre>\n<p>However, this loses the namespace declaration and the enclosing root element. Luckily, a little sed magic can bring those back:<\/p>\n<pre class=\"brush:shell\">find . -name '*.xml' | xargs -n1 sed -e '1 a\\ \r\n&lt;foo:Root xmlns:foo=\"http:\/\/www.foo.bar\/fnarf\/foo\"&gt;\r\n' -e '$ a\\\r\n&lt;\/foo:Root&gt;\r\n' -i ''<\/pre>\n<p>find lists all the files, xargs invokes sed on them one by one (<code>-n1<\/code>), and sed adds the opening tag with namespace declaration after the first line (<code>1 a<\/code>) and the closing tag after the last line (<code>$ a<\/code>). Now each file looks like this:<\/p>\n<pre class=\"brush:xml\">&lt;?xml version=\"1.0\" encoding=\"UTF-8\"?&gt;\r\n&lt;foo:Root xmlns:foo=\"http:\/\/www.foo.bar\/fnarf\/foo\"&gt;\r\n  &lt;foo:child&gt;\r\n    ...\r\n  &lt;\/foo:child&gt;\r\n&lt;\/foo:Root&gt;<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Slicing up XML files is best done with an XML parser. (Regular expressions, csplit, etc. are too easily confused by arbitrary strings in CDATA sections.) xml_split (may be obtained with CPAN by installing XML::Twig) mostly does the trick. Given a file like: &lt;?xml version=&#8221;1.0&#8243; encoding=&#8221;UTF-8&#8243;?&gt; &lt;foo:Root xmlns:foo=&#8221;http:\/\/www.foo.bar\/fnarf\/foo&#8221;&gt; &lt;foo:child&gt; &#8230; &lt;\/foo:child&gt; &lt;foo:child&gt; &#8230; &lt;\/foo:child&gt; &lt;\/foo:Root&gt; &#8230;xml_split <a href='https:\/\/www.lorrin.org\/blog\/2012\/02\/08\/splitting-large-xml-files-with-xml_split-and-sed\/' class='excerpt-more'>[&#8230;]<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[48],"tags":[30,36],"_links":{"self":[{"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/posts\/93"}],"collection":[{"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/comments?post=93"}],"version-history":[{"count":8,"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/posts\/93\/revisions"}],"predecessor-version":[{"id":107,"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/posts\/93\/revisions\/107"}],"wp:attachment":[{"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/media?parent=93"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/categories?post=93"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/tags?post=93"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}