{"id":213,"date":"2012-08-17T14:38:33","date_gmt":"2012-08-17T21:38:33","guid":{"rendered":"http:\/\/www.lorrin.org\/blog\/?p=213"},"modified":"2013-03-25T22:59:32","modified_gmt":"2013-03-26T05:59:32","slug":"stripping-byte-order-mark","status":"publish","type":"post","link":"https:\/\/www.lorrin.org\/blog\/2012\/08\/17\/stripping-byte-order-mark\/","title":{"rendered":"Stripping Unicode Byte Order Mark to resolve SolrException: undefined field: &#8220;?myField&#8221; during Ingest"},"content":{"rendered":"<p><a href=\"http:\/\/lucene.apache.org\/solr\/\">Solr<\/a> has a handy ability to <a href=\"https:\/\/wiki.apache.org\/solr\/UpdateCSV\">ingest local CSV files<\/a>. The neatest aspect of which is that you can populate multi-valued fields by sub-parsing an individual field. E.g. the following will ingest <tt>\/tmp\/input.csv<\/tt> and split <tt>SomeField<\/tt> into multiple values by semi-colon delimiters:<\/p>\n<pre class=\"brush:shell;gutter:false;toolbar:false\">curl http:\/\/localhost:80\/solr\/my_core\/update\\?stream.file\\=\/tmp\/input.csv\\&amp;stream.contentType\\=text\/csv\\;charset\\=utf-8\\&amp;commit\\=true\\&amp;f.SpmeField.split\\=true\\&amp;f.SomeField.separator\\=%3B<\/pre>\n<p>When running an ingest, I got the following response, which was confusing since <tt>myField<\/tt> was, in fact, defined in my schema:<\/p>\n<pre class=\"brush:xml\">&lt;?xml version=\"1.0\" encoding=\"UTF-8\"?&gt;\r\n&lt;response&gt;\r\n    &lt;lst name=\"responseHeader\"&gt;\r\n        &lt;int name=\"status\"&gt;400&lt;\/int&gt;\r\n        &lt;int name=\"QTime\"&gt;1&lt;\/int&gt;&lt;\/lst&gt;\r\n        &lt;lst name=\"error\"&gt;\r\n            &lt;str name=\"msg\"&gt;undefined field: \"myField\"&lt;\/str&gt;\r\n        &lt;int name=\"code\"&gt;400&lt;\/int&gt;\r\n    &lt;\/lst&gt;\r\n&lt;\/response&gt;<\/pre>\n<p>A peek in the log provided a clue (note the leading question mark):<\/p>\n<pre>SEVERE: org.apache.solr.common.SolrException: undefined field: \"?myField\"<\/pre>\n<p>Examining a hex dump of the CSV file revealed that it started with a UTF-8 <a href=\"https:\/\/en.wikipedia.org\/wiki\/Byte_order_mark\">Byte Order Mark<\/a>:<\/p>\n<pre>xxd \/tmp\/input.csv | head\r\n0000000: efbb bf...<\/pre>\n<p>One way to strip the BOM is with <a href=\"http:\/\/www.ueber.net\/who\/mjl\/projects\/bomstrip\/\">Bomstrip<\/a>, a collection of BOM-stripping implementations in various languages, including a Perl one-liner. Alternatively, just <strong>open the file in Vim, do <tt>:set nobomb<\/tt> and save<\/strong>. Done!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Solr has a handy ability to ingest local CSV files. The neatest aspect of which is that you can populate multi-valued fields by sub-parsing an individual field. E.g. the following will ingest \/tmp\/input.csv and split SomeField into multiple values by semi-colon delimiters: curl http:\/\/localhost:80\/solr\/my_core\/update\\?stream.file\\=\/tmp\/input.csv\\&amp;stream.contentType\\=text\/csv\\;charset\\=utf-8\\&amp;commit\\=true\\&amp;f.SpmeField.split\\=true\\&amp;f.SomeField.separator\\=%3B When running an ingest, I got the following response, which was <a href='https:\/\/www.lorrin.org\/blog\/2012\/08\/17\/stripping-byte-order-mark\/' class='excerpt-more'>[&#8230;]<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[48],"tags":[70,68,69,38],"_links":{"self":[{"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/posts\/213"}],"collection":[{"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/comments?post=213"}],"version-history":[{"count":5,"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/posts\/213\/revisions"}],"predecessor-version":[{"id":252,"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/posts\/213\/revisions\/252"}],"wp:attachment":[{"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/media?parent=213"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/categories?post=213"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.lorrin.org\/blog\/wp-json\/wp\/v2\/tags?post=213"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}