Saturday Jan 12, 2013

The non-UTF-8 encoding character invalid byte sequences error

An ongoing issue for XML transactions processing is UTF-8 character conformance. In an ideal world your computer should simply process your information content stream, store it and step on.  XML engineers however have other ideas.

Content created in Microsoft Excel or Word or in a Web page application on a Windows desktop is by default using the Windows 1252 character set, however often this content ends up in XML document instances labelled as UTF-8 encoding.

A conforming XML parser such as Xerces will then kick out invalid byte code sequence errors when attempting to process the content.  Turns out the really simple answer is to change the encoding statement in the XML prolog to say "Windows 1252" e.g.


<?xml version="1.0" encoding="Windows-1252" standalone="yes"?>

and then retry. Of course if you know you are using a different character encoding substitute that for the Windows-1252 value here instead.

Now for automated batch processes you will need a simple piece of XSLT to switch / add the correct encoding.

You can find out more tips and tricks on all this - plus links to XSLT tools to help with this from the CAM Editor wiki page.

Another issue is simply locating the offending characters inside an XML instance - for that you can use this handy command line grep statement:

grep --color='auto' -P -n "[\x80-\xFF]" file.xml

All this then allows you to diagnose potential character set conflicts and hopefully then build smoothly functioning XML interfaces.  For XML content validation you can of course use the CAMV validation engine - and you can find out more on that from this YouTube resource site showing a video on the topic (also included are various NIEM training aspect too).


About

Not all XML is created equal. XML Orb looks at the challenges of creating information exchanges with XML and NIEM and how this can be made simpler, comprehensible, consistent and reliable.

Search

Categories
Archives
« January 2013 »
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today