Wednesday, February 14, 2007

Invalid XML Characters: when valid UTF8 does not mean valid XML

I was working on Java application recently when I got the following exception

org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x1a) was found in the element content of the document.

I was using Castor in an attempt to unmarshal (XML->Java Objects) an XML string. The XML originated from a bookmarks file that was uploaded into the application by the user, tidied up, transformed and then stored in a Oracle database. It was at the point at which the XML string was returned from the database and was being converted into Java objects that this error occurred.

I immediately assumed it to be some kind of character encoding conversion problem.

With some databases (e.g. MySQL) it is possible to set data to be passed as UTF8 by passing certain settings via the JDBC url. I do not use MySQL so I do not know whether this would fix my problem; although I suspect it would not.

I tried various methods to convert my XML string into valid UTF8 and I was pretty sure that I had achieved satisfactory UTF8 conversion but I still got the error.

This is when I discovered that not all valid UTF8 characters are valid XML characters, which probably makes sense (what with control characters and such) but I have never had to think about this before.

After spending several hours previously messing with numerous UTF8 conversion techniques I eventually found a solution. I found it in the Xalan mailing list. I am reproducing this solution here because it was not mentioned in the context of the "Unicode: 0x1a" error, if it was I would have found the solution more quickly. The XML standard specifies which UTF8 characters are valid in XML documents, so it is possible to take a UTF8 document and filter out all the invalid characters using a method like this:


/**
* This method ensures that the output String has only
* valid XML unicode characters as specified by the
* XML 1.0 standard. For reference, please see
* <a href="http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char">the
* standard</a>. This method will return an empty
* String if the input is null or empty.
*
* @param in The String whose non-valid characters we want to remove.
* @return The in String, stripped of non-valid characters.
*/

public String stripNonValidXMLCharacters(String in) {
StringBuffer out = new StringBuffer(); // Used to hold the output.
char current; // Used to reference the current character.

if (in == null || ("".equals(in))) return ""; // vacancy test.
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
if ((current == 0x9) ||
(current == 0xA) ||
(current == 0xD) ||
((current >= 0x20) && (current <= 0xD7FF)) ||
((current >= 0xE000) && (current <= 0xFFFD)) ||
((current >= 0x10000) && (current <= 0x10FFFF)))
out.append(current);
}
return out.toString();
}

I hope somebody else finds this useful (and it saves them a few hours of head scratching), alternatively if people reading this know of a better solution then please do let me know!

91 comments:

ismjml said...

Thank you so very much, i was searching the internet almost this wil afternoon.... Thankkss!
Note: Comment imported. Original by Tjeerd at 2007-02-22 18:33

ismjml said...

Here's the .NET version:

public static string stripNonValidXMLCharacters(string s)

{

StringBuilder _validXML = new StringBuilder(s.Length, s.Length); // Used to hold the output.

char current; // Used to reference the current character.

char[] charArray = s.ToCharArray();



if (string.IsNullOrEmpty(s)) return string.Empty; // vacancy test.



for (int i = 0; i < charArray.Length; i++)

{

current = charArray[i]; // NOTE: No IndexOutOfBoundsException caught here; it should not happen.

if ((current == 0x9) ||

(current == 0xA) ||

(current == 0xD) ||

((current >= 0x20) && (current <= 0xD7FF)) ||

((current >= 0xE000) && (current <= 0xFFFD)) ||

((current >= 0x10000) && (current <= 0x10FFFF)))

_validXML.Append(current);

}

return _validXML.ToString();


Note: Comment imported. Original by Meno website: http://www.workpac.com/ at 2007-04-04 00:53

ismjml said...

Thank you, I was indeed looking for such a solution for long
Note: Comment imported. Original by Santthosh website: http://www.santthosh.info at 2007-04-04 05:29

ismjml said...

Just wanted to add my thanks for this - and add some google bait. I was seeing



An invalid XML character (Unicode: 0xb) was found in the CDATA section



for my XML, now fixed using this snippet.



Dave
Note: Comment imported. Original by Anonymous at 2007-05-04 13:57

ismjml said...

Hi Mark,



Just wanted to say "Thanks!"



I am pulling data from an AS400 and this saved the day for me!!



Cheers!

Bob
Note: Comment imported. Original by Anonymous at 2007-05-04 20:01

ismjml said...

Thank you so much for sharing this code. That saved me on a major project I am working on.
Note: Comment imported. Original by Karen at 2007-05-22 18:28

ismjml said...

Thanks for the solution, this saved me a lot of time!
Note: Comment imported. Original by Nathan at 2007-06-05 23:47

ismjml said...

Thanks, found your blog after

javax.servlet.ServletException: javax.xml.transform.TransformerException: com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: An invalid XML character (Unicode: 0x0) was found in the element content of the document.
Note: Comment imported. Original by Robot website: http://www.moesol.com at 2007-06-07 08:07

ismjml said...

Thanks. Such a big problem was solved in few mins. Thanks you very much dude.
Note: Comment imported. Original by Leon John at 2007-06-29 19:26

ismjml said...

Thanks!!
Note: Comment imported. Original by gurito website: http://softygen.cross-solution.de at 2007-08-17 11:43

ismjml said...

Thank you very much Mark
Note: Comment imported. Original by Anonymous at 2007-08-27 20:53

ismjml said...

Thanks... This really helped!
Note: Comment imported. Original by Anonymous at 2007-09-25 21:19

ismjml said...

tnx!
Note: Comment imported. Original by Anonymous at 2007-10-05 14:06

ismjml said...

Also wanted to say, thanks! This just saved me a serious parsing headache.
Note: Comment imported. Original by GS at 2007-11-07 18:41

ismjml said...

Hi Mark and everyone,



Thank you so bloody much!



Questions: Is the character 'range' up-to-date? And does it apply to CData Sections as well?
Note: Comment imported. Original by Håkan Jacobsson at 2007-11-09 17:06

ismjml said...

Hi Håkan,



I think the character ranges are correct according to the latest XML specification (Fourth edition, last edited 29 September 2006).





http://www.w3.org/TR/xml/#charsets





Also, CDATA sections are about storing strings of characters that are not to be treated as markup. I am almost certain that they should not contain out of range characters (as this would invalidate the entire XML document).


Note: Comment imported. Original by markmc website: http://content.mark-mclaren.info/ at 2007-11-09 17:50

ismjml said...

Mark,



Thanks so much

I have one more question.

Can I use this range of characters for validation of XML documents not using UTF-8 as encoding

(sorry if this is a newbie question)?

The XML documents I deal with are client product feeds and may not be encoded in UTF-8.
Note: Comment imported. Original by Håkan Jacobsson at 2007-11-11 12:44

ismjml said...

I found this a very difficult question to answer. I find reading detailed specifications hard at the best of times! I have done some digging around and I am still not certain that I have the right answer. My guess is whatever character encoding you use, valid XML must only contain characters from the defined range. I base my guess on the following:





"Remember that character encodings, despite their name do not apply to characters - they apply to byte sequences which represent characters. If you have a char variable in Java it has no character encoding as far as you are concerned, it’s just that character."





However, if you do decide to opt for a character encoding other than UTF-8/UTF-16 you will have additional constraints in order to satisfy XML validity.





You must have an encoding declaration

All data must be encoded in the encoding named in the declaration

All byte sequences must be legal for that encoding




Note: Comment imported. Original by markmc website: http://cse-mjmcl.cse.bris.ac.uk/blog at 2007-11-12 09:33

ismjml said...

Thank you for this function, here is the PHP version:



function strip_invalid_xml_chars( $in ) {

$out = ""; // Used to hold the output.

$current; // Used to reference the current character.

if ( empty($in) ) {

return ""; // vacancy test.

}

$length = strlen($in);

for ( $i = 0; $i < $length; $i++) {

$current = ord($in{$i});

if ( ($current == 0x9) ||

($current == 0xA) ||

($current == 0xD) ||

(($current >= 0x20) && ($current <= 0xD7FF)) ||

(($current >= 0xE000) && ($current <= 0xFFFD)) ||

(($current >= 0x10000) && ($current <= 0x10FFFF))) {

$out .= chr($current);

}

else {

$out .= " ";

}

}

return $out;

}
Note: Comment imported. Original by Konzhang website: http://www.qq.com/ at 2007-12-10 05:55

ismjml said...

Greetings,



Thanks a lot Mark, I was able to fix a XML transformation problem by stripping all the non-valid XML characters.


Note: Comment imported. Original by ABr website: http://abr3.wordpress.com at 2007-12-31 10:54

ismjml said...

Mil gracias.
Note: Comment imported. Original by Anonymous at 2008-01-08 12:48

ismjml said...

If you just want to validate a string (and not replace the characters), you can do it easily with a regular expression:



public static boolean iSValidXMLText(String xml) {

boolean valid = true;



if( xml != null ) {

valid = xml.matches("^([\\x09\\x0A\\x0D\\x20-\\x7E]|" //# ASCII

+ "[\\xC2-\\xDF][\\x80-\\xBF]|" //# non-overlong 2-byte

+ "\\xE0[\\xA0-\\xBF][\\x80-\\xBF]|" //# excluding overlongs

+ "[\\xE1-\\xEC\\xEE\\xEF][\\x80-\\xBF]{2}|" //# straight 3-byte

+ "\\xED[\\x80-\\x9F][\\x80-\\xBF]|" //# excluding surrogates

+ "\\xF0[\\x90-\\xBF][\\x80-\\xBF]{2}|" //# planes 1-3

+ "[\\xF1-\\xF3][\\x80-\\xBF]{3}|" //# planes 4-15

+ "\\xF4[\\x80-\\x8F][\\x80-\\xBF]{2})*$"); //# plane 16

}



return valid;

}



(borrowed from Here)


Note: Comment imported. Original by dcendents at 2008-01-11 11:56

ismjml said...

Anybody ever done something like this over an entire file?
Note: Comment imported. Original by Anonymous at 2008-02-21 16:11

ismjml said...

I used the method, but could not get it to work. Still get $#19; an invalid xml char. I decoded the String using UTF-8, str.charAt(i) returned $,#,1,9, and ; respectively. Individually, these are valid xml char. Could you advice what I did wrong? Thanks
Note: Comment imported. Original by Anonymous at 2008-02-25 17:57

ismjml said...

Please note: Your code is broken.



the condition "current <= 0x10FFFF" will not work as the character has 16bit.



java stores Strings as UTF-16, and "exposes" this encoding in charAt(x).



this is at least "qurix" in java, e.g. String.length() returns too much.



fixing the code would require to "decode" UTF-16.
Note: Comment imported. Original by Anonymous at 2008-02-25 23:35

ismjml said...

Couldn't you also do this? We did not want to actually remove the characters just replace them. Feedback/corrections welcome:



public String stripNonValidXMLCharacters(String in) {

StringBuffer out = new StringBuffer(); // Used to hold the output.

char current; // Used to reference the current character.



if (in == null || ("".equals(in))) {

return "";

} // vacancy test.

for (int i = 0; i < in.length(); i++) {

current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.

if ((current == 0x9) ||

(current == 0xA) ||

(current == 0xD) ||

((current >= 0x20) && (current <= 0xD7FF)) ||

((current >= 0xE000) && (current <= 0xFFFD)) ||

((current >= 0x10000) && (current <= 0x10FFFF))) {

out.append(current);

} else {

out.append("_");

}

}



return out.toString();

}


Note: Comment imported. Original by John Seals at 2008-02-26 21:11

ismjml said...

I'm working on doing that I'll post it when I'm done
Note: Comment imported. Original by sidd at 2008-03-03 18:42

ismjml said...

Thank you so much. I had a huge data set that we converted into XML and about 300 xml failed fue to invalid char in them. I was trying to code in some method do this. But what you have is much better then what I had in my mind. THANK YOU!!!
Note: Comment imported. Original by sidd at 2008-03-03 18:41

ismjml said...

Hi,

Thx for your solution.



(I apologize I do not speak very good English)



I have the same problem, I use jaxb and jaxWs.

I have a web service that builds and forwards an Adobe PDF file in a String object. And a service client’s rebuilds this PDF file on the other side.



When the web service client unmarshall the response flow, it generates this exception : An invalid XML character (Unicode: 0x2) was found in the element content of the document. My PDF Document stored in the String object contains valid characters for the PDF reader, but invalid for the xml parser.



When I uses your solution, invalid characters are well removed, and the web service work correctly.



But when I recovers the pdf after calling my service, this one is corrupted. The removed characters are necessary for read the PDF properly.

I try multiple solutions to convert the String object to UTF8 encoding, but nothing works. Those characters are always presents and obviously necessary.



Is there a solution to replace (Not remove) invalid characters in the original String and successfully pass the xml parser

And recover invalid characters after the parsing step to rebuild at the identical the original message ?



Thanks for your help.


Note: Comment imported. Original by Florent at 2008-03-07 17:26

ismjml said...

In the case of marshalling binary files, the solution is finally to encode the PDF flow in 64 base.

More informations here : http://www.javaworld.com/javaworld/javatips/jw-javatip117.html



In my case, I use :

sun.misc.BASE64Encoder and sun.misc.BASE64Decoder.
Note: Comment imported. Original by Florent at 2008-03-10 09:08

ismjml said...

Works like a charm... This was a real time saver for me.

Thanks
Note: Comment imported. Original by Anonymous at 2008-03-17 21:18

ismjml said...

Good piece piece of code!

Had problems getting MSXML to parse some apparently valid characters but it seems they weren't.



Cheers.
Note: Comment imported. Original by 5ubliminal website: http://www.tellinya.com/ at 2008-03-20 02:32

ismjml said...

THANK YOU THANK YOU THANK YOU



Who ever posted the php function saved me.







function strip_invalid_xml_chars2( $in )

{



$out = "";



$length = strlen($in);



for ( $i = 0; $i < $length; $i++)

{



$current = ord($in{$i});



if ( ($current == 0x9) || ($current == 0xA) || ($current == 0xD) || (($current >= 0x20) && ($current = 0xE000) && ($current = 0x10000) && ($current <= 0x10FFFF)))

{

$out .= chr($current);

}

else

{

$out .= " ";

}





}



return $out;





}






Note: Comment imported. Original by ransom website: http://www.ransom.vg at 2008-03-21 19:35

ismjml said...

Mark,



I owe you much beer. Can I use this function (while giving you credit) in my AntiSamy project?



It's published under the BSD license.



Cheers,

Arshan




Note: Comment imported. Original by Arshan D. website: http://i8jesus.com/ at 2008-03-25 14:55

ismjml said...

Arshan, please use it freely with my blessing. (it wasn't my code originally anyhow, I just reproduced it in this context).



Mark
Note: Comment imported. Original by markmc website: http://cse-mjmcl.cse.bris.ac.uk at 2008-03-26 19:11

ismjml said...

Absolutely brilliant! Saved me from a big headache!



Thanks!
Note: Comment imported. Original by Anonymous at 2008-04-24 17:20

ismjml said...

very nice, fixed some problems we have overhere. Here a delphi version:





function ValidXmlString(Value : wideString) : WideString;

var

NewLen,

idx : integer;

CurChar : word;

begin

// initialize

Result := '';

NewLen := 0;



// check for empty string

if Length(Value) = 0 then

exit;



// init length of result

Result := stringofchar(' ',length(Value));



// loop and check valid xml characters

for idx := 1 to length(Value) do

begin

CurChar := ord(Value[idx]);

if (CurChar = $9) or (CurChar = $A) or (CurChar = $D) or

((CurChar >= $20) and (CurChar <= $D7FF)) or

((CurChar >= $E000) and (CurChar <= $FFFD)) or

((CurChar >= $10000) and (CurChar <= $10FFFF)) then

begin

inc(NewLen);

Result[NewLen] := Value[idx];

end;

end;



// adjust size of result

Result := copy(Result,1,NewLen);

end; // ValidXmlString


Note: Comment imported. Original by Maurits van RIjnen at 2008-05-06 08:40

ismjml said...

Thank you very much!
Note: Comment imported. Original by Dario at 2008-06-10 22:33

ismjml said...

I am using the service in my code , but not able to test it :( please post if somebody has tested the service before.The service will trim 0x13 if somebody is having this character plz post.
Note: Comment imported. Original by vishal paisal at 2008-07-03 12:54

ismjml said...

Thanks you very much, I am sure you have saved atleast 10hrs of our time.



Keep posting this kind of useful stuff.
Note: Comment imported. Original by Anonymous at 2008-07-10 20:10

ismjml said...

if you just want to get rid of control characters you can use a regex. In c#:



xml = Regex.Replace(xml, "&\\#x(?:0[0-8BCEF]|1[0-9A-F]);", "");


Note: Comment imported. Original by jba at 2008-09-11 09:09

ismjml said...

Could you please help me write the regualr expression in java
Note: Comment imported. Original by Anonymous at 2008-09-12 11:14

ismjml said...

More than one year later and the solution you've posted is still up to date! Thanks a lot for the code Mark. It certainly saved me a lot of headaches!



Cheers,

Ricardo
Note: Comment imported. Original by Ricardo at 2008-09-26 11:42

ismjml said...

Thanks for the help!
Note: Comment imported. Original by Craig Sumner website: http://www.craigsumner.net at 2008-10-16 18:55

ismjml said...

Thank you!!! Saved me a lot of time and frustration with this function.
Note: Comment imported. Original by V at 2008-10-28 20:36

ismjml said...

Thanks a lot!
Note: Comment imported. Original by geert at 2008-11-24 14:01

ismjml said...

function stripNonValidXMLCharacters(in)

{

var current;

var out = "";



for (var i = 0; i < in.length; i++)

{

current = in.charCodeAt(i);



if ((current == 0x9) ||

(current == 0xA) ||

(current == 0xD) ||

((current >= 0x20) && (current <= 0xD7FF)) ||

((current >= 0xE000) && (current <= 0xFFFD)) ||

((current >= 0x10000) && (current <= 0x10FFFF)))

out += in.charAt(i);

}



return out;

}


Note: Comment imported. Original by Leni website: http://www.zindus.com at 2008-11-25 00:13

ismjml said...

Thanks for helping me identify this problem.
Note: Comment imported. Original by Jay at 2008-12-19 03:43

ismjml said...

hey Leni.. isn't 'in' a reserved keyword in javascript ?
Note: Comment imported. Original by Anonymous at 2009-01-12 06:20

ismjml said...

"in" is a reserved word in JavaScript!
Note: Comment imported. Original by markmc website: http://content.mark-mclaren.info/ at 2009-01-12 08:19

ismjml said...

Thanks a Lot. I saved a lot of time and it worked.otherwise, I would have wasted so much of my time trying to find a solution for this problem



Thanks once again
Note: Comment imported. Original by Anonymous at 2009-01-22 13:40

ismjml said...

Thanks very much Mark!
Note: Comment imported. Original by Arash Jooracbhi at 2009-02-25 18:04

ismjml said...

I think that the solution is broken because java chars are less than FFFF so some comparisons are useless.

The idea is to compare Unicode code points not UTF16 code units(java char type).

Here is my solution:



/**

* This method ensures that the output String has only valid XML unicode characters as specified by the

* XML 1.0 standard. For reference, please see the

* standard. This method will return an empty String if the input is null or empty.

*

* @author Donoiu Cristian, GPL

* @param The String whose non-valid characters we want to remove.

* @return The in String, stripped of non-valid characters.

*/

public static String removeInvalidXMLCharacters(String s) {

StringBuilder out = new StringBuilder(); // Used to hold the output.

int codePoint; // Used to reference the current character.

//String ss = "\ud801\udc00"; // This is actualy one unicode character, represented by two code units!!!.

//System.out.println(ss.codePointCount(0, ss.length()));// See: 1

int i=0;

while(i<s.length()) {

System.out.println("i=" + i);

codePoint = s.codePointAt(i); // This is the unicode code of the character.

if ((codePoint == 0x9) || // Consider testing larger ranges first to improve speed.

(codePoint == 0xA) ||

(codePoint == 0xD) ||

((codePoint >= 0x20) && (codePoint <= 0xD7FF)) ||

((codePoint >= 0xE000) && (codePoint <= 0xFFFD)) ||

((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) {

out.append(Character.toChars(codePoint));

}

i+= Character.charCount(codePoint); // Increment with the number of code units(java chars) needed to represent a Unicode char.

}

return out.toString();

}





Hope it helps.<br/><i><b>Note:</b> Comment imported. Original by Donoiu Cristian at 2009-03-02 11:16</i>

ismjml said...

Anyone know why XML parsers or deserializaion objects through when an invalid character is found? Just curious about the motivation for raising throwing an exception and not ignoring it.
Note: Comment imported. Original by bock at 2009-03-18 17:29

ismjml said...

Hi Bock,



Good question! I will try to answer it! In the above error the XML parser is correctly reporting that this data does not conform to the strict guidelines of the XML specification. It is the parser's job to report any non-compliance to any extent (even if it seems minor). However, in my particular case it is sufficient to discard the non-standard characters and continue working with the data BUT in another situation this course of action might not be appropriate.



Mark


Note: Comment imported. Original by markmc website: http://content.mark-mclaren.info/ at 2009-03-19 21:26

ismjml said...

Thanks very much Mark!
Note: Comment imported. Original by Anonymous at 2009-04-16 18:42

ismjml said...

Saved me a bunch of time. Thanks!
Note: Comment imported. Original by Jessica King at 2009-04-23 20:29

ismjml said...

Thanks a lot.
Note: Comment imported. Original by Anonymous website: http://students.info.uaic.ro/~ionut.bujdei at 2009-05-20 05:43

ismjml said...

Doesn't filter this out:

§



The double S looking character breaks an XML document.
Note: Comment imported. Original by Anonymous at 2009-05-20 21:26

ismjml said...

Thanks a ton for this code. It really saved a lot of search and work !!
Note: Comment imported. Original by Saurabh Sule at 2009-05-22 06:55

ismjml said...

Or the Scala Version (I'm a scala noob, so don't rib me too hard)...





class ValidXmlChar(chr:Char) {

def isValidXmlChr() = {

((chr == 0x9) ||

(chr == 0xA) ||

(chr == 0xD) ||

((chr >= 0x20) && (chr <= 0xD7FF)) ||

((chr >= 0xE000) && (chr <= 0xFFFD)) ||

((chr >= 0x10000) && (chr <= 0x10FFFF)))

}

}



object ValidXmlChar {

implicit def char2ValidXMLChar(chr:Char) = new ValidXmlChar(chr)

}



import ValidXmlChar.{_}



object Main {



def main(args: Array[String]) :Unit = {

// Assuming args(0) is a string you want to clean

println(stripNonValidXMLCharacters(args(0)))

}



def stripNonValidXMLCharacters(s:String) = {

val sb = new StringBuffer()

s.filter(_.isValidXmlChr).foreach(sb.append(_))

sb

}



}


Note: Comment imported. Original by Koppe at 2009-06-08 18:08

ismjml said...

We're using Java 1.4.2 and a few methods can't seem to be found. Is there an equivalent for...?



-Character.toChars()

-Character.charCount()



I used the following:

codePoint = s.charAt(i);
Note: Comment imported. Original by Ken at 2009-06-09 14:54

ismjml said...

Thanks a lot
Note: Comment imported. Original by stroll at 2009-06-11 13:17

ismjml said...

worked perfectly! Thanks
Note: Comment imported. Original by Anonymous at 2009-06-24 00:22

ismjml said...

Thanks a lot! This function is very useful
Note: Comment imported. Original by Anonymous at 2009-06-25 14:46

ismjml said...

you can eliminate the for loop with regular expressions:



$clean_string=preg_replace('/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]/u','',$input_sting)
Note: Comment imported. Original by asoki at 2009-08-12 10:16

ismjml said...

Thanks a ton! This code snippet helped us big time!!
Note: Comment imported. Original by Shilpa at 2009-08-19 15:50

ismjml said...

Very efficient, thanks a mill
Note: Comment imported. Original by eladco website: http://none at 2009-09-10 17:27

ismjml said...

Yeay, thanks for the Get-Out-Of-Jail-Free card :)
Note: Comment imported. Original by Andy Brook at 2009-10-19 14:10

ismjml said...

Thanks a lot! It saved my day too :)
Note: Comment imported. Original by Kiiro Roshi at 2009-10-21 16:41

ismjml said...

I appreciate your help, I was having the very same problem with JAXB when unmarshalling objects. Thx a lot.
Note: Comment imported. Original by alb at 2009-12-09 12:06

ismjml said...

Excellent ! Exactly what I was needing.
Note: Comment imported. Original by Nicolas at 2009-12-16 09:28

ismjml said...

Saving me too, thanks a lot :D
Note: Comment imported. Original by Gregory Jourdan at 2010-02-12 23:52

ismjml said...

Thank you!!!!!!!
Note: Comment imported. Original by Anonymous at 2010-03-05 06:42

ismjml said...

Hi Mark,



I am dealing with a value that is sourced in systems not in my control/scope and I cannot strip any character that is a valid UTF-8 but not valid XML 1.0. Is there a way to transform these invalid characters into valid XML 1.0 equivalent.



This question has already been asked a few times in the thread.



Many thanks in advance.
Note: Comment imported. Original by Mani at 2010-04-28 10:13

ismjml said...

Thank you so much! Little shorter for java:



public static String rmNonValidChars(String str) {

if(str==null) return null;

StringBuffer s = new StringBuffer();

for (char c : str.toCharArray()) {

if ((c == 0x9) || (c == 0xA) || (c == 0xD)

|| ((c >= 0x20) && (c <= 0xD7FF))

|| ((c >= 0xE000) && (c <= 0xFFFD))

|| ((c >= 0x10000) && (c <= 0x10FFFF))) {

s.append(c);

}

}

return s.toString();

}


Note: Comment imported. Original by Anonymous at 2010-05-10 12:18

Diogo said...

tks, nice solution.

pravin's blog said...

here :"http://professionals-helpdesk.blogspot.com/2011/12/invalid-xml-character-unicode-0x-was.html"
you can find software for removing those invalid unicode character

Mrudula M said...

Thanks. This works even after four years!

sanket said...

This is an excellent post.
I used the C# code mentioned in one of the comments and it fixed my issue.

Thanks a lot for saving me a lot of time.

Stefan Kleineikenscheidt (K15t Software) said...

Hey Mark, very helpful code.

I'd like to include this code into a library for Confluence plugins, which probably open-source eventually under a BSD license.

As you haven't indicated any license for the code on your website, I'd like to ask for permission to include that code.

Thanks for feedback!

Cheers,
-Stefan

Stefan Kleineikenscheidt (K15t Software) said...

Hey Mark, very helpful code.

I'd like to include this code into a library for Confluence plugins, which probably open-source eventually under a BSD license.

As you haven't indicated any license for the code on your website, I'd like to ask for permission to include that code.

Thanks for feedback!

Cheers,
-Stefan

Lazy Java Developer said...
This comment has been removed by the author.
Lazy Java Developer said...

Hi,

Just a note:

((current >= 0x10000) && (current <= 0x10FFFF))

that is never true as char in Java is 16 bit (0xFFFF) so it can never be anything larger.

Igor Roháľ said...

Thanks so much, finding this saved me a lot of time today :)

Igor Roháľ said...

Thanks so much, finding this saved me a lot of time today :)

Mitesh said...

Thank you so much ..After struggling for half a day ,I found this ..Saved me a day or 2 ..

Mitesh said...

Thank you so much ..After struggling for half a day ,I found this ..Saved me a day or 2 ..

Pascual said...

Thanx soooooo much!

"when invalid doesn't mean invalid, but means invalid"

Srini said...

Amazing! Thank you!

Kireeti Satya Venkata Ratna said...

Hi How to i use a specify a null character in xml with utf-8 encodig