Wednesday, February 14, 2007

Invalid XML Characters: when valid UTF8 does not mean valid XML

I was working on Java application recently when I got the following exception

org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x1a) was found in the element content of the document.

I was using Castor in an attempt to unmarshal (XML->Java Objects) an XML string. The XML originated from a bookmarks file that was uploaded into the application by the user, tidied up, transformed and then stored in a Oracle database. It was at the point at which the XML string was returned from the database and was being converted into Java objects that this error occurred.

I immediately assumed it to be some kind of character encoding conversion problem.

With some databases (e.g. MySQL) it is possible to set data to be passed as UTF8 by passing certain settings via the JDBC url. I do not use MySQL so I do not know whether this would fix my problem; although I suspect it would not.

I tried various methods to convert my XML string into valid UTF8 and I was pretty sure that I had achieved satisfactory UTF8 conversion but I still got the error.

This is when I discovered that not all valid UTF8 characters are valid XML characters, which probably makes sense (what with control characters and such) but I have never had to think about this before.

After spending several hours previously messing with numerous UTF8 conversion techniques I eventually found a solution. I found it in the Xalan mailing list. I am reproducing this solution here because it was not mentioned in the context of the "Unicode: 0x1a" error, if it was I would have found the solution more quickly. The XML standard specifies which UTF8 characters are valid in XML documents, so it is possible to take a UTF8 document and filter out all the invalid characters using a method like this:


/**
* This method ensures that the output String has only
* valid XML unicode characters as specified by the
* XML 1.0 standard. For reference, please see
* <a href="http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char">the
* standard</a>. This method will return an empty
* String if the input is null or empty.
*
* @param in The String whose non-valid characters we want to remove.
* @return The in String, stripped of non-valid characters.
*/

public String stripNonValidXMLCharacters(String in) {
StringBuffer out = new StringBuffer(); // Used to hold the output.
char current; // Used to reference the current character.

if (in == null || ("".equals(in))) return ""; // vacancy test.
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
if ((current == 0x9) ||
(current == 0xA) ||
(current == 0xD) ||
((current >= 0x20) && (current <= 0xD7FF)) ||
((current >= 0xE000) && (current <= 0xFFFD)) ||
((current >= 0x10000) && (current <= 0x10FFFF)))
out.append(current);
}
return out.toString();
}

I hope somebody else finds this useful (and it saves them a few hours of head scratching), alternatively if people reading this know of a better solution then please do let me know!

139 comments:

ismjml said...

Thank you so very much, i was searching the internet almost this wil afternoon.... Thankkss!
Note: Comment imported. Original by Tjeerd at 2007-02-22 18:33

ismjml said...

Here's the .NET version:

public static string stripNonValidXMLCharacters(string s)

{

StringBuilder _validXML = new StringBuilder(s.Length, s.Length); // Used to hold the output.

char current; // Used to reference the current character.

char[] charArray = s.ToCharArray();



if (string.IsNullOrEmpty(s)) return string.Empty; // vacancy test.



for (int i = 0; i < charArray.Length; i++)

{

current = charArray[i]; // NOTE: No IndexOutOfBoundsException caught here; it should not happen.

if ((current == 0x9) ||

(current == 0xA) ||

(current == 0xD) ||

((current >= 0x20) && (current <= 0xD7FF)) ||

((current >= 0xE000) && (current <= 0xFFFD)) ||

((current >= 0x10000) && (current <= 0x10FFFF)))

_validXML.Append(current);

}

return _validXML.ToString();


Note: Comment imported. Original by Meno website: http://www.workpac.com/ at 2007-04-04 00:53

ismjml said...

Thank you, I was indeed looking for such a solution for long
Note: Comment imported. Original by Santthosh website: http://www.santthosh.info at 2007-04-04 05:29

ismjml said...

Just wanted to add my thanks for this - and add some google bait. I was seeing



An invalid XML character (Unicode: 0xb) was found in the CDATA section



for my XML, now fixed using this snippet.



Dave
Note: Comment imported. Original by Anonymous at 2007-05-04 13:57

ismjml said...

Hi Mark,



Just wanted to say "Thanks!"



I am pulling data from an AS400 and this saved the day for me!!



Cheers!

Bob
Note: Comment imported. Original by Anonymous at 2007-05-04 20:01

ismjml said...

Thank you so much for sharing this code. That saved me on a major project I am working on.
Note: Comment imported. Original by Karen at 2007-05-22 18:28

ismjml said...

Thanks for the solution, this saved me a lot of time!
Note: Comment imported. Original by Nathan at 2007-06-05 23:47

ismjml said...

Thanks, found your blog after

javax.servlet.ServletException: javax.xml.transform.TransformerException: com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: An invalid XML character (Unicode: 0x0) was found in the element content of the document.
Note: Comment imported. Original by Robot website: http://www.moesol.com at 2007-06-07 08:07

ismjml said...

Thanks. Such a big problem was solved in few mins. Thanks you very much dude.
Note: Comment imported. Original by Leon John at 2007-06-29 19:26

ismjml said...

Thanks!!
Note: Comment imported. Original by gurito website: http://softygen.cross-solution.de at 2007-08-17 11:43

ismjml said...

Thank you very much Mark
Note: Comment imported. Original by Anonymous at 2007-08-27 20:53

ismjml said...

Thanks... This really helped!
Note: Comment imported. Original by Anonymous at 2007-09-25 21:19

ismjml said...

tnx!
Note: Comment imported. Original by Anonymous at 2007-10-05 14:06

ismjml said...

Also wanted to say, thanks! This just saved me a serious parsing headache.
Note: Comment imported. Original by GS at 2007-11-07 18:41

ismjml said...

Hi Mark and everyone,



Thank you so bloody much!



Questions: Is the character 'range' up-to-date? And does it apply to CData Sections as well?
Note: Comment imported. Original by Håkan Jacobsson at 2007-11-09 17:06

ismjml said...

Hi Håkan,



I think the character ranges are correct according to the latest XML specification (Fourth edition, last edited 29 September 2006).





http://www.w3.org/TR/xml/#charsets





Also, CDATA sections are about storing strings of characters that are not to be treated as markup. I am almost certain that they should not contain out of range characters (as this would invalidate the entire XML document).


Note: Comment imported. Original by markmc website: http://content.mark-mclaren.info/ at 2007-11-09 17:50

ismjml said...

Mark,



Thanks so much

I have one more question.

Can I use this range of characters for validation of XML documents not using UTF-8 as encoding

(sorry if this is a newbie question)?

The XML documents I deal with are client product feeds and may not be encoded in UTF-8.
Note: Comment imported. Original by Håkan Jacobsson at 2007-11-11 12:44

ismjml said...

I found this a very difficult question to answer. I find reading detailed specifications hard at the best of times! I have done some digging around and I am still not certain that I have the right answer. My guess is whatever character encoding you use, valid XML must only contain characters from the defined range. I base my guess on the following:





"Remember that character encodings, despite their name do not apply to characters - they apply to byte sequences which represent characters. If you have a char variable in Java it has no character encoding as far as you are concerned, it’s just that character."





However, if you do decide to opt for a character encoding other than UTF-8/UTF-16 you will have additional constraints in order to satisfy XML validity.





You must have an encoding declaration

All data must be encoded in the encoding named in the declaration

All byte sequences must be legal for that encoding




Note: Comment imported. Original by markmc website: http://cse-mjmcl.cse.bris.ac.uk/blog at 2007-11-12 09:33

ismjml said...

Thank you for this function, here is the PHP version:



function strip_invalid_xml_chars( $in ) {

$out = ""; // Used to hold the output.

$current; // Used to reference the current character.

if ( empty($in) ) {

return ""; // vacancy test.

}

$length = strlen($in);

for ( $i = 0; $i < $length; $i++) {

$current = ord($in{$i});

if ( ($current == 0x9) ||

($current == 0xA) ||

($current == 0xD) ||

(($current >= 0x20) && ($current <= 0xD7FF)) ||

(($current >= 0xE000) && ($current <= 0xFFFD)) ||

(($current >= 0x10000) && ($current <= 0x10FFFF))) {

$out .= chr($current);

}

else {

$out .= " ";

}

}

return $out;

}
Note: Comment imported. Original by Konzhang website: http://www.qq.com/ at 2007-12-10 05:55

ismjml said...

Greetings,



Thanks a lot Mark, I was able to fix a XML transformation problem by stripping all the non-valid XML characters.


Note: Comment imported. Original by ABr website: http://abr3.wordpress.com at 2007-12-31 10:54

ismjml said...

Mil gracias.
Note: Comment imported. Original by Anonymous at 2008-01-08 12:48

ismjml said...

If you just want to validate a string (and not replace the characters), you can do it easily with a regular expression:



public static boolean iSValidXMLText(String xml) {

boolean valid = true;



if( xml != null ) {

valid = xml.matches("^([\\x09\\x0A\\x0D\\x20-\\x7E]|" //# ASCII

+ "[\\xC2-\\xDF][\\x80-\\xBF]|" //# non-overlong 2-byte

+ "\\xE0[\\xA0-\\xBF][\\x80-\\xBF]|" //# excluding overlongs

+ "[\\xE1-\\xEC\\xEE\\xEF][\\x80-\\xBF]{2}|" //# straight 3-byte

+ "\\xED[\\x80-\\x9F][\\x80-\\xBF]|" //# excluding surrogates

+ "\\xF0[\\x90-\\xBF][\\x80-\\xBF]{2}|" //# planes 1-3

+ "[\\xF1-\\xF3][\\x80-\\xBF]{3}|" //# planes 4-15

+ "\\xF4[\\x80-\\x8F][\\x80-\\xBF]{2})*$"); //# plane 16

}



return valid;

}



(borrowed from Here)


Note: Comment imported. Original by dcendents at 2008-01-11 11:56

ismjml said...

Anybody ever done something like this over an entire file?
Note: Comment imported. Original by Anonymous at 2008-02-21 16:11

ismjml said...

I used the method, but could not get it to work. Still get $#19; an invalid xml char. I decoded the String using UTF-8, str.charAt(i) returned $,#,1,9, and ; respectively. Individually, these are valid xml char. Could you advice what I did wrong? Thanks
Note: Comment imported. Original by Anonymous at 2008-02-25 17:57

ismjml said...

Please note: Your code is broken.



the condition "current <= 0x10FFFF" will not work as the character has 16bit.



java stores Strings as UTF-16, and "exposes" this encoding in charAt(x).



this is at least "qurix" in java, e.g. String.length() returns too much.



fixing the code would require to "decode" UTF-16.
Note: Comment imported. Original by Anonymous at 2008-02-25 23:35

ismjml said...

Couldn't you also do this? We did not want to actually remove the characters just replace them. Feedback/corrections welcome:



public String stripNonValidXMLCharacters(String in) {

StringBuffer out = new StringBuffer(); // Used to hold the output.

char current; // Used to reference the current character.



if (in == null || ("".equals(in))) {

return "";

} // vacancy test.

for (int i = 0; i < in.length(); i++) {

current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.

if ((current == 0x9) ||

(current == 0xA) ||

(current == 0xD) ||

((current >= 0x20) && (current <= 0xD7FF)) ||

((current >= 0xE000) && (current <= 0xFFFD)) ||

((current >= 0x10000) && (current <= 0x10FFFF))) {

out.append(current);

} else {

out.append("_");

}

}



return out.toString();

}


Note: Comment imported. Original by John Seals at 2008-02-26 21:11

ismjml said...

I'm working on doing that I'll post it when I'm done
Note: Comment imported. Original by sidd at 2008-03-03 18:42

ismjml said...

Thank you so much. I had a huge data set that we converted into XML and about 300 xml failed fue to invalid char in them. I was trying to code in some method do this. But what you have is much better then what I had in my mind. THANK YOU!!!
Note: Comment imported. Original by sidd at 2008-03-03 18:41

ismjml said...

Hi,

Thx for your solution.



(I apologize I do not speak very good English)



I have the same problem, I use jaxb and jaxWs.

I have a web service that builds and forwards an Adobe PDF file in a String object. And a service client’s rebuilds this PDF file on the other side.



When the web service client unmarshall the response flow, it generates this exception : An invalid XML character (Unicode: 0x2) was found in the element content of the document. My PDF Document stored in the String object contains valid characters for the PDF reader, but invalid for the xml parser.



When I uses your solution, invalid characters are well removed, and the web service work correctly.



But when I recovers the pdf after calling my service, this one is corrupted. The removed characters are necessary for read the PDF properly.

I try multiple solutions to convert the String object to UTF8 encoding, but nothing works. Those characters are always presents and obviously necessary.



Is there a solution to replace (Not remove) invalid characters in the original String and successfully pass the xml parser

And recover invalid characters after the parsing step to rebuild at the identical the original message ?



Thanks for your help.


Note: Comment imported. Original by Florent at 2008-03-07 17:26

ismjml said...

In the case of marshalling binary files, the solution is finally to encode the PDF flow in 64 base.

More informations here : http://www.javaworld.com/javaworld/javatips/jw-javatip117.html



In my case, I use :

sun.misc.BASE64Encoder and sun.misc.BASE64Decoder.
Note: Comment imported. Original by Florent at 2008-03-10 09:08

ismjml said...

Works like a charm... This was a real time saver for me.

Thanks
Note: Comment imported. Original by Anonymous at 2008-03-17 21:18

ismjml said...

Good piece piece of code!

Had problems getting MSXML to parse some apparently valid characters but it seems they weren't.



Cheers.
Note: Comment imported. Original by 5ubliminal website: http://www.tellinya.com/ at 2008-03-20 02:32

ismjml said...

THANK YOU THANK YOU THANK YOU



Who ever posted the php function saved me.







function strip_invalid_xml_chars2( $in )

{



$out = "";



$length = strlen($in);



for ( $i = 0; $i < $length; $i++)

{



$current = ord($in{$i});



if ( ($current == 0x9) || ($current == 0xA) || ($current == 0xD) || (($current >= 0x20) && ($current = 0xE000) && ($current = 0x10000) && ($current <= 0x10FFFF)))

{

$out .= chr($current);

}

else

{

$out .= " ";

}





}



return $out;





}






Note: Comment imported. Original by ransom website: http://www.ransom.vg at 2008-03-21 19:35

ismjml said...

Mark,



I owe you much beer. Can I use this function (while giving you credit) in my AntiSamy project?



It's published under the BSD license.



Cheers,

Arshan




Note: Comment imported. Original by Arshan D. website: http://i8jesus.com/ at 2008-03-25 14:55

ismjml said...

Arshan, please use it freely with my blessing. (it wasn't my code originally anyhow, I just reproduced it in this context).



Mark
Note: Comment imported. Original by markmc website: http://cse-mjmcl.cse.bris.ac.uk at 2008-03-26 19:11

ismjml said...

Absolutely brilliant! Saved me from a big headache!



Thanks!
Note: Comment imported. Original by Anonymous at 2008-04-24 17:20

ismjml said...

very nice, fixed some problems we have overhere. Here a delphi version:





function ValidXmlString(Value : wideString) : WideString;

var

NewLen,

idx : integer;

CurChar : word;

begin

// initialize

Result := '';

NewLen := 0;



// check for empty string

if Length(Value) = 0 then

exit;



// init length of result

Result := stringofchar(' ',length(Value));



// loop and check valid xml characters

for idx := 1 to length(Value) do

begin

CurChar := ord(Value[idx]);

if (CurChar = $9) or (CurChar = $A) or (CurChar = $D) or

((CurChar >= $20) and (CurChar <= $D7FF)) or

((CurChar >= $E000) and (CurChar <= $FFFD)) or

((CurChar >= $10000) and (CurChar <= $10FFFF)) then

begin

inc(NewLen);

Result[NewLen] := Value[idx];

end;

end;



// adjust size of result

Result := copy(Result,1,NewLen);

end; // ValidXmlString


Note: Comment imported. Original by Maurits van RIjnen at 2008-05-06 08:40

ismjml said...

Thank you very much!
Note: Comment imported. Original by Dario at 2008-06-10 22:33

ismjml said...

I am using the service in my code , but not able to test it :( please post if somebody has tested the service before.The service will trim 0x13 if somebody is having this character plz post.
Note: Comment imported. Original by vishal paisal at 2008-07-03 12:54

ismjml said...

Thanks you very much, I am sure you have saved atleast 10hrs of our time.



Keep posting this kind of useful stuff.
Note: Comment imported. Original by Anonymous at 2008-07-10 20:10

ismjml said...

if you just want to get rid of control characters you can use a regex. In c#:



xml = Regex.Replace(xml, "&\\#x(?:0[0-8BCEF]|1[0-9A-F]);", "");


Note: Comment imported. Original by jba at 2008-09-11 09:09

ismjml said...

Could you please help me write the regualr expression in java
Note: Comment imported. Original by Anonymous at 2008-09-12 11:14

ismjml said...

More than one year later and the solution you've posted is still up to date! Thanks a lot for the code Mark. It certainly saved me a lot of headaches!



Cheers,

Ricardo
Note: Comment imported. Original by Ricardo at 2008-09-26 11:42

ismjml said...

Thanks for the help!
Note: Comment imported. Original by Craig Sumner website: http://www.craigsumner.net at 2008-10-16 18:55

ismjml said...

Thank you!!! Saved me a lot of time and frustration with this function.
Note: Comment imported. Original by V at 2008-10-28 20:36

ismjml said...

Thanks a lot!
Note: Comment imported. Original by geert at 2008-11-24 14:01

ismjml said...

function stripNonValidXMLCharacters(in)

{

var current;

var out = "";



for (var i = 0; i < in.length; i++)

{

current = in.charCodeAt(i);



if ((current == 0x9) ||

(current == 0xA) ||

(current == 0xD) ||

((current >= 0x20) && (current <= 0xD7FF)) ||

((current >= 0xE000) && (current <= 0xFFFD)) ||

((current >= 0x10000) && (current <= 0x10FFFF)))

out += in.charAt(i);

}



return out;

}


Note: Comment imported. Original by Leni website: http://www.zindus.com at 2008-11-25 00:13

ismjml said...

Thanks for helping me identify this problem.
Note: Comment imported. Original by Jay at 2008-12-19 03:43

ismjml said...

hey Leni.. isn't 'in' a reserved keyword in javascript ?
Note: Comment imported. Original by Anonymous at 2009-01-12 06:20

ismjml said...

"in" is a reserved word in JavaScript!
Note: Comment imported. Original by markmc website: http://content.mark-mclaren.info/ at 2009-01-12 08:19

ismjml said...

Thanks a Lot. I saved a lot of time and it worked.otherwise, I would have wasted so much of my time trying to find a solution for this problem



Thanks once again
Note: Comment imported. Original by Anonymous at 2009-01-22 13:40

ismjml said...

Thanks very much Mark!
Note: Comment imported. Original by Arash Jooracbhi at 2009-02-25 18:04

ismjml said...

I think that the solution is broken because java chars are less than FFFF so some comparisons are useless.

The idea is to compare Unicode code points not UTF16 code units(java char type).

Here is my solution:



/**

* This method ensures that the output String has only valid XML unicode characters as specified by the

* XML 1.0 standard. For reference, please see the

* standard. This method will return an empty String if the input is null or empty.

*

* @author Donoiu Cristian, GPL

* @param The String whose non-valid characters we want to remove.

* @return The in String, stripped of non-valid characters.

*/

public static String removeInvalidXMLCharacters(String s) {

StringBuilder out = new StringBuilder(); // Used to hold the output.

int codePoint; // Used to reference the current character.

//String ss = "\ud801\udc00"; // This is actualy one unicode character, represented by two code units!!!.

//System.out.println(ss.codePointCount(0, ss.length()));// See: 1

int i=0;

while(i<s.length()) {

System.out.println("i=" + i);

codePoint = s.codePointAt(i); // This is the unicode code of the character.

if ((codePoint == 0x9) || // Consider testing larger ranges first to improve speed.

(codePoint == 0xA) ||

(codePoint == 0xD) ||

((codePoint >= 0x20) && (codePoint <= 0xD7FF)) ||

((codePoint >= 0xE000) && (codePoint <= 0xFFFD)) ||

((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) {

out.append(Character.toChars(codePoint));

}

i+= Character.charCount(codePoint); // Increment with the number of code units(java chars) needed to represent a Unicode char.

}

return out.toString();

}





Hope it helps.<br/><i><b>Note:</b> Comment imported. Original by Donoiu Cristian at 2009-03-02 11:16</i>

ismjml said...

Anyone know why XML parsers or deserializaion objects through when an invalid character is found? Just curious about the motivation for raising throwing an exception and not ignoring it.
Note: Comment imported. Original by bock at 2009-03-18 17:29

ismjml said...

Hi Bock,



Good question! I will try to answer it! In the above error the XML parser is correctly reporting that this data does not conform to the strict guidelines of the XML specification. It is the parser's job to report any non-compliance to any extent (even if it seems minor). However, in my particular case it is sufficient to discard the non-standard characters and continue working with the data BUT in another situation this course of action might not be appropriate.



Mark


Note: Comment imported. Original by markmc website: http://content.mark-mclaren.info/ at 2009-03-19 21:26

ismjml said...

Thanks very much Mark!
Note: Comment imported. Original by Anonymous at 2009-04-16 18:42

ismjml said...

Saved me a bunch of time. Thanks!
Note: Comment imported. Original by Jessica King at 2009-04-23 20:29

ismjml said...

Thanks a lot.
Note: Comment imported. Original by Anonymous website: http://students.info.uaic.ro/~ionut.bujdei at 2009-05-20 05:43

ismjml said...

Doesn't filter this out:

§



The double S looking character breaks an XML document.
Note: Comment imported. Original by Anonymous at 2009-05-20 21:26

ismjml said...

Thanks a ton for this code. It really saved a lot of search and work !!
Note: Comment imported. Original by Saurabh Sule at 2009-05-22 06:55

ismjml said...

Or the Scala Version (I'm a scala noob, so don't rib me too hard)...





class ValidXmlChar(chr:Char) {

def isValidXmlChr() = {

((chr == 0x9) ||

(chr == 0xA) ||

(chr == 0xD) ||

((chr >= 0x20) && (chr <= 0xD7FF)) ||

((chr >= 0xE000) && (chr <= 0xFFFD)) ||

((chr >= 0x10000) && (chr <= 0x10FFFF)))

}

}



object ValidXmlChar {

implicit def char2ValidXMLChar(chr:Char) = new ValidXmlChar(chr)

}



import ValidXmlChar.{_}



object Main {



def main(args: Array[String]) :Unit = {

// Assuming args(0) is a string you want to clean

println(stripNonValidXMLCharacters(args(0)))

}



def stripNonValidXMLCharacters(s:String) = {

val sb = new StringBuffer()

s.filter(_.isValidXmlChr).foreach(sb.append(_))

sb

}



}


Note: Comment imported. Original by Koppe at 2009-06-08 18:08

ismjml said...

We're using Java 1.4.2 and a few methods can't seem to be found. Is there an equivalent for...?



-Character.toChars()

-Character.charCount()



I used the following:

codePoint = s.charAt(i);
Note: Comment imported. Original by Ken at 2009-06-09 14:54

ismjml said...

Thanks a lot
Note: Comment imported. Original by stroll at 2009-06-11 13:17

ismjml said...

worked perfectly! Thanks
Note: Comment imported. Original by Anonymous at 2009-06-24 00:22

ismjml said...

Thanks a lot! This function is very useful
Note: Comment imported. Original by Anonymous at 2009-06-25 14:46

ismjml said...

you can eliminate the for loop with regular expressions:



$clean_string=preg_replace('/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]/u','',$input_sting)
Note: Comment imported. Original by asoki at 2009-08-12 10:16

ismjml said...

Thanks a ton! This code snippet helped us big time!!
Note: Comment imported. Original by Shilpa at 2009-08-19 15:50

ismjml said...

Very efficient, thanks a mill
Note: Comment imported. Original by eladco website: http://none at 2009-09-10 17:27

ismjml said...

Yeay, thanks for the Get-Out-Of-Jail-Free card :)
Note: Comment imported. Original by Andy Brook at 2009-10-19 14:10

ismjml said...

Thanks a lot! It saved my day too :)
Note: Comment imported. Original by Kiiro Roshi at 2009-10-21 16:41

ismjml said...

I appreciate your help, I was having the very same problem with JAXB when unmarshalling objects. Thx a lot.
Note: Comment imported. Original by alb at 2009-12-09 12:06

ismjml said...

Excellent ! Exactly what I was needing.
Note: Comment imported. Original by Nicolas at 2009-12-16 09:28

ismjml said...

Saving me too, thanks a lot :D
Note: Comment imported. Original by Gregory Jourdan at 2010-02-12 23:52

ismjml said...

Thank you!!!!!!!
Note: Comment imported. Original by Anonymous at 2010-03-05 06:42

ismjml said...

Hi Mark,



I am dealing with a value that is sourced in systems not in my control/scope and I cannot strip any character that is a valid UTF-8 but not valid XML 1.0. Is there a way to transform these invalid characters into valid XML 1.0 equivalent.



This question has already been asked a few times in the thread.



Many thanks in advance.
Note: Comment imported. Original by Mani at 2010-04-28 10:13

ismjml said...

Thank you so much! Little shorter for java:



public static String rmNonValidChars(String str) {

if(str==null) return null;

StringBuffer s = new StringBuffer();

for (char c : str.toCharArray()) {

if ((c == 0x9) || (c == 0xA) || (c == 0xD)

|| ((c >= 0x20) && (c <= 0xD7FF))

|| ((c >= 0xE000) && (c <= 0xFFFD))

|| ((c >= 0x10000) && (c <= 0x10FFFF))) {

s.append(c);

}

}

return s.toString();

}


Note: Comment imported. Original by Anonymous at 2010-05-10 12:18

Diogo said...

tks, nice solution.

pravin's blog said...

here :"http://professionals-helpdesk.blogspot.com/2011/12/invalid-xml-character-unicode-0x-was.html"
you can find software for removing those invalid unicode character

Mrudula M said...

Thanks. This works even after four years!

sanket said...

This is an excellent post.
I used the C# code mentioned in one of the comments and it fixed my issue.

Thanks a lot for saving me a lot of time.

Stefan Kleineikenscheidt (K15t Software) said...

Hey Mark, very helpful code.

I'd like to include this code into a library for Confluence plugins, which probably open-source eventually under a BSD license.

As you haven't indicated any license for the code on your website, I'd like to ask for permission to include that code.

Thanks for feedback!

Cheers,
-Stefan

Stefan Kleineikenscheidt (K15t Software) said...

Hey Mark, very helpful code.

I'd like to include this code into a library for Confluence plugins, which probably open-source eventually under a BSD license.

As you haven't indicated any license for the code on your website, I'd like to ask for permission to include that code.

Thanks for feedback!

Cheers,
-Stefan

Lazy Java Developer said...
This comment has been removed by the author.
Lazy Java Developer said...

Hi,

Just a note:

((current >= 0x10000) && (current <= 0x10FFFF))

that is never true as char in Java is 16 bit (0xFFFF) so it can never be anything larger.

Igor Roháľ said...

Thanks so much, finding this saved me a lot of time today :)

Igor Roháľ said...

Thanks so much, finding this saved me a lot of time today :)

Mitesh said...

Thank you so much ..After struggling for half a day ,I found this ..Saved me a day or 2 ..

Mitesh said...

Thank you so much ..After struggling for half a day ,I found this ..Saved me a day or 2 ..

Pascual said...

Thanx soooooo much!

"when invalid doesn't mean invalid, but means invalid"

Srini said...

Amazing! Thank you!

Kireeti Satya Venkata Ratna said...

Hi How to i use a specify a null character in xml with utf-8 encodig

Bogdan Kulyk said...

Big thanks to you man!! You saved me a lot of time!!
Best wishes to you :)

Willem said...

Hi here.

I am a complete newbie to this kind of thing so please try and not laugh to hard at my request.

I created a website with wordpress with about 5 000 products. Now i am trying to list my products on an action site in our country. he gave me a plugin to install to pull our product feed from the site. the problem is that the feed keeps on failing because of an 0x3 invalid xml character error.

If i had to go through all the product descriptions to try and find the illegal characters, it'll take me forever.

It's a wordpress site with woocommerce installed.

I know that most of the data gets stored on the database.

Is there a way to seek and destroy
these illegal characters via phpmyadmin

the table in question is wp_posts
and the columns in question are post_excerpt, post_content and possibly post_title.
the post type is 'product'

I would appreciate any help so much!!

isanjaykp said...

This has helped

Fernando Silva said...

Thanks a lot! It took me also some hours trying to fix and nothing was solving the error.

Oladipo said...

This should work in VB6, VBA, VBScript.


Function RemoveIllegalXMLCharacters(Content)
Dim Copied(), I, J, current, currentW
J = 0
Redim Copied(Len(Content))

For I = 1 To Len(Content)
current = Mid(Content, I, 1)
currentW = AscW(current)

If (currentW = 9 Or currentW = 10 Or currentW = 13) _
Or ((currentW >= 32) And (currentW <= 55295)) _
Or ((currentW >= 57344) And (currentW <= 65533)) _
Or ((currentW >= 65536) And (currentW <= 1114111)) Then
Copied(J) = current
J = J + 1
End If
Next

Redim Preserve Copied(J)

RemoveIllegalXMLCharacters = Join(Copied, "")
End Function

n1rmus said...

Thanks a lot!

Nasreen Basu said...


Really cool post, highly informative and professionally written and I am glad to be a visitor of this perfect blog, thank you for this rare info!
Regards liferay training in hyderabad

for IT the said...

I have read your blog its very attractive and impressive. I like it your blog.

Java Training in Chennai Core Java Training in Chennai Core Java Training in Chennai

Java Online Training Java Online Training Core Java 8 Training in Chennai Core java 8 online training JavaEE Training in Chennai Java EE Training in Chennai

Ansh said...

Thank you so much.

manoj said...

How is this applicable if we use the axis library.
I am not able to get the raw soap message.
I can only apply the validation once i get the message object in String.

Nilay Vora said...

Worked like a charm in my case! Thank You so much! Keep posting the good work Mark! (Y)

simbu said...

I prefer to study this kind of material. Nicely written information in this post, the quality of content is fine and the conclusion is lovely. Things are very open and intensely clear explanation of issues
Selenium training in Chennai

Selenium training in Bangalore

WUGI said...

Have you played blackjack for a long time and won? Come to us and try yourself in a super game and win. good online casino slot games Take the winnings constantly and without restrictions.

sasitamil said...

After reading this web site I am very satisfied simply because this site is providing comprehensive knowledge for you to audience.
Thank you to the perform as well as discuss anything incredibly important in my opinion. We loose time waiting for your next article writing in addition to I beg one to get back to pay a visit to our website in



Selenium training in bangalore
Selenium training in Chennai
Selenium training in Bangalore
Selenium training in Pune
Selenium Online training

sathish said...

After reading this web site I am very satisfied simply because this site is providing comprehensive knowledge for you to audience.
Thank you to the perform as well as discuss anything incredibly important in my opinion. We loose time waiting for your next article writing in addition to I beg one to get back to pay a visit to our website in



Selenium training in bangalore
Selenium training in Chennai
Selenium training in Bangalore
Selenium training in Pune
Selenium Online training

jefrin adams said...

I bookmark this blog very useful
selenium training institute chennai

franklinraj said...

Thank you for excellent article.

Please refer below if you are looking for best project center in coimbatore

final year projects in coimbatore
Spoken English Training in coimbatore
final year projects for CSE in coimbatore
final year projects for IT in coimbatore
final year projects for ECE in coimbatore
final year projects for EEE in coimbatore
final year projects for Mechanical in coimbatore
final year projects for Instrumentation in coimbatore

jaya devan said...

You are doing a great job. I would like to appreciate your work for good accuracy
Regards,
Selenium Training Institute in Chennai | Selenium Testing Training in chennai

Infocampus said...

Informative post. Thanks for sharing the information.

selenium training in Bangalore
web development training in Bangalore
selenium training in Marathahalli
selenium training institute in Bangalore
best web development training in Bangalore

Blogger said...

I ‘d mention that most of us visitors are endowed to exist in a fabulous place with very many wonderful individuals with very helpful things.
Selenium Training in Chennai | SeleniumTraining Institute in Chennai

digitalsourabh said...

C C
++ Classes in Bhopal

Nodejs Training in Bhopal
Big Data Hadoop Training in Bhopal
FullStack Training in Bhopal
AngularJs Training in Bhopal

velraj said...

I love this!!The blog is very nice to me. Im always keeping this idea in mind. I will appreciate your help once again. Thanks in advance.
core java training in chennai
core java Training in Anna Nagar
clinical sas training in chennai
Spring Training in Chennai
QTP Training in Chennai
Manual Testing Training in Chennai
JMeter Training in Chennai
core java training in chennai

vyshu kits2019 said...


Really cool post, highly informative and professionally written and I am glad to be a visitor of this perfect blog, thank you for this rare info!


Cognos Interview Questions and Answers


Data Modeling Interview Questions and Answers


Data Science Interview Questions and Answers

John Oneal said...

HP Printer Phone Number
Epson Printer Support Number
Malwarebytes Phone Number Canada
Brother Printer Customer Support Number

Techxinx said...

I am a regular reader of your blog and I find it really informative. for more info contact us
CPCT Coaching in Bhopal
java coaching in bhopal
Autocad classes in bhopal
Catia coaching in bhopal

malar said...

Thank you for excellent article.You made an article that is interesting.
Tavera car for rent in coimbatore|Indica car for rent in coimbatore|innova car for rent in coimbatore|mini bus for rent in coimbatore|tempo traveller for rent in coimbatore|kodaikanal tour package from chennai

Keep on the good work and write more article like this...

Great work !!!!Congratulations for this blog

Chris Hemsworth said...

The article is so informative. This is more helpful for our
software testing training and placement
selenium testing training in chennai. Thanks for sharing

gokul said...

Thank you for this informative blog
data science interview questions pdf
data science interview questions online
data science job interview questions and answers
data science interview questions and answers pdf online
frequently asked datascience interview questions
top 50 interview questions for data science
data science interview questions for freshers
data science interview questions
data science interview questions for beginners
data science interview questions and answers pdf
data science interview questions for experienced
top 100 data science interview questions

Institute Coim said...

BECOME A DIGITAL MARKETING
EXPERT WITH US
COIM offers professional Digital Marketing Course Training in Delhi to help you for job and your business on the path to success.
+91-9717 419 413
Digital Marketing Course in Laxmi Nagar
Digital Marketing Institute in Delhi
Digital Marketing training in Preet Vihar
Online Digital Marketing Course in India
Digital Marketing Institute in Delhi
Digital Marketing Institute in Delhi
Love Funny Romantic
Digital Marketing Institute In Delhi


travelkida said...

thanks.
delhi to kasauli
manali tour package for couple
cheap honeymoon destinations outside india
distance between delhi to kasauli by road
tourist places in india for summer
holiday destinations near delhi
best tourist places in india
hill station tour packages
himachal tour package for couple

Motohog said...

motorcycle t shirts india
best biker t shirts
mens motorcycle t shirts
Rider t shirts online india
womens biker t shirts

Anonymous said...



It’s awesome that you want to share those tips with us. It is a very useful post Keep it up and thanks to the writer.

robotic process automation companies in chennai
custom application development in us
uipath development in us
rpa development in us
erp implementation in chennai
software Development

gokul said...

Thank you for this informative blog
Top 5 Data science training in chennai
Data science training in chennai
Data science training in velachery
Data science training in OMR
Best Data science training in chennai
Data science training course content
Data science syllabus
Data science courses in chennai
Data science training institute in chennai
Data science online course
Data science with python training
Data science with R training

Institute Coim said...

Really Informative Blog Post thanks for information.
YouthHub is the Best Blog & Website which provides online news related to Best songs, comedy films, Celebrities, gadgets,
fitness and many more.
Bollywood
Bollywood Comedy

Top 10 Iconic Places to Visit in Delhi said...

Thanks for given information about above Article all the details
are very useful.

Aruna Ram said...

Pretty post...! I got more useful information about this topic and I like a unique post. Please updating...

Pega Training in Chennai
Pega Training Institutes in Chennai
Tableau Training in Chennai
Oracle Training in Chennai
Oracle DBA Training in Chennai
Job Openings in Chennai
Social Media Marketing Courses in Chennai
Primavera Training in Chennai
Pega Training in Vadapalani
Pega Training in Thiruvanmiyur

Digital Marketing Service said...

Best Honeymoon Place in India

Best honeymoon place in himachal
Best tourist place in delhi
best honeymoon place in kerala
best tourist place in goa
best tourist places in jharkhand
places to visit in uttar pradesh
honeymoon destinations in india
most romantic honeymoon destinations in india
five star hotels in delhi
five star hotels in delhi list
list of all 5 star hotels in delhi
5 star hotels in delhi near airport
hotel in delhi
hotels in delhi near railway station

charmidevan said...

nice article and its very informative...Thanks for sharing...
Mobile Testing Training in Chennai
Mobile Testing Training
Mobile Application Testing Training
Mobile Testing Training in Velachery
Mobile Testing Training in Tambaram
Manual Testing Training in Chennai
LoadRunner Training in Chennai
Photoshop Classes in Chennai
Spring Training in Chennai
QTP Training in Chennai

Nitesh said...

watch and download the latest movie
khandani shafakhana movie
khandaani shafakhana
khandaani shafakhana movie

Riya Raj said...

Wonderful Blog.... Thanks for sharing with us...
Hadoop Training in Chennai
Big data training in chennai
big data course
Hadoop Course in Chennai
Big data training in vadapalani
Hadoop training in porur
Python Training in Chennai
JAVA Training in Chennai
Selenium Training in Chennai
Software testing training in chennai

Top 10 Iconic Places to Visit in Delhi said...

Thanks for given information about above Article all the details
are very useful.

shivam said...

Flying Shift - Packers & Movers in Bhopal

Digital Marketing Service said...

Top places to visit in Himachal Pradesh

Best honeymoon place in himachal
Best tourist place in delhi
best honeymoon place in kerala
best tourist place in goa
best tourist places in jharkhand
places to visit in uttar pradesh
honeymoon destinations in india
most romantic honeymoon destinations in india
five star hotels in delhi

Newagedigitech said...

Whatsapp Marketing
Whatsapp Marketing for business

Digital Marketing Service said...


Hi,
Thanks for this great information and i request to you please keep sharing ahead.
Thanks

Athulya Cute said...


Nice! you are sharing such helpful and easy to understandable blog in decoration. i have no words for say i just say thanks because it is helpful for me.

robotic process automation companies in us
Robotic Process Automation in us
machine maintanance in us
erp in chennai
mobility software companies in chennai
erp providers in us

Hind pvt LTD said...

Such a good information
Home salon service delhi
Salon at home delhi
Beauty services at home delhi

Unknown said...

Great information.
Web Design