svtools: HTML import: don't put lone surrogates in OUString

The bytes "ed b3 b5" in fdo67610-1.doc (which, as the name indicates, is an HTML file) are converted to the lone UTF-16 surrogate "dcf5", which is inserted into SwTextNode and causes asserts later on. The actual encoding of the HTML document is probably GBK (at least VIM doesn't display any missing characters with that), but because it doesn't contain any indication of its encoding it's apparently imported as UTF-8; the ImplConvertUtf8ToUnicode() thinking a surrogate code point is valid even if the JSON-compatible mode RTL_TEXTENCODING_JAVA_UTF8 is not specified is a bit of a surprise. Change-Id: Idd788d9d461fed150171dd907439166f3075a834
author: Michael Stahl <mstahl@redhat.com> 2017-09-07 23:01:26 +0200
committer: Michael Stahl <mstahl@redhat.com> 2017-09-07 23:22:11 +0200
commit: fc670f637d4271246691904fd649358ce2e7be59 (patch)
tree: 0eee10cd701f0479d4ed8ca7287defefef6af29e /svtools
parent: 554a79d793ee9546f71802643b79001749c3c695 (diff)
1 files changed, 2 insertions, 1 deletions
diff --git a/svtools/source/svrtf/svparser.cxx b/svtools/source/svrtf/svparser.cxx
index 541aa5276c2d..cef258f04dd2 100644
--- a/svtools/source/svrtf/svparser.cxx
+++ b/svtools/source/svrtf/svparser.cxx
@@ -423,7 +423,8 @@ sal_uInt32 SvParser<T>::GetNextChar()
         while( 0 == nChars  && !bErr );
     }
 
-    if ( ! rtl::isUnicodeCodePoint( c ) )
+    // Note: ImplConvertUtf8ToUnicode() may produce a surrogate!
+    if (!rtl::isUnicodeCodePoint(c) || rtl::isHighSurrogate(c) || rtl::isLowSurrogate(c))
         c = '?' ;
 
     if( bErr )
author	Michael Stahl <mstahl@redhat.com>	2017-09-07 23:01:26 +0200
committer	Michael Stahl <mstahl@redhat.com>	2017-09-07 23:22:11 +0200
commit	fc670f637d4271246691904fd649358ce2e7be59 (patch)
tree	0eee10cd701f0479d4ed8ca7287defefef6af29e /svtools
parent	554a79d793ee9546f71802643b79001749c3c695 (diff)