[API CHANGE] rtl_convertTextToUnicode behavior upon erroneous input

<http://udk.openoffice.org/cpp/man/spec/textconversion.html> specifies that FLAGS_UNDEFINED_ERROR, FLAGS_MBUNDEFINED_ERROR, and FLAGS_INVALID_ERROR: "Read past the [erroneous] code in the input buffer [...]" But actual behavior of rtl_convertTextToUnicode for the various rtl_TextEncoding values has been inconsistent. Some erroneous input (mostly single-byte UNDEFINED and INVALID ones) has not been consumed at all, some (multi-byte MBUNDEFINED and INVALID) has been consumed partly, and some has been consumed fully as required. However, at least since 8dd4265b9ddbd7786b6237676909eae5b540da0e "CWS-TOOLING: integrate CWS hb18", Custom8BitToUnicode in sw/source/filter/ww8/ww8par.cxx appears to rely on the broken behavior of not consuming erroneous input. (It reads the chunk of valid input with e.g. some RTL_TEXTENCODING_MS_125x that happens to exhibit the broken behavior of not consuming erroneous input, then wants to try to re-read the erroneous input with RTL_TEXTENCODING_MS_1252. For example, opening sw/qa/core/data/ww8/pass/forcepoint50-grfanchor-1.doc triggers that code. For whatever reason, the am_faksas.dot attached to <https://bz.apache.org/ooo/show_bug.cgi?id=9240#c1> "Do not show lithuanian letter 'ÂŠ'" appears to not, or at least no longer, trigger that code.) Therefore, it would be useful to have a mode in which rtl_convertTextToUnicode does not consume erroneous input. (And I plan on doing changes in sal/osl/unx/file* that would benefit from that behavior, too.) But changing rtl_convertTextToUnicode to generally not consume erroneous input would not be feasible: If calls do not set RTL_TEXTTOUNICODE_FLAGS_FLUSH, part of an erroneous input can already have been consumed by a previous call, so the current call cannot undo that. But a change that looks like it can work is to change the behavior only if RTL_TEXTTOUNICODE_FLAGS_FLUSH is set. In that case we can at least not consume the part of an erroneous input that has not yet been consumed by a previous call (which would necessarily have been done with RTL_TEXTTOUNICODE_FLAGS_FLUSH unset). The expecation is that code that relies on the don't-consume behavior will do only single calls with RTL_TEXTTOUNICODE_FLAGS_FLUSH set (so reliably not consume the complete erroneous input), while other code (which might do calls in a loop) will not care whether erroneous input has been consumed, anyway. This can be considered a mild form of behavioral API CHANGE (but note that the old implementation didn't exhibit the requested behavior anyway). So all implementations of rtl_convertTextToUnicode for the various rtl_TextEncoding values have been adapted to the new behavior. The only exceptions are ImplDummyToUnicode (sal/textenc/textcvt.cxx), which is a special case anyway used by RTL_TEXTENCODING_DONTKNOW, and two out of three places (marked with a "TODO" each) in ImplUTF7ToUnicode (sal/textenc/tcvtutf7.cxx), where it is hard to retrofit the expected behaivor, and RTL_TEXTENCODING_UTF7 is probably not relevant for the use cases relying on the don't-consume--behavior, anyway. Whether a similar change should be done for rtl_convertUnicodeToText can be examined later. Change-Id: I1ac2c4cfd99e2a0eca219f9a3855ef110b254855 Reviewed-on: https://gerrit.libreoffice.org/78584 Tested-by: Jenkins Reviewed-by: Stephan Bergmann <sbergman@redhat.com>
author: Stephan Bergmann <sbergman@redhat.com> 2019-09-04 09:36:03 +0200
committer: Stephan Bergmann <sbergman@redhat.com> 2019-09-04 13:02:11 +0200
commit: ca6ddfcc9385f1c31531eae31dfa81a9dda246f0 (patch)
tree: 554bf51eb5de4b9bd29ec1cbe966bf693da25f99 /sal/textenc/converteuctw.cxx
parent: c6ad32e03de01525a863171ed58df05e89e9f105 (diff)
1 files changed, 18 insertions, 4 deletions
diff --git a/sal/textenc/converteuctw.cxx b/sal/textenc/converteuctw.cxx
index 87becd9b11ec..abc214402636 100644
--- a/sal/textenc/converteuctw.cxx
+++ b/sal/textenc/converteuctw.cxx
@@ -93,6 +93,7 @@ sal_Size ImplConvertEucTwToUnicode(void const * pData,
     sal_Size nConverted = 0;
     sal_Unicode * pDestBufPtr = pDestBuf;
     sal_Unicode * pDestBufEnd = pDestBuf + nDestChars;
+    sal_Size startOfCurrentChar = 0;
 
     if (pContext)
     {
@@ -109,9 +110,10 @@ sal_Size ImplConvertEucTwToUnicode(void const * pData,
         {
         case IMPL_EUC_TW_TO_UNICODE_STATE_0:
             if (nChar < 0x80)
-                if (pDestBufPtr != pDestBufEnd)
+                if (pDestBufPtr != pDestBufEnd) {
                     *pDestBufPtr++ = static_cast<sal_Unicode>(nChar);
-                else
+                    startOfCurrentChar = nConverted + 1;
+                } else
                     goto no_output;
             else if (nChar >= 0xA1 && nChar <= 0xFE)
             {
@@ -210,13 +212,15 @@ sal_Size ImplConvertEucTwToUnicode(void const * pData,
                                 *pDestBufPtr++
                                     = static_cast<sal_Unicode>(pCns116431992Data[
                                               nOffset + (nChar - nFirst)]);
+                                startOfCurrentChar = nConverted + 1;
                             }
                             else
                                 goto no_output;
                         else
-                            if (pDestBufPtr != pDestBufEnd)
+                            if (pDestBufPtr != pDestBufEnd) {
                                 *pDestBufPtr++ = static_cast<sal_Unicode>(nUnicode);
-                            else
+                                startOfCurrentChar = nConverted + 1;
+                            } else
                                 goto no_output;
                     }
                     else
@@ -234,10 +238,16 @@ sal_Size ImplConvertEucTwToUnicode(void const * pData,
         {
         case sal::detail::textenc::BAD_INPUT_STOP:
             eState = IMPL_EUC_TW_TO_UNICODE_STATE_0;
+            if ((nFlags & RTL_TEXTTOUNICODE_FLAGS_FLUSH) == 0) {
+                ++nConverted;
+            } else {
+                nConverted = startOfCurrentChar;
+            }
             break;
 
         case sal::detail::textenc::BAD_INPUT_CONTINUE:
             eState = IMPL_EUC_TW_TO_UNICODE_STATE_0;
+            startOfCurrentChar = nConverted + 1;
             continue;
 
         case sal::detail::textenc::BAD_INPUT_NO_OUTPUT:
@@ -264,6 +274,10 @@ sal_Size ImplConvertEucTwToUnicode(void const * pData,
                         &nInfo))
             {
             case sal::detail::textenc::BAD_INPUT_STOP:
+                if ((nFlags & RTL_TEXTTOUNICODE_FLAGS_FLUSH) != 0) {
+                    nConverted = startOfCurrentChar;
+                }
+                [[fallthrough]];
             case sal::detail::textenc::BAD_INPUT_CONTINUE:
                 eState = IMPL_EUC_TW_TO_UNICODE_STATE_0;
                 break;
author	Stephan Bergmann <sbergman@redhat.com>	2019-09-04 09:36:03 +0200
committer	Stephan Bergmann <sbergman@redhat.com>	2019-09-04 13:02:11 +0200
commit	ca6ddfcc9385f1c31531eae31dfa81a9dda246f0 (patch)
tree	554bf51eb5de4b9bd29ec1cbe966bf693da25f99 /sal/textenc/converteuctw.cxx
parent	c6ad32e03de01525a863171ed58df05e89e9f105 (diff)