diff options
author | Stephan Bergmann <sbergman@redhat.com> | 2019-09-04 09:36:03 +0200 |
---|---|---|
committer | Stephan Bergmann <sbergman@redhat.com> | 2019-09-04 13:02:11 +0200 |
commit | ca6ddfcc9385f1c31531eae31dfa81a9dda246f0 (patch) | |
tree | 554bf51eb5de4b9bd29ec1cbe966bf693da25f99 /sal/textenc/convertiso2022jp.cxx | |
parent | c6ad32e03de01525a863171ed58df05e89e9f105 (diff) |
[API CHANGE] rtl_convertTextToUnicode behavior upon erroneous input
<http://udk.openoffice.org/cpp/man/spec/textconversion.html> specifies that
FLAGS_UNDEFINED_ERROR, FLAGS_MBUNDEFINED_ERROR, and FLAGS_INVALID_ERROR: "Read
past the [erroneous] code in the input buffer [...]" But actual behavior of
rtl_convertTextToUnicode for the various rtl_TextEncoding values has been
inconsistent. Some erroneous input (mostly single-byte UNDEFINED and INVALID
ones) has not been consumed at all, some (multi-byte MBUNDEFINED and INVALID)
has been consumed partly, and some has been consumed fully as required.
However, at least since 8dd4265b9ddbd7786b6237676909eae5b540da0e "CWS-TOOLING:
integrate CWS hb18", Custom8BitToUnicode in sw/source/filter/ww8/ww8par.cxx
appears to rely on the broken behavior of not consuming erroneous input. (It
reads the chunk of valid input with e.g. some RTL_TEXTENCODING_MS_125x that
happens to exhibit the broken behavior of not consuming erroneous input, then
wants to try to re-read the erroneous input with RTL_TEXTENCODING_MS_1252. For
example, opening sw/qa/core/data/ww8/pass/forcepoint50-grfanchor-1.doc triggers
that code. For whatever reason, the am_faksas.dot attached to
<https://bz.apache.org/ooo/show_bug.cgi?id=9240#c1> "Do not show lithuanian
letter 'Š'" appears to not, or at least no longer, trigger that code.)
Therefore, it would be useful to have a mode in which rtl_convertTextToUnicode
does not consume erroneous input. (And I plan on doing changes in
sal/osl/unx/file* that would benefit from that behavior, too.) But changing
rtl_convertTextToUnicode to generally not consume erroneous input would not be
feasible: If calls do not set RTL_TEXTTOUNICODE_FLAGS_FLUSH, part of an
erroneous input can already have been consumed by a previous call, so the
current call cannot undo that.
But a change that looks like it can work is to change the behavior only if
RTL_TEXTTOUNICODE_FLAGS_FLUSH is set. In that case we can at least not consume
the part of an erroneous input that has not yet been consumed by a previous call
(which would necessarily have been done with RTL_TEXTTOUNICODE_FLAGS_FLUSH
unset). The expecation is that code that relies on the don't-consume behavior
will do only single calls with RTL_TEXTTOUNICODE_FLAGS_FLUSH set (so reliably
not consume the complete erroneous input), while other code (which might do
calls in a loop) will not care whether erroneous input has been consumed,
anyway. This can be considered a mild form of behavioral API CHANGE (but note
that the old implementation didn't exhibit the requested behavior anyway).
So all implementations of rtl_convertTextToUnicode for the various
rtl_TextEncoding values have been adapted to the new behavior. The only
exceptions are ImplDummyToUnicode (sal/textenc/textcvt.cxx), which is a special
case anyway used by RTL_TEXTENCODING_DONTKNOW, and two out of three places
(marked with a "TODO" each) in ImplUTF7ToUnicode (sal/textenc/tcvtutf7.cxx),
where it is hard to retrofit the expected behaivor, and RTL_TEXTENCODING_UTF7 is
probably not relevant for the use cases relying on the don't-consume--behavior,
anyway.
Whether a similar change should be done for rtl_convertUnicodeToText can be
examined later.
Change-Id: I1ac2c4cfd99e2a0eca219f9a3855ef110b254855
Reviewed-on: https://gerrit.libreoffice.org/78584
Tested-by: Jenkins
Reviewed-by: Stephan Bergmann <sbergman@redhat.com>
Diffstat (limited to 'sal/textenc/convertiso2022jp.cxx')
-rw-r--r-- | sal/textenc/convertiso2022jp.cxx | 18 |
1 files changed, 16 insertions, 2 deletions
diff --git a/sal/textenc/convertiso2022jp.cxx b/sal/textenc/convertiso2022jp.cxx index f0eb5eb9a936..565c09ab36f5 100644 --- a/sal/textenc/convertiso2022jp.cxx +++ b/sal/textenc/convertiso2022jp.cxx @@ -94,6 +94,7 @@ sal_Size ImplConvertIso2022JpToUnicode(void const * pData, sal_Size nConverted = 0; sal_Unicode * pDestBufPtr = pDestBuf; sal_Unicode * pDestBufEnd = pDestBuf + nDestChars; + sal_Size startOfCurrentChar = 0; if (pContext) { @@ -111,9 +112,10 @@ sal_Size ImplConvertIso2022JpToUnicode(void const * pData, if (nChar == 0x1B) // ESC eState = IMPL_ISO_2022_JP_TO_UNICODE_STATE_ESC; else if (nChar < 0x80) - if (pDestBufPtr != pDestBufEnd) + if (pDestBufPtr != pDestBufEnd) { *pDestBufPtr++ = static_cast<sal_Unicode>(nChar); - else + startOfCurrentChar = nConverted + 1; + } else goto no_output; else { @@ -139,6 +141,7 @@ sal_Size ImplConvertIso2022JpToUnicode(void const * pData, break; } *pDestBufPtr++ = static_cast<sal_Unicode>(nChar); + startOfCurrentChar = nConverted + 1; } else goto no_output; @@ -178,6 +181,7 @@ sal_Size ImplConvertIso2022JpToUnicode(void const * pData, { *pDestBufPtr++ = static_cast<sal_Unicode>(nUnicode); eState = IMPL_ISO_2022_JP_TO_UNICODE_STATE_0208; + startOfCurrentChar = nConverted + 1; } else goto no_output; @@ -248,10 +252,16 @@ sal_Size ImplConvertIso2022JpToUnicode(void const * pData, { case sal::detail::textenc::BAD_INPUT_STOP: eState = IMPL_ISO_2022_JP_TO_UNICODE_STATE_ASCII; + if ((nFlags & RTL_TEXTTOUNICODE_FLAGS_FLUSH) == 0) { + ++nConverted; + } else { + nConverted = startOfCurrentChar; + } break; case sal::detail::textenc::BAD_INPUT_CONTINUE: eState = IMPL_ISO_2022_JP_TO_UNICODE_STATE_ASCII; + startOfCurrentChar = nConverted + 1; continue; case sal::detail::textenc::BAD_INPUT_NO_OUTPUT: @@ -278,6 +288,10 @@ sal_Size ImplConvertIso2022JpToUnicode(void const * pData, &nInfo)) { case sal::detail::textenc::BAD_INPUT_STOP: + if ((nFlags & RTL_TEXTTOUNICODE_FLAGS_FLUSH) != 0) { + nConverted = startOfCurrentChar; + } + [[fallthrough]]; case sal::detail::textenc::BAD_INPUT_CONTINUE: eState = IMPL_ISO_2022_JP_TO_UNICODE_STATE_ASCII; break; |