summaryrefslogtreecommitdiff
path: root/sal/textenc/convertiso2022cn.cxx
AgeCommit message (Collapse)Author
2023-09-19tdf#146619 Remove unused includes from sal/ [cpp files]Gabor Kelemen
Change-Id: I11a54c1ddf73c16ce46a0d1c375bf43157870db7 Reviewed-on: https://gerrit.libreoffice.org/c/core/+/155856 Tested-by: Jenkins Reviewed-by: Miklos Vajna <vmiklos@collabora.com>
2021-12-24Use rtl functions instead of own surrogate checking/combiningMike Kaganski
Change-Id: I3eb05d8f5b0761bc3b672d4c855eb469f8cc1a29 Reviewed-on: https://gerrit.libreoffice.org/c/core/+/127375 Tested-by: Jenkins Reviewed-by: Mike Kaganski <mike.kaganski@collabora.com>
2019-09-04Do not exclude Unicode noncharacters from rtl_convertUnicodeToTextStephan Bergmann
For one, that broke round-tripping with e.g. UTF-8 (see the test case added to Test::testComplex in sal/qa/rtl/textenc/rtl_textcvt.cxx) which did not treat noncharacters as invalid. For another, <https://unicode.org/faq/private_use.html#nonchar7> is meanwhile quite clear on the matter: "Q: Are noncharacters prohibited in interchange? "A: This question has led to some controversy, because the Unicode Standard has been somewhat ambiguous about the status of noncharacters. The formal wording of the definition of 'noncharacter' in the standard has always indicated that noncharacters 'should never be interchanged.' That led some people to assume that the definition actually meant 'shall not be interchanged' and that therefore the presence of a noncharacter in any Unicode string immediately rendered that string malformed according to the standard. But the intended use of noncharacters requires the ability to exchange them in a limited context, at least across APIs and even through data files and other means of 'interchange', so that they can be processed as intended. The choice of the word 'should' in the original definition was deliberate, and indicated that one should not try to interchange noncharacters precisely because their interpretation is strictly internal to whatever implementation uses them, so they have no publicly interchangeable semantics. But other informative wording in the text of the core specification and in the character names list was differently and more strongly worded, leading to contradictory interpretations. "Given this ambiguity of intent, in 2013 the UTC issued Corrigendum #9, which deleted the phrase 'and that should never be interchanged' from the definition of noncharacters, to make it clear that prohibition from interchange is not part of the formal definition of noncharacters. Corrigendum #9 has been incorporated into the core specification for Unicode 7.0. "Q: Are noncharacters invalid in Unicode strings and UTFs? "A: Absolutely not. Noncharacters do not cause a Unicode string to be ill-formed in any UTF. This can be seen explicitly in the table above, where every noncharacter code point has a well-formed representation in UTF-32, in UTF-16, and in UTF-8. An implementation which converts noncharacter code points between one UTF representation and another must preserve these values correctly. The fact that they are called 'noncharacters' and are not intended for open interchange does not mean that they are somehow illegal or invalid code points which make strings containing them invalid." Change-Id: I4fcc0156e3d2fd305a7c7bb0c7b3dbef846c9e64 Reviewed-on: https://gerrit.libreoffice.org/78598 Tested-by: Jenkins Reviewed-by: Stephan Bergmann <sbergman@redhat.com>
2019-09-04[API CHANGE] rtl_convertTextToUnicode behavior upon erroneous inputStephan Bergmann
<http://udk.openoffice.org/cpp/man/spec/textconversion.html> specifies that FLAGS_UNDEFINED_ERROR, FLAGS_MBUNDEFINED_ERROR, and FLAGS_INVALID_ERROR: "Read past the [erroneous] code in the input buffer [...]" But actual behavior of rtl_convertTextToUnicode for the various rtl_TextEncoding values has been inconsistent. Some erroneous input (mostly single-byte UNDEFINED and INVALID ones) has not been consumed at all, some (multi-byte MBUNDEFINED and INVALID) has been consumed partly, and some has been consumed fully as required. However, at least since 8dd4265b9ddbd7786b6237676909eae5b540da0e "CWS-TOOLING: integrate CWS hb18", Custom8BitToUnicode in sw/source/filter/ww8/ww8par.cxx appears to rely on the broken behavior of not consuming erroneous input. (It reads the chunk of valid input with e.g. some RTL_TEXTENCODING_MS_125x that happens to exhibit the broken behavior of not consuming erroneous input, then wants to try to re-read the erroneous input with RTL_TEXTENCODING_MS_1252. For example, opening sw/qa/core/data/ww8/pass/forcepoint50-grfanchor-1.doc triggers that code. For whatever reason, the am_faksas.dot attached to <https://bz.apache.org/ooo/show_bug.cgi?id=9240#c1> "Do not show lithuanian letter 'Š'" appears to not, or at least no longer, trigger that code.) Therefore, it would be useful to have a mode in which rtl_convertTextToUnicode does not consume erroneous input. (And I plan on doing changes in sal/osl/unx/file* that would benefit from that behavior, too.) But changing rtl_convertTextToUnicode to generally not consume erroneous input would not be feasible: If calls do not set RTL_TEXTTOUNICODE_FLAGS_FLUSH, part of an erroneous input can already have been consumed by a previous call, so the current call cannot undo that. But a change that looks like it can work is to change the behavior only if RTL_TEXTTOUNICODE_FLAGS_FLUSH is set. In that case we can at least not consume the part of an erroneous input that has not yet been consumed by a previous call (which would necessarily have been done with RTL_TEXTTOUNICODE_FLAGS_FLUSH unset). The expecation is that code that relies on the don't-consume behavior will do only single calls with RTL_TEXTTOUNICODE_FLAGS_FLUSH set (so reliably not consume the complete erroneous input), while other code (which might do calls in a loop) will not care whether erroneous input has been consumed, anyway. This can be considered a mild form of behavioral API CHANGE (but note that the old implementation didn't exhibit the requested behavior anyway). So all implementations of rtl_convertTextToUnicode for the various rtl_TextEncoding values have been adapted to the new behavior. The only exceptions are ImplDummyToUnicode (sal/textenc/textcvt.cxx), which is a special case anyway used by RTL_TEXTENCODING_DONTKNOW, and two out of three places (marked with a "TODO" each) in ImplUTF7ToUnicode (sal/textenc/tcvtutf7.cxx), where it is hard to retrofit the expected behaivor, and RTL_TEXTENCODING_UTF7 is probably not relevant for the use cases relying on the don't-consume--behavior, anyway. Whether a similar change should be done for rtl_convertUnicodeToText can be examined later. Change-Id: I1ac2c4cfd99e2a0eca219f9a3855ef110b254855 Reviewed-on: https://gerrit.libreoffice.org/78584 Tested-by: Jenkins Reviewed-by: Stephan Bergmann <sbergman@redhat.com>
2018-01-12More loplugin:cstylecast: salStephan Bergmann
auto-rewrite with <https://gerrit.libreoffice.org/#/c/47798/> "Enable loplugin:cstylecast for some more cases" plus solenv/clang-format/reformat-formatted-files Change-Id: I7d89b011464ba5d2dd12e04d5fc9f65cb4daebde
2017-10-23loplugin:includeform: salStephan Bergmann
Change-Id: I539ca8b9dee5edc5fc2282a2b9b0ffd78bad8b11
2017-07-17RTL_UNICODETOTEXT_INFO_{DEST|SCR}BUFFERTOSMALL should use TOO, not TOChris Sherlock
I have kept the old mispelled constant for backwards compatibility Change-Id: I128a2eec76d00cc5ef058cd6a0c35a7474d2411e Reviewed-on: https://gerrit.libreoffice.org/39995 Reviewed-by: Chris Sherlock <chris.sherlock79@gmail.com> Tested-by: Chris Sherlock <chris.sherlock79@gmail.com>
2015-04-15remove unnecessary use of void in function declarationsNoel Grandin
ie. void f(void); becomes void f(); I used the following command to make the changes: git grep -lP '\(\s*void\s*\)' -- *.cxx \ | xargs perl -pi -w -e 's/(\w+)\s*\(\s*void\s*\)/$1\(\)/g;' and ran it for both .cxx and .hxx files. Change-Id: I314a1b56e9c14d10726e32841736b0ad5eef8ddd
2015-01-20Some more loplugin:cstylecast: salStephan Bergmann
Change-Id: Ie54d340478412e62b87d66e287fd8a3963e97898
2014-11-18More iwyu suggested headers removalRiccardo Magliocchetti
Signed-off-by: Riccardo Magliocchetti <riccardo.magliocchetti@gmail.com> Signed-off-by: Stephan Bergmann <sbergman@redhat.com>, undid one remove that was detrimental to loplugin:unreffun Change-Id: I18d8252084d828f94ef7a954e1dbfb45743d7970
2013-11-27Unwind occurrences of deprecated sal_sChar, sal_uCharStephan Bergmann
Change-Id: I76be464200d486efef9c8a7e957c310c9adae3b8
2012-11-21re-base on ALv2 code. Includes:Michael Meeks
Patch contributed by Herbert Duerr: #i118662# remove berkeleyDB from module xmlhelp (author=orwitt) http://svn.apache.org/viewvc?view=revision&revision=1213188 #i119141# remove ISCII converter for now http://svn.apache.org/viewvc?view=revision&revision=1306246 make exceptions for cppunittester verbose http://svn.apache.org/viewvc?view=revision&revision=1174831 Patches contributed by Pedro Giffuni: Avoid some uses of non portable #!/bin/bash in shell scripts. http://svn.apache.org/viewvc?view=revision&revision=1235297 Patch contributed by Oliver-Rainer Wittmann 88652: applied patch, remove unicows deps http://svn.apache.org/viewvc?view=revision&revision=1177585 drop OS/2 code, remove in-line assembler ARM atomics, and obsolete armarch header.
2012-01-06Made textenc/converter cleanly usable by both sal and sal_textenc.Stephan Bergmann
2012-01-06Further clean up.Stephan Bergmann
2012-01-05Changed C files to C++.Stephan Bergmann