diff options
author | Khaled Hosny <khaledhosny@eglug.org> | 2018-04-26 12:55:26 +0200 |
---|---|---|
committer | Miklos Vajna <vmiklos@collabora.co.uk> | 2018-04-27 11:23:14 +0200 |
commit | c688b01d9102832226251fc84045408afe392459 (patch) | |
tree | e000d416369c3d4b032cf2614ce8e9d59eb0e68f /accessibility | |
parent | dfdc165a48d711b867961d1f75ee36a1c9596dc0 (diff) |
tdf#66597 Fix PDF text extraction for complex text
Implement a more through strategy for embedding textual content in PDF
files:
* If there is unique one to one or one to many mapping between each
glyph index and Unicode code points, use ToUnicode CMAP.
* If there is many to one or many to many mapping, use an ActualText
span embedding the original string, since ToUnicode can’t handle
these.
* If the one glyph is used for several Unicode code points, also use
ActualText since ToUnicode can map each glyph in the font only once.
* Limit ActualText to single cluster at a time, since using it for whole
words or sentences breaks text selection and highlighting in PDF
viewers (there will be no way to tell which glyphs belong to which
characters).
* Keep generating (now) redundant ToUnicode entries for compatibility
with old tools not supporting ActualText.
Change-Id: I33261811b59b3b8fe2164c2c21d3c52c417e6208
Reviewed-on: https://gerrit.libreoffice.org/53315
Tested-by: Jenkins <ci@libreoffice.org>
Reviewed-by: Miklos Vajna <vmiklos@collabora.co.uk>
Diffstat (limited to 'accessibility')
0 files changed, 0 insertions, 0 deletions