summaryrefslogtreecommitdiff
path: root/accessibility
diff options
context:
space:
mode:
authorKhaled Hosny <khaledhosny@eglug.org>2018-04-26 12:55:26 +0200
committerMiklos Vajna <vmiklos@collabora.co.uk>2018-04-27 11:23:14 +0200
commitc688b01d9102832226251fc84045408afe392459 (patch)
treee000d416369c3d4b032cf2614ce8e9d59eb0e68f /accessibility
parentdfdc165a48d711b867961d1f75ee36a1c9596dc0 (diff)
tdf#66597 Fix PDF text extraction for complex text
Implement a more through strategy for embedding textual content in PDF files: * If there is unique one to one or one to many mapping between each glyph index and Unicode code points, use ToUnicode CMAP. * If there is many to one or many to many mapping, use an ActualText span embedding the original string, since ToUnicode can’t handle these. * If the one glyph is used for several Unicode code points, also use ActualText since ToUnicode can map each glyph in the font only once. * Limit ActualText to single cluster at a time, since using it for whole words or sentences breaks text selection and highlighting in PDF viewers (there will be no way to tell which glyphs belong to which characters). * Keep generating (now) redundant ToUnicode entries for compatibility with old tools not supporting ActualText. Change-Id: I33261811b59b3b8fe2164c2c21d3c52c417e6208 Reviewed-on: https://gerrit.libreoffice.org/53315 Tested-by: Jenkins <ci@libreoffice.org> Reviewed-by: Miklos Vajna <vmiklos@collabora.co.uk>
Diffstat (limited to 'accessibility')
0 files changed, 0 insertions, 0 deletions