lo/core - LibreOffice 核心代码仓库

diff options

author	Khaled Hosny <khaledhosny@eglug.org>	2018-04-26 12:55:26 +0200
committer	Miklos Vajna <vmiklos@collabora.co.uk>	2018-04-27 11:23:14 +0200
commit	c688b01d9102832226251fc84045408afe392459 (patch)
tree	e000d416369c3d4b032cf2614ce8e9d59eb0e68f /accessibility
parent	dfdc165a48d711b867961d1f75ee36a1c9596dc0 (diff)

tdf#66597 Fix PDF text extraction for complex text

Implement a more through strategy for embedding textual content in PDF files: * If there is unique one to one or one to many mapping between each glyph index and Unicode code points, use ToUnicode CMAP. * If there is many to one or many to many mapping, use an ActualText span embedding the original string, since ToUnicode can’t handle these. * If the one glyph is used for several Unicode code points, also use ActualText since ToUnicode can map each glyph in the font only once. * Limit ActualText to single cluster at a time, since using it for whole words or sentences breaks text selection and highlighting in PDF viewers (there will be no way to tell which glyphs belong to which characters). * Keep generating (now) redundant ToUnicode entries for compatibility with old tools not supporting ActualText. Change-Id: I33261811b59b3b8fe2164c2c21d3c52c417e6208 Reviewed-on: https://gerrit.libreoffice.org/53315 Tested-by: Jenkins <ci@libreoffice.org> Reviewed-by: Miklos Vajna <vmiklos@collabora.co.uk>

Diffstat (limited to 'accessibility')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: