tdf#129353, tdf#129402: fix node creation on index import

ToC, bibliography, and index sections import code changed to closely follow what Word does, make sure that pre-rendered entries don't get imported as standalone paragraphs outside of the index sections, and paragraph count is accurate (no missing or added paragraphs as much as possible). In Word, an index may start and end in the middle of a paragraph: <w:p> <w:r> <w:t>Some text before index</w:t> </w:r> <w:r> <w:fldChar w:fldCharType="begin"/> </w:r> <w:r> <w:instrText> TOC ...</w:instrText> </w:r> <w:r> <w:fldChar w:fldCharType="separate"/> </w:r> <w:r> <w:t>First pre-rendered index entry</w:t> </w:r> </w:p> ... <w:p> <w:r> <w:t>Last pre-rendered index entry</w:t> </w:r> <w:r> <w:fldChar w:fldCharType="end"/> </w:r> <w:r> <w:t>Some text after index</w:t> </w:r> </w:p> However, normally it looks like either no runs precedig index, or no runs of pre-rendered contents will be present. When no Std elements are used, the typical situation is that there's a normal paragraph (possibly with some user text), which ends with index start marker, without any pre-rendered contents in the same paragraph; and all pre- rendered contents goes in following paragraphs. Such index normally ends with index end marker in the *first* run of a paragraph, which then might have normal text runs. When Stds are used, then no leading/trailing out-of-index runs in paragraphs with marks are usually present; and in this case, when paragraphs with index marks don't contain pre-rendered entries, they still are treated as part of the index. In Writer, indexes are node sections (and so cannot be inline with other paragraph contents). When there was some paragraph content already before the start-of-index mark, the paragraph is assumed to end before the index; in this case, when current <w:p> element ends, importer decides if a separate starting paragraph is needed or not, depending on if there was some runs after the mark. When there was no text runs before the starting mark, then the paragraph is treated as leading paragraph of the index. This allows to not miss empty paragraphs before index; and not have two paragraphs where there was one in Word. Only in cases when user had manually typed text both in and outside of the index in the same paragraph in Word, we would have the paragraph split into two in Writer. For end marks, the behaviour depends on whether it's inside Std. When inside, the ending paragraph starting with index end mark is considered part of the index. For out-of-Std case, it's considered normal paragraph (and measures are taken to make sure it's not dropped even if empty, because sometimes such paragraphs don't have other content, and have section settings, which is usually treated by Writer as "drop this paragraph" sign). A special problem is multi-column index. It's wrapped into a continuous section by Word; and in Writer, we also wrap it into a section. It would be possibly useful to detect somehow if this section is part of index definition, and in this case, drop the section and put its properties into the Writer's index section. That would avoid an explicit section in the imported document. This is TODO, for someone who figures how to detect reliably if the section belongs to index definition. See comment in DomainMapper_Impl::appendTextSectionAfter. By the way, current export code is wrong, producing an index that is single-column in Word; this change doesn't touch that. Several existing tests needed to be fixed, which used to test wrong results. Change-Id: I9597c8ab13f31ded9abcc24054d3478d3e3a3b40 Reviewed-on: https://gerrit.libreoffice.org/85089 Tested-by: Jenkins Reviewed-by: Mike Kaganski <mike.kaganski@collabora.com> Reviewed-on: https://gerrit.libreoffice.org/c/core/+/85278 Tested-by: Jenkins CollaboraOffice <jenkinscollaboraoffice@gmail.com> Reviewed-on: https://gerrit.libreoffice.org/c/core/+/86464 Reviewed-by: Andras Timar <andras.timar@collabora.com>
author: Mike Kaganski <mike.kaganski@collabora.com> 2019-12-13 09:36:39 +0300
committer: Andras Timar <andras.timar@collabora.com> 2020-01-09 12:15:39 +0100
commit: 3018dcbf12fe96e57fe52dc1b2d7a1235b480eef (patch)
tree: 999f377206f9d7f79f1120ba5c997dbd6f060095 /sw
parent: 62eee51aeaee380139126e21ac550e6e367164ab (diff)
5 files changed, 64 insertions, 7 deletions
diff --git a/sw/qa/extras/ooxmlexport/data/tdf129353.docx b/sw/qa/extras/ooxmlexport/data/tdf129353.docx
new file mode 100644
index 000000000000..c5cf8865eda6
--- /dev/null
+++ b/sw/qa/extras/ooxmlexport/data/tdf129353.docx
diff --git a/sw/qa/extras/ooxmlexport/ooxmlexport13.cxx b/sw/qa/extras/ooxmlexport/ooxmlexport13.cxx
index 3800851c3b83..d5aff0be51f9 100644
--- a/sw/qa/extras/ooxmlexport/ooxmlexport13.cxx
+++ b/sw/qa/extras/ooxmlexport/ooxmlexport13.cxx
@@ -11,10 +11,12 @@
 
 #include <com/sun/star/beans/XPropertySet.hpp>
 #include <com/sun/star/text/WritingMode2.hpp>
+#include <com/sun/star/text/XDocumentIndex.hpp>
 #include <com/sun/star/text/XTextFrame.hpp>
 #include <com/sun/star/drawing/XControlShape.hpp>
 #include <com/sun/star/style/ParagraphAdjust.hpp>
 #include <IDocumentSettingAccess.hxx>
+#include <tools/lineend.hxx>
 
 #include <editsh.hxx>
 
@@ -478,6 +480,32 @@ DECLARE_OOXMLEXPORT_TEST(testTdf127741, "tdf127741.docx")
     CPPUNIT_ASSERT(visitedStyleName.equalsIgnoreAsciiCase("Visited Internet Link"));
 }
 
+DECLARE_OOXMLEXPORT_TEST(testTdf129353, "tdf129353.docx")
+{
+    CPPUNIT_ASSERT_EQUAL(8, getParagraphs());
+    getParagraph(2, "Bibliography");
+    getParagraph(4, "Christie, A. (1922). The Secret Adversary. ");
+    CPPUNIT_ASSERT_EQUAL(OUString(), getParagraph(8)->getString());
+
+    uno::Reference<text::XDocumentIndexesSupplier> xIndexSupplier(mxComponent, uno::UNO_QUERY);
+    uno::Reference<container::XIndexAccess> xIndexes = xIndexSupplier->getDocumentIndexes();
+    uno::Reference<text::XDocumentIndex> xIndex(xIndexes->getByIndex(0), uno::UNO_QUERY);
+    uno::Reference<text::XTextRange> xTextRange = xIndex->getAnchor();
+    uno::Reference<text::XText> xText = xTextRange->getText();
+    uno::Reference<text::XTextCursor> xTextCursor = xText->createTextCursor();
+    xTextCursor->gotoRange(xTextRange->getStart(), false);
+    xTextCursor->gotoRange(xTextRange->getEnd(), true);
+    OUString aIndexString(convertLineEnd(xTextCursor->getString(), LineEnd::LINEEND_LF));
+
+    // Check that all the pre-rendered entries are correct, including trailing spaces
+    CPPUNIT_ASSERT_EQUAL(OUString("\n" // starting with an empty paragraph
+                                  "Christie, A. (1922). The Secret Adversary. \n"
+                                  "\n"
+                                  "Verne, J. G. (1870). Twenty Thousand Leagues Under the Sea. \n"
+                                  ""), // ending with an empty paragraph
+                         aIndexString);
+}
+
 CPPUNIT_PLUGIN_IMPLEMENT();
 
 /* vim:set shiftwidth=4 softtabstop=4 expandtab: */
diff --git a/sw/qa/extras/ooxmlexport/ooxmlexport5.cxx b/sw/qa/extras/ooxmlexport/ooxmlexport5.cxx
index fde09e123f9f..8d53353aa13c 100644
--- a/sw/qa/extras/ooxmlexport/ooxmlexport5.cxx
+++ b/sw/qa/extras/ooxmlexport/ooxmlexport5.cxx
@@ -9,6 +9,8 @@
 
 #include <swmodeltestbase.hxx>
 
+#include <com/sun/star/text/XDocumentIndex.hpp>
+#include <com/sun/star/text/XDocumentIndexesSupplier.hpp>
 #include <com/sun/star/text/XFootnote.hpp>
 #include <com/sun/star/text/XTextTable.hpp>
 #include <com/sun/star/style/LineSpacing.hpp>
@@ -676,6 +678,33 @@ DECLARE_OOXMLEXPORT_TEST(testFdo77129, "fdo77129.docx")
     assertXPathContent(pXmlDoc, "/w:document/w:body/w:p[4]/w:r[1]/w:t", "Abstract");
 }
 
+// Test the same testdoc used for testFdo77129.
+DECLARE_OOXMLEXPORT_TEST(testTdf129402, "fdo77129.docx")
+{
+    // tdf#129402: ToC title must be "Contents", not "Content"; the index field must include
+    // pre-rendered element.
+
+    // Currently export drops empty paragraph after ToC, so skip getParagraphs test for now
+//    CPPUNIT_ASSERT_EQUAL(5, getParagraphs());
+    CPPUNIT_ASSERT_EQUAL(OUString("owners."), getParagraph(1)->getString());
+    CPPUNIT_ASSERT_EQUAL(OUString("Contents"), getParagraph(2)->getString());
+    CPPUNIT_ASSERT_EQUAL(OUString("How\t2"), getParagraph(3)->getString());
+//    CPPUNIT_ASSERT_EQUAL(OUString(), getParagraph(4)->getString());
+
+    uno::Reference<text::XDocumentIndexesSupplier> xIndexSupplier(mxComponent, uno::UNO_QUERY);
+    uno::Reference<container::XIndexAccess> xIndexes = xIndexSupplier->getDocumentIndexes();
+    uno::Reference<text::XDocumentIndex> xIndex(xIndexes->getByIndex(0), uno::UNO_QUERY);
+    uno::Reference<text::XTextRange> xTextRange = xIndex->getAnchor();
+    uno::Reference<text::XText> xText = xTextRange->getText();
+    uno::Reference<text::XTextCursor> xTextCursor = xText->createTextCursor();
+    xTextCursor->gotoRange(xTextRange->getStart(), false);
+    xTextCursor->gotoRange(xTextRange->getEnd(), true);
+    OUString aTocString(xTextCursor->getString());
+
+    // Check that the pre-rendered entry is inside the index
+    CPPUNIT_ASSERT_EQUAL(OUString("How\t2"), aTocString);
+}
+
 DECLARE_OOXMLEXPORT_TEST(testfdo79969_xlsm, "fdo79969_xlsm.docx")
 {
     // This UT for DOCX embedded with excel work sheet.
diff --git a/sw/qa/extras/ooxmlexport/ooxmlfieldexport.cxx b/sw/qa/extras/ooxmlexport/ooxmlfieldexport.cxx
index 1765a28fa64f..94ca46896548 100644
--- a/sw/qa/extras/ooxmlexport/ooxmlfieldexport.cxx
+++ b/sw/qa/extras/ooxmlexport/ooxmlfieldexport.cxx
@@ -282,9 +282,9 @@ DECLARE_OOXMLEXPORT_TEST(testAlphabeticalIndex_MultipleColumns,"alphabeticalInde
 
     // check for section breaks after and before the Index Section
     assertXPath(pXmlDoc, "/w:document/w:body/w:p[2]/w:pPr/w:sectPr/w:type","val","continuous");
-    assertXPath(pXmlDoc, "/w:document/w:body/w:p[9]/w:pPr/w:sectPr/w:type","val","continuous");
+    assertXPath(pXmlDoc, "/w:document/w:body/w:p[8]/w:pPr/w:sectPr/w:type","val","continuous");
     // check for "w:space" attribute for the columns in Section Properties
-    assertXPath(pXmlDoc, "/w:document/w:body/w:p[9]/w:pPr/w:sectPr/w:cols/w:col[1]","space","720");
+    assertXPath(pXmlDoc, "/w:document/w:body/w:p[8]/w:pPr/w:sectPr/w:cols/w:col[1]","space","720");
 }
 
 DECLARE_OOXMLEXPORT_TEST(testPageref, "testPageref.docx")
@@ -359,9 +359,9 @@ DECLARE_OOXMLEXPORT_TEST(test_FieldType, "99_Fields.docx")
     if (!pXmlDoc)
         return;
     // Checking for three field types (BIBLIOGRAPHY, BIDIOUTLINE, CITATION) in sequence
-    assertXPath(pXmlDoc, "/w:document[1]/w:body[1]/w:p[2]/w:r[2]/w:instrText[1]",1);
-    assertXPath(pXmlDoc, "/w:document[1]/w:body[1]/w:p[5]/w:r[2]/w:instrText[1]",1);
-    assertXPath(pXmlDoc, "/w:document[1]/w:body[1]/w:p/w:sdt/w:sdtContent/w:r[2]/w:instrText[1]",1);
+    assertXPath(pXmlDoc, "/w:document/w:body/w:p[2]/w:r[2]/w:instrText");
+    assertXPath(pXmlDoc, "/w:document/w:body/w:p[3]/w:r[2]/w:instrText");
+    assertXPath(pXmlDoc, "/w:document/w:body/w:p[4]/w:sdt/w:sdtContent/w:r[2]/w:instrText");
 }
 
 DECLARE_OOXMLEXPORT_TEST(testCitation,"FDO74775.docx")
@@ -411,7 +411,7 @@ DECLARE_OOXMLEXPORT_TEST(testFDO78654 , "fdo78654.docx")
         return;
     // In case of two "Hyperlink" tags in one paragraph and one of them
     // contains "PAGEREF" field then field end tag was missing from hyperlink.
-    assertXPath ( pXmlDoc, "/w:document/w:body/w:p[2]/w:hyperlink[3]/w:r[5]/w:fldChar", "fldCharType", "end" );
+    assertXPath ( pXmlDoc, "/w:document/w:body/w:sdt/w:sdtContent/w:p[2]/w:hyperlink[3]/w:r[5]/w:fldChar", "fldCharType", "end" );
 }
 
 
diff --git a/sw/qa/extras/rtfimport/rtfimport.cxx b/sw/qa/extras/rtfimport/rtfimport.cxx
index 6fa57b6de4ce..b35ee4b84717 100644
--- a/sw/qa/extras/rtfimport/rtfimport.cxx
+++ b/sw/qa/extras/rtfimport/rtfimport.cxx
@@ -919,7 +919,7 @@ DECLARE_RTFIMPORT_TEST(testFdo44984, "fdo44984.rtf")
 DECLARE_RTFIMPORT_TEST(testFdo82071, "fdo82071.rtf")
 {
     // The problem was that in TOC, chapter names were underlined, but they should not be.
-    uno::Reference<text::XTextRange> xRun = getRun(getParagraph(2), 1);
+    uno::Reference<text::XTextRange> xRun = getRun(getParagraph(1), 1);
     // Make sure we test the right text portion.
     CPPUNIT_ASSERT_EQUAL(OUString("Chapter 1"), xRun->getString());
     // This was awt::FontUnderline::SINGLE.
author	Mike Kaganski <mike.kaganski@collabora.com>	2019-12-13 09:36:39 +0300
committer	Andras Timar <andras.timar@collabora.com>	2020-01-09 12:15:39 +0100
commit	3018dcbf12fe96e57fe52dc1b2d7a1235b480eef (patch)
tree	999f377206f9d7f79f1120ba5c997dbd6f060095 /sw
parent	62eee51aeaee380139126e21ac550e6e367164ab (diff)