[tex4ht] [bug #618] Incomplete XML Document, domfilter error, truncated build on large file.

William F Hammond gellmu at gmail.com
Tue Dec 12 04:06:58 CET 2023


Hello Nasser,

You don't give us much to go on.  But it does provoke my curiosity.

I assume that you are able to build the 57,000 page pdf from the tex source
that you want to process with tex4ht.

Is html output the final tex4ht target?  I'm assuming it is.

You say:

[INFO]    make4ht-lib: parse_lg process file: reportsubsection1100.htm
[WARNING] domfilter: DOM parsing of reportsubsection1100.htm failed:
[WARNING] domfilter:
...ive/2023/texmf-dist/tex/luatex/luaxml/luaxml-mod-xml.lua:175: Incomplete
XML Document [char=33675]
>From this I deduce that the 57,000 page document is being written in HTML
pieces by tex4ht, "reportsubsection1100.htm" is one of those pieces, and
perhaps not all expected pieces have been generated.

Have you checked whether "reportsubsection1100.htm" is well-formed XML
using, say, the tool "xmlwf" found in the expat distribution?

            -- Bill


William F Hammond
Email: gellmu at gmail.com
https://www.facebook.com/william.f.hammond
http://www.albany.edu/~hammond/

𝑻𝒉𝒆 𝒕𝒊𝒎𝒆 𝒕𝒐 𝒔𝒂𝒗𝒆 𝒂 𝒅𝒆𝒎𝒐𝒄𝒓𝒂𝒄𝒚 𝒊𝒔 𝒃𝒆𝒇𝒐𝒓𝒆 𝒊𝒕
𝒊𝒔 𝒍𝒐𝒔𝒕.   -- 𝐊𝐞𝐧 𝐁𝐮𝐫𝐧𝐬




On Mon, Dec 11, 2023 at 5:04 PM Nasser M. Abbasi <puszcza-hackers at gnu.org.ua>
wrote:

> URL:
>   <http://puszcza.gnu.org.ua/bugs/?618>
>
>                  Summary: Incomplete XML Document, domfilter error,
> truncated
> build on large file.
>                  Project: tex4ht
>             Submitted by: nma123
>             Submitted on: Tue Dec 12 01:04:12 2023
>                 Category: None
>                 Priority: 5 - Normal
>                 Severity: 7 - Important
>                   Status: None
>                  Privacy: Public
>              Assigned to: None
>         Originator Email:
>              Open/Closed: Open
>          Discussion Lock: Any
>
>     _______________________________________________________
>
> Details:
>
> I have been working with Michal on this via private email but thought to
> enter
> a bug report on this just for tracking and documentation.
>
> I have one large file (57,000 PDF pages) that when compiled with tex4ht
> (takes
> 14 hrs), and at about 10% when generating the final HTML pages, it gets XML
> error and stops.
>
> i.e. the 90% rest of the sections are missing from the final web pages.
>
> -------------------------------------------------------
>
> [INFO]    make4ht-lib: parse_lg process file: reportsubsection1100.htm
> [WARNING] domfilter: DOM parsing of reportsubsection1100.htm failed:
> [WARNING] domfilter:
> ...ive/2023/texmf-dist/tex/luatex/luaxml/luaxml-mod-xml.lua:175: Incomplete
> XML Document [char=33675]
>
> [INFO]    make4ht-lib: parse_lg process file: reportsubsection1100.htm
> [WARNING] domfilter: DOM parsing of reportsubsection1100.htm failed:
> [WARNING] domfilter:
> ...ive/2023/texmf-dist/tex/luatex/luaxml/luaxml-mod-xml.lua:175: Incomplete
> XML Document [char=33675]
>
> [INFO]    make4ht-lib: parse_lg process file: reportsubsection1100.htm
>
> ----------------------------------
>
> I've just send Michal a link to complete self contained ZIP file (450 MB)
> with
> instructions how to run as standalone in order to see these errors on his
> end.
>
>
> I tried this on latest texlive 2023 on new Linux installation.
>
> I will work with Michal to provide any additional information he needs from
> me, to hopefully find the cause of this problem.
>
> This happens only on this file. I think may be due to the large size, since
> the Latex code is all generated by same program and only this file gives
> this
> error.
>
> --Nasser
>
>
>
>
>
>     _______________________________________________________
>
> Reply to this item at:
>
>   <http://puszcza.gnu.org.ua/bugs/?618>
>
> _______________________________________________
>   Message sent via/by Puszcza
>   http://puszcza.gnu.org.ua/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://tug.org/pipermail/tex4ht/attachments/20231211/11448ca2/attachment-0001.htm>


More information about the tex4ht mailing list.