Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML Entities, individual and grouped #183

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from
Draft

Conversation

alfsb
Copy link
Member

@alfsb alfsb commented Nov 20, 2024

This PR creates a new doc-base/scripts/entities.ent file, that is called from configure.php but can also be called from the command line.

The new script start looking global.ent, manual.ent and remove.ent in each doc-lang repository. Besides the .ent extension, these are normal. XML files, that uses the same namespaces as manual, so small entities placed here can be namespace clean(er),

The new script also starts looking for an entities/ dir in each doc-lang repository, and loads any .xml file found here as an individual entity file, so bigger entities get easier to edit and can be now revchecked individually.

Included are two other scripts, dtdent-conv.php and dtdent-split.php, that bulk convert (or split) big files of DTD Entities into XML Entities. These tools are not necessary for implantation.

This will make entity experimentation a lot easier, and is the enabling step into splitting language-entities.ent file. This works well, but is another possibly big change, so I do not plan to push for this until 2025, or the PHP 8.4 doc changes slow down, or if there is some demand for early experimentation.

@alfsb
Copy link
Member Author

alfsb commented Nov 27, 2024

Added support for namespace correct, bundled XML entity files.

So a line like this on global.ent

<entity name="link.composer"><link xlink:href="&url.pecl;">Composer</link></entity>

becomes this on doc-base/temp/entities.ent

<!ENTITY link.composer '<link xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="&url.pecl;">Composer</link>'>

and an file named doc-lang/entities/callback.cmp.xml with contents

<methodsynopsis>
 <type>int</type><methodname><replaceable>callback</replaceable></methodname>
 <methodparam><type>mixed</type><parameter>a</parameter></methodparam>
 <methodparam><type>mixed</type><parameter>b</parameter></methodparam>
</methodsynopsis>

becomes

<!ENTITY callback.cmp '<methodsynopsis xmlns="http://docbook.org/ns/docbook"><type>int</type><methodname><replaceable>callback</replaceable></methodname>
 <methodparam><type>mixed</type><parameter>a</parameter></methodparam>
 <methodparam><type>mixed</type><parameter>b</parameter></methodparam>
</methodsynopsis>'>

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a very cursory glance at this, but you might want to rebase and squash some commits together?

.gitignore Outdated Show resolved Hide resolved
@alfsb
Copy link
Member Author

alfsb commented Nov 28, 2024

Yes to squash and rebase. I just found out that manual building is not idempotent, and because of that, my regression testing so far was inadequate. Back to draft, until I can address all points above. I will ping #php-doc about the idempotent thing.

@alfsb alfsb marked this pull request as draft November 28, 2024 09:59
@alfsb
Copy link
Member Author

alfsb commented Nov 29, 2024

Some observations that I gathered while working on making manual build idempotent. At a size of ~41 Mb and actual entity usage, the PHP manual is dangerously close to hinting some hard coded libxml2 limits.Twice I changed the code to avoid the misleading error Detected an entity reference loop (there is no loop, only high entity usage). This could be a problem soon.

Specifically, it was necessary to keep entities listings, generated by file-entities.php, as external loaded files. This somehow alleviates the problem for now, yet it is slow down manual configure, by needing generate (and load) about a thousand more files.

@Girgias
Copy link
Member

Girgias commented Dec 3, 2024

Would changing the entities generated by file-entities.php to Process Instructions that are transformed into an XInclude (which takes a system path) fix the issue?

Yes this would require patching all of the docs, but wondering if this is a way forward as I was thinking of this already.

@alfsb
Copy link
Member Author

alfsb commented Dec 3, 2024

Would changing the entities generated by file-entities.php to Process Instructions that are transformed into an XInclude (which takes a system path) fix the issue?

Yes. Yet, a better solution would be having an option to change whatever metric libxml2 uses to identify high entity usage. The entity usage is huge in manuals, and this will bite someday.

Yes this would require patching all of the docs, but wondering if this is a way forward as I was thinking of this already.

In the end, I think directly changing entity files for <xi:include href=""/> would work, and can be done automatically. This way, the entire file-entities.php and entities cease to exist, and manual goes more XML, and less DTD. I could explore this path, instead of rewriting file-entities.php, to make doc-base fully idempotent (same on doc-en, as file-entities.php writes a lot of temp files with .xml extension on each run...)

But, priorities. For now, I focused on XInclude/fallback, then XInclude by xml:id, then this infrastructure, then file-entities.php, the extricating qaxmlsync sync tools from lib revcheck, and then, at last, changing qaxmlsync into something more translations could use.

@Girgias
Copy link
Member

Girgias commented Dec 3, 2024

Yes, obviously this is lower on the priority.
But thank you again for tackling those issues!

@alfsb
Copy link
Member Author

alfsb commented Dec 4, 2024

I'm thinking of opening an issue, to keep track of these projects. An road map of mentioned projects above, and to document some bottlenecks found in the way, like the entities limit.

@Girgias
Copy link
Member

Girgias commented Dec 4, 2024

Feel free to do that :) it can be a meta issue like the doc tracking one which you can update overtime and split into individual issues if needed.

@alfsb
Copy link
Member Author

alfsb commented Dec 5, 2024

I will change my answer, after this comment.

Would changing the entities generated by file-entities.php to Process Instructions that are transformed into an XInclude (which takes a system path) fix the issue?

No. XInclude only "runs" by calling xinclude(). But xinclude_run_byid() requires that the entire file is loaded and that it is run before any other xinclude(). So file inclusion by XInclude and userland XInclude by xml:id are both incompatible with each other. Having fully native XInclude 1.1 support may enable file inclusion by XInclude, but is not even planned on libxml2.

Yes this would require patching all of the docs, but wondering if this is a way forward as I was thinking of this already.

Let me be clear about this. The PHP manuals are at breaking point as far as libxml2 is concerned. There are files it loads, and there are files it rejects. Full stop.

DITA DTDs are unusable on libxml2 for several months, and there are other reports of files being rejected starting at ~40 Mb size. Looking ahead, the PHP community may need to ask/contribute/fund for an "unlimited" option on libxml2, on a libxml2 version that it could use, compile and distribute (or building manual outside servers become impossible).

The linked fix for DITA only fixes half the problem (the size amplification one), but PHP manual already are triggering another limit, entity recursion level. This is the Detected an entity reference loop error that I commented above.

@Girgias Girgias mentioned this pull request Dec 5, 2024
27 tasks
@alfsb alfsb closed this Dec 6, 2024
André L F S Bacci added 2 commits December 6, 2024 12:05
@alfsb
Copy link
Member Author

alfsb commented Dec 6, 2024

While doing rebase (and tests) efforts, I found an entity collision. Enity resource is defined in two places:

@alfsb alfsb reopened this Dec 6, 2024
@Girgias
Copy link
Member

Girgias commented Dec 6, 2024

While doing rebase (and tests) efforts, I found an entity collision. Enity resource is defined in two places:

* https://github.com/php/doc-base/blob/master/entities/global.ent#L590C65-L590C81

* https://github.com/php/doc-en/blob/master/language-snippets.ent#L799

Please remove the one in language-snippets.ent

@alfsb
Copy link
Member Author

alfsb commented Dec 6, 2024

Please remove the one in language-snippets.ent

I will do it Monday morning, after merging the small PRs.

@alfsb
Copy link
Member Author

alfsb commented Dec 9, 2024

Some other notes. I discovered only yesterday that is a whole W3C recommendation for XML Fragments, and I'm surprised to to see solution adopted here is the same of said recommendation.

About replacing file entities by Process Instructions and/or XPointer, this might be possible. The problem is the bad iterations between entities and XInclude, so replacing file entities for PI/XI would need to exist as one of two possible stacks bellow:

Without XInclude 1.1 native support

  1. First, recurse replace <xi:include href=""/> by userland code;
  2. Second, simulate XInclude 1.1 attribute copy by userland code (PR 198)
  3. Loop xinclude().

With XInclude 1.1 native support

  1. Just loop xinclude()

I think it is possible to create an userland XInclude by Href, but I have not created a prototype yet, to test if the bad iteration can be overcome. The risk of succeeding here is that we may paint ourselves into a ugly corner of XML toolage in the end.

So the answer might be: if possible, change to a XML processor that does XInclude 1.1.

@alfsb alfsb changed the title Infrastructure for individual entity files XML Entities, individual and grouped Dec 9, 2024
@alfsb
Copy link
Member Author

alfsb commented Dec 9, 2024

Please remove the one in language-snippets.ent

Pushed a change to detect duplicated entity names on first language loaded (so translations can detect internal duplications), and finally tested inter repository debug mode. Found two more duplicated entities between doc-base and doc-en, so there is three in total:

Expected global, replaced 1 times:     resource
Expected global, replaced 1 times:     foreach
Expected global, replaced 1 times:     yield

@alfsb
Copy link
Member Author

alfsb commented Dec 9, 2024

Remove the three duplicated entities from language-snippets.ent?

Meanwhile, this is waiting for idempotent to get merged (so regression tests get a little less random), but it's in good enough shape to be merged, if there is demand for experimentation while 8.4 changes are still high.

@Girgias
Copy link
Member

Girgias commented Dec 9, 2024

Remove the three duplicated entities from language-snippets.ent?

Meanwhile, this is waiting for idempotent to get merged (so regression tests get a little less random), but it's in good enough shape to be merged, if there is demand for experimentation while 8.4 changes are still high.

Yes I think this is the best approach, those shouldn't be translate.

@alfsb
Copy link
Member Author

alfsb commented Dec 20, 2024

Would changing the entities generated by file-entities.php to Process Instructions that are transformed into an XInclude (which takes a system path) fix the issue?

Yes this would require patching all of the docs, but wondering if this is a way forward as I was thinking of this already.

And

So the answer might be: if possible, change to a XML processor that does XInclude 1.1.

After some tests today, my answer is that replacing file entities with anything else will only be possible by changing to a XML processor that does XInclude 1.1 and propagate entities between files, something that is not mandated by standards.

The test. a.xml:

<!DOCTYPE a [<!ENTITY c "CC">]>
<a>
 <b>&c;</b>
 <b><xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="b.xml"/></b>
</a>

And b.xml:

<b>&c;</b>

That is, a.xml is the actual manual.xml, with 13k+ <!ENTITY>, and b.xml is a "simple splitted" part of the manual. Then doing an xmllint --noout --xinclude a.xml results in

b.xml:1: parser error : Entity 'c' not defined
<b>&c;</b>
      ^
a.xml:6: element include: XInclude error : could not load b.xml, and no fallback was found

libxml fails to parse the included files. And that is because XIncludes are completely orthogonal to entities. XInclude focus on copying "infosets" from one document to another.

Complete parsed infosets.

As b.xml is parsed completely separated from a.xml, then b.xml will always fail to parse, and XInclude will also fail.

@alfsb
Copy link
Member Author

alfsb commented Dec 28, 2024

Would changing the entities generated by file-entities.php to Process Instructions that are transformed into an XInclude (which takes a system path) fix the issue?

In the end, a simple LIBXML_PARSEHUGE may suffice to avoid the hard limit the manuals already tripped, and by so, makes it possible for file-entities.php to not generate a thousand files per configure.

Yes this would require patching all of the docs, but wondering if this is a way forward as I was thinking of this already.

I found hacky ways to do controlled file loading/inclusion in userland code, by entity and XInclude, so in theory it is possible to get rid of file-entities.php completely, and/or replace file entity inclusions by XInclude syntax, gradually.

@alfsb
Copy link
Member Author

alfsb commented Jan 2, 2025

Conflict resolved. But the main question remains. Do the languages ​​manuals want to split language-snippets.ent and/or rework the Entity DTD files into these XML Entity bundles?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants