Parsoid
A bidirectional parser between wikitext and HTML5
Loading...
Searching...
No Matches
Design Document

Wikitext constructs split across top-level content and templates

There are two ways in which Parsoid trips on templates that cause a wikitext construct to be split across top-level-content and templates.

  1. In one form, the first part of the construct swallows the other part of the construct into its attributes. But in addition, it also swallows following tokens/content as well. The fix is to separate the two and move that other content out back to the top level. (T48811 is an example of this)
  2. In another form, the construct is split between two adjacent tokens. The fix is to bring content together across tokens / DOM nodes (T69857, T69850, T52603, T46498)

Both these forms affect tables. The first form is seen when a template is used in the table-attribute (and possibly table-row-attribute) position. The second form is seen when a template is used in table-cell-attribute position. The reason why we these two different manifestations of the same problem is because of the peculiarities of how the table-opening-tag and table-cell-tags are tokenized and what the tokenizer can assume about transclusions.

The second form is addressed by the code in Wt2Html/PP/Handlers/TableFixups.php where the DOM is examined to bring together the split pieces. However, so far, it only supports templates that generates attributes of a table-cell and a single following table cell. But, we now need to generalize this support to support multiple table-cells being generated by these templates to cover the last of the big unsupported scenarios.

Example wikitext for it is:

{|
|{{convert| 400|m|ft|disp=table|sortable=on}}
|}

This affects enwiki:List_of_largest_container_ships (among possibly other pages). Generalizing the code in TableFixups.php should probably take care of this.

The longer term fix for both these issues is to start scoping the output of templates to return DOM-representable strings and split some of these monolithic templates into multiple templates, one that generate just the attribute, and another that return just additional content. This also improves WYSIWYG editability of some of these tables.

Mark up cite errors in embedded content

It's a feature of named refs that we only know at the time of inserting the references list whether they have content or not, and are therefore in err. The initial strategy was to keep pointers to all named ref nodes so that if an error does occur, we can mark them up.

The problem with embedded content is that, at the time when we find out about the errors, it's been serialized and stored, and so any pointers we might have kept around are no longer live or relevant. We need to go back and process all that embedded content again to find where the refs with errors are hiding.

We slightly optimizes that by keeping a map of all the errors for refs in embedded content so that only one pass is necessary, rather than for each references list. Also, it's helpful that, in the common case, this pass won't need to run since we won't have any errors in embedded content.

Redefinitions in the face of nested refs are ambiguous / undefined behaviour

A redefinition is like,

<ref name="name">123</ref> <ref name="name">345</ref>

The latter of which will result in an error.

In the case of nested refs, we might have,

{{#tag:ref|123 <ref>haha</ref>|name="name"}} {{#tag:ref|123 <ref>haha</ref>|name="name"}}

If you go by the wikitext, those definitions sort of look the same and one might assume it shouldn't generate an error.

However, if you go by the html rendering of the content, because of linkback ids and whatnot, those will never be same and presumably always generate an error in the legacy parser.

On the parsoid side, we're doing a comparison that looks like,

...

vs

which is always going to be a redefinition error / contents differ.