Parsoid
A bidirectional parser between wikitext and HTML5
Loading...
Searching...
No Matches
Wikimedia\Parsoid\Wt2Html\DOMPostProcessor Class Reference

Perform post-processing steps on an already-built HTML DOM. More...

+ Inheritance diagram for Wikimedia\Parsoid\Wt2Html\DOMPostProcessor:
+ Collaboration diagram for Wikimedia\Parsoid\Wt2Html\DOMPostProcessor:

Public Member Functions

 __construct (Env $env, array $options=[], string $stageId="", ?PipelineStage $prevStage=null)
 
 registerProcessors (?array $processors)
 
 getDefaultProcessors ()
 
 setSourceOffsets (SourceRange $so)
 Set the source offsets for the content being processing by this pipeline This matters for when a substring of the top-level page is being processed in its own pipeline.This ensures that all source offsets assigned to tokens and DOM nodes in this stage are relative to the top-level page.
Parameters
SourceRange$so

 
 resetState (array $options)
 Resets any internal state for this pipeline stage.This is usually called so a cached pipeline can be reused.
Parameters
array$options

 
 addMetaData (Env $env, Document $document)
 FIXME: consider moving to DOMUtils or Env.
 
 doPostProcess (Node $node)
 
 process ( $node, array $opts=null)
 
 processChunkily ( $input, ?array $options)
 Process wikitext, an array of tokens, or a DOM document depending on what pipeline stage this is.This method will either directly or indirectly implement a generator that parses the input in chunks and yields output in chunks as well.Implementations that don't consume tokens (ex: Tokenizer, DOMPostProcessor) will provide specialized implementations that handle their input type.
Parameters
string | array | Document$input
?array$options
  • atTopLevel: (bool) Whether we are processing the top-level document
  • sol: (bool) Whether input should be processed in start-of-line context
Returns
Generator

 
- Public Member Functions inherited from Wikimedia\Parsoid\Wt2Html\PipelineStage
 __construct (Env $env, ?PipelineStage $prevStage=null)
 
 setPipelineId (int $id)
 
 getPipelineId ()
 
 getEnv ()
 
 addTransformer (TokenHandler $t)
 Register a token transformer.
 
 setFrame (Frame $frame)
 Set frame on this pipeline stage.
 
 process ( $input, ?array $options=null)
 Process wikitext, an array of tokens, or a DOM document depending on what pipeline stage this is.
 

Additional Inherited Members

- Protected Attributes inherited from Wikimedia\Parsoid\Wt2Html\PipelineStage
 $prevStage
 
 $pipelineId = -1
 
 $env = null
 
 $atTopLevel
 
 $frame
 

Detailed Description

Perform post-processing steps on an already-built HTML DOM.

Constructor & Destructor Documentation

◆ __construct()

Wikimedia\Parsoid\Wt2Html\DOMPostProcessor::__construct ( Env $env,
array $options = [],
string $stageId = "",
?PipelineStage $prevStage = null )
Parameters
Env$env
array$options
string$stageId
?PipelineStage$prevStage

Member Function Documentation

◆ addMetaData()

Wikimedia\Parsoid\Wt2Html\DOMPostProcessor::addMetaData ( Env $env,
Document $document )

FIXME: consider moving to DOMUtils or Env.

Parameters
Env$env
Document$document

FIXME: The JS side has a bunch of other checks here

◆ doPostProcess()

Wikimedia\Parsoid\Wt2Html\DOMPostProcessor::doPostProcess ( Node $node)
Parameters
Node$node

◆ getDefaultProcessors()

Wikimedia\Parsoid\Wt2Html\DOMPostProcessor::getDefaultProcessors ( )
Returns
array

FIXME: There are two potential ordering problems here.

  1. unpackDOMFragment should always run immediately before these extensionPostProcessors, which we do currently. This ensures packed content get processed correctly by extensions before additional transformations are run on the DOM.

This ordering issue is handled through documentation.

  1. This has existed all along (in the PHP parser as well as Parsoid which is probably how the ref-in-ref hack works - because of how parser functions and extension tags are procesed, #tag:ref doesn't see a nested ref anymore) and this patch only exposes that problem more clearly with the unpackOutput property.
  • Consider the set of extensions that (a) process wikitext (b) provide an extensionPostProcessor (c) run the extensionPostProcessor only on the top-level As of today, there is exactly one extension (Cite) that has all these properties, so the problem below is a speculative problem for today. But, this could potentially be a problem in the future.
  • Let us say there are at least two of them, E1 and E2 that support extension tags <e1> and <e2> respectively.
  • Let us say in an instance of <e1> on the page, <e2> is present and in another instance of <e2> on the page, <e1> is present.
  • In what order should E1's and E2's extensionPostProcessors be run on the top-level? Depending on what these handlers do, you could get potentially different results. You can see this quite starkly with the unpackOutput flag.
  • The ideal solution to this problem is to require that every extension's extensionPostProcessor be idempotent which lets us run these post processors repeatedly till the DOM stabilizes. But, this still doesn't necessarily guarantee that ordering doesn't matter. It just guarantees that with the unpackOutput flag set to false multiple extensions, all sealed fragments get fully processed. So, we still need to worry about that problem.

    But, idempotence could potentially be a sufficient property in most cases. To see this, consider that there is a Footnotes extension which is similar to the Cite extension in that they both extract inline content in the page source to a separate section of output and leave behind pointers to the global section in the output DOM. Given this, the Cite and Footnote extension post processors would essentially walk the dom and move any existing inline content into that global section till it is done. So, even if a <footnote> has a <ref> and a <ref> has a <footnote>, we ultimately end up with all footnote content in the footnotes section and all ref content in the references section and the DOM stabilizes. Ordering is irrelevant here.

    So, perhaps one way of catching these problems would be in code review by analyzing what the DOM postprocessor does and see if it introduces potential ordering issues.

◆ processChunkily()

Wikimedia\Parsoid\Wt2Html\DOMPostProcessor::processChunkily ( $input,
?array $options )

Process wikitext, an array of tokens, or a DOM document depending on what pipeline stage this is.This method will either directly or indirectly implement a generator that parses the input in chunks and yields output in chunks as well.Implementations that don't consume tokens (ex: Tokenizer, DOMPostProcessor) will provide specialized implementations that handle their input type.

Parameters
string | array | Document$input
?array$options
  • atTopLevel: (bool) Whether we are processing the top-level document
  • sol: (bool) Whether input should be processed in start-of-line context
Returns
Generator

Reimplemented from Wikimedia\Parsoid\Wt2Html\PipelineStage.

◆ registerProcessors()

Wikimedia\Parsoid\Wt2Html\DOMPostProcessor::registerProcessors ( ?array $processors)
Parameters
?array$processors

◆ resetState()

Wikimedia\Parsoid\Wt2Html\DOMPostProcessor::resetState ( array $options)

Resets any internal state for this pipeline stage.This is usually called so a cached pipeline can be reused.

Parameters
array$options

Reimplemented from Wikimedia\Parsoid\Wt2Html\PipelineStage.

◆ setSourceOffsets()

Wikimedia\Parsoid\Wt2Html\DOMPostProcessor::setSourceOffsets ( SourceRange $so)

Set the source offsets for the content being processing by this pipeline This matters for when a substring of the top-level page is being processed in its own pipeline.This ensures that all source offsets assigned to tokens and DOM nodes in this stage are relative to the top-level page.

Parameters
SourceRange$so

Reimplemented from Wikimedia\Parsoid\Wt2Html\PipelineStage.


The documentation for this class was generated from the following file: