MediaWiki REL1_31
MediaWiki\Tidy\Balancer Class Reference

An implementation of the tree building portion of the HTML5 parsing spec. More...

Collaboration diagram for MediaWiki\Tidy\Balancer:

Public Member Functions

 __construct (array $config=[])
 Create a new Balancer.
 
 balance ( $text, $processingCallback=null, $processingArgs=[])
 Return a balanced HTML string for the HTML fragment given by $text, subject to the caveats listed in the class description.
 

Public Attributes

const VALID_COMMENT_REGEX
 Valid HTML5 comments.
 

Private Member Functions

 advance ()
 Grab the next "token" from $bitsIterator.
 
 endCaption ()
 
 endCell ()
 
 endRow ()
 
 endSection ()
 
 inBodyMode ( $token, $value, $attribs=null, $selfClose=false)
 
 inCaptionMode ( $token, $value, $attribs=null, $selfClose=false)
 
 inCellMode ( $token, $value, $attribs=null, $selfClose=false)
 
 inColumnGroupMode ( $token, $value, $attribs=null, $selfClose=false)
 
 inHeadMode ( $token, $value, $attribs=null, $selfClose=false)
 
 inRowMode ( $token, $value, $attribs=null, $selfClose=false)
 
 inSelectInTableMode ( $token, $value, $attribs=null, $selfClose=false)
 
 inSelectMode ( $token, $value, $attribs=null, $selfClose=false)
 
 insertForeignToken ( $token, $value, $attribs=null, $selfClose=false)
 
 insertToken ( $token, $value, $attribs=null, $selfClose=false)
 Pass a token to the tree builder.
 
 inTableBodyMode ( $token, $value, $attribs=null, $selfClose=false)
 
 inTableMode ( $token, $value, $attribs=null, $selfClose=false)
 
 inTableTextMode ( $token, $value, $attribs=null, $selfClose=false)
 
 inTemplateMode ( $token, $value, $attribs=null, $selfClose=false)
 
 inTextMode ( $token, $value, $attribs=null, $selfClose=false)
 
 parseRawText ( $value, $attribs=null)
 
 resetInsertionMode ()
 
 stopParsing ()
 
 switchMode ( $mode)
 
 switchModeAndReprocess ( $mode, $token, $value, $attribs, $selfClose)
 

Private Attributes

BalanceActiveFormattingElements $afe
 
 $allowComments
 
 $allowedHtmlElements
 
Iterator $bitsIterator
 
 $config
 
 $formElementPointer
 
 $fragmentContext
 
 $ignoreLinefeed
 
 $inRAWTEXT
 
 $inRCDATA
 
 $originalInsertionMode
 
 $parseMode
 
 $pendingTableText
 
array $processingArgs
 
callable null $processingCallback
 
BalanceStack $stack
 
 $strict
 
 $textIntegrationMode
 

Detailed Description

An implementation of the tree building portion of the HTML5 parsing spec.

This is used to balance and tidy output so that the result can always be cleanly serialized/deserialized by an HTML5 parser. It does not guarantee "conforming" output – the HTML5 spec contains a number of constraints which are not enforced by the HTML5 parsing process. But the result will be free of gross errors: misnested or unclosed tags, for example, and will be unchanged by spec-complient parsing followed by serialization.

The tree building stage is structured as a state machine. When comparing the implementation to https://www.w3.org/TR/html5/syntax.html#tree-construction note that each state is implemented as a function with a name ending in Mode (because the HTML spec refers to them as insertion modes). The current insertion mode is held by the $parseMode property.

The following simplifications have been made:

  • We handle body content only (ie, we start in body.)
  • The document is never in "quirks mode".
  • All occurrences of < and > have been entity escaped, so we can parse tags by simply splitting on those two characters. (This also simplifies the handling of < inside <textarea>.) The character < must not appear inside comments. Similarly, all attributes have been "cleaned" and are double-quoted and escaped.
  • All null characters are assumed to have been removed.
  • The following elements are disallowed: <html>, <head>, <body>, <frameset>, <frame>, <plaintext>, <xmp>, <iframe>, <noembed>, <noscript>, <script>, <title>. As a result, further simplifications can be made:

    • frameset-ok is not tracked.
    • head element pointer is not tracked (but presumed non-null)
    • Tokenizer has only a single mode. (<textarea> wants RCDATA and <style>/<noframes> want RAWTEXT modes which we only loosely emulate.)

    We generally mark places where we omit cases from the spec due to disallowed elements with a comment: // OMITTED: <element-name>.

    The HTML spec keeps a flag during the parsing process to track whether or not a "parse error" has been encountered. We don't bother to track that flag, we just implement the error-handling process as specified.

Since
1.27
See also
https://html.spec.whatwg.org/multipage/syntax.html#tree-construction

Definition at line 1804 of file Balancer.php.

Constructor & Destructor Documentation

◆ __construct()

MediaWiki\Tidy\Balancer::__construct ( array $config = [])

Create a new Balancer.

Parameters
array$configBalancer configuration. Includes: 'strict' : boolean, defaults to false. When true, enforces syntactic constraints on input: all non-tag '<' must be escaped, all attributes must be separated by a single space and double-quoted. This is consistent with the output of the Sanitizer. 'allowedHtmlElements' : array, defaults to null. When present, the keys of this associative array give the acceptable HTML tag names. When not present, no tag sanitization is done. 'tidyCompat' : boolean, defaults to false. When true, the serialization algorithm is tweaked to provide historical compatibility with the old "tidy" program:

-wrapping is done to the children of <body> and

elements, and empty elements are removed. The

/<listing>/<textarea> serialization
        is also tweaked to allow lossless round trips.
        (See: https://github.com/whatwg/html/issues/944)
    'allowComments': boolean, defaults to true.
        When true, allows HTML comments in the input.
        The Sanitizer generally strips all comments, so if you
        are running on sanitized output you can set this to
        false to get a bit more performance.

Definition at line 1888 of file Balancer.php.

References MediaWiki\Tidy\Balancer\$config, MediaWiki\Tidy\BalanceSets\$unsupportedSet, and MediaWiki\Tidy\BalanceSets\HTML_NAMESPACE.

Member Function Documentation

◆ advance()

MediaWiki\Tidy\Balancer::advance ( )
private

Grab the next "token" from $bitsIterator.

This is either a open/close tag or text or a comment, depending on whether the Sanitizer approves.

Definition at line 2159 of file Balancer.php.

References $attribs, $t, MediaWiki\Tidy\Balancer\insertToken(), and list.

Referenced by MediaWiki\Tidy\Balancer\balance().

◆ balance()

MediaWiki\Tidy\Balancer::balance ( $text,
$processingCallback = null,
$processingArgs = [] )

Return a balanced HTML string for the HTML fragment given by $text, subject to the caveats listed in the class description.

The result will typically be idempotent – that is, rebalancing the output would result in no change.

Parameters
string$textThe markup to be balanced
callable$processingCallbackCallback to do any variable or parameter replacements in HTML attributes values
array | bool$processingArgsArguments for the processing callback
Returns
string The balanced markup

Definition at line 1932 of file Balancer.php.

References $e, MediaWiki\Tidy\Balancer\$processingArgs, MediaWiki\Tidy\Balancer\$processingCallback, $result, MediaWiki\Tidy\Balancer\advance(), MediaWiki\Tidy\BalanceSets\HTML_NAMESPACE, MediaWiki\Tidy\Balancer\insertToken(), and MediaWiki\Tidy\Balancer\resetInsertionMode().

◆ endCaption()

MediaWiki\Tidy\Balancer::endCaption ( )
private

◆ endCell()

MediaWiki\Tidy\Balancer::endCell ( )
private

Definition at line 3352 of file Balancer.php.

References MediaWiki\Tidy\Balancer\inCellMode().

Referenced by MediaWiki\Tidy\Balancer\inCellMode().

◆ endRow()

MediaWiki\Tidy\Balancer::endRow ( )
private

◆ endSection()

MediaWiki\Tidy\Balancer::endSection ( )
private

◆ inBodyMode()

◆ inCaptionMode()

MediaWiki\Tidy\Balancer::inCaptionMode ( $token,
$value,
$attribs = null,
$selfClose = false )
private

◆ inCellMode()

MediaWiki\Tidy\Balancer::inCellMode ( $token,
$value,
$attribs = null,
$selfClose = false )
private

◆ inColumnGroupMode()

MediaWiki\Tidy\Balancer::inColumnGroupMode ( $token,
$value,
$attribs = null,
$selfClose = false )
private

◆ inHeadMode()

◆ inRowMode()

MediaWiki\Tidy\Balancer::inRowMode ( $token,
$value,
$attribs = null,
$selfClose = false )
private

◆ inSelectInTableMode()

MediaWiki\Tidy\Balancer::inSelectInTableMode ( $token,
$value,
$attribs = null,
$selfClose = false )
private

◆ inSelectMode()

◆ insertForeignToken()

MediaWiki\Tidy\Balancer::insertForeignToken ( $token,
$value,
$attribs = null,
$selfClose = false )
private

◆ insertToken()

◆ inTableBodyMode()

◆ inTableMode()

◆ inTableTextMode()

MediaWiki\Tidy\Balancer::inTableTextMode ( $token,
$value,
$attribs = null,
$selfClose = false )
private

◆ inTemplateMode()

◆ inTextMode()

MediaWiki\Tidy\Balancer::inTextMode ( $token,
$value,
$attribs = null,
$selfClose = false )
private

◆ parseRawText()

MediaWiki\Tidy\Balancer::parseRawText ( $value,
$attribs = null )
private

Definition at line 2349 of file Balancer.php.

References $attribs, $value, and MediaWiki\Tidy\Balancer\switchMode().

Referenced by MediaWiki\Tidy\Balancer\inHeadMode().

◆ resetInsertionMode()

◆ stopParsing()

MediaWiki\Tidy\Balancer::stopParsing ( )
private

◆ switchMode()

◆ switchModeAndReprocess()

MediaWiki\Tidy\Balancer::switchModeAndReprocess ( $mode,
$token,
$value,
$attribs,
$selfClose )
private

Member Data Documentation

◆ $afe

BalanceActiveFormattingElements MediaWiki\Tidy\Balancer::$afe
private

Definition at line 1810 of file Balancer.php.

◆ $allowComments

MediaWiki\Tidy\Balancer::$allowComments
private

Definition at line 1814 of file Balancer.php.

◆ $allowedHtmlElements

MediaWiki\Tidy\Balancer::$allowedHtmlElements
private

Definition at line 1808 of file Balancer.php.

◆ $bitsIterator

Iterator MediaWiki\Tidy\Balancer::$bitsIterator
private

Definition at line 1807 of file Balancer.php.

◆ $config

MediaWiki\Tidy\Balancer::$config
private

Definition at line 1815 of file Balancer.php.

Referenced by MediaWiki\Tidy\Balancer\__construct().

◆ $formElementPointer

MediaWiki\Tidy\Balancer::$formElementPointer
private

Definition at line 1821 of file Balancer.php.

Referenced by MediaWiki\Tidy\Balancer\inBodyMode().

◆ $fragmentContext

MediaWiki\Tidy\Balancer::$fragmentContext
private

Definition at line 1820 of file Balancer.php.

Referenced by MediaWiki\Tidy\Balancer\resetInsertionMode().

◆ $ignoreLinefeed

MediaWiki\Tidy\Balancer::$ignoreLinefeed
private

Definition at line 1822 of file Balancer.php.

◆ $inRAWTEXT

MediaWiki\Tidy\Balancer::$inRAWTEXT
private

Definition at line 1824 of file Balancer.php.

◆ $inRCDATA

MediaWiki\Tidy\Balancer::$inRCDATA
private

Definition at line 1823 of file Balancer.php.

◆ $originalInsertionMode

MediaWiki\Tidy\Balancer::$originalInsertionMode
private

Definition at line 1819 of file Balancer.php.

◆ $parseMode

◆ $pendingTableText

MediaWiki\Tidy\Balancer::$pendingTableText
private

Definition at line 1818 of file Balancer.php.

Referenced by MediaWiki\Tidy\Balancer\inTableTextMode().

◆ $processingArgs

array MediaWiki\Tidy\Balancer::$processingArgs
private

Definition at line 1829 of file Balancer.php.

Referenced by MediaWiki\Tidy\Balancer\balance().

◆ $processingCallback

callable null MediaWiki\Tidy\Balancer::$processingCallback
private

Definition at line 1827 of file Balancer.php.

Referenced by MediaWiki\Tidy\Balancer\balance().

◆ $stack

BalanceStack MediaWiki\Tidy\Balancer::$stack
private

Definition at line 1812 of file Balancer.php.

◆ $strict

MediaWiki\Tidy\Balancer::$strict
private

Definition at line 1813 of file Balancer.php.

◆ $textIntegrationMode

MediaWiki\Tidy\Balancer::$textIntegrationMode
private

Definition at line 1817 of file Balancer.php.

◆ VALID_COMMENT_REGEX

const MediaWiki\Tidy\Balancer::VALID_COMMENT_REGEX
Initial value:
= "~ !--
( # 1. Comment match detector
> | -> | # Invalid short close
( # 2. Comment contents
(?:
(?! --> )
(?! --!> )
(?! --! \z )
(?! -- \z )
(?! - \z )
.
)*+
)
( # 3. Comment close
--> | # Normal close
--!> | # Comment end bang
( # 4. Indicate matches requiring EOF
--! | # EOF in comment end bang state
-- | # EOF in comment end state
- | # EOF in comment end dash state
(?#nothing) # EOF in comment state
)
)
)
([^<]*) \z # 5. Non-tag text after the comment
~xs"

Valid HTML5 comments.

Regex borrowed from Tim Starling's "remex-html" project.

Definition at line 1835 of file Balancer.php.


The documentation for this class was generated from the following file: