MediaWiki master
MediaWiki\Parser\Sanitizer Class Reference

HTML sanitizer for MediaWiki. More...

Static Public Member Functions

static armorFrenchSpaces (string $text, string $space=' ')
 Armor French spaces with a replacement character.
 
static checkCss ( $value)
 Pick apart some CSS and check it for forbidden or unsafe structures.
 
static cleanUrl (string $url)
 
static decodeCharReferences (string $text)
 Decode any character references, numeric or named entities, in the text and return a UTF-8 string.
 
static decodeCharReferencesAndNormalize (string $text)
 Decode any character references, numeric or named entities, in the next and normalize the resulting string.
 
static decodeTagAttributes (string $text)
 Return an associative array of attribute names and values from a partial tag string.
 
static encodeAttribute (string $text)
 Encode an attribute value for HTML output.
 
static escapeClass (string $class)
 Given a value, escape it so that it can be used as a CSS class and return it.
 
static escapeHtmlAllowEntities (string $html)
 Given HTML input, escape with htmlspecialchars but un-escape entities.
 
static escapeIdForAttribute (string $id, int $mode=self::ID_PRIMARY)
 Given a section name or other user-generated or otherwise unsafe string, escapes it to be a valid HTML id attribute.
 
static escapeIdForExternalInterwiki (string $id)
 Given a section name or other user-generated or otherwise unsafe string, escapes it to be a valid URL fragment for external interwikis.
 
static escapeIdForLink (string $id)
 Given a section name or other user-generated or otherwise unsafe string, escapes it to be a valid URL fragment.
 
static fixTagAttributes (string $text, string $element, bool $sorted=false)
 Take a tag soup fragment listing an HTML element's attributes and normalize it to well-formed XML, discarding unwanted attributes.
 
static getRecognizedTagData (array $extratags=[], array $removetags=[])
 Return the various lists of recognized tags.
 
static hackDocType ()
 Hack up a private DOCTYPE with HTML's standard entity declarations.
 
static internalRemoveHtmlTags (string $text, ?callable $processCallback=null, $args=[], array $extratags=[], array $removetags=[])
 Cleans up HTML, removes dangerous tags and attributes, and removes HTML comments; BEWARE there may be unmatched HTML tags in the result.
 
static isReservedDataAttribute (string $attr)
 Given an attribute name, checks whether it is a reserved data attribute (such as data-mw-foo) which is unavailable to user-generated HTML so MediaWiki core and extension code can safely use it to communicate with frontend code.
 
static mergeAttributes (array $a, array $b)
 Merge two sets of HTML attributes.
 
static normalizeCharReferences (string $text)
 Ensure that any entities and character references are legal for XML and XHTML specifically.
 
static normalizeCss (string $value)
 Normalize CSS into a format we can easily search for hostile input.
 
static normalizeSectionNameWhitespace (string $section)
 Normalizes whitespace in a section name, such as might be returned by Parser::stripSectionName(), for use in the id's that are used for section links.
 
static removeHTMLcomments (string $text)
 Remove '', and everything between.
 
static removeHTMLtags (string $text, ?callable $processCallback=null, $args=[], array $extratags=[], array $removetags=[])
 Cleans up HTML, removes dangerous tags and attributes, and removes HTML comments; BEWARE there may be unmatched HTML tags in the result.
 
static removeSomeTags (string $text, array $options=[])
 Cleans up HTML, removes dangerous tags and attributes, and removes HTML comments; the result will always be balanced and tidy HTML.
 
static safeEncodeAttribute (string $text)
 Encode an attribute value for HTML tags, with extra armoring against further wiki processing.
 
static safeEncodeTagAttributes (array $assoc_array)
 Build a partial tag string from an associative array of attribute names and values as returned by decodeTagAttributes.
 
static stripAllTags (string $html)
 Take a fragment of (potentially invalid) HTML and return a version with any tags removed, encoded as plain text.
 
static validateAttributes (array $attribs, array $allowed)
 Take an array of attribute names and values and normalize or discard illegal values.
 
static validateEmail (string $addr)
 Does a string look like an e-mail address?
 
static validateTagAttributes (array $attribs, string $element)
 Take an array of attribute names and values and normalize or discard illegal values for the given element type.
 

Public Attributes

const ID_FALLBACK = 1
 Tells escapeUrlForHtml() to encode the ID using the fallback encoding, or return false if no fallback is configured.
 
const ID_PRIMARY = 0
 Tells escapeUrlForHtml() to encode the ID using the wiki's primary encoding.
 

Detailed Description

HTML sanitizer for MediaWiki.

Definition at line 46 of file Sanitizer.php.

Member Function Documentation

◆ armorFrenchSpaces()

static MediaWiki\Parser\Sanitizer::armorFrenchSpaces ( string $text,
string $space = ' ' )
static

Armor French spaces with a replacement character.

Since
1.32
Parameters
string$textText to armor
string$spaceSpace character for the French spaces, defaults to ' '
Returns
string Armored text

Definition at line 854 of file Sanitizer.php.

◆ checkCss()

static MediaWiki\Parser\Sanitizer::checkCss ( $value)
static

Pick apart some CSS and check it for forbidden or unsafe structures.

Returns a sanitized string. This sanitized string will have character references and escape sequences decoded and comments stripped (unless it is itself one valid comment, in which case the value will be passed through). If the input is just too evil, only a comment complaining about evilness will be returned.

Currently URL references, 'expression', 'tps' are forbidden.

NOTE: Despite the fact that character references are decoded, the returned string may contain character references given certain clever input strings. These character references must be escaped before the return value is embedded in HTML.

Parameters
string$value
Returns
string

Definition at line 742 of file Sanitizer.php.

◆ cleanUrl()

static MediaWiki\Parser\Sanitizer::cleanUrl ( string $url)
static

Definition at line 1685 of file Sanitizer.php.

◆ decodeCharReferences()

static MediaWiki\Parser\Sanitizer::decodeCharReferences ( string $text)
static

Decode any character references, numeric or named entities, in the text and return a UTF-8 string.

Definition at line 1292 of file Sanitizer.php.

◆ decodeCharReferencesAndNormalize()

static MediaWiki\Parser\Sanitizer::decodeCharReferencesAndNormalize ( string $text)
static

Decode any character references, numeric or named entities, in the next and normalize the resulting string.

(T16952)

This is useful for page titles, not for text to be displayed, MediaWiki allows HTML entities to escape normalization as a feature.

Parameters
string$textAlready normalized, containing entities
Returns
string Still normalized, without entities

Definition at line 1310 of file Sanitizer.php.

◆ decodeTagAttributes()

static MediaWiki\Parser\Sanitizer::decodeTagAttributes ( string $text)
static

Return an associative array of attribute names and values from a partial tag string.

Attribute names are forced to lowercase, character references are decoded to UTF-8 text.

Definition at line 1097 of file Sanitizer.php.

◆ encodeAttribute()

static MediaWiki\Parser\Sanitizer::encodeAttribute ( string $text)
static

Encode an attribute value for HTML output.

Parameters
string$text
Returns
string HTML-encoded text fragment

Definition at line 831 of file Sanitizer.php.

◆ escapeClass()

static MediaWiki\Parser\Sanitizer::escapeClass ( string $class)
static

Given a value, escape it so that it can be used as a CSS class and return it.

Todo
For extra validity, input should be validated UTF-8.
See also
https://www.w3.org/TR/CSS21/syndata.html Valid characters/format

Definition at line 1066 of file Sanitizer.php.

◆ escapeHtmlAllowEntities()

static MediaWiki\Parser\Sanitizer::escapeHtmlAllowEntities ( string $html)
static

Given HTML input, escape with htmlspecialchars but un-escape entities.

This allows (generally harmless) entities like   to survive.

Parameters
string$htmlHTML to escape
Returns
string Escaped input

Definition at line 1083 of file Sanitizer.php.

◆ escapeIdForAttribute()

static MediaWiki\Parser\Sanitizer::escapeIdForAttribute ( string $id,
int $mode = self::ID_PRIMARY )
static

Given a section name or other user-generated or otherwise unsafe string, escapes it to be a valid HTML id attribute.

WARNING: The output of this function is not guaranteed to be HTML safe, so be sure to use proper escaping.

Parameters
string$idString to escape
int$modeOne of ID_* constants, specifying whether the primary or fallback encoding should be used.
Returns
string|false Escaped ID or false if fallback encoding is requested but it's not configured.
Since
1.30

Definition at line 923 of file Sanitizer.php.

◆ escapeIdForExternalInterwiki()

static MediaWiki\Parser\Sanitizer::escapeIdForExternalInterwiki ( string $id)
static

Given a section name or other user-generated or otherwise unsafe string, escapes it to be a valid URL fragment for external interwikis.

Parameters
string$idString to escape
Returns
string Escaped ID
Since
1.30

Definition at line 973 of file Sanitizer.php.

◆ escapeIdForLink()

static MediaWiki\Parser\Sanitizer::escapeIdForLink ( string $id)
static

Given a section name or other user-generated or otherwise unsafe string, escapes it to be a valid URL fragment.

WARNING: The output of this function is not guaranteed to be HTML safe, so be sure to use proper escaping.

Parameters
string$idString to escape
Returns
string Escaped ID
Since
1.30

Definition at line 950 of file Sanitizer.php.

◆ fixTagAttributes()

static MediaWiki\Parser\Sanitizer::fixTagAttributes ( string $text,
string $element,
bool $sorted = false )
static

Take a tag soup fragment listing an HTML element's attributes and normalize it to well-formed XML, discarding unwanted attributes.

Output is safe for further wikitext processing, with escaping of values that could trigger problems.

  • Normalizes attribute names to lowercase
  • Discards attributes not allowed for the given element
  • Turns broken or invalid entities into plaintext
  • Double-quotes all attribute values
  • Attributes without values are given the name as attribute
  • Double attributes are discarded
  • Unsafe style attributes are discarded
  • Prepends space if there are attributes.
  • (Optionally) Sorts attributes by name.
Parameters
string$text
string$element
bool$sortedWhether to sort the attributes (default: false)
Returns
string

Definition at line 809 of file Sanitizer.php.

◆ getRecognizedTagData()

static MediaWiki\Parser\Sanitizer::getRecognizedTagData ( array $extratags = [],
array $removetags = [] )
static

Return the various lists of recognized tags.

Parameters
string[]$extratagsFor any extra tags to include
string[]$removetagsFor any tags (default or extra) to exclude
Returns
array
Access: internal

Definition at line 155 of file Sanitizer.php.

◆ hackDocType()

static MediaWiki\Parser\Sanitizer::hackDocType ( )
static

Hack up a private DOCTYPE with HTML's standard entity declarations.

PHP 4 seemed to know these if you gave it an HTML doctype, but PHP 5.1 doesn't.

Use for passing XHTML fragments to PHP's XML parsing functions

Deprecated
since 1.36; will be made private or removed in a future release.

Definition at line 1665 of file Sanitizer.php.

◆ internalRemoveHtmlTags()

static MediaWiki\Parser\Sanitizer::internalRemoveHtmlTags ( string $text,
?callable $processCallback = null,
$args = [],
array $extratags = [],
array $removetags = [] )
static

Cleans up HTML, removes dangerous tags and attributes, and removes HTML comments; BEWARE there may be unmatched HTML tags in the result.

Note
Callers are recommended to use ::removeSomeTags() instead of this method. Sanitizer::removeSomeTags() is safer and will always return well-formed HTML; however, it is significantly slower (especially for short strings where setup costs predominate). This method is for internal use by the legacy parser where we know the result will be cleaned up in a subsequent tidy pass.
Parameters
string$textOriginal string; see T268353 for why untainted.
callable | null$processCallbackCallback to do any variable or parameter replacements in HTML attribute values. This argument should be considered
Access: internal
.
Parameters
array | bool$argsArguments for the processing callback
array$extratagsFor any extra tags to include
array$removetagsFor any tags (default or extra) to exclude
Returns
string
Access: internal

Definition at line 307 of file Sanitizer.php.

◆ isReservedDataAttribute()

static MediaWiki\Parser\Sanitizer::isReservedDataAttribute ( string $attr)
static

Given an attribute name, checks whether it is a reserved data attribute (such as data-mw-foo) which is unavailable to user-generated HTML so MediaWiki core and extension code can safely use it to communicate with frontend code.

Parameters
string$attrAttribute name.
Returns
bool

Definition at line 633 of file Sanitizer.php.

◆ mergeAttributes()

static MediaWiki\Parser\Sanitizer::mergeAttributes ( array $a,
array $b )
static

Merge two sets of HTML attributes.

Conflicting items in the second set will override those in the first, except for 'class' attributes which will be combined (if they're both strings).

Todo
implement merging for other attributes such as style

Definition at line 651 of file Sanitizer.php.

◆ normalizeCharReferences()

static MediaWiki\Parser\Sanitizer::normalizeCharReferences ( string $text)
static

Ensure that any entities and character references are legal for XML and XHTML specifically.

Any stray bits will be &-escaped to result in a valid text fragment.

a. named char refs can only be < > & ", others are numericized (this way we're well-formed even without a DTD) b. any numeric char refs must be legal chars, not invalid or forbidden c. use lower cased "&#x", not "&#X" d. fix or reject non-valid attributes

Access: internal

Definition at line 1200 of file Sanitizer.php.

◆ normalizeCss()

static MediaWiki\Parser\Sanitizer::normalizeCss ( string $value)
static

Normalize CSS into a format we can easily search for hostile input.

  • decode character references
  • decode escape sequences
  • remove comments, unless the entire value is one single comment
    Parameters
    string$valuethe css string
    Returns
    string normalized css

Definition at line 672 of file Sanitizer.php.

◆ normalizeSectionNameWhitespace()

static MediaWiki\Parser\Sanitizer::normalizeSectionNameWhitespace ( string $section)
static

Normalizes whitespace in a section name, such as might be returned by Parser::stripSectionName(), for use in the id's that are used for section links.

Definition at line 1183 of file Sanitizer.php.

◆ removeHTMLcomments()

static MediaWiki\Parser\Sanitizer::removeHTMLcomments ( string $text)
static

Remove '', and everything between.

To avoid leaving blank lines, when a comment is both preceded and followed by a newline (ignoring spaces), trim leading and trailing spaces and one of the newlines.

Definition at line 428 of file Sanitizer.php.

◆ removeHTMLtags()

static MediaWiki\Parser\Sanitizer::removeHTMLtags ( string $text,
?callable $processCallback = null,
$args = [],
array $extratags = [],
array $removetags = [] )
static

Cleans up HTML, removes dangerous tags and attributes, and removes HTML comments; BEWARE there may be unmatched HTML tags in the result.

Note
Callers are recommended to use ::removeSomeTags() instead of this method. Sanitizer::removeSomeTags() is safer and will always return well-formed HTML; however, it is significantly slower (especially for short strings where setup costs predominate). This method, although faster, should only be used where we know the result be cleaned up in a subsequent tidy pass.
Parameters
string$textOriginal string; see T268353 for why untainted.
callable | null$processCallbackCallback to do any variable or parameter replacements in HTML attribute values. This argument should be considered
Access: internal
.
Parameters
array | bool$argsArguments for the processing callback
array$extratagsFor any extra tags to include
array$removetagsFor any tags (default or extra) to exclude
Returns
string
Deprecated
since 1.38. Use ::removeSomeTags(), which always gives balanced/tidy HTML.

Definition at line 270 of file Sanitizer.php.

◆ removeSomeTags()

static MediaWiki\Parser\Sanitizer::removeSomeTags ( string $text,
array $options = [] )
static

Cleans up HTML, removes dangerous tags and attributes, and removes HTML comments; the result will always be balanced and tidy HTML.

Parameters
string$textSource string; see T268353 for why untainted
array$optionsOptions controlling the cleanup: string[] $options['extraTags'] Any extra tags to allow (This property taints the whole array.) string[] $options['removeTags'] Any tags (default or extra) to exclude callable(Attributes,...):Attributes $options['attrCallback'] Callback to do any variable or parameter replacements in HTML attribute values before further cleanup; should be considered
Access: internal
and not for external use. array $options['attrCallbackArgs'] Additional arguments for the attribute callback
Returns
string The cleaned up HTML
Since
1.38

Definition at line 382 of file Sanitizer.php.

Referenced by MediaWiki\Parser\Parser\parse().

◆ safeEncodeAttribute()

static MediaWiki\Parser\Sanitizer::safeEncodeAttribute ( string $text)
static

Encode an attribute value for HTML tags, with extra armoring against further wiki processing.

Parameters
string$text
Returns
string HTML-encoded text fragment

Definition at line 875 of file Sanitizer.php.

◆ safeEncodeTagAttributes()

static MediaWiki\Parser\Sanitizer::safeEncodeTagAttributes ( array $assoc_array)
static

Build a partial tag string from an associative array of attribute names and values as returned by decodeTagAttributes.

Definition at line 1136 of file Sanitizer.php.

◆ stripAllTags()

static MediaWiki\Parser\Sanitizer::stripAllTags ( string $html)
static

Take a fragment of (potentially invalid) HTML and return a version with any tags removed, encoded as plain text.

Warning: this return value must be further escaped for literal inclusion in HTML output as of 1.10!

Parameters
string$htmlHTML fragment
Returns
string

Definition at line 1639 of file Sanitizer.php.

◆ validateAttributes()

static MediaWiki\Parser\Sanitizer::validateAttributes ( array $attribs,
array $allowed )
static

Take an array of attribute names and values and normalize or discard illegal values.

  • Discards attributes not on the given list
  • Unsafe style attributes are discarded
  • Invalid id attributes are re-encoded
Parameters
array$attribs
array$allowedList of allowed attribute names, as an associative array where keys give valid attribute names (since 1.34). Before 1.35, passing a sequential array of valid attribute names was permitted but that is now deprecated.
Returns
array
Todo

Check for legal values where the DTD limits things.

Check for unique id attribute :P

Definition at line 527 of file Sanitizer.php.

References wfDeprecated().

◆ validateEmail()

static MediaWiki\Parser\Sanitizer::validateEmail ( string $addr)
static

Does a string look like an e-mail address?

This validates an email address using an HTML5 specification found at: http://www.whatwg.org/html/states-of-the-type-attribute.html#valid-e-mail-address Which as of 2011-01-24 says:

A valid e-mail address is a string that matches the ABNF production 1*( atext / "." ) "@" ldh-str *( "." ldh-str ) where atext is defined in RFC 5322 section 3.2.3, and ldh-str is defined in RFC 1034 section 3.5.

This function is an implementation of the specification as requested in T24449.

Client-side forms will use the same standard validation rules via JS or HTML 5 validation; additional restrictions can be enforced server-side by extensions via the 'isValidEmailAddr' hook.

Note that this validation doesn't 100% match RFC 2822, but is believed to be liberal enough for wide use. Some invalid addresses will still pass validation here.

Since
1.18
Parameters
string$addrE-mail address
Returns
bool

Definition at line 1794 of file Sanitizer.php.

◆ validateTagAttributes()

static MediaWiki\Parser\Sanitizer::validateTagAttributes ( array $attribs,
string $element )
static

Take an array of attribute names and values and normalize or discard illegal values for the given element type.

  • Discards attributes not allowed for the given element
  • Unsafe style attributes are discarded
  • Invalid id attributes are re-encoded

    Todo

    Check for legal values where the DTD limits things.

    Check for unique id attribute :P

Definition at line 504 of file Sanitizer.php.

Referenced by MediaWiki\Parser\RemexRemoveTagHandler\startTag().

Member Data Documentation

◆ ID_FALLBACK

const MediaWiki\Parser\Sanitizer::ID_FALLBACK = 1

Tells escapeUrlForHtml() to encode the ID using the fallback encoding, or return false if no fallback is configured.

Since
1.30

Definition at line 90 of file Sanitizer.php.

◆ ID_PRIMARY

const MediaWiki\Parser\Sanitizer::ID_PRIMARY = 0

Tells escapeUrlForHtml() to encode the ID using the wiki's primary encoding.

Since
1.30

Definition at line 82 of file Sanitizer.php.


The documentation for this class was generated from the following file: