textlib — Changing Wikitext#

Functions for manipulating wiki-text.

class textlib.Content(header, sections, footer)[source]#

Bases: NamedTuple

A namedtuple as result of extract_sections() holding page content.

Changed in version 8.2: _Content becomes a public class.

Create new instance of Content(header, sections, footer)

Parameters:
footer: str#

the page footer

header: str#

the page header

sections: SectionList[Section]#

the page sections

property title: str#

Return the first main title found on the page.

The first main title is anything enclosed within triple quotes.

Added in version 8.2.

class textlib.GetDataHTML(*, keeptags=None, removetags=None)[source]#

Bases: HTMLParser

HTML parser that removes unwanted HTML elements and optionally comments.

Tags listed in keeptags are preserved. Tags listed in removetags are removed entirely along with their content. Optionally strips HTML comments. Use via the callable interface or in a with closing(...) block.

Note

The callable interface is preferred because it is simpler and ensures proper resource management automatically. If using the context manager, be sure to access textdata before calling close().

text = ('<html><head><title>Test</title></head>'
        '<body><h1><!-- Parse --> me!</h1></body></html>')

parser = GetDataHTML(keeptags = ['html'])
clean_text = parser(text)

Usage:

>>> text = ('<html><head><title>Test</title></head>'
...         '<body><h1><!-- Parse --> me!</h1></body></html>')
>>> GetDataHTML()(text)
'Test me!'
>>> GetDataHTML(keeptags=['title'])(text)
'<title>Test</title> me!'
>>> GetDataHTML(removetags=['body'])(text)
'Test'

Caution

Tag names must be given in lowercase.

Changed in version 9.2: No longer a context manager

Changed in version 10.3: Public class now. Added support for removals of tag contents.

Parameters:
  • keeptags (list[str] | None) – List of tag names to keep, including their content and markup. Defaults to ['tt', 'nowiki', 'small', 'sup'] if None.

  • removetags (list[str] | None) – List of tag names whose tags and content should be removed. The tags can be preserved if listed in keeptags. Defaults to ['style', 'script'] if None.

  • removecomments – Whether to remove HTML comments. Defaults to True.

Initialize default tags and internal state.

close()[source]#

Clean current processing and clear textdata.

Return type:

None

handle_data(data)[source]#

Handle plain text content found between tags.

Text is added to the output unless it is located inside a tag marked for removal.

Parameters:

data (str) – The text data between HTML tags.

Return type:

None

handle_endtag(tag)[source]#

Handle a closing HTML tag.

Tags listed in keeptags are preserved in the output. A closing tag that matches the currently skipped tag will end the skip block.

Parameters:

tag (str) – The name of the closing tag.

Return type:

None

handle_starttag(tag, attrs)[source]#

Handle an opening HTML tag.

Tags listed in keeptags are preserved in the output. Tags listed in removetags begin a skip block, and their content will be excluded from the output.

Changed in version 10.3: Keep tag attributes.

Parameters:
  • tag (str) – The tag name (e.g., “div”, “script”) converted to lowercase.

  • attrs (list[tuple[str, str | None]]) – A list of (name, value) pairs with tag attributes.

Return type:

None

textdata#

The cleaned output text collected during parsing.

class textlib.MultiTemplateMatchBuilder(site)[source]#

Bases: object

Build template matcher.

pattern(template, flags=re.DOTALL)[source]#

Return a compiled regex to match template.

search_any_predicate(templates)[source]#

Return a predicate that matches any template.

class textlib.Section(title, content)[source]#

Bases: NamedTuple

A namedtuple as part of Content describing a page section.

Changed in version 8.2: _Section becomes a public class.

Create new instance of Section(title, content)

Parameters:
  • title (str)

  • content (str)

content: str#

section content

property heading: str#

Return the section title without equal signs.

Added in version 8.2.

property level: int#

Return the section level.

Added in version 8.2.

title: str#

section title including equal signs

class textlib.SectionList(iterable=(), /)[source]#

Bases: list

List of Section objects with heading/level-aware index().

Introduced for handling lists of sections with custom lookup by Section.heading and level.

Added in version 10.4.

count(value, /)[source]#

Count the number of sections matching the given value.

Parameters:

value (str | tuple[str, int] | Section) – The section heading string, a (heading, level) tuple, or a Section instance to search for.

Returns:

The number of matching sections.

Return type:

int

index(value, start=0, stop=9223372036854775807, /)[source]#

Return the index of a matching section.

Works like list.index(value, start, stop) but also allows:

  • value as a string → match by Section.heading (any level)

  • value as a (heading, level) tuple → match both heading and level

  • value as a Section object → normal list.index() behavior

Parameters:
  • value (str | tuple[str, int] | Section) – The item to search for. May be: - str — search by section heading. - tuple[str, int] — search by heading and section level. - Section — search for an exact section object.

  • start (int) – Index to start searching from (inclusive).

  • stop (int) – Index to stop searching at (exclusive).

Returns:

The integer index of the matching section.

Raises:

ValueError – If no matching section is found.

Return type:

int

class textlib.TimeStripper(site=None)[source]#

Bases: object

Find timestamp in page and return it as pywikibot.Timestamp object.

Changed in version 8.0: group attribute is a set instead of a list. patterns is a TimeStripperPatterns namedtuple instead of a list.

Example:

>>> site = pywikibot.Site('wikipedia:fr')
>>> sign = 'Merci bien Xqt (d) 15 mai 2013 à 20:34 (CEST)'
>>> ts = TimeStripper(site)
>>> ts.timestripper(sign)
Timestamp(2013, 5, 15, 20, 34, tzinfo=TZoneFixedOffset(3600, Europe/Paris))
timestripper(line)[source]#

Find timestamp in line and convert it to time zone aware datetime.

All the following items must be matched, otherwise None is returned: -. year, month, hour, time, day, minute, tzinfo

Changed in version 7.6: HTML parts are removed from line

Returns:

A timestamp found on the given line

Parameters:

line (str)

Return type:

Timestamp | None

class textlib.TimeStripperPatterns(time, tzinfo, year, month, day)[source]#

Bases: NamedTuple

Hold precompiled timestamp patterns for TimeStripper.

Attribute order is important to avoid mismatch when searching.

Added in version 8.0.

Create new instance of TimeStripperPatterns(time, tzinfo, year, month, day)

Parameters:
  • time (Pattern[str])

  • tzinfo (Pattern[str])

  • year (Pattern[str])

  • month (Pattern[str])

  • day (Pattern[str])

day: Pattern[str]#

Alias for field number 4

month: Pattern[str]#

Alias for field number 3

time: Pattern[str]#

Alias for field number 0

tzinfo: Pattern[str]#

Alias for field number 1

year: Pattern[str]#

Alias for field number 2

textlib._create_default_regexes()[source]#

Fill (and possibly overwrite) _regex_cache with default regexes.

The following keys are provided: category, comment, file, header, hyperlink, interwiki, invoke, link, pagelist, property, startcolon, startspace, table, template.

Return type:

None

textlib.add_text(text, add, *, site=None)[source]#

Add text to a page content above categories and interwiki.

Added in version 6.4.

Parameters:
  • text (str) – The page content to add text to.

  • add (str) – Text to add.

  • site (pywikibot.Site) – The site that the text is coming from. Required for reorder of categories and interlanguage links. Te default site is used otherwise.

Return type:

str

textlib.case_escape(case, string, *, underscore=False)[source]#

Return an escaped regex pattern which depends on ‘first-letter’ case.

Added in version 7.0.

Changed in version 8.4: Added the optional underscore parameter.

Parameters:
  • case (str) – if case is ‘first-letter’, the regex contains an inline re.IGNORECASE flag for the first letter

  • underscore (bool) – if True, expand the regex to detect spaces and underscores which are interchangeable and collapsible

  • string (str)

Return type:

str

textlib.categoryFormat(categories, insite=None)[source]#

Return a string containing links to all categories in a list.

Parameters:
  • categories (iterable) – A list of Category or Page objects or strings which can be either the raw name, [[Category:..]] or [[cat_localised_ns:…]].

  • insite (pywikibot.Site) – Used to to localise the category namespace.

Returns:

String of categories

Return type:

str

textlib.compileLinkR(withoutBracketed=False, onlyBracketed=False)[source]#

Return a regex that matches external links.

Parameters:
  • withoutBracketed (bool)

  • onlyBracketed (bool)

textlib.does_text_contain_section(pagetext, section)[source]#

Determine whether the page text contains the given section title.

It does not care whether a section string may contain spaces or underlines. Both will match.

If a section parameter contains an internal link, it will match the section with or without a preceding colon which is required for a text link e.g. for categories and files.

Parameters:
  • pagetext (str) – The wikitext of a page

  • section (str) – a section of a page including wikitext markups

Return type:

bool

textlib.expandmarker(text, marker='', separator='')[source]#

Return a marker expanded whitespace and the separator.

It searches for the first occurrence of the marker and gets the combination of the separator and whitespace directly before it.

Parameters:
  • text (str) – the text which will be searched.

  • marker (str) – the marker to be searched.

  • separator (str) – the separator string allowed before the marker. If empty it won’t include whitespace too.

Returns:

the marker with the separator and whitespace from the text in front of it. It’ll be just the marker if the separator is empty.

Return type:

str

textlib.extract_sections(text, site=None)[source]#

Return section headings and contents found in text.

The returned namedtuple Content contains the text parsed into header, sections and footer parts. The main title found in the header which is the first text enclosed with ‘’’ like ‘’’page title’’’ can be given by the title property.

The header part is a string containing text part above the first heading.

The sections part is a list of Section namedtuples, each tuple containing a string with section title (including equal signs), and a string with the section content. In addition the section heading (the title without equal signs) can be given by the heading property. Also the section level can be found by the level property which is the number of the equal signs around the section heading.

The footer part is also a string containing text part after the last section.

Examples:

>>> text = """
... '''this''' is a Python module.
...
... == History of this ==
... This set of principles was posted in 1999...
...
... == Usage of this ==
... Enter "import this" for usage...
...
... === Details ===
... The Zen of Python...
...
... [[Category:Programming principles]]
... """
>>> site = pywikibot.Site('wikipedia:en')
>>> result = extract_sections(text, site)
>>> result.header.strip()
"'''this''' is a Python module."
>>> result.sections[0].title
'== History of this =='
>>> result.sections[1].content.strip()
'Enter "import this" for usage...'
>>> 'Details' in result.sections
True
>>> ('Details', 2) in result.sections
False
>>> result.sections.index('Details')
2
>>> result.sections.index(('Details', 2))
Traceback (most recent call last):
...
ValueError: ('Details', 2) not found in Section headings/levels
>>> result.sections[2].heading
'Details'
>>> result.sections[2].level
3
>>> result.footer.strip()
'[[Category:Programming principles]]'
>>> result.title
'this'

Note

sections and text from templates are not extracted but embedded as plain text.

Added in version 3.0.

Changed in version 8.2: The Content and Section class have additional properties.

Changed in version 10.4: Added custom index(), count() and in operator support for Content.sections.

Returns:

The parsed namedtuple.

Parameters:
Return type:

Content

textlib.extract_templates_and_params(text, remove_disabled_parts=False, strip=False)[source]#

Return a list of templates found in text.

Return value is a list of tuples. There is one tuple for each use of a template in the page, with the template title as the first entry and a dict of parameters as the second entry. Parameters are indexed by strings; as in MediaWiki, an unnamed parameter is given a parameter name with an integer value corresponding to its position among the unnamed parameters, and if this results multiple parameters with the same name only the last value provided will be returned.

This uses the package mwparserfromhell or wikitextparser as MediaWiki markup parser. mwparserfromhell is installed by default.

There are minor differences between the two implementations.

The parser packages preserves whitespace in parameter names and values.

If there are multiple numbered parameters in the wikitext for the same position, MediaWiki will only use the last parameter value. e.g. {{a| foo | 2 <!-- --> = bar | baz }} is {{a|1=foo|2=baz}} To replicate that behaviour, enable both remove_disabled_parts and strip parameters.

Parameters:
  • text (str) – The wikitext from which templates are extracted

  • remove_disabled_parts (bool) – If enabled, remove disabled wikitext such as comments and pre.

  • strip (bool) – If enabled, strip arguments and values of templates.

Returns:

list of template name and params

Return type:

list[tuple[str, OrderedDict[str, str]]]

Changed in version 6.1: wikitextparser package is supported; either wikitextparser or mwparserfromhell is strictly recommended.

textlib.extract_templates_and_params_regex_simple(text)[source]#

Extract top-level templates with params using only a simple regex.

This function uses only a single regex, and returns an entry for each template called at the top-level of the wikitext. Nested templates are included in the argument values of the top-level template.

This method will incorrectly split arguments when an argument value contains a ‘|’, such as {{template|a={{b|c}} }}.

Parameters:

text (str) – The wikitext from which templates are extracted

Returns:

list of template name and params

Return type:

list of tuple of name and OrderedDict

textlib.findmarker(text, startwith='@@', append=None)[source]#

Find a string which is not part of text.

Parameters:
  • text (str)

  • startwith (str)

  • append (str | None)

Return type:

str

Return a list of category links found in text.

Parameters:
  • include (list[str] | None) – list of tags which should not be removed by removeDisabledParts() and where CategoryLinks can be searched.

  • text (str)

  • expand_text (bool)

Returns:

all category links found

Return type:

list[Category]

Return a dict of inter-language links found in text.

The returned dict uses the site as keys and Page objects as values. It does not contain its own site.

Do not call this routine directly, use page.BasePage.interwiki() method instead.

Parameters:
  • text (str)

  • template_subpage (bool)

Return type:

dict[BaseSite, Page]

textlib.get_regexes(keys, site=None)[source]#

Fetch compiled regexes.

Changed in version 8.2: _get_regexes becomes a public function. keys may be a single string; site is optional.

Parameters:
  • keys (str | Iterable[str]) – a single key or an iterable of keys whose regex pattern should be given

  • site (BaseSite | None) – a BaseSite object needed for category, file, interwiki, invoke and property keys

Raises:

ValueError – site cannot be None.

Return type:

list[Pattern[str]]

textlib.glue_template_and_params(template_and_params)[source]#

Return wiki text of template glued from params.

You can use items from extract_templates_and_params here to get an equivalent template wiki text (it may happen that the order of the params changes).

Return type:

str

textlib.ignore_case(string)[source]#

Return a case-insensitive pattern for the string.

Changed in version 7.2: _ignore_case becomes a public method

Parameters:

string (str)

Return type:

str

textlib.interwikiFormat(links, insite=None)[source]#

Convert interwiki link dict into a wikitext string.

Parameters:
  • links (dict with the Site objects as keys, and Page or Link objects as values.) – interwiki links to be formatted

  • insite (BaseSite) – site the interwiki links will be formatted for (defaulting to the current site).

Returns:

string including wiki links formatted for inclusion in insite

Return type:

str

textlib.interwikiSort(sites, insite=None)[source]#

Sort sites according to local interwiki sort logic.

textlib.isDisabled(text, index, tags=None)[source]#

Return True if text[index] is disabled, e.g. by a comment or nowiki tag.

For the tags parameter, see removeDisabledParts.

Parameters:
  • text (str)

  • index (int)

Return type:

bool

textlib.reformat_ISBNs(text, match_func)[source]#

Reformat ISBNs.

Parameters:
  • text (str) – text containing ISBNs

  • match_func (callable) – function to reformat matched ISBNs

Returns:

reformatted text

Return type:

str

Return text with all category links removed.

Parameters:
  • text (str) – The text that needs to be modified.

  • site (pywikibot.Site) – The site that the text is coming from.

  • marker (str) – If defined, marker is placed after the last category link, or at the end of text if there are no category links.

Returns:

The modified text.

Return type:

str

textlib.removeCategoryLinksAndSeparator(text, site=None, marker='', separator='')[source]#

Return text with category links and preceding separators removed.

Parameters:
  • text (str) – The text that needs to be modified.

  • site (pywikibot.Site) – The site that the text is coming from.

  • marker (str) – If defined, marker is placed after the last category link, or at the end of text if there are no category links.

  • separator (str) – The separator string that will be removed if followed by the category links.

Returns:

The modified text

Return type:

str

textlib.removeDisabledParts(text, tags=None, include=None, site=None)[source]#

Return text without portions where wiki markup is disabled.

Parts that will be removed by default are:

  • HTML comments

  • nowiki tags

  • pre tags

  • includeonly tags

  • source and syntaxhighlight tags

Changed in version 7.0: the order of removals will correspond to the tags argument if provided as an ordered collection (list, tuple)

Parameters:
  • tags (Iterable | None) – The exact set of parts which should be removed using keywords from get_regexes().

  • include (Container | None) – Or, in alternative, default parts that shall not be removed.

  • site (BaseSite | None) – Site to be used for site-dependent regexes. Default disabled parts listed above do not need it.

  • text (str)

Returns:

text stripped from disabled parts.

Return type:

str

textlib.removeHTMLParts(text, keeptags=None, *, removetags=None)[source]#

Remove selected HTML tags, their content, and comments from text.

This function removes HTML tags and their contents for tags listed in removetags. Tags specified in keeptags are preserved along with their content and markup. This is a wrapper around the GetDataHTML parser class.

Example:

>>> remove = removeHTMLParts
>>> remove('<div><b><ref><tt>Hi all!</tt></ref></b></div>')
'<tt>Hi all!</tt>'
>>> remove('<style><b>This is stylish</b></style>', keeptags=['style'])
'<style></style>'
>>> remove('<a>Note:</a> <b>This is important!<!-- really? --></b>')
'Note: This is important!'
>>> remove('<a>Note:</a> <b>This is important!</b>', removetags=['a'])
' This is important!'

Caution

Tag names must be given in lowercase.

Changed in version 10.3: The removetags parameter was added. Refactored to use GetDataHTML and its __call__ method. tag attributes will be kept.

Parameters:
  • text (str) – The input HTML text to clean.

  • keeptags (list[str] | None) – List of tag names to keep, including their content and markup. Defaults to ['tt', 'nowiki', 'small', 'sup'] if None.

  • removetags (list[str] | None) – List of tag names whose tags and content should be removed. The tags ca be preserved if listed in keeptags. Defaults to ['style', 'script'] if None.

Returns:

The cleaned text with specified HTML parts removed.

Return type:

str

Return text with all inter-language links removed.

If a link to an unknown language is encountered, a warning is printed.

Parameters:
  • text (str) – The text that needs to be modified.

  • site (pywikibot.Site) – The site that the text is coming from.

  • marker (str) – If defined, marker is placed after the last language link, or at the end of text if there are no language links.

Returns:

The modified text.

Return type:

str

textlib.removeLanguageLinksAndSeparator(text, site=None, marker='', separator='')[source]#

Return text with inter-language links and preceding separators removed.

If a link to an unknown language is encountered, a warning is printed.

Parameters:
  • text (str) – The text that needs to be modified.

  • site (pywikibot.Site) – The site that the text is coming from.

  • marker (str) – If defined, marker is placed after the last language link, or at the end of text if there are no language links.

  • separator (str) – The separator string that will be removed if followed by the language links.

Returns:

The modified text

Return type:

str

textlib.replaceCategoryInPlace(oldtext, oldcat, newcat, site=None, add_only=False)[source]#

Replace old category with new one and return the modified text.

Parameters:
  • oldtext – Content of the old category

  • oldcat – pywikibot.Category object of the old category

  • newcat – pywikibot.Category object of the new category

  • add_only (bool) – If add_only is True, the old category won’t be replaced and the category given will be added after it.

Returns:

the modified text

Return type:

str

Replace all existing category links with new category links.

Changed in version 8.0: addOnly was renamed to add_only.

Parameters:
  • oldtext (str) – The text that needs to be replaced.

  • new (Iterable) – Should be a list of Category objects or strings which can be either the raw name or [[Category:..]].

  • site (BaseSite | None) – The site that the text is from.

  • add_only (bool) – If add_only is True, the old category won’t be deleted and the category(s) given will be added (and they won’t replace anything).

Returns:

The modified text.

Return type:

str

textlib.replaceExcept(text, old, new, exceptions, caseInsensitive=False, allowoverlap=False, marker='', site=None, count=0)[source]#

Return text with old replaced by new, ignoring specified text types.

Skip occurrences of old within exceptions; e.g. within nowiki tags or HTML comments. If caseInsensitive is true, then use case insensitive regex matching. If allowoverlap is true, overlapping occurrences are all replaced

Caution

Watch out when using allowoverlap, it might lead to infinite loops!

Parameters:
  • text (str) – text to be modified

  • old (str | Pattern[str]) – a compiled or uncompiled regular expression

  • new (str | Callable[[Match[str]], str]) – a string (which can contain regular expression references), or a function which takes a match object as parameter. See parameter repl of re.sub().

  • exceptions (Sequence[str | Pattern[str]]) – a list of strings or already compiled regex objects which signal what to leave out. List of strings might be like ['math', 'table', 'template'] for example.

  • marker (str) – a string that will be added to the last replacement; if nothing is changed, it is added at the end

  • count (int) – how many replacements to do at most. See parameter count of re.sub().

  • caseInsensitive (bool)

  • allowoverlap (bool)

  • site (BaseSite | None)

Return type:

str

Replace inter-language links in the text with a new set of links.

Changed in version 8.0: addOnly was renamed to add_only.

Parameters:
  • oldtext (str) – The text that needs to be modified.

  • new (Mapping[BaseSite, Page | Link]) – A dict with the Site objects as keys, and Page or Link objects as values (i.e., just like the dict returned by getLanguageLinks() function).

  • site (BaseSite | None) – The site that the text is from.

  • add_only (bool) – If True, do not remove old language links, only add new ones.

  • template (bool) – Indicates if text belongs to a template page or not.

  • template_subpage (bool) – Indicates if text belongs to a template sub-page or not.

Returns:

The modified text.

Return type:

str

Replace wikilinks selectively.

The text is searched for a link and on each link it replaces the text depending on the result for that link. If the result is just None it skips that link. When it’s False it unlinks it and just inserts the label. When it is a Link instance it’ll use the target, section and label from that Link instance. If it’s a Page instance it’ll use just the target from the replacement and the section and label from the original link.

If it’s a string and the replacement was a sequence it converts it into a Page instance. If the replacement is done via a callable it’ll use it like unlinking and directly replace the link with the text itself.

If either the section or label should be used the replacement can be a function which returns a Link instance and copies the value which should remaining.

Changed in version 7.0: site parameter is mandatory

Parameters:
  • text (str) – the text in which to replace links

  • replace (sequence of pywikibot.Page/pywikibot.Link/str or callable) – either a callable which reacts like described above. The callable must accept four parameters link, text, groups, rng and allows for user interaction. The groups are a dict containing ‘title’, ‘section’, ‘label’ and ‘linktrail’ and the rng are the start and end position of the link. The ‘label’ in groups contains everything after the first pipe which might contain additional data which is used in File namespace for example. Alternatively it can be a sequence containing two items where the first must be a Link or Page and the second has almost the same meaning as the result by the callable. It’ll convert that into a callable where the first item (the Link or Page) has to be equal to the found link and in that case it will apply the second value from the sequence.

  • site (BaseSite) – a Site object to use. It should match the origin or target site of the text

Raises:
  • TypeError – missing positional argument ‘site’

  • ValueError – Wrong site type

  • ValueError – Wrong replacement number

  • ValueError – Wrong replacement types

Return type:

str

textlib.to_ascii_digits(phrase, langs=None)[source]#

Change non-ascii digits to ascii digits.

Added in version 7.0.

Changed in version 10.3: this function was renamed from to_latin_digits.

Parameters:
  • phrase (str) – The phrase to convert to ascii numerical.

  • langs (Sequence[str] | str | None) – Language codes. If langs parameter is None, use all known languages to convert.

Returns:

The string with ascii digits

Return type:

str

textlib.to_local_digits(phrase, lang)[source]#

Change ASCII digits based on language to localized version.

Attention

Be aware that this function only works for several languages, and that it returns an unchanged string if an unsupported language is given.

Changed in version 7.5: always return a string even phrase is an int.

Parameters:
  • phrase (str | int) – The phrase to convert to localized numerical

  • lang (str) – language code

Returns:

The localized version

Return type:

str