`proofreadpage` — ProofreadPage Extension#

Objects used with ProofreadPage Extension.

OCR support of page scans via:

Wikimedia OCR, see: https://www.mediawiki.org/wiki/Help:Extension:Wikisource/Wikimedia_OCR
https://ocr.wmcloud.org/, inspired by https://wikisource.org/wiki/MediaWiki:GoogleOCR.js

Parse text and extract the first <pages ... /> tag. Individual attributes will be accessible with dot notation.

>>> tp = PagesTagParser('<pages />')
>>> tp
PagesTagParser('<pages />')

>>> tp = PagesTagParser(
... 'Text: <pages index="Index.pdf" from="first" to="last" />')
>>> tp
PagesTagParser('<pages index="Index.pdf" from="first" to="last" />')

Attributes can be modified via dot notation. If an attribute is a number, it is converted to int.

Note

from is represented as ffrom due to conflict with keyword.

>>> tp.ffrom = 1; tp.to = '"3"'
>>> tp.ffrom
1
>>> tp.to
3

Quotes are stripped in the value and added back in the str representation.

Note

Quotes are not mandatory.

>>> tp
PagesTagParser('<pages index="Index.pdf" from=1 to="3" />')

Attributes can be added via dot notation. Order is fixed (same order as attribute definition in the class).

>>> tp.fromsection = '"A"'
>>> tp.fromsection
'A'
>>> tp
PagesTagParser('<pages index="Index.pdf" from=1 to="3" fromsection="A" />')

Attributes can be deleted. >>> del tp.fromsection >>> tp PagesTagParser(‘<pages index=”Index.pdf” from=1 to=”3” />’)

Attribute presence can be checked. >>> ‘to’ in tp True

>>> 'step' in tp
False

Added in version 8.0.

Changed in version 8.1: text parameter is defaulted to '<pages />'.

exclude#: A descriptor tag.

Added in version 8.0.

ffrom#: A descriptor tag.

Added in version 8.0.

fromsection#: A descriptor tag.

Added in version 8.0.

classmethod get_descriptors()[source]#: Get TagAttrDesc descriptors.

header#: A descriptor tag.

Added in version 8.0.

include#: A descriptor tag.

Added in version 8.0.

index#: A descriptor tag.

Added in version 8.0.

onlysection#: A descriptor tag.

Added in version 8.0.

pat_attr = re.compile('(index=|from=|to=|include=|exclude=|step=|header=|fromsection=|tosection=|onlysection=)')#

pat_tag = re.compile('<pages (?P<attrs>[^/]*?)/>')#

step#: A descriptor tag.

Added in version 8.0.

to#: A descriptor tag.

Added in version 8.0.

tokens = ('index', 'from', 'to', 'include', 'exclude', 'step', 'header', 'fromsection', 'tosection', 'onlysection')#

tosection#: A descriptor tag.

Added in version 8.0.

class proofreadpage.ProofreadPage(source, title='')[source]#

Bases: Page

ProofreadPage page used in MediaWiki ProofreadPage extension.

Instantiate a ProofreadPage object.

Raises:

UnknownExtensionError – source Site has no ProofreadPage Extension.

Parameters:

source (PageSourceType)
title (str)

NOT_PROOFREAD = 1#

PROBLEMATIC = 2#

PROOFREAD = 3#

PROOFREAD_LEVELS = [0, 1, 2, 3, 4]#

VALIDATED = 4#

WITHOUT_TEXT = 0#

property body: Any#

close_tag = '</noinclude>'#

property footer: Any#

property header: Any#

property index: IndexPage | None#

Get the Index page which contains ProofreadPage.

If there are many Index pages link to this ProofreadPage, and the ProofreadPage is titled Page:<index title>/<page number>, the Index page with the same title will be returned. Otherwise None is returned in the case of multiple linked Index pages.

To force reload, delete index and call it again.

Returns:: the Index page for this ProofreadPage

not_proofread()[source]#

Set Page QL to “Not Proofread”.

Return type:: None

ocr(ocr_tool=None)[source]#

Do OCR of ProofreadPage scan.

The text returned by this function shall be assigned to body, otherwise the ProofreadPage format will not be maintained.

Warning

It is the user’s responsibility to reset quality level accordingly.

Changed in version 9.2: default for ocr_tool is wmfOCR.

Removed in version 9.2: phetools support is not available anymore.

Parameters:

ocr_tool (str | None) – ‘wmfOCR’ or ‘googleOCR’; default is ‘wmfOCR’

Returns:

OCR text for the page.

Raises:

TypeError – wrong ocr_tool keyword arg.
ValueError – something went wrong with OCR process.

Return type:

str

open_tag = '<noinclude>'#

p_close = re.compile('(</div>|\\n\\n\\n)?</noinclude>')#

p_close_no_div = re.compile('</noinclude>')#

p_open = re.compile('<noinclude>')#

property pre_summary: str#

Return trailing part of edit summary.

The edit summary shall be appended to pre_summary to highlight Status in the edit summary on wiki.

problematic()[source]#

Set Page QL to “Problematic”.

Return type:: None

proofread()[source]#

Set Page QL to “Proofread”.

Return type:: None

property ql: Any#

property quality_level: int#

Return the quality level of this page when it is retrieved from API.

This is only applicable if contentmodel equals ‘proofread-page’. None is returned otherwise.

This property is read-only and is applicable only when page is loaded. If quality level is overwritten during page processing, this property is no longer necessarily aligned with the new value.

In this way, no text parsing is necessary to check quality level when fetching a page.

save(*args, **kwargs)[source]#

Save page content after recomposing the page.

Parameters:

args (Any)
kwargs (Any)

Return type:

None

property status: Any#

property text: str#

Override text property.

Preload text returned by EditFormPreloadText to preload non-existing pages.

property url_image: str#

Get the file url of the scan of ProofreadPage.

Returns:: file url of the scan of ProofreadPage or None.

For MW version < 1.40: :raises Exception: in case of http errors :raises ImportError: if bs4 is not installed, _bs4_soup() will raise :raises ValueError: in case of no prp_page_image src found for scan

property user: Any#

validate()[source]#

Set Page QL to “Validated”.

Return type:: None

without_text()[source]#

Set Page QL to “Without text”.

Return type:: None

class proofreadpage.PurgeRequest(**kwargs)[source]#

Bases: Request

Subclass of Request which skips the check on write rights.

Workaround for T128994.

Monkeypatch action in Request initializer.

Parameters:: kwargs (Any)

class proofreadpage.TagAttr(attr, value)[source]#

Bases: object

Tag attribute of <pages />.

Represent a single attribute. It is used internally in PagesTagParser and shall not be used stand-alone.

It manages string formatting output and conversion str <–> int and quotes. Input value can only be str or int and shall have quotes or nothing.

>>> a = TagAttr('to', 3.0)
Traceback (most recent call last):
  ...
TypeError: value=3.0 must be str or int.

>>> a = TagAttr('to', 'A123"')
Traceback (most recent call last):
  ...
ValueError: value=A123" has wrong quotes.

>>> a = TagAttr('to', 3)
>>> a
TagAttr('to', 3)
>>> str(a)
'to=3'
>>> a.attr
'to'
>>> a.value
3

>>> a = TagAttr('to', '3')
>>> a
TagAttr('to', '3')
>>> str(a)
'to=3'
>>> a.attr
'to'
>>> a.value
3

>>> a = TagAttr('to', '"3"')
>>> a
TagAttr('to', '"3"')
>>> str(a)
'to="3"'
>>> a.value
3

>>> a = TagAttr('to', "'3'")
>>> a
TagAttr('to', "'3'")
>>> str(a)
"to='3'"
>>> a.value
3

>>> a = TagAttr('to', 'A123')
>>> a
TagAttr('to', 'A123')
>>> str(a)
'to=A123'
>>> a.value
'A123'

Added in version 8.0.

property value#: Attribute value.

class proofreadpage.TagAttrDesc[source]#

Bases: object

A descriptor tag.

Added in version 8.0.

proofreadpage.check_if_cached(fn)[source]#

Decorator for IndexPage to ensure data is cached.

Parameters:: fn (Callable)
Return type:: Callable

proofreadpage.decompose(fn)[source]#

Decorator for ProofreadPage.

Decompose text if needed and recompose text.

Parameters:: fn (Callable)
Return type:: Callable

proofreadpage — ProofreadPage Extension#

`proofreadpage` — ProofreadPage Extension#