proofreadpage — ProofreadPage Extension#

Objects used with ProofreadPage Extension.

This module includes objects:

  • ProofreadPage(Page)

  • FullHeader

  • IndexPage(Page)

OCR support of page scans via: - https://phetools.toolforge.org/hocr_cgi.py - https://phetools.toolforge.org/ocr.php - inspired by https://en.wikisource.org/wiki/MediaWiki:Gadget-ocr.js

class proofreadpage.FullHeader(text=None)[source]#

Bases: object

Header of a ProofreadPage object.

Parameters:

text (Optional[str]) –

TEMPLATE_V1 = '<pagequality level="{0.ql}" user="{0.user}" /><div class="pagetext">{0.header}\n\n\n'#
TEMPLATE_V2 = '<pagequality level="{0.ql}" user="{0.user}" />{0.header}'#
p_header = re.compile('<pagequality level="(?P<ql>\\d)" user="(?P<user>.*?)" />(?P<has_div><div class="pagetext">)?(?P<header>.*)', re.DOTALL)#
class proofreadpage.IndexPage(source, title='')[source]#

Bases: Page

Index Page page used in Mediawiki ProofreadPage extension.

Instantiate an IndexPage object.

In this class: page number is the number in the page title in the Page namespace, if the wikisource site adopts this convention (e.g. page_number is 12 for Page:Popular Science Monthly Volume 1.djvu/12) or the sequential number of the pages linked from the index section in the Index page if the index is built via transclusion of a list of pages (e.g. like on de wikisource). page label is the label associated with a page in the Index page.

This class provides methods to get pages contained in Index page, and relative page numbers and labels by means of several helper functions.

It also provides a generator to pages contained in Index page, with possibility to define range, filter by quality levels and page existence.

Raises:
  • UnknownExtensionError – source Site has no ProofreadPage Extension.

  • ImportError – bs4 is not installed.

Parameters:
INDEX_TEMPLATE = ':MediaWiki:Proofreadpage_index_template'#
_get_prp_index_pagelist()[source]#

Get all pages in an IndexPage page list.

Note

This method is called by initializer and should not be used.

get_label_from_page(*args, **kwargs)[source]#
Parameters:
  • self (IndexPage) –

  • args (Any) –

  • kwargs (Any) –

Return type:

Any

get_label_from_page_number(*args, **kwargs)[source]#
Parameters:
  • self (IndexPage) –

  • args (Any) –

  • kwargs (Any) –

Return type:

Any

get_number(*args, **kwargs)[source]#
Parameters:
  • self (IndexPage) –

  • args (Any) –

  • kwargs (Any) –

Return type:

Any

get_page(*args, **kwargs)[source]#
Parameters:
  • self (IndexPage) –

  • args (Any) –

  • kwargs (Any) –

Return type:

Any

get_page_from_label(*args, **kwargs)[source]#
Parameters:
  • self (IndexPage) –

  • args (Any) –

  • kwargs (Any) –

Return type:

Any

get_page_number_from_label(*args, **kwargs)[source]#
Parameters:
  • self (IndexPage) –

  • args (Any) –

  • kwargs (Any) –

Return type:

Any

has_valid_content()[source]#

Test page only contains a single call to the index template.

Return type:

bool

property num_pages: Any#
page_gen(start=1, end=None, filter_ql=None, only_existing=False, content=True)[source]#

Return a page generator which yields pages contained in Index page.

Range is [start … end], extremes included.

Parameters:
  • start (int) – first page, defaults to 1

  • end (Optional[int]) – num_pages if end is None

  • filter_ql (Optional[Sequence[int]]) – filters quality levels if None: all but ‘Without Text’.

  • only_existing (bool) – yields only existing pages.

  • content (bool) – preload content.

Return type:

Iterable[Page]

pages(*args, **kwargs)[source]#
Parameters:
  • self (IndexPage) –

  • args (Any) –

  • kwargs (Any) –

Return type:

Any

purge()[source]#

Overwrite purge method.

Instead of a proper purge action, use PurgeRequest, which skips the check on write rights.

Return type:

None

save(*args, **kwargs)[source]#

Save page after validating the content.

Trying to save any other content fails silently with a parameterless INDEX_TEMPLATE being saved.

Parameters:
  • args (Any) –

  • kwargs (Any) –

Return type:

None

class proofreadpage.PagesTagParser(text)[source]#

Bases: Container

Parser for tag <pages />.

Parse text and extract the first <pages ... /> tag. Individual attributes will be accessible with dot notation.

>>> tp = PagesTagParser(
... 'Text: <pages index="Index.pdf" from="first" to="last" />')
>>> tp
PagesTagParser('<pages index="Index.pdf" from="first" to="last" />')

Attributes can be modified via dot notation. If an attribute is a number, it is converted to int.

Note

from is represented as ffrom due to conflict with keyword.

>>> tp.ffrom = 1; tp.to = '"3"'
>>> tp.ffrom
1
>>> tp.to
3

Quotes are stripped in the value and added back in the str representation.

Note

Quotes are not mandatory.

>>> tp
PagesTagParser('<pages index="Index.pdf" from=1 to="3" />')

Attributes can be added via dot notation. Order is fixed (same order as attribute definition in the class).

>>> tp.fromsection = '"A"'
>>> tp.fromsection
'A'
>>> tp
PagesTagParser('<pages index="Index.pdf" from=1 to="3" fromsection="A" />')

Attributes can be deleted. >>> del tp.fromsection >>> tp PagesTagParser(‘<pages index=”Index.pdf” from=1 to=”3” />’)

Attribute presence can be checked. >>> ‘to’ in tp True

>>> 'step' in tp
False

New in version 8.0.

exclude#

A descriptor tag.

New in version 8.0.

ffrom#

A descriptor tag.

New in version 8.0.

fromsection#

A descriptor tag.

New in version 8.0.

classmethod get_descriptors()[source]#

Get TagAttrDesc descriptors.

header#

A descriptor tag.

New in version 8.0.

include#

A descriptor tag.

New in version 8.0.

index#

A descriptor tag.

New in version 8.0.

onlysection#

A descriptor tag.

New in version 8.0.

pat_attr = re.compile('(index=|from=|to=|include=|exclude=|step=|header=|tosection=|fromsection=|onlysection=)')#
pat_tag = re.compile('<pages (?P<attrs>[^/]*?)/>')#
step#

A descriptor tag.

New in version 8.0.

to#

A descriptor tag.

New in version 8.0.

tokens = '(index=|from=|to=|include=|exclude=|step=|header=|tosection=|fromsection=|onlysection=)'#
tosection#

A descriptor tag.

New in version 8.0.

class proofreadpage.ProofreadPage(source, title='')[source]#

Bases: Page

ProofreadPage page used in Mediawiki ProofreadPage extension.

Instantiate a ProofreadPage object.

Raises:

UnknownExtensionError – source Site has no ProofreadPage Extension.

Parameters:
NOT_PROOFREAD = 1#
PROBLEMATIC = 2#
PROOFREAD = 3#
PROOFREAD_LEVELS = [0, 1, 2, 3, 4]#
VALIDATED = 4#
WITHOUT_TEXT = 0#
property body: Any#
close_tag = '</noinclude>'#
property footer: Any#
property header: Any#
property index: Optional[IndexPage]#

Get the Index page which contains ProofreadPage.

If there are many Index pages link to this ProofreadPage, and the ProofreadPage is titled Page:<index title>/<page number>, the Index page with the same title will be returned. Otherwise None is returned in the case of multiple linked Index pages.

To force reload, delete index and call it again.

Returns:

the Index page for this ProofreadPage

not_proofread()[source]#

Set Page QL to “Not Proofread”.

Return type:

None

ocr(ocr_tool=None)[source]#

Do OCR of ProofreadPage scan.

The text returned by this function shall be assigned to self.body, otherwise the ProofreadPage format will not be maintained.

It is the user’s responsibility to reset quality level accordingly.

Parameters:

ocr_tool (Optional[str]) – ‘phetools’, ‘wmfOCR’ or ‘googleOCR’; default is ‘phetools’

Returns:

OCR text for the page.

Raises:
  • TypeError – wrong ocr_tool keyword arg.

  • ValueError – something went wrong with OCR process.

Return type:

str

open_tag = '<noinclude>'#
p_close = re.compile('(</div>|\\n\\n\\n)?</noinclude>')#
p_close_no_div = re.compile('</noinclude>')#
p_open = re.compile('<noinclude>')#
property pre_summary: str#

Return trailing part of edit summary.

The edit summary shall be appended to pre_summary to highlight Status in the edit summary on wiki.

problematic()[source]#

Set Page QL to “Problematic”.

Return type:

None

proofread()[source]#

Set Page QL to “Proofread”.

Return type:

None

property ql: Any#
property quality_level: int#

Return the quality level of this page when it is retrieved from API.

This is only applicable if contentmodel equals ‘proofread-page’. None is returned otherwise.

This property is read-only and is applicable only when page is loaded. If quality level is overwritten during page processing, this property is no longer necessarily aligned with the new value.

In this way, no text parsing is necessary to check quality level when fetching a page.

save(*args, **kwargs)[source]#

Save page content after recomposing the page.

Parameters:
  • args (Any) –

  • kwargs (Any) –

Return type:

None

property status: Any#
property text: str#

Override text property.

Preload text returned by EditFormPreloadText to preload non-existing pages.

property url_image: str#

Get the file url of the scan of ProofreadPage.

Returns:

file url of the scan ProofreadPage or None.

Raises:
  • Exception – in case of http errors

  • ImportError – if bs4 is not installed, _bs4_soup() will raise

  • ValueError – in case of no prp_page_image src found for scan

property user: Any#
validate()[source]#

Set Page QL to “Validated”.

Return type:

None

without_text()[source]#

Set Page QL to “Without text”.

Return type:

None

class proofreadpage.PurgeRequest(**kwargs)[source]#

Bases: Request

Subclass of Request which skips the check on write rights.

Workaround for T128994.

Monkeypatch action in Request initializer.

Parameters:

kwargs (Any) –

class proofreadpage.TagAttr(attr, value)[source]#

Bases: object

Tag attribute of <pages />.

Represent a single attribute. It is used internally in PagesTagParser and shall not be used stand-alone.

It manages string formatting output and conversion str <–> int and quotes. Input value can only be str or int and shall have quotes or nothing.

>>> a = TagAttr('to', 3.0)
Traceback (most recent call last):
  ...
TypeError: value=3.0 must be str or int.
>>> a = TagAttr('to', 'A123"')
Traceback (most recent call last):
  ...
ValueError: value=A123" has wrong quotes.
>>> a = TagAttr('to', 3)
>>> a
TagAttr('to', 3)
>>> str(a)
'to=3'
>>> a.attr
'to'
>>> a.value
3
>>> a = TagAttr('to', '3')
>>> a
TagAttr('to', '3')
>>> str(a)
'to=3'
>>> a.attr
'to'
>>> a.value
3
>>> a = TagAttr('to', '"3"')
>>> a
TagAttr('to', '"3"')
>>> str(a)
'to="3"'
>>> a.value
3
>>> a = TagAttr('to', "'3'")
>>> a
TagAttr('to', "'3'")
>>> str(a)
"to='3'"
>>> a.value
3
>>> a = TagAttr('to', 'A123')
>>> a
TagAttr('to', 'A123')
>>> str(a)
'to=A123'
>>> a.value
'A123'

New in version 8.0.

property value#

Attribute value.

class proofreadpage.TagAttrDesc[source]#

Bases: object

A descriptor tag.

New in version 8.0.

proofreadpage.check_if_cached(fn)[source]#

Decorator for IndexPage to ensure data is cached.

Parameters:

fn (Callable) –

Return type:

Callable

proofreadpage.decompose(fn)[source]#

Decorator for ProofreadPage.

Decompose text if needed and recompose text.

Parameters:

fn (Callable) –

Return type:

Callable