parser Package

parser Package

This package contains the actual wikicode parser, split up into two main modules: the tokenizer and the builder. This module joins them together into one interface.

class mwparserfromhell.parser.Parser[source]

Represents a parser for wikicode.

Actual parsing is a two-step process: first, the text is split up into a series of tokens by the Tokenizer, and then the tokens are converted into trees of Wikicode objects and Nodes by the Builder.

Instances of this class or its dependents (Tokenizer and Builder) should not be shared between threads. parse() can be called multiple times as long as it is not done concurrently. In general, there is no need to do this because parsing should be done through mwparserfromhell.parse(), which creates a new Parser object as necessary.

parse(text, context=0, skip_style_tags=False)[source]

Parse text, returning a Wikicode object tree.

If given, context will be passed as a starting context to the parser. This is helpful when this function is used inside node attribute setters. For example, ExternalLink‘s url setter sets context to contexts.EXT_LINK_URI to prevent the URL itself from becoming an ExternalLink.

If skip_style_tags is True, then '' and ''' will not be parsed, but instead will be treated as plain text.

If there is an internal error while parsing, ParserError will be raised.

exception mwparserfromhell.parser.ParserError(extra)[source]

Exception raised when an internal error occurs while parsing.

This does not mean that the wikicode was invalid, because invalid markup should still be parsed correctly. This means that the parser caught itself with an impossible internal state and is bailing out before other problems can happen. Its appearance indicates a bug.

builder Module

class mwparserfromhell.parser.builder.Builder[source]

Builds a tree of nodes out of a sequence of tokens.

To use, pass a list of Tokens to the build() method. The list will be exhausted as it is parsed and a Wikicode object containing the node tree will be returned.

build(tokenlist)[source]

Build a Wikicode object from a list tokens and return it.

contexts Module

This module contains various “context” definitions, which are essentially flags set during the tokenization process, either on the current parse stack (local contexts) or affecting all stacks (global contexts). They represent the context the tokenizer is in, such as inside a template’s name definition, or inside a level-two heading. This is used to determine what tokens are valid at the current point and also if the current parsing route is invalid.

The tokenizer stores context as an integer, with these definitions bitwise OR’d to set them, AND’d to check if they’re set, and XOR’d to unset them. The advantage of this is that contexts can have sub-contexts (as FOO == 0b11 will cover BAR == 0b10 and BAZ == 0b01).

Local (stack-specific) contexts:

  • TEMPLATE

    • TEMPLATE_NAME

    • TEMPLATE_PARAM_KEY

    • TEMPLATE_PARAM_VALUE

  • ARGUMENT

    • ARGUMENT_NAME

    • ARGUMENT_DEFAULT

  • WIKILINK

    • WIKILINK_TITLE

    • WIKILINK_TEXT

  • EXT_LINK

    • EXT_LINK_URI

    • EXT_LINK_TITLE

  • HEADING

    • HEADING_LEVEL_1

    • HEADING_LEVEL_2

    • HEADING_LEVEL_3

    • HEADING_LEVEL_4

    • HEADING_LEVEL_5

    • HEADING_LEVEL_6

  • TAG

    • TAG_OPEN

    • TAG_ATTR

    • TAG_BODY

    • TAG_CLOSE

  • STYLE

    • STYLE_ITALICS

    • STYLE_BOLD

    • STYLE_PASS_AGAIN

    • STYLE_SECOND_PASS

  • DL_TERM

  • SAFETY_CHECK

    • HAS_TEXT

    • FAIL_ON_TEXT

    • FAIL_NEXT

    • FAIL_ON_LBRACE

    • FAIL_ON_RBRACE

    • FAIL_ON_EQUALS

    • HAS_TEMPLATE

  • TABLE

    • TABLE_OPEN

    • TABLE_CELL_OPEN

    • TABLE_CELL_STYLE

    • TABLE_TD_LINE

    • TABLE_TH_LINE

    • TABLE_CELL_LINE_CONTEXTS

  • HTML_ENTITY

Global contexts:

  • GL_HEADING

Aggregate contexts:

  • FAIL

  • UNSAFE

  • DOUBLE

  • NO_WIKILINKS

  • NO_EXT_LINKS

mwparserfromhell.parser.contexts.describe(context)[source]

Return a string describing the given context value, for debugging.

tokenizer Module

class mwparserfromhell.parser.tokenizer.Tokenizer[source]

Creates a list of tokens from a string of wikicode.

MARKERS: list[str | Sentinel] = ['{', '}', '[', ']', '<', '>', '|', '=', '&', "'", '"', '#', '*', ';', ':', '/', '-', '!', '\n', Sentinel.START, Sentinel.END]
MAX_DEPTH = 100
URISCHEME = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+.-'
USES_C = False
regex = re.compile('([{}\\[\\]<>|=&\'#*;:/\\\\\\"\\-!\\n])', re.IGNORECASE)
tag_splitter = re.compile('([\\s\\"\\\'\\\\]+)')
tokenize(text: str, context=0, skip_style_tags=False)[source]

Build a list of tokens from a string of wikicode and return it.

exception mwparserfromhell.parser.tokenizer.BadRoute(context=0)[source]

Raised internally when the current tokenization route is invalid.

tokens Module

This module contains the token definitions that are used as an intermediate parsing data type - they are stored in a flat list, with each token being identified by its type and optional attributes. The token list is generated in a syntactically valid form by the Tokenizer, and then converted into the :class`.Wikicode` tree by the Builder.

class mwparserfromhell.parser.tokens.ArgumentClose
class mwparserfromhell.parser.tokens.ArgumentOpen
class mwparserfromhell.parser.tokens.ArgumentSeparator
class mwparserfromhell.parser.tokens.CommentEnd
class mwparserfromhell.parser.tokens.CommentStart
class mwparserfromhell.parser.tokens.ExternalLinkClose
class mwparserfromhell.parser.tokens.ExternalLinkOpen
class mwparserfromhell.parser.tokens.ExternalLinkSeparator
class mwparserfromhell.parser.tokens.HTMLEntityEnd
class mwparserfromhell.parser.tokens.HTMLEntityHex
class mwparserfromhell.parser.tokens.HTMLEntityNumeric
class mwparserfromhell.parser.tokens.HTMLEntityStart
class mwparserfromhell.parser.tokens.HeadingEnd
class mwparserfromhell.parser.tokens.HeadingStart
class mwparserfromhell.parser.tokens.TagAttrEquals
class mwparserfromhell.parser.tokens.TagAttrQuote
class mwparserfromhell.parser.tokens.TagAttrStart
class mwparserfromhell.parser.tokens.TagCloseClose
class mwparserfromhell.parser.tokens.TagCloseOpen
class mwparserfromhell.parser.tokens.TagCloseSelfclose
class mwparserfromhell.parser.tokens.TagOpenClose
class mwparserfromhell.parser.tokens.TagOpenOpen
class mwparserfromhell.parser.tokens.TemplateClose
class mwparserfromhell.parser.tokens.TemplateOpen
class mwparserfromhell.parser.tokens.TemplateParamEquals
class mwparserfromhell.parser.tokens.TemplateParamSeparator
class mwparserfromhell.parser.tokens.Text
class mwparserfromhell.parser.tokens.Token[source]

A token stores the semantic meaning of a unit of wikicode.

class mwparserfromhell.parser.tokens.WikilinkClose
class mwparserfromhell.parser.tokens.WikilinkOpen
class mwparserfromhell.parser.tokens.WikilinkSeparator