parser Package¶
parser Package¶
This package contains the actual wikicode parser, split up into two main
modules: the tokenizer and the builder. This module joins them
together into one interface.
- class mwparserfromhell.parser.Parser[source]¶
Represents a parser for wikicode.
Actual parsing is a two-step process: first, the text is split up into a series of tokens by the
Tokenizer, and then the tokens are converted into trees ofWikicodeobjects andNodes by theBuilder.Instances of this class or its dependents (
TokenizerandBuilder) should not be shared between threads.parse()can be called multiple times as long as it is not done concurrently. In general, there is no need to do this because parsing should be done throughmwparserfromhell.parse(), which creates a newParserobject as necessary.- parse(text, context=0, skip_style_tags=False)[source]¶
Parse text, returning a
Wikicodeobject tree.If given, context will be passed as a starting context to the parser. This is helpful when this function is used inside node attribute setters. For example,
ExternalLink‘surlsetter sets context tocontexts.EXT_LINK_URIto prevent the URL itself from becoming anExternalLink.If skip_style_tags is
True, then''and'''will not be parsed, but instead will be treated as plain text.If there is an internal error while parsing,
ParserErrorwill be raised.
- exception mwparserfromhell.parser.ParserError(extra)[source]¶
Exception raised when an internal error occurs while parsing.
This does not mean that the wikicode was invalid, because invalid markup should still be parsed correctly. This means that the parser caught itself with an impossible internal state and is bailing out before other problems can happen. Its appearance indicates a bug.
builder Module¶
- class mwparserfromhell.parser.builder.Builder[source]¶
Builds a tree of nodes out of a sequence of tokens.
To use, pass a list of
Tokens to thebuild()method. The list will be exhausted as it is parsed and aWikicodeobject containing the node tree will be returned.- _handle_parameter(default)[source]¶
Handle a case where a parameter is at the head of the tokens.
default is the value to use if no parameter name is defined.
contexts Module¶
This module contains various “context” definitions, which are essentially flags set during the tokenization process, either on the current parse stack (local contexts) or affecting all stacks (global contexts). They represent the context the tokenizer is in, such as inside a template’s name definition, or inside a level-two heading. This is used to determine what tokens are valid at the current point and also if the current parsing route is invalid.
The tokenizer stores context as an integer, with these definitions bitwise OR’d
to set them, AND’d to check if they’re set, and XOR’d to unset them. The
advantage of this is that contexts can have sub-contexts (as FOO == 0b11
will cover BAR == 0b10 and BAZ == 0b01).
Local (stack-specific) contexts:
TEMPLATETEMPLATE_NAMETEMPLATE_PARAM_KEYTEMPLATE_PARAM_VALUE
ARGUMENTARGUMENT_NAMEARGUMENT_DEFAULT
WIKILINKWIKILINK_TITLEWIKILINK_TEXT
EXT_LINKEXT_LINK_URIEXT_LINK_TITLE
HEADINGHEADING_LEVEL_1HEADING_LEVEL_2HEADING_LEVEL_3HEADING_LEVEL_4HEADING_LEVEL_5HEADING_LEVEL_6
TAGTAG_OPENTAG_ATTRTAG_BODYTAG_CLOSE
STYLESTYLE_ITALICSSTYLE_BOLDSTYLE_PASS_AGAINSTYLE_SECOND_PASS
DL_TERMSAFETY_CHECKHAS_TEXTFAIL_ON_TEXTFAIL_NEXTFAIL_ON_LBRACEFAIL_ON_RBRACEFAIL_ON_EQUALSHAS_TEMPLATE
TABLETABLE_OPENTABLE_CELL_OPENTABLE_CELL_STYLETABLE_TD_LINETABLE_TH_LINETABLE_CELL_LINE_CONTEXTS
HTML_ENTITY
Global contexts:
GL_HEADING
Aggregate contexts:
FAILUNSAFEDOUBLENO_WIKILINKSNO_EXT_LINKS
errors Module¶
- exception mwparserfromhell.parser.errors.ParserError(extra)[source]¶
Exception raised when an internal error occurs while parsing.
This does not mean that the wikicode was invalid, because invalid markup should still be parsed correctly. This means that the parser caught itself with an impossible internal state and is bailing out before other problems can happen. Its appearance indicates a bug.
tokenizer Module¶
- class mwparserfromhell.parser.tokenizer.Tokenizer[source]¶
Creates a list of tokens from a string of wikicode.
- MARKERS: list[str | Sentinel] = ['{', '}', '[', ']', '<', '>', '|', '=', '&', "'", '"', '#', '*', ';', ':', '/', '-', '!', '\n', Sentinel.START, Sentinel.END]¶
- MAX_DEPTH = 100¶
- URISCHEME = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+.-'¶
- USES_C = False¶
- property _context¶
The current token context.
- _emit_style_tag(tag, markup, body)[source]¶
Write the body of a tag and the tokens that should surround it.
- _emit_table_tag(open_open_markup, tag, style, padding, close_open_markup, contents, open_close_markup)[source]¶
Emit a table tag.
- _fail_route()[source]¶
Fail the current tokenization route.
Discards the current stack/context/textbuffer and raises
BadRoute.
- _handle_free_link_text(punct, tail, this)[source]¶
Handle text in a free ext link, including trailing punctuation.
- _handle_invalid_tag_start()[source]¶
Handle the (possible) start of an implicitly closing single tag.
- _handle_single_only_tag_end()[source]¶
Handle the end of an implicitly closing single-only HTML tag.
- _handle_table_cell(markup, tag, line_context)[source]¶
Parse as normal syntax unless we hit a style marker, then parse style as HTML attributes and the remainder as normal syntax.
- _handle_table_cell_end(reset_for_style=False)[source]¶
Returns the current context, with the TABLE_CELL_STYLE flag set if it is necessary to reset and parse style attributes.
- _handle_template_param_value()[source]¶
Handle a template parameter’s value at the head of the string.
- _memoize_bad_route()[source]¶
Remember that the current route (head + context at push) is invalid.
This will be noticed when calling _push with the same head and context, and the route will be failed immediately.
- _parse_template_or_argument()[source]¶
Parse a template or argument at the head of the wikicode string.
- _pop(keep_context=False)[source]¶
Pop the current stack/context/textbuffer, returning the stack.
If keep_context is
True, then we will replace the underlying stack’s context with the current stack’s.
- _read(*, strict: Literal[False] = False) str | Literal[Sentinel.END][source]¶
- _read(*, strict: Literal[True]) str
- _read(delta: int = 0, *, strict: Literal[False] = False) str | Literal[Sentinel.START, Sentinel.END]
- _read(delta: int = 0, *, strict: Literal[True]) str | Literal[Sentinel.START]
Read the value at a relative point in the wikicode.
The value is read from
self._headplus the value of delta (which can be negative). If strict isTrue, the route will be failed (with_fail_route()) if we try to read from past the end of the string; otherwise,ENDis returned. If we try to read from before the start of the string,STARTis returned.
- _remove_uri_scheme_from_textbuffer(scheme)[source]¶
Remove the URI scheme of a new external link from the textbuffer.
- property _stack¶
The current token stack.
- property _stack_ident¶
An identifier for the current stack.
This is based on the starting head position and context. Stacks with the same identifier are always parsed in the same way. This can be used to cache intermediate parsing info.
- property _textbuffer¶
The current textbuffer.
- regex = re.compile('([{}\\[\\]<>|=&\'#*;:/\\\\\\"\\-!\\n])', re.IGNORECASE)¶
- tag_splitter = re.compile('([\\s\\"\\\'\\\\]+)')¶
tokens Module¶
This module contains the token definitions that are used as an intermediate
parsing data type - they are stored in a flat list, with each token being
identified by its type and optional attributes. The token list is generated in
a syntactically valid form by the Tokenizer, and then converted into
the :class`.Wikicode` tree by the Builder.
- class mwparserfromhell.parser.tokens.ArgumentClose¶
- class mwparserfromhell.parser.tokens.ArgumentOpen¶
- class mwparserfromhell.parser.tokens.ArgumentSeparator¶
- class mwparserfromhell.parser.tokens.CommentEnd¶
- class mwparserfromhell.parser.tokens.CommentStart¶
- class mwparserfromhell.parser.tokens.ExternalLinkClose¶
- class mwparserfromhell.parser.tokens.ExternalLinkOpen¶
- class mwparserfromhell.parser.tokens.ExternalLinkSeparator¶
- class mwparserfromhell.parser.tokens.HTMLEntityEnd¶
- class mwparserfromhell.parser.tokens.HTMLEntityHex¶
- class mwparserfromhell.parser.tokens.HTMLEntityNumeric¶
- class mwparserfromhell.parser.tokens.HTMLEntityStart¶
- class mwparserfromhell.parser.tokens.HeadingEnd¶
- class mwparserfromhell.parser.tokens.HeadingStart¶
- class mwparserfromhell.parser.tokens.TagAttrEquals¶
- class mwparserfromhell.parser.tokens.TagAttrQuote¶
- class mwparserfromhell.parser.tokens.TagAttrStart¶
- class mwparserfromhell.parser.tokens.TagCloseClose¶
- class mwparserfromhell.parser.tokens.TagCloseOpen¶
- class mwparserfromhell.parser.tokens.TagCloseSelfclose¶
- class mwparserfromhell.parser.tokens.TagOpenClose¶
- class mwparserfromhell.parser.tokens.TagOpenOpen¶
- class mwparserfromhell.parser.tokens.TemplateClose¶
- class mwparserfromhell.parser.tokens.TemplateOpen¶
- class mwparserfromhell.parser.tokens.TemplateParamEquals¶
- class mwparserfromhell.parser.tokens.TemplateParamSeparator¶
- class mwparserfromhell.parser.tokens.Text¶
- class mwparserfromhell.parser.tokens.Token[source]¶
A token stores the semantic meaning of a unit of wikicode.
- class mwparserfromhell.parser.tokens.WikilinkClose¶
- class mwparserfromhell.parser.tokens.WikilinkOpen¶
- class mwparserfromhell.parser.tokens.WikilinkSeparator¶