parser Package¶
parser
Package¶
This package contains the actual wikicode parser, split up into two main
modules: the tokenizer
and the builder
. This module joins them
together into one interface.
- class mwparserfromhell.parser.Parser[source]¶
Represents a parser for wikicode.
Actual parsing is a two-step process: first, the text is split up into a series of tokens by the
Tokenizer
, and then the tokens are converted into trees ofWikicode
objects andNode
s by theBuilder
.Instances of this class or its dependents (
Tokenizer
andBuilder
) should not be shared between threads.parse()
can be called multiple times as long as it is not done concurrently. In general, there is no need to do this because parsing should be done throughmwparserfromhell.parse()
, which creates a newParser
object as necessary.- parse(text, context=0, skip_style_tags=False)[source]¶
Parse text, returning a
Wikicode
object tree.If given, context will be passed as a starting context to the parser. This is helpful when this function is used inside node attribute setters. For example,
ExternalLink
‘surl
setter sets context tocontexts.EXT_LINK_URI
to prevent the URL itself from becoming anExternalLink
.If skip_style_tags is
True
, then''
and'''
will not be parsed, but instead will be treated as plain text.If there is an internal error while parsing,
ParserError
will be raised.
- exception mwparserfromhell.parser.ParserError(extra)[source]¶
Exception raised when an internal error occurs while parsing.
This does not mean that the wikicode was invalid, because invalid markup should still be parsed correctly. This means that the parser caught itself with an impossible internal state and is bailing out before other problems can happen. Its appearance indicates a bug.
builder
Module¶
- class mwparserfromhell.parser.builder.Builder[source]¶
Builds a tree of nodes out of a sequence of tokens.
To use, pass a list of
Token
s to thebuild()
method. The list will be exhausted as it is parsed and aWikicode
object containing the node tree will be returned.- _handle_parameter(default)[source]¶
Handle a case where a parameter is at the head of the tokens.
default is the value to use if no parameter name is defined.
contexts
Module¶
This module contains various “context” definitions, which are essentially flags set during the tokenization process, either on the current parse stack (local contexts) or affecting all stacks (global contexts). They represent the context the tokenizer is in, such as inside a template’s name definition, or inside a level-two heading. This is used to determine what tokens are valid at the current point and also if the current parsing route is invalid.
The tokenizer stores context as an integer, with these definitions bitwise OR’d
to set them, AND’d to check if they’re set, and XOR’d to unset them. The
advantage of this is that contexts can have sub-contexts (as FOO == 0b11
will cover BAR == 0b10
and BAZ == 0b01
).
Local (stack-specific) contexts:
TEMPLATE
TEMPLATE_NAME
TEMPLATE_PARAM_KEY
TEMPLATE_PARAM_VALUE
ARGUMENT
ARGUMENT_NAME
ARGUMENT_DEFAULT
WIKILINK
WIKILINK_TITLE
WIKILINK_TEXT
EXT_LINK
EXT_LINK_URI
EXT_LINK_TITLE
HEADING
HEADING_LEVEL_1
HEADING_LEVEL_2
HEADING_LEVEL_3
HEADING_LEVEL_4
HEADING_LEVEL_5
HEADING_LEVEL_6
TAG
TAG_OPEN
TAG_ATTR
TAG_BODY
TAG_CLOSE
STYLE
STYLE_ITALICS
STYLE_BOLD
STYLE_PASS_AGAIN
STYLE_SECOND_PASS
DL_TERM
SAFETY_CHECK
HAS_TEXT
FAIL_ON_TEXT
FAIL_NEXT
FAIL_ON_LBRACE
FAIL_ON_RBRACE
FAIL_ON_EQUALS
HAS_TEMPLATE
TABLE
TABLE_OPEN
TABLE_CELL_OPEN
TABLE_CELL_STYLE
TABLE_TD_LINE
TABLE_TH_LINE
TABLE_CELL_LINE_CONTEXTS
HTML_ENTITY
Global contexts:
GL_HEADING
Aggregate contexts:
FAIL
UNSAFE
DOUBLE
NO_WIKILINKS
NO_EXT_LINKS
errors
Module¶
- exception mwparserfromhell.parser.errors.ParserError(extra)[source]¶
Exception raised when an internal error occurs while parsing.
This does not mean that the wikicode was invalid, because invalid markup should still be parsed correctly. This means that the parser caught itself with an impossible internal state and is bailing out before other problems can happen. Its appearance indicates a bug.
tokenizer
Module¶
- class mwparserfromhell.parser.tokenizer.Tokenizer[source]¶
Creates a list of tokens from a string of wikicode.
- END = <object object>¶
- MARKERS = ['{', '}', '[', ']', '<', '>', '|', '=', '&', "'", '"', '#', '*', ';', ':', '/', '-', '!', '\n', <object object>, <object object>]¶
- MAX_DEPTH = 100¶
- START = <object object>¶
- URISCHEME = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+.-'¶
- USES_C = False¶
- property _context¶
The current token context.
- _emit_style_tag(tag, markup, body)[source]¶
Write the body of a tag and the tokens that should surround it.
- _emit_table_tag(open_open_markup, tag, style, padding, close_open_markup, contents, open_close_markup)[source]¶
Emit a table tag.
- _fail_route()[source]¶
Fail the current tokenization route.
Discards the current stack/context/textbuffer and raises
BadRoute
.
- _handle_free_link_text(punct, tail, this)[source]¶
Handle text in a free ext link, including trailing punctuation.
- _handle_invalid_tag_start()[source]¶
Handle the (possible) start of an implicitly closing single tag.
- _handle_single_only_tag_end()[source]¶
Handle the end of an implicitly closing single-only HTML tag.
- _handle_table_cell(markup, tag, line_context)[source]¶
Parse as normal syntax unless we hit a style marker, then parse style as HTML attributes and the remainder as normal syntax.
- _handle_table_cell_end(reset_for_style=False)[source]¶
Returns the current context, with the TABLE_CELL_STYLE flag set if it is necessary to reset and parse style attributes.
- _handle_template_param_value()[source]¶
Handle a template parameter’s value at the head of the string.
- _memoize_bad_route()[source]¶
Remember that the current route (head + context at push) is invalid.
This will be noticed when calling _push with the same head and context, and the route will be failed immediately.
- _parse_template_or_argument()[source]¶
Parse a template or argument at the head of the wikicode string.
- _pop(keep_context=False)[source]¶
Pop the current stack/context/textbuffer, returning the stack.
If keep_context is
True
, then we will replace the underlying stack’s context with the current stack’s.
- _read(delta=0, wrap=False, strict=False)[source]¶
Read the value at a relative point in the wikicode.
The value is read from
self._head
plus the value of delta (which can be negative). If wrap isFalse
, we will not allow attempts to read from the end of the string ifself._head + delta
is negative. If strict isTrue
, the route will be failed (with_fail_route()
) if we try to read from past the end of the string; otherwise,self.END
is returned. If we try to read from before the start of the string,self.START
is returned.
- _remove_uri_scheme_from_textbuffer(scheme)[source]¶
Remove the URI scheme of a new external link from the textbuffer.
- property _stack¶
The current token stack.
- property _stack_ident¶
An identifier for the current stack.
This is based on the starting head position and context. Stacks with the same identifier are always parsed in the same way. This can be used to cache intermediate parsing info.
- property _textbuffer¶
The current textbuffer.
- regex = re.compile('([{}\\[\\]<>|=&\'#*;:/\\\\\\"\\-!\\n])', re.IGNORECASE)¶
- tag_splitter = re.compile('([\\s\\"\\\'\\\\]+)')¶
tokens
Module¶
This module contains the token definitions that are used as an intermediate
parsing data type - they are stored in a flat list, with each token being
identified by its type and optional attributes. The token list is generated in
a syntactically valid form by the Tokenizer
, and then converted into
the :class`.Wikicode` tree by the Builder
.
- class mwparserfromhell.parser.tokens.ArgumentClose¶
- class mwparserfromhell.parser.tokens.ArgumentOpen¶
- class mwparserfromhell.parser.tokens.ArgumentSeparator¶
- class mwparserfromhell.parser.tokens.CommentEnd¶
- class mwparserfromhell.parser.tokens.CommentStart¶
- class mwparserfromhell.parser.tokens.ExternalLinkClose¶
- class mwparserfromhell.parser.tokens.ExternalLinkOpen¶
- class mwparserfromhell.parser.tokens.ExternalLinkSeparator¶
- class mwparserfromhell.parser.tokens.HTMLEntityEnd¶
- class mwparserfromhell.parser.tokens.HTMLEntityHex¶
- class mwparserfromhell.parser.tokens.HTMLEntityNumeric¶
- class mwparserfromhell.parser.tokens.HTMLEntityStart¶
- class mwparserfromhell.parser.tokens.HeadingEnd¶
- class mwparserfromhell.parser.tokens.HeadingStart¶
- class mwparserfromhell.parser.tokens.TagAttrEquals¶
- class mwparserfromhell.parser.tokens.TagAttrQuote¶
- class mwparserfromhell.parser.tokens.TagAttrStart¶
- class mwparserfromhell.parser.tokens.TagCloseClose¶
- class mwparserfromhell.parser.tokens.TagCloseOpen¶
- class mwparserfromhell.parser.tokens.TagCloseSelfclose¶
- class mwparserfromhell.parser.tokens.TagOpenClose¶
- class mwparserfromhell.parser.tokens.TagOpenOpen¶
- class mwparserfromhell.parser.tokens.TemplateClose¶
- class mwparserfromhell.parser.tokens.TemplateOpen¶
- class mwparserfromhell.parser.tokens.TemplateParamEquals¶
- class mwparserfromhell.parser.tokens.TemplateParamSeparator¶
- class mwparserfromhell.parser.tokens.Text¶
- class mwparserfromhell.parser.tokens.Token[source]¶
A token stores the semantic meaning of a unit of wikicode.
- class mwparserfromhell.parser.tokens.WikilinkClose¶
- class mwparserfromhell.parser.tokens.WikilinkOpen¶
- class mwparserfromhell.parser.tokens.WikilinkSeparator¶