parser Package¶

`parser` Package¶

This package contains the actual wikicode parser, split up into two main modules: the tokenizer and the builder. This module joins them together into one interface.

class mwparserfromhell.parser.Parser[source]¶

Represents a parser for wikicode.

Actual parsing is a two-step process: first, the text is split up into a series of tokens by the Tokenizer, and then the tokens are converted into trees of Wikicode objects and Nodes by the Builder.

Instances of this class or its dependents (Tokenizer and Builder) should not be shared between threads. parse() can be called multiple times as long as it is not done concurrently. In general, there is no need to do this because parsing should be done through mwparserfromhell.parse(), which creates a new Parser object as necessary.

parse(text, context=0, skip_style_tags=False)[source]¶

Parse text, returning a Wikicode object tree.

If given, context will be passed as a starting context to the parser. This is helpful when this function is used inside node attribute setters. For example, ExternalLink‘s url setter sets context to contexts.EXT_LINK_URI to prevent the URL itself from becoming an ExternalLink.

If skip_style_tags is True, then '' and ''' will not be parsed, but instead will be treated as plain text.

If there is an internal error while parsing, ParserError will be raised.

exception mwparserfromhell.parser.ParserError(extra)[source]¶

Exception raised when an internal error occurs while parsing.

This does not mean that the wikicode was invalid, because invalid markup should still be parsed correctly. This means that the parser caught itself with an impossible internal state and is bailing out before other problems can happen. Its appearance indicates a bug.

`builder` Module¶

class mwparserfromhell.parser.builder.Builder[source]¶

Builds a tree of nodes out of a sequence of tokens.

To use, pass a list of Tokens to the build() method. The list will be exhausted as it is parsed and a Wikicode object containing the node tree will be returned.

_handle_argument(token)[source]¶: Handle a case where an argument is at the head of the tokens.

_handle_attribute(start)[source]¶: Handle a case where a tag attribute is at the head of the tokens.

_handle_comment(token)[source]¶: Handle a case where an HTML comment is at the head of the tokens.

_handle_entity(token)[source]¶: Handle a case where an HTML entity is at the head of the tokens.

_handle_external_link(token)[source]¶: Handle when an external link is at the head of the tokens.

_handle_heading(token)[source]¶: Handle a case where a heading is at the head of the tokens.

_handle_parameter(default)[source]¶

Handle a case where a parameter is at the head of the tokens.

default is the value to use if no parameter name is defined.

_handle_tag(token)[source]¶: Handle a case where a tag is at the head of the tokens.

_handle_template(token)[source]¶: Handle a case where a template is at the head of the tokens.

_handle_token(token)[source]¶: Handle a single token.

_handle_wikilink(token)[source]¶: Handle a case where a wikilink is at the head of the tokens.

_pop()[source]¶

Pop the current node list off of the stack.

The raw node list is wrapped in a SmartList and then in a Wikicode object.

_push()[source]¶: Push a new node list onto the stack.

_write(item)[source]¶: Append a node to the current node list.

build(tokenlist)[source]¶: Build a Wikicode object from a list tokens and return it.

`contexts` Module¶

This module contains various “context” definitions, which are essentially flags set during the tokenization process, either on the current parse stack (local contexts) or affecting all stacks (global contexts). They represent the context the tokenizer is in, such as inside a template’s name definition, or inside a level-two heading. This is used to determine what tokens are valid at the current point and also if the current parsing route is invalid.

The tokenizer stores context as an integer, with these definitions bitwise OR’d to set them, AND’d to check if they’re set, and XOR’d to unset them. The advantage of this is that contexts can have sub-contexts (as FOO == 0b11 will cover BAR == 0b10 and BAZ == 0b01).

Local (stack-specific) contexts:

TEMPLATE
- TEMPLATE_NAME
- TEMPLATE_PARAM_KEY
- TEMPLATE_PARAM_VALUE
ARGUMENT
- ARGUMENT_NAME
- ARGUMENT_DEFAULT
WIKILINK
- WIKILINK_TITLE
- WIKILINK_TEXT
EXT_LINK
- EXT_LINK_URI
- EXT_LINK_TITLE
HEADING
- HEADING_LEVEL_1
- HEADING_LEVEL_2
- HEADING_LEVEL_3
- HEADING_LEVEL_4
- HEADING_LEVEL_5
- HEADING_LEVEL_6
TAG
- TAG_OPEN
- TAG_ATTR
- TAG_BODY
- TAG_CLOSE
STYLE
- STYLE_ITALICS
- STYLE_BOLD
- STYLE_PASS_AGAIN
- STYLE_SECOND_PASS
DL_TERM
SAFETY_CHECK
- HAS_TEXT
- FAIL_ON_TEXT
- FAIL_NEXT
- FAIL_ON_LBRACE
- FAIL_ON_RBRACE
- FAIL_ON_EQUALS
- HAS_TEMPLATE
TABLE
- TABLE_OPEN
- TABLE_CELL_OPEN
- TABLE_CELL_STYLE
- TABLE_TD_LINE
- TABLE_TH_LINE
- TABLE_CELL_LINE_CONTEXTS
HTML_ENTITY

Global contexts:

GL_HEADING

Aggregate contexts:

FAIL
UNSAFE
DOUBLE
NO_WIKILINKS
NO_EXT_LINKS

mwparserfromhell.parser.contexts.describe(context)[source]¶: Return a string describing the given context value, for debugging.

`errors` Module¶

exception mwparserfromhell.parser.errors.ParserError(extra)[source]¶

Exception raised when an internal error occurs while parsing.

This does not mean that the wikicode was invalid, because invalid markup should still be parsed correctly. This means that the parser caught itself with an impossible internal state and is bailing out before other problems can happen. Its appearance indicates a bug.

`tokenizer` Module¶

class mwparserfromhell.parser.tokenizer.Tokenizer[source]¶

Creates a list of tokens from a string of wikicode.

MARKERS: list[str | Sentinel] = ['{', '}', '[', ']', '<', '>', '|', '=', '&', "'", '"', '#', '*', ';', ':', '/', '-', '!', '\n', Sentinel.START, Sentinel.END]¶

MAX_DEPTH = 100¶

URISCHEME = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+.-'¶

USES_C = False¶

_can_recurse()[source]¶: Return whether or not our max recursion depth has been exceeded.

property _context¶: The current token context.

_emit(token)[source]¶: Write a token to the end of the current token stack.

_emit_all(tokenlist)[source]¶: Write a series of tokens to the current stack at once.

_emit_first(token)[source]¶: Write a token to the beginning of the current token stack.

_emit_style_tag(tag, markup, body)[source]¶: Write the body of a tag and the tokens that should surround it.

_emit_table_tag(open_open_markup, tag, style, padding, close_open_markup, contents, open_close_markup)[source]¶: Emit a table tag.

_emit_text(text)[source]¶: Write text to the current textbuffer.

_emit_text_then_stack(text)[source]¶: Pop the current stack, write text, and then write the stack.

_fail_route()[source]¶

Fail the current tokenization route.

Discards the current stack/context/textbuffer and raises BadRoute.

_handle_argument_end()[source]¶: Handle the end of an argument at the head of the string.

_handle_argument_separator()[source]¶: Handle the separator between an argument’s name and default.

_handle_blacklisted_tag()[source]¶: Handle the body of an HTML tag that is parser-blacklisted.

_handle_dl_term()[source]¶: Handle the term in a description list (foo in ;foo:bar).

_handle_end()[source]¶: Handle the end of the stream of wikitext.

_handle_free_link_text(punct, tail, this)[source]¶: Handle text in a free ext link, including trailing punctuation.

_handle_heading_end()[source]¶: Handle the end of a section heading at the head of the string.

_handle_hr()[source]¶: Handle a wiki-style horizontal rule (----) in the string.

_handle_invalid_tag_start()[source]¶: Handle the (possible) start of an implicitly closing single tag.

_handle_list()[source]¶: Handle a wiki-style list (#, *, ;, :).

_handle_list_marker()[source]¶: Handle a list marker at the head (#, *, ;, :).

_handle_single_only_tag_end()[source]¶: Handle the end of an implicitly closing single-only HTML tag.

_handle_single_tag_end()[source]¶: Handle the stream end when inside a single-supporting HTML tag.

_handle_table_cell(markup, tag, line_context)[source]¶: Parse as normal syntax unless we hit a style marker, then parse style as HTML attributes and the remainder as normal syntax.

_handle_table_cell_end(reset_for_style=False)[source]¶: Returns the current context, with the TABLE_CELL_STYLE flag set if it is necessary to reset and parse style attributes.

_handle_table_end()[source]¶: Return the stack in order to handle the table end.

_handle_table_row()[source]¶: Parse as style until end of the line, then continue.

_handle_table_row_end()[source]¶: Return the stack in order to handle the table row end.

_handle_table_style(end_token: str)[source]¶: Handle style attributes for a table until end_token.

_handle_tag_close_close()[source]¶: Handle the ending of a closing tag (</foo>).

_handle_tag_close_open(data, token)[source]¶: Handle the closing of a open tag (<foo>).

_handle_tag_data(data, text)[source]¶: Handle all sorts of text data inside of an HTML open tag.

_handle_tag_open_close()[source]¶: Handle the opening of a closing tag (</foo>).

_handle_tag_space(data, text)[source]¶: Handle whitespace (text) inside of an HTML open tag.

_handle_tag_text(text)[source]¶: Handle regular text inside of an HTML open tag.

_handle_template_end()[source]¶: Handle the end of a template at the head of the string.

_handle_template_param()[source]¶: Handle a template parameter at the head of the string.

_handle_template_param_value()[source]¶: Handle a template parameter’s value at the head of the string.

_handle_wikilink_end()[source]¶: Handle the end of a wikilink at the head of the string.

_handle_wikilink_separator()[source]¶: Handle the separator between a wikilink’s title and its text.

_is_uri_end(this, nxt)[source]¶: Return whether the current head is the end of a URI.

_memoize_bad_route()[source]¶

Remember that the current route (head + context at push) is invalid.

This will be noticed when calling _push with the same head and context, and the route will be failed immediately.

_parse(context=0, push=True)[source]¶: Parse the wikicode string, using context for when to stop.

_parse_argument()[source]¶: Parse an argument at the head of the wikicode string.

_parse_bold()[source]¶: Parse wiki-style bold.

_parse_bracketed_uri_scheme()[source]¶: Parse the URI scheme of a bracket-enclosed external link.

_parse_comment()[source]¶: Parse an HTML comment at the head of the wikicode string.

_parse_entity()[source]¶: Parse an HTML entity at the head of the wikicode string.

_parse_external_link(brackets)[source]¶: Parse an external link at the head of the wikicode string.

_parse_free_uri_scheme()[source]¶: Parse the URI scheme of a free (no brackets) external link.

_parse_heading()[source]¶: Parse a section heading at the head of the wikicode string.

_parse_italics()[source]¶: Parse wiki-style italics.

_parse_italics_and_bold()[source]¶: Parse wiki-style italics and bold together (i.e., five ticks).

_parse_style()[source]¶: Parse wiki-style formatting (''/''' for italics/bold).

_parse_table()[source]¶: Parse a wikicode table by starting with the first line.

_parse_tag()[source]¶: Parse an HTML tag at the head of the wikicode string.

_parse_template(has_content)[source]¶: Parse a template at the head of the wikicode string.

_parse_template_or_argument()[source]¶: Parse a template or argument at the head of the wikicode string.

_parse_wikilink()[source]¶: Parse an internal wikilink at the head of the wikicode string.

_pop(keep_context=False)[source]¶

Pop the current stack/context/textbuffer, returning the stack.

If keep_context is True, then we will replace the underlying stack’s context with the current stack’s.

_push(context=0)[source]¶: Add a new token stack, context, and textbuffer to the list.

_push_tag_buffer(data)[source]¶: Write a pending tag attribute from data to the stack.

_push_textbuffer()[source]¶: Push the textbuffer onto the stack as a Text node and clear it.

_read(*, strict: Literal[False] = False) → str | Literal[Sentinel.END][source]¶

_read(*, strict: Literal[True]) → str

_read(delta: int = 0, *, strict: Literal[False] = False) → str | Literal[Sentinel.START, Sentinel.END]

_read(delta: int = 0, *, strict: Literal[True]) → str | Literal[Sentinel.START]

Read the value at a relative point in the wikicode.

The value is read from self._head plus the value of delta (which can be negative). If strict is True, the route will be failed (with _fail_route()) if we try to read from past the end of the string; otherwise, END is returned. If we try to read from before the start of the string, START is returned.

_really_parse_entity()[source]¶: Actually parse an HTML entity and ensure that it is valid.

_really_parse_external_link(brackets)[source]¶: Really parse an external link.

_really_parse_tag()[source]¶: Actually parse an HTML tag, starting with the open (<foo>).

_remove_uri_scheme_from_textbuffer(scheme)[source]¶: Remove the URI scheme of a new external link from the textbuffer.

property _stack¶: The current token stack.

property _stack_ident¶

An identifier for the current stack.

This is based on the starting head position and context. Stacks with the same identifier are always parsed in the same way. This can be used to cache intermediate parsing info.

property _textbuffer¶: The current textbuffer.

_verify_safe(this)[source]¶: Make sure we are not trying to write an invalid character.

regex = re.compile('([{}\\[\\]<>|=&\'#*;:/\\\\\\"\\-!\\n])', re.IGNORECASE)¶

tag_splitter = re.compile('([\\s\\"\\\'\\\\]+)')¶

tokenize(text: str, context=0, skip_style_tags=False)[source]¶: Build a list of tokens from a string of wikicode and return it.

exception mwparserfromhell.parser.tokenizer.BadRoute(context=0)[source]¶: Raised internally when the current tokenization route is invalid.

`tokens` Module¶

This module contains the token definitions that are used as an intermediate parsing data type - they are stored in a flat list, with each token being identified by its type and optional attributes. The token list is generated in a syntactically valid form by the Tokenizer, and then converted into the :class`.Wikicode` tree by the Builder.

class mwparserfromhell.parser.tokens.ArgumentClose¶

class mwparserfromhell.parser.tokens.ArgumentOpen¶

class mwparserfromhell.parser.tokens.ArgumentSeparator¶

class mwparserfromhell.parser.tokens.CommentEnd¶

class mwparserfromhell.parser.tokens.CommentStart¶

class mwparserfromhell.parser.tokens.ExternalLinkClose¶

class mwparserfromhell.parser.tokens.ExternalLinkOpen¶

class mwparserfromhell.parser.tokens.ExternalLinkSeparator¶

class mwparserfromhell.parser.tokens.HTMLEntityEnd¶

class mwparserfromhell.parser.tokens.HTMLEntityHex¶

class mwparserfromhell.parser.tokens.HTMLEntityNumeric¶

class mwparserfromhell.parser.tokens.HTMLEntityStart¶

class mwparserfromhell.parser.tokens.HeadingEnd¶

class mwparserfromhell.parser.tokens.HeadingStart¶

class mwparserfromhell.parser.tokens.TagAttrEquals¶

class mwparserfromhell.parser.tokens.TagAttrQuote¶

class mwparserfromhell.parser.tokens.TagAttrStart¶

class mwparserfromhell.parser.tokens.TagCloseClose¶

class mwparserfromhell.parser.tokens.TagCloseOpen¶

class mwparserfromhell.parser.tokens.TagCloseSelfclose¶

class mwparserfromhell.parser.tokens.TagOpenClose¶

class mwparserfromhell.parser.tokens.TagOpenOpen¶

class mwparserfromhell.parser.tokens.TemplateClose¶

class mwparserfromhell.parser.tokens.TemplateOpen¶

class mwparserfromhell.parser.tokens.TemplateParamEquals¶

class mwparserfromhell.parser.tokens.TemplateParamSeparator¶

class mwparserfromhell.parser.tokens.Text¶

class mwparserfromhell.parser.tokens.Token[source]¶: A token stores the semantic meaning of a unit of wikicode.

class mwparserfromhell.parser.tokens.WikilinkClose¶

class mwparserfromhell.parser.tokens.WikilinkOpen¶

class mwparserfromhell.parser.tokens.WikilinkSeparator¶

mwparserfromhell

Navigation

Related Topics

parser Package¶

`parser` Package¶

`builder` Module¶

`contexts` Module¶

`errors` Module¶

`tokenizer` Module¶

`tokens` Module¶

parser Package¶

parser Package¶

builder Module¶

contexts Module¶

errors Module¶

tokenizer Module¶

tokens Module¶

`parser` Package¶

`builder` Module¶

`contexts` Module¶

`errors` Module¶

`tokenizer` Module¶

`tokens` Module¶