Skip to content

Commit

Permalink
Add line numbers on all block tokens during parsing (#144)
Browse files Browse the repository at this point in the history
* add tracking of line numbers to block_tokenizer.FileWrapper
* assign line numbers to all block tokens during parsing
* add line_number as a repr_attribute on all block tokens
* update the developer's guide
  • Loading branch information
anderskaplan committed Dec 2, 2023
1 parent ee7ce94 commit 13d1c11
Show file tree
Hide file tree
Showing 9 changed files with 250 additions and 90 deletions.
65 changes: 48 additions & 17 deletions dev-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,25 @@
This document describes usage of mistletoe and its API
from the developer's point of view.

Understanding the AST
---------------------
Understanding the AST and the tokens
------------------------------------

When a markdown document gets parsed by mistletoe, the result is represented
as an "abstract syntax tree" (AST), stored in an instance of `Document`.
This object contains a hierarchy of
all the various tokens which were recognized during the parsing process.
as an _abstract syntax tree (AST)_, stored in an instance of `Document`.
This object contains a hierarchy of all the various _tokens_ which were recognized
during the parsing process, for example, `Paragraph`, `Heading`, and `RawText`.

The tokens which represent a line or a block of lines in the input markdown
are called _block tokens_. Examples include `List`, `Paragraph`, `ThematicBreak`,
and also the `Document` itself.

The tokens which represent the actual content within a block are called _span tokens_,
or, with CommonMark terminology, _inline tokens_.
In this category you will find tokens like `RawText`, `Link`, and `Emphasis`.

Block tokens may have block tokens, span tokens, or no tokens at all as children
in the AST; this depends on the type of token. Span tokens may *only* have span
tokens as children.

In order to see what exactly gets parsed, one can simply use the `AstRenderer`
on a given markdown input, for example:
Expand All @@ -36,9 +48,11 @@ Then we will get this JSON output from the AST renderer:
{
"type": "Document",
"footnotes": {},
"line_number": 1,
"children": [
{
"type": "Heading",
"line_number": 1,
"level": 1,
"children": [
{
Expand All @@ -49,6 +63,7 @@ Then we will get this JSON output from the AST renderer:
},
{
"type": "Paragraph",
"line_number": 3,
"children": [
{
"type": "RawText",
Expand All @@ -58,6 +73,7 @@ Then we will get this JSON output from the AST renderer:
},
{
"type": "Heading",
"line_number": 5,
"level": 1,
"children": [
{
Expand All @@ -68,6 +84,7 @@ Then we will get this JSON output from the AST renderer:
},
{
"type": "Paragraph",
"line_number": 7,
"children": [
{
"type": "Link",
Expand All @@ -86,12 +103,25 @@ Then we will get this JSON output from the AST renderer:
}
```

When passing this tree to a renderer, it is recursively traversed
### Line numbers

mistletoe records the starting line of all block tokens that it encounters during
parsing and stores it as the `line_number` attribute of each token.
(This feature is not available for span tokens yet.)

Rendering
---------
Sometimes all you need is the information from the AST. But more often, you'll
want to take that information and turn it into some other format like HTML.
This is called _rendering_. mistletoe provides a set of built-in renderers for
different formats, and it's also possible to define your own renderer.

When passing an AST to a renderer, the tree is recursively traversed
and methods corresponding to individual token types get called on the renderer
in order to create the output in the desired format.

Creating a custom renderer
--------------------------
Creating a custom token and renderer
------------------------------------

Here's an example of how to add GitHub-style wiki links to the parsing process,
and provide a renderer for this new token.
Expand Down Expand Up @@ -245,7 +275,8 @@ For more info, take a look at the `base_renderer` module in mistletoe.
The docstrings might give you a more granular idea of customizing mistletoe
to your needs.

## Markdown to Markdown
Markdown to Markdown parsing-and-rendering
------------------------------------------

Suppose you have some Markdown that you want to process and then output
as Markdown again. Thanks to the text-like nature of Markdown, it is often
Expand All @@ -254,12 +285,11 @@ example, if you want to replace a text fragment in the plain text, but not
in the embedded code samples, then the search-and-replace approach won't work.

In this case you can use mistletoe's `MarkdownRenderer`:
1. Parse Markdown to an AST tree (usually held in a `Document` token).
2. Make modifications to the AST tree.
1. Parse Markdown to an AST (usually held in a `Document` token).
2. Make modifications to the AST.
3. Render back to Markdown using `MarkdownRenderer.render()`.

Here is an example of how you can make text replacements in selected parts
of the AST:
Here is an example of how you can replace text in selected parts of the AST:

```python
import mistletoe
Expand Down Expand Up @@ -296,7 +326,8 @@ with open("README.md", "r") as fin:
print(md)
```

If you're making large changes, so that the formatting of the document is
affected, then it can be useful to also have the text reflowed. This can
be done by specifying a `max_line_length` parameter in the call to the
`MarkdownRenderer` constructor.
The `MarkdownRenderer` can also reflow the text in the document to a given
maximum line length. And it can do so while preserving the formatting of code
blocks and other tokens where line breaks matter. To use this feature,
specify a `max_line_length` parameter in the call to the `MarkdownRenderer`
constructor.
57 changes: 34 additions & 23 deletions mistletoe/block_token.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ class BlockToken(token.Token):
of the current token. Every subclass of BlockToken must define a
start function (see block_tokenizer.tokenize).
* BlockToken.read takes the rest of the lines in the ducment as an
* BlockToken.read takes the rest of the lines in the document as an
iterator (including the start line), and consumes all the lines
that should be read into this token.
Expand All @@ -107,7 +107,10 @@ class BlockToken(token.Token):
Attributes:
children (list): inner tokens.
line_number (int): starting line (1-based).
"""
repr_attributes = ("line_number",)

def __init__(self, lines, tokenize_func):
self.children = tokenize_func(lines)

Expand Down Expand Up @@ -138,6 +141,7 @@ def __init__(self, lines):
lines = lines.splitlines(keepends=True)
lines = [line if line.endswith('\n') else '{}\n'.format(line) for line in lines]
self.footnotes = {}
self.line_number = 1
token._root_node = self
self.children = tokenize(lines)
token._root_node = None
Expand All @@ -151,7 +155,7 @@ class Heading(BlockToken):
Attributes:
level (int): heading level.
"""
repr_attributes = ("level",)
repr_attributes = BlockToken.repr_attributes + ("level",)
pattern = re.compile(r' {0,3}(#{1,6})(?:\n|\s+?(.*?)(\n|\s+?#+\s*?$))')
level = 0
content = ''
Expand Down Expand Up @@ -192,7 +196,7 @@ class SetextHeading(BlockToken):
Attributes:
level (int): heading level.
"""
repr_attributes = ("level",)
repr_attributes = BlockToken.repr_attributes + ("level",)

def __init__(self, lines):
self.underline = lines.pop().rstrip()
Expand Down Expand Up @@ -236,6 +240,7 @@ def read(cls, lines):
if len(line) > 0 and line[0] == ' ':
line = line[1:]
line_buffer = [line]
start_line = lines.line_number()

# set booleans
in_code_fence = CodeFence.start(line)
Expand Down Expand Up @@ -271,7 +276,7 @@ def read(cls, lines):

# parse child block tokens
Paragraph.parse_setext = False
parse_buffer = tokenizer.tokenize_block(line_buffer, _token_types)
parse_buffer = tokenizer.tokenize_block(line_buffer, _token_types, start_line=start_line)
Paragraph.parse_setext = True
return parse_buffer

Expand Down Expand Up @@ -350,7 +355,7 @@ class BlockCode(BlockToken):
Attributes:
language (str): always the empty string.
"""
repr_attributes = ("language",)
repr_attributes = BlockToken.repr_attributes + ("language",)
def __init__(self, lines):
self.language = ''
self.children = (span_token.RawText(''.join(lines).strip('\n')+'\n'),)
Expand Down Expand Up @@ -406,7 +411,7 @@ class CodeFence(BlockToken):
Attributes:
language (str): language of code block (default to empty).
"""
repr_attributes = ("language",)
repr_attributes = BlockToken.repr_attributes + ("language",)
pattern = re.compile(r'( {0,3})(`{3,}|~{3,})( *(\S*)[^\n]*)')
_open_info = None

Expand Down Expand Up @@ -466,7 +471,7 @@ class List(BlockToken):
loose (bool): whether the list is loose.
start (NoneType or int): None if unordered, starting number if ordered.
"""
repr_attributes = ("loose", "start")
repr_attributes = BlockToken.repr_attributes + ("loose", "start")
pattern = re.compile(r' {0,3}(?:\d{0,9}[.)]|[+\-*])(?:[ \t]*$|[ \t]+)')
def __init__(self, matches):
self.children = [ListItem(*match) for match in matches]
Expand Down Expand Up @@ -537,11 +542,12 @@ class ListItem(BlockToken):
for continuation lines.
loose (bool): whether the list is loose.
"""
repr_attributes = ("leader", "indentation", "prepend", "loose")
repr_attributes = BlockToken.repr_attributes + ("leader", "indentation", "prepend", "loose")
pattern = re.compile(r'( {0,3})(\d{0,9}[.)]|[+\-*])($|\s+)')
continuation_pattern = re.compile(r'([ \t]*)(\S.*\n|\n)')

def __init__(self, parse_buffer, indentation, prepend, leader):
def __init__(self, parse_buffer, indentation, prepend, leader, line_number=None):
self.line_number = line_number
self.leader = leader
self.indentation = indentation
self.prepend = prepend
Expand Down Expand Up @@ -603,6 +609,7 @@ def read(cls, lines, prev_marker=None):

# first line
line = next(lines)
start_line = lines.line_number()
next_line = lines.peek()
indentation, prepend, leader, content = prev_marker if prev_marker else cls.parse_marker(line)
if content.strip() == '':
Expand All @@ -619,7 +626,7 @@ def read(cls, lines, prev_marker=None):
parse_buffer = tokenizer.ParseBuffer()
parse_buffer.loose = True
next_marker = cls.parse_marker(next_line) if next_line is not None else None
return (parse_buffer, indentation, prepend, leader), next_marker
return (parse_buffer, indentation, prepend, leader, start_line), next_marker
else:
line_buffer.append(content)

Expand Down Expand Up @@ -663,8 +670,8 @@ def read(cls, lines, prev_marker=None):

# block-level tokens are parsed here, so that footnotes can be
# recognized before span-level parsing.
parse_buffer = tokenizer.tokenize_block(line_buffer, _token_types)
return (parse_buffer, indentation, prepend, leader), next_marker
parse_buffer = tokenizer.tokenize_block(line_buffer, _token_types, start_line=start_line)
return (parse_buffer, indentation, prepend, leader, start_line), next_marker


class Table(BlockToken):
Expand All @@ -680,19 +687,20 @@ class Table(BlockToken):
header: header row (TableRow).
column_align (list): align options for each column (default to [None]).
"""
repr_attributes = ("column_align",)
repr_attributes = BlockToken.repr_attributes + ("column_align",)
interrupt_paragraph = True

def __init__(self, lines):
def __init__(self, match):
lines, start_line = match
if '---' in lines[1]:
self.column_align = [self.parse_align(column)
for column in self.split_delimiter(lines[1])]
self.header = TableRow(lines[0], self.column_align)
self.children = [TableRow(line, self.column_align) for line in lines[2:]]
self.header = TableRow(lines[0], self.column_align, start_line)
self.children = [TableRow(line, self.column_align, start_line + offset) for offset, line in enumerate(lines[2:], start=2)]
else:
# note: not reachable, because read() guarantees the presence of three dashes
self.column_align = [None]
self.children = [TableRow(line) for line in lines]
self.children = [TableRow(line, line_number=start_line + offset) for offset, line in enumerate(lines)]

@staticmethod
def split_delimiter(delimiter):
Expand Down Expand Up @@ -736,12 +744,13 @@ def check_interrupts_paragraph(cls, lines):
def read(lines):
anchor = lines.get_pos()
line_buffer = [next(lines)]
start_line = lines.line_number()
while lines.peek() is not None and '|' in lines.peek():
line_buffer.append(next(lines))
if len(line_buffer) < 2 or '---' not in line_buffer[1]:
lines.set_pos(anchor)
return None
return line_buffer
return line_buffer, start_line


class TableRow(BlockToken):
Expand All @@ -754,16 +763,17 @@ class TableRow(BlockToken):
Attributes:
row_align (list): align options for each column (default to [None]).
"""
repr_attributes = ("row_align",)
repr_attributes = BlockToken.repr_attributes + ("row_align",)
# Note: Python regex requires fixed-length look-behind,
# so we cannot use a more precise alternative: r"(?<!\\(?:\\\\)*)(\|)"
split_pattern = re.compile(r"(?<!\\)\|")
escaped_pipe_pattern = re.compile(r"(?<!\\)(\\\\)*\\\|")

def __init__(self, line, row_align=None):
def __init__(self, line, row_align=None, line_number=None):
self.row_align = row_align or [None]
self.line_number = line_number
cells = filter(None, self.split_pattern.split(line.strip()))
self.children = [TableCell(self.escaped_pipe_pattern.sub('\\1|', cell.strip()) if cell else '', align)
self.children = [TableCell(self.escaped_pipe_pattern.sub('\\1|', cell.strip()) if cell else '', align, line_number)
for cell, align in zip_longest(cells, self.row_align)]


Expand All @@ -777,9 +787,10 @@ class TableCell(BlockToken):
Attributes:
align (bool): align option for current cell (default to None).
"""
repr_attributes = ("align",)
def __init__(self, content, align=None):
repr_attributes = BlockToken.repr_attributes + ("align",)
def __init__(self, content, align=None, line_number=None):
self.align = align
self.line_number = line_number
super().__init__(content, span_token.tokenize_inner)


Expand Down
Loading

0 comments on commit 13d1c11

Please sign in to comment.