Add line numbers on all block tokens during parsing (#144)

* add tracking of line numbers to block_tokenizer.FileWrapper * assign line numbers to all block tokens during parsing * add line_number as a repr_attribute on all block tokens * update the developer's guide
miyuchina · Dec 2, 2023 · 13d1c11 · 13d1c11
1 parent ee7ce94
commit 13d1c11
Show file tree

Hide file tree

Showing 9 changed files with 250 additions and 90 deletions.
diff --git a/dev-guide.md b/dev-guide.md
@@ -3,13 +3,25 @@
 This document describes usage of mistletoe and its API
 from the developer's point of view.
 
-Understanding the AST
----------------------
+Understanding the AST and the tokens
+------------------------------------
 
 When a markdown document gets parsed by mistletoe, the result is represented
-as an "abstract syntax tree" (AST), stored in an instance of `Document`.
-This object contains a hierarchy of
-all the various tokens which were recognized during the parsing process.
+as an _abstract syntax tree (AST)_, stored in an instance of `Document`.
+This object contains a hierarchy of all the various _tokens_ which were recognized
+during the parsing process, for example, `Paragraph`, `Heading`, and `RawText`.
+
+The tokens which represent a line or a block of lines in the input markdown
+are called _block tokens_. Examples include `List`, `Paragraph`, `ThematicBreak`,
+and also the `Document` itself.
+
+The tokens which represent the actual content within a block are called _span tokens_,
+or, with CommonMark terminology, _inline tokens_.
+In this category you will find tokens like `RawText`, `Link`, and `Emphasis`.
+
+Block tokens may have block tokens, span tokens, or no tokens at all as children
+in the AST; this depends on the type of token. Span tokens may *only* have span
+tokens as children.
 
 In order to see what exactly gets parsed, one can simply use the `AstRenderer`
 on a given markdown input, for example:
@@ -36,9 +48,11 @@ Then we will get this JSON output from the AST renderer:
 {
   "type": "Document",
   "footnotes": {},
+  "line_number": 1,
   "children": [
     {
       "type": "Heading",
+      "line_number": 1,
       "level": 1,
       "children": [
         {
@@ -49,6 +63,7 @@ Then we will get this JSON output from the AST renderer:
     },
     {
       "type": "Paragraph",
+      "line_number": 3,
       "children": [
         {
           "type": "RawText",
@@ -58,6 +73,7 @@ Then we will get this JSON output from the AST renderer:
     },
     {
       "type": "Heading",
+      "line_number": 5,
       "level": 1,
       "children": [
         {
@@ -68,6 +84,7 @@ Then we will get this JSON output from the AST renderer:
     },
     {
       "type": "Paragraph",
+      "line_number": 7,
       "children": [
         {
           "type": "Link",
@@ -86,12 +103,25 @@ Then we will get this JSON output from the AST renderer:
 }
 ```
 
-When passing this tree to a renderer, it is recursively traversed
+### Line numbers
+
+mistletoe records the starting line of all block tokens that it encounters during
+parsing and stores it as the `line_number` attribute of each token.
+(This feature is not available for span tokens yet.)
+
+Rendering
+---------
+Sometimes all you need is the information from the AST. But more often, you'll
+want to take that information and turn it into some other format like HTML.
+This is called _rendering_. mistletoe provides a set of built-in renderers for
+different formats, and it's also possible to define your own renderer.
+
+When passing an AST to a renderer, the tree is recursively traversed
 and methods corresponding to individual token types get called on the renderer
 in order to create the output in the desired format.
 
-Creating a custom renderer
---------------------------
+Creating a custom token and renderer
+------------------------------------
 
 Here's an example of how to add GitHub-style wiki links to the parsing process,
 and provide a renderer for this new token.
@@ -245,7 +275,8 @@ For more info, take a look at the `base_renderer` module in mistletoe.
 The docstrings might give you a more granular idea of customizing mistletoe
 to your needs.
 
-## Markdown to Markdown
+Markdown to Markdown parsing-and-rendering
+------------------------------------------
 
 Suppose you have some Markdown that you want to process and then output
 as Markdown again. Thanks to the text-like nature of Markdown, it is often
@@ -254,12 +285,11 @@ example, if you want to replace a text fragment in the plain text, but not
 in the embedded code samples, then the search-and-replace approach won't work.
 
 In this case you can use mistletoe's `MarkdownRenderer`:
-1. Parse Markdown to an AST tree (usually held in a `Document` token).
-2. Make modifications to the AST tree.
+1. Parse Markdown to an AST (usually held in a `Document` token).
+2. Make modifications to the AST.
 3. Render back to Markdown using `MarkdownRenderer.render()`.
 
-Here is an example of how you can make text replacements in selected parts
-of the AST:
+Here is an example of how you can replace text in selected parts of the AST:
 
 ```python
 import mistletoe
@@ -296,7 +326,8 @@ with open("README.md", "r") as fin:
         print(md)
 ```
 
-If you're making large changes, so that the formatting of the document is
-affected, then it can be useful to also have the text reflowed. This can
-be done by specifying a `max_line_length` parameter in the call to the
-`MarkdownRenderer` constructor.
+The `MarkdownRenderer` can also reflow the text in the document to a given
+maximum line length. And it can do so while preserving the formatting of code
+blocks and other tokens where line breaks matter. To use this feature,
+specify a `max_line_length` parameter in the call to the `MarkdownRenderer`
+constructor.
diff --git a/mistletoe/block_token.py b/mistletoe/block_token.py
@@ -88,7 +88,7 @@ class BlockToken(token.Token):
           of the current token. Every subclass of BlockToken must define a
           start function (see block_tokenizer.tokenize).
 
-        * BlockToken.read takes the rest of the lines in the ducment as an
+        * BlockToken.read takes the rest of the lines in the document as an
           iterator (including the start line), and consumes all the lines
           that should be read into this token.
 
@@ -107,7 +107,10 @@ class BlockToken(token.Token):
 
     Attributes:
         children (list): inner tokens.
+        line_number (int): starting line (1-based).
     """
+    repr_attributes = ("line_number",)
+
     def __init__(self, lines, tokenize_func):
         self.children = tokenize_func(lines)
 
@@ -138,6 +141,7 @@ def __init__(self, lines):
             lines = lines.splitlines(keepends=True)
         lines = [line if line.endswith('\n') else '{}\n'.format(line) for line in lines]
         self.footnotes = {}
+        self.line_number = 1
         token._root_node = self
         self.children = tokenize(lines)
         token._root_node = None
@@ -151,7 +155,7 @@ class Heading(BlockToken):
     Attributes:
         level (int): heading level.
     """
-    repr_attributes = ("level",)
+    repr_attributes = BlockToken.repr_attributes + ("level",)
     pattern = re.compile(r' {0,3}(#{1,6})(?:\n|\s+?(.*?)(\n|\s+?#+\s*?$))')
     level = 0
     content = ''
@@ -192,7 +196,7 @@ class SetextHeading(BlockToken):
     Attributes:
         level (int): heading level.
     """
-    repr_attributes = ("level",)
+    repr_attributes = BlockToken.repr_attributes + ("level",)
 
     def __init__(self, lines):
         self.underline = lines.pop().rstrip()
@@ -236,6 +240,7 @@ def read(cls, lines):
         if len(line) > 0 and line[0] == ' ':
             line = line[1:]
         line_buffer = [line]
+        start_line = lines.line_number()
 
         # set booleans
         in_code_fence = CodeFence.start(line)
@@ -271,7 +276,7 @@ def read(cls, lines):
 
         # parse child block tokens
         Paragraph.parse_setext = False
-        parse_buffer = tokenizer.tokenize_block(line_buffer, _token_types)
+        parse_buffer = tokenizer.tokenize_block(line_buffer, _token_types, start_line=start_line)
         Paragraph.parse_setext = True
         return parse_buffer
 
@@ -350,7 +355,7 @@ class BlockCode(BlockToken):
     Attributes:
         language (str): always the empty string.
     """
-    repr_attributes = ("language",)
+    repr_attributes = BlockToken.repr_attributes + ("language",)
     def __init__(self, lines):
         self.language = ''
         self.children = (span_token.RawText(''.join(lines).strip('\n')+'\n'),)
@@ -406,7 +411,7 @@ class CodeFence(BlockToken):
     Attributes:
         language (str): language of code block (default to empty).
     """
-    repr_attributes = ("language",)
+    repr_attributes = BlockToken.repr_attributes + ("language",)
     pattern = re.compile(r'( {0,3})(`{3,}|~{3,})( *(\S*)[^\n]*)')
     _open_info = None
 
@@ -466,7 +471,7 @@ class List(BlockToken):
         loose (bool): whether the list is loose.
         start (NoneType or int): None if unordered, starting number if ordered.
     """
-    repr_attributes = ("loose", "start")
+    repr_attributes = BlockToken.repr_attributes + ("loose", "start")
     pattern = re.compile(r' {0,3}(?:\d{0,9}[.)]|[+\-*])(?:[ \t]*$|[ \t]+)')
     def __init__(self, matches):
         self.children = [ListItem(*match) for match in matches]
@@ -537,11 +542,12 @@ class ListItem(BlockToken):
                        for continuation lines.
         loose (bool): whether the list is loose.
     """
-    repr_attributes = ("leader", "indentation", "prepend", "loose")
+    repr_attributes = BlockToken.repr_attributes + ("leader", "indentation", "prepend", "loose")
     pattern = re.compile(r'( {0,3})(\d{0,9}[.)]|[+\-*])($|\s+)')
     continuation_pattern = re.compile(r'([ \t]*)(\S.*\n|\n)')
 
-    def __init__(self, parse_buffer, indentation, prepend, leader):
+    def __init__(self, parse_buffer, indentation, prepend, leader, line_number=None):
+        self.line_number = line_number
         self.leader = leader
         self.indentation = indentation
         self.prepend = prepend
@@ -603,6 +609,7 @@ def read(cls, lines, prev_marker=None):
 
         # first line
         line = next(lines)
+        start_line = lines.line_number()
         next_line = lines.peek()
         indentation, prepend, leader, content = prev_marker if prev_marker else cls.parse_marker(line)
         if content.strip() == '':
@@ -619,7 +626,7 @@ def read(cls, lines, prev_marker=None):
                 parse_buffer = tokenizer.ParseBuffer()
                 parse_buffer.loose = True
                 next_marker = cls.parse_marker(next_line) if next_line is not None else None
-                return (parse_buffer, indentation, prepend, leader), next_marker
+                return (parse_buffer, indentation, prepend, leader, start_line), next_marker
         else:
             line_buffer.append(content)
 
@@ -663,8 +670,8 @@ def read(cls, lines, prev_marker=None):
 
         # block-level tokens are parsed here, so that footnotes can be
         # recognized before span-level parsing.
-        parse_buffer = tokenizer.tokenize_block(line_buffer, _token_types)
-        return (parse_buffer, indentation, prepend, leader), next_marker
+        parse_buffer = tokenizer.tokenize_block(line_buffer, _token_types, start_line=start_line)
+        return (parse_buffer, indentation, prepend, leader, start_line), next_marker
 
 
 class Table(BlockToken):
@@ -680,19 +687,20 @@ class Table(BlockToken):
         header: header row (TableRow).
         column_align (list): align options for each column (default to [None]).
     """
-    repr_attributes = ("column_align",)
+    repr_attributes = BlockToken.repr_attributes + ("column_align",)
     interrupt_paragraph = True
 
-    def __init__(self, lines):
+    def __init__(self, match):
+        lines, start_line = match
         if '---' in lines[1]:
             self.column_align = [self.parse_align(column)
                     for column in self.split_delimiter(lines[1])]
-            self.header = TableRow(lines[0], self.column_align)
-            self.children = [TableRow(line, self.column_align) for line in lines[2:]]
+            self.header = TableRow(lines[0], self.column_align, start_line)
+            self.children = [TableRow(line, self.column_align, start_line + offset) for offset, line in enumerate(lines[2:], start=2)]
         else:
             # note: not reachable, because read() guarantees the presence of three dashes
             self.column_align = [None]
-            self.children = [TableRow(line) for line in lines]
+            self.children = [TableRow(line, line_number=start_line + offset) for offset, line in enumerate(lines)]
 
     @staticmethod
     def split_delimiter(delimiter):
@@ -736,12 +744,13 @@ def check_interrupts_paragraph(cls, lines):
     def read(lines):
         anchor = lines.get_pos()
         line_buffer = [next(lines)]
+        start_line = lines.line_number()
         while lines.peek() is not None and '|' in lines.peek():
             line_buffer.append(next(lines))
         if len(line_buffer) < 2 or '---' not in line_buffer[1]:
             lines.set_pos(anchor)
             return None
-        return line_buffer
+        return line_buffer, start_line
 
 
 class TableRow(BlockToken):
@@ -754,16 +763,17 @@ class TableRow(BlockToken):
     Attributes:
         row_align (list): align options for each column (default to [None]).
     """
-    repr_attributes = ("row_align",)
+    repr_attributes = BlockToken.repr_attributes + ("row_align",)
     # Note: Python regex requires fixed-length look-behind,
     # so we cannot use a more precise alternative: r"(?<!\\(?:\\\\)*)(\|)"
     split_pattern = re.compile(r"(?<!\\)\|")
     escaped_pipe_pattern = re.compile(r"(?<!\\)(\\\\)*\\\|")
 
-    def __init__(self, line, row_align=None):
+    def __init__(self, line, row_align=None, line_number=None):
         self.row_align = row_align or [None]
+        self.line_number = line_number
         cells = filter(None, self.split_pattern.split(line.strip()))
-        self.children = [TableCell(self.escaped_pipe_pattern.sub('\\1|', cell.strip()) if cell else '', align)
+        self.children = [TableCell(self.escaped_pipe_pattern.sub('\\1|', cell.strip()) if cell else '', align, line_number)
                          for cell, align in zip_longest(cells, self.row_align)]
 
 
@@ -777,9 +787,10 @@ class TableCell(BlockToken):
     Attributes:
         align (bool): align option for current cell (default to None).
     """
-    repr_attributes = ("align",)
-    def __init__(self, content, align=None):
+    repr_attributes = BlockToken.repr_attributes + ("align",)
+    def __init__(self, content, align=None, line_number=None):
         self.align = align
+        self.line_number = line_number
         super().__init__(content, span_token.tokenize_inner)