feat: added C++ port of TypeScript PythonLexerBase for Python 3.13 #4440

kormanowsky · 2025-03-12T21:13:50Z

Ported missing PythonParseBase from TypeScript to C++.
This is needed so that one can create fully working Python parser.

kaby76 · 2025-03-13T01:40:37Z

Thanks for this, but the code doesn't compile/link. Correcting the compilation errors, it crashes.

The first thing you'll need to do is to fit this into the build because as is, nothing is compiled/linked or tested for the Cpp target.

You need these changes.

The desc.xml must have Cpp added to the <targets>. Testing won't build anything if you don't declare that the Cpp target actually "works".
The grammar file needs @header to add #include "PythonLexerBase.h" in the generated file PythonLexer.h. But, the @header block is always needed to be in a target-specific language. We can't allow that because PythonLexer.g4 is supposed to be "target agnostic". To work around this, you have to have a string in the .g4 that gets modified just prior to calling the Antlr4 Tool to generate the .cpp and .h files. That is the whole purpose of "transformGrammar.h".
I couldn't get anything to compile with using statements. It all compiles if I use fully qualified namespaces for Antlr4 types. So, instead of Token, it should be antlr4::Token.

Applying this, these are the changes you need to perform to your code.

$ git diff .
diff --git a/python/python3_13/Cpp/PythonLexerBase.cpp b/python/python3_13/Cpp/PythonLexerBase.cpp
index e35d8ee5..116d366c 100644
--- a/python/python3_13/Cpp/PythonLexerBase.cpp
+++ b/python/python3_13/Cpp/PythonLexerBase.cpp
@@ -1,4 +1,7 @@
+#include "antlr4-runtime.h"
 #include "PythonLexerBase.h"
+#include "PythonLexer.h"
+#include "PythonParser.h"

 using namespace antlr4;

@@ -148,7 +151,7 @@ void PythonLexerBase::setCurrentAndFollowingTokens() {
     if (this->ffgToken) {
         this->curToken = this->cloneToken(this->ffgToken);
     } else {
-        this->curToken = PythonLexer::nextToken();
+        this->curToken = this->Lexer::nextToken();
     }

     this->checkCurToken(); // ffgToken cannot be used in this method and its sub methods (ffgToken is not yet set)!
@@ -156,7 +159,7 @@ void PythonLexerBase::setCurrentAndFollowingTokens() {
     if (this->curToken->getType() == PythonLexer::EOF) {
         this->ffgToken = this->cloneToken(this->ffgToken);
     } else {
-        this->ffgToken = PythonLexer::nextToken();
+        this->ffgToken = this->Lexer::nextToken();
     }
 }

diff --git a/python/python3_13/Cpp/PythonLexerBase.h b/python/python3_13/Cpp/PythonLexerBase.h
index 24b63372..63305f43 100644
--- a/python/python3_13/Cpp/PythonLexerBase.h
+++ b/python/python3_13/Cpp/PythonLexerBase.h
@@ -35,23 +35,20 @@ THE SOFTWARE.
 #include <vector>
 #include <regex>
 #include "antlr4-runtime.h"
-#include "PythonLexer.h"
-#include "PythonParser.h"
+//#include "PythonLexer.h"

-using namespace antlr4;
-
-class PythonLexerBase : public PythonLexer {
+class PythonLexerBase : public antlr4::Lexer {
 public:
-    explicit PythonLexerBase(CharStream *input): PythonLexer(input) {
+    PythonLexerBase(antlr4::CharStream *input): antlr4::Lexer(input) {
         this->init();
     }
-    virtual std::unique_ptr<Token> nextToken() override;
+    virtual std::unique_ptr<antlr4::Token> nextToken() override;
     virtual void reset() override;
 private:
-    std::unique_ptr<Token> cloneToken(const std::unique_ptr<Token> &source);
-    std::unique_ptr<Token> cloneToken(const std::unique_ptr<Token> &source, size_t channel);
-    std::unique_ptr<Token> cloneToken(const std::unique_ptr<Token> &source, const std::string &text);
-    std::unique_ptr<Token> cloneToken(const std::unique_ptr<Token> &source, size_t channel, const std::string &text, size_t type);
+    std::unique_ptr<antlr4::Token> cloneToken(const std::unique_ptr<antlr4::Token> &source);
+    std::unique_ptr<antlr4::Token> cloneToken(const std::unique_ptr<antlr4::Token> &source, size_t channel);
+    std::unique_ptr<antlr4::Token> cloneToken(const std::unique_ptr<antlr4::Token> &source, const std::string &text);
+    std::unique_ptr<antlr4::Token> cloneToken(const std::unique_ptr<antlr4::Token> &source, size_t channel, const std::string &text, size_t type);
     void init();
     void checkNextToken();
     void setCurrentAndFollowingTokens();
@@ -81,10 +78,10 @@ private:
     bool isDictionaryComprehensionOrSetComprehension(const std::string &code);
     void insertTrailingTokens();
     void handleEOFtoken();
-    void hideAndAddPendingToken(const std::unique_ptr<Token> &token);
-    void createAndAddPendingToken(size_t type, size_t channel, const std::string &text, const std::unique_ptr<Token> &sampleToken);
-    void createAndAddPendingToken(size_t type, size_t channel, const std::unique_ptr<Token> &sampleToken);
-    void addPendingToken(const std::unique_ptr<Token> &token);
+    void hideAndAddPendingToken(const std::unique_ptr<antlr4::Token> &token);
+    void createAndAddPendingToken(size_t type, size_t channel, const std::string &text, const std::unique_ptr<antlr4::Token> &sampleToken);
+    void createAndAddPendingToken(size_t type, size_t channel, const std::unique_ptr<antlr4::Token> &sampleToken);
+    void addPendingToken(const std::unique_ptr<antlr4::Token> &token);
     size_t getIndentationLength(const std::string &identText);
     void reportLexerError(const std::string &errMsg);
     void reportError(const std::string &errMsg);
@@ -92,7 +89,7 @@ private:
     // A stack that keeps track of the indentation lengths
     std::stack<size_t> indentLengthStack;
     // A list where tokens are waiting to be loaded into the token stream
-    std::vector<std::unique_ptr<Token>> pendingTokens;
+    std::vector<std::unique_ptr<antlr4::Token>> pendingTokens;
     // last pending token types
     size_t previousPendingTokenType;
     size_t lastPendingTokenTypeFromDefaultChannel;
@@ -113,9 +110,9 @@ private:
     bool wasTabIndentation;
     bool wasIndentationMixedWithSpacesAndTabs;

-    std::unique_ptr<Token> curToken; // current (under processing) token
-    std::unique_ptr<Token> ffgToken; // following (look ahead) token
+    std::unique_ptr<antlr4::Token> curToken; // current (under processing) token
+    std::unique_ptr<antlr4::Token> ffgToken; // following (look ahead) token

     const ssize_t INVALID_LENGTH = -1;
     const std::string ERR_TXT = " ERROR: ";
-};
\ No newline at end of file
+};
diff --git a/python/python3_13/PythonLexer.g4 b/python/python3_13/PythonLexer.g4
index 98b99d4a..f2c036b9 100644
--- a/python/python3_13/PythonLexer.g4
+++ b/python/python3_13/PythonLexer.g4
@@ -32,6 +32,8 @@ lexer grammar PythonLexer;

 options { superClass=PythonLexerBase; }

+// Insert here @header for lexer.
+
 tokens {
     ENCODING // https://docs.python.org/3.13/reference/lexical_analysis.html#encoding-declarations
   , INDENT, DEDENT // https://docs.python.org/3.13/reference/lexical_analysis.html#indentation
diff --git a/python/python3_13/desc.xml b/python/python3_13/desc.xml
index 8aa6fdea..3ef04309 100644
--- a/python/python3_13/desc.xml
+++ b/python/python3_13/desc.xml
@@ -3,7 +3,7 @@
    <antlr-version>^4.13.2</antlr-version>
    <targets>CSharp;Java;Python3;JavaScript;TypeScript</targets>
    <test>
-      <targets>CSharp;Java;Python3;JavaScript;TypeScript</targets>
+      <targets>Cpp;CSharp;Java;Python3;JavaScript;TypeScript</targets>
       <entry-point>file_input</entry-point>
       <inputs>examples</inputs>
    </test>
03/12-21:33:27 ~/issues/g4-4440/python/python3_13

You will need to place this file, transformGrammar.py, in the Cpp/ directory along with ParserLexerBase.h and ParserLexerBase.cpp.

$ cat Cpp/transformGrammar.py
import sys, os, re, shutil
from glob import glob
from pathlib import Path

def main(argv):
    for file in glob("./*.g4"):
        fix(file)

def fix(file_path):
    print("Altering " + file_path)
    if not os.path.exists(file_path):
        print(f"Could not find file: {file_path}")
        sys.exit(1)
    parts = os.path.split(file_path)
    file_name = parts[-1]
    shutil.move(file_path, file_path + ".bak")
    input_file = open(file_path + ".bak",'r')
    output_file = open(file_path, 'w')
    for x in input_file:
        if '// Insert here @header for lexer.' in x:
            x = x.replace('// Insert here @header for lexer.', '@header {#include "PythonLexerBase.h"}')
        if 'this.' in x:
            x = x.replace('this.', 'this->')
        output_file.write(x)
        output_file.flush()

    print("Writing ...")
    input_file.close()
    output_file.close()

if __name__ == '__main__':
    main(sys.argv)

kormanowsky · 2025-03-13T16:07:22Z

@kaby76 ok, I see. This is because I used the TypeScript version as a source and it contains class inherited from PythonLexer, not from Lexer.

Will try the changes you suggested in a couple of days

kaby76 · 2025-03-13T17:26:52Z

That doesn't sound right because PythonLexerBase is derived from Lexer not PythonLexer in the TypeScript port:

grammars-v4/python/python3_13/TypeScript/PythonLexerBase.ts

Line 35 in 56308f1

export default abstract class PythonLexerBase extends Lexer {

teverett · 2025-03-20T20:24:41Z

@kaby76 should i hold on this one before merging?

kaby76 · 2025-03-20T21:02:58Z

@teverett Hold please. The PR is not being teated and it contains many compilation issues.

… stmts

kormanowsky · 2025-03-21T15:10:35Z

@kaby76 I applied modifications you mentioned and was able to compile the code on my machine. Also, I fixed some typos which led to segfaults. Could you approve the workflow to test my changes, please?

P. S. sorry for being late, had too much work(

kaby76 · 2025-03-21T15:12:16Z

@teverett Could we get a build of this PR? Thanks!

feat: added C++ port of TypeScript PythonParserBase for Python 3.13

4f9a84e

kormanowsky added 4 commits March 21, 2025 17:25

chore: add Cpp to targets

65a23a0

chore: make PythonLexerBase depend on antlr4::Lexer and remoove using…

4759008

… stmts

chore: added placeholder for C++ header

3bd33de

chore: added transformGrammar.py

8ff7691

fix: fixed errors due to bad copy-paste

d96cdfa

kormanowsky changed the title ~~feat: added C++ port of TypeScript PythonParserBase for Python 3.13~~ feat: added C++ port of TypeScript PythonLexerBase for Python 3.13 Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: added C++ port of TypeScript PythonLexerBase for Python 3.13 #4440

feat: added C++ port of TypeScript PythonLexerBase for Python 3.13 #4440

Uh oh!

kormanowsky commented Mar 12, 2025 •

edited

Loading

Uh oh!

kaby76 commented Mar 13, 2025

Uh oh!

kormanowsky commented Mar 13, 2025

Uh oh!

kaby76 commented Mar 13, 2025

Uh oh!

teverett commented Mar 20, 2025

Uh oh!

kaby76 commented Mar 20, 2025

Uh oh!

kormanowsky commented Mar 21, 2025

Uh oh!

kaby76 commented Mar 21, 2025

Uh oh!

Uh oh!

feat: added C++ port of TypeScript PythonLexerBase for Python 3.13 #4440

Are you sure you want to change the base?

feat: added C++ port of TypeScript PythonLexerBase for Python 3.13 #4440

Uh oh!

Conversation

kormanowsky commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaby76 commented Mar 13, 2025

Uh oh!

kormanowsky commented Mar 13, 2025

Uh oh!

kaby76 commented Mar 13, 2025

Uh oh!

teverett commented Mar 20, 2025

Uh oh!

kaby76 commented Mar 20, 2025

Uh oh!

kormanowsky commented Mar 21, 2025

Uh oh!

kaby76 commented Mar 21, 2025

Uh oh!

Uh oh!

kormanowsky commented Mar 12, 2025 •

edited

Loading