-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve parser speed #33
Comments
I'm not sure where the problem stems from. The best approach would be to profile the Python code. A timeline profiler like py-spy could be very useful. Please let us know what you find out! |
sure, ill work on this and keep you posted @tbennun. I might need some time to get back. So, it would be great if you can kindly leave the issue open. |
@tbennun, circling back to this topic again. The following is the specific usage model of pymlir for my usecase. I define custom dialects using DialectOp and DialectType classes to define the syntax's for the dialect that I would like to parse. The MLIR file that I parse doesnt use any of the standard dialects which pymlir supports. I did not profile the code with py-spy. Instead I just tried some high level optimizations which improved the parsing speed by ~25%. My target was to understand how parsing speed is related to the number of rules.
There are couple of things that I am working on currently.
As a next step, I would like to get to know your thoughts on the topics below.
|
Thank you very much for looking into this, @ShivaShankarMovidius! I think you and the Lark team are both right that the file is not very efficient right now. I focused on matching the documentation rather than parsing performance, and didn't feel a performance difference on the small files I'm parsing. Additionally, I don't remember any "optimization / best practices" guidelines in the Lark project, which would have been great. The rules can be simplified quite a lot (definitely in terms of tree shaping, which should reduce overheads). I also tried LALR first and switched to Earley encountering the same issue (might have to do with the recursive rule mentioned in the issue). As for your questions: I have a suggestion that would address questions (1) and (2): We can split mlir.lark into a mlir-syntax.lark for the basic syntax, and a lark file for each dialect. One could then choose a list of dialects (or pass in an empty list for a "lean" parser), where the default would be all the core dialects. This would add to the For (3), I am not exactly sure what you mean. Do you have an example of such a use case? One way to cleanly pass a custom transformer would be to have a I would still look at the code through py-spy if I were you, just to make sure which part of the code is taking all the time (i.e., is it regex, tree depth, etc.) |
Hi @tbennun, many thanks for getting back to me with your thoughts. Neither did I find any document on any best practice to define lark grammar. Probably, something worth giving them as a feedback. My proposal would be to do a 2-step process.
In regards to starting with 1), I would like to propose adding the following arguments to pymlir class. --lark - This could be a lark file that end users want to provide. If this is not provided, we could default to mlir.lark I feel this could be a sequential improvement in reusing the current codebase rather than redefining lark definitions for all the dialects. The high level pseudocode that I envision for Parser class is as follows: class Parser(object):
def __init__(self, dialects: Optional[List[Dialect]] = None, std_dialects: List[str] = STANDARD_DIALECTS, lark: Optional = None, transformer: TreetoMLIR):
self.dialects = dialects or []
self.lark = lark
self.std_dialects = std_dialects
self.transformer = transformer #
# Lazy-load (if necessary) the Lark files
if self.lark == None:
_lazy_load() # Load mlir.lark / mlir-syntax.lark
else:
_lazy_load(self.lark) # Load a custom mlir.lark defined by end users
for dialect in [self.std_dialects, self.dialects]:
# Add the rules in to pymlir_dialect_ops / pymlir_dialect_types
# Create parser and tree transformer
self.parser = Lark(parser_src, parser='earley')
self.transformer = transformer
# Need to make sure transformer is a derivative of TreetoMLIR class In this way, all requirements from 1 - 3 can be easily satisfied. Even if you decide to split mlir.lark into multiple files, that can go in as an incremental change. Can I get back towards the end of the week on code profiling? |
Hi @tbennun. The following is the output from cProfile when I ran the code on pymlir. Looks like a big portion of the program is spent on the parsing portion of the code. I am not sure if I will be able to share the complete output from cProfile. But a glimpse of maximum time spent across a range of functions is shown below. ![]() |
@ShivaShankarMovidius looks like using LALR would yield the largest performance gain then. For your API suggestion, I would propose the following: class Parser(object):
def __init__(self, dialects: Optional[List[Dialect]] = None,
std_dialects: List[str] = STANDARD_DIALECTS,
transformer: Union[Type[Transformer], Transformer] = TreetoMLIR):
# ...
if isinstance(transformer, type): # Type
self.transformer = transformer()
else: # Object
self.transformer = transformer I also don't see how passing in a custom lark file makes sense. You could potentially pass in a lark file that is not MLIR-related. |
Hi @tbennun. Yes, I agree with you. Looks like LALR will be the best way to move forward to improve the performance of pymlir parser. I've currently developed 2 modes of operation. The first flow parses a complete MLIR file while the second flow performs preprocessing to extract information at a lower level granularity from an mlirfile (say per function / operator) and iteratively use parse_string() to parse the mlir grammar. In this way, I am able to linearize the complexity by hierarchically parsing an mlir file. I also removed a set of rules from mlir.lark associated with mlirfile / function etc. to make the grammar lean for achieving latter. So, long story short, if any user like me wants to remove specific rules, the option of passing their own modified mlir.lark is a possibility with my proposal. I also foresee another usecase where it would be a possibility for users to append custom rules to mlir.lark and not always go through the DialectOp / DialectType flow. Writing custom rules through lark allows to define specific priorities for rules and I see this as an added advantage. |
I see, it makes sense that you’d want a different file. Since your behavior is general purpose, why not have a class hierarchy where a BaseParser takes in a lark file, Parser that parses everything, and a ModuleHierarchyParser/FunctionParser that gives a module hierarchy? |
I don't see a need for 2 separate parser classes. The lark file just needs 1 modification to update the root node of the Tree by modifying the start rule in the lark grammar. https://github.com/spcl/pymlir/blob/master/mlir/lark/mlir.lark#L327 Thats the reason for which I proposed a single parser class which consumes a lark file as an input. But I understand your concern on a potential possibility where someone can pass an invalid lark grammar file. I don't know a way in which we can distinguish between a valid and an invalid lark grammar in its current form. |
then you can add an from enum import Enum, auto # or just autoenum
class ParsingKind(Enum):
ALL = auto()
MODULE_AND_FUNCTION_NAMES = auto() and use that to pick a starting rule. |
The root node is defined in lark grammar and not the python file. If you have a strong feeling for the Parser to not consume a lark file as an input, thats ok with me. Would you like me to work on this and push a PR on the parser modifications? |
That would be great, thank you |
Sure, I'll get this sorted. I will do that following.
I wont be able to spend time on optimizing the mlir grammar for LALR. If you can help me on that, it would be really great. |
I am currently using pymlir to parse an mlir file with quite a bit of custom dialects used in the mlir file. A file ~2200 lines takes about 5 mins to parse and generate an MLIRFile object when i use the parse_string command. Are there any suggestions or strategies to improve the parser speed?
The text was updated successfully, but these errors were encountered: