Understanding the time taken for analyzing queries using codeql cli #6020

kakashiUc · 2021-06-05T18:27:17Z

kakashiUc
Jun 5, 2021

I am trying out codeql cli for quite some time now. While running queries in /python/ql/examples/(as a qlpack), I observed for some queries its taking a lot of time, while others are getting analyzed in a very short period. Some even seem to be running forever.

Also, similar kinds of queries show different behavior(w.r.t. time they use to analyze). For example /python/ql/examples/snippets/integer_literal.ql and /python/ql/examples/snippets/print.ql - both are similar kind of queries, i.e., they need to be just searched without any extra processing. But the first one completes in seconds, while the second keeps on running.

I also observed for most of the queries where some custom value can be given - for example /python/ql/examples/snippets/method_call.ql - it takes more time.

So is there an understanding of how to know which queries usually will take time?

Additional info:-

I am using ~40k files.
I am keeping all the files in the root folder, as I just want to check - for example how many 'if' blocks are present. That means some dependencies such as imports might not get resolved. Is this has some role to play in the time taken for a query?

Answered by rdmarsh2

Jun 10, 2021

Generally speaking, resolving call targets in Python isn't simply a matter of finding calls with a particular name, since Python doesn't provide static types for objects, and module members can potentially be redefined on the fly. Instead, we need to track the set of possible definitions for a function or method name at each point in the program, which is a rather expensive computation. The result of that analysis is then cached, so it only needs to be done once for each database. Also, note that in Python 3, the builtin print is a function instead a keyword, so it has the same potential to be shadowed or replaced as any other function name (and /python/ql/examples/snippets/print.ql does …

View full answer

rdmarsh2 · 2021-06-10T22:56:57Z

rdmarsh2
Jun 10, 2021

Generally speaking, resolving call targets in Python isn't simply a matter of finding calls with a particular name, since Python doesn't provide static types for objects, and module members can potentially be redefined on the fly. Instead, we need to track the set of possible definitions for a function or method name at each point in the program, which is a rather expensive computation. The result of that analysis is then cached, so it only needs to be done once for each database. Also, note that in Python 3, the builtin print is a function instead a keyword, so it has the same potential to be shadowed or replaced as any other function name (and /python/ql/examples/snippets/print.ql does need to resolve call targets).

1. I am using ~40k files.

Unless those files are quite small, this would be an extremely large codebase, especially for Python.

2. I am keeping **all the files in the root folder**, as I just want to check - for example how many 'if' blocks are present. That means some dependencies such as imports might not get resolved. Is this has some role to play in the time taken for a query?

Are you running on a specific project that's that large, or is it files from multiple projects that you're combining into a single directory? If it's the latter, I expect there's a better way to accomplish your underlying goals. Maybe try analyzing one project at a time and combining the results at the end?

2 replies

thepurpleowl Jun 13, 2021

It's the latter.
What you suggested is surely one way to go, but I think there is nothing more I am getting from the analysis if done in the suggested way. It might just add some overhead of database creation.

Can you tell me why the suggested way will be a better way to go?

tausbn Jun 14, 2021
Maintainer

The danger in combining unrelated projects into a single database is that (because it's only calculating an approximation) our analysis may get confused and think there are connections between unrelated projects. This in turn will lead to extra computation (possibly exploring nonsensical code paths that would never happen during runtime) and may also lead to erroneous results.

Another problem is memory pressure. Even if there is no unwanted cross-talk between the projects you have combined together, the analysis is still going to performed across all of them at the same time. In particular this may result in very large sets of results being computed. If the memory needed to store these intermediate results exceeds what's available, the evaluator may end up spending a nontrivial amount of time doing garbage collection.

In comparison, the overhead of database creation is tiny. I have never found this to be a bottleneck for the Python analysis.

tausbn · 2021-06-14T11:54:55Z

tausbn
Jun 14, 2021
Maintainer

Also, similar kinds of queries show different behavior(w.r.t. time they use to analyze). For example /python/ql/examples/snippets/integer_literal.ql and /python/ql/examples/snippets/print.ql - both are similar kind of queries, i.e., they need to be just searched without any extra processing. But the first one completes in seconds, while the second keeps on running.

While these two examples are similar, this similarity is only skin deep. The main difference is that the first query searches only for integer literals (that is, constants appearing directly in source code) and the second searches for all instances where print is used. For Python 2 this is simple, as print is a statement (and hence it is a straightforward search, as you say). For Python 3, however, it's not so simple, since print is now just another function. This means that in particular searching for all calls to "something named print" is not good enough, since you can redefine print to have some other functionality. Also, this would not capture something like myprint = print; myprint("Hello"), since this again does not have the syntactic appearance of a call to something named print.

The part of the Python analysis that keeps track of this is called the "points-to" analysis. Basically, we compute (an approximation of) what a given name can point to at a given point in the source code. As you can imagine, this analysis depends on itself, since in order to figure out the meaning of foo = bar(baz), we first need to figure out what object bar points to, figure out what value baz has at this point, and from there (and the body of bar) figure out what possible values foo can have. Needless to say, this is very time-consuming to compute (and this is why print.ql takes so much longer).

So, as a rule of thumb, anything that involves the points-to analysis is going to be time-consuming. On the other hand, this analysis is cached, so you only have to pay that price once for a given codebase. Most of our existing queries take advantage of this fact (as the points-to analysis allows for greater precision and generality), so you will find that "most" of the queries are slow (but hopefully only for the first query).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding the time taken for analyzing queries using codeql cli #6020

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Understanding the time taken for analyzing queries using codeql cli #6020

kakashiUc Jun 5, 2021

Replies: 2 comments · 2 replies

rdmarsh2 Jun 10, 2021

thepurpleowl Jun 13, 2021

tausbn Jun 14, 2021 Maintainer

tausbn Jun 14, 2021 Maintainer

kakashiUc
Jun 5, 2021

Replies: 2 comments 2 replies

rdmarsh2
Jun 10, 2021

tausbn Jun 14, 2021
Maintainer

tausbn
Jun 14, 2021
Maintainer