Securing CodeQL queries with Semgrep

We're excited to announce that Semgrep now offers GA support for CodeQL's query language. We shipped this as a fun April Fools joke, but it also serves as a great demo of Semgrep's extensibility and scalability when it comes to adding support for new languages.

Secure guardrails for your secure guardrails

It’s no secret that at Semgrep, we’re big fans of secure guardrails.

Secure guardrails are techniques and practices that nudge developers to towards secure-by-default coding and design principles. The aim is to have secure code as the default state, so engineering teams spend more time building and less time addressing accrued technical debt.

But ‘who shaves the barber?’ Who will secure your secure guardrails? In our case, we have many Semgrep rules that run on Semgrep rules; specifically targeting the effectiveness of the rules contributed by our community users. But, this is only one facet of the problem - we obviously aren’t the only code-scanning tool around!

Semgrep’s mission is to profoundly improve software security, regardless of vendor. So today, we are happy to announce that Semgrep now supports scanning CodeQL's query language.

Using Semgrep to scan CodeQL's query language

CodeQL is an analysis engine, originally developed by Semmle Inc, and subsequently acquired by GitHub. Most simply defined, CodeQL allows you to write checks for poor code patterns across the languages it supports. It is not dissimilar to Semgrep, you can use it to target security issues or enforce code consistency. However, it requires you to use its domain-specific query language.

For instance, here is a CodeQL query which asserts that all class names should begin with uppercase letters:

1import python
2
3predicate lower_case_class(Class c) {
4  exists(string first_char |
5    first_char = c.getName().prefix(1) and
6    not first_char = first_char.toUpperCase()
7  )
8}
9
10from Class c
11where
12  c.inSource() and
13  lower_case_class(c) and
14  not exists(Class c1 |
15    c1 != c and
16    c1.getLocation().getFile() = c1.getLocation().getFile() and
17    lower_case_class(c1)
18  )
19select c, "Class names should start in uppercase."

The Semgrep alternative rule would be:

1rules:
2  - id: classes-should-be-capitalized
3    severity: WARNING
4    message: |
5      The class "$C" should be capitalized for readability!
6    languages: [python]
7    patterns: 
8      - pattern: |
9          class $C:
10            ...
11      - metavariable-regex:
12          metavariable: $C
13          regex: "[a-z].*"

Note that I actually changed the CodeQL rule slightly, to add a bug. Can you see it?

The query asserts that c1.getLocation.getFile() = c1.getLocation().getFile(), which is always true! This is clearly a typo which was meant to be c1.getLocation.getFile() = c.getLocation().getFile(), which is exactly what it was before. This is a bad mistake because it can affect the correctness of the query!

In the CodeQL language server, there is no in-editor warning for this kind of error:

CodeQL Blog 1

This means this mistake would need to be fixed if the mistaken typo were ever caught, presumably at runtime of the query. There is no guardrail, you have just driven off the mountain.

Enter Semgrep. We can very easily write a rule which will allow us to statically catch the mistake in the rule:

1rules:
2  - id: codeql-redundant-equality
3    severity: ERROR
4    message: |
5      You should not compare the same expression to itself!
6    languages: [ql]
7    pattern: |
8      $X = $X

Running the rule now, we see:

CodeQL Blog 2

This still requires us to run the rule manually from the command-line though. Fortunately, with the Semgrep VS Code Extension, we can easily add a rule to our configuration and set it up to scan automatically within our IDE:

CodeQL Blog 3

N.B. Some might take issue with my usage of the term ‘security guardrail’, given that we are scanning CodeQL queries and we aren’t able to actively nudge developers down more secure paths. For the purposes of this article, the term ‘security guardrail’ refers to a ‘guardrail which advances the interests of security’. More correct CodeQL queries leads directly to more secure code, which is a win all around. Ka-chow. (This is an April Fool's post, remember?)

Supporting a Language

The second part of this story is the journey we undertook to implement CodeQL within Semgrep. If you haven’t been following up until now, this isn’t a joke. We actually did this.

Thanks to our leveraging of the tree-sitter technology (as well as our streamlined process for handling each language), this entire quest took a day and a half.

Semgrep makes use of the community-maintained tree-sitter project, which is a family of grammars and associated parsers for each of these languages. This produces a parser.c file which is able to turn programs in the specified language into a grammar-defined parse tree.

CodeQL Blog 4We’ve developed an ocaml-tree-sitter-core technology which then lets us consume the parse tree, by generating a Parse.ml OCaml file which can convert the plaintext parse tree into an OCaml value, in the form of a typed CST (concrete syntax tree).

CodeQLblog5

Once we obtain an OCaml CST, we just need to convert it into the Generic AST, which is the common representation that Semgrep uses for every single programming language. Once we are there, then the Semgrep engine knows what to do, and can take care of everything else, without us needing to consider anything else that is language-specific.

This can be done in a single step, but for maintainability (and readability) purposes, I decided to do it in two. We translated from the CodeQL CST to an intermediate CodeQL AST, before translating from that to the Generic AST.

CodeQL Blog 6Then, we just stitch it all together and everything just works. It’s really that easy.

CodeQL blog 7

Limitations and Challenges

The primary technical hiccups that we ran into over the course of this project mostly had to do with the fact that CodeQL is not a programming language.

Semgrep is focused on things which look like programming languages, so when it comes to things which look slightly different, the translation can be somewhat laborious. Thanks to some creative type work using our any type (not to be confused with the unconstrained any type which causes lots of problems for engineers), full support can still be achieved, however.

Another difficulty came in the overall complexity of the grammar. The QL language specification is not bad as far as language grammars go, as it’s reasonably comprehensive. However, there are quite a few nuanced constructs — such as the distinction between an algebraic datatype and a type union. Also what the hell is going on with predicates? Predicates compute their own results. Plebs (we all are) expect predicates to be a function-like recipe. Don’t get me started on unions! Why would a union also keyword as a class? At the end of the day, just the usual edge cases that increase learning curve up for anyone diving into CodeQL for the first time (like me!).

Irregularities and special cases in the grammar can also make implementation difficult. For instance, there is the construct exists(<decls> | <expr>) which lets one assert the existence of a semantic object that satisfies the condition. Syntactically, this expression may be omitted, meaning writing exists(<decls>) . This itself is not so bad, but you may also add an additional bar and expression… but only once. This means that the only other valid form is exists(<decls> | <expr1> | <expr2>) , which is semantically equivalent to exists(<decls> | <expr1> and <expr2>) , but you may include no more expressions. This results in annoying corner cases. For the five of you that have had to deal with this, please write to me - we can set up a cathartic Discord.

Reflection

Now, at the end of our journey, we can proudly state that Semgrep supports scanning CodeQL!

I’ll leave you with one last fun fact. The Semgrep Supported Languages documentation outlines a 99% parse rate + support for all of the Semgrep pattern constructs as a requirement for a language to be considered ‘generally available’ (aka ‘GA’). Our parse rate for CodeQL, as it stands, is currently 99.999%:

CodeQL 7

While implementing CodeQL, I went to great pains to ensure that we would support all of the Semgrep rule-writing constructs.

So, not only does Semgrep support CodeQL, but it also meets our standards for GA languages. Reach out to us if you’re seeing value from this April experiment.

Happy scanning!

About

Semgrep Logo

Semgrep lets security teams partner with developers and shift left organically, without introducing friction. Semgrep gives security teams confidence that they are only surfacing true, actionable issues to developers, and makes it easy for developers to fix these issues in their existing environments.

Find and fix the issues that matter before build time

Semgrep helps organizations shift left without the developer productivity tax.

Get started in minutesBook a demo