Rijnard van Tonder on February 12, 2020
We're introducing a new way to search code at Sourcegraph with structural code search. Structural code search lets you match nested expressions and whole code blocks that can be difficult or awkward to match using regular expressions.
Structural code search is the idea that you can search for syntactic
structures in code that correspond more closely to a program's underlying
concrete syntax tree (or parse tree). For example, for
loops in
Rust look something like
this:
for var in expression {
code
}
The code
block can contain nested for
loops, if
statements, and so on. To
match all of the code
block contents for these expressions, and search for
patterns inside them, a search engine must understand that code
exists
inside the balanced braces { ... }
. Regular expressions can go a long way to
match such syntactic structures but they are not
ideal.
In practice we use parsing to interpret and convert syntax for nested
expressions like {...}
into trees, which encode richer structural properties
than the textual representation.
Figure 1: Nested
expressions can expand inside code blocks. Parsing converts nested expressions
into tree data structures.
Most code search today is not based on true parsing or tree data structures. Instead, we use literal strings or regular expressions which is good enough for many kinds of searches. But these methods make it tricky to match precisely on blocks or expressions that can expand inside statements like the loop in Figure 1. We could more easily and precisely search for richer syntactic patterns if today's search tools also treated code as syntax trees, and that's the key idea behind structural search.
Many neat developer and compiler tools already exist for querying or matching tree structures (see additional resources at the end of this post!). But none are available at your fingertips, just seconds away from running on some of today's largest and most popular codebases. That is why we are happy to announce that Sourcegraph now supports a first release of structural search available at scale, for nearly every language, directly from your browser.
Let's look for things in the Linux kernel. After all, why not show structural search working on one of the largest and most popular projects in open source software?
One important function is copy_from_user
, which copies content from userspace
memory into the kernelspace memory. This function has a history of careful
auditing
because incorrect uses can (and have) lead to vulnerabilities. We can find all
copy_from_user
calls with a query like copy_from_user(:[args])
. Try it
live (the toggle means structural search is active):
The :[args]
syntax is a structural hole that matches all text between
balanced parentheses. The args
part is just a descriptive identifier. We
support comby syntax, which is currently the
underlying engine behind structural search. You can find out more about the
match syntax in our usage
docs, but for now it's
enough to just follow along this blog post!
Of course, we could have run a simpler regex search for the prefix with something like copy_from_user( and get results more quickly, and sometimes that's the right thing to do.
But in other cases we can do more interesting things with structural search that become awkward otherwise. For example, one result for the above query is:
copy_from_user(&txc.tick, &txc_p->tick, sizeof(struct timex32) -
offsetof(struct timex32, tick))
where the argument spans multiple lines. By default, :[hole]
syntax matches
across multiple lines just like code structures can. An interesting thing about
the call above is that it calculates the size of memory using sizeof(...) -
...
. Calculating and checking the size of memory to copy can be more
error-prone than simple or static values. So, one thing we could check is
whether there are other calls that calculate the size of memory in a similar way to the
above, using subtraction and sizeof
:
This query breaks up the original :[args]
hole into holes for the destination
buffer dst
, source buffer src
, and the calculation for the memory size. The
:[_]
syntax is just a hole that we don't particularly care to name. This
query finds just a handful of results, so this is a rather uncommon pattern in
the code base!
Structural search can also identify patterns to clean up. For example, one cleanup patch in the kernel looks like this:
Here's a query to easily find more of these patterns:
This time, we're using the same identifier :[x]
twice, to make sure that the
argument is the same for both list_del
and list_add
calls. We could choose
any identifier, except for :[_]
, which is just a placeholder. The whitespace
between the calls is interpreted to possibly include newlines, so there's no
issue matching these calls across newlines.
Do note that structural search is purely syntactic, so there are some matches that cannot cleaned up:
if (!list_empty(&page->lru))
list_del(&page->lru);
list_add(&page->lru, &pool->lru);
In this case, the problem is that the list_del
call is in scope of the if
block. However, the majority of matches carry our intended meaning. In the
future, we are introducing rules to refine
queries further, giving you greater control for avoiding unintended matches.
Structural search works on practically all languages, and understands the basic syntactic structures for them (like strings, comments, and code blocks). Here's a short list that gives just a taste of some patterns you can try out:
Java. Find try-catch-finally statements where the catch statement has no body (the catch
clause could be omitted)
Python. Find old-style string formatted print
statements
Rust. Find chained filter(...).next()
that could simplified to .find(...)
(based on lint)
π .filter(:[_]).next() and .filter(:[_]) .next() (the latter matches across newlines)
ReactJS. Look for opportunities to optimize away arrow functions (see the React FAQ)
Go. Find .type(...)
switches that contain a nil:
case
Clojure. Find cond
expressions with an :else nil
branch at the end
Dart. Find Image.asset
constructors in the Flutter API where width is initialized to 100
*Here are a couple of things to keep in mind
*A quick note on regular expressions: How is structural search different?
Structural search is not a replacement for regexp search. It's another tool in your toolkit that works well for matching blocks of code or expressions, and simplifies catching buggy syntactic patterns. If you only want to find a simple string or pattern, consider using Sourcegraph's literal or regexp search, because these queries are typically much faster! For a more detailed breakdown, see the short comparison at the end of this post.
That's it for now. You'll find some additional resources and discussion below if you're interested in reading more. Happy searching!
There is an immense amount of existing parsing and query tools for syntax trees. Most compilers today offer a library or visitor framework, and linters or static analyzers may build on them to implement checks. Here is a non-exhaustive short list of tools related to structural search and matching that you may be familiar with or find interesting:
SSR
[1]Sgrep
, or Syntactical grep (multiple languages) [1], [2]tree-sitter
, parsing and query framework (multiple languages) [1]gogrep
for declaratively matching Go syntax trees [1]Spoon
for declaratively matching Java code [1], [2]Spoofax
, AST querying using the Spoofax Language Workbench (multiple languages) [1]CodeQL
, querying tree and graph properties for a number of poular languages [1]At Sourcegraph we're continually looking to improve developer tools, and to integrate richer search functionality. If you find these tools or others valuable, share your thoughts with us at feedback@sourcegraph.com.
Here are some key differences and comparisons to regexp-based text search:
lang:
filter. If omitted, we perform a best-effort to infer the language based on matching file extensions, or fall back to a generic structural matcher.:[hole]
syntax matches across multiple lines by default.{}
, []
, ()
are expected to always be balanced
(depending on language). For example, a dangling parenthesis in Java is
considered a syntax error and can't be matched. A dangling delimiter in the
pattern implies a syntax error (prefer regexp search if you want to match
dangling delimiters).foo(:[x], :[x])
. This is similar to, e.g., backreferences in regular
expressions.\d+
yet.For a complete overview, refer to comby.dev.