About CodeQL
CodeQL is the analysis engine used by developers to automate security checks, and by
security researchers to perform variant analysis.
In CodeQL, code is treated like data. Security vulnerabilities, bugs,
and other errors are modeled as queries that can be executed against databases
extracted from code. You can run the standard CodeQL queries, written by GitHub
researchers and community contributors, or write your own to use in custom
analyses. Queries that find potential bugs highlight the result directly in the
source file.
About variant analysis
Variant analysis is the process of using a known security vulnerability as a
seed to find similar problems in your code. It’s a technique that security
engineers use to identify potential vulnerabilities, and ensure these threats
are properly fixed across multiple codebases.
Querying code using CodeQL is the most efficient way to perform variant
analysis. You can use the standard CodeQL queries to identify seed
vulnerabilities, or find new vulnerabilities by writing your own custom CodeQL
queries. Then, develop or iterate over the query to automatically find logical
variants of the same bug that could be missed using traditional manual
techniques.
CodeQL analysis
CodeQL analysis consists of three steps:
- Preparing the code, by creating a CodeQL database
- Running CodeQL queries against the database
- Interpreting the query results
Database creation
To create a database, CodeQL first extracts a single relational representation
of each source file in the codebase.
For compiled languages, extraction works by monitoring the normal build process.
Each time a compiler is invoked to process a source file, a copy of that file is
made, and all relevant information about the source code is collected. This includes
syntactic data about the abstract syntax tree and semantic data about name
binding and type information.
For interpreted languages, the extractor runs directly on the source code,
resolving dependencies to give an accurate representation of the codebase.
There is one
extractor
for each language supported by CodeQL
to ensure that the extraction process is as accurate as possible. For
multi-language codebases, databases are generated one language at a time.
After extraction, all the data required for analysis (relational data, copied
source files, and a language-specific
database schema
, which specifies the mutual relations in the data) is
imported into a single directory, known as a
CodeQL database
.
Query execution
After you’ve created a CodeQL database, one or more queries are executed
against it. CodeQL queries are written in a specially-designed object-oriented
query language called QL. You can run the queries checked out from the CodeQL
repo (or custom queries that you’ve written yourself) using the
CodeQL
for VS Code extension
or the
CodeQL CLI
. For more information about queries, see “
About CodeQL queries
.”
Query results
The final step converts results produced during query execution into a form that
is more meaningful in the context of the source code. That is, the results are
interpreted in a way that highlights the potential issue that the queries are
designed to find.
Queries contain metadata properties that indicate how the results should be
interpreted. For instance, some queries display a simple message at a single
location in the code. Others display a series of locations that represent steps
along a data-flow or control-flow path, along with a message explaining the
significance of the result. Queries that don’t have metadata are not
interpreted?their results are output as a table and not displayed in the source
code.
Following interpretation, results are output for code review and triaging. In
CodeQL for Visual Studio Code, interpreted query results are automatically
displayed in the source code. Results generated by the CodeQL CLI can be output
into a number of different formats for use with different tools.
About CodeQL databases
CodeQL databases contain queryable data extracted from a codebase, for a single
language at a particular point in time. The database contains a full,
hierarchical representation of the code, including a representation of the
abstract syntax tree, the data flow graph, and the control flow graph.
Each language has its own unique database schema that defines the relations used
to create a database. The schema provides an interface between the initial
lexical analysis during the extraction process, and the actual complex analysis
using CodeQL. The schema specifies, for instance, that there is a table for
every language construct.
For each language, the CodeQL libraries define classes to provide a layer of
abstraction over the database tables. This provides an object-oriented view of
the data which makes it easier to write queries.
For example, in a CodeQL database for a Java program, two key tables are:
- The
expressions
table containing a row for every single expression in the
source code that was analyzed during the build process.
- The
statements
table containing a row for every single statement in the
source code that was analyzed during the build process.
The CodeQL library defines classes to provide a layer of abstraction over each
of these tables (and the related auxiliary tables):
Expr
and
Stmt
.