Skip to content

Fundamentals of Program Analysis

Author: Guangliang Yang
Last Updated: 2026-01-04
Copyright: This document is licensed under the CC BY-NC-SA 4.0 license

Program analysis determines whether a program is correct by analyzing its syntax and behavior. It consists of several key stages:

    Source code
->  Abstract Syntax Tree (AST)
->  Scope Hierarchy
->  Type Analysis and Checking
->  Intermediate Representation (IR)
->  Control Flow Graph (CFG)
->  Static Single Assignment (SSA)
->  Data Flow Graph (DFG)
->  Pointer Analysis
->  Taint Analysis

These stages do not form a linear pipeline. Rather, they constitute a set of semantic modeling techniques. Different program analysis frameworks select, adjust, and combine these stages in different ways.

This chapter focuses on the stages preceding data-flow analysis. The remaining stages will be discussed in detail in subsequent chapters.

1. Tokens

In the first stage of compiler analysis, the input is source code represented as a string. After scanning by a lexical analyzer, the output is a sequence of lexical units, or tokens. That is, the character stream is segmented into individual tokens. A token consists of two parts: the extracted text, called the lexeme, and its category. A token can therefore be represented as (Lexeme, Type).

Consider the following C source code:

[01] int x = 1;
[02]
[03] int f(int y) {
[04]     int z;
[05]     if (y > 0) {
[06]         z = x + y;
[07]     } else {
[08]         z = x - y;
[09]     }
[10]     return z;
[11] }

When this code is scanned by the lexer, semantic information is not considered. For example, the lexer does not distinguish if as a control-flow construct, nor does it track multiple assignments to z. Instead, it focuses only on syntactic categories: if is treated as a keyword, while z, x, and y are all identifiers (IDENT). After ignoring whitespace and line breaks, the resulting token sequence is:

[01] int     , KEYWORD_INT
[02] x       , IDENTIFIER
[03] =       , ASSIGN
[04] 1       , INT_LITERAL
[05] ;       , SEMICOLON

[06] int     , KEYWORD_INT
[07] f       , IDENTIFIER
[08] (       , LPAREN
[09] int     , KEYWORD_INT
[10] y       , IDENTIFIER
[11] )       , RPAREN
[12] {       , LBRACE

[13] int     , KEYWORD_INT
[14] z       , IDENTIFIER
[15] ;       , SEMICOLON

[16] if      , KEYWORD_IF
[17] (       , LPAREN
[18] y       , IDENTIFIER
[19] >       , GREATER_THAN
[20] 0       , INT_LITERAL
[21] )       , RPAREN
[22] {       , LBRACE

[23] z       , IDENTIFIER
[24] =       , ASSIGN
[25] x       , IDENTIFIER
[26] +       , PLUS
[27] y       , IDENTIFIER
[28] ;       , SEMICOLON
[29] }       , RBRACE

[30] else    , KEYWORD_ELSE
[31] {       , LBRACE
[32] z       , IDENTIFIER
[33] =       , ASSIGN
[34] x       , IDENTIFIER
[35] -       , MINUS
[36] y       , IDENTIFIER
[37] ;       , SEMICOLON
[38] }       , RBRACE

[39] return  , KEYWORD_RETURN
[40] z       , IDENTIFIER
[41] ;       , SEMICOLON

[42] }       , RBRACE

2. Abstract Syntax Tree (AST)

An abstract syntax tree (AST) is an abstract representation of source code and forms the basis of semantic analysis. It is a tree structure composed of nodes that represent elements of the source code, such as variables, functions, operators, and expressions.

Using the token sequence as input, the parser applies grammar rules to construct a parse tree. These grammar rules are manually specified and are typically expressed using BNF (Backus–Naur Form). The following grammar rules describe an if branch structure. Elements enclosed in double quotes represent concrete keywords or syntax symbols. In this grammar, Cond denotes a condition expression, Then and Else denote branches, Block denotes a block, and Stmt denotes a statement.

IfStmt ::= "if" "(" Cond ")" Then "else" Else
Cond ::= BoolExpr
Then ::= Block
Else ::= Block
Block ::= "{" Stmt* "}"

By applying these rules in a top-down manner, the token stream is converted into a parse tree. Since parse trees are verbose, they are simplified to produce an abstract syntax tree. The AST corresponding to the C source code above is shown below:

Global
├── VarDecl
│   ├── Type: int
│   ├── Name: x
│   └── Init
│       └── IntegerLiteral(1)
│
└── FunctionDecl
    ├── ReturnType: int
    ├── Name: f
    ├── Params
    │   └── ParamDecl
    │       ├── Type: int
    │       └── Name: y
    └── Body: CompoundStmt
        ├── VarDecl
        │   ├── Type: int
        │   └── Name: z
        │
        ├── IfStmt
        │   ├── Cond
        │   │   └── BinaryExpr (>)
        │   │       ├── Identifier(y)
        │   │       └── IntegerLiteral(0)
        │   │
        │   ├── Then
        │   │   └── CompoundStmt
        │   │       └── ExprStmt
        │   │           └── AssignExpr (=)
        │   │               ├── Identifier(z)
        │   │               └── BinaryExpr (+)
        │   │                   ├── Identifier(x)
        │   │                   └── Identifier(y)
        │   │
        │   └── Else
        │       └── CompoundStmt
        │           └── ExprStmt
        │               └── AssignExpr (=)
        │                   ├── Identifier(z)
        │                   └── BinaryExpr (-)
        │                       ├── Identifier(x)
        │                       └── Identifier(y)
        │
        └── ReturnStmt
            └── Identifier(z)

3. Scope

Scope is a basic unit of semantic analysis. It defines the visibility range of variables and functions. A scope can be viewed as a namespace: variables and functions defined within a scope are accessible inside that scope, while those defined outside are not. In practice, scopes are nested, and their nesting structure is determined by declaration order.

To characterize a scope, it is necessary to record both its enclosing and enclosed scopes, as well as all variables and functions defined within it.

Using the same C example, the scope hierarchy is:

GlobalScope
├── x : int
├── f : function(int) -> int
│
└── FunctionScope(f)
    ├── y : int
    │
    └── BlockScope (function body)
        ├── z : int
        ├── BlockScope (then)
        └── BlockScope (else)

Consider line [08], z = x - y. It belongs to the BlockScope (else). The analysis checks whether x and y are defined in this scope; they are not, so it proceeds to the enclosing scope, BlockScope (function body). Still not found, it proceeds to FunctionScope(f), where y is found. Continuing upward, x is located in GlobalScope.

For efficient lookup, symbols are often assigned unique identifiers. These identifiers may be sequential or based on line numbers, as long as they are unique. Assuming line-number-based identifiers, the scope hierarchy becomes:

GlobalScope: ID-00
├── x : ID-01, int
├── f : ID-02, function(int) -> int
│
└── FunctionScope(f): ID-02
    ├── y : ID-03, int
    │
    └── BlockScope (function body): ID-04
        ├── z : ID-05, int
        ├── BlockScope (then): ID-06
        └── BlockScope (else): ID-07

Identifiers play a critical role in program analysis. For example, in line [08] z = x - y, the identifiers of z, x, and y are ID-05, ID-01, and ID-03, respectively. The same identifiers appear in line [06] z = x + y. Therefore, both lines use variables ID-01 and ID-03 and modify variable ID-05.

4. Type Checking

Once variables are identified within the scope hierarchy, static type checking and inference can be performed. This requires modeling each instruction according to the language’s syntax and semantics, specifying the corresponding type rules.

For line [08] z = x - y, the identifiers of x and y are ID-01 and ID-03, both of type int. The identifier of z is ID-05, also of type int. The type rule reduces to int = int - int, which is valid in C.

Type information is essential for correct analysis. For example, given a function call f(1), if f is identified as ID-02, the analysis can determine that the call targets the function defined in FunctionScope: ID-02.

5. Control Flow Graph (CFG)

A control flow graph (CFG) represents the possible execution paths of a function at runtime. Nodes correspond to instructions or basic blocks (BBs), and edges represent possible control transfers. A basic block is a sequence of instructions with no internal control transfer except at the end.

Revisiting the function f:

[03] int f(int y) {
[04]     int z;
[05]     if (y > 0) {
[06]         z = x + y;
[07]     } else {
[08]         z = x - y;
[09]     }
[10]     return z;
[11] }

The CFG of f is:

        [Entry]
           |
           v
         [04]
           |
           v
         [05]
         /  \
        T    F
        |    |
      [06]  [08]
        \    /
         \  /
          vv
         [10]
           |
        [Exit]

6. Intermediate Representation (IR)

While the example code is simple, real programs often contain complex statements. For example, w = f(1) + f(2) combines function calls, arithmetic, and assignment. Direct analysis of such statements is difficult.

To address this, complex instructions are decomposed into simpler ones using an intermediate representation (IR), also called an intermediate language. A commonly used form is three-address code, where each instruction involves at most three operands.

For w = f(1) + f(2), the IR is:

%v0 = f(1)
%v1 = f(2)
w = %v0 + %v1

IR generation takes the AST as input and traverses it recursively from the root.

7. Static Single Assignment (SSA)

Static single assignment (SSA) assigns each variable exactly once. This allows variable names to uniquely identify definitions, making it easy to locate the most recent assignment.

In the example, determining the latest assignment to z before return z can be non-trivial. SSA resolves this:

x0 = 1
y0 = param y

if (y0 > 0) {
    z1 = x0 + y0
} else {
    z2 = x0 - y0
}

z3 = phi(z1, z2)
return z3

Here, phi merges multiple definitions into one.

SSA construction algorithms are well established, notably:

  • Efficiently Computing Static Single Assignment Form and the Control Dependence Graph (1991)
  • Simple and Efficient Construction of Static Single Assignment Form (2013)

In some frameworks, SSA plays a central role and can be generated directly from the AST, making an explicit IR step unnecessary. From a data-flow perspective, SSA and data-flow graphs are equivalent representations.

8. Additional Topics

8.1 Class Hierarchy

In object-oriented languages, inheritance is fundamental. A class hierarchy represents inheritance relationships and supports property and method resolution.

Example in Java:

class Animal { ... }
class Dog extends Animal { ... }
class Cat extends Animal { ... }
class Other extends Animal {}

Hierarchy:

Animal
 ├── Dog
 ├── Cat
 └── Other

For:

Animal a = new Other();
a.speak();

The call resolves to Animal.speak().

This introduces pointer relationships between variables and objects. For example:

void makeSound(Animal a) {
    a.speak();
}

The variable a may refer to different object types. Analysis approaches include class hierarchy analysis and pointer analysis.

8.2 Call Graph (CG)

A call graph represents calling relationships between functions. It is used to identify sensitive functions and call chains.

Construction depends on call forms:

  • method()
  • receiver.method()
  • implicit calls such as constructors

Higher-order functions often require more advanced analyses, such as pointer analysis.

8.3 Interprocedural Control Flow Graph (ICFG)

A CFG describes control flow within a single function. An interprocedural control flow graph (ICFG) combines CFGs with the call graph to represent global control flow across functions. It serves as the foundation for analyses such as data flow and pointer analysis.