Pointer Analysis¶
Author: Yang Guangliang
Last Updated: 2026-01-06
Copyright Notice: This document is licensed under CC BY-NC-SA 4.0
Pointer Analysis, also known as Points-to Analysis, is a technique used to analyze the memory objects that pointer variables in a program may point to.
From the overall workflow of program analysis, pointer analysis is not mandatory. From the abstract syntax tree (AST) to the intermediate representation (IR), then to the control flow graph (CFG) and interprocedural control flow graph (ICFG), as well as interprocedural data-flow analysis—these steps can all be completed without pointer analysis. In addition, pointer analysis is extremely difficult and costly to implement, which ultimately leads many program analysis frameworks to choose not to implement it.
Even so, in many complex and important scenarios, pointer analysis is the key to solving these problems:
-
Aliasing: When two pointers point to the same memory, a modification made through one pointer affects what the other pointer reads. Data-flow analysis alone is often insufficient to handle aliasing; aliasing can cause missed reports during data-flow propagation. Points-to analysis is a powerful tool for aliasing: it can detect whether two pointer variables may point to the same content. The same aliasing issue also applies to object references.
-
Call Graph Generation: The call graph (CG) is the foundation for many whole-program analyses, such as interprocedural control-flow and data-flow analysis. Some tools build CGs using type systems and class hierarchies, but the presence of complex code such as higher-order functions (functions that take other functions as parameters or return functions) can make the CG incomplete or highly inaccurate. Pointer analysis can extract type information from the memory objects that function-name variables point to, thereby generating a more precise call graph.
-
Memory Safety: Pointer analysis helps detect memory-safety vulnerabilities such as null dereference, dangling pointers, memory leaks, and more.
Pointer analysis is challenging. It is an NP-hard problem (see Pointer-induced aliasing: a problem taxonomy, 1991) and can be viewed as the "crown jewel" of the entire program analysis field. Its history also reflects its complexity. The Gen/Kill algorithm for data-flow analysis was proposed in the 1970s and widely adopted. By contrast, pointer-analysis algorithms were lacking for a long time. Not until 1994 did Andersen propose a set-constraint-based pointer analysis in his PhD thesis "Program Analysis and Specialization for the C Programming Language", laying the foundation for modern pointer analysis. Andersen's algorithm remains one of the mainstream approaches today; almost all modern pointer analysis tools are based on it or its variants.
This chapter introduces the basic concepts of pointer analysis, the classic Andersen algorithm, precision metrics, and algorithmic improvements and practical applications.
1. Basic Concepts¶
1.1 Pointers and Points-to Relations¶
In a program, a pointer or reference is a special kind of variable that stores the address of the referenced memory or object rather than an ordinary value. Through pointers or references, a program can indirectly access and manipulate data in memory. A points-to relation describes the relationship between a pointer variable and the memory objects it may point to.
We use the notation p → o to indicate that pointer variable p may point to memory object o. Also, the points-to set pts (Points-to set) denotes all objects that pointer variable p may point to, i.e., pts(p) = {o}.
The following C code demonstrates points-to relations and their effects:
[01] int x = 10;
[02] int* p = &x;
[03] *p = 20;
In this example, instruction [02] makes p point to x, i.e., p → x. This points-to relation takes effect in instruction [03], which modifies the memory object that p points to, changing it to 20. Although x is not explicitly mentioned in [03], it is effectively modifying x. The key to the analysis here is to clarify the points-to relation between p and x.
1.2 Aliasing¶
Aliasing means that two or more different pointer variables point to or reference the same memory location. Aliasing is an extension of points-to relations, is widespread in real-world programming, and poses challenges for program analysis.
A slight modification of the above example illustrates aliasing and its impact:
[01] int x = 10;
[02] int* p = &x;
[03] int* q = p;
[04] *q = 20;
In this example, instruction [03] makes p and q aliases: they both point to the memory location of x. Therefore, when [04] executes, x is still updated to 20. Likewise, x does not appear in [03] or [04], yet the final result is that the memory content of x is modified. If you only perform a simple data-flow analysis, it is hard to understand the modification to x.
1.3 Function Calls¶
Functions are the basic units of program analysis, and how function calls are handled directly determines the quality of interprocedural processing. According to how directly the callee is resolved, calls can be classified as:
- Direct call: the caller directly calls the callee.
- Indirect call: the caller invokes the callee via a pointer variable.
[00] void foo() {}
......
[01] void (*fp)();
[02] fp = foo;
[03] fp();
[04] foo();
In this code, the call to foo in instruction [04] is a direct call and does not depend on any additional pointer variables. Instruction [03], however, calls foo through the function pointer fp. If the analysis is compiler-based (e.g., Clang/LLVM), the compiler normalizes the call format at [03] (i.e., the function signature: including the types of parameters and return value), and the signature can help locate the callee. But if the analysis is source-based, you need pointer analysis to obtain the points-to relation of fp in [03]: fp → foo, and thus determine that the callee is foo.
In object-oriented languages, function calls are often written as receiver.method(), where receiver is the object owning the call and method is the function name. Similarly, in compiler-based analysis (e.g., Java or Dalvik bytecode), the signature can determine the callee. But in source-based analysis, pointer analysis is needed to find the concrete object that receiver references, then look up method in that object to locate the callee.
1.4 Heap Object Abstraction¶
A heap object is a dynamically allocated memory unit created at runtime in the "heap" memory region. Heap objects are the core of pointer analysis. In practice, however, many heap objects may be created, which reduces analysis efficiency. Therefore, heap objects need to be abstracted and merged.
[01] for (int i = 0; i < n; i++) {
[02] *p = malloc(sizeof(int));
[03] ...
[04] }
In this example, at [02], pointer variable p points to a heap memory object. If each loop iteration creates a new object, then at [04], p may point to multiple objects, making analysis difficult.
Typically, heap objects are abstracted (or merged). Common abstraction strategies include:
-
Type-based Abstraction: abstract all heap objects of the same type into one. This is coarser and less precise, but more efficient.
-
Allocation-site Abstraction: abstract all heap objects created by the same allocation instruction into one abstract object. In the example above, although the loop may execute many times and create multiple distinct objects, allocation-site abstraction merges them into one object, denoted
object_L02(the abstract object created at [02]). -
Context-sensitive Abstraction: based on allocation-site abstraction, further distinguish objects created under different calling contexts. This improves precision but increases complexity.
Allocation-site abstraction is a very common strategy; the pointer analysis algorithms in this section mainly use allocation-site abstraction.
2. Andersen's Algorithm¶
Andersen's algorithm is one of the classic pointer analysis algorithms. It reduces points-to analysis to a set-constraint solving problem. It is also called inclusion-based analysis, or subset-based analysis.
The main idea of Andersen's algorithm is to maintain a points-to set pts for each pointer variable, representing all objects the pointer may point to. It then generates corresponding constraints from program statements, and finally computes the points-to results by iteratively collecting and solving these constraints.
2.1 Constraint Generation Rules¶
The key of Andersen's algorithm is generating and collecting constraints over points-to sets pts. The table below summarizes the rules for generating points-to constraints for each kind of statement:
| Statement Type | Statement | Generated Constraint | Meaning |
|---|---|---|---|
| Allocation | x = malloc() |
{object_stmt} ⊆ pts(x) |
The newly allocated abstract object object_stmt is included in x's points-to set |
| Assignment | x = y |
pts(y) ⊆ pts(x) |
y's points-to set is included in x's points-to set |
| Address-of | x = &y |
{y} ⊆ pts(x) |
Memory object y is included in x's points-to set |
| Pointer store | *x = y |
∀o ∈ pts(x): pts(y) ⊆ pts(o) |
For each object o pointed to by x, y's points-to set is included in o's points-to set |
| Pointer load | x = *y |
∀o ∈ pts(y): pts(o) ⊆ pts(x) |
For each object o pointed to by y, o's points-to set is included in x's points-to set |
For the call statement x=method(args), if method is a function pointer used for indirect calls, its points-to constraints are:
∀o ∈ pts(method):
// Find the callee function based on method's points-to set
callee = dispatch(o)
// Align the points-to sets of actual arguments args and formal parameters parameters in callee
pts(args_i) ⊆ pts(callee.parameters_i)
// Set the points-to set of the return value
pts(callee.return) ⊆ pts(x)
From the table above, Andersen's algorithm adopts a conservative strategy to ensure soundness. Each rule collects a points-to set inclusion relationship: as long as a points-to relation is possible, it will be collected.
2.2 Andersen Pointer Analysis¶
Based on the inclusion constraints over points-to sets, the analysis uses a worklist and fixed-point iteration to solve the constraints. The intra-procedural Andersen pointer analysis algorithm is as follows:
Algorithm: Andersen pointer analysis pointer_analysis()
Input: IR code, entry function main
Output: Points-to set pts for every pointer in the whole program
1. Global initialization:
// Initialize points-to sets
pts = ∅
// Initialize Worklist
Worklist = ∅
// Initialize visited functions
VisitedMethods = ∅
2. Add helper function: scan_method()
function scan_method(method):
if method ∉ VisitedMethods:
VisitedMethods.add(method)
// Generate constraints for all statements in the function body and add to Worklist
for stmt in method.body:
constraint = generate_constraint(stmt)
Worklist.push(constraint)
3. Add entry function:
// Initialize entry function
scan_method(main)
4. Iterative solving:
while Worklist ≠ ∅:
constraint = Worklist.pop()
// Process constraint
case constraint of:
// [Case 1] Allocation: x = malloc()
{o} ⊆ pts(x):
if {o} ⊄ pts(x):
pts(x) = pts(x) ∪ {o}
// Add constraints affected by x to Worklist
add_dependent_constraints(x, Worklist)
// [Case 2] Assignment: x = y
pts(y) ⊆ pts(x):
diff = pts(y) - pts(x)
if diff ≠ ∅:
pts(x) = pts(x) ∪ diff
// Add constraints affected by x to Worklist
add_dependent_constraints(x, Worklist)
// [Case 3] Address-of: x = &y
{y} ⊆ pts(x):
if y ∉ pts(x):
pts(x) = pts(x) ∪ {y}
// Add constraints affected by x to Worklist
add_dependent_constraints(x, Worklist)
// [Case 4] Pointer store: *x = y
∀o ∈ pts(x) pts(y) ⊆ pts(o):
diff = pts(y) - pts(o)
if diff ≠ ∅:
pts(o) = pts(o) ∪ pts(y)
// Add constraints affected by o to Worklist
add_constraint_to_worklist(o, Worklist)
// [Case 5] Pointer load: x = *y
∀o ∈ pts(y), pts(o) ⊆ pts(x):
diff = pts(o) - pts(x)
if diff ≠ ∅:
pts(x) = pts(x) ∪ pts(o)
// Add constraints affected by x to Worklist
add_constraint_to_worklist(x, Worklist)
// [Case 6] Function call: x = method(args)
∀callee ∈ pts(method):
// Ensure callee function body is scanned
scan_method(callee)
// Parameter passing (actual -> formal)
for i in 0..len(args):
new_constraint = pts(args[i]) ⊆ pts(callee.params[i])
Worklist.push(new_constraint)
// 3. Return value passing
if x exists:
new_constraint = pts(callee.return_var) ⊆ pts(x)
Worklist.push(new_constraint)
5. Return pts
Andersen pointer analysis models points-to relations in a program as a set of set-inclusion constraints by maintaining a monotonically increasing mapping from pointers to sets of abstract objects. Since all constraints are monotone functions, the analysis must converge to a fixed point in a finite number of steps, which guarantees termination. The algorithm adopts a conservative May-analysis strategy: any potentially existing points-to relation is included in the result, so the result is sound.
In implementing Andersen pointer analysis, several difficulties are worth noting:
- How to store points-to sets pts? Points-to sets store objects and need union and even difference operations with other sets. Designing an efficient data structure for object sets becomes an issue.
- How to store constraints? Constraints are not a structured data type, and expressing points-to relations cleanly is not easy.
- How to find affected constraints and add them to the Worklist? Given a points-to set, it may be necessary to traverse all constraints to find those affected and push them into the Worklist, which is inefficient.
For Problem 1, a bit vector can store object sets, and union/difference operations become bitwise integer operations, which is efficient. For Problems 2 and 3, the introduction and use of the Pointer Assignment Graph (PAG) greatly improves feasibility.
3. PAG-based Andersen Algorithm¶
In practical implementations of Andersen's algorithm, directly handling the constraint set faces many efficiency issues. The Pointer Assignment Graph (PAG) addresses this well. It was first proposed and implemented in the SPARK points-to module of the Java analysis framework Soot (see Scaling Java Points-to Analysis using SPARK, 2003). With a PAG, points-to sets can be propagated, turning constraint solving into a graph reachability problem.
3.1 PAG Construction¶
PAG was originally constructed for Java bytecode. In a PAG, nodes abstract pointer variables and memory objects, and edges abstract constraints over points-to relations.
PAG nodes can be of the following types:
- VarNode: represents pointer variables in the program
- AllocNode: represents objects abstracted from allocation sites
- FieldRefNode: represents fields of objects
PAG edges can be of the following types:
| Type | Corresponding Statement | Generated Constraint | Meaning | PAG Edge | Edge Type |
|---|---|---|---|---|---|
| Allocation | x = new T | {object_stmt} ⊆ pts(x) | x points to the newly allocated object | AllocNode → VarNode(x) | New(object_stmt, x) |
| Pointer assignment | x = y | pts(y) ⊆ pts(x) | points-to set propagates from y to x | VarNode(y) → VarNode(x) | Assign(y, x) |
| Field store | x.f = y | ∀o ∈ pts(x): pts(y) ⊆ pts(o.f) | via objects pointed to by x, propagate y to field f | VarNode(y) → FieldRefNode(o.f) | Store(y, x.f) |
| Field load | x = y.f | ∀o ∈ pts(y): pts(o.f) ⊆ pts(x) | propagate from field f of objects pointed to by y to x | FieldRefNode(o.f) → VarNode(x) | Load(y.f, x) |
For the call statement x=receiver.method(args), the points-to constraints are:
∀o ∈ pts(receiver):
// Find the callee based on receiver's points-to set
callee = dispatch(o.type, method)
// If it is a virtual function, set the this parameter
pts(receiver) ⊆ pts(this)
// Align the points-to sets of actual arguments args and formal parameters parameters in callee
pts(args_i) ⊆ pts(callee.parameters_i)
// Set the points-to set of the return value
pts(callee.return) ⊆ pts(x)
3.2 Implementation of PAG-based Andersen Algorithm¶
Based on PAG, Andersen pointer analysis becomes a graph-based propagation problem over points-to sets. The new algorithm has two parts: building the PAG, and propagating points-to sets using the PAG. The algorithm is as follows:
Algorithm: PAG-based Andersen pointer analysis
Input: program IR
Output: points-to set pts for each pointer
1. Build initial PAG:
PAG = Graph()
// Initialize points-to sets
pts = ∅
// Initialize Worklist
Worklist = ∅
// Initialize visited functions
VisitedMethods = ∅
// Initialize field stores
StoreConstraints = ∅
// Initialize field loads
LoadConstraints = ∅
// Initialize function calls
CallConstraints = ∅
2. Add helper function: first-time visit scan_method()
function scan_method(method):
if method ∉ VisitedMethods:
VisitedMethods.add(method)
for stmt in method.body:
case stmt of:
// Create New edges for each allocation site
x = new T (at line L):
pts(x) = pts(x) ∪ {object_L}
Worklist.push(x)
// Create Assign edges for assignments
x = y:
PAG.add(Assign(y, x))
if pts(y) ≠ ∅:
Worklist.push(y)
// Record Store relations for field accesses
x.f = y:
StoreConstraints.add(x, f, y)
// Record Load relations for field loads
x = y.f:
LoadConstraints.add(x, y, f)
// Record call relations
x = receiver.method(args):
CallConstraints.add(x, receiver, method, args)
3. Add entry function:
// Initialize entry function
scan_method(main)
4. Iterative solving:
while Worklist ≠ ∅:
n = Worklist.pop()
// [Case 1] Assign edges: m = n
for Assign(n, m) ∈ PAG:
diff = pts(n) - pts(m)
if diff ≠ ∅:
pts(m) = pts(m) ∪ diff
Worklist.push(m)
// [Case 2] Store edges: x.f = y
for each (x, f, y) ∈ StoreConstraints where x == n:
for each o ∈ pts(x):
if Assign(y, o.f) ∉ PAG:
PAG.add(Assign(y, o.f))
if pts(y) ≠ ∅:
Worklist.push(y)
// [Case 3] Load edges: x = y.f
for each (x, y, f) ∈ LoadConstraints where y == n:
for each o ∈ pts(y):
if Assign(o.f, x) ∉ PAG:
PAG.add(Assign(o.f, x))
if pts(o.f) ≠ ∅:
Worklist.push(o.f)
// [Case 4] Function calls
for each (x, receiver, method, args) ∈ CallConstraints where receiver == n:
for each o ∈ pts(n):
callee = dispatch(o.type, method)
// Ensure callee function body is scanned
scan_method(callee)
// If it is a virtual function, align this
PAG.add(Assign(receiver, callee.this))
if pts(receiver) ≠ ∅:
Worklist.push(receiver)
// Parameter passing (actual -> formal)
for i in 0..len(args):
PAG.add(Assign(args[i], callee.params[i]))
if pts(args[i]) ≠ ∅:
Worklist.push(args[i])
// 3. Return value passing
if x exists:
PAG.add(Assign(callee.ret, x))
if pts(callee.ret) ≠ ∅:
Worklist.push(callee.ret)
5. Return pts
For each object o generated at an allocation site, its field o.f is modeled as an independent node in the PAG. dispatch(o.type, method) ensures that as receiver's points-to set is updated, new callee functions are incorporated into the analysis. Compared with pre-building a coarse call graph, this on-the-fly approach uses pointer-analysis precision to build a more accurate call graph.
3.3 Example¶
[01] class Box {
[02] int data;
[03] }
[04]
[05] void main() {
[06] Box b = new Box();
[07] Box p = b;
[08] Box q = b;
[09]
[10] p.data = 10;
[11] int x = q.data;
[12] }
In the Java code above, p and q both reference object b, so they are aliases. When p.data is modified, q.data is affected as well.
Below shows how Andersen pointer analysis handles aliasing:
Step 1: Scan main, update PAG¶
| Statement | Operation | Notes |
|---|---|---|
[06] b = new Box() |
pts(b) = {object_L06} Worklist.push(b) |
Create allocation-site abstract object object_L06 |
[07] p = b |
PAG.add(Assign(b, p)) Worklist.push(b) |
Add assignment edge b → p into PAG |
[08] q = b |
PAG.add(Assign(b, q)) Worklist.push(b) |
Add assignment edge b → q into PAG |
[10] p.data = 10 |
StoreConstraints.add(p, data, 10) | Record Store |
[11] x = q.data |
LoadConstraints.add(x, q, data) | Record Load |
After Step 1, the initial pts state is:
pts(b) = {object_L06}
The resulting PAG is
b
/ \
/ \
assign assign
| |
v v
p q
And Worklist = [b]
Step 2: Iteratively propagate pts¶
2.1: First, process b, pts(b) = {object_L06}
[Assign edge] Assign(b, p):
pts(p) = {object_L06}
Worklist.push(p)
[Assign edge] Assign(b, q):
pts(q) = {object_L06}
Worklist.push(q)
2.2: Next, process p, pts(p) = {object_L06}
[Store edge] Find (p, data, 10):
For object_L06 ∈ pts(p):
Record: object_L06.data is modified
2.3: Then, process q, pts(q) = {object_L06}
[Load edge] Find (x, q, data):
For object_L06 ∈ pts(q):
Read: object_L06.data
Final result¶
pts(b) = {object_L06}
pts(p) = {object_L06}
pts(q) = {object_L06}
Through pointer analysis, we find:
pts(p) ∩ pts(q) = {object_L06} ∩ {object_L06} = {object_L06} ≠ ∅
Therefore, p and q are aliases!
Conclusion¶
- Statement [10]
p.data = 10modifies object_L06.data to 10 - Statement [11]
x = q.datareads object_L06.data - Therefore, the value of x is 10
Pointer analysis identifies that p and q are aliases pointing to the same object, enabling accurate data-flow tracking: p.data → object_L06.data → q.data.
4. Analysis Precision¶
Andersen pointer analysis laid the foundation of pointer analysis. Many improved versions have since been proposed; the algorithms have been refined and optimized, and precision has improved. The main directions of improvement include:
4.1 Context Sensitive vs Context Insensitive¶
Context sensitivity refers to whether the analysis can distinguish the behavior of the same function under different call sites.
[01] int* identity(int *p) {
[02] return p;
[03] }
[04]
[05] void main(){
[06] int a, b;
[07] int *x = identity(&a); // Call Site 1
[08] int *y = identity(&b); // Call Site 2
[09] }
In this example, identity returns its input pointer p. At runtime, the calls at [07] and [08] return pointers to a and b, respectively.
With a context-sensitive pointer analysis, the results differ: pts(x) = {a}, pts(y) = {b}, because the analysis maintains separate results for each call site. With a context-insensitive analysis, the results are merged: pts(x) = {a, b}, pts(y) = {a, b}, because the two calls to identity at [07] and [08] are not distinguished.
Common context-sensitive strategies include:
- Call-site Sensitivity: distinguish contexts by call instruction locations.
- Object Sensitivity: in receiver.method() calls, distinguish contexts by different receivers; especially suitable for OO languages.
- Type Sensitivity: distinguish contexts using type information.
Sensitivity is typically bounded to k levels, commonly k=1 or k=2. When k is too large, analysis cost increases sharply.
4.2 Flow Sensitive vs Flow Insensitive¶
Flow sensitivity determines whether statement execution order is considered. A flow-insensitive analysis ignores order and merges effects of all statements.
[01] p = &a;
[02] p = &b;
In this code, flow-insensitive analysis merges [01] and [02], producing pts(p) = {a, b}. Flow-sensitive analysis yields pts(p) = {a} after [01], and pts(p) = {b} after [02].
4.3 Path Sensitive vs Path Insensitive¶
Path sensitivity determines whether different execution paths are distinguished. A path-insensitive analysis merges results from all possible control-flow paths.
[01] int a, b;
[02] int *p;
[03] if (condition) {
[04] p = &a;
[05] } else {
[06] p = &b;
[07] }
[08] *p = 10;
For this code, path-insensitive analysis yields pts(p) = {a, b} at [08]; the two branch assignments are merged, so *p = 10 may modify a or b. With path-sensitive analysis, different paths are distinguished and separate states are maintained:
Path 1: condition is true
pts(p) = {a}
*p = 10 modifies only a
Path 2: condition is false
pts(p) = {b}
*p = 10 modifies only b
5. Improvements and Applications of Andersen's Algorithm¶
Given the precision dimensions above, we can conclude that the original Andersen algorithm is context-insensitive, flow-insensitive, and path-insensitive.
Subsequent improved versions mainly enhance precision along these three axes to make pointer analysis more accurate:
5.1 Context Sensitive¶
A function usually has one entry and multiple exits. After entering analysis of the callee, caller information is often lost, so it becomes unclear where to return. Context-insensitive analysis treats this as irrelevant, considers all exits to ensure soundness, and thus loses precision.
Context-sensitive improvements incorporate caller information by extending heap abstractions and variables from one dimension to two, i.e., (context, element). This retains caller context and improves precision.
5.2 Flow Sensitive¶
Flow sensitivity is commonly addressed by introducing SSA (Static Single Assignment).
5.3 Path Sensitive¶
Path sensitivity requires analyzing branch statements, introducing path constraints, and combining with a SAT (Boolean Satisfiability) solver.
6. Convergence of Andersen Pointer Analysis¶
Andersen pointer analysis converges and terminates. The core reasons are monotonicity and finiteness.
-
Monotonic growth of points-to sets
- Andersen's algorithm continuously unions and expands points-to sets; it never removes elements. PAG edges, once added, are never deleted. Therefore, points-to sets grow monotonically.
-
The elements handled by pointer analysis are finite:
- Finite number of pointer variables
- Finite number of abstract objects (using allocation-site abstraction and other strategies)
- Finite number of points-to relations: each pointer can point to at most all objects, which has an upper bound
Therefore, within a finite search space, Andersen's algorithm approaches the boundary in a monotone increasing way. As long as the program size is finite, it must stop in a finite number of steps.