Skip to content

Data Flow Analysis

Author: Guangliang Yang
Last Updated: 2026-01-05
Copyright: This document is licensed under the CC BY-NC-SA 4.0 license

Data Flow Analysis is a core technique in program analysis. It describes how data propagates and evolves along the control flow graph (CFG). Data flow analysis algorithms were proposed as early as the 1970s and can be traced back to Kildall’s 1973 work “A Unified Approach to Global Program Optimization”. These algorithms are built around key concepts such as In/Out and Gen/Kill, and are solved through fixed-point iteration.

A fixed point refers to a state in which the result of program analysis continues to be iteratively updated during execution of the algorithm and eventually converges, stabilizes, and terminates. The point at which convergence is reached is called a fixed point. Fixed-point theory can be traced back to Tarski’s 1955 work “A Lattice-Theoretical Fixpoint Theorem”, which introduced the concept of complete lattices over partially ordered sets and proved that any monotone function on a complete lattice must have a least fixed point and a greatest fixed point, both of which can be computed through iteration in a finite number of steps. This result is fundamental to the development of program analysis theory. Many algorithms, including data flow analysis and pointer analysis, rely on fixed-point theory to prove that the analysis will eventually converge and terminate.

This chapter focuses on data flow analysis algorithms, explains basic concepts such as def/use and In/Out, and introduces worklist-based data flow algorithms.

1. Instruction-Level def/use

The definitions of def and use are fundamental concepts in data flow analysis. They describe how data is modified and referenced within a program. For a given instruction, def refers to the data modified by that instruction, while use refers to the data used by that instruction. It should be noted that def/use in this chapter refers only to variable-level syntactic definitions and uses, without involving aliasing, pointers, or memory-sensitive modeling.

From the semantic perspective of instructions, instruction-level def/use is typically static, meaning that the data modified and referenced by an instruction is fixed.

Instruction Type Instruction def use Notes
variable_decl var name ∅ (or {name}) def may also be {name}, depending on the analysis strategy
assign target = operand1 operand2 {target} {operand1, operand2}
array_write array[index] = source {array[index]} {array, index, source} Whether array[index] is treated as an independent definition unit depends on whether the analysis distinguishes memory locations
array_read target = array[index] {target} {array, index}
field_write receiver_object.field = source {receiver.field} {receiver_object, source} field is a static literal and may be excluded from use; in field-insensitive analysis, this degenerates into a weak update on receiver
field_read target = receiver_object.field {target} {receiver_object}
call target = func(param1, param2, ...) {target} {func, param1, param2, ...} def and use for function calls are often complex and may involve side effects; they require careful interprocedural handling
return return value {value} the value may be propagated to the caller; like call instructions, this requires interprocedural handling
if if condition {condition}
  • ∅ denotes the empty set, meaning that the instruction does not modify or use any variable

  • Assignment instructions: the left-hand side is modified, the right-hand side is used

  • Variable declarations: do not use or modify any value; some frameworks treat the declared variable as modified for convenience, depending on the analysis strategy

  • Array operations: in array_write, the array is modified and the array reference, index, and source value are used; in array_read, the target variable is modified and the array reference and index are used

  • Object field operations: in field_write, the object is modified and the object reference and source value are used; in field_read, the target variable is modified and the object reference is used

  • Function calls: if there is a return value, the target variable is modified; all parameters and the function reference are used; function calls may have side effects affecting global variables or heap objects and require special handling in interprocedural analysis

  • Control-flow instructions: return does not modify variables and uses the return value (if any); if and similar instructions do not modify variables but use values

For complex instructions, the def/use modeling strategy depends on the desired analysis precision:

  • Strong Update: when the modified element can be precisely determined, it can be modeled precisely. For example, a[0] = 1 modifies only the element at index 0.

  • Weak Update: when the modified element cannot be precisely determined (e.g., the index is a variable), the analysis must conservatively assume that the entire data structure may be modified. For example, a[i] = 1 may modify any element of the array.

In practice, if array[index] or receiver.field cannot be precisely distinguished, the analysis may fall back to weak updates to ensure soundness.

Soundness and related metrics

Soundness is a critical concept and metric in program analysis. A related concept is completeness.

  • Soundness: all problems that actually occur are included in the analysis result. That is, the analysis report contains all real cases. This corresponds to soundness and conservativeness, meaning no false negatives.
  • Completeness: every reported result actually occurs. That is, the analysis report is a subset of real cases, corresponding to no false positives.

Many program analysis frameworks prioritize soundness. If an analysis is sound, it means that all actual cases are already included in the analysis result, and the result can be considered safe.

However, these terms are highly academic, and the Chinese translation of completeness often causes confusion. To avoid ambiguity, false positives and false negatives are more commonly used:

  • False Positive (FP): issues reported by the analysis that will not actually occur. The false positive rate (FPR) is FP / total reported issues.
  • False Negative (FN): issues that actually occur but are not reported by the analysis. The false negative rate (FNR) is FN / total actual issues.

Correspondingly, Precision is 1 − FPR, and Recall is 1 − FNR.

2. Instruction-Level In/Out

def/use captures how an instruction statically modifies and uses variables, but it does not describe dynamic data dependencies between instructions. To address this, the concepts of In and Out are introduced. In represents the data-flow state before executing the instruction, and Out represents the data-flow state after the instruction completes.

“State” is a critical concept, but it is entirely determined by the specific data flow task. Different data flow analyses define state differently. For clarity, the classic task of Reaching Definitions is used as an example. Reaching definitions determines, for any instruction, which variable definitions remain accessible and usable. For example:

[01] a=1
[02] a=3
[03] b=a

Both [01] and [02] define variable a. However, for instruction [03], only the definition at [02] is usable, because the definition at [01] has been overwritten by [02]. Therefore, before executing [03], the state set is {[02] a}, indicating that only the definition of a at [02] is reachable.

Based on this notion of state, In and Out can be fully defined.

In State Set

For an instruction stmt, its In state set is obtained by merging the Out state sets of all predecessor instructions in the control flow graph:

In_stmt = ∪(Out_P) for all P ∈ pred(stmt)

where pred(stmt) denotes all predecessor instructions of stmt in the CFG.

Out State Set

For an instruction stmt, its Out state set is computed from its In state set combined with its def/use:

Out_stmt = Gen_stmt ∪ (In_stmt - Kill_stmt)

Here, two important functions are introduced. Gen_stmt represents the new definitions generated by the instruction, and Kill_stmt represents all other definitions of the same variable.

Example

Using the same code:

[01] a=1
[02] a=3
[03] b=a

The In and Out state computation for instruction [02] is as follows:

  • For instruction [02], its In state set is the Out state set of instruction [01]:
In_2 = {[01] a}
  • For instruction [02], its Out state set is computed from its In state set and its def/use. Gen_2 is {[02] a}, and Kill_2 is {[01] a}.

Thus, the Out state set of [02] is:

Out_2 = Gen_2 ∪ (In_2 - Kill_2)
      = {[02] a} ∪ ({[01] a} - {[01] a})
      = {[02] a} ∪ ∅
      = {[02] a}

Therefore, the Out state set of instruction [02] is {[02] a}.

3. Reaching Definitions Analysis

Reaching Definitions Analysis is an important data flow analysis task that computes the reachability of variable definitions. This section introduces a classic worklist algorithm to perform reaching definitions analysis based on In/Out states.

The worklist algorithm aligns well with fixed-point-based analysis. It repeatedly updates analysis results until stability is reached. Many program analysis algorithms are implemented using worklist algorithms.

The following is pseudocode for reaching definitions analysis using a worklist algorithm:

Algorithm: Reaching Definitions Analysis
Input: Control Flow Graph (CFG)
Output: In and Out state sets for each instruction

1. Initialization:
    for each stmt:
        In_stmt = ∅
        Out_stmt = ∅

    Worklist = {entry instruction of CFG}

2. Iteration:
    while Worklist ≠ ∅:
        stmt = pop Worklist

        In_stmt = ∪(Out_P) for all P ∈ pred(stmt)

        Out_stmt_new = Gen_stmt ∪ (In_stmt - Kill_stmt)

        if Out_stmt_new ≠ Out_stmt_old:
            Out_stmt = Out_stmt_new
            Worklist = Worklist ∪ succ(stmt)

3. Return In and Out state sets for all instructions

According to Tarski’s fixed-point theorem, the state space is finite (the number of possible definitions is finite), and the transfer functions are monotonic (the Out set only grows). Therefore, the algorithm must reach a fixed point in a finite number of steps and will converge. Intuitively, whenever a definition is killed, a new definition is generated. Since the overall state set is bounded by all possible definitions, the iteration must terminate.

4. Example

Consider function f in the following C code:

[01] int x = 1;
[02]
[03] int f(int y) {
[04]     int z;
[05]     if (y > 0) {
[06]         z = x + y;
[07]     } else {
[08]         z = x - y;
[09]     }
[10]     return z;
[11] }

For reaching definitions analysis of function f, assuming the initial state is {[01] x, [03] y}, the result is:

Instruction Code def use In Out
[04] int z; {[01]x, [03]y} {[01]x, [03]y}
[05] if (y > 0) y {[01]x, [03]y} {[01]x, [03]y}
[06] z = x + y; z x, y {[01]x, [03]y} {[01]x, [03]y, [06]z}
[08] z = x - y; z x, y {[01]x, [03]y} {[01]x, [03]y, [08]z}
[10] return z; z {[01]x, [03]y, [06]z, [08]z} {[01]x, [03]y, [06]z, [08]z}

Note that the Out state of the return instruction represents the state at function exit and is propagated to the caller. In intraprocedural analysis, the Out state of return may also be defined as the empty set.

5. Classical Data Flow Analysis Tasks

In addition to reaching definitions, data flow analysis is used for many other tasks. Depending on the task and objective, different strategies are chosen along dimensions such as analysis direction, analysis type, state elements, meet operation, and initial values. The following table summarizes several representative tasks.

Analysis Task Description Direction Type State Element Meet Initial
Reaching Definitions which variable definitions are available at stmt Forward May variable def(stmt, var)
Live Variables which variables may be used in the future Backward May variable
Reaching Uses which uses are reached by definitions Backward May use sites
Available Expressions which expressions are available at stmt Forward Must expression Universe
Very Busy / Anticipated Expressions which expressions will be used in the future Backward Must expression Universe

The following explains these analysis dimensions:

(1) Analysis Direction (Forward / Backward)

The analysis direction determines how data-flow information propagates along the CFG.

Forward analysis traverses the CFG from entry to exit.

Backward analysis traverses the CFG from exit to entry.

Reaching definitions is a forward analysis. In general, analyses related to def are forward, and analyses related to use are backward.

(2) Analysis Type (May / Must)

The analysis type determines whether the state represents “possibly true” or “definitely true”. In May analysis, if a condition holds on at least one path, it is considered true. In Must analysis, it must hold on all paths.

May analysis is commonly used for safety, while Must analysis is often used for optimization.

(3) State Meaning

State meaning defines the semantic interpretation of a single element in the In/Out set. For reaching definitions, each state element represents a definition available at the current instruction.

(4) Meet Operation (∪ / ∩)

The meet operation specifies how states are merged when an instruction has multiple predecessors or successors.

Union (∪): holds if at least one path satisfies the condition (May)

Intersection (∩): holds only if all paths satisfy the condition (Must)

Thus, May ⇔ ∪, Must ⇔ ∩.

(5) Initial Value

The initial value defines the default In/Out state at the start of the algorithm.

For May analysis, the initial value is the empty set. For Must analysis, the initial value is the universal set.

6. Interprocedural Analysis

The analyses above are intraprocedural. Interprocedural data flow analysis requires modeling function calls and returns and defining In and Out states for functions. Classical interprocedural data flow analysis relies heavily on the interprocedural control flow graph (ICFG). Once the ICFG is constructed, intraprocedural data flow algorithms can be lifted directly to the interprocedural setting.

The ICFG concept is described in 2-1. Program Analysis Basics.