P2: Bottom-up Semantic Analysis¶
At this stage, the Lian program analysis framework performs fine-grained bottom-up semantic analysis, including data-flow analysis and points-to analysis. The core goal of this stage is to precisely characterize data flow and points-to relations in the program without relying on type information, providing a high-quality semantic basis for subsequent interprocedural analysis, taint tracking, and security modeling.
This stage is implemented by the core/prelim_semantics module and driven by PreliminaryAnalysis.run(). The required inputs include GIR statements and the results from the previous stage, "basic structural analysis".
1 Main Approach¶
1-1 Overall Workflow¶
This stage uses a bottom-up semantic analysis strategy. The basic workflow has two steps: 1) use functions as the basic analysis unit, analyze each function independently, and compute its semantic summary; 2) synthesize upward along function call relations, introducing the semantic summary of the callee into the semantic analysis of the caller.
The core idea of this strategy is to characterize local, reusable function semantics first, then gradually build a complete understanding of behavior upward. This design has several important considerations:
First, all functions are analyzed, including potentially "dead code" functions. In traditional program analysis, call graph pruning is often used to exclude functions that will not be called, but this strategy is not always reliable in real systems. Due to the widespread presence of higher-order functions, dynamic property access, reflection mechanisms, and third-party library calls, static analysis often has deviations when judging "function reachability". Once some functions are misjudged and ignored, the completeness and extensibility of subsequent analysis will be irreversibly affected. Therefore, at this stage Lian does not rely on call reachability as a semantic premise, but constructs conservative local semantic summaries for all resolvable functions. Whether these summaries take effect in the final program semantics is determined by later analysis stages.
Second, treating functions as independent analysis units helps abstract semantic information independent of call sites.
[01] f(int a) {
[02] int b = a;
[03] return b;
[04] }
The semantics expressed by this function is "return its input parameter". This conclusion is independent of call sites and has high reuse value.
Of course, this assumption does not always hold. When a function references an external environment (such as this, global variables, implicit objects, or indirect call targets), its semantics will inevitably depend on the specific calling context. For example:
f(A a) {
a.speak();
}
In this case, the actual behavior of a.speak() depends on the runtime target object of a.
The goal of this stage is not to fully solve such problems, but to extract reusable and stable function semantics as much as possible. The remaining context-dependent semantics will be further handled in the later "top-down" analysis stage.
1-2 Gen/Kill Algorithm¶
The Gen/Kill algorithm has long demonstrated its effectiveness in data-flow analysis, and it is equally applicable to untyped language analysis. Therefore, the Lian system continues to use the Gen/Kill algorithm for data-flow analysis. Not limited to this, Lian also applies the Gen/Kill algorithm to points-to set analysis and achieves good results.
The Gen/Kill algorithm is the core mechanism in classic data-flow analysis and has been validated in many program analysis systems. This algorithm does not rely on strong typing assumptions, so it is also suitable for dynamic and weakly typed languages. In Lian, the Gen/Kill algorithm is used not only for symbol-level data-flow analysis, but is also extended to the evolution analysis of points-to sets. By applying the Gen/Kill approach at different levels, Lian can achieve relatively high precision while keeping the analysis controllable.
1-3 Trade-offs in SSA Analysis Strategy¶
Static Single Assignment (SSA) form has important value in compilation optimization and dependency analysis, and can efficiently characterize variable-level data dependencies. However, in Lian's target scenarios, SSA is not a universal solution. Its limitations are mainly reflected in the following aspects:
First, SSA is difficult to support field-sensitive analysis naturally. For example:
[01] x.f = 1;
[02] x.f = 2;
[03] c = x.f;
For the third statement [03], the dependency of field f of x is difficult to analyze and understand using SSA results. Similar problems are especially prominent in array element access and dynamic property access.
Second, in terms of expressiveness, SSA is equivalent to classic data-flow analysis in many scenarios. Therefore, when only one can be chosen, Lian prefers to use data-flow analysis as the core semantic modeling mechanism and integrate points-to analysis on this basis.
1-4 Points-to Analysis¶
Points-to analysis is the key technique for handling complex behaviors in dynamic languages, and is especially important for property access, indirect calls, and similar scenarios. For example:
- For property access
z = x[y], if the points-to sets of x and y are known, the concrete data type of x (possibly an array or object), and the type (possibly int or string) and value range of y can be inferred, thus computing the points-to set of z. - For indirect call
z = x(y), if the points-to set of x is known, potential callees can be analyzed on-the-fly, and the number of parameters, parameter points-to sets, and return semantics can be inferred. This mechanism also applies to object method callsz = x.f(y).
Without pointer analysis support, for static languages it is still possible to approximate call targets using compiler type information, but in dynamic languages, the callee is often only determined at runtime, and traditional type-driven methods are difficult to apply. Therefore, pointer analysis that does not depend on a type system is critical in such scenarios.
1-5 Relationship Between Pointer Analysis and Taint Analysis¶
In Lian, pointer analysis is not only used to handle aliasing and call dispatch, but its results are also directly reused in taint tracking analysis. The result of pointer analysis is a mapping from variables to abstract memory objects, and these abstract memory objects can also serve as carriers for taint propagation.
Take an assignment statement as an example, such as x = y. The points-to set of x is the points-to set of y. After y is tainted, the memory objects inside y's points-to set can also be marked as tainted. After this statement is analyzed, x's points-to set will also be updated, and the target objects pointed to by x's points-to set will also be marked as tainted. By extension, for other statements such as x.f = y, the same mechanism applies.
Therefore, in Lian's design, pointer analysis is a core capability in the semantic analysis stage and the basis for high-precision taint analysis.
1-6 Precision Targets of Pointer Analysis¶
Pointer analysis is a focus of the Lian program analysis framework. Its precision targets include:
- Context sensitive: for analysis in this stage, context sensitivity is achieved via summaries; for the next stage (top-down analysis), 1-CallSite precision is used.
- Flow sensitive: identify the impact of statement execution order on points-to relations through reaching-definition analysis.
- Field sensitive: explicitly model dependencies of object properties and array elements.
2 Preliminary Semantic Analysis¶
2-1 Basic Definitions¶
- Symbol: all identifiers in the program, such as variables, functions, classes, properties, array elements, etc.
- Abstract Memory Object: an abstraction of runtime memory entities that a symbol may correspond to, including address, statement ID, type, value, fields, and array elements.
- State: a mapping from symbols to sets of abstract memory objects (
symbol -> {memory objects}). - Reaching definition: symbol definitions and their corresponding abstract memory objects that are still valid at the current statement position.
2-2 Method Summary¶
A method summary describes a function's semantic behavior along two dimensions, input and output:
- Input: the function's dependency on the external environment, including parameters, external variable references, etc.
- Output: the function's impact on the external environment, including return values, external variable modifications, etc.
At this stage, all inputs of a function are uniformly abstracted as anything. Here, anything denotes the top element of the current abstract domain (abstract top), used to maintain conservative semantics when information is insufficient, rather than introducing extra assumptions. Through intraprocedural semantic analysis, its output behavior is computed gradually. When a callee summary is applied to a caller, it is only necessary to replace anything in the summary with the real abstract memory objects under the caller context.
To support summary computation, a method frame is introduced to record inputs, states, and intermediate results during function analysis.
2-3 Analysis Workflow¶
Semantic analysis takes functions as the basic unit and proceeds in the following order:
- methods with no callee
- methods with direct callee: direct call means the callee can be located directly via the function-name variable ID
- methods with indirect callee: indirect call means the callee cannot be determined via the function-name variable ID
For each function, initialize the method frame and analyze every statement in its CFG. The overall workflow is:
Input: method GIR statements, basic structural analysis results
Output: method summary summary
Worklist = [first stmt of method CFG]
while Worklist != empty:
// get a statement
stmt = Worklist.pop()
// analyze available symbol defs at this statement
reaching_symbol_defs = analyze_reaching_symbols(stmt)
// analyze points-to relations for this statement
compute_stmt_states(stmt, reaching_symbol_defs)
// if the above analysis changed symbol definitions (e.g., function call), update
update_reaching_defs(stmt)
// check whether symbol-level or state-level changes happened at this statement
if is_changed(symbol, states) or first_round:
Worklist = Worklist union succ(stmt)
2-4 Symbol-level Reaching Definition Analysis¶
analyze_reaching_symbols() implements the classic Gen/Kill algorithm at the symbol level to compute the reaching symbol definitions at each statement. Its detailed principle can be found in the data-flow analysis chapter on reaching-definition analysis.
update_reaching_defs() handles implicit effects introduced by pointer analysis, with function calls as a typical scenario. When the callee references or modifies external symbols, the symbol use and def sets of the current statement need to be updated accordingly.
2-5 Points-to Analysis¶
compute_stmt_states() computes the changes to points-to relations caused by each statement.
Flow-sensitive analysis¶
[01] x.f = 1;
[02] x.f = 2;
[03] c = x.f
In this example, in [01] the abstract memory object of x is {fields: {f: 1}}; in [02] it changes to {fields: {f: 2}}; in [03] the points-to set of x is {fields: {f: 2}}. With flow-insensitive analysis, in [03] the points-to set of x would be {fields: {f: {1, 2}}}, increasing false positives. Flow-sensitive analysis produces the correct result.
For this, Lian introduces a reaching-definition-based state selection mechanism at the abstract memory object level, ensuring that each field read is associated with the latest valid state definition. In [03], the used symbol is x and its field f. The latest state of x is computed, and its latest state definition is found at [02], so the points-to set of x is {fields: {f: 2}}.
State copying and strong update¶
In the example above, without precise alias determination, directly using strong updates can easily introduce inconsistency in branch and join scenarios, especially when handling branches. Since this stage does not introduce SSA form, to avoid inconsistency caused by strong updates in branching, Lian uses a state copying strategy: create a copy and apply modifications to the copy, while keeping the source state unchanged. This state copying mechanism is not a replacement for SSA, but works at the abstract memory object level to express semantic evolution related to fields and aliasing, which SSA at the variable level cannot directly capture.
This raises the problem of distinguishing source states from copied states, which is decided by reaching-definition analysis at the abstract memory object level.
State computation¶
The points-to analysis workflow for a single instruction is:
Input: GIR instruction stmt to analyze, basic structural analysis results, symbol-level reaching definitions
Output: instruction-related points-to relations
// collect all corresponding states based on available symbol definitions at this statement
symbol_states = collect_newest_states(reaching_symbol_defs)
// check whether some symbols have no corresponding state; if so, fill them
complete_in_states(symbol_states)
// after preparing all input symbols and their memory states, compute statement state changes
// statement state changes are handled by `core/stmt_states.py`
run_stmt_state_analysis()
// after computation, check what new states were generated; if new states exist, update state-level reaching definitions
adjust_computation_results()
Applying method summaries¶
After a callee method summary is generated, it is applied to the caller. This application is implemented by call_stmt in core/stmt_states.py:
- collect all used symbols and collect the newest states of these reaching symbols
- use the summary as a template, instantiate the method summary by replacing anything with the newest states
- treat the instantiated summary as new states and update caller semantics