Semantic Storage and Retrieval¶
Semantic storage and retrieval (src/lian/util/loader) is the core infrastructure module for data management. During large-scale static analysis, this module efficiently manages, stores, and retrieves massive intermediate data produced by analysis. At the same time, under limited memory resources, it provides a data management foundation for Lian to run complex program analysis through caching strategies, bundled storage, and unified data access interfaces.
The semantic storage and retrieval module contains multiple loaders to manage different types of analysis results in Lian. The loader framework uses two organizational approaches:
- Template loader classes: define the common pattern of data access.
- Concrete loader implementations: bind to a specific class of analysis results.
Template classes answer "how this kind of data is accessed", while concrete loaders answer "what this data is".
2 Loader Template System¶
The loader framework implements a layered abstraction from general to specific through an inheritance hierarchy. Low-level data serialization is handled by DataModel, and loaders only focus on data organization logic and access pattern optimization.
2-1 General Data Management Template: GeneralLoader¶
GeneralLoader is the base class of loaders and implements the core caching and bundled storage mechanisms.
Caching mechanism¶
It uses a two-level cache system:
access request
->
item cache (Item Cache) <- LRU, stores deserialized objects
-> (miss)
bundle cache (Bundle Cache) <- LRU, stores raw data blocks
-> (miss)
disk files (Bundle Files)
Data storage mechanism¶
For data storage, a bundled storage strategy is used. Data is not written to disk item by item. Instead, it is accumulated in memory into an "active bundle":
save() -> Active Bundle (memory)
-> (accumulate to MAX_ROWS)
export() -> bundle file (disk)
When the active bundle reaches the configured row limit (config.MAX_ROWS), a batch write is triggered, generating files like cfg.bundle0. This batch I/O strategy merges hundreds of small writes into one large write, reducing disk overhead.
Index mechanism¶
GeneralLoader uses a separate index file to record the mapping item_id -> bundle_file_id, for example:
method_id: 1001 -> bundle0
method_id: 1002 -> bundle0
method_id: 2001 -> bundle1
During query, it locates the bundle via the index first, then extracts the specific data from the bundle.
Abstract interfaces¶
GeneralLoader defines three abstract methods that subclasses must implement to adapt to different data structures:
query_flattened_item_when_loading(): query records of a specific item from a bundle.unflatten_item_dataframe_when_loading(): reconstruct a complex object from a flattened DataFrame.flatten_item_when_saving(): convert a complex object into a flattened list of dicts.
This design separates data storage responsibility (DataModel) from data organization responsibility (Loader).
2-2 UnitLevelLoader: Source-file-granularity data management¶
UnitLevelLoader handles data whose granularity is the source file (unit_id), such as scope hierarchies and exported symbols.
2-3 MethodLevelAnalysisResultLoader: Method-granularity data management¶
MethodLevelAnalysisResultLoader handles analysis results at the method (method_id) level or finer granularity (such as call site CallSite).
2-4 OneToManyMapLoader / ManyToOneMapLoader: Mapping relationship management¶
These loaders handle mappings between IDs, such as "one class contains multiple methods" and "one method belongs to one class". These loaders provide fast bidirectional queries, offering index support for call relationships, type relationships, and symbol ownership relationships.
3 Concrete Loader Implementations¶
Concrete loaders are instantiated implementations of loader template classes. Each concrete loader manages specific analysis data. They inherit the corresponding loader template class and implement specific serialization logic. These loaders do not care about how data is physically stored (handled by DataModel), but only about the logical structure and access patterns.
3-1 Structural Data Loaders¶
ModuleSymbolsLoader:
- manage metadata of all source files and directories
- build a bidirectional mapping between file paths and
unit_id - serve as the starting point for unit-path mapping in the analysis pipeline
UnitGIRLoader:
- store a unit's generic intermediate representation (GIR)
- provide fine-grained statement-level access
3-2 Graph Structure Loaders¶
CFGLoader:
- store control-flow graph edges:
(src_stmt_id, dst_stmt_id, control_flow_type) - rebuild a NetworkX directed graph when loading
SymbolGraphLoader:
- handle mixed-node graphs: nodes include statement IDs and
SymbolDefNode - distinguish "def" edges and "use" edges by edge direction
- implement sparse storage via conditional fields (
defined_symbol_*vsused_symbol_*)
StateFlowGraphLoader:
- store the state flow graph, with nodes as
SFGNodeand edges asSFGEdge - serialize complex objects via
to_tuple()/from_tuple()
ImportGraphLoader:
-
manage three related data structures:
- symbol-level import graph (nodes are symbol IDs)
- node metadata (symbol details)
- unit-level dependency graph (nodes are unit_id)
-
separate storage, unified queries
TypeGraphLoader:
- store the type inheritance graph
- support querying parent classes and child classes
3-3 Data-flow Analysis Loaders¶
BitVectorManagerLoader:
- store the mapping from semantic nodes (
SymbolDefNode/StateDefNode) to bit positions - use one-way storage and two-way reconstruction
StmtStatusLoader:
- store statement def-use information
- fields include:
defined_symbol,used_symbols,in_symbol_bits,out_symbol_bits, etc.
SymbolStateSpaceLoader:
- store full information of symbols and states
- distinguish
SymbolandState Statecontains access paths (a list ofAccessPoint)
3-4 Method Summary Loaders¶
MethodDefUseSummaryLoader:
- store basic method def-use summary
- fields: parameter symbols, local symbols, external symbols, return symbol
MethodSummaryLoader:
-
handle two object types:
MethodSummaryTemplate(keyed bymethod_id)MethodSummaryInstance(keyed byCallSite)
-
distinguish types by checking the
caller_idfield
CalleeParameterMappingLoader:
- store parameter mappings at call sites
ParameterMappingcontains parameter indices, state IDs, access paths, etc.
3-5 Call Relationship Loaders¶
CallGraphLoader:
- store call relationship graphs between methods
- edges include call statement IDs
CallPathLoader:
- store complete call paths (
CallPath) - each path is a sequence of
CallSite - provide composite query interfaces:
get_callers_by_method_id(): get callersget_call_path_between_two_methods(): get the path between two methodsget_lowest_common_ancestor(): find the lowest common ancestor
4 Unified Interfaces¶
The semantic storage and retrieval module aggregates dozens of concrete loaders and provides unified, semantic access interfaces.
4-1 Unified get/save interfaces¶
The Loader class provides semantic access methods for each specialized loader, forming a unified naming convention:
Basic pattern:
# save data
save_xxx(id, data)
# load data
get_xxx(id) -> data
# convert mapping
convert_xxx_to_yyy(xxx_id) -> yyy_id
# type check
is_xxx(id) -> bool
Examples:
# module symbols
save_module_symbols(module_symbol_results)
get_module_symbol_table() -> DataModel
# unit GIR
save_unit_gir(unit_id, unit_gir)
get_unit_gir(unit_id) -> DataModel
# method CFG
save_method_cfg(method_id, cfg)
get_method_cfg(method_id) -> nx.DiGraph
# ID conversion
convert_method_id_to_unit_id(method_id) -> unit_id
convert_stmt_id_to_method_id(stmt_id) -> method_id
# type checks
is_method_decl(stmt_id) -> bool
is_parameter_decl_of_method(stmt_id, method_id) -> bool
This unified interface hides the complexity of underlying loaders, so upper-layer analysis code does not need to care where data is stored or how it is serialized.
4-2 Intelligent query encapsulation¶
Loaders provide many high-level query methods that encapsulate complex logic:
Call relationship queries:
get_callers(method, phase="p3", match_by="name", entry_point_id=-1)
get_callees(method, phase="p3", match_by="name", entry_point_id=-1)
Select implementations by parameters:
phase="p2": useCallGraphLoaderphase="p3": useCallPathLoadermatch_by="name": return a set of method namesmatch_by="id": return a set of method IDs
Source code queries:
get_stmt_source_code_with_comment(stmt_id) -> List[str]
get_method_decl_source_code(method_id) -> str
get_unit_source_code_by_stmt_id(stmt_id) -> List[str]
These methods read files and handle comments and boundary cases based on fixed rules:
- comment identification and inclusion
- distinguishing method declarations from method bodies
- line number mapping and boundary handling
Method name queries:
get_method_name_by_method_id(method_id) -> str
# return "ClassName.method_name" or "method_name"
Class names are concatenated automatically for more user-friendly display.
5 Multi-stage Analysis Support (P1 / P2 / P3)¶
Lian's analysis pipeline has three stages, with increasing data scale and precision:
- P1: basic structural analysis
- P2: bottom-up interprocedural semantic analysis
- P3: top-down global analysis
Loaders isolate stage data by separating filesystem paths:
workspace/
|-- frontend/ # frontend parsing results (GIR, module symbols, etc.)
|-- semantic_p1/ # P1 basic structural analysis
|-- semantic_p2/ # P2 bottom-up interprocedural analysis
`-- semantic_p3/ # P3 top-down global analysis
5-1 P1: Basic Structural Analysis¶
Goal: run basic intraprocedural analysis and build structural information.
Core data:
-
control flow and scopes:
CFGLoader: method-level CFGScopeHierarchyLoader: scope hierarchyUnitSymbolDeclSummaryLoader: symbol declaration summary
-
symbol and state foundations:
SymbolBitVectorManagerLoader (P1): symbol bit-vector mappingStmtStatusLoader (P1): statement def-use infoSymbolStateSpaceLoader (P1): symbol state space
-
method and class info:
MethodDefUseSummaryLoader: method def-use summaryMethodInternalCalleesLoader: internal call sites in methodsClassIDToMethodsLoader: class-to-method mapping
-
initial call graph:
CallGraphLoader: initial call relationsGroupedMethodsLoader: grouped methods
5-2 P2: Bottom-up Interprocedural Analysis¶
Goal: propagate information from callees to callers and perform interprocedural data-flow analysis.
Core data:
-
symbol and state propagation:
SymbolBitVectorManagerLoader (P2): extended symbol bit vectorsStateBitVectorManagerLoader (P2): state bit vectorsStmtStatusLoader (P2): updated statement statusSymbolStateSpaceLoader (P2): extended symbol state spaceSymbolStateSpaceSummaryLoader (P2): summary info
-
graph refinement:
SymbolGraphLoader (P2): symbol dependency graphStateFlowGraphLoader (P2): state flow graph
-
method summaries:
MethodSummaryLoader (Template): method summary templatesCalleeParameterMappingLoader (P2): parameter mapping
-
call graph refinement:
CallGraphLoader (P2): more precise static call graph
5-3 P3: Top-down Global Analysis¶
Goal: start from entry points and perform path-sensitive global analysis.
Core data:
-
global symbols and states:
SymbolBitVectorManagerLoader (P3): global symbol bit vectorsStateBitVectorManagerLoader (P3): global state bit vectorsStmtStatusLoader (P3): context-sensitive statement statusSymbolStateSpaceLoader (P3): global symbol state spaceSymbolStateSpaceSummaryLoader (P3): global summary
-
call paths:
CallPathLoader: core component storing complete call paths
-
context sensitivity:
MethodSummaryLoader (Instance): context-specific method summariesCalleeParameterMappingLoader (P3): context-sensitive parameter mappingSymbolGraphLoader (P3): context-sensitive symbol graphStateFlowGraphLoader (P3): global state flow graph
6 Extensibility¶
The semantic storage and retrieval module provides data read/write and query capabilities for direct use by upper-layer analysis.
6-1 Source code and GIR association¶
Problem: analysis results are based on GIR (statement IDs), but users need source code.
Solution:
get_stmt_source_code_with_comment(stmt_id) -> List[str]
- automatically locate the source file containing the statement
- extract corresponding code based on line number information in GIR
- handle comments intelligently (identify comments before methods via
util.determine_comment_line()) - for method declarations, return the whole method body; for normal statements, return only the statement
6-2 Rule matching and taint analysis¶
Problem: taint analysis needs to quickly locate code positions based on rules (such as "which functions are sources/sinks").
Solution:
get_source_sink_stmt_ids_by_rules() -> (source_stmt_ids, sink_stmt_ids)
Workflow:
- get all source/sink rules from
RuleManager - filter rules by file path (avoid invalid matches)
-
traverse each unit's GIR, for each statement:
- convert to a readable GIR string (
readable_gir.get_gir_str()) - use regex to match symbol names (
\bword boundaries to avoid false matches) - if the rule specifies a line number, additionally check line number match
- convert to a readable GIR string (
-
return the matched statement ID sets
6-3 Unified call relationship query¶
Problem: call relationships are stored differently across stages (P2 uses graphs, P3 uses paths), and query interfaces differ.
Solution:
get_callers(method, phase="p3", match_by="name", entry_point_id=-1)
get_callees(method, phase="p3", match_by="name", entry_point_id=-1)
Parameters:
method: method name or method IDphase: "p2" / "p3" / "all" (auto-merge)match_by: "name" (return method name set) / "id" (return method ID set)entry_point_id: specify entry point (only valid for P3)
6-4 Type hierarchy queries¶
Capability:
get_parent_class_by_class_name(class_name) -> List[TypeNode]
get_son_class_by_class_name(class_name) -> List[TypeNode]
Use cases: analyze inheritance, virtual calls, type casting, etc.
Implementation: traverse the type graph in TypeGraphLoader and query by edge field parent_name.
6-5 Module import relationship queries¶
Capability:
get_import_node_with_name(unit_id, import_name) -> SymbolNodeInImportGraph
get_edges_and_nodes_with_edge_attrs_in_import_graph(node_id, attr_dict)
Use cases: analyze module dependencies and resolve symbol references.
Design:
ImportGraphLoadermaintains both symbol-level import graph and unit-level dependency graph- support queries by import name, edge type (internal/external), statement ID, and other attributes
- return combinations of edges and nodes
(edge, node)
7 Summary¶
The semantic storage and retrieval framework implements efficient, flexible, and maintainable data management. With controllable memory and I/O overhead, it provides the data management foundation for high-precision global analysis.