Skip to content

Semantic Storage and Retrieval

Semantic storage and retrieval (src/lian/util/loader) is the core infrastructure module for data management. During large-scale static analysis, this module efficiently manages, stores, and retrieves massive intermediate data produced by analysis. At the same time, under limited memory resources, it provides a data management foundation for Lian to run complex program analysis through caching strategies, bundled storage, and unified data access interfaces.

The semantic storage and retrieval module contains multiple loaders to manage different types of analysis results in Lian. The loader framework uses two organizational approaches:

  • Template loader classes: define the common pattern of data access.
  • Concrete loader implementations: bind to a specific class of analysis results.

Template classes answer "how this kind of data is accessed", while concrete loaders answer "what this data is".

2 Loader Template System

The loader framework implements a layered abstraction from general to specific through an inheritance hierarchy. Low-level data serialization is handled by DataModel, and loaders only focus on data organization logic and access pattern optimization.

2-1 General Data Management Template: GeneralLoader

GeneralLoader is the base class of loaders and implements the core caching and bundled storage mechanisms.

Caching mechanism

It uses a two-level cache system:

access request
  ->
item cache (Item Cache) <- LRU, stores deserialized objects
  -> (miss)
bundle cache (Bundle Cache) <- LRU, stores raw data blocks
  -> (miss)
disk files (Bundle Files)

Data storage mechanism

For data storage, a bundled storage strategy is used. Data is not written to disk item by item. Instead, it is accumulated in memory into an "active bundle":

save() -> Active Bundle (memory)
  -> (accumulate to MAX_ROWS)
export() -> bundle file (disk)

When the active bundle reaches the configured row limit (config.MAX_ROWS), a batch write is triggered, generating files like cfg.bundle0. This batch I/O strategy merges hundreds of small writes into one large write, reducing disk overhead.

Index mechanism

GeneralLoader uses a separate index file to record the mapping item_id -> bundle_file_id, for example:

method_id: 1001 -> bundle0
method_id: 1002 -> bundle0
method_id: 2001 -> bundle1

During query, it locates the bundle via the index first, then extracts the specific data from the bundle.

Abstract interfaces

GeneralLoader defines three abstract methods that subclasses must implement to adapt to different data structures:

  • query_flattened_item_when_loading(): query records of a specific item from a bundle.
  • unflatten_item_dataframe_when_loading(): reconstruct a complex object from a flattened DataFrame.
  • flatten_item_when_saving(): convert a complex object into a flattened list of dicts.

This design separates data storage responsibility (DataModel) from data organization responsibility (Loader).

2-2 UnitLevelLoader: Source-file-granularity data management

UnitLevelLoader handles data whose granularity is the source file (unit_id), such as scope hierarchies and exported symbols.

2-3 MethodLevelAnalysisResultLoader: Method-granularity data management

MethodLevelAnalysisResultLoader handles analysis results at the method (method_id) level or finer granularity (such as call site CallSite).

2-4 OneToManyMapLoader / ManyToOneMapLoader: Mapping relationship management

These loaders handle mappings between IDs, such as "one class contains multiple methods" and "one method belongs to one class". These loaders provide fast bidirectional queries, offering index support for call relationships, type relationships, and symbol ownership relationships.

3 Concrete Loader Implementations

Concrete loaders are instantiated implementations of loader template classes. Each concrete loader manages specific analysis data. They inherit the corresponding loader template class and implement specific serialization logic. These loaders do not care about how data is physically stored (handled by DataModel), but only about the logical structure and access patterns.

3-1 Structural Data Loaders

ModuleSymbolsLoader:

  • manage metadata of all source files and directories
  • build a bidirectional mapping between file paths and unit_id
  • serve as the starting point for unit-path mapping in the analysis pipeline

UnitGIRLoader:

  • store a unit's generic intermediate representation (GIR)
  • provide fine-grained statement-level access

3-2 Graph Structure Loaders

CFGLoader:

  • store control-flow graph edges: (src_stmt_id, dst_stmt_id, control_flow_type)
  • rebuild a NetworkX directed graph when loading

SymbolGraphLoader:

  • handle mixed-node graphs: nodes include statement IDs and SymbolDefNode
  • distinguish "def" edges and "use" edges by edge direction
  • implement sparse storage via conditional fields (defined_symbol_* vs used_symbol_*)

StateFlowGraphLoader:

  • store the state flow graph, with nodes as SFGNode and edges as SFGEdge
  • serialize complex objects via to_tuple() / from_tuple()

ImportGraphLoader:

  • manage three related data structures:

    • symbol-level import graph (nodes are symbol IDs)
    • node metadata (symbol details)
    • unit-level dependency graph (nodes are unit_id)
  • separate storage, unified queries

TypeGraphLoader:

  • store the type inheritance graph
  • support querying parent classes and child classes

3-3 Data-flow Analysis Loaders

BitVectorManagerLoader:

  • store the mapping from semantic nodes (SymbolDefNode / StateDefNode) to bit positions
  • use one-way storage and two-way reconstruction

StmtStatusLoader:

  • store statement def-use information
  • fields include: defined_symbol, used_symbols, in_symbol_bits, out_symbol_bits, etc.

SymbolStateSpaceLoader:

  • store full information of symbols and states
  • distinguish Symbol and State
  • State contains access paths (a list of AccessPoint)

3-4 Method Summary Loaders

MethodDefUseSummaryLoader:

  • store basic method def-use summary
  • fields: parameter symbols, local symbols, external symbols, return symbol

MethodSummaryLoader:

  • handle two object types:

    • MethodSummaryTemplate (keyed by method_id)
    • MethodSummaryInstance (keyed by CallSite)
  • distinguish types by checking the caller_id field

CalleeParameterMappingLoader:

  • store parameter mappings at call sites
  • ParameterMapping contains parameter indices, state IDs, access paths, etc.

3-5 Call Relationship Loaders

CallGraphLoader:

  • store call relationship graphs between methods
  • edges include call statement IDs

CallPathLoader:

  • store complete call paths (CallPath)
  • each path is a sequence of CallSite
  • provide composite query interfaces:
    • get_callers_by_method_id(): get callers
    • get_call_path_between_two_methods(): get the path between two methods
    • get_lowest_common_ancestor(): find the lowest common ancestor

4 Unified Interfaces

The semantic storage and retrieval module aggregates dozens of concrete loaders and provides unified, semantic access interfaces.

4-1 Unified get/save interfaces

The Loader class provides semantic access methods for each specialized loader, forming a unified naming convention:

Basic pattern:

# save data
save_xxx(id, data)

# load data
get_xxx(id) -> data

# convert mapping
convert_xxx_to_yyy(xxx_id) -> yyy_id

# type check
is_xxx(id) -> bool

Examples:

# module symbols
save_module_symbols(module_symbol_results)
get_module_symbol_table() -> DataModel

# unit GIR
save_unit_gir(unit_id, unit_gir)
get_unit_gir(unit_id) -> DataModel

# method CFG
save_method_cfg(method_id, cfg)
get_method_cfg(method_id) -> nx.DiGraph

# ID conversion
convert_method_id_to_unit_id(method_id) -> unit_id
convert_stmt_id_to_method_id(stmt_id) -> method_id

# type checks
is_method_decl(stmt_id) -> bool
is_parameter_decl_of_method(stmt_id, method_id) -> bool

This unified interface hides the complexity of underlying loaders, so upper-layer analysis code does not need to care where data is stored or how it is serialized.

4-2 Intelligent query encapsulation

Loaders provide many high-level query methods that encapsulate complex logic:

Call relationship queries:

get_callers(method, phase="p3", match_by="name", entry_point_id=-1)
get_callees(method, phase="p3", match_by="name", entry_point_id=-1)

Select implementations by parameters:

  • phase="p2": use CallGraphLoader
  • phase="p3": use CallPathLoader
  • match_by="name": return a set of method names
  • match_by="id": return a set of method IDs

Source code queries:

get_stmt_source_code_with_comment(stmt_id) -> List[str]
get_method_decl_source_code(method_id) -> str
get_unit_source_code_by_stmt_id(stmt_id) -> List[str]

These methods read files and handle comments and boundary cases based on fixed rules:

  • comment identification and inclusion
  • distinguishing method declarations from method bodies
  • line number mapping and boundary handling

Method name queries:

get_method_name_by_method_id(method_id) -> str
# return "ClassName.method_name" or "method_name"

Class names are concatenated automatically for more user-friendly display.

5 Multi-stage Analysis Support (P1 / P2 / P3)

Lian's analysis pipeline has three stages, with increasing data scale and precision:

  • P1: basic structural analysis
  • P2: bottom-up interprocedural semantic analysis
  • P3: top-down global analysis

Loaders isolate stage data by separating filesystem paths:

workspace/
  |-- frontend/          # frontend parsing results (GIR, module symbols, etc.)
  |-- semantic_p1/       # P1 basic structural analysis
  |-- semantic_p2/       # P2 bottom-up interprocedural analysis
  `-- semantic_p3/       # P3 top-down global analysis

5-1 P1: Basic Structural Analysis

Goal: run basic intraprocedural analysis and build structural information.

Core data:

  • control flow and scopes:

    • CFGLoader: method-level CFG
    • ScopeHierarchyLoader: scope hierarchy
    • UnitSymbolDeclSummaryLoader: symbol declaration summary
  • symbol and state foundations:

    • SymbolBitVectorManagerLoader (P1): symbol bit-vector mapping
    • StmtStatusLoader (P1): statement def-use info
    • SymbolStateSpaceLoader (P1): symbol state space
  • method and class info:

    • MethodDefUseSummaryLoader: method def-use summary
    • MethodInternalCalleesLoader: internal call sites in methods
    • ClassIDToMethodsLoader: class-to-method mapping
  • initial call graph:

    • CallGraphLoader: initial call relations
    • GroupedMethodsLoader: grouped methods

5-2 P2: Bottom-up Interprocedural Analysis

Goal: propagate information from callees to callers and perform interprocedural data-flow analysis.

Core data:

  • symbol and state propagation:

    • SymbolBitVectorManagerLoader (P2): extended symbol bit vectors
    • StateBitVectorManagerLoader (P2): state bit vectors
    • StmtStatusLoader (P2): updated statement status
    • SymbolStateSpaceLoader (P2): extended symbol state space
    • SymbolStateSpaceSummaryLoader (P2): summary info
  • graph refinement:

    • SymbolGraphLoader (P2): symbol dependency graph
    • StateFlowGraphLoader (P2): state flow graph
  • method summaries:

    • MethodSummaryLoader (Template): method summary templates
    • CalleeParameterMappingLoader (P2): parameter mapping
  • call graph refinement:

    • CallGraphLoader (P2): more precise static call graph

5-3 P3: Top-down Global Analysis

Goal: start from entry points and perform path-sensitive global analysis.

Core data:

  • global symbols and states:

    • SymbolBitVectorManagerLoader (P3): global symbol bit vectors
    • StateBitVectorManagerLoader (P3): global state bit vectors
    • StmtStatusLoader (P3): context-sensitive statement status
    • SymbolStateSpaceLoader (P3): global symbol state space
    • SymbolStateSpaceSummaryLoader (P3): global summary
  • call paths:

    • CallPathLoader: core component storing complete call paths
  • context sensitivity:

    • MethodSummaryLoader (Instance): context-specific method summaries
    • CalleeParameterMappingLoader (P3): context-sensitive parameter mapping
    • SymbolGraphLoader (P3): context-sensitive symbol graph
    • StateFlowGraphLoader (P3): global state flow graph

6 Extensibility

The semantic storage and retrieval module provides data read/write and query capabilities for direct use by upper-layer analysis.

6-1 Source code and GIR association

Problem: analysis results are based on GIR (statement IDs), but users need source code.

Solution:

get_stmt_source_code_with_comment(stmt_id) -> List[str]
  • automatically locate the source file containing the statement
  • extract corresponding code based on line number information in GIR
  • handle comments intelligently (identify comments before methods via util.determine_comment_line())
  • for method declarations, return the whole method body; for normal statements, return only the statement

6-2 Rule matching and taint analysis

Problem: taint analysis needs to quickly locate code positions based on rules (such as "which functions are sources/sinks").

Solution:

get_source_sink_stmt_ids_by_rules() -> (source_stmt_ids, sink_stmt_ids)

Workflow:

  1. get all source/sink rules from RuleManager
  2. filter rules by file path (avoid invalid matches)
  3. traverse each unit's GIR, for each statement:

    • convert to a readable GIR string (readable_gir.get_gir_str())
    • use regex to match symbol names (\b word boundaries to avoid false matches)
    • if the rule specifies a line number, additionally check line number match
  4. return the matched statement ID sets

6-3 Unified call relationship query

Problem: call relationships are stored differently across stages (P2 uses graphs, P3 uses paths), and query interfaces differ.

Solution:

get_callers(method, phase="p3", match_by="name", entry_point_id=-1)
get_callees(method, phase="p3", match_by="name", entry_point_id=-1)

Parameters:

  • method: method name or method ID
  • phase: "p2" / "p3" / "all" (auto-merge)
  • match_by: "name" (return method name set) / "id" (return method ID set)
  • entry_point_id: specify entry point (only valid for P3)

6-4 Type hierarchy queries

Capability:

get_parent_class_by_class_name(class_name) -> List[TypeNode]
get_son_class_by_class_name(class_name) -> List[TypeNode]

Use cases: analyze inheritance, virtual calls, type casting, etc.

Implementation: traverse the type graph in TypeGraphLoader and query by edge field parent_name.

6-5 Module import relationship queries

Capability:

get_import_node_with_name(unit_id, import_name) -> SymbolNodeInImportGraph
get_edges_and_nodes_with_edge_attrs_in_import_graph(node_id, attr_dict)

Use cases: analyze module dependencies and resolve symbol references.

Design:

  • ImportGraphLoader maintains both symbol-level import graph and unit-level dependency graph
  • support queries by import name, edge type (internal/external), statement ID, and other attributes
  • return combinations of edges and nodes (edge, node)

7 Summary

The semantic storage and retrieval framework implements efficient, flexible, and maintainable data management. With controllable memory and I/O overhead, it provides the data management foundation for high-precision global analysis.