Skip to content

GIR Translation

The generic intermediate language (GIR) translation module is the foundation of Lian's multi-language analysis capabilities. Its goal is to translate code from different source languages into a language-agnostic unified intermediate representation, GIR (Generic Intermediate Representation). Through the abstraction provided by GIR, subsequent semantic analyses (such as control-flow analysis, data-flow analysis, pointer analysis, taint analysis, etc.) can completely detach from the syntactic constraints and type-system limitations of specific languages, enabling unified program analysis.

An important design principle of GIR is: although syntax varies widely, semantic structures at the behavioral level share strong commonality. GIR is a concrete representation of this consistency. GIR's support for new languages is significant: it simplifies the traditional approach of rebuilding a full set of analysis algorithms per language into implementing only a lightweight frontend translator that maps the source language to GIR. With this design, the Lian system has successfully supported multiple languages including C, Java, Python, JavaScript, PHP, TypeScript, and Go.

1 Limitations of Existing Intermediate Languages

In program analysis, several mature intermediate representations already exist, but they still face major obstacles in multi-language scenarios:

  • LLVM IR: LLVM IR focuses on low-level compilation optimization and machine code generation, and its semantics heavily depend on static type information. LLVM IR instructions are built on the premise of complete types, which makes it highly efficient for C/C++ and easy to translate into low-level code. However, for dynamic languages such as Python and JavaScript, types are often an analysis result rather than a prerequisite, making it very challenging to translate dynamic languages effectively into LLVM IR.

    At the same time, LLVM IR lowers high-level operations (such as field access and array element access) into address-based instructions (such as getelementptr), causing high-level semantic information such as field names and indices to be lost. In addition, dynamic languages allow properties to be added dynamically, making this flattened analysis approach extremely unfavorable for analyzing semantic information and understanding program behavior.

  • WALA IR: Although WALA IR provides some language abstraction, its construction heavily depends on a series of prerequisite analysis results (such as control-flow graph construction, SSA transformation, and type analysis). This means that for each new language, developers must re-implement these prerequisite and complex analysis pipelines such as control flow and SSA, leading to very high engineering maintenance costs and blocking rapid expansion.

  • Truffle AST: Truffle AST is designed for runtime execution, and its node semantics are tightly built around runtime execution. Its design is deeply bound to the runtime execution model, making it difficult to provide a high-level and unified representation for whole-program static analysis.

Therefore, existing intermediate languages are not suitable for the Lian program analysis framework. Given this situation, GIR is designed as a high-level, language-agnostic, and static-analysis-oriented intermediate representation.

2 Design Philosophy of GIR

The core of GIR's design is not to take types as a modeling prerequisite, but to use program behavioral semantics as the abstraction core. Therefore, GIR does not assume any specific language features; based on commonality in behavioral semantics, it supports both statically typed and dynamically typed languages.

Mainstream programming languages share a set of core structures at the behavioral level:

  • A program consists of modules and functions;
  • Inside functions, state is managed through variable declarations and scope management;
  • Data flows through assignment, reading, and passing;
  • Execution logic is driven by conditionals and loops;
  • Object-oriented paradigms further introduce the dimensions of classes, fields, and methods.

GIR defines 79 basic instructions for these commonalities, precisely covering the core semantic units of programs. These instructions are named intuitively, and each instruction has a single semantics with clear boundaries, making it convenient for subsequent analyses to establish precise semantic relationships. Meanwhile, language-specific syntactic sugar is expanded by the frontend into combinations of multiple basic GIR instructions.

Notably, GIR does not aim for extreme instruction minimization, but for expressive accuracy, so that subsequent analysis algorithms can directly establish precise semantic associations.

For differentiated handling, GIR does not attempt to forcibly erase language-specific differences (such as JavaScript's prototype chain or Python's dynamic attribute injection). Its role is to faithfully record “what behavior happened”, while leaving “the concrete meaning of this behavior in a specific language” to the language-difference compatibility module. This separation of responsibilities ensures long-term stability of the GIR structure, allowing core analysis algorithms to remain generic, while language-specific logic can be flexibly layered on as plugins.

3 GIR Frontend

The GIR frontend uses a top-down translation process:

  • First, it uses tree-sitter to parse source code into a syntax tree (typically treated as an AST, Abstract Syntax Tree);
  • Second, it performs semantic mapping: lang_parser.py dispatches the parser for the specific language and performs a top-down recursive traversal from the AST root node, recursively mapping each syntax node into one or more atomic GIR instructions.

In this process, the frontend translator does not need to perform complex control-flow computation or SSA transformation. It only needs to preserve original information such as variable declarations, scope switches, function calls, and explicit data-flow relations. This design greatly lowers the barrier to onboarding new languages and avoids analysis distortion caused by introducing assumptions too early during translation.

In the end, the generated GIR instructions serve as a standardized, language-agnostic program code representation and drive all subsequent analyses in the Lian system. Whether it is complex pointer aliasing or sensitive data tracking, the analysis target is always stable and unified GIR. This is the key for the Lian system to achieve multi-language high-precision analysis and sustainable engineering development.