Data Layer Data Model¶
The data layer is used to store and operate intermediate results produced during program analysis, providing stable data read/write interfaces for each analysis stage. This layer does not participate in specific analysis logic. It is only responsible for data organization, access, update, and serialization.
The current implementation uses tabular data as the main form and manages analysis results by wrapping pandas.DataFrame.
1 Data Model (util/data_model.py)¶
1-1 DataModel¶
DataModel represents an analysis result table. Internally it maintains a pandas.DataFrame instance and provides access interfaces oriented to program analysis scenarios.
Supported operations include:
- construct an instance from a
DataFrame, raw data, or an existingDataModel; - access data by row index, index set, or boolean conditions;
- read or update column data by column name;
- row/column slicing and appending data;
- in-place modification during analysis.
To reduce repeated scans, DataModel maintains the following internal structures:
_schema: a mapping from column names to column indices;_rows: a cached NumPy representation of row data;_column_indexer: an equality index built for specific columns.
When data changes, these structures are marked invalid and rebuilt on demand.
1-2 Row¶
Row represents one row of data in the table. It is a wrapper over the underlying NumPy row array and supports:
- access corresponding fields via attribute names;
- convert row data to a dict;
- copy the row object while keeping the column structure unchanged;
- directly modify field values during analysis.
Each Row records its row number in the original table, used to locate the corresponding statement or analysis entity.
1-3 Column¶
Column inherits from pandas.Series and represents a single-column data view.
Besides basic series operations, Column supports fast filtering based on equality conditions. The related index is maintained uniformly by DataModel, built and cached when needed.
2 Data Access Methods¶
DataModel provides multiple access granularities to fit different analysis stages:
- row access: get
Rowby row number or a set of row numbers; - column access: get
Columnby column name; - conditional filtering: get a subtable based on indices or boolean conditions;
- range read: read a continuous statement range by
stmt_id, used for statement-level analysis.
These interfaces are used repeatedly in semantic analysis, state-flow construction, and subsequent data propagation.
3 Data Updates¶
The data layer allows modifications on existing results, including:
- update a single cell or an entire row;
- batch modify column content;
- rename columns;
- delete rows that match conditions.
All update operations invalidate related caches to ensure subsequent access results match the actual data.
4 Data Persistence¶
DataModel supports writing the current table to a file in Feather format and reloading it in later stages. When saving, it can optionally reset the row index to ensure reproducibility of serialization results.
This mechanism is mainly used for result passing between analysis stages and offline inspection of intermediate states.
5 Summary¶
The data layer organizes analysis intermediate results around DataModel, providing row-, column-, and range-level access by wrapping tabular data. This design relies on existing data processing libraries for storage and querying, avoiding introducing additional data management components, while meeting the need for repeated read/write of intermediate results during program analysis.