[Refactor] Add new AST node types and resolve AST TODOs#99
[Refactor] Add new AST node types and resolve AST TODOs#99colinthebomb1 wants to merge 7 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors the internal AST representation to better model several SQL constructs (types/keywords, value lists, intervals, CASE expressions, and DISTINCT variants) and updates the expected AST fixtures accordingly.
Changes:
- Introduces new AST node types:
TypeNode,ListNode,IntervalNode, andCaseNode, plus newNodeTypeenum values. - Extends
SelectNodeto representDISTINCTandDISTINCT ON. - Updates
data/asts.pyexpected ASTs (and tweaks formatter tests) to use the new node types.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
core/ast/node.py |
Adds new node classes and extends SelectNode to model DISTINCT/DISTINCT ON. |
core/ast/enums.py |
Adds NodeType enum values for the new node types. |
data/asts.py |
Updates expected AST fixtures to use TypeNode, ListNode, IntervalNode, CaseNode, and DISTINCT metadata. |
tests/test_query_formatter.py |
Adjusts/relaxes formatter tests by commenting out some formatter calls and minor assertion key change. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
HazelYuAhiru
left a comment
There was a problem hiding this comment.
Overall looks good to me! Just to confirm, are we now using a more specific approach (creating specific new node types when they are encountered, such as IntervalNode and WhenThenNode) rather than a more general approach for handling new queries? Also, should we add code in ast_util to visualize the new nodes as well?
baiqiushi
left a comment
There was a problem hiding this comment.
Good refactoring overall — replacing FunctionNode workarounds with proper typed AST nodes is a solid improvement. CI is green. A few issues to address:
Bug: SelectNode._distinct_on is not included in children, which breaks generic AST traversals.
Suggestion: The hardcoded allowlists in DataTypeNode and TimeUnitNode are quite restrictive and will throw ValueError on valid SQL types/units not in the set.
| def __init__(self, _items: List['Node'], **kwargs): | ||
| """SELECT clause node. _distinct_on is the list of expressions for DISTINCT ON (e.g. ListNode of columns).""" | ||
| def __init__(self, _items: List['Node'], _distinct: bool = False, _distinct_on: Optional['Node'] = None, **kwargs): | ||
| super().__init__(NodeType.SELECT, children=_items, **kwargs) |
There was a problem hiding this comment.
Bug: _distinct_on is stored as an attribute but never added to children. Any generic AST traversal (formatting, rewriting, analysis) will silently skip the DISTINCT ON columns. This is inconsistent with how IntervalNode and CaseNode correctly include optional parts in their children.
Suggested fix:
| super().__init__(NodeType.SELECT, children=_items, **kwargs) | |
| children = list(_items) | |
| if _distinct_on is not None: | |
| children.append(_distinct_on) | |
| super().__init__(NodeType.SELECT, children=children, **kwargs) |
|
|
||
| class DataTypeNode(Node): | ||
| """SQL data type node used in CAST expressions (e.g. TEXT, DATE, INTEGER)""" | ||
| SQL_DATA_TYPES = {"TEXT", "DATE", "INTEGER", "TIMESTAMP", "VARCHAR", "BOOLEAN", "FLOAT"} |
There was a problem hiding this comment.
This allowlist is quite restrictive — common SQL types like BIGINT, DECIMAL, NUMERIC, CHAR, DOUBLE, REAL, SMALLINT, SERIAL, BYTEA, JSON are missing. This will raise ValueError on valid SQL.
Consider either expanding the set significantly or removing the strict validation (and relying on the parser to only produce valid types).
|
|
||
| class TimeUnitNode(Node): | ||
| """SQL time unit node used in INTERVAL and temporal functions (e.g. DAY, MONTH, SECOND)""" | ||
| TIME_UNITS = {"SECOND", "MINUTE", "HOUR", "DAY", "WEEK", "MONTH", "YEAR"} |
There was a problem hiding this comment.
Similarly, missing time units like QUARTER, MICROSECOND, MILLISECOND which are valid in PostgreSQL/MySQL.
| LITERAL = "literal" | ||
| DATA_TYPE = "data_type" | ||
| TIME_UNIT = "time_unit" | ||
| LIST = "list" |
There was a problem hiding this comment.
Nit: trailing whitespace on this line, and double space before = on the INTERVAL line below.
Overview
This PR adds new node types and adds functionality to existing ones to improve the constructed ASTs for test files, resolving most of the TODOs identified in PR #97.
Code Changes
DataTypeNodefor SQL data types used in CAST expressions (TEXT,DATE,INTEGER, etc.)TimeUnitNodefor SQL time units used in INTERVAL and temporal functions (DAY,SECOND,MONTH, etc.)ListNodefor value lists (e.g. the RHS ofINexpressions) — replaces raw Python listsIntervalNodeforINTERVALexpressions — replacesFunctionNode("INTERVAL", ...)WhenThenNodefor individual WHEN/THEN branches within a CASE expressionCaseNodeforCASE WHEN ... THEN ... ELSE ... END— usesWhenThenNodeinstead of raw Python tuples_distinctand_distinct_onparameters toSelectNodeforSELECT DISTINCT/DISTINCT ONNULLrepresented asLiteralNode(None)rather than a separate node typeNodeType.DATA_TYPE,TIME_UNIT,LIST,INTERVAL,CASE,WHEN_THENto enumsdata/asts.pyto use the new node types