Skip to content

[Refactor] Add new AST node types and resolve AST TODOs#99

Open
colinthebomb1 wants to merge 7 commits intomainfrom
feature/improve-ast-todos
Open

[Refactor] Add new AST node types and resolve AST TODOs#99
colinthebomb1 wants to merge 7 commits intomainfrom
feature/improve-ast-todos

Conversation

@colinthebomb1
Copy link
Collaborator

@colinthebomb1 colinthebomb1 commented Feb 28, 2026

Overview

This PR adds new node types and adds functionality to existing ones to improve the constructed ASTs for test files, resolving most of the TODOs identified in PR #97.

Code Changes

  • Added DataTypeNode for SQL data types used in CAST expressions (TEXT, DATE, INTEGER, etc.)
  • Added TimeUnitNode for SQL time units used in INTERVAL and temporal functions (DAY, SECOND, MONTH, etc.)
  • Added ListNode for value lists (e.g. the RHS of IN expressions) — replaces raw Python lists
  • Added IntervalNode for INTERVAL expressions — replaces FunctionNode("INTERVAL", ...)
  • Added WhenThenNode for individual WHEN/THEN branches within a CASE expression
  • Added CaseNode for CASE WHEN ... THEN ... ELSE ... END — uses WhenThenNode instead of raw Python tuples
  • Added _distinct and _distinct_on parameters to SelectNode for SELECT DISTINCT / DISTINCT ON
  • SQL NULL represented as LiteralNode(None) rather than a separate node type
  • Added NodeType.DATA_TYPE, TIME_UNIT, LIST, INTERVAL, CASE, WHEN_THEN to enums
  • Updated expected ASTs in data/asts.py to use the new node types

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the internal AST representation to better model several SQL constructs (types/keywords, value lists, intervals, CASE expressions, and DISTINCT variants) and updates the expected AST fixtures accordingly.

Changes:

  • Introduces new AST node types: TypeNode, ListNode, IntervalNode, and CaseNode, plus new NodeType enum values.
  • Extends SelectNode to represent DISTINCT and DISTINCT ON.
  • Updates data/asts.py expected ASTs (and tweaks formatter tests) to use the new node types.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
core/ast/node.py Adds new node classes and extends SelectNode to model DISTINCT/DISTINCT ON.
core/ast/enums.py Adds NodeType enum values for the new node types.
data/asts.py Updates expected AST fixtures to use TypeNode, ListNode, IntervalNode, CaseNode, and DISTINCT metadata.
tests/test_query_formatter.py Adjusts/relaxes formatter tests by commenting out some formatter calls and minor assertion key change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

colinthebomb1 and others added 4 commits March 3, 2026 12:30
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Collaborator

@HazelYuAhiru HazelYuAhiru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me! Just to confirm, are we now using a more specific approach (creating specific new node types when they are encountered, such as IntervalNode and WhenThenNode) rather than a more general approach for handling new queries? Also, should we add code in ast_util to visualize the new nodes as well?

Copy link
Contributor

@baiqiushi baiqiushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good refactoring overall — replacing FunctionNode workarounds with proper typed AST nodes is a solid improvement. CI is green. A few issues to address:

Bug: SelectNode._distinct_on is not included in children, which breaks generic AST traversals.

Suggestion: The hardcoded allowlists in DataTypeNode and TimeUnitNode are quite restrictive and will throw ValueError on valid SQL types/units not in the set.

def __init__(self, _items: List['Node'], **kwargs):
"""SELECT clause node. _distinct_on is the list of expressions for DISTINCT ON (e.g. ListNode of columns)."""
def __init__(self, _items: List['Node'], _distinct: bool = False, _distinct_on: Optional['Node'] = None, **kwargs):
super().__init__(NodeType.SELECT, children=_items, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: _distinct_on is stored as an attribute but never added to children. Any generic AST traversal (formatting, rewriting, analysis) will silently skip the DISTINCT ON columns. This is inconsistent with how IntervalNode and CaseNode correctly include optional parts in their children.

Suggested fix:

Suggested change
super().__init__(NodeType.SELECT, children=_items, **kwargs)
children = list(_items)
if _distinct_on is not None:
children.append(_distinct_on)
super().__init__(NodeType.SELECT, children=children, **kwargs)


class DataTypeNode(Node):
"""SQL data type node used in CAST expressions (e.g. TEXT, DATE, INTEGER)"""
SQL_DATA_TYPES = {"TEXT", "DATE", "INTEGER", "TIMESTAMP", "VARCHAR", "BOOLEAN", "FLOAT"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allowlist is quite restrictive — common SQL types like BIGINT, DECIMAL, NUMERIC, CHAR, DOUBLE, REAL, SMALLINT, SERIAL, BYTEA, JSON are missing. This will raise ValueError on valid SQL.

Consider either expanding the set significantly or removing the strict validation (and relying on the parser to only produce valid types).


class TimeUnitNode(Node):
"""SQL time unit node used in INTERVAL and temporal functions (e.g. DAY, MONTH, SECOND)"""
TIME_UNITS = {"SECOND", "MINUTE", "HOUR", "DAY", "WEEK", "MONTH", "YEAR"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, missing time units like QUARTER, MICROSECOND, MILLISECOND which are valid in PostgreSQL/MySQL.

LITERAL = "literal"
DATA_TYPE = "data_type"
TIME_UNIT = "time_unit"
LIST = "list"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: trailing whitespace on this line, and double space before = on the INTERVAL line below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants