Skip to content

[GRAPH-005] Graph Aggregations #623

@jtnelson

Description

@jtnelson

Phase: 5

Summary

Add graph-structural aggregate functions to the existing calculate infrastructure: degree centrality (in-degree, out-degree, total degree), connected components, and PageRank. These extend the existing count/sum/avg/min/max calculations with graph-aware analytics.

Motivation

  • Analytics: Users need to answer "who has the most connections?", "what are the clusters in my data?", "which records are most influential?"
  • Natural extension: Concourse already has a Calculator interface. Graph aggregations fit this pattern.
  • Operational insights: Degree distributions reveal data quality issues. Connected components reveal structural partitions.

New API Methods

Degree Centrality

long degree(long record);
long degree(String key, long record);
long indegree(long record);
long indegree(String key, long record);
long outdegree(long record);
long outdegree(String key, long record);

// Bulk variants
Map<Long, Long> degree(Collection<Long> records);
Map<Long, Long> degree(Criteria criteria);
Map<Long, Long> degree(String ccl);
// Same pattern for indegree, outdegree

Connected Components

Set<Set<Long>> connectedComponents();
Set<Set<Long>> connectedComponents(Collection<String> viaKeys);
long componentOf(long record);

PageRank

Map<Long, Double> pagerank();
Map<Long, Double> pagerank(Collection<String> viaKeys);
Map<Long, Double> pagerank(Collection<Long> records);
Map<Long, Double> pagerank(Criteria criteria);
Map<Long, Double> pagerank(String ccl);

API Design: New Graph Interface (Recommended)

Consolidate ALL graph operations under a dedicated namespace:

Graph graph = concourse.graph();
long d = graph.degree(record);
Map<Long, Double> pr = graph.pagerank();
Subgraph sg = graph.neighborhood(record, 2);
boolean r = graph.reachable(1, 42, 10);

Existing top-level navigate(), trace(), link(), unlink(), consolidate() remain for backward compatibility.

Algorithm Details

Algorithm Method Complexity
Degree Index lookup O(1) with graph index
Connected Components Union-Find O(V + E * alpha(V))
PageRank Power iteration O(iterations * (V + E)), default 20 iterations

PageRank Safety

  • Cap iterations (default 20, max 100)
  • Dangling nodes: redistribute rank equally across all nodes
  • Convergence tolerance: 1e-6

CCL Grammar Changes (ccl project)

Degree Functions

Fit existing CALCULATE command syntax -- may not need new grammar tokens if dispatched by function name string.

cash> calculate degree 1
cash> calculate indegree "friends" 1
cash> calculate outdegree where department = "eng"

New Top-Level Commands (for whole-graph operations)

New Tokens: CONNECTED_COMPONENTS, PAGERANK

New Productions:

connected_components [via <keys>]
pagerank [via <keys>] | pagerank <records> | pagerank where <criteria>

New Symbol Classes (2): ConnectedComponentsSymbol, PagerankSymbol

CaSH Examples

cash> connected_components
[[1, 3, 7, 12], [42, 43, 44], [100]]

cash> connected_components via friends
[[1, 3, 7], [12, 42], [43, 44], [100]]

cash> pagerank
{1: 0.042, 3: 0.038, 7: 0.105, 12: 0.003, ...}

cash> pagerank 1, 3, 7
{1: 0.042, 3: 0.038, 7: 0.105}

Thrift API Changes

  • ~20 new method signatures (degree/indegree/outdegree variants, connected components, pagerank)
  • 5 new TCommandVerb values: DEGREE = 30, INDEGREE = 31, OUTDEGREE = 32, CONNECTED_COMPONENTS = 33, PAGERANK = 34

Server-Side Implementation

New Classes

  • GraphAggregations.java -- algorithm implementations
  • UnionFind.java -- disjoint set data structure for connected components

Touch Surfaces

Layer File(s) Change
Thrift IDL interface/concourse.thrift ~20 new method signatures
Thrift IDL interface/data.thrift 5 new TCommandVerb values
Thrift codegen All stubs Regenerate
New class GraphAggregations.java Algorithm implementations
New class UnionFind.java Disjoint set data structure
Server ConcourseServer.java ~20 handler methods
Server ops Operations.java Atomic wrappers
Server dispatch TCommandDispatcher.java 5 verb mappings
Server calc Calculations.java Route graph aggregates
Driver Concourse.java or Graph.java New methods
Driver impl ConcourseThriftDriver.java Thrift implementations
CCL grammar ccl/grammar/grammar.jjt CONNECTED_COMPONENTS, PAGERANK tokens + productions
CCL symbols ccl/src/.../command/ 2 new symbol classes
CaSH ShellEngine.java, ApiMethodCatalog.java, CompletionService.java Dispatch + completions
Plugin API StatefulConcourseService.java New methods
All drivers Python, PHP, Ruby New methods
Integration tests GraphAggregationsTest.java
Unit tests UnionFindTest.java
Documentation docs/guide/src/graph.md Aggregations section
Changelog CHANGELOG.md Entry under 1.0.0 (TBD)

Testing Plan

Degree: counts both directions, filters by key, zero for isolated records
Connected Components: single component, multiple disconnected, single node, via keys filtering
PageRank: sums to ~1, higher for well-linked nodes, converges, handles dangling nodes
UnionFind: union merges, find returns root, path compression, components grouping

Dependencies

  • Depends on: GRAPH-002 (Graph Index) -- all aggregations use the index
  • Benefits from: GRAPH-004 (Subgraph Extraction) -- aggregations can be scoped to subgraphs

Acceptance Criteria

  • degree/indegree/outdegree return correct counts
  • Connected components correctly partitions the graph
  • PageRank produces valid scores (sum to ~1, converges)
  • All methods work with viaKeys filtering
  • Degree functions integrate with calculate syntax
  • Connected components and PageRank have dedicated CaSH commands
  • All tests pass
  • Changelog and docs updated

Risks

  • PageRank memory: Scores for every record must fit in memory
  • Connected components on large graphs: Union-Find is linear but requires iterating all edges
  • Staleness: Results may be slightly stale during concurrent writes

Complexity: Medium-High

Full spec: docs/specs/graph/GRAPH-005-graph-aggregations.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions