-
Notifications
You must be signed in to change notification settings - Fork 231
Description
Phase: 5
Summary
Add graph-structural aggregate functions to the existing calculate infrastructure: degree centrality (in-degree, out-degree, total degree), connected components, and PageRank. These extend the existing count/sum/avg/min/max calculations with graph-aware analytics.
Motivation
- Analytics: Users need to answer "who has the most connections?", "what are the clusters in my data?", "which records are most influential?"
- Natural extension: Concourse already has a
Calculatorinterface. Graph aggregations fit this pattern. - Operational insights: Degree distributions reveal data quality issues. Connected components reveal structural partitions.
New API Methods
Degree Centrality
long degree(long record);
long degree(String key, long record);
long indegree(long record);
long indegree(String key, long record);
long outdegree(long record);
long outdegree(String key, long record);
// Bulk variants
Map<Long, Long> degree(Collection<Long> records);
Map<Long, Long> degree(Criteria criteria);
Map<Long, Long> degree(String ccl);
// Same pattern for indegree, outdegreeConnected Components
Set<Set<Long>> connectedComponents();
Set<Set<Long>> connectedComponents(Collection<String> viaKeys);
long componentOf(long record);PageRank
Map<Long, Double> pagerank();
Map<Long, Double> pagerank(Collection<String> viaKeys);
Map<Long, Double> pagerank(Collection<Long> records);
Map<Long, Double> pagerank(Criteria criteria);
Map<Long, Double> pagerank(String ccl);API Design: New Graph Interface (Recommended)
Consolidate ALL graph operations under a dedicated namespace:
Graph graph = concourse.graph();
long d = graph.degree(record);
Map<Long, Double> pr = graph.pagerank();
Subgraph sg = graph.neighborhood(record, 2);
boolean r = graph.reachable(1, 42, 10);Existing top-level navigate(), trace(), link(), unlink(), consolidate() remain for backward compatibility.
Algorithm Details
| Algorithm | Method | Complexity |
|---|---|---|
| Degree | Index lookup | O(1) with graph index |
| Connected Components | Union-Find | O(V + E * alpha(V)) |
| PageRank | Power iteration | O(iterations * (V + E)), default 20 iterations |
PageRank Safety
- Cap iterations (default 20, max 100)
- Dangling nodes: redistribute rank equally across all nodes
- Convergence tolerance: 1e-6
CCL Grammar Changes (ccl project)
Degree Functions
Fit existing CALCULATE command syntax -- may not need new grammar tokens if dispatched by function name string.
cash> calculate degree 1
cash> calculate indegree "friends" 1
cash> calculate outdegree where department = "eng"
New Top-Level Commands (for whole-graph operations)
New Tokens: CONNECTED_COMPONENTS, PAGERANK
New Productions:
connected_components [via <keys>]
pagerank [via <keys>] | pagerank <records> | pagerank where <criteria>
New Symbol Classes (2): ConnectedComponentsSymbol, PagerankSymbol
CaSH Examples
cash> connected_components
[[1, 3, 7, 12], [42, 43, 44], [100]]
cash> connected_components via friends
[[1, 3, 7], [12, 42], [43, 44], [100]]
cash> pagerank
{1: 0.042, 3: 0.038, 7: 0.105, 12: 0.003, ...}
cash> pagerank 1, 3, 7
{1: 0.042, 3: 0.038, 7: 0.105}
Thrift API Changes
- ~20 new method signatures (degree/indegree/outdegree variants, connected components, pagerank)
- 5 new
TCommandVerbvalues:DEGREE = 30,INDEGREE = 31,OUTDEGREE = 32,CONNECTED_COMPONENTS = 33,PAGERANK = 34
Server-Side Implementation
New Classes
GraphAggregations.java-- algorithm implementationsUnionFind.java-- disjoint set data structure for connected components
Touch Surfaces
| Layer | File(s) | Change |
|---|---|---|
| Thrift IDL | interface/concourse.thrift |
~20 new method signatures |
| Thrift IDL | interface/data.thrift |
5 new TCommandVerb values |
| Thrift codegen | All stubs | Regenerate |
| New class | GraphAggregations.java |
Algorithm implementations |
| New class | UnionFind.java |
Disjoint set data structure |
| Server | ConcourseServer.java |
~20 handler methods |
| Server ops | Operations.java |
Atomic wrappers |
| Server dispatch | TCommandDispatcher.java |
5 verb mappings |
| Server calc | Calculations.java |
Route graph aggregates |
| Driver | Concourse.java or Graph.java |
New methods |
| Driver impl | ConcourseThriftDriver.java |
Thrift implementations |
| CCL grammar | ccl/grammar/grammar.jjt |
CONNECTED_COMPONENTS, PAGERANK tokens + productions |
| CCL symbols | ccl/src/.../command/ |
2 new symbol classes |
| CaSH | ShellEngine.java, ApiMethodCatalog.java, CompletionService.java |
Dispatch + completions |
| Plugin API | StatefulConcourseService.java |
New methods |
| All drivers | Python, PHP, Ruby | New methods |
| Integration tests | GraphAggregationsTest.java |
|
| Unit tests | UnionFindTest.java |
|
| Documentation | docs/guide/src/graph.md |
Aggregations section |
| Changelog | CHANGELOG.md |
Entry under 1.0.0 (TBD) |
Testing Plan
Degree: counts both directions, filters by key, zero for isolated records
Connected Components: single component, multiple disconnected, single node, via keys filtering
PageRank: sums to ~1, higher for well-linked nodes, converges, handles dangling nodes
UnionFind: union merges, find returns root, path compression, components grouping
Dependencies
- Depends on: GRAPH-002 (Graph Index) -- all aggregations use the index
- Benefits from: GRAPH-004 (Subgraph Extraction) -- aggregations can be scoped to subgraphs
Acceptance Criteria
-
degree/indegree/outdegreereturn correct counts - Connected components correctly partitions the graph
- PageRank produces valid scores (sum to ~1, converges)
- All methods work with
viaKeysfiltering - Degree functions integrate with
calculatesyntax - Connected components and PageRank have dedicated CaSH commands
- All tests pass
- Changelog and docs updated
Risks
- PageRank memory: Scores for every record must fit in memory
- Connected components on large graphs: Union-Find is linear but requires iterating all edges
- Staleness: Results may be slightly stale during concurrent writes
Complexity: Medium-High
Full spec: docs/specs/graph/GRAPH-005-graph-aggregations.md