Skip to content

Optimize document loading and pagination across all workspaces #823

@paullizer

Description

@paullizer

1

application/single_app/route_backend_public_documents.py

Like the internal list routes, this external public-docs list now loads all matching docs and paginates in Python after collapsing revisions. That can get expensive quickly for large workspaces and defeats Cosmos-side paging/sorting. Prefer filtering to current revisions directly in the query (with a legacy fallback for missing fields) so OFFSET/LIMIT can stay in Cosmos and the endpoint remains scalable.

GitHub Copilot: Read

Read skill cosmosdb-best-practices

I’m checking the actual public-documents list path against the internal list routes to see whether the suggestion is accurate and whether the current query shape is forcing Python-side pagination.

Read , lines 1 to 400

Searched for regex collapse|revision|OFFSET|LIMIT|current_revision|is_current|list_public|list.*documents (**/application/single_app/**/*.py), 200 results

Searched for files matching **/application/single_app/route_backend*_documents.py, 3 matches

I confirmed the public route is doing a full query, then select_current_documents, then Python slicing. I’m pulling the personal and group equivalents now to check whether the suggestion matches the existing pattern or if this route diverged.

Searched for regex def api_list_documents|def api_list_group_documents|SELECT \* FROM c WHERE|OFFSET|select_current_documents\(|sort_documents\( (**/application/single_app/route_backend*_documents.py), 23 results

Read , lines 1 to 260

Read , lines 1 to 260

Read , lines 280 to 360

I found the same pattern in the personal and group list endpoints, so the suggestion’s premise is directionally right on scalability, but it’s not a regression unique to this public route. I’m reading the exact list implementations and the revision helpers to see whether a Cosmos-side filter is safe with the project’s legacy revision fallback.

Read , lines 500 to 590

Read , lines 260 to 330

Read , lines 420 to 490

Read , lines 600 to 740

There’s already a stronger pattern elsewhere in the codebase: revision-aware queries fetch only the fields needed, then collapse with a current-version check plus a legacy fallback. I’m reading that helper because it’s probably the best evidence for or against the suggestion as written.

Read changed files in

Read , lines 7480 to 7575

Searched for text is_current_version|revision_family_id|search_visibility_state (**/application/single_app/**/*.py), 61 results

The main caveat is the legacy path: select_current_documents does more than check a boolean flag, because it groups by revision family and falls back to filename-based collapsing when revision metadata is missing. I’m checking the repo memory for prior revision behavior so I don’t overstate what a simple query rewrite can safely do.

Read memory

Searched for regex normalize_document_revision_families\( (**/application/single_app/**/*.py), 4 results

Read , lines 110 to 160

I have enough context to assess it, but I want exact anchors for the revision helpers before I answer so I can be precise about the caveat around legacy families and normalization.

Searched for regex def select_current_documents|def normalize_document_revision_families (**/application/single_app/functions_documents.py), 2 results

I mostly agree with it. The performance concern is real: route_backend_public_documents.py fetches all matching documents, route_backend_public_documents.py collapses revisions in Python, and route_backend_public_documents.py paginates in Python. For a large public workspace, that is exactly the kind of pattern that burns RU and latency unnecessarily. It also is not unique to this route; the personal and group list endpoints do the same thing in route_backend_documents.py and route_backend_group_documents.py.

The part I would tighten is the legacy fallback wording. A simple current-version predicate is not fully equivalent to the existing behavior, because functions_documents.py does more than check a flag: it collapses legacy families using revision-family metadata and, when that is missing, a filename-based fallback. The list routes do not run the normalization pass that search does in functions_search.py through functions_documents.py. So if you only add something like “is current or missing field” to the Cosmos query, older unnormalized multi-revision families can still leak duplicates and break counts/paging semantics.

I would restate the suggestion as: move these list routes to a query-side fast path for normalized current revisions, but keep an explicit fallback or one-time normalization/migration path for legacy revision families. Also, if the API contract ever changes, Cosmos continuation tokens are a better long-term fit than deep OFFSET/LIMIT, though keeping filtering and sorting in Cosmos is still a good improvement within the current page/page_size contract.

  1. I can rewrite this into a tighter PR review comment.
  2. I can sketch the safest implementation approach for all three list routes, not just the public one.

2

application/single_app/route_backend_group_documents.py

This group list path now loads all matching documents and paginates client-side after select_current_documents(...). That can be very costly for large groups and removes Cosmos-side pagination (ORDER BY ... OFFSET ... LIMIT ...). Since the data model now tracks current revision state, prefer a Cosmos filter for current docs (with legacy fallback) so paging/sorting stays server-side and scalable.

GitHub Copilot: I’m checking the Cosmos DB guidance and the relevant group document path first, then I’ll judge whether this is a real scalability regression or just a theoretical concern.

Read skill cosmosdb-best-practices

Read , lines 1 to 400

Read

Read memory

The route does appear to materialize the full match set before slicing, so I’m tracing whether the revision flags already support a server-side filter and whether other routes already solved this better.

Searched for regex select_current_documents|is_current_version|search_visibility_state (**/application/single_app/**/*.py), 35 results

Read , lines 1 to 320

Read , lines 1 to 420

I found the same pattern in the personal and public list routes, but the key question is whether the group route can now rely on persisted current-revision flags instead of collapsing in Python. I’m checking that path next.

Read , lines 460 to 610

Read , lines 430 to 760

Read , lines 150 to 330

I’ve confirmed the current route does full materialization and in-memory pagination. I’m checking whether that was an intentional compromise tied to legacy revision handling or whether it can be replaced cleanly with a server-side current-doc filter.

Read changed files in

Read , lines 220 to 340

Searched for regex ORDER BY c\._ts|OFFSET|LIMIT (**/application/single_app/**/*.py), 100 results

I haven’t found a server-side current-only query helper yet, which makes this look like a duplicated stopgap rather than a deliberate long-term design. I’m checking whether the persisted flags are reliable enough for a filtered query with a legacy fallback.

Ran terminal command: git log -n 5 -- application/single_app/route_backend_group_documents.py

Searched for regex is_current_version = true|search_visibility_state = 'active'|search_visibility_state = "active"|IS_DEFINED\(c\.is_current_version\)|NOT IS_DEFINED\(c\.is_current_version\) (**/application/single_app/**/*.py), no results

Read , lines 2990 to 3085

The history confirms this behavior came in with the recent document versioning change, so I’m pulling that diff to see whether the fallback logic really required the in-memory collapse.

Ran terminal command: git show 5feca22 -- application/single_app/route_backend_group_documents.py

Ran terminal command: git show 5feca22 -- application/single_app/route_backend_documents.py application/single_app/route_backend_public_documents.py

The diff confirms this was a direct tradeoff: the prior server-side count and page query were dropped solely to dedupe revisions in Python. I’m reading the related route diff now to see whether there’s already a better pattern elsewhere.

Read , lines 1 to 260

I’ve got enough to assess it. I’m grabbing exact anchors for the relevant lines so I can give you a precise recommendation instead of a vague “sounds good.”

Searched for regex matching_docs = list|select_current_documents\(|total_count = len\(current_docs\)|docs = current_docs\[offset:offset \+ page_size\] (**/application/single_app/route_backend_group_documents.py), 4 results

Searched for regex "is_current_version": True|"search_visibility_state": "active"|existing_document\['is_current_version'\] = False|existing_document\['search_visibility_state'\] = 'archived' (**/application/single_app/functions_documents.py), 8 results

Searched for regex promoted_document\['is_current_version'\] = True|promoted_document\['search_visibility_state'\] = 'active' (**/application/single_app/functions_documents.py), 2 results

Searched for regex def normalize_document_revision_families|document_item\["is_current_version"\] = expected_current|document_item\["search_visibility_state"\] = "active"|document_item\["search_visibility_state"\] = "archived" (**/application/single_app/functions_documents.py), 5 results

Searched for regex legacy_q =|needs_legacy_update_check (**/application/single_app/route_backend_group_documents.py), 3 results

Searched for regex matching_docs = list|select_current_documents\(|total_count = len\(current_docs\)|docs = current_docs\[offset:offset \+ page_size\] (**/application/single_app/route_backend_documents.py|application/single_app/route_backend_public_documents.py), no results

One last check: I’m confirming whether this same pagination tradeoff exists in the sibling list routes, because if it does, the best fix is probably a shared helper rather than a one-off group patch.

Searched for regex matching_docs = list|select_current_documents\(|total_count = len\(current_docs\)|docs = current_docs\[offset:offset \+ page_size\] (**/application/single_app/route_backend_documents.py), 4 results

Searched for regex matching_docs = list|select_current_documents\(|total_count = len\(current_docs\)|docs = current_docs\[offset:offset \+ page_size\] (**/application/single_app/route_backend_public_documents.py), 5 results

I agree with the suggestion. It’s a real scalability regression, not just a stylistic preference.

The problem is visible in route_backend_group_documents.py: the route now fetches every matching document, runs functions_documents.py in Python, then computes total_count and slices the page in memory. That means RU cost, payload size, and app memory all grow with the full result set and full revision history, instead of with the requested page size. For large groups, that’s exactly the wrong side of the tradeoff.

The suggestion is also supported by the data model. New current revisions are persisted with is_current_version=True and search_visibility_state="active" in functions_documents.py, prior revisions are archived in functions_documents.py, delete promotion restores the prior current revision in functions_documents.py, and there’s already a repair path in functions_documents.py. So the route should prefer a Cosmos query filtered to current docs and keep ORDER BY ... OFFSET ... LIMIT on the server.

The one caveat is legacy data. A pure c.is_current_version = true filter can hide older docs that were never backfilled, and the route already knows legacy records may exist via route_backend_group_documents.py. So the right implementation is usually:

  1. If no legacy docs are present for the requested groups, use server-side count + paged query with a current-only filter.
  2. If legacy docs are present, fall back to the current materialize-and-collapse path until those groups are normalized.

One more thing: this same regression exists in the personal and public list routes too, at route_backend_documents.py and route_backend_public_documents.py. If you fix it, I’d do it as a shared helper rather than only in the group route.

If you want, I can implement that current-only server-side path with legacy fallback across all three routes.

3

application/single_app/route_backend_documents.py

These list endpoints now fetch all matching documents from Cosmos (SELECT * ... WHERE ...), then collapse/sort/paginate in Python. For users/groups with many documents this can become very expensive in RU, memory, and latency (and defeats server-side OFFSET/LIMIT). Since revision state is now explicitly modeled, consider filtering server-side for current revisions (e.g., c.is_current_version = true or c.search_visibility_state = 'active' with a legacy fallback for missing fields) so Cosmos can paginate efficiently.

GitHub Copilot: I’m checking the actual document list routes and the current revision logic first, then I’ll assess whether the suggestion is correct in this codebase and where the tradeoffs are.

Read

Read skill cosmosdb-best-practices

Read , lines 1 to 400

Read

Read memory

I’ve confirmed the repo already models current revisions explicitly. Next I’m tracing which list routes still fetch the full matching set and then collapse in Python, because that determines whether the suggestion is just valid in principle or actionable as-is here.

Searched for regex list_documents|select_current_documents|normalize_document_revision_families|OFFSET|LIMIT|page|pagination|sort (**/application/single_app/route_backend_documents.py), 53 results

Searched for regex def select_current_documents|def normalize_document_revision_families|is_current_version|search_visibility_state (**/application/single_app/functions_documents.py), 27 results

Read , lines 1 to 420

I found the personal documents list route that does exactly what the suggestion describes: broad fetch, then select_current_documents, sort, and slice in Python. I’m reading the surrounding implementation and the helper query code now to check whether a server-side filter is straightforward or whether legacy/shared-document rules make it more complicated.

Read , lines 400 to 620

Searched for regex _query_accessible_documents|get_all_documents|get_documents_by|shared_user_ids|shared_group_ids (**/application/single_app/functions_documents.py), 125 results

Read , lines 7480 to 7565

The suggestion is directionally right, but I want to be precise about the legacy edge case before I give you a yes/no. The current helper falls back to “latest per family or per legacy filename” when revision flags are missing, so a naive is_current_version = true filter would be incomplete unless those old docs have already been normalized.

Read , lines 420 to 520

Searched for regex normalize_document_revision_families\( (**/application/single_app/**/*.py), 4 results

Searched for regex @app.route\('/api/documents|@app.route\('/api/group|@app.route\('/api/public (**/application/single_app/*.py), 61 results

I’ve got the code path. I’m pulling one quick docs check on Cosmos pagination so I don’t hand-wave the RU tradeoff, because OFFSET/LIMIT and continuation tokens are not equivalent in Cosmos.

Calling documentation microsoft_docs_search...
Completed with input: {
"intent": "Find official Azure Cosmos DB guidance comparing OFFSET LIMIT and continuation tokens for efficient pagination and RU usage"
}

Read

I’m doing one last pass on the sibling group/public list routes. If they mirror this pattern, that strengthens the case for a shared helper rather than a one-off tweak in the personal route.

Searched for regex select_current_documents\(|matching_docs = list\(|offset = \(page - 1\) \* page_size|docs = current_docs\[offset:offset \+ page_size\] (**/application/single_app/route_backend_group_documents.py), 4 results

Searched for regex select_current_documents\(|matching_docs = list\(|offset = \(page - 1\) \* page_size|docs = current_docs\[offset:offset \+ page_size\] (**/application/single_app/route_backend_public_documents.py), 5 results

Read , lines 145 to 270

Read , lines 113 to 210

I’ve got enough to answer. One quick check remains: whether the existing “legacy upgrade” flow already normalizes revision flags. If it doesn’t, that’s the main reason I’d treat the suggestion as “good, but incomplete” instead of “ship it exactly as written.”

Read , lines 840 to 930

Searched for regex upgrade_legacy|legacy_update|normalize_document_revision_families|is_current_version|search_visibility_state (**/application/single_app/functions_documents.py), 27 results

Read , lines 6889 to 6975

Read , lines 6975 to 7015

I think the suggestion is good and worth acting on, but I would tighten the implementation details.

  • The performance diagnosis is accurate. The personal list route in route_backend_documents.py fetches all matching documents, then collapses revisions, sorts, and paginates in Python at route_backend_documents.py, route_backend_documents.py, and route_backend_documents.py. For large workspaces that is expensive in RU, memory, and latency.

  • I would not replace that with only is_current_version = true yet. The current selection logic in functions_documents.py still supports legacy records by grouping on revision family when present, or scope plus file name when it is not. The list route does not call functions_documents.py, and the existing legacy upgrade flow in functions_documents.py does not backfill revision flags. So a strict current-only filter can hide valid legacy docs.

  • The right transitional move is to push down as much filtering as possible: prefer current rows server-side, use search_visibility_state = active as compatibility, and keep a fallback path for records missing both fields until data is normalized. That will reduce the candidate set substantially, even though shared-document access still keeps the personal query cross-partition.

  • I would also tweak the pagination claim. Server-side pagination is better than materializing everything, but in Cosmos the efficient pagination primitive is continuation tokens, not large OFFSET/LIMIT pages. If the API must preserve page numbers and total_count, OFFSET/LIMIT can be a short-term step; it is not the best long-term Cosmos shape.

  • The same pattern exists in the group and public list routes, so this is really a shared-query/helper problem, not just a personal-route issue: route_backend_group_documents.py and route_backend_public_documents.py.

If I were turning that suggestion into work, I’d do it in this order:

  1. Normalize or backfill revision markers for legacy documents.
  2. Move the personal, group, and public list routes to a shared current-documents query helper.
  3. Decide whether to keep page-number pagination temporarily or switch the API to continuation tokens.

If you want, I can turn that into a concrete refactor plan or implement the first pass.

GitHub Copilot: I’m checking the Cosmos guidance and the revision-handling caveat once, then I’ll tell you whether these comments are converging on the same fix or describing different work.

Read

Read skill cosmosdb-best-practices

I’ve loaded the general Cosmos best-practice guidance; I’m checking the repo’s explicit Cosmos instructions and the prior revision-state note to make sure the answer is grounded in this codebase’s constraints.

Read

Read memory

I’ve confirmed the shared theme; I’m pulling exact file anchors for the three list routes and the revision helper so the comparison is precise rather than just high-level.

Read , lines 150 to 280

Read , lines 500 to 590

Read , lines 110 to 220

Read , lines 420 to 520

I have the personal and public anchors; I’m grabbing the matching group-route block and the current-selection helper so the three routes can be compared apples to apples.

Read , lines 280 to 360

Read , lines 300 to 420

I have the comparison; I’m pulling exact line numbers for the shared fetch-collapse-slice pattern so I can answer with direct anchors instead of approximate references.

Searched for regex matching_docs = list|select_current_documents\(|total_count = len\(current_docs\)|docs = current_docs\[offset:offset \+ page_size\] (**/application/single_app/route_backend_documents.py), 4 results

Searched for regex matching_docs = list|select_current_documents\(|total_count = len\(current_docs\)|docs = current_docs\[offset:offset \+ page_size\] (**/application/single_app/route_backend_group_documents.py), 4 results

Searched for regex matching_docs = list|select_current_documents\(|total_count = len\(current_docs\)|docs = current_docs\[offset:offset \+ page_size\] (**/application/single_app/route_backend_public_documents.py), 5 results

Searched for regex def select_current_documents|def normalize_document_revision_families|is_current_version|search_visibility_state (**/application/single_app/functions_documents.py), 27 results

Yes. They’re all converging on the same underlying resolution, even if they phrase it a little differently.

The shared problem is the same in all three list routes: fetch the full matching Cosmos result set, collapse revisions with select_current_documents, then paginate in Python. You can see that pattern in route_backend_documents.py, route_backend_group_documents.py, and route_backend_public_documents.py. The shared caveat is also the same: current-selection is not just a simple flag check, because functions_documents.py collapses legacy families too, and functions_documents.py is the normalization path that repairs revision metadata.

So the comments are aiming at one common fix direction:

  • Move revision filtering closer to the Cosmos query so the list endpoints stop materializing everything first.
  • Keep a legacy fallback or backfill path, because a pure is_current_version = true filter is not fully equivalent for older data.
  • Ideally solve it once as a shared helper for personal, group, and public routes, rather than three separate one-off patches.
  • Treat continuation tokens as the better long-term Cosmos pagination model, while server-side filtering/paging is the immediate improvement.

So: same resolution family, not three unrelated suggestions. The later comments are basically refining the earlier ones by adding the legacy-data constraint and the shared-helper angle. No code changed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Pending Evaluation

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions