Re-organize high availability cluster-based deployment docs by roberson-io · Pull Request #8812 · mattermost/docs

roberson-io · 2026-03-10T16:47:37Z

Restructure the HA deployment documentation to match admin workflow:

Add Preparation section with pre-deployment guidance
Add Deployment guide section with step-by-step instructions
Add Next steps section for scaling optimizations
Preserve Operations and maintenance section for advanced topics
Emphasize PostgreSQL hot_standby and hot_standby_feedback settings
Guide admins toward database configuration over config.json

Summary by CodeRabbit

Documentation
- Expanded high-availability deployment guide with detailed preparation steps and operational procedures
- Added comprehensive guidance for storage options, including S3 and NAS configurations
- Enhanced database setup instructions with replica configuration and performance tuning details
- Improved NGINX proxy configuration guidance with TLS and network optimization
- Added clear procedures for node management and cluster verification

Restructure the HA deployment documentation to match admin workflow: - Add Preparation section with pre-deployment guidance - Add Deployment guide section with step-by-step instructions - Add Next steps section for scaling optimizations - Preserve Operations and maintenance section for advanced topics - Emphasize PostgreSQL hot_standby and hot_standby_feedback settings - Guide admins toward database configuration over config.json Closes #8811 Co-authored-by: Michael Roberson <roberson-io@users.noreply.github.com>

github-actions · 2026-03-10T16:51:10Z

Newest code from mattermost has been published to preview environment for Git SHA 1bb1c44

github-actions · 2026-03-13T14:34:22Z

Newest code from mattermost has been published to preview environment for Git SHA e5b2253

coderabbitai · 2026-03-13T14:39:34Z

📝 Walkthrough

Walkthrough

Reorganizes the high-availability cluster deployment documentation with a new Preparation section, expanded deployment guidance covering storage options (S3/NAS), database configuration (Aurora/RDS/self-managed PostgreSQL), NGINX proxy setup, node management procedures, network tuning, and disaster recovery sections. Adds operational details for system configuration, licensing, and cluster verification.

Changes

Cohort / File(s)	Summary
High-Availability Documentation Restructure and Expansion `source/administration-guide/scale/high-availability-cluster-based-deployment.rst`	Major reorganization with new Preparation section and expanded deployment guide. Adds comprehensive guidance on scaling, storage choices (S3/S3-compatible and NAS), Kubernetes vs non-Kubernetes deployments, licensing, and system configuration. Includes detailed NGINX proxy configuration, numbered deployment steps, node management procedures, PostgreSQL hot\_standby configuration for Aurora/RDS/self-managed setups, connection pooling, TLS tuning, and enhanced disaster recovery procedures. Reworks existing topics (cluster discovery, upgrade procedures, file storage, database configuration) with updated commands and file layout references.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the primary change: reorganization of the high-availability cluster-based deployment documentation to match an administrator workflow.
Linked Issues check	✅ Passed	The pull request fulfills all core coding requirements from issue `#8811`: new Preparation section, Deployment guide with step-by-step instructions, Next steps section, preserved Operations and maintenance content, highlighted PostgreSQL settings, and database configuration guidance.
Out of Scope Changes check	✅ Passed	All changes are directly related to reorganizing HA documentation and expanding deployment guidance; no out-of-scope modifications were introduced.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch claude/issue-8811-20260310-1635

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can disable sequence diagrams in the walkthrough.

Disable the reviews.sequence_diagrams setting to disable sequence diagrams in the walkthrough.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

source/administration-guide/scale/high-availability-cluster-based-deployment.rst (2)
540-540: Defaulting NFS mounts to soft is risky for application data consistency.

The examples on Lines 540/548/563 use rw,soft,intr, while Line 573 only notes hard,intr as an alternative. For shared app data, soft can return I/O errors under transient network issues and may cause partial writes.
Suggested doc adjustment
- sudo mount -t nfs -o rw,soft,intr NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data
+ sudo mount -t nfs -o rw,hard,intr NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data
...
- NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data nfs rw,soft,intr 0 0
+ NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data nfs rw,hard,intr 0 0
Also applies to: 548-548, 563-563, 573-573
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`
at line 540, The NFS mount examples use the risky option "rw,soft,intr" which
can cause I/O errors and partial writes; change the example mount options to use
"rw,hard,intr" (or show both options and explicitly recommend "hard,intr" for
application data) and add a brief explanatory note after the command that for
shared application data you should prefer hard mounts to avoid
transient-network-induced partial writes and data corruption; update every
occurrence of the example mount command (the string "sudo mount -t nfs -o
rw,soft,intr NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data") and add a
short caution sentence referencing "hard,intr" as the recommended setting.
709-709: Line 709 incorrectly claims RDS does not expose PostgreSQL configuration access.

Both Aurora and RDS PostgreSQL expose configuration through DB parameter groups and cluster parameter groups, allowing administrators to tune settings like memory allocation, replication parameters, and other PostgreSQL options. Reword to clarify these configuration mechanisms are available, rather than suggesting configuration is unavailable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`
at line 709, Replace the incorrect sentence "Amazon RDS does not expose direct
PostgreSQL configuration access" with a clarified statement that both Amazon RDS
(for PostgreSQL) and Amazon Aurora expose configuration via DB parameter groups
and cluster parameter groups; mention that these parameter groups allow tuning
of memory, replication, and other PostgreSQL options and that monitoring should
still be done via CloudWatch and RDS Performance Insights (locate the sentence
containing "Amazon RDS does not expose direct PostgreSQL configuration access"
in high-availability-cluster-based-deployment.rst and update it to reference DB
parameter groups/cluster parameter groups and retained monitoring guidance).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`:
- Around line 360-383: Clarify the health-check paragraph to explain the
difference between passive peer quarantining (controlled by max_fails=0) and
per-request active failover (controlled by proxy_next_upstream and similar
settings): explicitly state that setting max_fails=0 disables passive marking of
an upstream as unavailable, while NGINX can still perform per-request failover
using proxy_next_upstream so requests may be retried to other backends even when
peers are not quarantined; mention the Mattermost API ping endpoint
(http://SERVER_IP:8065/api/v4/system/ping) as a way to monitor server health but
note it does not change the distinction between passive and active failure
handling.

---

Nitpick comments:
In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`:
- Line 540: The NFS mount examples use the risky option "rw,soft,intr" which can
cause I/O errors and partial writes; change the example mount options to use
"rw,hard,intr" (or show both options and explicitly recommend "hard,intr" for
application data) and add a brief explanatory note after the command that for
shared application data you should prefer hard mounts to avoid
transient-network-induced partial writes and data corruption; update every
occurrence of the example mount command (the string "sudo mount -t nfs -o
rw,soft,intr NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data") and add a
short caution sentence referencing "hard,intr" as the recommended setting.
- Line 709: Replace the incorrect sentence "Amazon RDS does not expose direct
PostgreSQL configuration access" with a clarified statement that both Amazon RDS
(for PostgreSQL) and Amazon Aurora expose configuration via DB parameter groups
and cluster parameter groups; mention that these parameter groups allow tuning
of memory, replication, and other PostgreSQL options and that monitoring should
still be done via CloudWatch and RDS Performance Insights (locate the sentence
containing "Amazon RDS does not expose direct PostgreSQL configuration access"
in high-availability-cluster-based-deployment.rst and update it to reference DB
parameter groups/cluster parameter groups and retained monitoring guidance).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 85b65ef1-a743-47fb-ae48-49f1d75d59fd

📥 Commits

Reviewing files that changed from the base of the PR and between 741ca93 and e5b2253.

📒 Files selected for processing (1)

source/administration-guide/scale/high-availability-cluster-based-deployment.rst

coderabbitai · 2026-03-13T14:39:37Z

source/administration-guide/scale/high-availability-cluster-based-deployment.rst

+   This configuration includes high-performance optimizations such as preventing servers from being marked unavailable (``max_fails=0``), user image caching for 24 hours, and extended timeouts for long-running operations. For additional NGINX performance tuning options, including worker process optimization and advanced buffer settings, see the :ref:`high-performance scaling configuration <deployment-guide/server/setup-nginx-proxy:high-performance scaling configuration>` section.

-File storage configuration
-^^^^^^^^^^^^^^^^^^^^^^^^^^
+3. **Enable the site configuration:**

-.. note::
+   **Ubuntu/Debian:**

-  1. File storage is assumed to be shared between all the machines that are using services such as NAS or Amazon S3.
-  2. If ``"DriverName": "local"`` is used then the directory at ``"FileSettings":`` ``"Directory": "./data/"`` is expected to be a NAS location mapped as a local directory, otherwise high availability will not function correctly and may corrupt your file storage.
-  3. If you're using Amazon S3 or another S3-compatible service for file storage then no other configuration is required.
+   .. code-block:: bash

-If you’re using the Compliance Reports feature in Mattermost Enterprise, you need to configure the ``"ComplianceSettings":`` ``"Directory": "./data/",`` to share between all machines or the reports will only be available from the System Console on the local Mattermost server.
+      sudo ln -sf /etc/nginx/sites-available/mattermost /etc/nginx/sites-enabled/mattermost
+      sudo rm -f /etc/nginx/sites-enabled/default

-Migrating to NAS or S3 from local storage is beyond the scope of this document.
+   **RHEL/CentOS:** The configuration file in ``/etc/nginx/conf.d/`` is automatically enabled.

-Database configuration
-^^^^^^^^^^^^^^^^^^^^^^
+4. **Test and apply NGINX configuration:**

-.. tip::
+   .. code-block:: bash

-  Specifying configuration setting values using Mattermost environment variables ensure that they always take precedent over any other configuration settings.
+      sudo nginx -t
+      sudo systemctl restart nginx
+      sudo systemctl enable nginx

-For an AWS High Availability RDS cluster deployment, point the :ref:`datasource <administration-guide/configure/environment-configuration-settings:data source>` configuration setting to the write/read endpoint at the **cluster** level to benefit from the AWS failover handling. AWS takes care of promoting different database nodes to be the writer node. Mattermost doesn't need to manage this.
+5. **Configure TLS:** For production deployments, configure TLS on your NGINX proxy. See :doc:`Set up TLS </deployment-guide/server/setup-tls>` for detailed instructions on configuring TLS with NGINX. You can either use Let's Encrypt for automatic certificate management or provide your own TLS certificates.

-Use the :ref:`read replica <administration-guide/configure/environment-configuration-settings:read replicas>` feature to scale the database. The Mattermost server can be set up to use one master database and one or more read replica databases.
+6. **Configure health checks:** NGINX automatically stops routing traffic to backend servers that fail to respond. You can monitor server health using the Mattermost API endpoint ``http://SERVER_IP:8065/api/v4/system/ping`` which returns ``Status 200`` for healthy servers.


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In NGINX upstream configuration, what is the exact effect of setting max_fails=0 on backend failure detection and routing behaviour?

💡 Result:

Setting max_fails=0 on an upstream server disables NGINX’s passive failure “accounting” for that server: NGINX will not count unsuccessful attempts, and therefore will never mark the server “unavailable” based on max_fails/fail_timeout (so fail_timeout no longer has any effect for temporary removal of that peer). [1]

Practical routing impact:

The server stays eligible for selection by the load‑balancing algorithm even if it is repeatedly failing (it won’t be temporarily skipped as “down” due to exceeded fails). [1]

Per-request failover can still happen (e.g., proxy_next_upstream and related *_next_upstream rules determine what is an “unsuccessful attempt” and when NGINX may try another peer for that request), but those errors won’t cause the peer to be quarantined for fail_timeout. [1]

With hash-based selection, this also means NGINX won’t “rehash away” from a preferred server due to it being “known down” via max_fails/fail_timeout (one reason it’s used as a workaround). [2]

Sources:
[1] NGINX docs: max_fails “zero value disables the accounting of attempts” and how a server becomes “unavailable” via fail_timeout/max_fails and *_next_upstream. (nginx.org)
[2] NGINX trac ticket explaining that proxy_next_upstream doesn’t prevent choosing another server when a peer is “known to be down”, and that max_fails=0 makes NGINX ignore such down-status due to fails. (trac.nginx.org)

Citations:

1: https://nginx.org/r/keepalive

2: https://trac.nginx.org/nginx/ticket/694

Clarify health-check documentation to explain both passive and active failure handling.

Line 360 documents that max_fails=0 disables passive failure detection (preventing the server from being marked temporarily unavailable), but line 383's statement that "NGINX automatically stops routing traffic to backend servers that fail to respond" can mislead readers into thinking this contradicts the max_fails=0 setting. In reality, those are two separate mechanisms: max_fails=0 disables passive quarantining of peers, while per-request failover (via proxy_next_upstream) is a separate active failover mechanism. Clarify that the automatic routing at line 383 refers to per-request failover, not peer-level quarantining, to avoid confusion during incident response.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@source/administration-guide/scale/high-availability-cluster-based-deployment.rst` around lines 360 - 383, Clarify the health-check paragraph to explain the difference between passive peer quarantining (controlled by max_fails=0) and per-request active failover (controlled by proxy_next_upstream and similar settings): explicitly state that setting max_fails=0 disables passive marking of an upstream as unavailable, while NGINX can still perform per-request failover using proxy_next_upstream so requests may be retried to other backends even when peers are not quarantined; mention the Mattermost API ping endpoint (http://SERVER_IP:8065/api/v4/system/ping) as a way to monitor server health but note it does not change the distinction between passive and active failure handling.

roberson-io requested a review from neillcollie March 10, 2026 17:17

Merge branch 'master' into claude/issue-8811-20260310-1635

e5b2253

coderabbitai bot requested changes Mar 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-organize high availability cluster-based deployment docs#8812

Re-organize high availability cluster-based deployment docs#8812
roberson-io wants to merge 2 commits intomasterfrom
claude/issue-8811-20260310-1635

roberson-io commented Mar 10, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

coderabbitai bot commented Mar 13, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

roberson-io commented Mar 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

coderabbitai bot commented Mar 13, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

roberson-io commented Mar 10, 2026 •

edited by coderabbitai bot

Loading