Re-organize high availability cluster-based deployment docs#8812
Re-organize high availability cluster-based deployment docs#8812roberson-io wants to merge 2 commits intomasterfrom
Conversation
Restructure the HA deployment documentation to match admin workflow: - Add Preparation section with pre-deployment guidance - Add Deployment guide section with step-by-step instructions - Add Next steps section for scaling optimizations - Preserve Operations and maintenance section for advanced topics - Emphasize PostgreSQL hot_standby and hot_standby_feedback settings - Guide admins toward database configuration over config.json Closes #8811 Co-authored-by: Michael Roberson <roberson-io@users.noreply.github.com>
|
Newest code from mattermost has been published to preview environment for Git SHA 1bb1c44 |
|
Newest code from mattermost has been published to preview environment for Git SHA e5b2253 |
📝 WalkthroughWalkthroughReorganizes the high-availability cluster deployment documentation with a new Preparation section, expanded deployment guidance covering storage options (S3/NAS), database configuration (Aurora/RDS/self-managed PostgreSQL), NGINX proxy setup, node management procedures, network tuning, and disaster recovery sections. Adds operational details for system configuration, licensing, and cluster verification. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can disable sequence diagrams in the walkthrough.Disable the |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
source/administration-guide/scale/high-availability-cluster-based-deployment.rst (2)
540-540: Defaulting NFS mounts tosoftis risky for application data consistency.The examples on Lines 540/548/563 use
rw,soft,intr, while Line 573 only noteshard,intras an alternative. For shared app data,softcan return I/O errors under transient network issues and may cause partial writes.Suggested doc adjustment
- sudo mount -t nfs -o rw,soft,intr NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data + sudo mount -t nfs -o rw,hard,intr NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data ... - NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data nfs rw,soft,intr 0 0 + NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data nfs rw,hard,intr 0 0Also applies to: 548-548, 563-563, 573-573
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/administration-guide/scale/high-availability-cluster-based-deployment.rst` at line 540, The NFS mount examples use the risky option "rw,soft,intr" which can cause I/O errors and partial writes; change the example mount options to use "rw,hard,intr" (or show both options and explicitly recommend "hard,intr" for application data) and add a brief explanatory note after the command that for shared application data you should prefer hard mounts to avoid transient-network-induced partial writes and data corruption; update every occurrence of the example mount command (the string "sudo mount -t nfs -o rw,soft,intr NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data") and add a short caution sentence referencing "hard,intr" as the recommended setting.
709-709: Line 709 incorrectly claims RDS does not expose PostgreSQL configuration access.Both Aurora and RDS PostgreSQL expose configuration through DB parameter groups and cluster parameter groups, allowing administrators to tune settings like memory allocation, replication parameters, and other PostgreSQL options. Reword to clarify these configuration mechanisms are available, rather than suggesting configuration is unavailable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/administration-guide/scale/high-availability-cluster-based-deployment.rst` at line 709, Replace the incorrect sentence "Amazon RDS does not expose direct PostgreSQL configuration access" with a clarified statement that both Amazon RDS (for PostgreSQL) and Amazon Aurora expose configuration via DB parameter groups and cluster parameter groups; mention that these parameter groups allow tuning of memory, replication, and other PostgreSQL options and that monitoring should still be done via CloudWatch and RDS Performance Insights (locate the sentence containing "Amazon RDS does not expose direct PostgreSQL configuration access" in high-availability-cluster-based-deployment.rst and update it to reference DB parameter groups/cluster parameter groups and retained monitoring guidance).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`:
- Around line 360-383: Clarify the health-check paragraph to explain the
difference between passive peer quarantining (controlled by max_fails=0) and
per-request active failover (controlled by proxy_next_upstream and similar
settings): explicitly state that setting max_fails=0 disables passive marking of
an upstream as unavailable, while NGINX can still perform per-request failover
using proxy_next_upstream so requests may be retried to other backends even when
peers are not quarantined; mention the Mattermost API ping endpoint
(http://SERVER_IP:8065/api/v4/system/ping) as a way to monitor server health but
note it does not change the distinction between passive and active failure
handling.
---
Nitpick comments:
In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`:
- Line 540: The NFS mount examples use the risky option "rw,soft,intr" which can
cause I/O errors and partial writes; change the example mount options to use
"rw,hard,intr" (or show both options and explicitly recommend "hard,intr" for
application data) and add a brief explanatory note after the command that for
shared application data you should prefer hard mounts to avoid
transient-network-induced partial writes and data corruption; update every
occurrence of the example mount command (the string "sudo mount -t nfs -o
rw,soft,intr NFS_SERVER_IP:/mnt/mattermost-data /opt/mattermost/data") and add a
short caution sentence referencing "hard,intr" as the recommended setting.
- Line 709: Replace the incorrect sentence "Amazon RDS does not expose direct
PostgreSQL configuration access" with a clarified statement that both Amazon RDS
(for PostgreSQL) and Amazon Aurora expose configuration via DB parameter groups
and cluster parameter groups; mention that these parameter groups allow tuning
of memory, replication, and other PostgreSQL options and that monitoring should
still be done via CloudWatch and RDS Performance Insights (locate the sentence
containing "Amazon RDS does not expose direct PostgreSQL configuration access"
in high-availability-cluster-based-deployment.rst and update it to reference DB
parameter groups/cluster parameter groups and retained monitoring guidance).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 85b65ef1-a743-47fb-ae48-49f1d75d59fd
📒 Files selected for processing (1)
source/administration-guide/scale/high-availability-cluster-based-deployment.rst
| This configuration includes high-performance optimizations such as preventing servers from being marked unavailable (``max_fails=0``), user image caching for 24 hours, and extended timeouts for long-running operations. For additional NGINX performance tuning options, including worker process optimization and advanced buffer settings, see the :ref:`high-performance scaling configuration <deployment-guide/server/setup-nginx-proxy:high-performance scaling configuration>` section. | ||
|
|
||
| File storage configuration | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
| 3. **Enable the site configuration:** | ||
|
|
||
| .. note:: | ||
| **Ubuntu/Debian:** | ||
|
|
||
| 1. File storage is assumed to be shared between all the machines that are using services such as NAS or Amazon S3. | ||
| 2. If ``"DriverName": "local"`` is used then the directory at ``"FileSettings":`` ``"Directory": "./data/"`` is expected to be a NAS location mapped as a local directory, otherwise high availability will not function correctly and may corrupt your file storage. | ||
| 3. If you're using Amazon S3 or another S3-compatible service for file storage then no other configuration is required. | ||
| .. code-block:: bash | ||
|
|
||
| If you’re using the Compliance Reports feature in Mattermost Enterprise, you need to configure the ``"ComplianceSettings":`` ``"Directory": "./data/",`` to share between all machines or the reports will only be available from the System Console on the local Mattermost server. | ||
| sudo ln -sf /etc/nginx/sites-available/mattermost /etc/nginx/sites-enabled/mattermost | ||
| sudo rm -f /etc/nginx/sites-enabled/default | ||
|
|
||
| Migrating to NAS or S3 from local storage is beyond the scope of this document. | ||
| **RHEL/CentOS:** The configuration file in ``/etc/nginx/conf.d/`` is automatically enabled. | ||
|
|
||
| Database configuration | ||
| ^^^^^^^^^^^^^^^^^^^^^^ | ||
| 4. **Test and apply NGINX configuration:** | ||
|
|
||
| .. tip:: | ||
| .. code-block:: bash | ||
|
|
||
| Specifying configuration setting values using Mattermost environment variables ensure that they always take precedent over any other configuration settings. | ||
| sudo nginx -t | ||
| sudo systemctl restart nginx | ||
| sudo systemctl enable nginx | ||
|
|
||
| For an AWS High Availability RDS cluster deployment, point the :ref:`datasource <administration-guide/configure/environment-configuration-settings:data source>` configuration setting to the write/read endpoint at the **cluster** level to benefit from the AWS failover handling. AWS takes care of promoting different database nodes to be the writer node. Mattermost doesn't need to manage this. | ||
| 5. **Configure TLS:** For production deployments, configure TLS on your NGINX proxy. See :doc:`Set up TLS </deployment-guide/server/setup-tls>` for detailed instructions on configuring TLS with NGINX. You can either use Let's Encrypt for automatic certificate management or provide your own TLS certificates. | ||
|
|
||
| Use the :ref:`read replica <administration-guide/configure/environment-configuration-settings:read replicas>` feature to scale the database. The Mattermost server can be set up to use one master database and one or more read replica databases. | ||
| 6. **Configure health checks:** NGINX automatically stops routing traffic to backend servers that fail to respond. You can monitor server health using the Mattermost API endpoint ``http://SERVER_IP:8065/api/v4/system/ping`` which returns ``Status 200`` for healthy servers. |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In NGINX upstream configuration, what is the exact effect of setting max_fails=0 on backend failure detection and routing behaviour?
💡 Result:
Setting max_fails=0 on an upstream server disables NGINX’s passive failure “accounting” for that server: NGINX will not count unsuccessful attempts, and therefore will never mark the server “unavailable” based on max_fails/fail_timeout (so fail_timeout no longer has any effect for temporary removal of that peer). [1]
Practical routing impact:
- The server stays eligible for selection by the load‑balancing algorithm even if it is repeatedly failing (it won’t be temporarily skipped as “down” due to exceeded fails). [1]
- Per-request failover can still happen (e.g.,
proxy_next_upstreamand related*_next_upstreamrules determine what is an “unsuccessful attempt” and when NGINX may try another peer for that request), but those errors won’t cause the peer to be quarantined forfail_timeout. [1] - With hash-based selection, this also means NGINX won’t “rehash away” from a preferred server due to it being “known down” via
max_fails/fail_timeout(one reason it’s used as a workaround). [2]
Sources:
[1] NGINX docs: max_fails “zero value disables the accounting of attempts” and how a server becomes “unavailable” via fail_timeout/max_fails and *_next_upstream. (nginx.org)
[2] NGINX trac ticket explaining that proxy_next_upstream doesn’t prevent choosing another server when a peer is “known to be down”, and that max_fails=0 makes NGINX ignore such down-status due to fails. (trac.nginx.org)
Citations:
Clarify health-check documentation to explain both passive and active failure handling.
Line 360 documents that max_fails=0 disables passive failure detection (preventing the server from being marked temporarily unavailable), but line 383's statement that "NGINX automatically stops routing traffic to backend servers that fail to respond" can mislead readers into thinking this contradicts the max_fails=0 setting. In reality, those are two separate mechanisms: max_fails=0 disables passive quarantining of peers, while per-request failover (via proxy_next_upstream) is a separate active failover mechanism. Clarify that the automatic routing at line 383 refers to per-request failover, not peer-level quarantining, to avoid confusion during incident response.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@source/administration-guide/scale/high-availability-cluster-based-deployment.rst`
around lines 360 - 383, Clarify the health-check paragraph to explain the
difference between passive peer quarantining (controlled by max_fails=0) and
per-request active failover (controlled by proxy_next_upstream and similar
settings): explicitly state that setting max_fails=0 disables passive marking of
an upstream as unavailable, while NGINX can still perform per-request failover
using proxy_next_upstream so requests may be retried to other backends even when
peers are not quarantined; mention the Mattermost API ping endpoint
(http://SERVER_IP:8065/api/v4/system/ping) as a way to monitor server health but
note it does not change the distinction between passive and active failure
handling.
Restructure the HA deployment documentation to match admin workflow:
Closes #8811
Generated with Claude Code
Summary by CodeRabbit