Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions docs/rfc/application_logging/application_logging_guidelines.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# Guidelines for Developing Application Logging Solutions

An *application log* or *service log* is the textual data produced by a piece of software for the purpose of recording
its activity. It may come in many forms, such as output from a Docker container, Lambda function, or load balancer. This
information is often useful for development and debugging purposes, but may contain information that identifies
specific users or behaviors within our systems. This Personally Identifying Information (PII) and other sensitive data
must be carefully handled to respect our users' privacy.

This document defines the scope and limitations of application logging at TB Pro. It lays out guidelines for handling
logging and considerations for selecting specific logging tools and their configurations.


## Configuration

This section contains general guidelines for how logs should be produced and handled so as to protect user data. Any
specific logging solution (or a component thereof) should be built in accordance with these guidelines. If a logging
solution is already implemented that does not comply with these guidelines, issues should be created and prioritized to
resolve those discrepancies.


### Log Production

Application logs are often less useful for debugging if they do not contain information which identifies specific users.
Effort must be taken to reduce the amount of this information that gets recorded in the first place. We should only ever
log this information when it is necessary to identify a problem in our systems or respond to an incident. We should
never produce logs with the goal of collecting or analyzing user data, or to track user behavior.

Most application logging facilities include a "log level" feature where log messages of differing levels of detail are
associated with an appropriate verbosity level in the software's logging tool. For most purposes, we should only log
sensitive data at the "debug" level (or a more verbose level like "trace"), and only if it is absolutely necessary.

Production environments should always use the default log level (typically "info") to suppress this output.
Staging and development environments should never contain any real user PII, and therefore more verbose logging should
not pose a hazard. Where possible, debug logging of this sort should also be removed prior to deploying code to
production.


### Data Protection

Logs are always stored somewhere, at least temporarily. That might be the local disk of a server you control, a remote
service of some kind (CloudWatch, Splunk, Kibana, etc), or a developer's local machine. In development and staging
environments, there should be no sensitive data present, but when we do produce PII in logs, we need to protect it.

When logs are stored on a disk, that disk should be encrypted. This includes originating servers and aggregation targets
as well as engineer working machines. Encryption must be enabled at rest on all systems where logs potentially
containing PII will persist. Where possible, those encryption keys should be automatically rotated on a recurring
basis.

When logs are stored on another host, or they pass through an intermediary (such as fluentd/fluent-bit), the data
must be encrypted at all stages of transit using TLS or equivalent protection. Log aggregation systems should be
configured to limit access. This can be done by placing the system on a private network and controlling access to that
service. A password-based system where services sending logs must authenticate to do so is also recommended.

Data must be exfiltrated from these protected systems as infrequently as possible. For example, an engineer responding
to an incident may be tempted to copy some logged events locally for reference. We should resist this temptation when
possible. Where we cannot resist we must make an effort to avoid copying private data, ensure the machine's storage is
fully encrypted as per company policy, and delete the data as soon as possible.

We may have to report on the contents of these logs as part of a postmortem investigation or similar document. In these
cases, we must make sure to scrub any private information from logs before including them in such a report.


### Data Access

Access to stored log data must be protected from unauthorized access. Different tools use different auth schemes, but
some good guidelines include:

- Create logical separation between log types. Logs should be separated by environment (stage, prod, etc) and purpose
(service log, load balancer access log, and so on).
- Provide access through individual user accounts, not shared credentials.
- Authorize each individual according to their needs, following the principle of least privilege. Incident response
teams may require production access while developers may only need access to staging and development environments.
They may only need access to a single application's logs.
- Accessing logs should produce auditable events where possible. For example, when a user accesses a CloudWatch log
stream, AWS emits a `GetLogEvents` event, which can be discovered using CloudTrail.


### Data Retention

We do not need to store logs for a long period of time. Incident response should be able to grab (and sanitize) the
necessary logs during the initial response period or during the post-mortem investigation in the days following.
Developers investigating service issues in lower environments should be able to reproduce their test scenario to
generate new log text. The longer we retain our logs, the longer that information is at risk, the greater our burden of
protection.

Service logs should be retained no longer than 3 days in production environments and no longer than 7 days in other
environments. Logs exceeding these ages should be fully destroyed, not moved to another location or storage tier, nor
backed up for later retrieval. This should happen through an automated process, such as a lifecycle rule on an S3 bucket
or similar mechanism, not left to an engineer to tidy up manually.

Logs should not be exfiltrated from these systems under most circumstances. In the rare occasion when it is necessary to
offload logs, such as when preparing a postmortem incident report, that data should be scrubbed of private information
where possible and any data that remains un-anonymized should be destroyed as soon as feasible.


## Auditing

Ideally, we want as much of these processes to be auditable. That means that we should be able to look at our systems
and answer specific questions about the data such as:

- When did the data get created? How old is it? Does it exist outside of our data retention policy?
- Who has accessed this data?
- Has a particular user taken any other actions with the data?

To accomplish this, attention should be paid when designing logging systems to maximize our ability to do this. This may
include:

- Choosing system components which allow for detailed auditing (such as the way CloudWatch Logs produces CloudTrail
events).
- Choose write-once, unmodifiable systems of record.
- Design components with repeatable code patterns, such as a tb_pulumi module, to ensure all installations of these
logging patterns are configured the same and benefit from code and design improvements over time. Provide these
patterns to developers as the logging solution to use and provide support for the configurations to encourage their
use.
- Use automated tools like AWS GuardDuty and AWS Config to report on unsecure configurations.
- Use automated tools like AWS CloudTrail and CloudWatch Alarms to alert when sensitive data is accessed.


## Implementations

We currently have no implementations of these guidelines.

- An [implementation for CloudWatch Logs](./logging_to_cloudwatch_logs.md) has been proposed.
65 changes: 65 additions & 0 deletions docs/rfc/application_logging/logging_to_cloudwatch_logs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# RFC: Logging to CloudWatch Logs


## Proposal Overview

We should define a pattern of infrastructure resources that provides a unified experience for
applications sending their logs to CloudWatch Logs. This should come in the form of a
[tb_pulumi](https://github.com/thunderbird/pulumi/) module that can be used to stamp out logging
destinations that are compliant with our [guidelines](./application_logging_guidelines.md) in our
various infrastructure projects. Existing projects which send logs to CloudWatch Logs through
more or less default configurations can be adjusted to use the new pattern, bringing them into
compliance.


## Rationale

CloudWatch Logs can be configured to meet all of our guidelines pertaining to logging targets:

- KMS Keys provide encryption at rest for log data.
- KMS encryption keys can be set to auto-rotate at a custom interval.
- The AWS API provides encryption for log data in transit.
- Log data is passed through a VPC endpoint to the CloudWatch Logs service over a private network.
- Log streams are individually policable resources, allowing granularity in access control.
- Log groups are also access-controllable, increasing flexibility in access control.
- IAM actions related to these groups and streams can be tracked through a CloudTrail event trail and alerted on with
CloudWatch Alarms.
- Policies granting access to these logs can be crafted around the logical separators of environment and application.
IAM user groups with these policies applied can be created to control log access to individuals.
- CloudWatch Log Groups can be configured with a retention window, automating the deletion of log data according to our
Comment on lines +27 to +29
Copy link
Copy Markdown
Collaborator

@aatchison aatchison Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might think about using:

  • profile roles for applications that allow Cloudwatch write/least access for the application resources themselves
  • read roles with read access to the cloudwatch logs
  • user roles such as a read only role for access via the read roles

We could create these roles and resources based on the service and environment names like service-env-fdsaf-Profile for example.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my thinking here is to leave the implementation at the Policy level so that you can attach those Policies to any permissions model you want to run. You can do the users -> user groups -> policies model or the policy -> role -> oidc auth model, whatever works for the org, and it's flexible.

Copy link
Copy Markdown
Collaborator

@aatchison aatchison Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that makes sense. I'm just advocating for defining both security profiles, the ones emmiting logs to cloud watch and any entity reading them can be separate. That way readers can't modify and apps can't read.

Actually, I would throw in a third role for lambdas or other actions that need to write to clear logs as an example. Some resources require permissions on both ends.

guidelines.


## Implementation Details

We should implement this as a tb_pulumi module: `tb:cloudwatch:LoggingDestination`. This module should implement default
values which align with our guidelines.

It should define the following resources:

- A [KMS Key](https://www.pulumi.com/registry/packages/aws/api-docs/kms/key/) to handle encryption at rest for the
logs. (Example: `mailstrom-logs-stage`)
- A [CloudWatch LogGroup](https://www.pulumi.com/registry/packages/aws/api-docs/cloudwatch/loggroup/) for the
environment. In implementation, keeping with AWS's log group naming conventions, this might be called something like
`/tb/mailstrom/stage` for the Mailstrom/Stalwart staging environment.
- A [CloudWatch LogStream](https://www.pulumi.com/registry/packages/aws/api-docs/cloudwatch/logstream/) for each
application. This is somewhat arbitrary and can be broken up however it makes sense. (i.e.
`/tb/mailstrom/stage/stalwart/mail` vs `/tb/mailstrom/stage/stalwart/management-api`)
- A set of [IAM Policies](https://www.pulumi.com/registry/packages/aws/api-docs/iam/policy/) allowing various levels of
access to these streams. Applications will need a write access policy. Users will need read access policies. There
should be a level of customization here, allowing the engineers to design access in ways that make sense for their
use case. These policies can be applied to any existing set of permissions to extend access to logs.
- A [CloudTrail Event Trail](https://www.pulumi.com/registry/packages/aws/api-docs/cloudtrail/trail/) set up with an
appropriate filter for auditing log access.
- A [CloudWatch Alarm](https://www.pulumi.com/registry/packages/aws/api-docs/cloudwatch/metricalarm/) to alert when log
access is detected.

We have two primary use cases for this in our current system:

1. We need to [aggregate Stalwart logs](https://github.com/thunderbird/mailstrom/issues/196).
2. We need to apply our logging guidelines to existing services that produce logs.

In both cases, we need a target log stream with our retention/encryption/etc rules applied. In the first case, we can
use fluent-bit to ship logs from the Docker containers running Stalwart straight to a log group created with the common
pattern. In the second case, we can create new log destinations with the new pattern, then update the existing
container definitions to use the new log streams.