From 00914b16dd46c911be32bd3e62c5aaf6584c38b8 Mon Sep 17 00:00:00 2001 From: Ryan Jung Date: Tue, 3 Mar 2026 14:04:32 -0700 Subject: [PATCH 1/4] Add rough draft of app logs doc --- .../application_logging.md | 39 +++++++++++++++++++ 1 file changed, 39 insertions(+) create mode 100644 docs/rfc/application_logging/application_logging.md diff --git a/docs/rfc/application_logging/application_logging.md b/docs/rfc/application_logging/application_logging.md new file mode 100644 index 0000000..8904b5c --- /dev/null +++ b/docs/rfc/application_logging/application_logging.md @@ -0,0 +1,39 @@ +# Application Logging at Thunderbird Pro Services + +Thunderbird Pro Services produce application logs describing their moment-to-moment operations. These logs are generally +useful for diagnosing service level problems, but may contain personally identifying information (PII) in some cases. +This document defines the scope and limitations of application logging at TB Pro. + + +## Logging Purposes + +We should produce and store logs for the following purposes: + +- Development and debugging +- Incident response + +We should **not** produce or store logs for purposes such as: + +- Data collection +- Analytics +- Any kind of user tracking purpose + +Generally, if the logs contain information that can identify a user, we should be very scrutinous about whether those +logs actually need to be produced. Logs should serve a purpose, and PII in logs needs to be fully justified. Staging and +other pre-prod environments should not contain PII in the first place. + + +## Log Storage and Access + +Application logs should always: + +- be stored in AWS CloudWatch Logs, +- be encrypted at rest and in transit, and +- have access restricted according to the principle of least privilege. + + +## Log Retention + +Log files should only be stored as long as they are useful for the purposes described above. In production environments, +logs should be preserved no longer than 3 days. In staging environments, logs should be preserved no longer than 7 days. +After this point, log files should be fully deleted. \ No newline at end of file From f3a19b660187c96858126321865c3fc2c4c1b894 Mon Sep 17 00:00:00 2001 From: Ryan Jung Date: Tue, 3 Mar 2026 14:07:26 -0700 Subject: [PATCH 2/4] A little copyediting --- docs/rfc/application_logging/application_logging.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/rfc/application_logging/application_logging.md b/docs/rfc/application_logging/application_logging.md index 8904b5c..de5efba 100644 --- a/docs/rfc/application_logging/application_logging.md +++ b/docs/rfc/application_logging/application_logging.md @@ -12,15 +12,15 @@ We should produce and store logs for the following purposes: - Development and debugging - Incident response -We should **not** produce or store logs for purposes such as: +We should **not** produce or store logs for other purposes such as: - Data collection - Analytics -- Any kind of user tracking purpose +- Any kind of user tracking Generally, if the logs contain information that can identify a user, we should be very scrutinous about whether those logs actually need to be produced. Logs should serve a purpose, and PII in logs needs to be fully justified. Staging and -other pre-prod environments should not contain PII in the first place. +other pre-prod environments should not contain PII. ## Log Storage and Access @@ -36,4 +36,4 @@ Application logs should always: Log files should only be stored as long as they are useful for the purposes described above. In production environments, logs should be preserved no longer than 3 days. In staging environments, logs should be preserved no longer than 7 days. -After this point, log files should be fully deleted. \ No newline at end of file +After this point, log files should be fully deleted. From 868aafc2e8c513627678034bc9fb48f4b69461fd Mon Sep 17 00:00:00 2001 From: Ryan Jung Date: Fri, 6 Mar 2026 12:23:10 -0700 Subject: [PATCH 3/4] Update broad logging guidelines RFC --- .../application_logging.md | 39 ------ .../application_logging_guidelines.md | 115 ++++++++++++++++++ 2 files changed, 115 insertions(+), 39 deletions(-) delete mode 100644 docs/rfc/application_logging/application_logging.md create mode 100644 docs/rfc/application_logging/application_logging_guidelines.md diff --git a/docs/rfc/application_logging/application_logging.md b/docs/rfc/application_logging/application_logging.md deleted file mode 100644 index de5efba..0000000 --- a/docs/rfc/application_logging/application_logging.md +++ /dev/null @@ -1,39 +0,0 @@ -# Application Logging at Thunderbird Pro Services - -Thunderbird Pro Services produce application logs describing their moment-to-moment operations. These logs are generally -useful for diagnosing service level problems, but may contain personally identifying information (PII) in some cases. -This document defines the scope and limitations of application logging at TB Pro. - - -## Logging Purposes - -We should produce and store logs for the following purposes: - -- Development and debugging -- Incident response - -We should **not** produce or store logs for other purposes such as: - -- Data collection -- Analytics -- Any kind of user tracking - -Generally, if the logs contain information that can identify a user, we should be very scrutinous about whether those -logs actually need to be produced. Logs should serve a purpose, and PII in logs needs to be fully justified. Staging and -other pre-prod environments should not contain PII. - - -## Log Storage and Access - -Application logs should always: - -- be stored in AWS CloudWatch Logs, -- be encrypted at rest and in transit, and -- have access restricted according to the principle of least privilege. - - -## Log Retention - -Log files should only be stored as long as they are useful for the purposes described above. In production environments, -logs should be preserved no longer than 3 days. In staging environments, logs should be preserved no longer than 7 days. -After this point, log files should be fully deleted. diff --git a/docs/rfc/application_logging/application_logging_guidelines.md b/docs/rfc/application_logging/application_logging_guidelines.md new file mode 100644 index 0000000..891df1a --- /dev/null +++ b/docs/rfc/application_logging/application_logging_guidelines.md @@ -0,0 +1,115 @@ +# Guidelines for Developing Application Logging Solutions + +An *application log* or *service log* is the textual data produced by a piece of software for the purpose of recording +its activity. It may come in many forms, such as output from a Docker container, Lambda function, or load balancer. This +information is often useful for development and debugging purposes, but may contain information that identifies +specific users or behaviors within our systems. This Personally Identifying Information (PII) and other sensitive data +must be carefully handled to respect our users' privacy. + +This document defines the scope and limitations of application logging at TB Pro. It lays out guidelines for handling +logging and considerations for selecting specific logging tools and their configurations. + + +## Configuration + +This section contains general guidelines for how logs should be produced and handled so as to protect user data. Any +specific logging solution (or a component thereof) should be built in accordance with these guidelines. If a logging +solution is already implemented that does not comply with these guidelines, issues should be created and prioritized to +resolve those discrepancies. + + +### Log Production + +Application logs are often less useful for debugging if they do not contain information which identifies specific users. +Effort must be taken to reduce the amount of this information that gets recorded in the first place. We should only ever +log this information when it is necessary to identify a problem in our systems or respond to an incident. We should +never produce logs with the goal of collecting or analyzing user data, or to track user behavior. + +Most application logging facilities include a "log level" feature where log messages of differing levels of detail are +associated with an appropriate verbosity level in the software's logging tool. For most purposes, we should only log +sensitive data at the "debug" level (or a more verbose level like "trace"), and only if it is absolutely necessary. + +Production environments should always use the default log level (typically "info") to further suppress this output. +Staging and development environments should never contain any real user PII, and therefore more verbose logging should +not pose a hazard. Where possible, debug logging of this sort should also be removed prior to deploying code to +production. + + +### Data Protection + +Logs are always stored somewhere, at least temporarily. That might be the local disk of a server you control, a remote +service of some kind (CloudWatch, Splunk, Kibana, etc), or a developer's local machine. In development and staging +environments, there should be no sensitive data present, but when we do produce PII in logs, we need to protect it. + +When logs are stored on a disk, that disk should be encrypted. This includes originating servers and aggregation targets +as well as engineer working machines. *Encryption must be enabled at rest on all systems where logs potentially +containing PII will persist.* Where possible, those encryption keys should be automatically rotated on a recurring +basis. + +When logs are stored on another host, or they pass through an intermediary (such as fluentd or fluentbit), *the data +must be encrypted at all stages of transit using TLS or equivalent protection.* Log aggregation systems should be +configured to limit access. This can be done by placing the system on a private network and controlling access to that +service. A password-based system where services sending logs must authenticate to do so is also recommended. + +Data must be exfiltrated from these protected systems as infrequently as possible. For example, an engineer responding +to an incident may be tempted to copy some logged events locally for reference. We should resist this temptation when +possible. Where we cannot resist we must make an effort to avoid copying private data, ensure the machine's storage is +fully encrypted as per company policy, and delete the data as soon as possible. + +We may have to report on the contents of these logs as part of a postmortem investigation or similar document. In these +cases, we must make sure to scrub any private information from logs before including them in such a report. + + +### Data Access + +Access to stored log data must be protected from unauthorized access. Different tools use different auth schemes, but +some good guidelines include: + +- Create logical separation between log types. Logs should be separated by environment (stage, prod, etc) and purpose + (service log, load balancer access log, and so on). +- Provide access through individual user accounts, not shared credentials. +- Authorize each individual according to their needs, following the principle of least privilege. Incident response + teams may require production access while developers may only need access to staging and development environments. + They may only need access to a single application's logs. +- Accessing logs should produce auditable events where possible. For example, when a user accesses a CloudWatch log + stream, AWS emits a `GetLogEvents` event, which can be discovered using CloudTrail. + + +### Data Retention + +We do not need to store logs for a long period of time. Incident response should be able to grab the necessary logs +during the initial response period or during the post-mortem investigation in the days following. Developers +investigating service issues in lower environments should be able to reproduce their test scenario to generate new log +text. The longer we retain our logs, the longer that information is at risk, the greater our burden of protection. + +Service logs should be retained no longer than 3 days in production environments and no longer than 7 days in other +environments. Logs exceeding these ages should be fully destroyed, not moved to another location or storage tier, nor +backed up for later retrieval. This should happen through an automated process, such as a lifecycle rule on an S3 bucket +or similar mechanism, not left to an engineer to tidy up manually. + +Logs should not be exfiltrated from these systems under most circumstances. In the rare occasion when it is necessary to +offload logs, such as when preparing a postmortem incident report, that data should be scrubbed of private information +where possible and destroyed as soon as feasible. + + +## Auditing + +Ideally, we want as much of these processes to be auditable. That means that we should be able to look at our systems +and answer questions about the data such as: + +- When did the data get created? How old is it? +- Who has accessed a piece of data? +- Has a user taken any other actions with the data? + +To accomplish this, attention should be paid when designing logging systems to maximize our ability to do this. This may +include: + +- Choosing system components which allow for detailed auditing (such as the way CloudWatch Logs produces CloudTrail + events). +- Choose write-once, unmodifiable systems of record. +- Design components with repeatable code patterns, such as a tb_pulumi module, to ensure all installations of these + logging patterns are configured the same and benefit from code and design improvements over time. Provide these + patterns to developers as the logging solution to use and provide support for the configurations to encourage their + use. +- Use automated tools like AWS GuardDuty and AWS Config to report on unsecure configurations. +- Use automated tools like AWS CloudTrail and CloudWatch Alarms to alert when sensitive data is accessed. From bc1b6ae13aa59ccea6fab881d71fd99c7f40e087 Mon Sep 17 00:00:00 2001 From: Ryan Jung Date: Fri, 6 Mar 2026 13:51:21 -0700 Subject: [PATCH 4/4] RFC revisions for logging --- .../application_logging_guidelines.md | 36 ++++++---- .../logging_to_cloudwatch_logs.md | 65 +++++++++++++++++++ 2 files changed, 87 insertions(+), 14 deletions(-) create mode 100644 docs/rfc/application_logging/logging_to_cloudwatch_logs.md diff --git a/docs/rfc/application_logging/application_logging_guidelines.md b/docs/rfc/application_logging/application_logging_guidelines.md index 891df1a..81c63bf 100644 --- a/docs/rfc/application_logging/application_logging_guidelines.md +++ b/docs/rfc/application_logging/application_logging_guidelines.md @@ -29,7 +29,7 @@ Most application logging facilities include a "log level" feature where log mess associated with an appropriate verbosity level in the software's logging tool. For most purposes, we should only log sensitive data at the "debug" level (or a more verbose level like "trace"), and only if it is absolutely necessary. -Production environments should always use the default log level (typically "info") to further suppress this output. +Production environments should always use the default log level (typically "info") to suppress this output. Staging and development environments should never contain any real user PII, and therefore more verbose logging should not pose a hazard. Where possible, debug logging of this sort should also be removed prior to deploying code to production. @@ -42,12 +42,12 @@ service of some kind (CloudWatch, Splunk, Kibana, etc), or a developer's local m environments, there should be no sensitive data present, but when we do produce PII in logs, we need to protect it. When logs are stored on a disk, that disk should be encrypted. This includes originating servers and aggregation targets -as well as engineer working machines. *Encryption must be enabled at rest on all systems where logs potentially -containing PII will persist.* Where possible, those encryption keys should be automatically rotated on a recurring +as well as engineer working machines. Encryption must be enabled at rest on all systems where logs potentially +containing PII will persist. Where possible, those encryption keys should be automatically rotated on a recurring basis. -When logs are stored on another host, or they pass through an intermediary (such as fluentd or fluentbit), *the data -must be encrypted at all stages of transit using TLS or equivalent protection.* Log aggregation systems should be +When logs are stored on another host, or they pass through an intermediary (such as fluentd/fluent-bit), the data +must be encrypted at all stages of transit using TLS or equivalent protection. Log aggregation systems should be configured to limit access. This can be done by placing the system on a private network and controlling access to that service. A password-based system where services sending logs must authenticate to do so is also recommended. @@ -77,10 +77,11 @@ some good guidelines include: ### Data Retention -We do not need to store logs for a long period of time. Incident response should be able to grab the necessary logs -during the initial response period or during the post-mortem investigation in the days following. Developers -investigating service issues in lower environments should be able to reproduce their test scenario to generate new log -text. The longer we retain our logs, the longer that information is at risk, the greater our burden of protection. +We do not need to store logs for a long period of time. Incident response should be able to grab (and sanitize) the +necessary logs during the initial response period or during the post-mortem investigation in the days following. +Developers investigating service issues in lower environments should be able to reproduce their test scenario to +generate new log text. The longer we retain our logs, the longer that information is at risk, the greater our burden of +protection. Service logs should be retained no longer than 3 days in production environments and no longer than 7 days in other environments. Logs exceeding these ages should be fully destroyed, not moved to another location or storage tier, nor @@ -89,17 +90,17 @@ or similar mechanism, not left to an engineer to tidy up manually. Logs should not be exfiltrated from these systems under most circumstances. In the rare occasion when it is necessary to offload logs, such as when preparing a postmortem incident report, that data should be scrubbed of private information -where possible and destroyed as soon as feasible. +where possible and any data that remains un-anonymized should be destroyed as soon as feasible. ## Auditing Ideally, we want as much of these processes to be auditable. That means that we should be able to look at our systems -and answer questions about the data such as: +and answer specific questions about the data such as: -- When did the data get created? How old is it? -- Who has accessed a piece of data? -- Has a user taken any other actions with the data? +- When did the data get created? How old is it? Does it exist outside of our data retention policy? +- Who has accessed this data? +- Has a particular user taken any other actions with the data? To accomplish this, attention should be paid when designing logging systems to maximize our ability to do this. This may include: @@ -113,3 +114,10 @@ include: use. - Use automated tools like AWS GuardDuty and AWS Config to report on unsecure configurations. - Use automated tools like AWS CloudTrail and CloudWatch Alarms to alert when sensitive data is accessed. + + +## Implementations + +We currently have no implementations of these guidelines. + +- An [implementation for CloudWatch Logs](./logging_to_cloudwatch_logs.md) has been proposed. \ No newline at end of file diff --git a/docs/rfc/application_logging/logging_to_cloudwatch_logs.md b/docs/rfc/application_logging/logging_to_cloudwatch_logs.md new file mode 100644 index 0000000..9f72354 --- /dev/null +++ b/docs/rfc/application_logging/logging_to_cloudwatch_logs.md @@ -0,0 +1,65 @@ +# RFC: Logging to CloudWatch Logs + + +## Proposal Overview + +We should define a pattern of infrastructure resources that provides a unified experience for +applications sending their logs to CloudWatch Logs. This should come in the form of a +[tb_pulumi](https://github.com/thunderbird/pulumi/) module that can be used to stamp out logging +destinations that are compliant with our [guidelines](./application_logging_guidelines.md) in our +various infrastructure projects. Existing projects which send logs to CloudWatch Logs through +more or less default configurations can be adjusted to use the new pattern, bringing them into +compliance. + + +## Rationale + +CloudWatch Logs can be configured to meet all of our guidelines pertaining to logging targets: + +- KMS Keys provide encryption at rest for log data. +- KMS encryption keys can be set to auto-rotate at a custom interval. +- The AWS API provides encryption for log data in transit. +- Log data is passed through a VPC endpoint to the CloudWatch Logs service over a private network. +- Log streams are individually policable resources, allowing granularity in access control. +- Log groups are also access-controllable, increasing flexibility in access control. +- IAM actions related to these groups and streams can be tracked through a CloudTrail event trail and alerted on with + CloudWatch Alarms. +- Policies granting access to these logs can be crafted around the logical separators of environment and application. + IAM user groups with these policies applied can be created to control log access to individuals. +- CloudWatch Log Groups can be configured with a retention window, automating the deletion of log data according to our + guidelines. + + +## Implementation Details + +We should implement this as a tb_pulumi module: `tb:cloudwatch:LoggingDestination`. This module should implement default +values which align with our guidelines. + +It should define the following resources: + +- A [KMS Key](https://www.pulumi.com/registry/packages/aws/api-docs/kms/key/) to handle encryption at rest for the + logs. (Example: `mailstrom-logs-stage`) +- A [CloudWatch LogGroup](https://www.pulumi.com/registry/packages/aws/api-docs/cloudwatch/loggroup/) for the + environment. In implementation, keeping with AWS's log group naming conventions, this might be called something like + `/tb/mailstrom/stage` for the Mailstrom/Stalwart staging environment. +- A [CloudWatch LogStream](https://www.pulumi.com/registry/packages/aws/api-docs/cloudwatch/logstream/) for each + application. This is somewhat arbitrary and can be broken up however it makes sense. (i.e. + `/tb/mailstrom/stage/stalwart/mail` vs `/tb/mailstrom/stage/stalwart/management-api`) +- A set of [IAM Policies](https://www.pulumi.com/registry/packages/aws/api-docs/iam/policy/) allowing various levels of + access to these streams. Applications will need a write access policy. Users will need read access policies. There + should be a level of customization here, allowing the engineers to design access in ways that make sense for their + use case. These policies can be applied to any existing set of permissions to extend access to logs. +- A [CloudTrail Event Trail](https://www.pulumi.com/registry/packages/aws/api-docs/cloudtrail/trail/) set up with an + appropriate filter for auditing log access. +- A [CloudWatch Alarm](https://www.pulumi.com/registry/packages/aws/api-docs/cloudwatch/metricalarm/) to alert when log + access is detected. + +We have two primary use cases for this in our current system: + +1. We need to [aggregate Stalwart logs](https://github.com/thunderbird/mailstrom/issues/196). +2. We need to apply our logging guidelines to existing services that produce logs. + +In both cases, we need a target log stream with our retention/encryption/etc rules applied. In the first case, we can +use fluent-bit to ship logs from the Docker containers running Stalwart straight to a log group created with the common +pattern. In the second case, we can create new log destinations with the new pattern, then update the existing +container definitions to use the new log streams. \ No newline at end of file