-
Notifications
You must be signed in to change notification settings - Fork 0
Add application logging proposals #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
123 changes: 123 additions & 0 deletions
123
docs/rfc/application_logging/application_logging_guidelines.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,123 @@ | ||
| # Guidelines for Developing Application Logging Solutions | ||
|
|
||
| An *application log* or *service log* is the textual data produced by a piece of software for the purpose of recording | ||
| its activity. It may come in many forms, such as output from a Docker container, Lambda function, or load balancer. This | ||
| information is often useful for development and debugging purposes, but may contain information that identifies | ||
| specific users or behaviors within our systems. This Personally Identifying Information (PII) and other sensitive data | ||
| must be carefully handled to respect our users' privacy. | ||
|
|
||
| This document defines the scope and limitations of application logging at TB Pro. It lays out guidelines for handling | ||
| logging and considerations for selecting specific logging tools and their configurations. | ||
|
|
||
|
|
||
| ## Configuration | ||
|
|
||
| This section contains general guidelines for how logs should be produced and handled so as to protect user data. Any | ||
| specific logging solution (or a component thereof) should be built in accordance with these guidelines. If a logging | ||
| solution is already implemented that does not comply with these guidelines, issues should be created and prioritized to | ||
| resolve those discrepancies. | ||
|
|
||
|
|
||
| ### Log Production | ||
|
|
||
| Application logs are often less useful for debugging if they do not contain information which identifies specific users. | ||
| Effort must be taken to reduce the amount of this information that gets recorded in the first place. We should only ever | ||
| log this information when it is necessary to identify a problem in our systems or respond to an incident. We should | ||
| never produce logs with the goal of collecting or analyzing user data, or to track user behavior. | ||
|
|
||
| Most application logging facilities include a "log level" feature where log messages of differing levels of detail are | ||
| associated with an appropriate verbosity level in the software's logging tool. For most purposes, we should only log | ||
| sensitive data at the "debug" level (or a more verbose level like "trace"), and only if it is absolutely necessary. | ||
|
|
||
| Production environments should always use the default log level (typically "info") to suppress this output. | ||
| Staging and development environments should never contain any real user PII, and therefore more verbose logging should | ||
| not pose a hazard. Where possible, debug logging of this sort should also be removed prior to deploying code to | ||
| production. | ||
|
|
||
|
|
||
| ### Data Protection | ||
|
|
||
| Logs are always stored somewhere, at least temporarily. That might be the local disk of a server you control, a remote | ||
| service of some kind (CloudWatch, Splunk, Kibana, etc), or a developer's local machine. In development and staging | ||
| environments, there should be no sensitive data present, but when we do produce PII in logs, we need to protect it. | ||
|
|
||
| When logs are stored on a disk, that disk should be encrypted. This includes originating servers and aggregation targets | ||
| as well as engineer working machines. Encryption must be enabled at rest on all systems where logs potentially | ||
| containing PII will persist. Where possible, those encryption keys should be automatically rotated on a recurring | ||
| basis. | ||
|
|
||
| When logs are stored on another host, or they pass through an intermediary (such as fluentd/fluent-bit), the data | ||
| must be encrypted at all stages of transit using TLS or equivalent protection. Log aggregation systems should be | ||
| configured to limit access. This can be done by placing the system on a private network and controlling access to that | ||
| service. A password-based system where services sending logs must authenticate to do so is also recommended. | ||
|
|
||
| Data must be exfiltrated from these protected systems as infrequently as possible. For example, an engineer responding | ||
| to an incident may be tempted to copy some logged events locally for reference. We should resist this temptation when | ||
| possible. Where we cannot resist we must make an effort to avoid copying private data, ensure the machine's storage is | ||
| fully encrypted as per company policy, and delete the data as soon as possible. | ||
|
|
||
| We may have to report on the contents of these logs as part of a postmortem investigation or similar document. In these | ||
| cases, we must make sure to scrub any private information from logs before including them in such a report. | ||
|
|
||
|
|
||
| ### Data Access | ||
|
|
||
| Access to stored log data must be protected from unauthorized access. Different tools use different auth schemes, but | ||
| some good guidelines include: | ||
|
|
||
| - Create logical separation between log types. Logs should be separated by environment (stage, prod, etc) and purpose | ||
| (service log, load balancer access log, and so on). | ||
| - Provide access through individual user accounts, not shared credentials. | ||
| - Authorize each individual according to their needs, following the principle of least privilege. Incident response | ||
| teams may require production access while developers may only need access to staging and development environments. | ||
| They may only need access to a single application's logs. | ||
| - Accessing logs should produce auditable events where possible. For example, when a user accesses a CloudWatch log | ||
| stream, AWS emits a `GetLogEvents` event, which can be discovered using CloudTrail. | ||
|
|
||
|
|
||
| ### Data Retention | ||
|
|
||
| We do not need to store logs for a long period of time. Incident response should be able to grab (and sanitize) the | ||
| necessary logs during the initial response period or during the post-mortem investigation in the days following. | ||
| Developers investigating service issues in lower environments should be able to reproduce their test scenario to | ||
| generate new log text. The longer we retain our logs, the longer that information is at risk, the greater our burden of | ||
| protection. | ||
|
|
||
| Service logs should be retained no longer than 3 days in production environments and no longer than 7 days in other | ||
| environments. Logs exceeding these ages should be fully destroyed, not moved to another location or storage tier, nor | ||
| backed up for later retrieval. This should happen through an automated process, such as a lifecycle rule on an S3 bucket | ||
| or similar mechanism, not left to an engineer to tidy up manually. | ||
|
|
||
| Logs should not be exfiltrated from these systems under most circumstances. In the rare occasion when it is necessary to | ||
| offload logs, such as when preparing a postmortem incident report, that data should be scrubbed of private information | ||
| where possible and any data that remains un-anonymized should be destroyed as soon as feasible. | ||
|
|
||
|
|
||
| ## Auditing | ||
|
|
||
| Ideally, we want as much of these processes to be auditable. That means that we should be able to look at our systems | ||
| and answer specific questions about the data such as: | ||
|
|
||
| - When did the data get created? How old is it? Does it exist outside of our data retention policy? | ||
| - Who has accessed this data? | ||
| - Has a particular user taken any other actions with the data? | ||
|
|
||
| To accomplish this, attention should be paid when designing logging systems to maximize our ability to do this. This may | ||
| include: | ||
|
|
||
| - Choosing system components which allow for detailed auditing (such as the way CloudWatch Logs produces CloudTrail | ||
| events). | ||
| - Choose write-once, unmodifiable systems of record. | ||
| - Design components with repeatable code patterns, such as a tb_pulumi module, to ensure all installations of these | ||
| logging patterns are configured the same and benefit from code and design improvements over time. Provide these | ||
| patterns to developers as the logging solution to use and provide support for the configurations to encourage their | ||
| use. | ||
| - Use automated tools like AWS GuardDuty and AWS Config to report on unsecure configurations. | ||
| - Use automated tools like AWS CloudTrail and CloudWatch Alarms to alert when sensitive data is accessed. | ||
|
|
||
|
|
||
| ## Implementations | ||
|
|
||
| We currently have no implementations of these guidelines. | ||
|
|
||
| - An [implementation for CloudWatch Logs](./logging_to_cloudwatch_logs.md) has been proposed. |
65 changes: 65 additions & 0 deletions
65
docs/rfc/application_logging/logging_to_cloudwatch_logs.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| # RFC: Logging to CloudWatch Logs | ||
|
|
||
|
|
||
| ## Proposal Overview | ||
|
|
||
| We should define a pattern of infrastructure resources that provides a unified experience for | ||
| applications sending their logs to CloudWatch Logs. This should come in the form of a | ||
| [tb_pulumi](https://github.com/thunderbird/pulumi/) module that can be used to stamp out logging | ||
| destinations that are compliant with our [guidelines](./application_logging_guidelines.md) in our | ||
| various infrastructure projects. Existing projects which send logs to CloudWatch Logs through | ||
| more or less default configurations can be adjusted to use the new pattern, bringing them into | ||
| compliance. | ||
|
|
||
|
|
||
| ## Rationale | ||
|
|
||
| CloudWatch Logs can be configured to meet all of our guidelines pertaining to logging targets: | ||
|
|
||
| - KMS Keys provide encryption at rest for log data. | ||
| - KMS encryption keys can be set to auto-rotate at a custom interval. | ||
| - The AWS API provides encryption for log data in transit. | ||
| - Log data is passed through a VPC endpoint to the CloudWatch Logs service over a private network. | ||
| - Log streams are individually policable resources, allowing granularity in access control. | ||
| - Log groups are also access-controllable, increasing flexibility in access control. | ||
| - IAM actions related to these groups and streams can be tracked through a CloudTrail event trail and alerted on with | ||
| CloudWatch Alarms. | ||
| - Policies granting access to these logs can be crafted around the logical separators of environment and application. | ||
| IAM user groups with these policies applied can be created to control log access to individuals. | ||
| - CloudWatch Log Groups can be configured with a retention window, automating the deletion of log data according to our | ||
| guidelines. | ||
|
|
||
|
|
||
| ## Implementation Details | ||
|
|
||
| We should implement this as a tb_pulumi module: `tb:cloudwatch:LoggingDestination`. This module should implement default | ||
| values which align with our guidelines. | ||
|
|
||
| It should define the following resources: | ||
|
|
||
| - A [KMS Key](https://www.pulumi.com/registry/packages/aws/api-docs/kms/key/) to handle encryption at rest for the | ||
| logs. (Example: `mailstrom-logs-stage`) | ||
| - A [CloudWatch LogGroup](https://www.pulumi.com/registry/packages/aws/api-docs/cloudwatch/loggroup/) for the | ||
| environment. In implementation, keeping with AWS's log group naming conventions, this might be called something like | ||
| `/tb/mailstrom/stage` for the Mailstrom/Stalwart staging environment. | ||
| - A [CloudWatch LogStream](https://www.pulumi.com/registry/packages/aws/api-docs/cloudwatch/logstream/) for each | ||
| application. This is somewhat arbitrary and can be broken up however it makes sense. (i.e. | ||
| `/tb/mailstrom/stage/stalwart/mail` vs `/tb/mailstrom/stage/stalwart/management-api`) | ||
| - A set of [IAM Policies](https://www.pulumi.com/registry/packages/aws/api-docs/iam/policy/) allowing various levels of | ||
| access to these streams. Applications will need a write access policy. Users will need read access policies. There | ||
| should be a level of customization here, allowing the engineers to design access in ways that make sense for their | ||
| use case. These policies can be applied to any existing set of permissions to extend access to logs. | ||
| - A [CloudTrail Event Trail](https://www.pulumi.com/registry/packages/aws/api-docs/cloudtrail/trail/) set up with an | ||
| appropriate filter for auditing log access. | ||
| - A [CloudWatch Alarm](https://www.pulumi.com/registry/packages/aws/api-docs/cloudwatch/metricalarm/) to alert when log | ||
| access is detected. | ||
|
|
||
| We have two primary use cases for this in our current system: | ||
|
|
||
| 1. We need to [aggregate Stalwart logs](https://github.com/thunderbird/mailstrom/issues/196). | ||
| 2. We need to apply our logging guidelines to existing services that produce logs. | ||
|
|
||
| In both cases, we need a target log stream with our retention/encryption/etc rules applied. In the first case, we can | ||
| use fluent-bit to ship logs from the Docker containers running Stalwart straight to a log group created with the common | ||
| pattern. In the second case, we can create new log destinations with the new pattern, then update the existing | ||
| container definitions to use the new log streams. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might think about using:
We could create these roles and resources based on the service and environment names like service-env-fdsaf-Profile for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my thinking here is to leave the implementation at the Policy level so that you can attach those Policies to any permissions model you want to run. You can do the users -> user groups -> policies model or the policy -> role -> oidc auth model, whatever works for the org, and it's flexible.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that makes sense. I'm just advocating for defining both security profiles, the ones emmiting logs to cloud watch and any entity reading them can be separate. That way readers can't modify and apps can't read.
Actually, I would throw in a third role for lambdas or other actions that need to write to clear logs as an example. Some resources require permissions on both ends.