PagerDuty incident resolution support#131

Open

ivachkov wants to merge 2 commits intoklen:developfrom

ivachkov:develop

ivachkov commented Oct 11, 2016

PagerDuty incidents are generated with messages like:

[BEACON] CRITICAL <Test Wildcard Alert> (stats.gauges.server2.data) failed. Current value: 1.0

"Back to normal" messages look like:

[BEACON] NORMAL <Test Wildcard Alert> (stats.gauges.server1.data) is back to normal.

From those, I figured that incident-unique information is the combination of alert name (<Test Wildcard Alert>) and metrics name ((stats.gauges.server1.data)). To avoid storing data on the file system, I decided to generate a hash out of those two. Using this hash value incidents can be triggered and resolved in a stateless way.

Following tests were performed:

Tested alerts with exact metric match ("query": "stats.gauges.test")
Tested wildcard metrics alerts ("query": "stats.gauges.*.data")

ivachkov added 2 commits

October 11, 2016 05:49


          implementation of incident resolution for PagerDuty

11d2f66


          fixing tabs to spaces and alignments

1d8a391

garrettheel reviewed

View reviewed changes

Collaborator

garrettheel left a comment

Thanks for contributing!

Can you help me understand what the original issue was that prompted you to do this? As far as I can tell, we're already using rule['raw'] as the incident_key which should allow for stateless resolution

graphite_beacon/handlers/pagerduty.py

-                          event_type = 'resolve'
+                      # Extract unique alert identifiers
+                      alert_name = message[message.find("<")+1:message.find(">")]

Collaborator

garrettheel Oct 15, 2016

Can you use the context passed into the function for the alert name and metric name instead of pulling it out of the message?

graphite_beacon/handlers/pagerduty.py

+                      h.update(alert_metric)
+                      # Use hash as incident key to support resolution
+                      incident_key = h.hexdigest()

Collaborator

garrettheel Oct 15, 2016

Is there any benefit to md5ing these? Why not just do "{alert_name:alert_metric}"?

graphite_beacon/handlers/pagerduty.py

+                      # Use hash as incident key to support resolution
+                      incident_key = h.hexdigest()
+                      if level == 'critical':

Collaborator

garrettheel Oct 15, 2016

if level in ['critical', 'warning']:

graphite_beacon/handlers/pagerduty.py

+                          event_type = "resolve"
                       else:
-                          event_type = 'trigger'
+                          return

Collaborator

garrettheel Oct 15, 2016

Is there a reason you're changing this?

graphite_beacon/handlers/pagerduty.py

                           "event_type": event_type,
                           "description": message,
                           "details": message,
-                          "incident_key":  rule['raw'] if rule is not None else 'graphite connect error',

Collaborator

garrettheel Oct 15, 2016

Looks like this logic has has been lost, can you re-add it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet