It’s easy for on-call engineers to become overwhelmed by alerts, especially as cloud environments continue to scale at a rapid pace. In fact, in a recent SecOps survey, 83% of professionals reported struggling with the inundation of security alerts.
However, context is king and the right tools and integrations can help subside alert fatigue by driving smarter insights to help you make more informed decisions. At Squadcast, we recognize just how important this is, so with key partners like Threat Stack, we help you resolve incidents quickly by providing maximum context around incoming alerts.
Anyone that’s worked with cloud infrastructure long enough, will understand the need to closely monitor system health to make sure everything is okay. Since systems are always vulnerable to cyber attacks and internal system failures, it is a wise thing to have tools which monitor system health & performance.
For instance, a company whose infrastructure is on AWS may use native monitoring tools, like AWS CloudWatch for basic monitoring purposes. Furthermore, they can have dedicated tools for monitoring critical services and overall infra health, thus making up their ‘Observability Stack’.
Threat Stack is one such security monitoring tool, which works at the application, infrastructure layer and looks out for anything suspicious. And for us at Squadcast, Threat Stack is very much an ally to smoothly tackle such issues.
Quickly addressing security and compliance incidents is critical to the success of today’s modern SaaS business. If a website or service is down for even a minute, it can potentially cause significant revenue loss, damage to brand reputation, and distrust among customers.
This is where a modern incident response platform such as Squadcast comes into the picture. And Threat Stack provides full stack cloud security, observability and compliance for infrastructure and applications.
When Threat Stack detects a security risk and/or anomaly, it acts as the alert source and sends it to Squadcast. Leveraging Site Reliability Engineering (SRE) best practices, Squadcast aggregates, and routes these alerts to the on-call engineer. For the SRE & Security teams, however, they only act after such incidents have been reported.
There can be multiple teams responsible for different components of the infrastructure. And since Squadcast natively integrates with various Application Performance Monitoring (APM)/ logging and error tracking tools, it can notify the appropriate team by intelligently routing alerts and helping them collaborate in real-time to deal with incidents within Squadcast.
In order to fully understand the depth of Squadcast + Threat Stack integration, let's take the earlier example of a customer whose infrastructure is on AWS. In this scenario, Threat Stack observes across the various layers of modern infrastructures to detect a wide variety of behaviors within your environment. Combined with AWS CloudWatch, these solutions enable monitoring infrastructure/application stack and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs). If and when an SLO is breached, alerts are then sent into the Squadcast platform.
Put simply, whenever there is an alert in the Threat Stack platform, the configured Webhook for Squadcast is signaled, and an incident is created. Similarly, if Threat Stack sends more than one alert, the Squadcast platform can organize and group alerts (deduplication), providing full context to users working on resolving the incident. Let’s explore why this is beneficial.
The first step towards doing better incident management is adding enough context to incidents as they get detected.
The incident payloads coming from Threat Stack into Squadcast can have Tags added to them to make alerts more context-rich. Examples of such tags are incident priority, incident severity, environment name, etc., which gives more context to the engineers when they receive the alert, which in turn helps in faster incident response and ultimately reducing MTTR.
However, there are other factors like the required urgency in solving the incident, or how the incident can affect other parts of the system that may not be taken into account while assigning the severity of an incident. Some incident management tools attempt to solve this by adding other forms of classification like incident urgency and incident priority. Most solutions only allow for incident severity as the form of classification and in some cases, this is done manually instead of automatically assigning the severity levels based on the incoming alert context.
With Squadcast, there is an added layer of flexibility that lets you define rules to Classify alerts as Sev-1 or Sev-2 or Sev-3. This rule-based auto-tagging system in Squadcast allows you to classify incidents as and when they are raised, thus making the alert notification more context-rich.
Alerting a very specific personnel is sometimes highly critical. By using tags, alerts can be immediately forwarded to a specific team/personnel with the help of Routing Rules. These flexible conditional routing rules are based on incident properties and with multiple diverse notification modes, they can eliminate alert fatigue, resulting in faster time to detect and resolve. For instance, when there is a firewall breach detected, the Infrastructure Security team can be immediately alerted/notified.
We can also define Escalation Policies according to the Severity Level of the incident. For example, a high severity incident needs to be alerted to a specific personnel/ team immediately for fixing. However, lower severity incidents can be looked into by the scheduled on-call engineer without any time constraints.
As a general rule of thumb, higher transparency not only results in a better incident management and response process, but more importantly it increases trust between team members and helps them better troubleshoot what went wrong before planning next steps to resolve the issue.
Teams can level up their reliability and transparency with Squadcast’s free public Status Page.
Status Pages (either Public or Private) can help communicate the status of your services internally to other teams or externally to your customers/stakeholders at all times. This can be done by configuring your services and their dependencies to show their status in real-time on the Status Page.
Together, Threat Stack and Squadcast enable a holistic approach to security and compliant incident management via full transparency into risk and real-time response capabilities - all of which minimize friction in the incident response lifecycle. Likewise, with better collaboration and transparency, the overall reliability of critical IT systems and services improves significantly.
“Blameless postmortems are a tenet of SRE culture. You can’t "fix" people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systems”
A lot of the incidents can be quickly rectified with tools like infrastructure automation, runbooks, feature flags, version control, continuous delivery and people can be kept in the loop with chatops and status pages. These actions, though beneficial to fix the situation at hand, do not really help understand what failed and why. And understanding what failed and why is a crucial step towards preventing similar occurrences going forward.
This is where incident postmortems come in - the next logical step after any incident is to dissect and analyze the why, how and the what of the incident.
Ensuring a Blameless Postmortem for critical incidents offers the team increased confidence to escalate issues without fear in the future. Blameless postmortems are generally challenging to write since the postmortems clearly identify the actions that led to the incident.
In order to ensure that teams develop a culture around blameless incident postmortem reviews, it should also be noted that empowering teams with an easy and automated way to capture incident information and publish the final report with reusable checklists and templates, could potentially make incident postmortem meetings less dreadful.
Squadcast’s incident postmortem feature helps build an insightful timeline in a matter of minutes. This is especially useful as automation ensures that you can quickly have a system-generated postmortem for pretty much any incident.
Squadcast’s reporting and analytics feature will reveal the team’s performance in acknowledging and resolving alerts, and help understand the areas for improvement. It can help you visualize and analyse the distribution of incidents across services for a specified period of time along with the status of each service.
With the growing number of incidents, many patterns will emerge to double down on frequent issues. You can do exploratory data analysis using graphical representations and understand more about the past incidents. This data can also be exported by filtering based on Tags that the incidents carry, such as Severity Tags, Alert Source, Status, Date & Time, etc.
While the features mentioned above will help you get the best out of the integration between Squadcast and Threat Stack, there are a few best practices which you should keep in mind to enable smoother incident response:
Combining the power of Threat Stack and Squadcast will help you quickly drive critical alerts to cybersecurity professionals. This can make an enormous contribution to drive down KPIs like MTTA (Mean-Time-To-Acknowledge) and MTTR (Mean-Time-To-Respond).
If you are interested in unifying these two tools, Squadcast and Threat Stack support teams are available to help you achieve success. If you have other best practices to share or just need help with the integration set-up, feel free to drop a line to our Support Team. For more information on Threat Stack and our Cloud Security Platform, check out: https://www.threatstack.com/cloud-security-platform.