Operating Datadog at Scale

August 25, 2022
|
7
min read

Prelude

Thinking back to my years in the Datadog realm, both as a Datadog employee and Datadog partner, I’ve had a countless number of conversations with customers from many industries and company sizes, but have noticed common themes. As the value of Datadog becomes clear, a common line of follow up questions emerges, which can be easily summarized as; how do we operationalize our work with Datadog at an organization of our size? Now before we dive into the nitty gritty, I want to point out, the main focus for this blog is to talk through methods/strategies our team uses, with the main goal of maximizing the value from using Datadog by reducing the number of separate Datadog accounts in one company. 

With that said, it’s important to note that using Datadog sub-organizations cannot be avoided at times due to regulatory or legal requirements. This is meant to offer a perspective, to weigh when making your Datadog account structuring decisions. Our team is happy to engage your team directly to have more pointed discussions around your individual business needs. 

Datadog Governance at Scale

Striving for greater agility, velocity, and resilience has led to many organizations advocating for and adopting the self-service approach to development. In the process of promoting a self-service development approach, organizations strive to enable dev, ops, and security teams with the tools required to successfully and efficiently perform their jobs. As the leader in the observability space, Datadog’s monitoring platform is a key driver in not only breaking down silos that exist in IT organizations but also in further promoting Developer ownership of monitoring and improved service reliability. Driving towards the self-service platform approach, several concerns/questions arise:  

  1. How do we enable self-service monitoring while ensuring our account has order and some level of guardrails on it? 
  2. How do you plan for or prevent unintended edits or deletion of other teams’ monitors, dashboards, etc? 
  3. How do you enforce tagging standards? 
  4. How should we onboard our hundreds of cloud sub-accounts into one Datadog account? 
  5. Can we accurately track and “bill-back” usage to application owners, teams, or business units? 

Thinking about a Sub-Organization structure  

When approaching these questions many companies quickly jump to structuring their Datadog accounts similar to how they’ve structured their self-service cloud environments; through creating an account sub-organization structure. 

Is creating a sub-organization account structure for your Datadog usage a solution or does it create more dilemmas? Though there are many seemingly valid reasons to want to separate business units, teams, applications, etc into different Datadog accounts that roll into one billing account, what problems will emerge because of this choice? 

1.  Separation of teams, business units, and applications data

You chose Datadog for a reason, and there’s a strong possibility it was to break down silos that exist within your monitoring data and more importantly within your organization. Datadog was founded and built to enable Dev and Ops teams to collaborate on troubleshooting and proactively improve your applications’ resilience and reliability. With the addition of more valuable tools, like Security, UX, Network Monitoring, Cloud Cost monitoring, etc, Datadog is no longer just bringing Dev and Ops teams together, it’s bringing Dev, Ops, Security, Product, and even business teams into one platform. By breaking your account structure and thus separating your data you’re detracting from a major source of value your team likely was originally struck by, bringing more teams into a single location. 

2. Centralized SRE, NOC, and CCOE teams will need to pivot between different Datadog accounts when troubleshooting. 

Without a doubt, one of the biggest reasons your organization is seeking to improve upon monitoring is to improve your customer experience, whether internal customers or external customers. Outages or issues are inevitable and a monitoring solution should help you decrease your time to detection and resolution. Datadog’s ability to provide you the rich context on incidents when they occur can help decrease time to resolution, and improve resilience when these scenarios occur again. With that said, when you begin actively troubleshooting incidents, and your teams need to sign out/ sign in to different accounts to access the right monitoring data related to an incident, this is wasting needless time. 

3. Difficulty building at a glance Executive overviews

Though this one sounds obvious it is a headache you might not have seen coming. Executives want overview dashboards that show them at a glance how their systems and applications across all business units are performing. By separating data into different accounts someone will be tasked with consolidating the data into an easy to interpret report that needs to be sent periodically.

4. You lack the positive benefits of group learning across the organization

Similar to point 1, but worthy of a separate mention. The old Aristotle saying “The whole is greater than the sum of its parts” holds true in the monitoring world. Something as simple as how another team organizes their Dashboards or collects custom metrics can help other teams realize opportunities for improvement in their own monitoring.

Operating at scale with a Single Datadog Account

When choosing to go the single Datadog account route, the team at RapDev has found an effective combination of implementation strategies and RapDev built utilities that assist in addressing the common concerns an enterprise has. 

Monitor Datadog Tagging with Datadog

A critical component of harnessing the power of Datadog at scale is undoubtedly tagging. Tags are the bread and butter of what makes Datadog so powerful in the age of ephemeral and distributed systems. With tags, slicing and dicing your data becomes simple, ensuring new systems/applications that come online are properly monitored becomes automated, and gaining greater context when troubleshooting becomes natural. 

The criticality of tagging is the reason why, when RapDev engages with ANY customer, tags are an early core part of our engagements, regardless of the customer’s maturity with Datadog. Through discovery and learning about an organization’s business, applications, and team makeup, we help customers determine required and suggested tags that will be applied to all resources being monitored by Datadog. 

Once you have your defined tagging strategy though, how do you ensure all the teams using Datadog are tagging their resources properly? 

This is where RapDev’s first utility integration comes into play. RapDev has built a Tag Validator integration that effectively is using Datadog to monitor Datadog. The RapDev validator provides you with a high-level and granular understanding of what resources are in compliance or out of compliance with your established tagging strategy. With some customers, we’ve taken this integration one step further and provide individual teams monthly summaries showing which of their resources need to update with required and suggested tags. The RapDev tag validator is available as a Datadog Marketplace integration.

Build/Edit Dashboards and Monitors without Fear

For self-service monitoring, enabling individuals to build their own customized Dashboards or Monitors is a requirement. How can you do this without the fear that someone will delete or edit a critical Dashboard or Monitor? 

RapDev helps Datadog customers ease this common fear through a combination of another Datadog Utility integration, and through the custom configuration of Datadog.

By assisting our customers in configuring custom role mapping rules, Enterprises provide each team with the permissions needed to embrace and own their monitoring. We then use these same custom roles so that teams can limit the ability to edit their dashboards/monitors to members of their team’s custom role. This helps teams curb a majority of accidental changes or deletion of critical monitors or dashboards, but it can’t prevent 100%. 

In the case where a Dashboard or Monitor is changed or deleted and you want to restore to a previous version, RapDev has built a Datadog utility integration to automate the backup of all your monitors, dashboards, and even synthetic tests. Then when you need to restore to previous versions of a monitor or dashboards, organizations can leverage RapDev’s script to restore to these snapshots. 

As a catchall, we also work with our customers configuring a Datadog audit monitor that can be used to notify the necessary team members if a Monitor or Dashboard is edited or deleted. 

If you’re interested in checking it out, the RapDev Backup utility is available as a Datadog Marketplace integration

Bill-back your Datadog Costs

With a both single and multi-org Datadog account structure, many companies undoubtedly want to track the ongoing cost based usage by teams, applications, or other business specific variables. In order to effectively track this for bill-back purposes, tagging and log faceting play an integral role. Through tagging, log faceting, and custom dashboards, RapDev is helping many organizations establish and improve their bill-back practices for their Datadog usage. 

The first step in improving all customers' bill-back posture is leveraging the usage attribution functionality that Datadog offers out of the box for Enterprise customers. It’s important to note that you’re only able to apply up to 3 tags to these billing metrics using this tool, and the solution doesn’t assist in billing back for a few of the offerings available on Datadog’s platform, namely Logging, RUM, and Span indexing. 

For the Datadog offerings where the usage attribution functionality will not help, RapDev’s experienced consultants approach the bill-back requirements from a different angle. With logging, for example, we’ve leveraged everything from Datadog’s Enrichment Table functionality, to custom Log processors, and even advised on application side changes for log writing. Once the business context is applied to the logs we configure custom Datadog dashboards to visualize the various business units’ Datadog usage and can correlate that to monthly and annual bills. 

Through RapDev’s Enterprise Datadog implementation services and the periodic usage reviews provided in our Datadog Whisperer offering, we assist customers in tuning Datadog to address their business requirements.  

Solving common problems

Naturally other industry wide dilemmas are bound to arise when it comes to scaling your use of Datadog, and our team at RapDev is committed to solving these problems through engineering and automation.

A final example that comes to mind is automating the onboarding of hundreds of individual AWS accounts into Datadog. Following AWS best practices for a well-architected environment, many of our customers isolate resources and workloads into multiple AWS accounts. For these customers in particular, this practice led to the creation of hundreds of individual AWS accounts that still required centralized monitoring. To automate this process while still maintaining a strong security posture, RapDev’s team built a Lambda function to automate the collection of monitoring data as existing and new AWS accounts come online. Having worked with many customers with different variations of this problem, we’ve continued to improve on the Lambda function. 

With RapDev consultants solving common problems like the ones outlined throughout this blog, we are always looking for opportunities to help our customers benefit from our pooled knowledge and our ever expanding/improving solutions. In an effort to make these solutions and our Datadog utilities available to more than just our implementation and marketplace customers we’ve established a premium support offering, called Datadog Whisperer, which includes access to consolidated repos. The Datadog Whisperer repos are chalked full of an expanding number of solutions ranging from Lambda functions, Dashboard/monitor templates, and scripts. Leveraging Whisperer gives you the opportunity to tap into a growing pool of Datadog ecosystem knowledge and benefit from the help of our Datadog Engineers, utilities, and customer community.

In Conclusion

When making the decision on establishing your Datadog monitoring in a single or multi-org structure, it’s critical to reflect on your original goals for modernizing your observability, while also addressing your valid concerns. Through a combination of utilities and the tuning of the platform with catered Datadog implementation services, RapDev is continuing to improve how our customers operate the platform at scale. Of course, it’s difficult to address all the questions relevant to operating Datadog at scale, in a single blog, so reach out to our team for a conversation around your business’ needs. Feel free to engage us via email (ddsales@rapdev.io), or by phone ((855) 857-0222), we’re here to help! 

written by
Jesse Eddy
Jesse Eddy
Boston
Back to main Blog