Meta-Support: Tools for Supporting IoT
SpinDanceSpinDance
In the IoT world, a common headache for customers is supporting their product after it's released. To provide support with excellence, it's important to rely on the intelligent use of meta-support tools to amplify the support team's work. In this post, we'll highlight some of these tools.
Datadog is an excellent monitoring service that can be extended and hooked into in some incredible ways. This system keeps 24/7 surveillance on our cloud infrastructure, dedicated hosts, websites, and databases.Â
Helping others increase their quality and longevity of support for IoT products and services ultimately helps everyone, customers and companies alike.
The features we use the most are the monitors. Monitors can be simple smoke-test-style checks where a tool goes to a URL and makes sure it returns the expected HTTP code. Monitors can also be complex, carefully designed checks that pass input to the tool or a query to a database. For example, we have one check that queries a Redshift database's rows by time, averages the number of entries made in the past hour, and returns a warning if the value is below a specified value.
Datadog has excellent first-class support for all major cloud vendors, and if there isn't a pre-made tool, the platform is customizable enough to let you build it yourself!
In theory, this could be replicated by a master server pinging other servers or a free tool (like the excellent Cockpit Project). However, Datadog includes some unique sauce additions: its visualization of monitors and metrics is fantastic out of the box and has incredibly granular tools to, say, individually mute monitors once they have been triaged. When it is necessary to alert a support team member, Datadog hooks into our slack channel to communicate to the respective support role.
Certain critical level events also send an automated text to the on-call team member using Twilio (more on that later).
We have multiple redundant Datadog monitoring instances, which we orchestrate and deploy automatically using the Chef tool.
Chef Software is an extensive suite of tools built around configuration management. Paired with Datadog, this enables us to quickly respond to issues and keep everything we're responsible for in a working, consistently reproducible state.
While it may be a bit more work the first time to create a chef "cookbook" of recipes and use that to deploy infrastructure compared to just doing it yourself with a terminal, once the work to make the infrastructure as code has been done, further work is much more trivial. Keeping our servers specified in Chef cookbooks enables us to, for example, apply Linux Kernel security patches on a dev machine, automatically run tests against it, and when comfortable, apply the update to every relevant machine and know what exactly has changed.
Chef also makes recovering from a disaster much more manageable. Ransomware? No problem, spin up a new EC2 instance and execute your chef cookbooks! Combined with other backup tools, a complete server replacement is possible with a simple invocation of the command chef-client.
The Chef community keeps a public list of cookbooks up-to-date in the Chef "supermarket" to round things off. For almost any common task you can think of, somebody has a cookbook to help implement it.
The only major downside of Chef is the unfortunate fact that these tools share names with actual cooking and shopping terms, which can test the skills of any developer's Google-fu. Example: Chef has a command-line tool for managing dependencies called "knife." If you Google "chef knife not working," you get many results, very few of which will help you set up a server.
Fortunately for us, we have found Chef's documentation, supplemented with a healthy seasoning of StackOverflow, answers 99.9% of questions.
Listing AWS as a meta-tool is a bit like listing "a computer" as a developer's tech stack. It is technically accurate but, well, a bit general. AWS can do a lot (ask anyone who has studied for one of their exams), but today I will focus on the bits we use specifically for our support pipeline: CodeCommit and CodeBuild.
CodeCommit isn't too exciting on its own: it's a private git host. Its usefulness comes, as with many AWS products, in the tight integration to other AWS tools. Storing code in CodeCommit allows us to trigger CloudWatch events on commits with certain messages and run automatic linting, CI/CD, and execute builds. We aren't too fancy; our support team has a nice but straightforward workflow where we keep all of our Chef cookbooks as individual CodeCommit repositories. We can then use CodeBuild to produce artifacts. The chef infrastructure then checks daily for any updates and applies them to all our relevant machines.
So far, this blog post has walked through the primary "loop" of support: infrastructure, coded for Chef, deployed with AWS, and monitored by Datadog. Of course, these are not all the tools, but I sadly don't have the time to write (nor most of our dear readers, to consume) the dozens of pages it would take to fill in every detail. Instead, I will briefly mention some of the other tools outside of this primary support pipeline.
For some legacy systems and tools, CrashPlan is a dead-simple automated system backup tool. It can perform a complete system recovery, but we often find ourselves using it to help a customer recover an individual modified or deleted file. It is a great tool that you hope never to use, but boy will you be glad when you have to use it.
To ensure we can respond to emergencies at any time of the day, we have Datadog hooked into a Twilio connection, allowing us to text the on-call support team member under certain circumstances. In this instance, Twilio is just the pipe; Datadog and some cron jobs perform the brunt of the work, but without the texts, the support response would potentially be lacking outside business hours.
Realistically, you could put a wide variety of wiki tools in this bullet. Atlassian's Confluence product is polished, has every possible knob and button you could imagine, and ties in with other Atlassian tools. Documenting our solutions and pipeline in the wiki is crucial for raising our team's "bus factor" — the hypothetical number of people who could, for one reason or another, disappear (be "hit by a bus") without our support suffering.
Long-term, our goal is to make the support role as unnecessary as possible. While we understand that a human support specialist role is always going to be needed, by magnifying the effectiveness of our team, we avoid burnout and wasted time on micro-managing issues. Instead, we can focus on the important part of our work: bringing our customers' ideas and solutions to fruition with IoT!
We are looking into some exciting and promising tools, including Ansible as a potential complement or replacement for Chef, replacing all manual SSL cert renewal with automatic renewal via an ACME-client registrar, and hybrid-cloud solutions using Terraform to allow cross-cloud fallback.
We hope this short dive into support tooling inspires or encourages you in your support endeavors. As an industry, helping others increase their quality and longevity of support for IoT products and services ultimately helps everyone, customers and companies alike.
The Most Comprehensive IoT Newsletter for Enterprises
Showcasing the highest-quality content, resources, news, and insights from the world of the Internet of Things. Subscribe to remain informed and up-to-date.
New Podcast Episode
Recent Articles