Like the proverbial iceberg, issues with your database can linger below the surface and out of sight. As you navigate across the surface, things appear to be calm, with only one minor obstacle, the tip, clearly visible. But below awaits potential catastrophe. To steer clear, you need to know the depth and breadth of the obstacle underneath. You need to have visibility as to its size and shape to make a proper course correction. These corrections are much easier when it’s a distant blip on the radar rather than right in front of you.
For digital organizations, submerged database issues can result in slow site performance, a poor customer experience, and high abandonment rates. As the critical mechanism for storage and transaction execution, the database sits directly on the critical path to high performing systems, and happy customers. A great customer experience is when customers don’t notice site performance at all – they expect everything to perform flawlessly. To make this a daily reality, you must be able to not only spot and fix issues quickly, but proactively monitor and prevent issues before any customer would notice.
You need a complete view of database activity, you need to know which metrics are the most important to watch, and you need to be able to interpret what you find so that you can make the right call.
When Things Go Wrong
Database servers often pause, completely or partially, for brief periods of time. When this happens, queries don’t finish, and newly arrived queries wait in line. Existing connections don’t complete their work, and new connections continue to open. The effect is “pile-ups” where lots of connections are opened, many queries run, query latency spikes, and server load increases dramatically. If it’s long-lived (a minute or more) it’s not unusual for servers to run out of resources, such as failing to open new connections, or running out of memory and crashing.
There are two important things to know about these stalls:
They happen all the time, and most people don’t know it. It’s not a matter of whether they happen, but how long they last. On a well-behaved server, you hope the stalls are just microseconds long. If there’s something wrong, though, as load and data increase on the server, the stalls will become a second long, then 10 seconds or more. Most people will never notice until they are quite serious (10+ seconds long).
They are very hard to diagnose and can have any of hundreds of causes. By the time anyone notices a stall, they’re usually a chaotic mixture of symptoms. Everything’s going wrong at once and you can’t tell what the original problem was. Also, they can happen randomly and are practically impossible to catch in action.
You need to find stalls while they are still short (1 second or less). In order to do this you need to capture huge amounts data about the entire server. Both are hard tasks, but by doing them, you can prevent serious performance problems later.
System faults, such as stalls, are nearly impossible to detect by normal means such as threshold-based alerts. They are often hard to find until they become major problems (seconds or minutes of downtime), and can be some of the most difficult problems to diagnose, because of their fleeting nature, which makes them hard to observe. The symptoms and causes tend to be complex, because systems that stall often misbehave in a variety of ways simultaneously. For these reasons, faults are best detected while they are small (a second or two), which means you need to be proactive. Short-duration faults are much easier to diagnose and fix.
The definition of proactivity in database monitoring is the prevention of issues before they ever occur, and most performance monitoring solutions focus on quick after-the-fact detection. There are, however, a few approaches you can take to be more proactive:
Early detection is key. You need to be able to find budding problems, escalate, and surface them while they’re too small for your customers notice, but still small enough to diagnose cleanly. Problems like server stalls and query latencies have a “golden time” before they begin to get more serious and harder to diagnose due to the escalating effects.
Detection in pre-production environments. You need to be able to monitor performance in both pre-production and production deployment. This includes workload analytics capabilities to surface important changes when deployments to pre-production environments are executed. You also need textual analytics to highlight problems that may not show up in staging, but could cause issues in production. Finding problems in staging environments, before shipping to production, is a proven method for avoiding outages.
Cut down on firefighting. The best way to make your team proactive is to connect the feedback loops, smash the silos, and get the whole team focused on preventing and solving issues. Without this shift in mindset, those who operate the systems in production are doomed to spend all their time in reactive mode, battling problems they couldn’t prevent. But when the whole team has visibility into production database behavior, DBAs are freed up for the more strategic work they should be doing, preventing tomorrow’s problems instead of solving today’s.
Early detection can prevent future problems. That fleeting breakdown in performance may foretell a major stall, so getting more granular in your analysis and staying proactive will help you spot those blips on the radar, and allow you to course correct well in advance of trouble. And the best part is, with rock solid uptime and a great customer experience, your customers just may take notice.