AWS Operations Management: Improvements Needed

By | December 17, 2015

A version of this article originally appeared on TechTarget SearchAWS as Where AWS monitoring tools fall short

 AWS has a rich set of management APIs, automation tools and a central management console, but it can’t yet provide end-to-end performance and troubleshooting data

AWS has an overwhelming list of services, but piecing together a multi-tier application design and then monitoring and managing the result can feel like ordering off a Chinese menu. There are so many choices, many with similar or overlapping features, that finding the best solution, or even a workable one, is an arduous task. The complex mix of services only exacerbates system management, particularly when you layer on the fact that most enterprise AWS applications don’t exist in isolation, with their entire lifecycle spent in the cloud. Instead, they pull data from internal and third-party sources and target many different user groups and platforms, whether field service reps on tablets or B2B information exchanges with business partners. This resulting heterogeneous mix of services, networks and data sources makes comprehensive system and application management almost impossible.

Logo-schematicWhile AWS services provide a good idea of what’s happening in the cloud, they can’t measure the big picture of end-to-end performance and reliability, which are ultimately the only parameters enterprises really care about. Furthermore, AWS management services are designed for use on via own console, not the system management platforms enterprises already have deployed, adding yet another tool to learn and monitor to already overworked admins. In sum, it means AWS has some major holes, or, for those of a more optimistic bent, significant opportunities for improvement to its operations management portfolio.

The holes/opportunities are apparent when you consider the complexity of trying to get a complete view of an application’s performance, never mind troubleshooting any anomalies, for a reasonably elaborate enterprise app. For example, at the last re:Invent conference, Coursera discussed the data flow and ETL processes for its AWS-based data warehouse. It’s a system that pulls data from 15 sources including client events, external databases and third parties into a pipeline consisting of EC2 instances, S3 storage and EMR (Hadoop) processing that ends up in a multi-TB Redshift warehouse that combines it with even more data from internal business intelligence applications to power recommendations, search and other Coursera data products.

Even a simpler example, running SharePoint on AWS, shows the challenges of managing composite applications consisting of many different server and storage systems. The AWS SharePoint reference architecture includes no less than six AWS servers and two databases spread across two subnets, with both VPC (to an internal data center) and public Internet connections. Imagine trying to manage the performance of an internal Excel application that pulls data from an internal database and AWS-resident SharePoint repository, crunches the data and writes a report back out to another SharePoint share. Each AWS SharePoint server could be operating fine, but bottlenecks and resource contention at any point in the processing/communication chain could cause the application to fail.

Source: AWS

AWS SharePoint refereence design. Source: AWS

Trying to monitor, much less guarantee, end-to-end transaction performance, or worse yet, find and fix problems when something goes wrong, isn’t something existing AWS tools are designed to do. Yet this is precisely what enterprises need. Indeed, the problem is so intractable that Bernd Herzog, founder and CEO of OpsDataStore, claims, “The bottom line is that today end to end service quality assurance in the public cloud is impossible.” Herzog founded OpsDataStore to solve this problem, however it’s sufficiently diverse and demanding that they don’t intend to do it alone. Instead, the company is building a data platform that it hopes will support an ecosystem of point products spanning infrastructure, application performance, security, automation and financial management.

Operations data collection architecture. Source: OpsDataStore

Operations data collection architecture. Source: OpsDataStore

Ops Management Roadmap

Examine typical enterprise cloud deployment scenarios illustrate the challenges and opportunities for improvement to AWS’s operations management capability. AWS currently relies on third-party marketplace suppliers like AppDynamics, New Relic or Splunk for more extensive monitoring and troubleshooting feature, leaving the market open for multi-cloud management specialists like RightScale, Scalr, SevOne and Skeddly that augment or outright replace the AWS console with SaaS. Indeed, a SevOne post on monitoring public cloud infrastructure provides a good list upon which to build a some recommendations for the AWS product roadmap.

Enterprise admins need to:

  • Monitor cloud and on-premise infrastructure from a single platform. The task for AWS: provide better integration to popular enterprise management software from the likes of CA, IBM, Microsoft and VMware.
  • Track both cloud and on-prem resource consumption, trend usage over time and trigger alerts on spikes or anomalies. The task for AWS: augment existing cloud-only capabilities by tying usage to users, projects and budgets. The AWS service must also tie into enterprise account and billing systems. Tying usage to projects will entail more thorough use of resource tags. AWS needs to make these easier to setup and use.
  • Measure the End User Application Experience end-to-end across the entire application stack. The task for AWS: develop, acquire or more seamlessly integrate tools for end-to-end system performance monitoring. Performance management features must also tie into troubleshooting software like log and configuration analysis tools. AWS has a piece of this with the new AWS Config Rules, but it needs much more.
  • Integrate Performance Metrics, Data Flows and System/Device Logs into an aggregated view of the entire infrastructure, what Splunk calls end-to-end Operational Intelligence and the goal of OpsDataStore and other next-generation, cloud-centric management software firms. The task for AWS: again involves integration of cloud data with existing enterprise management systems to create a single version of the truth.