Start tracking these KPIs for a successful migration of your data infrastructure to the cloud
As enterprises are going digital, the amount of data they have to handle has exploded. New uses cases like machine learning and artificial intelligence are forcing them to re-think their data strategy.
It’s a new world where data is a strategic asset – the ability to manage torrents of data and extract value and insight from it is critical to a company’s success. A recent survey of 2,300 global IT leaders by MIT Technology Review Insights found that 83% of data-rich companies are prioritizing analytics as much as possible to gain a competitive advantage.
That’s where “data lakes” come in, as the underlying technology infrastructure to collect, store and process data.
The constraints of legacy data infrastructure
The original term “data lake” comes from the on-premise Hadoop world. In fact, many enterprises today still run their analytics infrastructure on-premise. But despite years of work and millions of investment into licenses, they’ve got little to show. And now this legacy analytics infrastructure is running into four major limitations.
- Limited data. With on-premise infrastructure, data is often distributed across many locations. That creates data silos and a lack of a comprehensive view across all the available data.
- Limited scale. Scaling your infrastructure to add new capacity takes weeks and months, and is expensive.
- Limited analytics. The system runs canned queries that generate a descriptive view of the past. Changing these queries and introducing new tools is another time-consuming process.
- Limited self-service. IT is the gatekeeper of all data. It’s a static world, where the consumers of the data have little to no control over changing the queries.
It’s this set of constraints that are driving enterprises to adopt the cloud for their data and analytics infrastructure.
Shifting your data to the cloud
The “new” data infrastructure combines cheap storage of cloud-based data lakes and powerful analytics capabilities of cloud warehouses.
Cloud-based data lakes combine simplicity with flexibility. It’s a simple and scalable way to store any data, while giving you flexible choices for security, geography and access rights. Putting your data into a single place while keeping tight control means breaking down data silos. Data Lakes also offer pricing flexibility, with different tiers to reduce the cost for storing historical data.
With data in a single place, you still need to join it to generate new insights. That’s where cloud warehouses come in, to run complex analytical models and queries across terabytes and petabytes of data. Cloud warehouses cost an order of magnitude less than on-premise alternatives.
Popular cloud warehouses include Amazon Redshift, Google BigQuery and Snowflake. Unlike on-premise products, cloud warehouses can scale on-demand to accommodate spikes in data and query volume.
Cloud warehouses offer the additional flexibility to allow data workers to use the tools of their choice. And because cloud warehouses are both performant and scalable, they can handle data transformation uses cases in-database rather than in some external processing layer. Add to this the separation of storage (in your data lake) and compute (in your warehouse), and there are few reasons to execute your data transformation jobs elsewhere.
As a result, much of the processing of large data sets today happens within cloud warehouses, based on traditional SQL. The primary benefit of SQL is that all data professionals understand it: data scientists, data engineers, data analysts and DBAs. Cloud warehouses are where data-driven enterprises run their “data pipelines”. And by giving people access to data with easy-to-use self-serve analytics tools, enterprises remove fences around their most valuable data and put it to work where it’s most productive.
New platform, new bottlenecks
In the same MIT Technology Review survey, 84% percent of respondents confirmed that the speed at which they receive, analyze and interpret data is central to their analytics initiatives.
Yet when first embarking on their cloud journey for their data infrastructure, teams run into a number of bottlenecks. They range from slow queries to data pipeline failures or increasing load times.
A frequent reaction is to throw more capacity at the problem. But that adds to the cost and often doesn’t solve the problem. Going into a migration with high hopes, the bottlenecks can be frustrating. What’ can be even more frustrating that the root causes are not always clear.
The reality is that data pipelines and queries fail, for a number of reasons. Unexpected changes at the source can break the workflows that extract the data. A delay in data loads can cause a backlog with downstream effects on data availability. Poor SQL statements can consume excessive resources and slow down an entire warehouse.
Defining data infrastructure and query KPIs
As the DBA or data engineer in charge, you want to be the first one to know when these bottlenecks happens. Or even better, before they happen. This is where your new KPIs come in.
For starters, I recommend three basic KPIs when running a cloud warehouse.
- 95th percentile of load times
- 95th percentile of runtime for your workflows
- 95th percentile of execution time for your ad-hoc queries
Just monitoring these three simple metrics will uncover bottlenecks that impact your SLAs. For example, a sharp increase in rows in a table can lead to an increase in query execution time. From there, you can take action to fix the root cause of the slow down.
A common challenge is to locate these KPIs, i.e. the metadata / logs that provide the relevant information. That’s where a product like intermix.io comes in. It provides these KPIs out of the box via custom dashboards. You can analyze query performance at both the individual and aggregate level, keep an eye on critical workloads, and drill-down into the data and look at historic patterns.
Follow the process
Migrating your data infrastructure is as much about the process as it is about technology. Simple tasks like data pipeline execution can require complex migration steps to ensure that the resulting cloud infrastructure matches the desired workloads.
Much of the hype surrounding data lakes and could analytics is justifiable. But careful planning and anticipation of bottlenecks is essential to determining success.
Co-founder & Chief Data Engineer at intermix.io