For many organisations, the prospect of realising functional applications in Azure databricks may seem little more than a distant possibility. For others, this will be something that has been achieved and taken as a given. In either scenario, there is always more that can be done to improve the performance of applications. While moving to the cloud is an important first step, there is a considerable amount more that needs to be done once that migration is completed. Jobs can often be inefficient leading to a gradual accumulation in cost that is ultimately felt by the business at a critical mass. No matter how finely-tuned those jobs are, they can generally be refined further to generate greater efficiencies and an optimised data ecosystem
Business Intelligence is Power
Optimising your workloads in Azure Databricks fundamentally boils down to operational intelligence. By having clear insights into what applications are running, how they are utilising resources, which users are accessing them, and at what times this is all occurring, a broad picture of the data landscape can be drawn. This kind of granular detail leads to faster diagnosis and resolution of application problems. For example, If an app is failing, rather than spending considerable time finding the metrics needed to inform your root cause analysis, we can instead draw these conclusions right away. This minimises downtime of what could be business-critical applications. So what kind of intelligence does the data team need to draw these insights? Importantly, they will need a detailed view of usage breakdown and trending over time across workspaces and users. Equally, it is essential to have visibility into data usage. This should cover things like what tables are being accessed when they’re being accessed and by who, and which applications. Consideration should also be given to the extent of custer usage which can lead to the identification of hot, cold, and warm tables.
While such comprehensive data may seem excessive, it allows for the discovery and improvement of efficiencies in resource utilisation. This allows the data teams to determine things like what resources are being used versus what is available, or see which applications are running at any one time. After all, it is impossible to ensure that resources are being utilised appropriately without this vital context. When we identify an instance where resources are not being utilised appropriately, we are now in a position to redefine the cluster in terms of what is being provisioned.
Data as a Guide
To illustrate what this means tangibly, let us say that we have three workspaces split out in our Azure Databricks environment. As is common in most organisations, each distinct job function has their own workspace. In our example, these three are finance, engineering and HR. Using all the aforementioned metrics, we quickly identify that the finance team is the biggest user by a considerable margin. This should immediately indicate to the data team that this needs to be looked into. First, we should ask whether there are some inefficiencies at fault here or whether this dominant usage is essential to the business function and justifiable. If it is an inefficiency, we can use further metrics to investigate what the potential cause might be and how we can go about remediating it. When we have comprehensive data on app performance and usage, we can extrapolate the ‘why’ that explains the changes in usage, which goes a long way in this kind of decision-making.
Such comprehensive metrics also have use in the longer term. While it is certainly indispensable in resolving job issues – as the above example shows – it also provides value in a broader sense when it comes to ROI conversations. With granular metrics on who is using which clusters for their applications and to what extent, we can assess which are providing value to the business and can be optimised. Moreover, with such sufficient data to hand, we can even project what these costs will look like in the future and ensure we have continued cost visibility. The ability to forecast in this way allows for better budgeting and provisioning which can be felt more widely in different business functions.
In summary, effective cost management is informed by intelligence. The more we have, the better we can optimise. Gathering this information, and finding ways to implement is a process that will be considerable dividends for cost management and improvements in performance. To become an organisation who can diagnose and resolve app problems fast, this cannot be overlooked. More pressingly, users can enjoy greater productivity by eliminating the time spent on tedious, low-value tasks such as log data collection, root cause analysis and application tuning
Shivnath Babu is the CTO at Unravel Data Systems and an adjunct professor of computer science at Duke University. His research focuses on ease-of-use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. Shivnath cofounded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.