Ashish Thusoo
Ashish Thusoo is the CEO and co-founder of Qubole. Before co-founding Qubole in 2011, Ashish led Facebook’s Data Infrastructure team and built one of the largest data processing and analytics platforms in the world. Ashish’s work at Facebook not only helped achieve the bold aim of making data accessible to analysts, engineers and data scientists, but also drove the big data revolution. In the process of scaling Facebook’s Big Data infrastructure, Ashish helped drive the creation of a host of tools, technologies and templates that are used industry-wide today, including the development of Apache Hive.

This is a difficult time for enterprises, which need to tightly control costs amid the threat of a recession while still investing sufficiently in technology to remain competitive. The public cloud has made it easy to scale capacity up and down as needed, but access to seemingly infinite resources also allows usage — and therefore costs — to escalate quickly and unpredictably.

This is especially true of analytics and machine learning projects. Data lakes, ideally suited for machine learning and streaming analytics, are a powerful way for businesses to develop new products and better serve their customers. But with data teams able to spin up new projects in the cloud easily, infrastructure must be managed closely to ensure every resource is optimized for cost and every dollar spent is justified. In the current economic climate, no business can tolerate waste.

Read More:   Update Apache Gets Yet Another Stream Processing Engine with Apex

But enterprises aren’t powerless. Strong financial governance practices allow data teams to control and even reduce their cloud costs while still allowing innovation to happen. Creating appropriate guardrails that prevent teams from using more resources than they need and ensuring workloads are matched with the correct instance types to optimize savings will go a long way to reducing waste while ensuring that critical SLAs are met.

Here are seven best practices CIOs can employ to manage cloud data lake costs. These will help to avoid unpredictable bills and keep spending in check during this uncertain period but still allow your company to innovate and emerge stronger on the other side.

  • Monitor, monitor, monitor. Cost management starts with understanding exactly what resources are being used, when, and by whom, and tracking this on a daily basis at a minimum. Tracking usage closely at the job-, cluster- and user-level allows you to immediately identify waste or inefficiency and make the necessary changes. You can’t manage what you can’t see.
  • Use heterogeneous clusters. The nodes in a cluster can be of different instance types depending on the workload and the cost/availability of different instances. For example, a cluster can include both on-demand instances and AWS Spot instances or Google Preemptible VMs. Apply tooling — generally in the form of DIY scripts — to automate usage so that you’re using the best value infrastructure while meeting the resilience and availability needs of the application.
  • Autoscale aggressively. Clusters don’t need to run when they’re not in use, and automatically scaling up and shutting down clusters when needed will save considerable cost. We have at least one customer that is shutting down clusters after 15 minutes of idle time during the pandemic to aggressively reduce costs. This obviously depends on the SLA needs of an application, but for development work and proof of concept work waiting a brief moment for a cluster to restart should not be an issue.
  • Test different engines. Many businesses employ multiple decision engines, such as Spark, Hive and Presto, because they are each suited to different workload types. Test queries on multiple engines to see where they run fastest. This benefits end users with increased performance, but also your company because a faster query time typically means you’re using less resources.
  • Use schedule-based lifecycle management. Automate the creation and destruction of systems to match usage patterns. If clusters are typically at capacity in the mornings or during peak trading hours, for example, look at the jobs that are running and see which can be spread out over the course of the day.
  • Resize under-utilized infrastructure. Capacity requirements aren’t always clear when new projects and applications are rolled out. Infrastructure gets over-provisioned and no one goes back to change it when real-world requirements become clear. Adjust the size of infrastructure to an appropriate level. This requires careful policy creation, since capacity must still allow for expected spikes in usage.
  • Educate your users. Data teams will do their part to help keep costs down if they understand the larger business imperative and the options available. Do they really need an r4.4xlarge instance for a proof of concept project? Probably not. Talk to them about the current situation and why it is in everyone’s interest to right-size infrastructure. One of our customers even did an exercise with their team using colored Leggo bricks to illustrate how different instance types can be used in a cluster. Help your teams to help you.
Read More:   Update The Challenge of Machine Learning and How DevOps and the Edge Will Modernize Data Science

Cost management in the cloud is about optimizing utilization while providing financial guardrails that allow teams to move fast in a self-service environment while preventing unexpected costs. These best practices should be employed on an ongoing basis at any organization but are especially critical in this macroeconomic climate. Use these techniques and you will be able to make it through this crisis and emerge in good financial health.

At this time, InApps does not allow comments directly on this website. We invite all readers who wish to discuss a story to visit us on Twitter or Facebook. We also welcome your news tips and feedback via email: [email protected].

Feature image via Pixabay.