Home
>
Data Science
>
Update Precomputation: The Key to High-Definition Cloud Analytics

March 24, 2022 by Phu Nguyen

Update Precomputation: The Key to High-Definition Cloud Analytics

Main Contents:

Precomputation: The Key to High-Definition Cloud Analytics is an article under the topic Data Science Many of you are most interested in today !! Today, let’s InApps.net learn Precomputation: The Key to High-Definition Cloud Analytics in today’s post !

For Analytics in the Cloud, How Fast Is Fast Enough?

“My query is slow! It’s unusable.”

This is probably the most common complaint we hear from business users about their analytics experience in the cloud. While everybody would love to conduct their analysis at the speed of thought, the reality is we spend too much time waiting for the dashboard to refresh. Naturally, we start looking for the next shiny BI tool hoping that it will run faster than the one we are using. However, the performance challenge is likely systemic, and not easily remedied by using another BI client.

Figure 1: Total Latency = Query Latency + Rendering Latency

There are two contributing factors to the delay on the dashboards: the time to fetch and prepare the data and the time to render and update the UI as illustrated in Figure 1. End users often fail to realize the majority of the delay, commonly referred to as query latency, actually comes from the first step. For complex queries and large datasets, the query latency can be tens of seconds to tens of minutes. Obviously, in a situation like this, it doesn’t matter how fast your BI tool can draw the diagrams, the user experience is going to be slow.

Ideally, to truly support analytics at the speed of thought, we need the overall delay — the time from when a user clicks the mouse to the dashboard being refreshed — to be around one second, and no more than three seconds. That means our query latency should be less than one second for the majority of use cases, and one to two seconds for the most complex scenarios.

To feel that we are truly getting accurate analytics on high-definition data, we need this level of latency on petabyte-scale datasets. This has to apply to the most complex analysis scenarios where tens of attributes are analyzed by thousands of concurrent users. This is the type of performance modern analytics systems require of the backend data service layer.

Compute on the Fly vs. Precomputation

Data Warehousing was created to maximize query performance by organizing data in a way that makes it easy to summarize the data for reporting and dashboarding. Data is grouped by subject area, organized in star schemas, and distributed based on the key filter or grouping conditions. Therefore, summarizing the data in the data warehouse is much faster than doing it in transactional databases.

In the big data world, columnar storage arranges data by columns instead of rows. Since most analytical queries are interested in summarizing data in certain columns, columnar storage dramatically improves the performance of analytical queries. New generation query engines such as Presto, Apache Impala, and Kinetica are designed to come up with ways of processing data in parallel across large numbers of servers. These new software technologies, together with the advancements in hardware such as GPUs, faster networks, and faster disks, have dramatically improved query performance.

In the last several years we’ve also seen the emergence of cloud data warehouses, with Snowflake being the best-known example. Storing data in a cloud data warehouse has become as simple as a couple of mouse clicks, and users don’t need to worry about many of the administration tasks that usually require a seasoned DBA to perform. The cloud data warehouse greatly reduces the effort needed to organize and analyze data in the cloud.

But all of these technologies run analytical processing after the data is read from the data warehouse or the file system. You want to know how many products were sold in Q3? Let the query engine read those records, calculate the total amount, and send the result back. How about from all of 2019? Well, we will just read all records from 2019 and give you the total amount.

This is obviously a simplified scenario. Data warehouses have many optimization methods (e.g. bitmap indexes, SQL optimization) to improve query performance, but the principle is still true as denoted in Figure 2: scan, compute, service the result, rinse, and repeat.

Figure 2: Store, Query, and Compute

Move Computation to an Earlier Stage in the Process

Instead of reading the raw data and computing the summaries for every single query, what if we compute the summaries and persist the results before the queries were even issued? This way, by the time the users ask these questions, they get the answers instantaneously. This is the concept of precomputation (see Figure 3).

Figure 3: Compute, Store, and Query

As many reading this article will know, precomputation is not a new concept in computing. People have been using various data indexing strategies and materialized views in RDBMS for years. And there are numerous examples in operating systems and microprocessors as well. For a bit more background on the subject, you can refer to a short blog I wrote on the Evolution of Precomputation.

Precomputation, a Paradigm Shift for Modern Analytics

In Apache Kylin environments, aggregations and other key measures are calculated upfront, or precomputed. The results of the precomputations are persisted as indexes stored in the storage layer. We can also define logical hierarchies for certain attributes such as time (day, week, month, quarter, year, etc.), location (city, state, country, etc.), product (product family, product category, product line, etc.), and so on to enable drill-down and roll-ups. Querying and fetching the results becomes a simple lookup operation, which is typically hundreds of times faster than brute force aggregation and calculation.

The benefits of this architecture are dramatic. First and foremost, the query latency doesn’t increase with growing data volumes (to be fair, the query latency will be slightly longer when working with a large index, but this increase is negligible compared to the typical MPP query engines’ performance).

The latency curves are illustrated in Figure 4.

Figure 4: On-the-Fly Computation vs. Precomputation

Another benefit of the Apache Kylin precomputation layer is the ability to support a much larger number of concurrent users. Since most traditional query engines need to compute data and assemble result sets in memory at run time, supporting hundreds or thousands of concurrent users not only contributes to unacceptable latencies, but they also require a crazy amount of memory. With Kylin’s precomputation architecture, each query is very lightweight — usually just a simple lookup — so we can easily scale out to handle thousands of concurrent users without sacrificing performance.

Knowing What to Precompute

The first question that probably comes to mind is: how do I know what results should be precomputed? One method would be to manually analyze your SQL or MDX queries to understand which datasets and results are most used and therefore suggest which aggregate indexes should be created.

A more forward-looking approach suggests that this question provides a clear opportunity to apply AI to your precompute strategy. The key is to leverage machine learning algorithms to predict the questions that will be asked by your analysts. By learning users’ query histories, analysis behaviors, metadata, and data access, we can apply AI algorithms to predict what questions users will ask and, therefore, which will benefit from having precomputed answers. As with any good machine learning model, the more queries we receive, the more data we process, the more users we serve, the more accurate these predictions will be.

By knowing what questions users will ask, the software can automatically pick the best strategy to run the appropriate queries, get the correct answers, and store them in the distributed file system. Many people probably will ask ‘why file system — why not in memory?’ The reason is that file systems such as Cloud Data Lake give us unlimited storage capability at a very reasonable cost, while still maintaining excellent data access performance. Thanks to today’s distributed compute and storage systems, we can now process trillions of rows of data in a short period of time at a low cost. That is the essence of achieving interactive analytics on high-definition data.

Summary

One of the challenges facing information workers is to quickly operationalize and analyze the huge amounts of data that are available to them. Our initial impulse was to make the data smaller, but this comes at the cost of time, effort, and operational expense. With the precomputation architecture of Apache Kylin, analytics data pipelines are simplified and users are able to interact with larger datasets to get a much clearer, high-definition picture of the types of insights we’ve been waiting a long time to behold.

Feature image via Pixabay.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Recommended

Tech News

May 29, 2025 by Anh Hoang

Update Precomputation: The Key to High-Definition Cloud Analytics

Read more about Precomputation: The Key to High-Definition Cloud Analytics at Wikipedia

For Analytics in the Cloud, How Fast Is Fast Enough?

Compute on the Fly vs. Precomputation

Move Computation to an Earlier Stage in the Process

Precomputation, a Paradigm Shift for Modern Analytics

Knowing What to Precompute

Summary

AI Automation for Business in 2025: A Step-by-Step Guide

FITNESS APP DEVELOPMENT

ONLINE COURSE APP

EVE HR – WEB DESIGN

AIRGOGO WEBSITE

WALLET APP DEVELOPMENT

Ho Chi Minh City Launches Digital Traffic App 2017

Why Your Business Needs a Mobile App Rather Than a Website

7 Questions To Ask Yourself Before You ‘App’ | Entrepreneur

Homestays Marketplace Application Development

Blog post

9 Practical Tips to Choose a Mobile App Development Company for 2023

AI Automation for Business in 2025: A Step-by-Step Guide

Top 10 Offshore Development Companies (ODCs) in 2025

How can businesses effectively integrate AI into their operations?

Locations

Read more about Precomputation: The Key to High-Definition Cloud Analytics at Wikipedia

For Analytics in the Cloud, How Fast Is Fast Enough?

Compute on the Fly vs. Precomputation

Move Computation to an Earlier Stage in the Process

Precomputation, a Paradigm Shift for Modern Analytics

Knowing What to Precompute

Summary

Get a custom Proposal

You need to enter your email to download

Blog post

Locations