Roman Zeyde
Roman Zeyde is a senior software engineer for Varada, the data lake query acceleration innovator, and a Presto/Trino expert.

Running complex queries on Presto, a high performance, distributed SQL query engine for big data, can take time to execute. With a properly tuned Presto cluster, you can run fast queries against big data with response times ranging anything from sub-second to minutes.

If you could only make them run faster. In order to accelerate your queries and reach top performance, you need to really understand what Presto is trying to do and why it’s trying to do it. There are several features and actions that greatly speed up Preso query performance, that can be applied throughout the query lifecycle to reduce query execution time and the associated compute resources.

In this article, I’ll provide practical tips and workflows that will help data ops engineers accelerate their queries. Let’s get started!

Step 1: Review Query Plan and Statistics

The first step is analyzing the query and associated tables to understand Presto’s execution plan. There are several alternatives available:

  1. Use EXPLAIN or EXPLAIN ANALYZE

Run EXPLAIN (plan structure + cost estimates) or EXPLAIN ANALYZE (plan structure + cost estimates + actual execution statistics) on your query and evaluate the plan to see if you get what you want. Find the most expensive query fragment (largest CPU time) and operator and check to see if you’re reading too much data? Did predicate pushdown work as expected? Are your filters implemented correctly, or are they really expensive due to bad selectivity? Do the JOINs look right (order and type)?

Now you can update your table statistics (if no estimates are available or they are stale) to improve Cost Based Optimizer (CBO) operations, override CBO if needed, and refine your query mechanics to improve cost and increase selectivity. Sometimes you’ll have distributed workload issues (over/under resource utilization or skewed workload distribution) that you’ll need to further investigate via observability tools.

The explain plan is a powerful tool, but it’s not very user-friendly. Reading the plans and making sense of them can be overwhelming for even the most experienced data ops engineers. For a deep dive into how Presto executes a query and how to read EXPLAIN, check out this great webinar by Martin Traverso from Trino: Understanding and Tuning Presto Query Processing.

Read More:   Top Video Conferencing Software Best Fit Small Businesses

2. Use the Workload Analyzer

Big data software provider Varada created and open-sourced a Workload Analyzer that collects details and metrics on every query, as well as aggregates and extracts information, and delivers dozens of charts describing all the facets of cluster performance. Using the Workload Analyzer, data teams can learn how resources are used on an hourly and weekly basis and define scaling rules, identify heavy spenders and improve the utilization of resources, and improve predicate pushdown. Workload Analyzer also helps data teams to significantly reduce IO and CPU, identify “hottest” data, improve JOINs performance, provide a better production roll-out experience, and identify upgrade risks upfront.

3. Use Presto’s Web Interface

Both Trino and PrestoDB provide a web-based user interface (UI) for monitoring a Presto cluster and managing queries. The main page has a list of queries along with information like unique query ID, query text, query state, percentage completed, username and source from which this query originated. The currently running queries are at the top of the page, followed by the most recently completed or failed queries. The query detail page contains multiple tabs of information ranging from resource utilization to live query plan and operators performance data.

Below are screenshots from Trino’s Web UI showing a query with a broadcast join plan between a large store_sales table and small time_dim table (correctly ordered by CBO), which uses dynamic filtering (can be seen in the operators’ view). The high-level statistics (query overview and resources) are useful for comparing similar queries, with different optimizations enabled.

Web UI Query Overview Tab

Web UI Query Resource Utilization Summary

Web UI Query Live Plan Tab showing Broadcast Join with DF

Web UI Stage (operators) Performance Tab

4. Update your table statistics

Presto’s CBO uses the table statistics provided by the connectors to estimate the costs for different join implementations and automatically picks the one with the lowest estimated costs. Therefore, it is recommended to keep table statistics up to date. This can be done using the SHOW STATS and ANALYZE commands. The SHOW STATS command returns current statistics for the named table or for the results of a (limited) query, while the ANALYZE command updates table and column statistics of a given table.

Step 2: Data Format and Organization

File format, size, compression, and organization can go a long way to improving your Presto query performance. Check out this great webinar by Dain Sundstrom from Trino on Configuring and Tuning Presto to learn more about tuning and sizing your Presto deployment for optimal performance.

  1. Use columnar formats such as ORC or Parquet.

Columnar format has become the de-facto standard for big data analytics. The process of converting rows into columns is time-consuming and compute-intensive but critical for optimal performance. We recommend using ORC or Parquet — both outperform text format considerably — but ORC is faster in Presto. Presto does not mandate how data is stored but it processes the data as vectorized columns. This means that Presto only holds the amount of data it processes without carrying around the additional fields that are not relevant to the query in process. Moreover, columnar data usually compresses better since each column can be compressed separately — according to its data distribution.

Read More:   Alation Delivers a New Data Governance App – InApps Technology 2022

2. Watch your file size and compress your data

File size is a big deal in Presto. Small files create many small IO requests which impact performance (high latency, getting throttled, or running out of IO capacity) and makes file listing slow. Each file requires separate handling, which increases the scheduling time and cost. We recommend using file sizes of at least 100MB to overcome potential IO issues.

Additionally, always consider compressing your data for better performance. If using high compression formats, prefer ZSTD over ZIP. For low compression, prefer LZ4 over Snappy. In any case, you should avoid using LZO altogether.

3. Partition and sort your data

Partitioning divides your table into parts and keeps the related data together based on column values such as date or product. You define partitions at table creation, and they help reduce the amount of data scanned for queries with predicates over the partitioned column, thereby improving performance. Additionally, sorting your data also helps narrow the range of partitions that need to be read by Presto. If we sort each file, we can efficiently skip irrelevant chunks, but only if we filter over the sorted column.

Step 3: Filtering Operations

Improving your filters to give you less results (high selectivity) is key to reducing latency and the amount of data scanned. Presto has several features it uses to optimize filtering operations.

  1. Pruning and predicate pushdown

One way Presto optimizes querying is by skipping over unnecessary data, referred to as pruning. Some statistics (e.g. the minimum and maximum values) of each column are recorded per file, per stripe (~1M rows), and for every 10K rows. Using this information, Presto can skip any segment that could not possibly match the query predicate. Below are screenshots of EXPLAIN ANALYZE outputs for queries with and without predicate pushdown.

EXPLAIN ANALYZE output for query without predicate pushdown: reading 2.79B rows with complex predicate — which currently cannot be pushed down.

EXPLAIN ANALYZE output for query with predicate pushdown: reading 2.13B rows with simple predicate.

2. Lazy reads

The last feature is lazy reads — delay reading columns until their data is actually needed. Predicate pushdown is amazing when it works, but for a lot of data sets, it doesn’t work at all. If the data has a large number of distinct values and is well-shuffled, the minimum and maximum stats will cover almost the entire range of values, rendering predicate pushdown ineffective. With lazy reads, the query engine always inspects the columns needed to evaluate the query filter, and only then reads other columns for segments that match the filter (if any are found). For most workloads, this feature saves more CPU in addition to predicate pushdown. Additionally, you can take advantage of another great optimization, late materialization which allows lazy reads to be propagated within the same stage. This can be enabled via the ‘experimental.late-materialization.enabled’ config property.

Read More:   Update Pachyderm Challenges Hadoop with Containerized Data Lakes

Step 4: Joins Ordering and Distribution via CBO

The order in which joins are executed in a query can have a significant impact on the query’s performance. If a join that produces a lot of data is performed early in the execution, then subsequent stages will need to process large amounts of data for longer than necessary, increasing the time and resources needed for the query. Presto’s CBO will always try to choose the most efficient plan for a given query, based on table statistics and resource availability. This greatly reduces query processing time.

  1. Probe side should be larger than build side

In order for advanced optimizations like dynamic filtering to work efficiently, the smaller dimension table needs to be chosen as a join’s build side. The cost-based optimizer can automatically do this using table statistics provided by connectors. Therefore, it is recommended to keep table statistics up to date and rely on the CBO to correctly choose the smaller table on the build side of the join.

2. Prefer broadcast over partitioned join

There are two types of join distributions in Presto:

  • PARTITIONED: each node participating in the query builds a hash table from only a fraction of the data
  • BROADCAST: each node participating in the query builds a hash table from all the data (data is replicated)

If the right-hand side (build) table is “small,” it can be replicated to all join workers — saving the CPU and network the cost of left-hand side repartitioning — which can be significant for a large left-hand side (probe) table. Otherwise, both tables are repartitioned using the join key, allowing joins with larger right-hand side tables.

3. Colocated joins execution for large tables

Very large join operations can sometimes run out of memory. Colocated joins (also known as grouped join execution) will use less memory, CPU, and shuffle less data among Presto workers. To leverage these benefits, you must make sure the two tables to be joined are partitioned on the same keys. When both sides of a join have the same table partitioning and the partitioning is addressable, partial data can be loaded into memory at a time, making it possible to execute the join with less peak memory usage.

4. Dynamic filtering can help selective joins

Presto was recently enhanced with the Dynamic Filtering feature (contributed to the Trino project by Varada) that allows pre-filtering the probe-side table before joining tables. Dynamic filtering optimizations significantly improve the performance of queries with selective joins by avoiding reading of data that would be otherwise filtered by join condition. This has shown to reduce query execution time by a factor of up to 10! Without dynamic filtering, Presto scans all the data in the fact table since there are no filters in the query. The join operator ends up throwing away most of the probe-side rows as the join criteria is highly selective.

Feature image via Pixabay.