Home
>
Data Science
>
Update From Big to Fast: Presto Continues to Shine for Cloud Data Lake Analytics

April 1, 2022 by Phu Nguyen

Update From Big to Fast: Presto Continues to Shine for Cloud Data Lake Analytics

Main Contents:

From Big to Fast: Presto Continues to Shine for Cloud Data Lake Analytics is an article under the topic Data Science Many of you are most interested in today !! Today, let’s InApps.net learn From Big to Fast: Presto Continues to Shine for Cloud Data Lake Analytics in today’s post !

Why Presto?

A federated data computing framework that allows users to retrieve and analyze data from multiple data sources and compute in a centralized platform is a better solution for fast data analytics in today’s data environments.

Presto, an open source platform, was originally designed to replace Hive, a batch approach to SQL on Hadoop and was built with higher performance and more interactivity compared with Apache Hive. The concept of Presto was to support a MPP (Massive Parallel Processing) framework to compute large scale data, so the architectural model is designed to support disaggregation of compute and storage and process real-time and high performing data analytics. Presto is not supposed to store data, instead the data sources are accessed via various connectors.

After years of development, the latest version of PrestoDB supports SQL-on-Anything and is an interactive query framework that can fit in any enterprise architecture as an in-memory query engine.

How Does Presto Work?

Presto is based upon a standard MPP database architecture, which enables horizontal scalability and the ability to process large amounts of data. Presto’s in-memory capabilities allow for interactive querying across various platforms of data sources. In order to access data at different locations, Presto is designed to be extensible with a pluggable architecture. Many components can be added to Presto to extend it further using this architecture, including connectors and security integrations.

Basic Concepts:

The Presto cluster is a query engine that runs a single-server process on each instance, or node. It consists of two types of service processes: a Coordinator node and a Worker node. The Coordinator node’s main purpose is to receive SQL statements from the users, parse the SQL statements, generate a query plan, and schedule tasks to dispatch across Worker nodes. Meanwhile, the Worker node may communicate with other Worker nodes and execute the task from the query plan from the Coordinator, which is fragmented for distributed processing.

1. Coordinator

The Presto Coordinator is a single node deployed to manage the cluster. The coordinator allows users to submit queries via Presto CLI, applications using JDBC or ODBC drivers, or other available client API libraries of connections. The Coordinator is also responsible to talk to Workers to get update status, assign tasks, and send the output result sets back to the users. All communication is done through the RESTful API by Coordinator’s StatementResource class.

2. Worker

Inside a Presto cluster, there may be one Coordinator node with multiple Worker nodes. If the Coordinator is a leader, the Worker nodes are followers. Each Worker node stays alive as a process of a service that listens to the Coordinator for task executions and actual compute. The Worker will periodically send a heartbeat to the Discovery Server via RESTful API to signal the server with its health status of whether or not the worker is online or offline. This lets the Coordinator know from the Discovery Server which Worker nodes are available to dispatch tasks when the user submits a query.

The logical implementation of Presto is shown below. There are seven basic steps for running a query:

User submits a query from client API to Presto coordinator via HTTP protocol.
The Coordinator receives the SQL statement, which is in textual format, follows a series of steps to parse, analyze and create a logical plan for execution using an internal data structure called the Presto query plan. After the query plan is optimized, there are three internal classes including query execution, stage execution, and task distribution, that are generated accordingly so that the Coordinator can create HTTP tasks depending on data locality property.
Depending on where the data is located, the Coordinator generates tasks and dispatches to the designated Worker nodes to process through HttpRemoteTask’s HttpClient. HttpClient creates or updates the task’s request and the TaskResource on the local data Worker node provides a RESTful API. The TaskResource takes the request and either starts a SqlTaskExecution object on the corresponding Worker node or updates the object’s Split.
Upstream task reads data from the corresponding connector.
Downstream task consumes output buffers from upstream’s task and starts to compute for data processing within its stage on all Worker nodes. Presto is a memory computing engine, so the memory management must be refined to ensure the orderly and smooth execution of query, and some cases such as starvation and deadlock occur. Because the worker node is designed for pure in-memory computing, there will be no data spill to the disk when there is not enough memory. Therefore, the query could fail due to Out-Of-Memory. However, the latest version of Presto supports disk spills which is an option for users to tune but not recommended due to the high latency cost it will bring if the switch is turned on.
Once the Coordinator dispatches the task across Worker nodes, it continuously listens to retrieve task’s computing results within the final stage.
After the client submits a SQL statement, the client continuously listens to retrieve final result sets from the Coordinator. These result sets are streamed back to the client piece by piece using HTTP protocol when the outputs are available.

Presto doesn’t use MapReduce. It computes through a custom query and execution engine. All of its query processing is in memory, which is one of the main reasons for its high performance.

Presto’s Use Cases

Presto is a distributed SQL engine for data analysis on data warehouses and other disparate data sources. It can achieve excellent performance for real-time or quasi-real-time analytic computing. Queries run with response times from millisecond to seconds. For a complex query with the right configuration, runtime can finish within the unit of minutes vs. hours or days if running on the Hive system. With its federated architecture, Presto is a proven technology that is most suitable for the following application scenarios:

Replace Hive queries for better performance. Presto’s execution model is a pure memory MPP model, which is at least 10 times faster than the MapReduce model of disk shuffle used by Hive.
Unified SQL execution engine. Presto is compatible with the ANSI SQL standard and can connect to multiple RDBMS and data warehouse data sources, using the same SQL syntax and SQL functions on these data sources.
Bring SQL execution capabilities to storage systems that do not have SQL execution capabilities. For example, Presto can bring SQL execution capabilities to HBase, Elasticsearch, and Kafka, and even local files, memory, JMX, and HTTP interfaces.
Construct a virtual unified data warehouse with federated querying of multiple data sources. If the data sources that need to be accessed are scattered in different RDBMS, data warehouses, and even other Remote Procedure Call (RPC) systems, Presto can directly associate these data sources together for analysis (SQL Join), without the need to copy data from the data source, and without needing to centralize it in one location.
Data migration and ETL tools. Presto can connect to multiple data sources, plus it has a wealth of SQL functions and UDFs, which can conveniently help data engineers pull (E), transform (T), and load (L) data from one data source to another data source.

A Popular Data Lake Analytic Engine

First of all, Presto adopts a full memory computing model with excellent performance, which is especially suitable for ad hoc query, data exploration, BI reporting and dashboarding, lightweight ETL and other business scenarios.

Secondly, unlike other engines that only support partial SQL semantics, Presto supports complete SQL semantics, so you don’t have to worry about any requirements that Presto can’t express. Furthermore, Presto has a very convenient plug-in mechanism, you can add your own plug-ins without changing the kernel. In theory, you can use Presto to connect any data source to meet your various business scenarios.

Finally, Presto has a very active community. As part of the Linux Foundation’s Presto Foundation, many large enterprise companies in addition to Facebook such as Twitter, Uber, Amazon Athena, and Alibaba embrace Presto’s data lake analytic capability to develop features using Presto’s codebase to support large scale, high volume OLAP transactions on top of their own data federation system. Based on the above advantages, Presto is a proven technology to provide cloud data lake analysis as the underlying analytics engine.

The priority design of data lake, through opening the underlying file storage, brings maximum flexibility to the data into the lake. The data entering the data lake can be structured, semi-structured, or even completely unstructured raw logs. In addition, open storage also brings more flexibility to the upper-level engines. Various engines can read and write the data stored in the data lake according to their own scenarios, and only need to follow the relatively loose compatibility conventions (such loose conventions will have hidden challenges, which will be mentioned later). But at the same time, file system direct access makes many higher-level functions difficult to implement. For example, fine-grained (less than file granularity) permission management, unified file management and read-write interface upgrade are also very difficult (each access file engine needs to be upgraded before the upgrade is completed).

A Fast SQL Engine

Presto also features performant SQL processing. Here are a few reasons why:

Presto supports standard ANSI SQL, including complex queries, aggregation, join, and window functions. As the substitutes of Hive and Pig (Hive and Pig complete HDFS data query through MapReduce pipeline), Presto does not store data itself, but can access multiple data sources, and supports cascading queries across data sources.
YARN is a general resource management system. However, no matter what kind of engine Hive uses when executing SQL, such as MR and TEZ, each executing operator runs in the YARN container, and the performance of YARN pulling up the container is particularly low (second level). It’s like an application pulling up a process and turning on multithreading. The thread is more lightweight, and the speed of starting the thread is faster and the acceleration is more obvious with simple operation; however, the startup process is much more cumbersome, and it is easy to be restricted by the operating system. Presto scheduling uses threads, not processes.
Presto’s Coordinator/Worker architecture is more like Spark standalone mode, which is only completed in two processes and services. However, Spark focuses more on the dependency relationship between SparkRDD, and stage failure and linear recovery lead to higher overhead. Spark input also directly relies on Hadoop input format API, which makes SparkSQL unable to transmit SQL optimization details to inputformat at runtime. Presto discards Hadoop inputformat, but adopts similar data partition technology. After SQL is parsed, it can generate a tuple domain from where conditions pass to the connector. The connector can use a certain degree of index push down according to the data sources according to the metastore data, and greatly reduce the data scanning interval and the amount of data involved in calculation.
Presto is completely memory-based parallel computing. Unlike Hive MR/ TEZ which needs to write intermediate data to disk or Spark which needs to write overflow data to disk, Presto completely assumes that data can be effectively put into memory. Furthermore, thanks to Presto’s pipelined job computing capability, the data displayed can be returned immediately by analyzing the execution plan of SQL. While this gives users a very fast “false impression”, this “illusion” is also justifiable. Even if we extract a large amount of data from a result, we also traverse the cursor. When we traverse to that location, the subsequent result data has been continuously calculated, which does not affect our results.

Conclusion

In many scenarios, Presto’s ad-hoc query runtime is expected to be 10 times faster than Hive in seconds or minutes. It supports multiple data sources, such as Hive, Kafka, MySQL, MongoDB, Redis, JMX, and more. As an open source distributed SQL query engine, Presto is a proven analytic framework to quickly analyze queries for any size of data. It supports both non-relational and relational data sources. Supported non-relational data sources include Hadoop distributed file system (HDFS), Amazon S3, Cassandra, MongoDB, and HBase. Furthermore, Presto supports JDBC / ODBC connection, ANSI SQL, window function, join, aggregation, complex query, etc. These key features are the founding keystones of building a cloud-based data lake analytics.

The Linux Foundation is a sponsor of InApps Technology.

Feature image via Pixabay.

Source: InApps.net

List of Keywords users find our article on Google:

odbc jobs

presto

presto query execution model

deploy mongodb on azure kubernetes

db2 warehouse on cloud

alibaba cloud express connect

hive ai linkedin

plug and shine

presto connectors

db2 warehouse

ahana company

presto data share

the amazon athena twitter

migrate for compute engine

ibm db2 warehouse

ibm db2 warehouse on cloud

presto tutorial

azure memories: the stymied

presto update

ibm analytics engine

spark sql functions

mongodb spark connector

hadoop api

hire remote etl developers

amazon presto

presto vs hive

software

data analysis wikipedia

ibm cloud sql query

ibm db2 warehouse on cloud connector

create table presto

presto cli

hire sql tuning developers

presto replace

ibm cloud sql

presto cluster

mongodb out of memory

spark sql odbc

mobile connector for sql

db2 on cloud

presto hive

kafka connect redis source

elasticsearch aggregation query

db2 zos

jmx latency

lead customer service associate wawa

presto vs athena

cassandra odbc

cassandra odbc driver

ibm db2 warehouse on cloud automation

presto odbc

ibm db2 warehouse on cloud api

data management for db2 on zos

spark sql odbc driver

kafka connect hdfs

mongodb odbc

presto jdbc

apache hive odbc driver

presto x review

presto jdbc connector

ibm db2 warehouse on cloud integration

presto page manager

spark odbc

kafka mysql source connector

presto kafka connector

redis and mongodb together

presto query example

presto azure

hire remote hive developers

hire remote jdbc developers

in-memory analytics wiki

athena vs elasticsearch

hire remote hadoop developers

kafka connect cassandra example

sql server discovery

optimize presto

snowflake jdbc

hdfs s3 connector

mongodb jdbc

presto queries

ibm cloud postgresql

ibm app connect connectors

aggregation in mongodb

elasticsearch aggregation pipeline

minimum data set coordinator travel jobs

mongodb aggregation

athena vs mysql

hire remote data warehousing developers

cassandra vs snowflake

presto mysql

db2 for i

ibm cloud kafka

mongo aggregation

plug-and-play parallel storage file system

hbase api

kafka connect connectors

federated mysql

sample microsoft project plan for software development mpp

aggregation mongodb

apache presto

db2 z os

ibm db2 data warehouse

cloud analytics engine

cloud sql

apache pig

standard sql functions

kafka s3 connector example

data lake tutorial

sql complete

mpp microsoft project plan

azure sql data warehouse

mongodb cursor

presto ats

amazon presto review

analyticdb for mysql

wawa customer service associate job description

presto isa

seven seconds wikipedia

presto amazon

presto saas

shine wiki

built to spill wiki

presto data science

mpp wikipedia

types of operating system wikipedia

db2 warehouse on cloud reviews

in memory analytics wikipedia

cassandra jdbc

presto jobs

db2 update

sql server express wiki

apache hive odbc

elasticsearch hadoop plugin

hbase odbc driver

hbase-client

hbase hadoop compatibility

presto client

prior lake pipe lining

window functions presto

elasticsearch data lakes engine

hbase client

amazon athena federated query

connect facebook leads to postgresql

hadoop hive odbc driver

presto redis

distributed data warehouse wikipedia

facebook odbc driver

presto configuration

presto machines near me

presto server types

apache spark cassandra connector

deploying presto

hire db2 developer

partial update amazon

elasticsearch coordinator node

hire remote elasticsearch developers

mongodb aggregation framework

the eve illusion listen

window function presto

apache hive jdbc

presto foundation

wiki deadlock

cassandra presto

kafka to cassandra connector

mongodb update many

presto memory

amazon db2

elasticsearch aggs

hive to elasticsearch

how many principal engineers at amazon

sql-on-anything

kafka sql queries

presto index

kafka connect hbase

snowflake-jdbc

amazon elasticsearch service latency

apache pig logo

kafka mongodb connection

hdfs wiki

hire remote cli developers

travel minimum data set coordinator jobs

amazon athena connector

amazon data lake

db2 for z/os

elasticsearch hot threads

presto data types

presto over partition by

snowflake kafka connector example

sparksql replace

amazon athena jdbc driver

athena datalake

cassandra connector

cloud sql vs datastore

data aggregation to sql azure

enterprise presto

hyperscale azure sql

kafka to hbase connector

kafka to redis connector

spark presto connector

athena federated query

athena vs snowflake

deploy kafka to hbase

elasticsearch join aggregation

presto join types

redis odbc driver

alibaba cloud instance types

azure data warehouse icon

db2 ai

db2 to snowflake migration

hire remote redis developers

spark vs elasticsearch

aggregation in elasticsearch

business object processing framework

cloud analytics tutorial

db2 in

db2 os

db2 protocol

hbase mapreduce

hire remote data migration engineers developers

kafka connect mongodb source

kafka sql tutorial

mongodb aggregation pipeline

power bi azure data lake direct query

presto black

red presto

amazon web services elasticsearch for data lakes

athena federated queries

connector net mysql

elasticsearch tasks api

etl recruitment

hbase internals

ibm db2 big sql

kafka connect cassandra

kafka mongodb connector

mongodb kafka connector

mysql federated tables

presto connection

azure odbc

ibm data engineer

ibm db2 cloud

kafka connect elasticsearch

kafka hive

mongodb spark connector example

mysql federated server

presto query

presto top up

quick sql

recruitment hive

shine cost

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Recommended

Tech News

May 29, 2025 by Anh Hoang

Update From Big to Fast: Presto Continues to Shine for Cloud Data Lake Analytics

Read more about From Big to Fast: Presto Continues to Shine for Cloud Data Lake Analytics at Wikipedia

Why Presto?

How Does Presto Work?

Basic Concepts:

1. Coordinator

2. Worker

Presto’s Use Cases

A Popular Data Lake Analytic Engine

A Fast SQL Engine

Conclusion

List of Keywords users find our article on Google:

AI Automation for Business in 2025: A Step-by-Step Guide

FITNESS APP DEVELOPMENT

ONLINE COURSE APP

EVE HR – WEB DESIGN

AIRGOGO WEBSITE

WALLET APP DEVELOPMENT

Ho Chi Minh City Launches Digital Traffic App 2017

Why Your Business Needs a Mobile App Rather Than a Website

7 Questions To Ask Yourself Before You ‘App’ | Entrepreneur

Homestays Marketplace Application Development

Blog post

9 Practical Tips to Choose a Mobile App Development Company for 2023

AI Automation for Business in 2025: A Step-by-Step Guide

Top 10 Offshore Development Companies (ODCs) in 2025

How can businesses effectively integrate AI into their operations?

Locations

Read more about From Big to Fast: Presto Continues to Shine for Cloud Data Lake Analytics at Wikipedia

Why Presto?

How Does Presto Work?

Basic Concepts:

1. Coordinator

2. Worker

Presto’s Use Cases

A Popular Data Lake Analytic Engine

A Fast SQL Engine

Conclusion

List of Keywords users find our article on Google:

Get a custom Proposal

You need to enter your email to download

Blog post

Locations