IT system architects have long been accustomed to the notion that data analysis should not be conducted directly on the transactional database, lest it overloads that system and slows performance. It was this thinking that gave birth to data warehouses, where historical data could be shuttled off to the warehouse and analyzed in depth.
Now, the PostgreSQL open source database management system, with the help of large-core count servers coming out, wants to take that analysis workload back, saving users money and administrative hassles of setting up secondary data warehouse systems, and giving them the ability to interrogate live data.
The new release gets the system closer to”using PostgreSQL as a pure analytics platform. Historically, PostgreSQL was an OLTP,” or online transaction processing database, said Bruce Momjian, one of the chief maintainers of the PostgreSQL core team, and a senior database architect at EnterpriseDB, which offers a commercial distribution of PostgreSQL.
“People are tired of dumping all their OLTP into an analytics database. This data is old, stale and there is a lot of overhead. There is a whole bunch of things you can’t do on a copy of the data that you can do on the live data,” Momjian said. “We’ve seen a lot of requests for live data analytics, and this [release] gets us closer to that.”
PostgreSQL 9.5, now available for download, offers a number of new features to prepare it for data warehouse-styled work, as well as some performance improvements to gear it to handle multiple workloads.
In benchmark tests conducted by EnterpriseDB, PostgreSQL 9.5 96 percent improvement over PostgreSQL 9.4, whilst serving 64 concurrent connections on a 24-core system running on 496GB of RAM.
“We’ve done a lot of work focusing on performance and scalability,” said Marc Linster, vice president of products and services at EnterpriseDB. “There have been significant additions that support analytics, but these capabilities rest of the fact that we can take advantage on the bigger iron that Moore’s Law has made available to us.”
Only a few years back, the largest PostgreSQL implementations were running on 8-core or 16-core servers, holding maybe 5TB of data. Now users want to run the database system on 32- and even 64-core machines, and hold dozens of terabytes of data, Linster said.
EnterpriseDB acts as a sponsor of PostgreSQL and contributes code to make it more palatable for large scale uses. For instance, the company contributed features that enable shared buffer concurrency and locking management, paving the way for supporting more users at once.
“With higher core counts, managing concurrency, locking and shared buffers becomes really critical,” Linster said. “You have dozens of users accessing that same information at the same time, so you want to make sure that locking mechanisms are as efficient as possible. If I read something from shared buffers and you want to read the same thing, we want to reduce your wait cycle as much as possible.”
New Analytic Features
The database system comes with a number of new analytic features typically found in data warehouses, including grouping sets, cubes and roll-up. They all offer functionality that can be executed through a series of standard SQL operators such as UNION ALL, though they make it much easier to carry out this work, speeding the execution times of complex queries and offer the way to craft more nuanced commands. Think of the need to summarize information like employee headcount across different departments, locations and job roles.
“This [approach] has the efficiency of going through the data only once,” Momjian said. “Telling people to use UNION ALL gets awkward after a while.”
Another new feature that should help in analytics is a new indexing type called BRIN (Block Range Index). BRIN can generate very small indexes to describe a range of information, such as minimum and maximum values, that allow queries to skip over vast numbers of rows when looking for data within a certain range. With BRIN, 100GB of data can be summarized within 100KB or so.
“The BRIN creates a filter index,” Momjian said. “You’re looking for a purple shirt in a multi-terabyte table, and I can basically whittle down the table to know that the purple shirt will be within one percent of the table.”
UPSERT Done Right
One of the feature users have been most excited about has been the introduction of an UPSERT command, which combines INSERT and UPDATE commands into a single call such that it can automatically turn an INSERT into an UPDATE if the data already exists. PostgreSQL doesn’t have a specific UPSERT command, but rather offers a special clause that can be used with INSERT, that achieves the same outcome.
This is a feature that other database systems have had for awhile and Momjian admitted he was slightly embarrassed that PostGreSQL did not possess this feature until now.
It turns out that many of the implementations of UPSERT (also called MERGE on some systems) on other database systems were “handled very badly,” Momjian said. Implementing this feature is a difficult task, especially to handle use cases where the database is being updated by multiple parties at once. In numerous other systems, two people doing an UPSERT of the same data at the same time would just result in one user, or both, getting error messages, which is not an optimal way to handle the situation.
The PostgreSQL team didn’t want to just hack something together that would result in technical debt that would have be addressed later by the developers, or by the user. They are pleased by the results.
“What is nice about our implementation is that it never generates an unexpected error. You can have multiple people doing this, and there is very little performance impact,” Momjian said. Because it can work on multiple tables at once, it can even be used to merge one table into another.
Feature Image: Simon Bolivar, NYC street art by Dasic Fernandez.