The SQL vs. NoSQL debate rages on as companies increasingly can’t accept the choice between scale or query power, but are insisting on both.
The latest entrant into the fray, the time-series database TimescaleDB, grew out of an Internet of Things platform called iobeam. It was co-founded by Ajay Kulkarni and Michael Freedman, a professor of computer science at Princeton University, and built out by a bunch of PhDs in New York City and Stockholm.
As it turns out, the IoT customers were more interested in the database they’d built for iobeam, so the company redirected its focus and changed its name. They open sourced TimescaleDB in March, and Freedman highlighted it at Strata+Hadoop World.
The IOT platform iobeam is a cache for Spark Streaming applications. In working with customer companies, the founders had 100,000 devices or more on a system and needed a place to put all this sensor data.
“We wanted something where we could quickly ask questions of it, power interactive monitoring charts, do exploration, but that was also scalable,” Kulkarni said.
They were frustrated with the current database offerings — they used some Postgres, Mongo and Cassandra — but were forced to choose between the scale of NoSQL or query power of a relational database.
“Either you can get joins and secondary indices and non-time-based aggregates and predicates or you could get scale,” as Kulkarni put it.
Ultimately they decided to build on top of 20-year-old Postgres because of its record of reliability and rich ecosystem. Users don’t have to learn a new system and the Postgres tools still work. It has integrations with Kafka, Tableau, Pentaho and more. There’s also an extension to query geospatial data.
It’s among a flurry of new databases including Google Spanner, TiDB and FaunaDB based on distributed SQL.
“There’s a pendulum that swung to NoSQL; now you have NoSQL databases adding a SQL layer, so now it’s ‘not only SQL.’ Then there’s the NewSQL world, with Cockroach, for example, Spanner to some degree. I think part of the pendulum swinging back is people saying, ‘You know, reliability really matters.’ Postgres is one of the most, if not the most reliable database,” Kulkarni explained
“We didn’t want to reinvent the wheel, or a better analogy might be that we didn’t want to build a whole new car from scratch. We have a car that’s pretty reliable and we want to put in a better engine and a few other pieces to make it go faster. That’s scale, obviously.”
“Time-series used to be this kind of niche thing that you only saw in a few industries — Wall Street, DevOps — but now it’s expanding in IoT, blockchain and others,” Kulkarni said.
Time-series requires a high write rate, but in IoT or similar applications, it’s primarily concerned with the latest data and less so with the update capabilities of transactional systems.
“In time-series data — in monitoring, DevOps, IoT sensor data — the workloads are mostly new data arriving to your system. I’m not sending you data from a year ago, I’m sending you data of what’s happening right now,” explained Freedman.
Timescale relies heavily on a natural partitioning across both time and space. Writes typically are made to the latest time interval(s) and across a “partitioning key” related to a primary data index, such as device identifiers for sensor data, locations, customers, or users.
Timescale creates a hypertable, treating the data as if it were just one table, regardless of its size, that can be queried as with standard SQL and divides the data into small chunks split across the two dimensions. The ability to query across both dimensions opens up a wealth of new opportunities, a paper on the technology explains.
Chunks are dynamically created by the runtime and placed on different servers when run as a cluster. They also can be automatically spread across disks on a single machine.
By keeping the chunks small, more data fits in memory and the system is optimized to access as few chunks as possible to limit latency and enhance throughput. It employs highly structured indexes to make querying fast and resource-efficient.
“We have the ability to scale out horizontally, but also provide better performance for run data and querying data even if you’re running on a single node,” Freedman said. Its own tests outlined in the paper showed it outperformed “vanilla” Postgres on just one node.
He explained it this way:
“Data takes place over two dimensions, time and space. So we’re portioning that grid into these chunks, and those are the chunks that we spread over servers. If you look at where new writes are going, it’s only in the latest time blocks. So those are the ones being constantly kept in memory. This all happens transparently to the user and administrator.
“We do a number of optimizations to make sure they’re ‘right size’ – that they fit in memory, and have good performance and query performance. And because we keep enough metadata around, you interact with this like it’s one giant table. Under the cover, we tie into a query planner and execution engine so if there’s a query like, ‘Tell me the average CPU load across all my devices, group by per hour, per location and order them by highest’ so I can give to DevOps so I can figure out if some nodes are misbehaving or something’s not normal.
“We do optimizations so the minimum number of chunks are needed to [answer this]. We do this to keep latency small and get good performance. That kind of query, a single-node traditional database cannot support, period.”
The advantage of the hypertable is that you don’t have to worry about writing to the right partition or the right node, and when you’re querying data, it’s just a single table.
It maintains all the advantages of a relational database, Kulkarni said, especially touting the utility of joins. With joins, at query time, you can combine your time-series and business data and the two can live alongside each other.
“Compared with NoSQL, there’s just lots of queries that you couldn’t support where you could on us,” Freedman said.
Feature Image: “Time” by Sean MacEntee, licensed under CC BY-SA 2.0.