Derek Robinson 00:00:02 Welcome to Data Points, a podcast by InterSystems Learning Services. Make sure to subscribe to the podcast on your favorite podcast app. Links can be found at datapoints.intersystems.com. I'm Derek Robinson, alongside Brenna Quirk today. And on this episode, we welcome Product Manager Benjamin De Boe to talk about columnar storage in InterSystems IRIS.
Derek Robinson 00:00:34 Welcome to the Data Points podcast by InterSystems Learning Services. I'm Derek Robinson. And first off, I'd like to welcome my colleague in Online Learning, Brenna Quirk, making her podcast debut. Brenna, how's it going?
Brenna Quirk 00:00:44 Good. I'm excited to be here.
Derek Robinson 00:00:46 Yeah. So, first time on one of these podcasts. And part of the reason we're doing that is our guest today, as I mentioned in that little intro sentence there, Benjamin De Boe, who is one of the product managers at InterSystems, we've had him on the podcast before, and he's gonna be talking to us about columnar storage, which kind of right away I felt that you'd be a good person to join for that conversation because you worked on a columnar video that we released in 2022. So what was that like, kind of being involved in that?
Brenna Quirk 00:01:13 That was a really exciting project. It was really great to work with Benjamin, you know, he really knows so much. He's really like an expert on this subject. So, it was really exciting to work with him, and especially this was kind of a new exciting feature at the time. We were really working on this video to present at Global Summit. This was one of the, you know, big new things coming out last year. So yeah, it was really great to work with him on that video.
Derek Robinson 00:01:38 Yeah, I think couple things there. First of all, we've had Benjamin on before, and he is extremely knowledgeable, and sometimes you gotta keep up with him because, like, he does not require time to stop and think, like, he just knows what he's talking about. And at Global Summit, in passing had a lot of like feedback to us, like as a learning group, that that video was really great and well-received. So, good job on that, and I think that'll be something in your mind as we do this interview with Benjamin. So, without further ado, let's take a listen to our discussion with Benjamin De Boe.
Derek Robinson 00:02:10 All right. Benjamin De Boe, welcome back to the podcast. How's it going?
Benjamin De Boe 00:02:13 Very good, very good. Thanks, Derek. Nice to be here.
Derek Robinson 00:02:15 Yeah. So today we're gonna be talking about columnar, as we mentioned in the intro. So, can you first start by summarizing for us what is columnar storage? It's a buzzword that's been thrown around a lot at Global Summit. It's been in some of our materials. What is columnar storage for the people that might not know?
Benjamin De Boe 00:02:30 If you want a really nice summary, you can go to intersystems.com/columnar-storage, and you'll see a beautifully produced video about columnar storage. But for those of you who are just listening to this while they're out running or cycling, I'll just walk you through it. So, on [InterSystems] IRIS, we store data in globals, globals being our unique storage paradigm for basic building blocks. Previously we were using traditional row storage, which means that we're storing the data organized per row. So everything that fits in one row goes into one global node. And now we've sort of tilted that on its side and we're storing it per column.
Brenna Quirk 00:03:08 So as I understand it, this is a new feature within InterSystems IRIS SQL capabilities, right?
Benjamin De Boe 00:03:14 Yes. SQL would be the primary way how you would get into it, because that's also the primary language for the use cases that we're targeting with it. But it is fully integrated with our object relational capability. So it means that it also works if you access your data through an object interface just as well as it works for SQL. It's also available as a new index type, so you can add it to existing tables that are row-organized. So it's really nicely interoperable. It's a nice complement to everything you already know and like about InterSystems IRIS.
Brenna Quirk 00:03:45 And I know one thing that we talked about, when we were working on previous learning content, like that video that you mentioned earlier, is the focus on increased performance. So is that one of the biggest drivers behind this feature?
Benjamin De Boe 00:03:57 Yes, absolutely. So we introduced this functionality specifically to address bottlenecks that we saw with traditional row storage, which are bottlenecks that affect everyone in the industry that stores data by…in a row format. So because data is stored in a separate global node for every for every row, if you have a large analytical query that needs to scan 10 million rows, then that's 10 million I/O operations, and then 10 million times that you need to extract the value from that row for calculating your aggregate. So that is very costly, the I/O that CPU costs, that's not something that you can just, that you can just conjure away. So that's why this new storage layout helps us tremendously to address these kinds of queries.
Derek Robinson 00:04:41 Right. One of the past episodes that we had you on, we talked about actually optimizing your SQL query performance. You mentioned there like massive analytical queries. Kind of honing in on that analytical part, what's kind of the difference between what you call an analytical query there, versus just any traditional query you might run?
Benjamin De Boe 00:04:59 There is no strict definition of when a query is analytic and when it's an operational query, but we like to think about analytical queries as those that involve scanning a large number of rows, and that do some aggregates, that do some groupings, et cetera, where from the results—the result typically being only handful of rows maybe, that are gonna serve to draw up a chart—where those do no longer…those result rows do not longer correspond one-to-one to the original rows in your tables, whereas with a traditional operational query—an application query—such as the ones to show you the most, the 10 most recent orders in your ordering application, those would be simple. They can also be about really big tables, but typically you're only retrieving a handful of them. You're only operating on a handful of them. And there is that link between the rows you see in the result and the rows you see in the original tables.
Derek Robinson 00:05:56 Right, right. So kind of drawing that out further, it sounds like with columnar storage, the analytical queries are the ones that stand to gain the most in performance and run faster.
Benjamin De Boe 00:06:05 Absolutely. That's exactly what we built it for. So we've been seeing queries that speed up with two orders of magnitude and some even more, especially if you look at the really simple ones that are really in the wheelhouse of columnar storage, but also for the complex ones, including joins involving multiple tables…we're now also seeing some really, really nice speedups, and that's why we're comfortable claiming that we're speeding them up by 10x.
Brenna Quirk 00:06:30 So I know that often we talk about the data storage format in InterSystems IRIS being a big reason for good performance. Is that what enables these performance gains using columnar?
Benjamin De Boe 00:06:42 Yeah, that storage layout is definitely an important part of it, but it's definitely not the only thing as well. So if we would've used the same encoding to encode a large number of column values that we use for encoding row values, that wouldn't have allowed us to make these gains. So we also invented, came up with a new encoding mechanism that we that we call vector compared to the list that we used for rows. So list is very much optimized for having a varying number of data types, varying number of values, et cetera. So it's meant for flexibility, the flexibility that you need in this kind of application-oriented use cases. In these analytic use cases, when you're storing all the values for a column, you can be quite sure that within a column, everything has the same data type, so you can optimize for that.
Benjamin De Boe 00:07:32 If it's a string column, you can apply a dictionary to it to store that data much more efficiently. If you are looking at integers, you can play with magnitude to also limit the number of bytes you're spending. So that's a very economical way of storing that data. Also handling with handling sparsity is something that we've been able to deal with really, really well in this new vector encoding. And that's not where it ends, because together with that data type, we also came up with a bunch of functions that operate directly on this data type that operate directly on all sets of values—thousands of values inside that data type at the same time. So these are using a set of special instructions that are available in modern CPUs called SIMD, single instruction multiple data, so that they can work on multiple data multiple data elements at the same time.
Benjamin De Boe 00:08:23 So that gives you a massive speedup compared to extracting value by value and then doing the sum or whatever aggregate that you are actually looking for at the ObjectScript level or outside of that SIMD, chipset level. So we've really pushed the computation deep into the iron. And then those operations, those vector operations, are what we use by the…in query optimizer. So when we're seeing a query that involves columnar data, we build our query plan in such a way that we can chain all of these operations, all of these vector operations, in kind of a pipeline, and we just push the chunks of data through it. And that's what's giving us these nice vectorized query plans, as we call them, that are giving us these 10x and more speedups.
Brenna Quirk 00:09:12 Right. Very cool. So Benjamin, how can people start taking advantage of this?
Benjamin De Boe 00:09:18 Starting with [InterSystems] IRIS 2022.2, there is a new optional class when you're creating your tables. You just have to add with storagetype = columnar. And that's it. It's really just a simple toggle per table. You can actually even refine it per column if you want, or add the indices as I mentioned earlier. And that's all there is to it. Uh, other than that, from a SQL querying perspective, everything remains the same. So your applications, your data analysts or the BI tools that they use, they can continue to work with these tables; they will just get faster results.
Derek Robinson 00:09:50 Right, right. Nice. I know a lot of the times when we cover these new features that are coming out, one of our top priorities is to get it into the hands of users, let them be able to play with it, let them kind of experiment with it. We often release stuff like that early in preview form. When it comes to this, like, how ready is this for production level? Like, how close are you to being able to, you know, tell somebody else this is ready for your production system?
Benjamin De Boe 00:10:13 So this has been a long time in the making; you can imagine that inventing that data type and those functions, that kept us busy for quite a while. So we announced this at Global Summit in 2022, and then we launched the Early Access Program. We worked with a couple of customers that were interested. We learned a lot. We were able to adjust and improve our implementation based on the feedback that we got from them. And we've improved. So in 2022.3, there's already a lot better support in our query optimizer and query plan generation. And now with 2023.1 to be released shortly, we plan to certify this for production use because we are now confident that we've really dealt with the full spectrum of queries and are able to give them nice vectorized query plans that offer the performance that we claim it gives you. And from there on, we will continue to add improvements in the same way that we add improvements across our SQL engines. So there might just happen to be a couple more that impact column queries rather than other ones for the next few releases. But we're really ready and very happy to be ready to see customers take this to production.
Derek Robinson 00:11:27 Right, right. Awesome. Well, Benjamin, thank you so much for joining us. We'll have to check back in with you once we've got some feedback from those customers who are rolling out to columnar storage.
Benjamin De Boe 00:11:34 Looking forward to that.
Derek Robinson 00:11:38 So once again, thanks to Benjamin for joining us. Brenna, what were some of your big takeaways from that conversation after kind of the original discussion we had about your work on the columnar stuff?
Brenna Quirk 00:11:47 Yeah, I found it really interesting to hear a little bit more about the underlying data storage in columnar, this vector storage. I remember when we were working on the video, that was something we didn't wanna get into too much detail on since it was really kind of a higher-level introduction-type video. So it's cool to see it all kind of actually coming together now and hear a little bit more about how it works.
Derek Robinson 00:12:07 Right, right. Yeah, I mean, kind of similar vibe for me, like, you know, the content was all interesting, and some of it probably overlapped with the columnar video as well, but for me, one of the big takeaways was sort of his readiness to stamp it production-ready. I think a lot of times we have these conversations with the product team and there's like, one of the things that I mentioned it in the interview with him that we try to do is to get things out into our user's hands, like to be able to try it, if it's an early access or a hackathon or something like that. So I thought it was really cool that he was basically ready to say like, yeah, this has been out for being able to try it and like with our upcoming release, it's gonna be production-ready. So I thought that was cool to see that come together as well.
Brenna Quirk 00:12:47 Yeah, agreed.
Derek Robinson 00:12:47 So thanks again to Benjamin, and thank you Brenna for joining, and we'll hopefully have you on for another one soon. That'll do it for this episode, and we will see you next time on Data Points.