In this episode, we chat with Benjamin De Boe, product manager for data management and analytics, about optimizing the performance of your SQL queries in InterSystems IRIS. Benjamin will go over some of the most common issues that cause performance loss within your queries, the easiest ways to fix them, and some other items to look for in order to ensure your queries are healthy and efficient.
For more information about Data Points, visit https://datapoints.intersystems.com. To try InterSystems IRIS today, head over to https://www.intersystems.com/try and launch your instance! You can find our SQL QuickStart at https://learning.intersystems.com/course/view.php?name=SQL%20QS, and if you'd like to discuss these SQL topics on the Developer Community, you can head over to https://community.intersystems.com.
TRANSCRIPT:
Derek Robinson 00:00:01 Welcome to Data Points, a podcast by InterSystems Learning Services. Make sure to subscribe to the podcast on your favorite podcast app, such as Spotify, Apple Podcasts, Google Play, or Stitcher. You can do this by searching for Data Points and hitting that Subscribe button. My name is Derek Robinson, and on today's episode, I'll chat with Benjamin De Boe, Product Manager for Data Management and Analytics at InterSystems, about SQL performance in InterSystems IRIS.
Derek Robinson 00:00:40 Welcome to Episode 4 of Data Points by InterSystems Learning Services. Hopefully you enjoyed the launch of the podcast earlier this month, which featured three episodes. Going forward, we plan to release one to two new episodes each month. So make sure you're subscribed on your favorite podcast app, and that way you won't miss any new episodes. We also have a new homepage for the podcast with an easy-to-remember URL; it's datapoints.intersystems.com. That page has the latest episode, more recent episodes, and the links to go subscribe on the different podcast apps. So go check that out for sure. Today I'm talking with Benjamin De Boe about SQL performance tips in InterSystems IRIS. Benjamin has worked with us and Learning Services quite a bit over the last few years to create engaging content in his areas of expertise. One of my favorite things about working with Benjamin is his ability to clearly convey concepts and really make them easy to understand. I think that comes through in our discussion here with Benjamin De Boe.
Derek Robinson 00:01:35 All right, and welcome to the podcast Benjamin De Boe, Product Manager for Data Management and Analytics here at InterSystems. Benjamin, how's it going?
Benjamin De Boe I'm doing great. Thanks, Derek. How are you?
Derek Robinson I'm doing great. Thanks. So today we're going to talk about SQL performance InterSystems IRIS. A lot of SQL developers out there might use different data models for their applications that use InterSystems IRIS, of course, we've mentioned in other episodes, we have multi-model database that you can kind of choose which approach based on your use case. But here we're going to talk about relational SQL, and where do we really begin? So I think what we're looking to you for here, some of the common practices and best, you know, tips for enhancing your performance and things like that. So what's kind of your overview level explanation of the SQL performance in IRIS, and where you'd start with it?
Benjamin De Boe 00:02:23 Okay. So there's a whole lot of things to talk about of course, when we talk about SQL performance, and much of that is already, these , and various separate articles. And there's also a great book in our documentation that's focused on SQL optimization, but let's take more of a problem-solving look at what's happening here. So imagine you have users that are complaining that the system is slow, or users that complain about this particular piece of the system is slow. Of course, the second ones are more easy to help, but the first category also exists.
Derek Robinson 00:03:00 Right. And so let's say that maybe I have a little bit more information to provide rather than everything is slow, including, you know, my computer starting, right? I think, if let's say I know that there's certain queries are slow, or I know that in general, this part of my application is slow, where can I go look to find the information I need to get that information?
Benjamin De Boe 00:03:19 So one great entry point into this whole conversation is the lightweight queries statistics. So that's a bunch of metrics that are always on. So for every SQL statement that you issue, we collect a little bit of metrics that hardly cause any processing overhead, and that information is kept in the statement index. So that means that you can create, and you can see how often every query is called, and also how long it takes on average and what the standard deviation is in the execution time. So that already gives you a great help for those users that come to you and say, the whole application is slow. You can already use that information or the information from lightweight query stats to identify the ones that might be the likely cause. So there's two categories there, there's the queries that are just slow, that have a very long execution time, but some of those queries might be inherently slow. So if it's the big fat accounting query that needs to run once a year, that kind of gathers the whole detail from the whole year and touches the entire database, okay. That query may take half an hour to complete, but if it's only executed once a year, that's not too bad, but if it's something that gets executed thousands of times a day, and that can be brought from one and a half seconds to half a second, that's a much better investment of your tuning time.
Derek Robinson 00:04:41 Right. So, yeah, it really depends on your situation and where it's smartest to put your attention toward as far as the different queries you're running and kind of the uses that you have for those.
Benjamin De Boe 00:04:51 Exactly. That's where the lightweight course statistics can help.
Derek Robinson 00:04:55 So, for me being someone who's not necessarily an expert in this, I see the term lightweight statistics, and I think there's probably a bunch more beyond that, right? I don't want to make it sound burdensome, but what else? Let's say I want to take a deeper dive than what I see in those lightweight statistics. What's kind of the next step that I would take as someone interested in that?
Benjamin De Boe 00:05:13 OK. So we don't call them heavyweight statistics, but there are other additional statistics that we can collect, but they're more for very involved analysis of one particular query or set of queries, that typically is better to involve InterSystems support for. But then there's a whole lot of stuff that you can do all by yourself, a sort of little bit of investigation and experimentation on how to improve your query's performance.
Derek Robinson 00:05:37 Right. So as far as some of those pieces of investigation, maybe I see one query that I want to, it's really bugging me, and the performance isn't great. What more can I look at to see as far as, not just the speed, but overall the health and the behavior of that query? Like what can I really dive into to kind of see, you know, fine tuning that performance and really optimizing it as best can?
Benjamin De Boe 00:05:58 The first step is obviously check out what it's actually doing. So checking what the query plan is like. So what the query optimizer comes up with as the sort of execution strategy for giving you the answer for the question you asked through SQL. So that query plan is sort of a step-by-step thing that describes the actual code that gets executed for satisfying your query and might have steps, such as look at this particular index with the user-supplied value; use the IDs you retrieve from that index to look into the master map, which has the main table data; return those rows; join to another table. So that's what a query plan looks like. And you can get those, you can read those through the system Management Portal. You can read those using the explain command and on the SQL prompt, or through an ObjectScript API, if you wish.
Derek Robinson 00:06:51 Right. So, let's say I take all those steps, and now it's basically given me the optimization that we should apply and that this query should receive, how much should I trust it? But like, is that always correct? Is it pretty much magic, or what things should you maybe watch out for, or are there any pitfalls that you might run into?
Benjamin De Boe 00:07:10 It's very close to magic, but it's still software. So essentially what it does is, and there's decades worth of engineering that went into it, it's looking at the question you asked, the SQL query that you presented, and looking at the schema, so your tables and that contained the data to satisfy that query. And then it's going to come up with a couple of different plans. It's going to make an estimation of the cost of each of those plans, and then eventually elect the one that supposedly has the lowest cost. But obviously that cost formula is based on the information that you're giving it. So that information that you're giving it is not just the schema information, but also the statistics of what data actually is in your schema. So we call that the table statistics. So table statistics are, for example, how many rows are actually in my table? What's the average length of a row to be able to estimate the cost of retrieving a row from this, of course? How selective are individual fields of every table? So if I do a filter on this field or on this field, which one is going to filter the set of rows to retrieve from disk, more efficiently? So those are extremely valuable pieces of information that go into the query optimizer and help it do its magic.
Derek Robinson 00:08:28 Right. So from a layman's point of view, it kind of feels like that's basically metadata about your table that helps the optimizer know…not the data, I don't care about what's in the table, but what are the characteristics of the table, to know how to try to optimize those queries? Is that right?
Benjamin De Boe 00:08:46 Exactly, exactly. And that's also why, of course, those need to be up to date. Those need to be in line with the data that's actually in there. So, when you initialize a table, it starts off empty, and we'll take some default values to go by with, but then if it turns out that your query plans are not according to what you thought they would be, it might be that it's just basing the query plan on outdated information about those table statistics. So the single most important thing that many of our support calls start with and sometimes also end with, is just gathering those table statistics again, and making sure that the query optimizer has the best information to make the best decisions and do its magic.
Derek Robinson 00:09:28 Right. So, given that last thing you just said, obviously sometimes an important thing that could be very easy and save time with support, that you don't really have to be spending…how do I gather these statistics easily and quickly?
Benjamin De Boe 00:09:39 So again, through the Management Portal, there's a menu option through which you can gather those statistics. There's a method on the system SQL utility, that's an ObjectSript API, but the easiest way to do it is just to use the SQL commands that does, of course, need to read the actual data in there in order to figure out what is the most selective, what's the selectivity of the fields, how many rows are in there, and that takes some effort. So you can sample that so that it only looks at a subset of the data. You can schedule that, so that it runs off peak, but that's sort of easy to schedule. And, it could be something that you run once a week, or maybe once after a certain ramp-up period. Those table statistics are typically fairly stable, and you would not have to run it that regularly, maybe once a week or once a month, would be enough again.
Derek Robinson 00:10:37 Interesting. So basically, as far as that part of it, just gather those statistics to update that information for the optimizer be able to use, and then I'm pretty much done at that point, as far what my obligation is to it.
Benjamin De Boe 00:10:49 As far as obligations go, those are all very simple things that you can do that don't take any additional knowledge or specialty or consideration. Those are almost no-brainers. Once you have that information, with the right table statistics, queries are still going slow, even though it supposedly takes the best possible access path, that it might be worth considering looking at additional indices, but only then. So there's no point in adding indices after you've seen the particular workload, or before you've considered a particular workload and before you've made sure that it's looking at up-to-date table statistics. And then there's the whole art of identifying the right index. And maybe art is a little bit of an overstatement, but there's several different types of indices that each have their advantages, and cases where they're best fit for. But in general, if you have a field with a low cardinality, so few different values, use a bitmap index. If it's a field with more distinct values, for example, date fields, use a regular index, and that's a really good start, and that can get you going quite quickly.
Derek Robinson 00:12:01 Nice. Yeah, I think that's, obviously a lot more you can do to stay on top of it and make sure that you're continuing to get good performance. So as far as additional improvements to the platform, right, going forward, if I'm a developer who does this all the time, and I'm really looking at my queries a lot, what can I be excited about in the coming months, years, versions of InterSystems IRIS, that's kind of in the plans for what could make this even better and even easier for developers going forward?
Benjamin De Boe 00:12:26 So very shortly, we'll be releasing our SAM product: System Alerting and Monitoring, which is sort of a standalone thing that kind of watches carefully over your InterSystems deployment. And that will also contain a nice interface on top of those lightweight core statistics that allows you to quickly drill into individual queries, look at those query plans where we're going to add a little more metadata to those query plans so that they get more readable and actionable. So for example, include in the query plan itself, how recent the table statistics were that it based its decision on. So that will help you identify if that is something that needs action. Also include which indices it used and which ones it didn't use. So that may all be valuable input to those optimization decisions. And then, finally, we're also going to work on some more automation for automatically gathering those table statistics and keeping those up to date without causing some unforeseen overhead on your system. And eventually also provide some automated recommendations on which indices might be good additions based on table statistics, and statistics.
Derek Robinson 00:13:37 Nice. So a lot of good features coming up, it sounds like. And, you know, if people are in the interim before those make it even easier, and they're kind of just going through all your queries and assessing your performance, short of calling support, which we're always fine with everybody calling support, because our support team is great here, but what steps could you take on your own to engage with the community, or what you can find online and things like that?
Benjamin De Boe 00:14:00 The Online Learning website is definitely a great resource to find information, but then also the Developer Community has a lot of interesting articles already on this subject. And of course you're always free to ask your questions right there. And if all else fails, just call in the cavalry. Our support team is ready for your call!
Derek Robinson 00:14:18 Very true. They are ready. But I have seen, on the Developer Community, a lot of good conversations on various topics, and sometimes even the InterSystems people helping with the problem learn something on their own, based on their discussion with users of our technology.
Benjamin De Boe 00:14:31 Exactly. It's really a very diverse community. Is also has, our developers are looking at it. Our support folks are looking at it, our customers, our partners. So you really have people representing all different perspectives at problems…and solutions.
Derek Robinson 00:14:48 Yeah, absolutely. And that's the important part. So, Benjamin De Boe, thank you so much for joining us, and we'll see you next time.
Benjamin De Boe You're welcome. Thanks, Derek.
Derek Robinson 00:14:59 So a big thank you to Benjamin for breaking down the tips and techniques for optimizing your SQL performance in queries within InterSystems IRIS. I thought that was really helpful. Like he mentioned, there's a lot of interesting discussion about SQL topics on the Developer Community. And we also have an SQL QuickStart exercise on our Learning site. If you're just getting started with SQL in InterSystems IRIS, or you want to try something hands-on, that link will be in the description of this podcast episode. So that's a wrap for Episode 4, and remember to rate and subscribe on the Data Points podcast listing on whatever app you use. On Episode 5, we'll be covering Mirroring in InterSystems IRIS. See you then!
This episode features the second part of our interview with senior clinical advisor for interoperability at InterSystems, Russ Leftwich. Check out Episode 8 for...
In this episode, Carmen Logue, Benjamin De Boe, and Thomas Dyar join the podcast to talk about the various features and products that fall...
In this episode, we chat with Thomas Dyar, product manager for machine learning and AI, about IntegratedML in InterSystems IRIS – the feature coming...