5. Mirroring Databases for High Availability (Bob Binstock)
In this episode, we chat with technical writer Bob Binstock about mirroring databases in InterSystems products — specifically in InterSystems IRIS. Bob is a technical writer at InterSystems with lots of knowledge about topics like these, and he walks us through the concept of mirroring for high availability. You'll hear about primaries and backups, journal files, failovers, and more.
To try out the First Look exercise on data resiliency and mirroring, visit https://docs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls?KEY=AFL_dataresil.
For more information about Data Points, visit https://datapoints.intersystems.com.
Derek Robinson 00:00:01 Welcome to Data Points, a podcast by InterSystems Learning Services. Make sure to subscribe to the podcast on your favorite podcast app, such as Spotify, Apple Podcasts, Google Play, or Stitcher. You can do this by searching for Data Points and hitting that Subscribe button. My name is Derek Robinson, and on today's episode, I'll chat with Bob Binstock, a Technical Writer here at InterSystems, about mirroring databases for high availability in InterSystems IRIS.
Derek Robinson. 00:00:39 Welcome to Episode Five of Data Points by InterSystems Learning Services. As always, we hope you're enjoying the first month or so of Data Points. The home page for the podcast is now datapoints.intersystems.com, so you have an easy landing page to go subscribe and to see all of the latest episodes. Today I'm chatting with Bob Binstock about high availability and mirroring databases. Bob has a pretty wide purview of expertise, being a Technical Writer here at InterSystems. And one of the areas he's worked quite a bit on is mirroring. In our discussion, we'll talk about some of the reasons that applications today really require ultra-high availability, and then how mirroring and InterSystems IRIS and other products help to address that issue.
Derek Robinson 00:01:22 Welcome to the podcast Bob Binstock, one of the Technical Writers here at InterSystems. Bob, how's it going?
Bob Binstock Good. How are you?
Derek Robinson 00:01:30 Good, thanks. So today we're going to be talking about mirroring and high availability, which is an important topic for people that have applications running on InterSystems products, or really any products I think, in the world today, and with the way that software development goes and application development goes. So, before getting into the specifics of that technology and those features, let's talk about the concept of high availability in the first place. What is high availability, and why is it important for application development and for really any applications running today?
Bob Binstock 00:01:57 Well, what you just said is actually very accurate because more and more, everybody needs their applications to be online 24-7, 365. We used to talk about mission- critical applications that couldn't afford any downtime. But the fact is that organizations in healthcare, government, financial services, manufacturing, all sorts of areas, live on the network now. So essentially everything they do is mission critical. I mean, it may not seem so critical to me that I can buy something on Amazon right now, but it is to them. Their business is in trouble if they have interruptions. So high availability basically means zero downtime, or as little as you can manage. And that's what InterSystems IRIS mirroring is for.
Derek Robinson 00:02:49 Right, right. So, yeah, the Amazon one's a great example because it matters to their business, and the way it is today, a lot of customers…it's amazing how you'd get mad as a customer now because you expect it to be so available and ready all the time.
Bob Binstock 00:03:04 Sure. And think about healthcare. Nobody wants healthcare to go back to the days when people were looking through stacks of binders to find information about a patient or about a medication. But if you think about it, healthcare happens around the clock, and clinicians and others need to be entering, sharing, and analyzing data around the clock. So as healthcare, more and more goes online, it has to be up all the time. And that's what high availability is about.
Derek Robinson 00:03:34 Yeah, exactly. So, taking that into, kind of how that is put into practice in the InterSystems stack of technology, particularly InterSystems IRIS and other products we'll get to, what is mirroring, starting with IRIS? And then you can kind of expand that to the other products that it's related to.
Bob Binstock 00:03:48 Sure. Mirroring is based on the failover pair, which is a common approach in which a synchronized partner, or standby system, takes over if the primary system fails or becomes unavailable. For example, there might be a network outage. The problem with most failover-based technologies is that they involve a single point of failure—both failover nodes using the same storage, or the system relies on physical copying from one storage device to another, which makes you vulnerable to a storage failure or possibly data corruption getting introduced while the copies made. So mirroring instead uses logical data replication, not physical. All of the needed information is passed over the network. So the synchronized backup copy of a database can be kept on a separate storage device or even in a separate data center. This eliminates that single point of failure as well as the possibility of data corruption.
Bob Binstock 00:04:45 So mirroring is based on journaling, which is one of the things that InterSystems IRIS uses to recover from crashes. So every update to a database is saved in its journal files. If your instance goes down or you have any kind of problem, the first thing you will do is restore the most recent backup of that database, but you still have to take care of the updates that happen after that backup was made. Well, what InterSystems IRIS does after the backup is read the journal files and apply every update made since the backup was taken to the databases, which means you're back to the state they were in when the crash happened. Except of course that open transactions are rolled back, 'cause there's no way to close them. Now mirroring synchronizes the databases on the backup, failover member, with those on the primary, by continually sending the updates that are in those journal files, from the primary to the backup, where they're applied to the database copies just as they would be after a crash. So basically you insert a row on the production database. That insert gets passed to the backup, and the same thing is done there, very, very quickly. The primary and the backup communicate synchronously, with the backup acknowledging each set of general updates that it receives. So the mirror always knows if the backup is caught up—that is, it's fully synchronized with the primary. So if there's an outage of the primary—when that is the case, when the databases are synchronized, which they typically will be—the mirror automatically fails over, redirects applications to the back application—connections to the backup—with zero data loss. And it usually is so quick that users don't even notice. The backup is now the primary, and when the former primary comes back online, it automatically becomes the backup. This can also be triggered deliberately when you need to bring the primary down for maintenance or upgrades. When you're done, you just trigger a failover back to the original primary. So sometimes one or more databases may not be quite caught up. So in those cases, the backup usually can retrieve journal information from the primary's host. You know, it knows what journal information it's missing. The mirror knows the backup didn't have that latest update. And you can usually get that information from the primary's host. That might extend failover to just a few seconds, not long enough to cause any problem. If databases aren't synchronized and the whole primary's host goes down, you might have to get into manual procedures, but the point is, you can always quickly get back to where you were.
Derek Robinson 00:07:28 Right. Yeah. So you talked about, basically the mirror that is your backup kind of knowing whether it's up to date with the primary. So let's say I want to go back to one of the points you made about the journal file, kind of being able to fill in the gaps between those, if it's not up to date. So let's say my primary goes down, and my backup knows that it is not up to date because it basically is not quite there to what the primary was at. Is the journal file accessible, despite the fact that the primary went down? Where do we get that journal file to make sure we can access it?
Bob Binstock 00:08:01 That's what I was saying...it's a difference between when the InterSystems IRIS instance is not in operation, or when the host is down, or let's say unavailable. The journal files are just files on the host file system, so each mirror member has what's called a mirror agent called the ISC agent, which does things like retrieving journal information from the primary host. So if the host is available, you can get that stuff. If it's not, you might have to make a decision between waiting until you can restore the host—the primary host to get that information—or going ahead with failover anyway, knowing that you've lost a little bit of data, but you'll know exactly what that is. You will be able to get that information eventually. So you could ultimately catch things up.
Derek Robinson 00:08:56 Right, exactly. So, taking some of that, that's, I think a good explanation of the technology of mirroring and how it works in InterSystems IRIS. Now my understanding is that for pretty much all of InterSystems products, this concept really works the same way, as far as how mirroring functionally works. But what are some of the differences you can talk about when it comes to HealthShare—and at a high level, why there's more that you need to kind of know without just being able to do it as simply as you maybe can with InterSystems IRIS?
Bob Binstock 00:09:21 That an excellent question. One of the important things to remember about mirroring is that it synchronizes only the mirrored databases. So information that's stored on the primary, but not on those databases, such as for instance, user accounts and security configurations, isn't synchronized, and your software settings and configurations aren't synchronized. So if you want any of that stuff to be the same on the backup, when the backup takes over, you've got to do that yourself, or you have to arrange for it to be done, which can almost always be automated pretty easily. It's not a big deal. But with HealthShare, which is a complex application, usually involving multiple nodes, there's a lot of that kind of thing that you need to do that is required before you can start the mirror going. In other words, with a typical IRIS data server, it's not the end of the world if you haven't synchronized some information outside the mirror databases that maybe you should have. You can usually deal with that. In HealthShare the information is too complex and too distributed. So there's a lot of things that you must do when setting up a HealthShare mirror that would be considered optional or, you know, things to consider when setting up a simple data server mirror. So it's really important for anyone using mirroring in HealthShare to, you know, go through all that documentation and make sure all the i's are dotted and the t's are crossed.
Derek Robinson 00:10:55 Right. And you can thank Bob afterwards for having that documentation be very helpful for you once you go through this process. So kind of moving into some of the, why does this matter? We talked about it a little bit at the top about some of the importance, you mentioned Amazon as an example, which is a really good one. If you have a couple others, like what are some common real-world use cases for mirroring databases that people might not think of, or that kind of is a good way to put it in perspective of why this is important?
Bob Binstock 00:11:20 So the first thing to do is…I mentioned the fact that basically almost every business or organization lives online now. The other thing to remember is that the amounts of data are truly staggering. So number one, you need your applications to keep running. Number two, you need to keep up with the data. If you miss some, it can be catastrophic. Let's say you are a manufacturer and you've equipped your factory with the latest Internet of Things technology to monitor all your devices. You can look at returns from your machinery and know exactly what's going on. Well, you can't afford gaps in that data. If you try to analyze it, your analysis will be wrong 'cause you're missing data. You might not get the alerts you need saying, you know, such-and-such equipment is about to break down unless certain maintenance is done, et cetera. And when you're talking about even one factory in IOT stuff, you're talking about just a constant, huge stream of data. Now, if you think about a financial services business, a big one, a worldwide one. So around the clock, they have customers, just to take one thing that happens with these companies, which is people trade stocks, right? When you order a stock trade, you want it now, you don't want it 20 minutes from now when they manage to bring the system back online, because things will have changed! People demand instant action. And again, if you think about the millions of users that a company may have, and in addition to user orders or requests, you've got constant querying going on. So you have transactions, which is one kind of query, database query, and you have analysis, which is another. And meanwhile, you have the third kind, which is incoming data is streaming in from all sorts of sources. You've got to organize it and put it in the right places. So (if) any of these get interrupted, you've got a big problem. So high availability is just, I don't think anybody in a modern enterprise will consider any technology that doesn't address that.
Derek Robinson 00:13:34 Yeah. It seems like a linchpin for really all modern apps that we really have today. So kind of bringing it back a little bit, two more questions. One is just, you know, I think we kind of talked about this when you were talking about using it in IRIS, but just kind of to reiterate it for the IRIS users out there, or, you know, you can have a footnote about other technologies that use it as well. We talked about HealthShare, but how easy is it, in contrast, to do this in InterSystems IRIS?
Bob Binstock 00:14:00 It's very easy in IRIS. Although, you know, so much depends on it. It might be intimidating to think of what happens if you make a mistake, but really we've got it set up so that it pretty much, won't let you. If you're creating the databases, you're doing a new mirror and you're creating the databases then. It's very simple. You configure the two instances as a mirror. You might have to, if they don't already have SSL/TLS configurations for data security, you have to create those. That doesn't take a long time. And typically systems will in fact have that in place. You configure them as a mirror, and then you add the databases on the primary, which is, you know, it takes a minute or two to add a database in InterSystems IRIS. And as soon as you do that, the mirror makes copies the backup and starts synchronizing them. So that is very easy. If you're adding existing databases to a mirror, it takes a bit longer because the journal files for a mirror database are different from those for a non-mirror database. So you can't just stick the database in there. Once you add it, it starts creating mirror journal files. So at that point, you've got to pause activity, back up or copy the database on the primary, then restore that to the backup and add it to the mirror there. In other words, you're copying…you're backing up or copying a mirror database and putting it on the backup, and you add it to the mirror under the same name as it has on the primary. And again, we'll immediately start synchronizing, get it caught up very quickly. Now, obviously there are related setup tasks you may need to do, and that's true for any mirror, not just HealthShare. For instance, there are network considerations. We have a good amount of documentation about getting your network and your subnets set up the right way. There are different ways of redirecting those application connections after failover. Some of them are highly automatic, but of course the more automatic something is, you know, the more it requires getting set up in the first place. There's something called a virtual IP address, which basically means that the IP address represents whichever of the failover members is the primary at the time. That takes some configuration, but it's none of it's, you know, long-term or complex. Another concern in any failover technology is the risk that the backup will incorrectly take over while the primary is still operating, which would lead to two nodes, both trying to act as a primary, which of course is a disaster for the data. Mirroring, however, uses a node called the arbiter to make sure this doesn't happen. The arbiter stays in constant contact with both the primary and the backup, and essentially gives the backup a go/no-go decision when it wants to take over. At the same time, if the primary loses touch with both the backup and the arbiter, it automatically goes into a trouble state. The arbiter therefore knows that it can tell the backup to take over at that point without the risk of the primary coming back online. So the arbiter pretty much takes care of that concern.
Derek Robinson 00:17:21 Right. Cool. So lots of good stuff there on mirroring. And, of course you can check out more details on this with the documentation that I'm sure Bob has written a lot of at docs.intersystems.com, as well as our Online Learning content at learning.intersystems.com. So, Bob Binstock, thank you so much for joining us, and we'll see you next time.
Bob Binstock Thank you, Derek.
Derek Robinson 00:17:40 So thanks to Bob for breaking down everything about mirroring. There's a lot of information about mirroring available in the documentation, including a First Look exercise on data resiliency and mirroring. You can head over to docs.intersystems.com to search through that content and to learn more. That'll do it for Episode Five. Don't forget to rate and subscribe on whatever podcast app you use. See you all next time on Data Points.
Brought to you by InterSystems Learning Services of Data Points