Hello everyone! Today I have the pleasure to have with us Oren Ame. Oren is the creator, CEO and founder of RavenDB and is the guy that knows a lot about database, internals and everything you need to know to start and to take advantages of the way that you store data. So here we go with Oren. Hello. Hi. It's great to be talking to all of you. Today, I'm going to be trying to work on two levels. I want to give you insight into how database works. Not what to do with them, but how do they actually achieve their goals? Because usually you use a database and it's almost magic. And in this case, what we're talking about is the ability to properly understand exactly what's going on inside the database so you can make best use of that. So let's talk about me in just a little bit. I've been doing databases for over 20 years. I've been working on RavenDB for almost 15. If you are interested in learning more about databases, I wrote a book. It's not completed, but it's a pretty good book, if you don't have to say so myself, to show you how to build a database from nothing. And with that, let's talk about what are databases. And it turns out that for most developers, for most professional developers, including seniors and people with 10 plus years experience, databases are magic. You throw the time, you get it out, and something happens in the middle. But this is magic. This is the sort of stuff that PhD talks about. You need to go and have your master degrees in order to just peek under the hood, which is absolutely, utterly false. Because the absolute beauty about database engines, the thing that I love the most, is that the things that you most use with databases are the things that you learn in Computer Science 101. The things that you learn in the fourth semester when you start getting into complexity and algorithm and how to work with certain data. That's it. Everything else is just built on top of that. None of which is complicated, except that you have to deal with a lot of competing demands all at the same time. So let's talk about this. Databases are super useful, and we would never be able to do anything without them. But the problem with thinking about databases as magic is that we are reduced to cargo cult, to, oh, I do this because I've always done that, and I don't understand why, but it usually works. Sometimes it doesn't, and then I call for help. And then, you know, that guy shows up, and he looks at the database for a few minutes, and then he does magic incantation. Maybe they take a sacrifice of chicken, and then it works okay. But again, databases databases almost purely computer science and database has been around forever there is a huge amount of resources behind it and one of the beauty about this is that databases are constrained it's not, I need to be able to run on an iPad with responsive design. It's not, I have to be able to think about 16 different types of users, and some of them are blind, some of them are deaf, some of them really, really like to talk in French and really want to use the German sorting rules. And all the other really, really annoying things that you have to deal with when you're building applications. Databases are almost purely input and output, and relatively simple to talk about, even if they're really complex bits. So let's talk about this. What am I doing with databases? What is the role of database engines? Well, I would really like to have transactions. What is transaction? It's ensuring that my data is persisted, that I can operate concurrently inside the database without stepping all over myself. It means that when I issue queries to the system and getting reasonable results, hopefully very quickly, a huge impact of what a database engine does is resource management. Knowing how to deal with those guys disk memory and cpu something even gpu and how do i balance all of them together how do i make sure that i'm using all of my resources as i should a good database engine will make optimal use of your resources, which may mean that your disk is busy, your memory is full, your CPU is running, which is good as long as you're able to answer transactional queries in the right latency that is required. Finally, in today's world, we also have to deal with running a distributed environment. What happens if I'm running a cluster of multiple nodes all at the same time, and how do I coordinate work among all of those? Finally, and this is the fun part, you have to present something to the users, some model in which they are going to work with your data. And that model turns out to be really complicated to manage because it has impact across the entire ecosystem. Another aspect is that typically if I have 10,000 records, I don't care. 10,000 records is super, super small. I can scan through that in memory with whatever language I want. Too fast for me to count. If I have more than that, hundreds of thousands, millions, billions of records, now I start caring about this, which means that now I want to look at the machine and get results quickly, which brings us to this. O of n, O of n squared, O log n, and O1. Those are ways of thinking about the complexity of operations that they have in the system. And if n is big, that's important. And that's critical for the success of your system. So let's think about what do I actually have to do. I have to write data. I have to query data. And, okay, writing data is simple, right? Except that I also need to support transactions. I also have to be able to query. It's not just append to a file. How do I do that? Think about this. You're starting from scratch. You have nothing on. You're looking at the blank. You're looking at the blank ID. Here you go. Write a database. What do you do? Well, that's an interesting problem. You have to support transactions. The ability to read and write, or at least write properly to the system, and then for queries to read the data out, hopefully in interesting ways. The absolutely simplest database you can possibly think of is something called the key value store. It's basically a hash table. You write a value with a key, and then you query by the key. It's simple. It's easy. It's cheap. No worries. What's the problem with that? How do you actually write the data to the disk? How do you actually write the data to the disk? How do you organize that? Are you going to allow yourself to have additional queries beyond just lookup breaking? If you think about the world of databases, it's really, really, is it really sufficient to write something that can just do a get and set. Take a look at this. And, Oren, I don't know if you can do a zoom in. I can absolutely. Like this? Yeah, perfect. Okay. So what you can see here, this is about the simplest database you can possibly think of. The only thing that it does is allow me to get a set of values. And the back end of that is a hash table, a map in Python. Really simple, really cheap. One problem with that, however. This is an in-memory database. I need to persist it to disk. I need to be able to run more interest. I would like to run more interest in query on that. Take into account this thing that I've shown you. Something very similar to that, plug this into the network, and this is memcached. Add a few more operations. This is Redis. This is what we take. Remember, the first version of Redis was written in a ticker which is not much different than Python in terms of expressibility and ability to work another aspect is the communication model that you have to talk in with the database this is really really interesting because it's really common for you to want to do things like that. And let me just change the language. Take a look at this this is a conversation with the database I'm sending this I'm getting a knock sending this it's going to tell me one row updated one row inserted and a knock for that perfect right One row updated. One row inserted. And a knock for that. Perfect, right? Very simple, very obvious, and very, very flexible. This model is how you typically think about databases. This model is also a really bad model to talk to databases. Why is that? Consider the following scenario. Right now, I need to go and talk to my database, which is a remote machine. I have to send it a message. One command, get the reply back. Another command, get back. Another command, get back. Another command, get back. The backend thought is typically what's going to kill me. Alternatively, I can do something like this. Now, the whole thing is sent as a single operation to the remote server. What does it mean? It means that I'm only going to go to the server once. The amount of work that the database has to do is the same, right? Two operations inside a transaction. Two operations inside a transaction. In this case, I'm saving on the network bandwidth cost, or network latency cost. Because I'm sending one the network bandwidth cost or network latency cost because I'm sending one message instead of four. Except that this also has another huge impact on the overall design of the system. Look at this. If I'm writing this sort of code, and this is, if you are familiar, this is how SQL works. This means that this takes one second. This takes one second. This takes one second. That's for the network latency to run, let's say I'm sitting here and the server that I'm talking to is on the other side of the world. So just the network latency overhead puts me at a delay. That's one problem. But from the database engine, there is another aspect. You made an operation, and that operation is still here, effectively in limbo, for a very long time. I'm now updating I update the record I have to ensure because I'm running a transaction that no other transaction is going to touch this record which means that I have to deal with a huge amount of complexity internally to make sure that I have locks on this role. Whereas with what's going on with this scenario, in this scenario I don't have to do that. Why? Because I'm running the whole transaction as a single unit. What does it mean then? It means that as I'm writing all of this, I can put one lock over the entire, I'm running one transaction at a time. Because I have all of the transaction, I don't have to wait for it. I don't have to wait for it I don't have to wait for the next command I can operate it immediately and consistently that gives me a lot more freedom and flexibility in how I operate and that is a really interesting aspect of databases because the protocol that you use to talk to the database has an impact on how you manage transactions, how you write to disk, how you, how you, how you, how you, how you. Absolutely insane level of interdependencies. And we have to do it this way in order to be able to give you good performance. So let's talk about the four stats that we have. We have this thing. I want to write some value to the disk. How do I do that? Well, the simplest way to do that is just to open to file. And I can use CSV for that. Now look here, this is the record that I have. Now I want to run this update. What's the problem with this? Here it's O and ETH. Here it's O and at. So if I would be trying to update this, I would have to do something like this. But this actually adds this. So what I would actually have to do is just this. actually adds this so what i would actually have to do is just this now because there is no api to append values to the middle of a file i have to rewrite the whole file which again makes sense if this is a small text file if this is a 10 gigabyte file it's not not viable so i cannot do an in-place update of the value. I cannot just have all of the value one after another in a CSV and expect that to work. So what's the next stage? How do I actually write the data to the file? Okay. I actually have the ability to overwrite at a particular offset on the file, a particular account. I can do that. And that's easy for me to do. There is a dedicated API for that. What's the problem? Well, I have to have a space for this. In order to deal with this, we are not going to look at the file as a plain sequence of bytes. Instead, we're going to split the file into pages. Every page is going to be some multiplier of 4 kilobyte. RevitDB uses 4 kilobyte. RevitDB uses 8 kilobyte. And this is because this is what the disk gives me. Model disks operate almost exclusively in 4 kilobyte increments. So the idea here that I have my file. And you can think about this file as not sequence of letters or bytes or something like that, but as an array of four kilobyte pages. And what this means is that I'm having the ability to overwrite using this method one of those funnies. And why does it matter? Because 4 kilobytes is small enough that I can overwrite all of it in one shot. So now my model is that I'm starting to organize my data using pages. And this is called pages because think about how you would deal with physical data, physical documents. You remember you used to have, I don't know the term for that. You remember you used to have, I don't know the term for that. I would, that's, I have the name for that in Hebrew, but not in English. And finder, that's the term. So if you remember, there used to be... Oh, one second. Okay. And they're supposed to be, here we go. So remember that there used to be stuff like this and you would put papers into that. If you aren't old enough to know about those, I have no words. So the idea here is that when you have a binder like that, you can pull a page, operate on that, write on that and write it back. And that means that this problem of modifying data in place has been resolved. Because worst case scenario, I need to update this. Then I go and take out just the page that is needed and work on that. Now, it means that the way that I write data into the database is in pages. And now, instead of having to worry about the structure of the entire file, I worry about the structure inside the page. So how do I look at the structure inside of a page? And it turns out that I can just treat it in the most simple fashion, what if it was just a sorted list of data? That's about it. And the data may actually even be in this format. Typically, it's not. I'm not going to go into the on-disk format, but you can think about it as me splitting the data into sorted lists. Each one of them is up to 4 kilobytes in size. Why is it important that this is sorted? Because if it's sorted, I can do so many interesting things. Let's assume that I have a single page in the file, and I want to find Eric. Well, here's it here. And I was able to very quickly find him because, well, this is sorted. This is all again. Of course, now we have a different problem. Let's say that I want to add a new item to this page. How do I do that? Well, let's say that I want to add Henrik into this team. Well, I could just add him here. Cheap and easy. But what happens if I wanted to add albert albert should go here what do i do well i already read the entire page into memory and i can just add him here modify the data in memory shifting all of those values and then write it back out because using this by overwriting the full page i just said this is not a good idea to do that on the full file but for a four kilobyte page it is absolutely viable to do that in fact databases do not do it this way there are a structure even inside of the page. There's something called a slotted page. But I'm not going to go into that in more detail unless I have a lot more time. So now, let's talk about what happens when the page is full. Here's the page. And now I need to add Hannah. I have no more room so what I'm doing then, I'm splitting things up now I have three pages I have the original page I have a new page with just HANA in it, and now I have a new root page with Abel and HANA. Now this looks like a binary tree. It is almost exactly binary tree. And again, I told you that we're going to talk about CS101. The only difference is that the root page may contain multiple entries. We'll see exactly how this works in a bit. So I added a name. Now I have this structure. What happens if I have Ben? Ben needs to go here. But that's a problem. How do I manage this? Because I couldn't just split the page. I'm splitting it in the middle. So now here is Abel, Ben, Brett, and Keynes. So those three are from the fourth page, and Ben is the new value. And here is those four that are the other half of this. Notice that I have two and two empties here. And I have some empty here. What's the idea here? The idea here is that I'm able to maintain the tree structure that we have here. This is called a B plus tree. This is a very, very old algorithm, going back to the 1970s. And this algorithm, which I just described, is the basis for pretty much all database engines ever. Whatever this is, Postgres or WebMDP, Mongo or Oracle, all of them use some variant of B3, of a B plus tree. There's a whole bunch of variants. RevenueDB itself has, I think at this point, six different implementations of B3s for various needs and requirements. But all of them have the same general concept. If you understand those ideas, you'll be able to make use of that. So let's look at the impact of adding things in order and out of order. You can see here that if I'm adding something in order, I'm able to just keep this page as is and then write this one as a new page. If I'm doing this, I have to split the page and I have more empty space. This is the result of writing data in a sorted fashion. Here is the same data written in a sorted fashion. Here is the same data written in a random fashion. What's the difference? Well, you can see that this thing is deeper. This is shallower. The shallower the tree, the faster our queries become. The better our entire system is. In this case, we are wasting a lot of space inside of the tree. You can see it here. There is here, the only wasted space that I have here is here. Here I have it all across the tree, basically. And if I had a lot of random inserts, I would end up with a lot more wasted space on these pages. Now, I mentioned that they are old. The first mention that we have here are from 1979. And at that point, it was called ubiqu ubiquitous the one that is found everywhere so they are about as old as computing basically I think that if you spend enough time you could see them showing up in the 1950s or something like that so when we are talking about biteries there are two ways of thinking about it. If you might have heard the term cluster and uncluster data, I use the term embedded data and references. In the case of embedded data, here you can see that the data itself is on the leaf pages, the ones here. In this case, I have two data items. One of them is the data itself, the other one is the actual tree structure. What's the difference between them? Well, let's say that I want to find Derek. It's very easy for me to find him here. I found Derek, and here is all of the details that I have about Derek, right there in this page. In this model, I find Derek again. I do a binary search, I find Derek, and then I have to jump from here to here to get those details. What's the downside and upside of each option? If I'm using this model, once I did a search, I found the data, I'm done. I don't have to do anything else. did a search, I found the data. I'm done. I don't have to do anything else. The downside of that is that I can fit fewer items per page because I have to store the data itself. And remember, all of this data sits in this page. I run out of space, I have to split the page. In this case, my B3 is much more compact because I'm only keeping the keys in order to be able to do the search. And once I found this, I have to jump to the right location. In almost all database engines, even if you have clustered data, you also have non-clustered data. Why? Because if I'm searching based on the name, I can use the cluster B-tree. If I'm searching by a different property, let's say email, I would have to have either another copy of the data, including full copy here, which is a huge waste of the space, or I have basically this and I have to find it in some other location. Another issue that we have to deal with, pages are limited in size. 4, 8, 16 kilobytes, whatever. What happens if I have a field or value that is too big for that? I cannot put it inside of the bitrate.3 instead i'm doing something called overflow i allocate sufficient number of pages to put it there and then i'm putting in the b3 just the pointer to that overflow location at overflow location. If we are thinking about a value such as large strings, big files that may be stored in the database, stuff like that, all of them are stored in exactly this fashion. Now, super important. A B3 is sorted. It's sorted by some key. What does it mean? Let's look at this. Here we go. Here is my sorted data. I want to find Brett. I found it here. I want to find all of the users whose names start with d here's here it is here it is it's simple even if i'm going into the more complex structure give me all of the users whose name are between e and h i start here, find, okay, here is E, F, G, go up, and then continue to reverse, here is h, and then I'm done. So the B-tree actually allowed me to do any of these operations using O log n complexity. And that is a huge, huge issue. Think about that. I'm now able to do all sorts of really interesting queries. Let's say that I want to store users. And my primary key is user ID. How does it look like in terms of data on disk javascript so i'm now remember i'm now operating over user records and I have this. Key by user ID. Now able to do queries like this. Get by key. Get by prefix. In this case, probably something like that is better. Let's assume this is a string. This is a string. All of them are super cheap. Except that you know what? I want to query by email. How do you do that? Well, it's very simple. You add another bit tree. What do I mean by that? Think about this. The whole idea of a bit tree is to allow you to do searches very efficiently. In fact, if I'm using in-memory algorithms, sorted map is what I'm looking for. In C sharp, this would be a sorted dictionary or a sorted list, et cetera. Now, why does it matter? It matters because look at how this works. Here I have the primary index, and the primary index is by user ID. And here I have a secondary index by email. And you can see how all of them point to the same location. This is this model of references. Conversely, I may have a secondary index who does not point to the actual value, but point to the primary key value like this. This is how MySQL does this, as two separate locations, two separate indexes that point to the same location. Here is how, sorry, this is Postgres, and this is MySQL. And it matters if you really, really, really care about performance. And you can see here, Uber did a really good study about those things and they really really care about this impact because for their scenario it meant a lot by the way for reference RevenB uses this model and this makes things a lot easier for us consider the different impact and this makes things a lot easier for us. Consider the different impact of the two models. I now want to do something like this. All users who logged in today. I have an index on last login date, which means that I have something like this, right? Now I want to say something like that. Now, there is a question here, what is this? In this model, I'm going to be given the actual entry. In this model, I'm going to be given the primary index. So let's look at this model. I can do something like this. Right? I know this is incorrect code. Everything is pseudo code. Don't get too hung up on that. Everything is pseudocode. Don't get too hung up on that. On the other hand, if I'm looking at this. So let's look at the two options that I have. So in this case, which match this model, I'm actually going to do, okay, I found the values. This was a login search inside of the tree. And then I'm going to have to do work that is equivalent to the number of matches that I have. Okay, what about this model? In this model, the cost is O log of n to find this and then log of n times length of matches. Why do I have this? Because I'm starting to scan here for every one of those I have to do a search in here and then to be able to get the actual value. for every one of those, I have to do a search in here and then to be able to get the actual value. So I'm actually paying at query time for this indirection. So it seems like a minor thing, but think about queries such as it says here to do it in dates something like this, where I'm scanning one index I need to do a lookup for each one of those in the other that gets expensive really really fast ok enough about the low level structure of how we store the data on disk let's talk about ACID another account that you take to have fun ACID stands for Atomic you take to have fun. ACID stands for Atomic Consistent Isolated Durable. Those are the basic properties of transactions. If you're trying to use a database that doesn't give you ACID transactions, stop. There is zero reason to want to ever do that. Atomic basically means something very simple. I want to be able to make an operation, multiple operations, and have them go as a single unit. I used the example of blog post before. Consider this scenario. Update posts. So now I have two operations. If those two operations are done in an atomic fashion, I will not be able to see the comment count increase without also having the new comment. And vice versa. By the way, that means that from any observer point of view, those two operations are identical. Consistent means that after the transaction is completed, all of my business environment are done. Isolated, related to being able to operate on multiple concurrent transactions at the same time. And durable is the most interesting scenario because it means that once they told you that the transaction was committed, it is done. Any future operation, even if you just put the plug, did nasty things to the network, whatever you want, the data is going to be persistent. Another aspect here, which is atomic related to durability, is that even if I wrote this data, this update, and then crashed, because I didn't commit, the entire thing gets rolled back. So both in terms of feasibility, both in terms of persistability, it goes in as a single unit. That's why it's called atomic. And that leads to a very simple problem. It's really super hard to do that in a performant manner. And the reason for that is that disks suck like a lot. A lot, a lot. So even if you're using something like NVMe disk with the latest absolutes whatever they still suck. They are still super slow compared to CPU and memory. So how do you handle that? How do you make it actually work? Well, it turns out that everything you think about is a lie. This is the line of code in C sharp if you think that it writes data to the disk you're wrong it doesn't do that it writes data to a usable buffer and then you send it to the operating system and then you send it to the operating system. And then you send it to the file system. Then you send it to the drive. And then you send it to the disk. And only after all of the steps have happened, you're actually hitting the actual physical media. And the reason for that is because I.O. is slow. So there's a lot of buffers in the middle, which means that it may take upward of 30 seconds for these things to happen. So think about what this means. I write some code. It's right to a file. I close my process. I pull the plug you know what happens to the data? it does not exist it went away or worse some of it was written not all of it and there's a lot of really really interesting edge cases to deal with so we have to really worry how and why this is going to happen so let's look at the wrong way to go about this i want to write some data to a file and i want it to be a new file so i open a new file i create it if needed and then i write the data, and then I close it, and then I go back, and I'm done. Makes it sound simple, easy, right? This code is wrong, because I'm not flashing the data to the disk. So let's fix that. I have now added fsync. So I'm flashing the data to the disk. And this is still wrong. In fact, I could go through this exact process seven different times in order to handle more and more ridiculous errors. For example, fsync can fail. Close can fail. What are you expected to do with close fails? I mean, seriously, there is absolutely nothing you can do because you can't even close the file. But you have to deal with that as a database engine. can't even close the file. But you have to deal with that as a data messenger. So a while ago, there was this huge issue in the Linux community with Postgres because F-Sync can fail. And Postgres didn't handle that, and that led to data corruption. There is a beautiful set of lectures by Daniel Liu on this topic. I absolutely recommend you go and read them. There is this nice paper that says, can you recover from f-think failures? Zero applications are able to do that. Just to give you some context, here is the set of operations that you need to do to ensure that the data hits the disk. You open the file, you write it to the disk, you write the data to the disk, you fsync the file, then you open the parent directory and you fsync the parent directory. Now, okay, so I have to do additional work, make sure the data is safe, etc. Here's the problem. This is the happy part. This does not include all of the nastiness that you have to do in order to recover from failures. Utterly ridiculous. And still you have to do it. And let's see what it means. So I cannot modify in place. Because what happens if I have a failure midway through? Remember, this failing does not mean that the data is not in the disk. It just means that I couldn't ensure that it hit the disk. For fun, I may be able to write some of the data to the disk, and some of it will not be in the disk. And it's on you to figure that one out. Okay. So because I cannot modify in place, I have to write in some other location. And I have to first ensure that I wrote there, and then write the actual data. There are two ways of doing that. One of them is called the write-ahead log. I have a separate file dedicated just to write what I intend to do before I do that. On starting of the process, I read the write-ahead log and apply all of the operations there that I see that are fully completed. Think about it. I'm going to write, I'm going to raise my hand, and then I'm going to scratch my forehead. This is complicated. Now, I crashed. So in the log, I have raise your hand, scratch your forehead, et cetera. I have raise your hands, scratch your forehead, etc. And the log also contains transaction commit markers and all sorts of other stuff like that that tell you when the transaction was successfully committed and needs to be reapplied or we need to discard and stuff like that. The other option is something called append only. I'm never modifying data in place. Instead, I'm always going to create new data at the end of the file. Going back to the pages and binder, when we're talking about the physical data, I always add new pages to the binder. The problem is that at some point it gets really really really big so i have to do compaction i have to go over that and get only the data that i'm actually using and that can have huge impact on the total cost of the system so let's talk about the writer headlog on the one, it's very simple. I always write to the end of the writeahead log, so we have sequential writes. I write the changes to the data before making the changes. And when I'm writing to the log file, I'm doing that in a way that is durable. I'm not going through all of the buffers in the middle and say, take this data, make it persistent, and don't bother me about it. Which means that when I'm writing data to the data file, I can use those buffers and get the benefit of that. I only ever have to read from the log file when I'm starting up. And there are various ways of doing that. Redo log, undo log, doesn't actually matter. Those are details. I'm more interested in you understanding this concept. So let's talk about how writes to the file must be doable. So here is two options of doing that. I can do a write and fsync each and every time. The problem here is that I pay the cost of two system calls to do this. Or I can do this, open with data sync. So what happens here is that I'm telling the operating system at opening the file time, make sure that every write is implicitly fsync. So I save one fsync call here. Now, I also have this option. I can do all direct and all desync. And the combination of these file flags are almost always only used for databases. This is a really beautiful feature because it means I want you to skip everything in the middle. I want you to write directly from my buffer to the disk by passing everything else and make sure this is doable. There are a lot of limits on doing that. The data must be four kilobytes aligned. It must be four kilobytes aligned in memory and on the file. And it turns out that when we write a headlog, this is not something that is hard for us to do. We can just do that very easily and cheaply. When we commit, we just, okay, I wrote to the log file, to write a headlog. When the write is done, I'm done because this, those flags says that if the write completed successfully, I'm good. It is almost sufficient, but I have to think about what happens if I wrote to the file and then I crash midway through. So when I write into the log file, I have to think about, to protect myself from partial writes. The typical way I would do that is that I would write a checksum as part of my write. So when I'm reading, I would say, oh, I'm not matching the checksum, I can just skip this value. Finally, we have to consider how am I actually writing to the end of this file. And here I have two options. Option number one is that I'm always writing to the end of this file. And here I have two options. Option number one is that I'm always writing to the end of the file, letting the operating system automatically extend the file. This is the wrong way to go about it. Why is that? Because if I'm updating the size of the file, I have to call fsync on the parent directory. Instead, what is better to do is to say, oh, let's allocate the log file of a specific size, let's say 256 megabytes, and just write to it as I go along. This way I don't have to worry about flashing of the parent directory pointer hopefully it's never going to be read but what happened to replaying of the log when I'm restarting let's say my system crashed I want to get back up as soon as possible if my log is all of the operations that I've done ever and I'm replaying that on startup, this is going to have a huge cost for me. How do I deal with that? What happens if I have a one terabyte database? Am I going to have to write all of that from the log file? That's insane. Instead, what I'm going to be doing, I'm going to keep the log only up to a certain point. Remember, I write to the log, and when I'm done writing to the log, I'm writing to the data file. And after some time, I'm going to flush all of the data from the data file. This is called a checkpoint. And at that point, I know that the status at the log file and the status of the data file are the same. So I can remove the log file. And this way I only have the, when I restart, only the data from the last checkpoint. Hopefully, this is very small in the range of hundreds to dozens to hundreds of megabytes to operate. So writing to the log is slow. Why is it slow? Because of this. Because I'm actually writing to the hardware, and the log is slow. Why is it slow? Because of this. Because I'm actually writing to the hardware and the hardware is slow. So there's a lot of really interesting ways of making it faster. For example, remember that you talk about the difference in transaction behavior, whatever you have, transaction that is begin transaction, do operation, do operation, commit transaction, each one of them is a single interaction with the server, versus here is all of your data in one shot, do all of these operations. Well, if I have all of the data for multiple transactions at the same time, I can merge them into a single write log file. Remember, the cost isn't in the amount of data that you're writing here. The cost is in going to the disk. Think about it as going to the store and you pick out one item or 10 items. The cost of that is roughly the same. But if you have to pick up 10 items and you go to the store 10 times, the time that you go back and forth to the disk is going to dominate all of your costs. So merging of concurrent transactions into a single write is huge. Another aspect is how do you deal with this? Typically, you would execute transaction in a serial fashion. But you can play all sorts of tricks. For example, I'm starting to commit this transaction. I can, instead of waiting for this transaction to commit, so I can execute the next transaction, I can execute the commit and the next transaction in parallel. This is based on this, after all. If this fails, if the commit process of this fails, then this would be disrupted. But this is fine because if the commit process of the transaction fails, that typically means that there is some sort of a catastrophic error in place. Okay now we have to deal with concurrency, especially with concurrent transactions. How can I deal with reading data from the database at the same time that I'm modifying it. There are many different options of doing that with locking, with isolation levels. I'm going to present something called MVCC, Multi-Version Concordancy Control. And the idea is quite simple. Pages are how we access the data. Instead of accessing the pages directly, we are going to add a layer of indirection. Remember, we are never modifying the data in place. So here is what happens. Here are all of the pages that I have in my system. 0, 1, 2, 3, blah, blah, blah. Each one of them is a page. Now, I want to read page number 4. So I go to the data file and get page number 4. I want to go and read page number 8. I go here. But what happens when I want to modify a page? So here I'm using something called COW. Copy on Write. So here I'm using something called cow, copy on right. So I'm going to take this page and I'm going to copy it to a scratch location. And now I can modify it in piece. In transaction 13, when I'm asking page number 8, I'm going to go here. When I'm asking for page number 4, I'm going to go here. And that's the only thing that you need to do. With this model in place, you get two very important properties. First of all, until I commit a transaction, nothing has been modified in the data. Rolling back is basically just freeing this memory, I'm done. Second, it means that transaction 12 doesn't need to coordinate anything with transaction 13. Because they're not touching the same data. Instead, part of the commit process of Transaction 13 is to publish something that I call PTT, which is Page Translation Table. Which basically means that page number 4 is now this thing. And that's it. Here's how it works. Here is the state of transaction 13. And now I'm transaction 14. And transaction 14 has this page translation table that was already committed but haven't yet been written to the data file. So whenever I want to get to page number four, I'm going to get this thing. Whenever I want to get page number eight, I go to this thing. And that's it. And the reason this makes me happy is that this is actually a really interesting scenario. Because let's say that I want to modify page number four again. I use the same approach. I do a copy and write on page number four, which gives me basically a copy of Emily. And then I can modify that. Other transactions are going to be still looking at Emily. all the transactions are going to look in at kings and i have multiple versions running the same time when all of the old versions that are looking at this version of page number four dead, I can start a process that take all of the data from here and write it to the data file. Once I've done that, I can flash the data file. This is the checkpoint process that I talked about earlier. And then I can remove the old log records that I have, because now they are being made redundant. Okay, so this is basically it. When no one is looking, I write in, update the page translation table to remove the references, and now the transaction goes back to the data file. It requires somewhat of a shift in thinking, but when you realize what's going on, it's actually really beautiful that you just need a hash table to manage that and then you're done. Okay. And this is the checkpoint process that I talked about, etc. Okay. So, there are all sorts of additional things that I can do. Writing to the log file is expensive, so I can do batch merging of transactions. And instead of writing, now, what do you write to the log? This is what I write to the log. The page that was changed. So at the end of the transaction, I have this list in the page translation table of the pages that were changed. And then I can just write them to the log. And I can optimize that by saying, hey, I have the previous and current versions. I can diff that. And then I can compress that to reduce as much as possible what I need to write to the disk. When I'm reading, assume that I have a lot of data. So now reading, I know that I need to scan all of this data. I can issue prefetch instructions to the operating system, says, hey, I want this data. So right now I'm reading this thing. By the time I'm coming to this one, it's already fetched from disk and available to me. And again, this is about the communication model again. If I'm doing a chat or give you a full message, if I'm doing a chart or give you a full message, if I'm doing a chart, I have to deal with two users who are concurrently talking modify record, I have to consider that. That means that I have to spend a lot of time managing this. There is a paper called OLTP through the looking glass who looked into that. 40% of database performance is in lock management just because of the communication model choice. Typically, we talk about separation of concerns as a good, important value for your architecture. It does not work like that with databases. The network protocol dictates how you store it in memory, and the ONGLE data impacts the transaction boundaries, and all of those are tied very closely together. It's a lot of fun, but it is absolutely complex. There are other models. I'm not going to touch on them because I think that this is complicated enough. I would say that there is something called the bitcast model, which is basically write the data to the disk and keep an in-memory hash table of the key and the position. That's very simple to work with and understand, but has limitation as the amount of data that you have grows to be. And here you can see the interaction models. Take started, inserted one row. Now try to imagine I have two conversations like that versus I'm getting it all in one package here is another aspect that we have to deal with how do we read the data let's say that I want to render this page what do I have to do well give me the last five orders that I have. And now I'm starting to issue a lot more queries. So this relates to the way, so far I talked about the database engine, the low-level ways that we are storing and retrieving the data. level ways that we are storing and retrieving the data. Now we need to start talking about the, now we need to start talking about the way that we interact with the database. By the way, notice that the fun thing about here is that there is a baby and you're so tired, need a lot of energy. Anyway, so let's talk about this. I want to query data. And I want to query the data, hopefully, in a fast manner. How do I do that? I have to use indexes. You might have heard about indexes, but what are those? And it turns out that indexes are super simple. are those. It turns out that indexes are super simple. They are referenced from the item that we are searching for to the record in question. It's really interesting because you can combine indexes to make really complex queries. With that, I'm going to stop for a second and introduce RevenDB itself. Here you can see RevenDB. This is actually running as a Cloud instance inside of AWS. You can also run RevenDB as your own on-premise or on your own systems. Now you can see in this case, I'm running it in Ireland because this is close to where I am. I have three nodes that are part of a single cluster. And hey, Yarin, sorry to interrupt. I don't know if you can zoom in as well. I absolutely can zoom in. Tell me when it's big enough. Yeah, I think it's great. Okay, so, and this is well. I can see that I have three nodes in the cluster. And in this case, I have a database with some minimum amount of data. The data that I have is stored as JSON documents. And it's not really that interesting. This is the NodeTwin data set. This is the Notring dataset. This is basically some online shop and you can see how this looks like. The interesting thing about this is that now that I have this data, I can start asking questions. So let's ask questions about the employees. Now you can see something really interesting. Here is an employee from London. How do you think I answer this query? How is the database knowing what to do? Well, in this case, we can see something really interesting. It used an index. In fact, it used an auto-index. This is an index, in the case of RevenueBee specifically, when you issue a query that doesn't have an index, RevenueBee is going to create one for you. But what does it mean that it has an index? Let's look into that. We now record all of the values of address city of all of the employees. And now I'm able, remember, this is basically a B-tree in deep, deep, deep inside of RevnDB. And now I'm able to ask questions like this, where other cities, Redmond, and do query like this. And if you remember, this is basically, not this one, This is basically cost of login. But this is cool. Because now you realize what's actually happening. For that matter, what happened... Let's go back to this one. What's the cause of this? Well, this is what I have to do. And the cause of this is log of n plus layer of matches. I have to scan internally to all of those values in fact i can actually show you something a lot more interesting let me see how i can do that here we go. What we see here is the bit tree of all of the documents that we have in the system. And you can see that they have a total of 10,059 items. Easy and simple. Let's make this bigger. So what I'm going to do, I'm going to increase the number of orders that they have in the system by 100. So what this does, it operate over each one of the orders, and for each one of those, it's going to call it by a hundred times. And you know what, let's run it again so we have a sufficient number of records. What you see here is the actual B tree. Instead of RevenDB, that maps from the order ID, from the document ID, to its location of this. This is the sizes that you see here. And now, we are going to start seeing something slightly different. We might actually crash a Chrome if the size is too big. So now I have a branch and a branch and now I have a leaf. And now let's say that I want to ask a question. Give me all of the orders from 000, 50,000 to 00, 60,000. So I'm going, I know that this is here and I know that this should be here. And here is 51,000. I don't know why we have 50 exactly. But oh, they skipped by two for some reason. And now I can start giving you values from here. And all of the items here, etc. Until I get to this one. And I keep on going until I hit this thing. So you can actually see the cost of those operations. At the same time, you can see, by the way, we have written quite a lot of data into this, and we are now at 250,000 records. I apologize, this is actually using standard disks, so we are not seeing high ingest rate. We are only seeing something in the order of few thousands per second. But the idea here is that you're able to see exactly the causal operations of what's going on here. And obviously, as the data grows, so is my ability to benefit from the login nature of the bit. Okay. So let's talk about how I can make more complex queries. address city is London. And... Here we go. So now we found Anne, who lives in London. How does this work? Think about that. Again, I'm going to go here. Here is the indexing question. Notice that now I have address city and force them indexed. Now, I have two separate items here. Here is Anne, and previously we also saw where address city is London. So how do I get to one query like that? What's actually happening here? Well, think about it. First, I did a search on give me all of the items with address at equal to London. Then I had a search of give me all of the name equal to one. The end result of that is that I have two sets of matching value. If I intersect them because I have an end, then I get the right value. If I have an or, I don't do intersection, I do a union. And et cetera, et cetera, et cetera. So it's conceptually very easy. There is some tension here between wanting to do more during the indexing process or the querying process. And the more that we go about doing the indexing process, the cheaper my queries become. But that means that the more that I'm doing in the querying process, I have more options available to me. So there is a tension between performance and flexibility. Okay, now I think that, okay, we are close to 700 000 orders let's see here there's also quite a bit of orders let's look at this i want to look at orders by company let's say goodbye company, select company account so the question that I'm now asking is how many orders do I have per company and this is an expensive question because I have to go through all of those items and because I'm running on fairly crappy hardware and now I can see that he gave me some results. Oh, wait a second. I didn't sort it so it's hard to look at. Wait a second. This is interesting. This was way faster than before. In fact, this is now effectively instant I'm only paying the time to actually go to the server what's going on here? why is it so fast? well, let's look at the index and in the case of RevenueDB itself we are always using an index. So now I can see that I have an index that doesn't look like the previous index that we saw. Now we have something called a reduce key, which I'm grouping by. So let's look at this thing. So here I have order number 8807 belonging to company 85. And now I can see something really interesting. Here is the record from 8807. And here is the aggregation of all of the values that we have in this page, 096, 962. And here I have aggregation of this with all of those as well. Combinating in this value. So the way that RevenDB handles aggregation is quite simple. We are actually doing all of the aggregation at indexing time. So when you ask a question like that, regardless of the data sets that you have, we are able to answer it to you very, very quickly and cheaply. It gets better because you can do things that are a lot more complicated. For example, let's look what I'm doing here. I'm operating over all of the orders and here I'm summing the total value that I have for the order. And here I'm summing the total count of orders per company and the total sum of value per order. And now when I query Currently, company 63 is the most important one. They purchase the most. Now, notice that I'm operating over a lot of data very, very easily and cheaply. And the way that I'm doing that is because I'm using the same structure. You can see it here. This document, here's the values for 8.8.1.5. Here's the sum total here. And I can aggregate that with additional values from here leading to the total value that i have here now what happened if i wanted to do a spatial query basically Basically be able to query the data using a geographical location. What do you think I need to do? For that matter, how can I make it go fast? So the answer here is that what is a spatial query? Let's look at that. One, one. Here we go. So here is an example of spatial data. Here is its location. Here is this location. I want to query on that. I want to be able to find things based on that. I can issue query like this. I'm saying four employees do a query on this. Let's see what we got. let's see what we got so I get 5 employees and I can actually see how they are built and I'm going to minimize a little bit so you can see how this works so this is in miles I want let's say, 3 kilometers, 3 miles, whatever. And you can see how it works. In fact, I can do things in a lot more complicated fashion. Let's get some querying over companies. And this is what this looks like. How does this work? How can we make it go fast? The answer is quite simple. Let's go and look. Here are the terms. And now you can see something really interesting. I'm actually looking at values like this. And here is what I've basically done. I started dividing the earth into grid pattern. And the more characters I have, the more accurate I am in its location. So each one of those represent a square. The more details, the narrower the square actually is. And now when I'm looking at the actual query that I execute, what do I do? I compute the bounding square for that. And then I'm doing basically this. Companies by special. Something like that. And that gives me all of the values in here. And then I do a check for each one of them, whatever they are included in this polygon. It's cheap and it's easy to do that. And in the most beautiful possible manner, it works without putting a lot of complexity on you. And think about that. The same bit tree that we initially talked about at the beginning of this lecture has been applied to do spatial queries, range queries, point queries, all of those sorts of things. It is an incredibly versatile system that allows me to work on a huge amount of data very easily and cheaply. Now, in the case of aggregations, we saw that. We saw how we can do this, which is these things. In this case, this is by company, but let's do... First time that I'm doing that, the database needs to adjust. This is a feature of Revenue B only because when you issue a query to Revenue B for the first time, it says, hey, I don't know how to handle this query. First time I've seen that. So what I'm going to be doing, I'm going to prepare myself so the next time you ask this question, I'm going to be insanely fast. And you can see that there is a lot of sales from Rio de Janeiro. Oh, sorry, a lot of orders going to Rioio de janeiro and all right again yes sorry and here we go so lots of cell going to rio de janeiro and the time that he take me to query that dropped to almost nothing this is because revenue b chose to do the aggregation at indexing time. And that leads to an interesting issue. Let's look here and let's look at order slash. Here we go. Now you can see all of those cities and all those that go to them. Here is this go to Geneva. And I already shown you this, so you should be somewhat familiar with that. But what I now want you to think about it is what does Revenly B do when I issue the query? It go and give you this value, which was already pre-computed. But this actually gets better because what happens if I update an order? So here I can see order 290. Let's go and update orders 2.9.0. Here's an order. I'm going to clone that. Now we have a new order for Geneva. Let's see what this looks like now. Previously it was 1210, now it's 1211. How does it work? Let's look at the new thing that we just created that goes to Geneva. Let's create another one. Now I expect that if I'm looking at the value, it's 1,212, as expected. Now searching for this. And I can see something really, really beautiful. When I edit this, what did it do? It went ahead and update this item, and previously it update this item, and updated this item and previously it updated this item and updated this and then all it had to do is update this one again so the cost of updating the aggregation is log h where h is the height of the value if I would run enough writes here you would be able to see that we're starting to see multiple levels. This typically happens around 20 million records or so. I will let it run. We'll see how it works. Okay. So this is aggregation. I need to come back to this later to show you how this works. Another really important aspect is the notion of going distributed. I've shown you very, very briefly that RevenDB actually runs as a cluster. What I haven't talked about is that this database actually runs as three separate servers. In this case, I'm doing something a bit nasty. What do I mean? I'm throwing a lot of writes into this database. And you can see that this database has a number of indexes and it is struggling. Why is it struggling? Because it's a small database. It is called, we have 11 indexes that needs to be updated all at the same time. So it's struggling. But what about the other nodes in the cluster? Let's go and look at them. Now looking at node B and node C. And the reason they are marked here as yellow is I'm writing to node A faster than I can write to node B and C. Again, this is just basically the issue of the IOL that I've chosen. Now let's look here. basically the issue of the IOL that I've chosen. Now let's look here. You can see that this one is less loaded because the writes are being replicated to it and not being made to it directly. Same thing should be done here. So Revit B at this point is going to have to make a decision, which node am I going to talk to? But you know what? Let's make this even more interesting. Let's stop this. So I'm in the wrong node. Let's stop this and see how long it takes the data to fully sync. Remember, what we are actually showing here is the scenario where the system is under tremendous load with insufficient hardware. And in particular, if I ask Raven, it would say, you're killing me because of the amount of writes on the disk that we had. if I ask Raven, it would say, you're killing me because of the amount of writes on the list that we had. Before I continue, I just want to see if we managed to get to the QNN O. Nope, we didn't get the triple layer. I'm not going to go into that. So right now, you can see that they have node A, B, and C, which compose of the database group. Any write reflected to any of them is going to be replicated to the group. And that's really interesting, because even when you saw that I'm hammering the system far beyond its capacity with thousands of writes per second, a lot of indexes, and the disk is slow and the CPU is burning up, the moment I had enough room to breathe, it was fully synced to everything. I can actually go and do even more nasty stuff. Let's force node B to everything. I can actually go and do even more nasty stuff. Let's force node B to fail. So basically just forcing node B to fail by restarting that. And let's see how it would be reflected here. It's basically in this case that the server is going to fail. This is going to be marked as red. You can see the server is down. And let me see if... Okay. So now we notice this is failed here. Here we're not seeing this fail. It may take a few more seconds to see that. But if I'm now from the client trying to approach NotB, it wouldn't work. But, oh, now it does. Sorry, I don't have enough time to make it fair so we can actually see the failure behavior. But the basic idea is that I don't need to worry about the availability of my system. It not goes down. The cluster detects that it redirects the traffic along between how they work and what they do, and I move on. It also means that they have a lot of interesting options with regard to how I'm actually operating. For example, I can say, hey, I want you to load balance the data, load balance the queries across all three members in the cluster. It means that as I write more data into the system, I can speed the load across all of those. Other features that I have, which are quite interesting to talk about, are about how I can run in a geo-distributed manner. In this case, you can see I have a New York and a London cluster, and they are tied together. They are able to operate in cooperation and share data and still give you a local and fast access point to operator. I'm trying to give you some context about all of the features, all of the behavior, but the fourth thing that I have to talk about is the concept of what it is. When you start using RevenDB or database, one of the things that we really want to do is to get away from needing to worry about all of the details that I'm talking about here. I don't want you to think about database internals. That's my role. I don't know if you noticed that, but this is what I love doing. It's, I think, fascinating and really interesting. But from your perspective, what you want to do you want to build an application and in order to do that you just want to write some code let's look at some of the typical code that you would write using revn db and in this case this is this is a few of the client APIs that we have, there are more, but you can see that I'm loading a document and I'm getting an object pack. This is a POCO, a plain old C sharp class, and you would have the similar in Node.js or Java or whatever. And the idea here is that this instance is being tracked by the session. So when you modify this and then call safe changes, we automatically know that you're going to be persisting that. For that matter, as I'm modifying, not this one, here we go. For that matter, as I'm modifying, not this one, here we go. In this case, you can see that I'm writing three different documents to the database. Or in KinoZoonIn again. Yes. I'm sorry, I keep forgetting. No problem. You can see here that I'm writing three different documents, and i'm doing that in a single transaction so this save changes is transactional all of the values are going to be written in one shot in an atomic fashion durable the whole thing so we have a bunch of interesting features so transactions are from the get-go. You have the ability to automatically track your changes. You're working with Solitaire objects. But then you start writing queries. For example, this is not an interesting query. Let's look at an interesting query. Here we go. So here is a query. In in this case i'm using a link language integrated query and this will generate the appropriate rql raven query language to the bucket i don't know if you notice but when we query rev and db we use something that is very similar to sql this is specifically because it makes it easy and cheap to do that. Now let's look at this example, because this is a really nice one. What you can see here is a whole bunch of features operating at once. So we start from orders and we want to say, oh, give me, go from the order document to the related document company. And now here I want you to do some projections. Load the lowercase d's, give me the country from the company, give me the line count, etc. So the idea here is that you have a lot of freedom in the way that you operate. But there is one feature that I really love that I want to discuss. This thing. Look at what we have here. We have a query that requires additional data. And you can see that each one of them requires additional queries from the database. How do I manage something like that with RavenDB? So here are my orders. And in the orders, I have company and employee that I might want to show. In a relational database, I would need to go and do separate queries or do a join, which then forces me into a Cartesian product. Draven, I'm going to do this. Sorry. And that didn't change anything, right? Except that let's look what happened when I'm scrolling down. In one shot to the database, I got all of the related documents for employees, for companies of those that were included in these values. What is even more interesting from my perspective, I can do that on this as well. Sorry, lines. Here are the related products. So in one shot to the database, I can do this. And remember, going back and forth to the database is what's expensive, not just the queries themselves. Now, this is great when I'm looking at the actual query. What does it look like in terms of the API? In the API, it's really simple if i would be able to just show that are you kidding me i don't have an example of that here but that's annoying so i'm sorry i have to go to the documentation for this. And the feature is called includes. And I don't want, come on. Here we go, client API. Here we go, client API. Here we go. So here you can see a document, and here I'm saying, hey, I want you to load this particular, and I need to increase the font size. Here I can load this order along the order I want it to give me the related customer ID now when I'm doing this, this does not go to the server why does it go to the server? because I already got this information here so even though the API is the same, I'm reducing the number of operations that go to the database, and that means that I give you much better system behavior. For that matter, here, I have to show it here. I also have another important feature. This is the notion of lazy operation. This is funny because this one does not call to the database. Why do I want to have a query that doesn't call to the database? Because I can make multiple such queries, and then they're basically stored in a buffer. And when I evaluate one of them by calling the dot value, I'm sending all of the pending queries to the database in one shot. all of the pending queries to the database in one shot. And again, the whole idea here is to reduce the number of hops that go to the server. So there is lazy, there is include, there is a whole bunch of features around reducing the number of times you have to go to the server. Okay. What's then? You start using a database. You start utilizing all sorts of interesting features. I talked a little bit about Revenue being the context of document and JSON data. We have shown that we can actually operate on the data itself. So give me the city load related document based on that, something like that. Let's see something like this. I now want to be able to do more complex searches. So I'm going to do this. What am I doing here? This is saying I want you to do a full-text search in the notes section of the employee. You can see that here we have this guy. Apparently, he is talking French or Portuguese. Let's type Portuguese. This guy Portuguese, but now we have multiple people who are speaking French. I think they are speaking French. Let's look at this guy. I don't know if he's speaking French or not. Oh, he is. But it took me a lot of time to figure this one out. I can also do this. Now, what ReveniBee will do, it will pop it out for me. now what RevenueDB will do it will pop it out for me here is this guy and here is the French that it has in there, so the whole idea is that I have a very easy way of operating over whatever this is geospatial, big aggregation full text search highlighting information, those sort of things I can also do other fun stuff let's say that I want to search for Anna really I don't have an Anna? that's weird I'm fairly certain that they have another and what i'm asking revenue to do here is hey i have a term that i want to query find me terms in the database that match it, that are close enough for that. This case, I'm searching for Anna, and I'm looking and I'm getting Anna, because this is what I actually wanted to do. So this is really cool if you want to implement searches and features such as did you mean and stuff like that, because this allows you to do things that looks like magic to the user, but are actually able to give you concrete results very easily and very cheaply. We have about 10 more minutes, and I don't know if we have enough time to cover any more items in depth. This is not a live session, so I don't know if we have any questions. If you do have any questions, feel free to send them to here, to oran.revedb.net. I would be very happy to answer them. We also offer, you can go to RevedB Cloud. We offer a completely free version that you can spawn an instance, start it, get it ready. You don't need to register or something like that. If you want to, you can also run RevitDB itself on your own machine. And again, there is a free version available that you just pick your poison wherever and run it and have fun with that. I think this is it. This is everything that I had to say. Do you have, oh, there is some questions. Okay, so, okay. This is a question from Gavriel. Does UAD as a string perform worse than ID based on timestamp or incremental number? Does UAD harm the B-tree because of the fact that it's not strictly order? And the answer is yes. In fact, let me show you exactly how. Let's go here. Let's create. We have a database for Gabriel. And here's some data. And now what I'm going to do, I'm going to say from update. Now what I want to do, I want to say, if I did this properly, this should generate a lot of values that are using GUIDs. Okay. So now, what am I doing here, basically? I'm creating, okay, should be sufficient for all purposes. And this is going to be it. Okay. Okay. So now you can see that I have 10,000 something. And this is the depth of the tree. Now, here is something that you should notice I have 93 items per page 170 something like that now I'm going to create another database and let's take this thing and what I'm going to do I'm going to create And let's take this thing. And what I'm going to do, I'm going to create, let's do it like this. Okay. And what am I doing? I want to generate sequential values with the same number of records. The same number of characters. And... And what I want to do now is... Okay, let's just do it. Sorry, I need the same number of characters, but this is annoying. should be sufficiently dense. Okay. So I've done that. Now I've done that. Okay. And again, we have 100 and something. That's fine. So now here here want g2 now let's look into this i think that i messed up with the number of ish sorry this is not sorted Sorry, this is not sorted. I'm just going to give up on the exact number of records because I don't want to now figure out how to do that in this fashion. And items. Okay. this would render roughly the same number the document IDs now look like something like this this is sortable now we can actually see the behavior so in this case you can see that I have 61 pages with 169 result each, except the last one. In this case, 61 pages versus 112 pages. You can see that there are random numbers and none of them is actually fit here. Now let me show you a trick. I can go here and say that we would like to compact this database. This is before the compaction this is afterwards and apparently it didn't do that properly oh no it did save something not sufficiently not uh it didn't do it the way I thought it would do. But anyway, to answer your question, Gabriel, the structure of the pages would absolutely matter. Now, here's the deal. Unless you get two ridiculous sizes, it doesn't really matter. Let's do the math. For this value we have 169 pages per page. 169 entries per page. So if we are talking about 100 million records that means that we have about 600,000 records. In this case, let's say that the average is 80. So 100 million is this. Multiplied by 8 kilobytes. This is in megabytes. This would be megabytes. This would be a 9.5 gigabyte of data for 100 records. Whereas for the sorted version, that would be 4.5. So we have the cost for the data. And that sounds like a huge impact until you realize that this happens on 100,000 records. Sorry, 100 million records. 100 million records. 100 million records, the 5 gigabyte difference is almost noise because you're talking about something in the order of half a terabyte of data. So it absolutely matters, but usually doesn't have an impact unless you go to really big sizes, which actually leads to an interesting observation. The API that we use, when you do something like that, we generate sorted values. So those values are going to be sorted fine. You don't really need to think about it. It just happened for you. So I think this is it. Again, if you have any questions, feel free to ping me here. I would really love it if you took heaven to be for a spin and saw what you can do with it. And that's about it. Thank you very much.