Monday, March 31, 2014

MySQL Cluster 7.4.0 Lab Release Improvements

The year of 2014 seems to be the year of the In-Memory Databases. We've seen a lot of commotion around the Hekaton product from Microsoft and other new products. At the same time everyone talks about WebScale and how to build distributed databases that can handle millions of queries per second. In addition we've seen the development quite efficient communication mechanisms the last few years. Databases that make use of In-Memory, WebScale and efficient communication mechanisms are the talk of today in the database world.

The background of NDB

The design of the NDB storage engine started out with exactly those base requirements already more than 20 years ago where the aim was to build the next generation telecom database. The base requirements was a shared-nothing database (for superior scalability and meeting telecom requirements on fail-over times). Already in those days one could buy a machine equipped with 1 GB of memory. Given that the telecom databases was used for extremely quick lookups of small amount of data it was natural to consider an in-memory database design. My research into databases at the time showed that they spent a considerable amount of time in the operating system to handle communication and hard disks. So by moving data to in-memory and by using communication mechanisms that avoided operating system we were able to deliver extremely efficient database operations already in the late 90s.

NDB Today

Today NDB is the storage engine of MySQL Cluster and has been in production usage for more than 10 years. Most everyone on the globe is touched by its operation in some telecom system, in some computer game or in some other type of web property application. We already delivered benchmarks with billions of reads per minute and the scalability of MySQL Cluster is so high that we simply don't have big enough computers or computer sites to show off its limits. A while ago we had access to a computer lab with hundreds of computers that was connected using Infiniband with a total bandwidth between the machines of 1 Tbit/sec. This means we can transport 128 GBytes per second between the machines. However MySQL Cluster could theoretically produce enough parallel read operations to swamp even such a network. So we're getting to the point that it becomes more and more uninteresting to show the scalability limits of MySQL Cluster. So this means that we also want to focus on efficiency and not only on scalability.

The Many-Core CPU Challenge

I gave a presentation to a set of students at Uppsala University about MySQL and MySQL Cluster. In this presentation I showed how the development of multi-core CPUs presented a tremendous challenge to software designers. In a short time of only 8 years Intel and other HW developers gave us the challenge to scale our software 60x. At MySQL we were up to the challenge, we've increased scalability of MySQL using InnoDB 20x in a time span of 5 years. At the same we've actually increased the scalability of the NDB storage engine more than 30x and this means that MySQL Cluster where we use MySQL together with the NDB storage engine has actually scaled 60x in total. This means that my test machine with 8 sockets and 96 CPU threads is now the limiting factor in my benchmarks.

The Many-Core CPU Solution

How have we achieved this? With the MySQL Server it has been a long set of handling various bottlenecks such as splitting the InnoDB buffer pool, handling the LOCK_open mutex and many more changes that collectively have made it possible to scale much beyond of what our software in 2008 could achieve. This improvement of scalability continues, so stay tuned for more, there are blogs to read now what is currently going on MySQL 5.7 development.

With the NDB storage engine the solution has been quite different. We started out building a distributed database already from the beginning consisting of a set of nodes that replicate data synchronously using transactions. In order to avoid usage of the operating system we built the architecture based on a set of independent modules that interact with messages. This was built on the architecture of AXE, a telecom switch operating system of unique efficiency. The first version had each node implemented in a signal thread. With the development based on independent modules meant that we inherited a simple task of dividing the thread into a number of functional modules. Currently we have separated into the local database part, the transaction part, the network send part, the network receive part and an asynchronous event part and finally the main part containing features for meta data handling. Given that we developed a shared nothing architecture it was simple to continue the partitioning to gain even more independent LDM parts by having each LDM thread handle different parts of the data. The transaction part is simple to use a simple round robin scheme and the network parts can easily be divided per socket we handle. In the future we could perform even more divisions of some functions.


NDB Layer by Layer approach


So what does this mean in effect. It means that we actually built a distributed database inside each node inside our distributed database. Given that we can also replicate using MySQL replication we can actually go even further and have multiple clusters connected together. For those that want even more to think of how NDB could be used to build systems with millions of CPUs can google on the word iClaustron. iClaustron is a hobby project I've played with since 2006 and I presented the aims of the project in a tech talk at Google which is available on YouTube.

The world is organised down into microcosmos and continues growing into macrocosmos. So why would software be different, we need to build systems of any size by using layers of layers of distribution.

So building MySQL Cluster is an interesting project in building layer by layer of distribution into the system.


The big challenge ahead of us


So what could be the next challenge that the hardware engineers will deliver to us software engineers. Personally I am preparing already now for it. This challenge that I hope they will bring to us is persistent memory. This means that we will have to build databases where all of a sudden we can make persistent writes at similar speed as we are currently writing to main memory. This will be an interesting challenge and personally I think that main memory databases have a unique advantage in this challenge since they already work at memory speed. So I feel a bit like a horse in the gates before a race, kicking and just eagerly waiting to get off on to the track to see how fast we can run home the next big challenge. But we have to wait until the hardware engineers first solves the issue with which technology will be the winner in this category and that can be commercialised.


So after these small philosophical thoughts let's get into what we're doing in first Lab Release of MySQL Cluster version 7.4 to get further on the path to these goals.

The improvements in MySQL Cluster 7.4.0 Labs release

As mentioned we are working on improving efficiency of MySQL Cluster, we have specifically worked on the scans in the NDB storage engine which have been heavily optimised. In benchmarks using a lot of scans such as Sysbench we have managed to scale up performance per data node by 46% comparing 7.4.0 to 7.3.5. Compared to 7.2.16 the difference is even bigger than 100% but going from 7.2 to 7.3 it was mainly inefficiences in the MySQL Server that was fixed.





Another important thing we've done in 7.4.0 is add a lot of documentation about both our restarts and our scans in the form of extended comments in the code. We've also gone through the log message to the operator while restarting and made them much more accessible and extensive.

MySQL Cluster 7.4.0 improvements for virtual machine environments

With 7.4 we're working hard on making MySQL Cluster more stable even when the underlying system isn't as stable as we would expect. MySQL Cluster is designed for high availability environments and now we're working on making sure that the system can continue to operate even when systems are overcommitted, when we're working in virtual machine environment where we cannot be certain of the exact resources we have available it is hard to operate a high availability environment but we still want to work as reliable as possible.

MySQL Cluster 7.4.0 Stability improvements

We have also been working on improving the availability of the system also in high availability environments by improving the restart times. There are many areas where we can work on this, we can remove subtle delays that adds up to longer restart times, we can use more parallelism in certain phases of the restarts. We have also made our local checkpoints more parallelised which means that we have a more balanced load on the various LDM threads in our system. This actually has the nice side effect that we get a more balanced load amongst the LDM threads which pays off in 5-10% improved performance for any application. Naturally it also means that we can run the local checkpoint faster since we don't risk imbalances in the CPU load by running local checkpoints faster.

Another unique feature of MySQL Cluster is supporting Active-Active environments using MySQL replication. We've been working to extend the support of this feature even further.


Benchmark environment description

We executed a set of benchmarks using Sysbench 0.4.12.6 in our dbt2-0.37.50.6 environment. We used a big machine with 8 sockets of Intel Xeon CPUs running at 2GHz. Each socket has 6 cores and 12 CPU threads. In most cases we run with hyperthreading enabled. But we have found that running LDM threads without hyperthreading is a good idea. This decreases the amount of partitions to manage and the number of threads to manage which have a positive effect on performance. We used 8 LDM threads and in this case the NDB data node used 2 sockets, the benchmark program and the MySQL Server had access to 5 sockets. The MySQL Server used about 40 CPU threads out of the 60 it had access to so in this configuration we had spare resources to use. But in the next step where we went to 12 LDM threads we could not use the full potential of the SW. In this case the data node needed 3 sockets, the benchmark program used 1 socket and thus the MySQL Server only had access to 4 sockets and this meant that it could increase performance by 25% and not the 50% made possible by going to 12 LDM threads (actually we squeezed a bit and made 52 CPU threads available to the MySQL Server and thus got about a 30% improvement over 8 LDM threads). Using 7.3 the data nodes are less efficient so here we could scale the LDM threads all the way to the 50% improvement (actually we even got to 52.7% improvement, so perfect scaling of performance as more LDM threads are added).

So with 12 LDM threads we need a 54 core-machine to make full use of the possibility of the data node. With 16 LDM threads we need even more, we need 4 sockets for the data node, we now need 2 sockets for the benchmark program, we need 6 sockets to run the MySQL Server and thus a total of 12 sockets or 72 cores. This is probably as far as MySQL 5.6 can help us scale before the MySQL Server can no longer scale. But this is an important area of focus for MySQL 5.7 that have already had a set of improvements implemented in the 5.7.4 DMR released now.

Final words

So with this kind of scalability we are now in a position to deliver more performance from one single node than we previously could deliver from an entire cluster. Imagine what performance we can then get when connecting many nodes together and still we're working on making each MySQL Cluster thread more efficient in its execution.

12 comments:

Jim Dowling said...

Great stuff Mikael. Did you run any of the flexAsynch benchmarks on 7.4? I guess if you've worked mostly on improving index scans, there won't have been big changes. When is 7.4 expected to go GA?

Mikael Ronstrom said...

No, haven't tested flexAsynch comparatively yet on this release. I expect it to be about a 10% improvement. Most of the improvements are scan-related, but there is some improvement of checksum calculation and prefetch of tuple early on that will also have a positive effect on all forms of Primary and unique key lookups.

Михаил Прокопив said...

Hi, Mikael, very interesting to read your blog!
Do you plan any optimization on internal datanode communications?
I mean cases, when we have much parallel PK reading, ethernet stack looks like a bottleneck.
Could you recommend any algorithm to preroute requests between nodegroups (PK-read)?
Best wishes, Michael

Mikael Ronstrom said...

Hi Michael,
Not sure I follow you all the way. We have already optimised PK read quite a lot as is.

Preroute requests I gather is what you can do using hints in the startTransaction-call in the NDB API. We use these hints when starting transactions in the NDB storage engine.

We have noticed that in some cases the communication between data nodes is a bottleneck, that we cannot have more communication using a single socket, so more sockets are needed in this particular case to scale onwards (but this is a very special case with extremely high update rates).

Михаил Прокопив said...

We load cluster (4 data nodes, Replicas=2) from 3 servers (3000req/sec), each data node also have API node (self-written) collocated.
Load is about 99000 NDB operations (12000 Index Scan NDB operations, 24000 PK insert NDB operations, 24000 PK update NDB operations, 39000 PK read NDB operations). And unfortunately we can't reach more.
CPU is load about 80%, disk waiting are minimal, network traffic ~62Mb.
Idea was that, if we'll pre-select API node neighborhood to a correspondent DataNode(group) in the TrafficGenerator, it’ll minimize internal cluster traffic and gain overall performance.
But we have to know, how route our requests between API nodes.
Could you recommend something?

Mikael Ronstrom said...

Hi Michael,
The way to route traffic to the correct node is through usage of the startTransaction calls in the NDB API (part of NdbTransaction object). You could use those also for pre-routing. In that case you would only do startTransaction followed immediately by closeTransaction. This will provide the the data node id which you want to go to. Then you need a mapping from that to the correct API node

Mikael Ronstrom said...

Hi Michael,
In addition you mention that you have 80% CPU load. But with ndbmtd there is much more to it than simply a CPU load. You can divide the CPUs to have some run the TC threads, some run the LDM threads, some run the send threads, some run the receive threads. For send, receive, ldm and tc threads you can have multiple threads of each type.

So which type is it that is at 80% CPU load.

Have you mapped the threads to specific CPU using the config parameter ThreadConfig, this is usually a good idea to get absolute best performance. Also that and using top gives you a very good picture of where your bottleneck resides.

Also have you ensured that the bottleneck isn't in the API nodes?

Rgrds Mikael

Mikael Ronstrom said...

Hi Michael,
In addition you mention that you have 80% CPU load. But with ndbmtd there is much more to it than simply a CPU load. You can divide the CPUs to have some run the TC threads, some run the LDM threads, some run the send threads, some run the receive threads. For send, receive, ldm and tc threads you can have multiple threads of each type.

So which type is it that is at 80% CPU load.

Have you mapped the threads to specific CPU using the config parameter ThreadConfig, this is usually a good idea to get absolute best performance. Also that and using top gives you a very good picture of where your bottleneck resides.

Also have you ensured that the bottleneck isn't in the API nodes?

Rgrds Mikael

Михаил Прокопив said...

ldm up to 80% cpu core,
ThreadConfig=ldm={count=4,cpubind=0,1,2,3},tc={count=1,cpubind=4},main={cpubind=7},io={cpubind=7},rep={cpubind=7},recv={count=2,cpubind=6,8},send={count=3,cpubind=9,7,5}.

Yes, API nodes takes much CPU, but debugger shows, that in the top is ndblibrary with "Poll and Events Waiting".

Do you know Keith Hollman from Oracle? We are in touch with him.

P.S. Is it possible to communicate you via email?

Mikael Ronstrom said...

Hi Michael,
You can reach me at
mikael dot ronstrom at oracle dot com

Have seen Keith Hollmans name in email
sometime in some customer emails.

Mikael Ronstrom said...

If the API takes up much CPU usage there are two
things you can do.

1) Ensure that the API program is locked to different CPUs than the data node.

You can accomplish this using either taskset or numactl.

2) Use the feature to bind NDB API receive threads to specific CPUs (different from rest of API CPUs and different from the data node CPUs)

Deepak S said...

Hello Mikael, I am working with MySQL Query Cache module. I found that the memory blocks are connected via doubly linked list and also that there are two doubly LLs- queries and tables. I request you to kindly explain me the architecture involved as I could not find any articles regarding that.
Thanks