Wednesday, April 01, 2015

Benchmarking MySQL Cluster 7.4 on an Intel NUC

I have done a lot of benchmarks of MySQL Cluster on large servers which
is obviously very interesting. As mentioned in a previous blog I have
an Intel NUC machine now easily accessible. So I thought it would be
fun to make some benchmarks on this machine to see how fast MySQL Cluster
runs on small HW.

First a little description of the HW. The CPU is an Intel Core i5-4250
CPU. It runs at 1.3GHz and have a turbo frequency of 2.3 GHz. The CPU
has two cores and each core can run two threads simultaneously
(called hyperhtreading in Intel CPUs). It comes with the box containing
the motherboard and the CPU. Then you buy one or two DRAMs to it and an
SSD drive. I installed two DDR3L DRAMs which gives me a total of 16GByte
memory in the machine. In addition I installed an SSD drive of 256GByte.
The box fits nicely into the palm of your hand.

On top of this HW I installed Oracle Linux 7.

The configuration of the benchmark uses my laptop as benchmark driver
and uses the Intel NUC to run the data node. Since I only have one
Intel NUC available the benchmarks focused on a setup with only one

To start with I will focus this blog on performance of a table which
is a pure main memory where after a restart the table still exists,
but is empty. This kind of tables are useful to keep track of live
states where it doesn't make sense to recover the table since the
live state changes so often that the restored data is useless.

When recoverability on the table is active it adds lots of writes
of logs and checkpoints that adds an extra overhead of about
30-50% for inserts, deletes and updates. Reads are about as fast
with recoverability as without. I’ll dig into this type of benchmark
in a later blog.

So to the results of the benchmark run. Inserts were processed at a
rate of 400k per second, updates at 402k per second, deletes at
433k per second. Reads I got up to 589k per second. For both writes
and reads we're operating here at full CPU speed and also on full
network speed (Gigabit Ethernet).

This benchmark was executed with 1 primary key of 8 bytes and one
field with 100 bytes in it.

I was curious to see what the top rate of key lookups was with very
small records where the network speed wasn't a factor in the play.
So I changed the field from 100 bytes to 8 bytes and reran the

Now I got inserts at 476k per second, updates at 479k per second,
deletes at 474k per second and finally reads at 925k per second.
So close to a million reads per second on this tiny machine alone.

Given that I ran without logging to disk most of the configuration
was pretty straightforward of the data node. The most important
part is obviously how to make use of the 4 CPU threads in the 2
available cores.

As usual the LDM threads is where the actual database resides is
the thread that requires the most CPU, in this particular benchmark
also the TC threads require a lot of CPU since they are involved
in each and every key lookup. For e.g. applications with scans the
TC thread is a lot less involved. There is also a send thread
and a receive thread which is important to configure correctly.
These four thread types are the most important to configure when
running without recoverability. When adding recoverability the
configuration of the file io threads is also important. There is
also a main thread and replication thread and some connection
threads, but this have no impact on the performance of this
benchmark, but they were configured to be on the CPU with lowest

The best results were achieved with only one thread per thread
type. The best config of those was to allow the LDM thread to
use its own CPU core and thus not use hyperthreading for this
thread to any great extent. Then one CPU thread in the other
core was used for the TC thread and the other CPU thread was
used for the send and the receive thread. This CPU thread also
handled the other threads.

When running with 2 LDM threads the number of threads increased
and the OS had to do more scheduling, most likely this config
could provide better results in a more bursty load, but for
sustained good performance it was better to use only a few
threads. The performance difference was though quite small
between using 1 LDM and 1 TC compared to using 2 LDMs and
2 TCs.

The benchmark application was executed on my laptop connected
to the Intel NUC machine through a gigabit ethernet switch.

Monday, March 23, 2015

Quick setup to run sysbench using MySQL Cluster 7.4

In developing MySQL Server and MySQL Cluster we use four types of testing.
We use unit testing, we use functional testing, we use system testing and
we use benchmark testing.

For the unit tests and functional tests we use the MTR test framework. This
is dealing with most issues of how to use SQL to query the MySQL database

For system testing the InnoDB team has a test framework that is used to ensure
that InnoDB will survive crashes of many sorts. In a similar fashion MySQL
Cluster uses the autotest framework that I described in an earlier blog:
Link to earlier blog.

For benchmark testing we have a few frameworks, I am going to describe the one
that I use on a regular basis. This framework can be downloaded from:
Link to download.

I started developing this framework in 2006. So it has been extended for quite
some time and it is a lot simpler to run it now than it ever was before.

In this blog I am going to describe  a very minimal setup needed to run a
sysbench benchmark against MySQL Cluster 7.4.

These benchmarks can be used to run the following types of tests:
1) Sysbench on a single server with InnoDB as storage engine
2) Sysbench on a single server with NDB as storage engine
3) Sysbench against InnoDB with InnoDB and sysbench on different servers
4) Sysbench against NDB with different programs on different servers
5) flexAsynch for MySQL Cluster testing on a set of servers
6) DBT2 on a single server with InnoDB as storage engine
7) DBT2 on a single server with NDB as storage engine
8) DBT2 using NDB as a storage engine against a lot of servers

Running against NDB as a storage engine means to use MySQL Cluster.
I mostly focus on 1) above for InnoDB testing. For MySQL Cluster I mostly
focus on 2) and 5). But occasionally I also test 8). 3), 4), 6) and
7) are perfectly possible setups and have been tested as well, but I don't
personally use those so much.

I will write a few blogs about how to setup a benchmark using these benchmark
scripts. I will first describe the basics of setting up a sysbench benchmark
against MySQL Cluster in this blog. Then a basic flexAsynch benchmark. Finally
a basic DBT2 benchmark on a single server.

After describing the basic configuration setup, I will describe in detail the
various parts to configure. There are numerous things to configure:

Mandatory items:
1) Which benchmark to run

2) Pointer to file locations for tarballs, data directory and install directory

3) Pointer to tarballs for MySQL, Sysbench and DBT2 benchmarks, also the
   benchmark scripts which are located in the DBT2 tarball.

4) Definition on which servers the various programs under test will be

Optional parts of configuring SUT (System Under Test)
5) Configuration of MySQL Server. Started also in flexAsynch, but not really
   used in flexAsynch benchmark.

6) Configuration of InnoDB if running test against InnoDB.

7) Configuration of NDB storage engine and NDB nodes if running against NDB
   storage engine.

Optional parts of configuring the actual benchmark
8) Sysbench setup. Setting up how to run the actual sysbench benchmark, for how
long to run, what sysbench variant to use, how many tables and a lot of other
things can be configured about how sysbench is to run. This part is obviously
only interesting when running a sysbench test.

9) DBT2 setup. There are numerous ways to adapt the basic DBT2 benchmark
   although the benchmark transactions are always the same.

10) flexAsynch benchmark setup. Also in flexAsynch there are multiple ways to
    adapt the benchmark.

11) In connection with the flexAsynch benchmark one can also benchmark restarts
    in MySQL Cluster. First the scripts use flexAsynch to load the database and
    then various forms of restarts can be tested.

The full capabilities of those benchmark scripts are quite extensive, they are
the result of 9 years of incremental design where we have tried to extend the
use cases of these benchmarks to support as many environments as possible, but
also to automate as much as is possible.

Obviously to understand all of this in one go is a tad difficult. So we will
try to start by describing how to setup a very simple sysbench benchmark using
one machine running sysbench against the NDB storage engine using the latest
development version of MySQL Cluster 7.4.

This is a good first step to get acquainted with the benchmark scripts. I will
add more details in later blogs about running more advanced setups.

The scripts have been used mostly on Linux, also a fair deal of usage on
Solaris. When preparing this blog I ran it on Mac OS X. This requires some
changes that will soon arrive in the dbt2- updated version of the
benchmark scripts. So preferably use Linux when trying this out for the
first time.

The first step in running this test is to create a directory from where the
tests are located on disk. I usually use some home directory and create a
directory called bench there. So on Mac OS X that would /Users/mikael/bench.
On a Linux machine I would normally use /home/mikael/bench.

In this directory I create four directories. I create one directory called
tarballs, this is where I will place the DBT2, sysbench and MySQL tarballs.
I create one directory called ndb, this will be used as the data directory
and will be automatically filled by the scripts. Next I will create a
directory mysql, this directory is where I will place the binary installation.
Finally I create a directory called sysbench.

Next step is to download the DBT2 tarball and the sysbench tarball from the

Copy those tarballs to the tarballs directory created above.

Next step is to get the MySQL tarball that you want to test. It's ok to use
both source tarballs and binary tarballs. Personally I obviously mostly work
with source tarballs since I want to have full control over everything, but
it's ok to use a binary tarball as well.

Place this tarball also in the tarballs directory.

Now in order to be able to run the benchmark scripts it is necessary to
get the top-level script copied from the DBT2 tarball. To do this perform
the following commands:

cd /Users/mikael/bench #use your bench directory here
cd tarballs
tar xfz dbt2-
cp dbt2-037.50.6/scripts/ ..
rm -rf dbt2-

Now ls from /Users/mikael/bench should look like this:
Mikaels-MacBook-Pro:bench mikael$ ls ndb tarballs
mysql sysbench
Mikaels-MacBook-Pro:bench mikael$ pwd

and ls from tarballs should look something like this:
Mikaels-MacBook-Pro:bench mikael$ cd tarballs
Mikaels-MacBook-Pro:tarballs mikael$ ls
dbt2- mysql-cluster-gpl-7.4.5.tar.gz sysbench-

Now it's time to create the config file.
This file is always located in an almost empty directory under the bench directory with only
one file in it. This file is always called autobench.conf.

Here is the content of the file in my simplistic test run.
#Define benchmark to run
#Define file locations of datadir, installdir and tarballs
#Define tarball versions
#Define engine to use
#Define way to build
#Define servers to use
#Configure MySQL Server
#Configure NDB part
#Configure sysbench

At first we need to define the benchmark to run, this is sysbench in our case.
We need to point out the tarball directory, we need to point out the directory
where to place the installation (REMOTE_BIN_INSTALL_DIR), we also need to point
out the data directory.

Next step is to provide the names of the tarballs, this is done in
have to adapt to different MySQL versions we also need to specify the base version
of MySQL which in the case of MySQL Cluster 7.4 is 5.6. The scripts currently
supports 5.1, 5.5 and 5.6. There is also some adaption needed for a new MySQL
version and so 5.7 is currently not supported but work is on the way for this.

Next we define if we want to use innodb or if we want to use ndb as the
storage engine in this test run.

We also define in the variable USE_BINARY_MYSQL_TARBALL whether we use a
source tarball or a binary tarball. In this case I use a source tarball.

We need to define the servers, there is in the case of sysbench always
one and only one MySQL Server, there can be one or many NDB management
servers. There can also be one or many NDB data nodes. But in this case
we wanted a simple setup on one machine so here there is one of each
sort and they are all located on which is the local host.

We always configure the MySQL server port to ensure we don't have a
problem running the benchmark if another MySQL Server is running on
the host. By default we use libtcmalloc instead of libc malloc, we avoid
this default since there are better malloc libraries and this variable should
be set purposely and not by default in most cases. But setting it to not use
a special malloc library will definitely work, so we do that here as a first

Since we run with NDB storage engine we need to set up a few configuration
variables. The NDB configuration can be quite extensive, but here we strive
for a simplistic one. We want only one replica which we set using NDB_REPLICAS.
We need to set the size of the NDB data memory and the NDB index memory.
Finally we set the number of execution threads to 8 which is pretty standard.
This should be sufficient for a simple benchmark run of sysbench.

The final section sets up a minimalistic set of parameters to run sysbench.
We run OLTP complex RW case. We insert 100.000 rows into the table and we
run two tests, one with 16 MySQL connections and one with 32 connections.
Each test will run for 1 minute and we will only run each test once.

Now that is a minimal configuration, it is possible to make it even smaller
but then it's hard to understand what the test actually does since one
needs to be an expert in the default settings which even I don't remember
for all variables.

Now this file autobench.conf is placed in
in my case.

Now it is time to run the test case.

The test is executed by the following commands:
cd /Users/mikael/bench
./ --default-directory /Users/mikael/bench/sysbench

The test will run everything until completed. Actually I more or less always use
one more parameter --skip-cleanup. This means that I don't have to repeat the
building of the source code if I want to rerun the test. If I run it then a
second time I should run with both --skip-cleanup and --skip-build to ensure
that I skip rebuilding the source and skip cleaning away the built source code
at the end of the test run.

The test result is located in the file:

The output from the sysbench program in this particular case can be found
in the file:

Good luck in benchmarking.

My favourite tool in following the benchmarks is top. I usually lock the various
threads onto different CPUs to ensure that I can understand how the different
threads behaves. I haven't done this in this set up since this is just a basic
setup. But top is still a good tool to follow what is going on in the server.

One final note is that running this basic setup will require 20-30 GByte of
disk space and also during the test run there will be a fairly high load on
the disk. As many developers are not suitable for benchmarks of
database engines one way to adapt the benchmark for a developer
environment is to set the config parameter NDB_DISKLESS="yes".
This essentially makes the file system behave as if it was located on

Friday, March 20, 2015

Controlling checkpoint speed in MySQL Cluster 7.4

A question from digitalpoint on a previous blog about 7.4 restarts requires
an explanation of how we control checkpoint speed in MySQL Cluster 7.4.
Since the explanation is fairly lengthy I will do it in this blog instead
of as a reply to a blog comment.

First of all some introduction into checkpoints in MySQL Cluster 7.4. We
actually use the term checkpoint for 2 things. We have LCPs (local
checkpoints) and GCPs (Global checkpoints). LCPs are the traditional
checkpoints and the one that will be described in this blog. GCPs are
about forming groups of transactions that will be durable after a
restart. GCPs happens very often whereas an LCP is a fairly long process.

So first I'll introduce why we're doing checkpoints (LCPs) in the first
place. There are two reasons for doing LCPs. The first is that MySQL
Cluster uses a log-based approach for recovery. Every update, insert,
delete and write is logged into the REDO log in MySQL Cluster. We also
have an UNDO log for disk data. To avoid that the logs grow to infinity
we use LCPs to be able to cut the log tail.

The logs have a fixed size, the REDO logs cannot be changed in size whereas
the UNDO log you can add new files to a logfile group. When a log is full
no updates are allowed anymore since we cannot recover the database anymore
if we allow non-logged updates.

All the above is true for tables that require durability. MySQL Cluster
actually also supports non-durable tables. These tables will be available
as long as the cluster survives, if the entire cluster fails, then those
tables will be restored to an empty state. Most applications need durable
tables, but there are also applications where durability doesn't make
sense and therefore it is better to run those tables without durability.

Checkpointing is obviously only relevant to durable tables where we recover
based on a combination of logs and checkpoints.

An LCP is used to cut the log tail. This means that we always need to write
the checkpoint at such a speed that we don't come into a point where we
run out of log (both REDO log and UNDO log). As an example assume that we
can execute 100.000 updates per second at top speed that updates 20 fields
of 200 bytes in size. The REDO log in this case will be about 200 bytes
plus 20 * 4 bytes for field info and around 70 bytes of fixed overhead for
a REDO log record. So about 350 bytes times 100.000 per second of REDO log
records created which is 35 MByte per second. Assuming that the data size is
50 GByte. We can either calculate the REDO log size based on the choice
of checkpoint write speed or the other way around that we calculate the
checkpoint write speed based on the REDO log size. If we start with a
checkpoint write speed of 100 MByte per second then it will take about
500 seconds to write an LCP. We need to have enough REDO log for at least
2 checkpoints and it is good have at least 100% safety margin. So then
the REDO log size needs to be big enough for 2000 seconds of log generation.
So 70 GByte of REDO log should be sufficient. The size of the REDO log is
a multiplication of number of log parts, number of log files and log file

So being able to cut the log tail is the only reason for running LCPs in
normal operation when there are no restarts ongoing. Obviously cutting the
log to be short also means shorter restart times since there is less
log to execute. So increasing checkpoint write speed means faster restart
times at the expense of more work in normal operation.

We also use the LCPs as part of restarts. We need to run an entire LCP where
the starting node participates to ensure that the starting node is
recoverable. In this scenario we simply want the LCP to run as fast as
possible to complete the restart as soon as possible.

So this means that we have two different speeds for checkpoint write speeds,
we have the speed needed during normal operation and we have the speed needed
for restarts. Actually we can even separate out a total cluster restart
(we call this system restart) from node restarts. The reason is that during
an LCP in a system restart there is no other action other than to perform
the LCP. Thus we can set the checkpoint write speed to the absolute maximum
possible. During another nodes restart we also need to execute user
transactions and so we want to find a balance between speeding up the restart
and being able to service current users of the cluster.

So this means that we need three config parameters. We call them
MaxDiskWriteSpeed (speed during normal operation),
MaxDiskWriteSpeedOtherNodeRestart (speed during another node performing
a node restart), MaxDiskWriteSpeedOwnNodeRestart (speed during our own node
restart, but also more importantly of all nodes during a system restart).

Obviously there is a cap to the write speed we can handle, there is both a
limit to what the disk drives can handle and there is also a limit to how
much each CPU running an LDM thread can output.

My experiments shows that one CPU that runs an LDM thread can output about
100 MByte per second if used only for generation of checkpoint writes.
The checkpoint format is essentially packed rows in a format that can be
restored by simply running normal inserts. We have changed a bit back and
forth between different checkpoint formats, we have used page-based schemes
that required UNDO logging also of memory tables in earlier versions of
MySQL Cluster. The benefit of the current scheme is that enables us to
avoid UNDO logs entirely for memory tables.

Modern disks can write a few hundred MBytes per second. Even cheap SSDs can
handle up to about 500 MBytes per second of writes. The disk has to handle
both writing of checkpoints and writes of the REDO log (and also of the
UNDO log for disk data).

In 7.3 and earlier the above config parameters was named differently.
They were named DiskCheckpointSpeed (more or less equal to MaxDiskWriteSpeed
in 7.4) and DiskCheckpointSpeedRestart (more or less equal to
MaxDiskWriteSpeedOwnRestart). MaxDiskWriteOtherNodeRestart have no
similitude in 7.3 and earlier versions, one uses DiskCheckpointSpeed also
when another node is restarting. This is one major factor in how we can
speed up node restarts and rolling upgrades.

One more thing is that the execution of LCPs have been improved in 7.4.
In 7.4 all LDM threads can write checkpoints in parallel. The config
parameter specifies the total speed of the node, thus each LDM will
use a part of this speed. So with write speed set to 100 MByte with
4 LDM threads means that each LDM thread will write at about 25 Mbyte
per second and thus about 25% of the available CPU capacity in the
LDM thread for checkpoints. In earlier versions up to 2 LDM threads
could write in parallel and each such LDM thread used the config
parameter DiskCheckpointSpeed/DiskCheckpointSpeedRestart. Thus the
write speed of the node was usually twice as much as the config
parameter was set to.

The change in 7.4 to enable all LDM threads to write in parallel means that
we can set the write speed higher in 7.4 than earlier. The CPU overhead of
checkpointing is spread on all LDM threads and this ensures that we are
able to increase the speed without detrimenting normal user operation.

There is also one more innovation in 7.4. This is seen from the config
parameter MinDiskWriteSpeed.

The problem here is that checkpoint writes compete for CPU capacity with
normal user operations. So if the user needs to run transactions at the
very highest speed, it is a good idea to slow down the CPU usage from
from writing checkpoints.

Additionally if we experience an IO lag where the disk gets overloaded
we also want to slow down the writing of checkpoints.

What we have done here is that we have implemented an adaptive algorithm
that will slow down the checkpoint speed when we either have a CPU
overload situation or we have an IO overload situation. To get a balanced
adaptive algorithm we decrease speed faster than we increase the speed.

For CPU overload we start reacting when the CPU usage reaches 95% and more.
We use a mechanism to see how much CPU time the LDM thread has had each
second. This means that this algorithm will work best in an environment
where the LDM threads always have access to CPU time if they need to.
If the LDM threads are sharing CPU time with many other threads and processes
the data node will still function but this particular adaptive algorithm
won't work very well.

Personally I always set up my data nodes using CPU locking with the
ThreadConfig config parameter to get the optimal stable performance from
the data nodes.

It is actually even possible to setup a real-time environment with MySQL
Cluster, more on the how-tos of that in a separate blog.

IO lagging is discovered by analysing how many seconds of REDO log is
on its way to the disk which hasn't yet been written to disk.

The adaptive algorithm will always ensure that we never go below
MinDiskWriteSpeed independent of CPU load and IO load. This parameter should
be set at such a level to never jeopardize the system and not risking that
the log tail meets the log head due to not completing LCPs quick enough.

The maximum checkpoint write speed is then based on whether we are in normal
operation, whether another node is currently restarting or whether we are
restarting our own node.

So by this thorough introduction to execution of LCPs in MySQL Cluster 7.4
it is possible to answer digitalpoint's question.

In normal operation when there is no high load on the node, there is no
overload of the IO subsystem, then we will write with LCPs at a constant
speed of MaxDiskWriteSpeed. The MinDiskWriteSpeed we will reach in a
scenario with overload either of the CPUs or the IO subsystem due to
many transactions being executed by the users of MySQL Cluster.

So MinDiskWriteSpeed should be set at the minimum level required to ensure
that we don't run out of REDO log space between. MaxDiskWriteSpeed should
be set to a level dependent on how fast restarts we want to have and this
is the speed for the most part so also one should take into account the
possibility of wearing out the disk when setting this parameter.

MaxDiskWriteSpeedOtherNodeRestart should be set quite high to enable fast
restarts. We should still not set it drastically high to avoid that we
run into unnecessary overload situations during the node restart.

MaxDiskWriteSpeedOwnRestart can be set to a level that can be handled by
the IO subsystem and CPU subsystem since it has very little other
activity to handle during a restart.

The adaptive algorithm makes it possible to fairly aggressive in setting
the last two config parameters since the adaptive algorithm will ensure
that we still maintain a healthy level of CPU usage for user transactions.

A final word is that it is possible to change those parameters in any restart.
Actually we even implemented an undocumented feature whereby you can change
these parameters in live nodes. The reason we haven't documented those is
that they are intended for persons that completely understand how their
system behaves and thus we consider that they are able to read the code of
Backup.cpp to understand what to do to change this parameters. The change
done through this interface won't survive a restart, for this one needs to
change the configuration.

We have also added a new NDBINFO table making it possible to follow how
the checkpoint write speed and other disk writes is occuring in your own
cluster. More on that in a later blog.

Wednesday, March 11, 2015

Methods of optimised module interaction in MySQL Cluster 7.4

One of the important optimisations we did in MySQL Cluster 7.4 was to
use more of direct function calls as part of SCAN processing instead of
relying on asynchronous messages.

We want each message execution to be short, up to around 10
microseconds of execution per message we should strive to
keep it below. However if we decrease the time spent executing
a message to down to a microsecond or smaller than we spend too
much time in the message scheduler. So it is important to strike a
balance here.

For primary key lookups we do most of the work in one message
execution, so here we already were doing the optimal behaviour
in our model of execution.

For scans however we often scan hundreds or even thousands and
in some cases even millions of rows, so here there is a lot of
opportunity to decide how to execute the scan.

Previously the scan model used a number of messages per row
scanned. This was quite obviously a bit too conservative. We have
now made optimisations such that we can scan up to 5 rows within
the execution of one message execution.

When converting an asynchronous message into a function call
there are more than one way to achieve this. There are ways that
are extremely fast, but these leave no extra opportunity of
tracing. So for each such call we have to make a decision.

The excerpt below is taken from the code in MySQL Cluster 7.4
and explains the different ways to convert asynchronous messages
into function calls.

When calling another block immediately there is a set of ways to
do this.

Standard, non-optimised manner:
Use EXECUTE_DIRECT with four parameters that specify block reference
of receiver, the signal object, the global signal number and the
length of the signal (There is also a variant used when sending such
a signal to a different instance). This method is fairly optimised
but has quite a lot of potential for performance improvement.

Standard, optimised manner:
In this case we optimise things by translating to block object and
retrieving the function pointer in the block call. This gives
the compiler assistance to separate the loads and stores more from
each other.
The call will be seen as:
block->EXECUTE_DIRECT(signal, f);

This manner optimises the code but retains the flexibility and also
the possibility to trace signal execution between blocks.

Non-standard, optimised manner:
In this case we remove some of the flexibility of the call to enhance
the performance yet a bit more. In this case we remove the possibility
for a flexible receiver of the signal, it is directed to a certain
block and we also call the method directly without any indirection.
The call will be seen as e.g.:

The standard manner of calling EXECUTE_DIRECT are both always calling
functions with one parameter being the signal object and no return
value. There is however two ways of sending signals back. In some
cases a signal is always sent back when returning from the
EXECUTE_DIRECT call, in this case the signal object returned contains
a signal object with data for the return signal. In some cases
one gets a return signal in this manner by a combination of the signal
number and the parameters. In the example below with NEXT_SCANREQ
we always gets a return signal when specifying ZSCAN_COMMIT
but not in other cases.

The other manner of sending a return signal is to perform a new
EXECUTE_DIRECT signal. In those cases one needs to ensure that the
call chain is bounded to not run out of stack. It is also a good
idea to try and ensure that the EXECUTE_DIRECT can use the
tail-call optimisation to avoid using too much stack which is bad
for CPU caching. This means returning immediately after
EXECUTE_DIRECT but also avoiding having objects that needs
destruction after return and also avoiding taking the reference of
stack variables.

Using the non-standard manner one can obviously also change more
ways, one can return a bool for example as in the example with
execTUPKEYREQ, one can add parameters and one can even change the
name to a different name not according to the standard naming
conventions. Obviously doing this removes flexibility of using
blocks in a flexible manner.