Monday 30 June 2014

OrientDB and ZFS Part II - (almost) real use case

The first time I analyzed the OrientDB performances related to ZFS tuning, I used a truly simple "import database" experiment.
This time, thanks to +Luigi Dell'Aquila from Orient Technologies LTD, we can analyze an (almost) real workload, filling an empty database with more than a million of vertexes and edges from scratch.
I used the same environment (host, disks, OS, jdk, ...) of the first analysis, but with OrientDB 1.7.4 community edition.
For general considerations, it was useful this reading.

Workload description

The assumption is that a graph can be expressed as a series of triplets, each of which representing a "Vertex → Edge → Vertex" relation.

From DBPedia web site, you can download many RDF files representing the Wikipedia's Italian version. My choice was the first file in the list: "Article Categories", whose first lines are shown below as an example:

<http://it.dbpedia.org/resource/Anni_1940> <http://purl.org/dc/terms/subject> <http://it.dbpedia.org/resource/Categoria:Decenni_del_XX_secolo>
<http://it.dbpedia.org/resource/Anni_1900> <http://purl.org/dc/terms/subject> <http://it.dbpedia.org/resource/Categoria:Decenni_del_XX_secolo>
<http://it.dbpedia.org/resource/Anni_1930> <http://purl.org/dc/terms/subject> <http://it.dbpedia.org/resource/Categoria:Decenni_del_XX_secolo>

An "Importer" Java program reads each line, checks if the first and third elements already exist, creates the vertexes and the edge between them.
Here the main source code:

public class Importer {

   public static void main(String[] args) throws Exception {
      if (args.length != 4) {
         System.out
            .println("sintassi corretta: Importer ");
         return;
      }
      creaSchema(args[0], args[1], args[2], args[3]);
      importa(args[0], args[1], args[2], args[3]);
   }

   private static void creaSchema(String fileName, String dbUrl, String user, String pw) throws Exception {
      OrientGraphNoTx db = new OrientGraphNoTx(dbUrl, user, pw);
      OSchema schema = db.getRawGraph().getMetadata().getSchema();
      OIndexManagerProxy indexMgr = db.getRawGraph().getMetadata().getIndexManager();
      OClass nodo = null;
      if (schema.existsClass("Nodo")) {
         nodo = schema.getClass("Nodo");
      } else {
         nodo = schema.createClass("Nodo", schema.getClass("V"));
      }
      if (!nodo.existsProperty("name")) {
         nodo.createProperty("name", OType.STRING);
      }
      if (!indexMgr.existsIndex("Nodo_name")) {
         nodo.createIndex("Nodo_name", INDEX_TYPE.UNIQUE, "name");
      }

      OClass arco = null;
      if (schema.existsClass("Arco")) {
         arco = schema.getClass("Arco");
      } else {
         arco = schema.createClass("Arco", schema.getClass("E"));
      }
      if (!arco.existsProperty("name")) {
         arco.createProperty("name", OType.STRING);
      }
      db.shutdown();
   }

   private static void importa(String fileName, String dbUrl, String user, String pw) throws Exception {
      File file = new java.io.File(fileName);
      FileInputStream input = new java.io.FileInputStream(file);
      BufferedReader reader = new java.io.BufferedReader(new java.io.InputStreamReader(input));

      OrientGraph gdb = new OrientGraph(dbUrl, user, pw);

      reader.readLine();
      String result = reader.readLine();
      while (result != null && result.trim().length() > 0) {
         String[] parts = result.split(" ");
         if (parts.length > 3 && parts[3].equals(".")) {
            Vertex vertex1 = createVertex(parts[0], gdb);
            Vertex vertex2 = createVertex(parts[2], gdb);

            Edge edge = vertex1.addEdge("Arco", vertex2);
            edge.setProperty("name", parts[1]);

            gdb.commit();
         }
         result = reader.readLine();
      }
      gdb.commit();
      gdb.shutdown();
   }

   private static Vertex createVertex(String name, OrientGraph gdb) {
      Iterable queryResult = gdb.command(new OSQLSynchQuery("select from Nodo where name = ?")).execute(name);
      for (Vertex v : queryResult) {
         return v;
      }
      Vertex result = gdb.addVertex("class:Nodo");
      result.setProperty("name", name);
      return result;
   }
}

Before starting the loading process, the Importer program checks if the basic schema already exists, if not it will be created.
Here the instructions if you prefer to create the schema using the OrientDB console:

CREATE DATABASE remote:localhost/dbpedia root root plocal

CONNECT remote:localhost/dbpedia admin admin

CREATE CLASS Nodo EXTENDS V
CREATE PROPERTY Nodo.name STRING
CREATE INDEX Nodo_name on Nodo(name) UNIQUE_HASH_INDEX

CREATE CLASS Arco EXTENDS E
CREATE PROPERTY Arco.name STRING


At last, you can execute the Java program:

java -cp "$ORIENTDB_HOME/lib/orientdb-server-1.7.4.jar:$ORIENTDB_HOME/lib/*:." Importer /opt/article-categories.txt remote:localhost/dbpedia admin admin

When the Import Java program finishes, you have quite a lot of records:

orientdb {dbpedia}> select count(*) from V

----+------+------
#   |@RID  |count
----+------+------
0   |#-1:-1|942949
----+------+------

orientdb {dbpedia}> select count(*) from E

----+------+-------
#   |@RID  |count
----+------+-------
0   |#-1:-1|1390193
----+------+-------

Workload characterization

As a single threaded process, we expect that the "Importer" program is going to load the host at its limits, without generating queues on its main resources.

For the purpose of this article is not very interesting to make a deep workload analysis but, since the whole process last around 1 hour, I ran the "Importer" program capturing system activities with sar, recording metrics every 60 seconds, for 60 minutes:

sar -A -o temp 60 60

First, let's verify the disk load:

root@globalZone:~# sar -d -f temp
[...]
Average  device      %busy   avque   r+w/s  blks/s  avwait  avserv
         adpu3200        0     0.0       0       0     0.0     0.0
         adpu3201       18     1.2      67   18397     0.0    17.8
         sd0            14     1.1      35   14287     0.0    30.7
         sd0,a           0     0.0       0       0     0.0     0.0
         sd0,b          14     1.1      35   14287     0.0    31.1
         sd1             5     0.1      29    4110     0.0     3.8
         sd1,a           5     0.1      29    4110     0.0     3.8

In particular, the column "avwait" (average wait time) confirms our assumption.


Further confirmations come from "queue lenght" e "CPU utilization" reports:

root@globalZone:~# sar -q -f temp
[...]
         runq-sz %runocc swpq-sz %swpocc
Average      1.3      35     0.0       0


root@globalZone:~# sar -u -f temp
[...]
            %usr    %sys    %wio   %idle
Average       48      14       0      38


Getting started

OrientdbZone configuration


I have 2 disks, each in one zpool, each zpool dedicated to a specific function: data files and Write Ahead Log:
root@globalZone:~/# zpool status
  pool: databases
 state: ONLINE
  scan: none requested
config:

        NAME       STATE     READ WRITE CKSUM
        databases  ONLINE       0     0     0
           c7t1d0  ONLINE       0     0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
  scan: none requested
config:

        NAME      STATE     READ WRITE CKSUM
        rpool     ONLINE       0     0     0
          c7t0d0  ONLINE       0     0     0

errors: No known data errors

I created a zfs file system for each zpool, without special option:
root@globalZone:~/# zfs create -o mountpoint=legacy databases/wal
root@globalZone:~/# zfs create -o mountpoint=legacy rpool/datafiles

I configured my orientdbZone in order to properly use these zfs file systems:
root@globalZone:~# zonecfg -z orientdbZone
zonecfg:orientdbZone> add fs
zonecfg:orientdbZone:fs> set type=zfs
zonecfg:orientdbZone:fs> set special=rpool/datafiles
zonecfg:orientdbZone:fs> set dir=/opt/orientdb/databases
zonecfg:orientdbZone:fs> end
zonecfg:orientdbZone> add fs
zonecfg:orientdbZone:fs> set type=zfs
zonecfg:orientdbZone:fs> set special=databases/wal
zonecfg:orientdbZone:fs> set dir=/opt/orientdb/wal
zonecfg:orientdbZone:fs> end
zonecfg:orientdbZone> verify
zonecfg:orientdbZone> commit
zonecfg:orientdbZone> exit

Before every test, I shutdown the orientdbZone, destroyed-recreated the zfs file systems, rebooted the orientdbZone.

Preliminary analysis


Using this simple DTrace program, I ran the "Import" Java program the first time only for gathering information about the page sizes written on the file system:
#! /usr/sbin/dtrace -Zs

syscall::*write*:entry
{
   self->fd = arg0
}
syscall::*write*:return
/(fds[self->fd].fi_mount == "/path/to/orientdb/databases") || (fds[self->fd].fi_mount == "/path/to/orientdb/wal")/
{
   @distrib[fds[self->fd].fi_fs, probefunc, fds[self->fd].fi_mount] = quantize(arg1);
}
syscall::*write*:return
/(fds[self->fd].fi_mount == "/path/to/orientdb/databases") || (fds[self->fd].fi_mount == "/path/to/orientdb/wal")/
{
   self->fd = 0;
}
dtrace:::END
{
   printa("\n %s %s %s %@d", @distrib);
}


The distribution is quite regular:
 CPU     ID                    FUNCTION:NAME
   1      2                             :END
 zfs write /path/to/orientdb/databases
           value  ------------- Distribution ------------- count
               0 |                                         0
               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    60
               2 |                                         0
               4 |                                         0
               8 |                                         0
              16 |                                         0
              32 |@@@                                      5
              64 |                                         0

 zfs write /path/to/orientdb/wal
           value  ------------- Distribution ------------- count
           32768 |                                         0
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 102876
          131072 |                                         0

 zfs pwrite /path/to/orientdb/databases
           value  ------------- Distribution ------------- count
               0 |                                         0
               1 |                                         5
               2 |                                         0
               4 |                                         0
               8 |                                         5
              16 |                                         0
              32 |                                         0
              64 |                                         0
             128 |                                         0
             256 |                                         0
             512 |                                         0
            1024 |                                         0
            2048 |                                         0
            4096 |                                         0
            8192 |                                         0
           16384 |                                         0
           32768 |                                         0
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 289755
          131072 |                                         0

So, this time we cannot have doubts: we have to setup the zfs recordsize=64k.

As verified with the previous analysis, if you are using a dedicated WAL file system, the most important factor is the write() time to this file system.

Test #1: Compression

Having fixed the workload and related ZFS recordsize, I tried to compare different behaviors with different compression strategies.

OrientDB has its own compression strategies, based on snappy or gzip libraries.

ZFS has different compression algorithms:
  • lzjb  - optimized for performances 
  • gzip-N (1 ≤ N ≤ 9) - best compression if N > 5
  • zle - fast, but only compresses sequences of zeros
The default compression factor for gzip is 6.

Of course we have to evaluate the compression ratio, related to the whole execution time:

SecondsMegabytes
Total Import Timedisk usage waldisk usage data files
zfs compression=on36481911242
zfs compression=gzip48311147146
zfs default, OrientDB gzip=on45684090838
zfs default, snappy=on36924090813
zfs compression=zle35362803594

The OrientDB compression strategies seem not so useful in execution time nor in disk space saved.

In general we see that the I/O time isn't so important for this kind of workload, and the time saved in writes are balanced by the time spent in compression, so we cannot save much execution time, but we can save a lot of disk space:


The best compromise is to use ZFS default compression, combined with OrientDB default settings.

Test #2: Backup

Even dealing with backups, you have to compare OrientDB and ZFS compression strategies.
Here is the ODB config file part related to the backup options:

<handler class="com.orientechnologies.orient.server.handler.OAutomaticBackup">
  <parameters>
    <parameter name="enabled" value="false" />
    <parameter name="delay" value="4h" />
    <parameter name="target.directory" value="backup" />
    <parameter name="target.fileName" value="${DBNAME}-${DATE:yyyyMMddHHmmss}.zip" />
    <parameter name="db.include" value="" />
    <parameter name="db.exclude" value="" />
    <parameter name="compressionLevel" value="9"/>
    <parameter name="bufferSize" value="1048576"/>
  </parameters>
</handler>

The default compression level is set to the maximum value (value=0 means no compression).

First of all, we have to verify the page size distribution when OrientDB save the backup file:

root@globalZone:~/# ./rw_bytesBackup.d
dtrace: script './rw_bytesBackup.d' matched 13 probes

 CPU     ID                    FUNCTION:NAME
   0      2                             :END
 zfs write /zones/orientdbZone/root/opt/orientdb-community-1.7.4/backup
           value  ------------- Distribution ------------- count
           65536 |                                         0
          131072 |                                         1
          262144 |                                         0
          524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 811
         1048576 |                                         0


OrientDB saves the backup file writing only pages of 512 KB

How much time we can save using the right recordsize for the ZFS backup partition?


Not so much...

Since the OAutomaticBackup class puts the data files in read only mode, my opinion is to try to complete the backup process as soon as possible, therefore we shouldn't use compression at all:

exec time (ms)
benchmark287872
ZFS compression=on31803
ZFS compression=gzip-941654
NO ZFS compression25704
NO ZFS compression, recordsize=512k20333
if data files are compressed15024



But, what about disk space?


We cannot waste so much disk space, so my suggestion is to create a new ZFS partion having the most powerful compression algorithm (compression=gzip-9) aside the backup partition, then, using a scheduled script, move the backup files into the archive partition:

root@globalZone:~/# zfs create -o mountpoint=legacy -o compression=gzip-9 databases/archive root@globalZone:~# zonecfg -z orientdbZone
zonecfg:orientdbZone> add fs
zonecfg:orientdbZone:fs> set type=zfs
zonecfg:orientdbZone:fs> set special=databases/archive
zonecfg:orientdbZone:fs> set dir=/opt/orientdb/archive
zonecfg:orientdbZone:fs> end
zonecfg:orientdbZone> verify
zonecfg:orientdbZone> commit
zonecfg:orientdbZone> exit
root@globalZone:~# zoneadm -z orientdbZone shutdown
root@globalZone:~# zoneadm -z orientdbZone boot
root@globalZone:~# zlogin orientdbZone

root@orientdbZone:~# du -k /opt/orientdb/backup/dbpedia-20140625182046.json.gz
831031 /opt/orientdb/backup/dbpedia-20140625182046.json.gz

root@orientdbZone:~# time mv /opt/orientdb/backup/dbpedia-20140625182046.json.gz /opt/orientdb/archive/

real 2m19.710s
user 0m0.004s
sys 0m5.678s

root@orientdbZone:~# du -k /opt/orientdb/archive/dbpedia-20140625182046.json.gz
98764 /opt/orientdb/archive/dbpedia-20140625182046.json.gz

This way the compression process is decoupled and become asynchronous, and you can achieve the best performances in execution time and space saved.
Of course, you can use an external ZFS storage mounted via NFS.

Here we can underline one important difference between Solaris ZFS and open-zfs, used in Illumos-based systems, FreeBSD and MAC OSX.
Whereas open-zfs has a some useful extra features, the Solaris version gives you the opportunity to use a greater recordsize, very useful for such a stuff:
root@globalZone:~/# zfs set recordsize=1M databases/archive
root@orientdbZone:~/# time mv dbpedia-20140627101455.json.gz /opt/orientdb/archive/

real 0m17.464s
user 0m0.004s
sys 0m6.186s


Another, more expensive, strategy is to have a backup dedicated instance in a distributed topology, so that the main instances never slow down because of the backup process.

Test #3: Lucene indexes

Few weeks ago +Enrico Risa developed an OrientDB plugin which gives you the opportunity to use Lucene engine for full-text and spatial indexes:
orientdb> connect remote:localhost/dbpedia admin admin
Connecting to database [remote:localhost/dbpedia] with user 'admin'...OK
orientdb {dbpedia}> CREATE INDEX Nodo.content on Nodo (name) FULLTEXT ENGINE LUCENE;


Creating index...
2014-06-27 14:43:48:380 INFO - Rebuilding index dbpedia.Nodo.content (estimated 942949 items)... [OIndexRebuildOutputListener]
2014-06-27 14:43:58:383 INFO --> 1.13% progress, 10,653 indexed so far (1,065 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:44:10:663 INFO --> 12.20% progress, 115,065 indexed so far (10,441 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:44:20:716 INFO --> 26.91% progress, 253,784 indexed so far (13,871 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:44:30:717 INFO --> 44.41% progress, 418,770 indexed so far (16,498 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:44:40:718 INFO --> 60.86% progress, 573,886 indexed so far (15,511 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:44:50:719 INFO --> 80.00% progress, 754,379 indexed so far (18,049 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:45:00:849 INFO --> 98.30% progress, 926,889 indexed so far (17,251 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:45:01:896 INFO --> OK, indexed 942,949 items in 73,516 ms [OIndexRebuildOutputListener]
Created index successfully with 942949 entries in 74.220001 sec(s).

Index created successfully

Despite it isn't ready for production environments yet, I've found his work very interesting, surely it is for my projects!

Let's start investigating its behavior.
As always, the first run is for the the page size distribution analysis during the index creation, having previously created a dedicated ZFS partition:
root@globalZone:~/# ./rw_bytesIndex.d
dtrace: script './rw_bytesIndex.d' matched 13 probes

 CPU     ID                    FUNCTION:NAME
   0      2                             :END
 zfs write /your/path/to/dbpedia/luceneIndexes
           value  ------------- Distribution ------------- count
               4 |                                         0
               8 |                                         2
              16 |                                         5
              32 |                                         30
              64 |                                         1
             128 |                                         23
             256 |                                         13
             512 |                                         14
            1024 |                                         30
            2048 |                                         11
            4096 |                                         72
            8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  8983
           16384 |                                         0


So we have to use recordsize=8k fot this partition.

Here the performance comparison among different ZFS tuning:

creation time (ms)disk usage (KB)
recodsize=8k8273836800
default8296336800
compression=on8202822764

Writing only 36 MB in 82 seconds, doesn't let you to observe the gaining of having the right recordsize, because the process spend the most part of the time in reading the source and applying the archive algorithm.

In particular, our source isn't very suitable for this test, because every property indexed is a single word, with underscores instead of spaces.

In fact, the whole test isn't so meaningful, except for the disk space you can save, because, in my experience, a full-text index must be very fast in reading, not in writing.

Imagine you want to index the content of hundreds of thousands of PDF documents: you may can accept an asynchronous new document loading (and indexing), but surely you want a fast enough search system.

In this case, my suggestion is to use a dedicated SSD device.

Extra considerations


  • Using ZFS file system can reduce the cost and complexity of managing data and storage devices. Storage product capabilities often overlap ZFS features, and my suggestion is to rely on ZFS ones, because they give you more flexibility. but if you are comfortable with traditional hardware RAID protection, then use it!
  • Since ZFS copy-on-write feature relocates all writes to free disk space, keeping a certain amount o free blocks allows ZFS to easily find space in big chunks, and allows ZFS to aggregate writes and reduce the write IOPS demand on hard disks.
  • ZFS manages synchronous writes internally with the ZFS intent log (ZIL). By default, synchronous writes are stored in the ZIL at low latency and shortly thereafter (within approximately 5 seconds), ZFS does a bulk update of the pool state by committing a transaction group. Another way exists for the ZIL to handle synchronous write activity, which is of slightly higher latency, but also involves almost half as much traffic on the channel to the storage array. This mode is triggered by the logbias=throughput value (thanks +Manuel Zach). This mode is recommended for file systems holding data files since writes to those files are highly threaded background operations from the database perspective, which does not warrant the utmost latency requirements. Using logbias=throughput for data files frees up the storage resource and as a consequence, allows the WAL to be handled with lower latency, especially if the storage pool is used by many zones.

Conclusions

The following table summarizes my suggestions about ZFS tuning for every partition you should use:

FILE SYSTEMRECORDSIZELOGBIASCOMPRESSION
data files64kthroughputon
WAL64klatencyon
backup512klatencyoff
archivemax availablethroughputgzip-9
Lucene indexes8kthroughputon