Discussion:
Riak performance problems when LevelDB database grows beyond 16GB
J***@seznam.cz
2012-10-11 16:12:14 UTC
Permalink
Hello,

I am writing a new application and I am testing it on a cluster with 4 Riak
nodes (16 GM RAM, 2 x i3 3.4GHz - 2 cores).

The application is tested with the expected load of 1000 requests/second,
90% of the requests cause a Riak read and  write of a new key. The problem
is that the performance starts falling after 18-20 hours and one of the Riak
nodes stops responding after 23-25 hours.

(Key is cca 61 bytes long, data is just 3 timestamps converted to binary,
and there is a secondary key containing an expiration time. There should be
a mapred job to delete keys older than 24 hours, but it is turned off while
researching the performance problem.)

Logs on the other nodes show that the problematic node cannot be contacted:

2012-10-11 11:33:57.473 [error] <0.908.0> ** Node '***@172.16.0.2' not
responding **
** Removing (timedout) connection **

The problematic node itself does not respond to "/usr/sbin/riak ping", but
beam.smp is running and ALIVE messages are written regularly to the erlang
log. There is nothing suspicious in logs on the node,  its error log is
empty.

The beam.smp consumes 20% memory and 50-100% of 1 CPU (the other 3 CPUs sit
idle),  and the process has 267 open LevelDB files.

The database sizes are:

node1: 16249M, 281 files in 21 dirs (with 4 additional files like /home/
riak/leveldb/0/lost/BLOCKS.bad); this is the problematic node
node2: 16183M, 264 files in 16 dirs
node3: 16664M, 264 files in 16 dirs
node4: 16205M, 265 files in 16 dirs

I tried to attach to the beam.smp process with Erlang, but it does not
respond to net_adm:ping/1.

I attached gdb to the process, and gdb shows that most of its 93 threads are
idle (in ethr_event_wait), but 2 threads are in LevelDB code:

Thread 24 (Thread 0x7f1a8ecc0700 (LWP 3912)):
#0  0x00007f1a91f74d84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64
-linux-gnu/libpthread.so.0
#1  0x00007f1a0ee0ae9d in leveldb::port::CondVar::Wait() () from /usr/lib/
riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#2  0x00007f1a0ede3841 in leveldb::DBImpl::MakeRoomForWrite(bool) () from /
usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#3  0x00007f1a0ede91ad in leveldb::DBImpl::Write(leveldb::WriteOptions const
&, leveldb::WriteBatch*) () from /usr/lib/riak/lib/eleveldb-1.2.2p5/priv/
eleveldb.so
#4  0x00007f1a0eddeca4 in eleveldb_write () from /usr/lib/riak/lib/eleveldb-
1.2.2p5/priv/eleveldb.so
#5  0x0000000000534c16 in process_main ()
#6  0x00000000004987e3 in ?? ()
#7  0x0000000000595320 in ?? ()
#8  0x00007f1a91f70e9a in start_thread () from /lib/x86_64-linux-gnu/
libpthread.so.0
#9  0x00007f1a91a964bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x0000000000000000 in ?? ()

Thread 20 (Thread 0x7f19fc727700 (LWP 3967)):
#0  0x00007f1a0ee05a67 in leveldb::crc32c::Extend(unsigned int, char const*,
unsigned long) () from /usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#1  0x00007f1a0ee012b9 in leveldb::TableBuilder::WriteRawBlock(leveldb::
Slice const&, leveldb::CompressionType, leveldb::BlockHandle*) () from /usr/
lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#2  0x00007f1a0ee01444 in leveldb::TableBuilder::WriteBlock(leveldb::
BlockBuilder*, leveldb::BlockHandle*) () from /usr/lib/riak/lib/eleveldb-
1.2.2p5/priv/eleveldb.so
#3  0x00007f1a0ee015e4 in leveldb::TableBuilder::Flush() () from /usr/lib/
riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#4  0x00007f1a0ee0178b in leveldb::TableBuilder::Add(leveldb::Slice const&,
leveldb::Slice const&) () from /usr/lib/riak/lib/eleveldb-1.2.2p5/priv/
eleveldb.so
#5  0x00007f1a0ede7cad in leveldb::DBImpl::DoCompactionWork(leveldb::
DBImpl::CompactionState*) () from /usr/lib/riak/lib/eleveldb-1.2.2p5/priv/
eleveldb.so
#6  0x00007f1a0ede8456 in leveldb::DBImpl::BackgroundCompaction() () from /
usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#7  0x00007f1a0ede9038 in leveldb::DBImpl::BackgroundCall() () from /usr/
lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#8  0x00007f1a0ee06c1e in ?? () from /usr/lib/riak/lib/eleveldb-1.2.2p5/
priv/eleveldb.so
#9  0x00007f1a91f70e9a in start_thread () from /lib/x86_64-linux-gnu/
libpthread.so.0
#10 0x00007f1a91a964bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x0000000000000000 in ?? ()

When I looked at thread 20 in the process again, the stack has shown some
Snappy compressions, and many later inspections have shown call to fdatasync
(2),
which was replaced by some more compaction work. Thread 24 still sits in
leveldb::DBImpl::MakeRoomForWrite.

Thread 20 samples:
#0  0x00007f1a0ee0ed6d in ?? () from /usr/lib/riak/lib/eleveldb-1.2.2p5/
priv/eleveldb.so
#1  0x00007f1a0ee0edb3 in ?? () from /usr/lib/riak/lib/eleveldb-1.2.2p5/
priv/eleveldb.so
#2  0x00007f1a0ee0f9dc in snappy::internal::CompressFragment(char const*,
unsigned long, char*, unsigned short*, int) () from /usr/lib/riak/lib/
eleveldb-1.2.2p5/priv/eleveldb.so
#3  0x00007f1a0ee10dc1 in snappy::Compress(snappy::Source*, snappy::Sink*) (
) from /usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#4  0x00007f1a0ee1115a in snappy::RawCompress(char const*, unsigned long,
char*, unsigned long*) () from /usr/lib/riak/lib/eleveldb-1.2.2p5/priv/
eleveldb.so
#5  0x00007f1a0ee014eb in leveldb::TableBuilder::WriteBlock(leveldb::
BlockBuilder*, leveldb::BlockHandle*) () from /usr/lib/riak/lib/eleveldb-
1.2.2p5/priv/eleveldb.so
#6  0x00007f1a0ee015e4 in leveldb::TableBuilder::Flush() () from /usr/lib/
riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#7  0x00007f1a0ee0178b in leveldb::TableBuilder::Add(leveldb::Slice const&,
leveldb::Slice const&) () from /usr/lib/riak/lib/eleveldb-1.2.2p5/priv/
eleveldb.so 
#8  0x00007f1a0ede7cad in leveldb::DBImpl::DoCompactionWork(leveldb::
DBImpl::CompactionState*) () from /usr/lib/riak/lib/eleveldb-1.2.2p5/priv/
eleveldb.so
#9  0x00007f1a0ede8456 in leveldb::DBImpl::BackgroundCompaction() () from /
usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#10 0x00007f1a0ede9038 in leveldb::DBImpl::BackgroundCall() () from /usr/
lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#11 0x00007f1a0ee06c1e in ?? () from /usr/lib/riak/lib/eleveldb-1.2.2p5/
priv/eleveldb.so
#12 0x00007f1a91f70e9a in start_thread () from /lib/x86_64-linux-gnu/
libpthread.so.0
#13 0x00007f1a91a964bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#14 0x0000000000000000 in ?? ()

#0  0x00007f1a91a8fa5d in fdatasync () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f1a0ee08d64 in ?? () from /usr/lib/riak/lib/eleveldb-1.2.2p5/
priv/eleveldb.so
#2  0x00007f1a0ede3357 in leveldb::DBImpl::FinishCompactionOutputFile
(leveldb::DBImpl::CompactionState*, leveldb::Iterator*) () from /usr/lib/
riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#3  0x00007f1a0ede7e6e in leveldb::DBImpl::DoCompactionWork(leveldb::
DBImpl::CompactionState*) () from /usr/lib/riak/lib/eleveldb-1.2.2p5/priv/
eleveldb.so
#4  0x00007f1a0ede8456 in leveldb::DBImpl::BackgroundCompaction() () from /
usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#5  0x00007f1a0ede9038 in leveldb::DBImpl::BackgroundCall() () from /usr/
lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#6  0x00007f1a0ee06c1e in ?? () from /usr/lib/riak/lib/eleveldb-1.2.2p5/
priv/eleveldb.so
#7  0x00007f1a91f70e9a in start_thread () from /lib/x86_64-linux-gnu/
libpthread.so.0
#8  0x00007f1a91a964bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#9  0x0000000000000000 in ?? ()

#0  0x00007f1a0ee05765 in ?? () from /usr/lib/riak/lib/eleveldb-1.2.2p5/
priv/eleveldb.so
#1  0x00007f1a0ee0b6da in leveldb::InternalKeyComparator::Compare(leveldb::
Slice const&, leveldb::Slice const&) const () from /usr/lib/riak/lib/
eleveldb-1.2.2p5/priv/eleveldb.so
#2  0x00007f1a0ee00218 in ?? () from /usr/lib/riak/lib/eleveldb-1.2.2p5/
priv/eleveldb.so
#3  0x00007f1a0ee006aa in ?? () from /usr/lib/riak/lib/eleveldb-1.2.2p5/
priv/eleveldb.so
#4  0x00007f1a0ede7ccd in leveldb::DBImpl::DoCompactionWork(leveldb::
DBImpl::CompactionState*) () from /usr/lib/riak/lib/eleveldb-1.2.2p5/priv/
eleveldb.so
#5  0x00007f1a0ede8456 in leveldb::DBImpl::BackgroundCompaction() () from /
usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#6  0x00007f1a0ede9038 in leveldb::DBImpl::BackgroundCall() () from /usr/
lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#7  0x00007f1a0ee06c1e in ?? () from /usr/lib/riak/lib/eleveldb-1.2.2p5/
priv/eleveldb.so
#8  0x00007f1a91f70e9a in start_thread () from /lib/x86_64-linux-gnu/
libpthread.so.0
#9  0x00007f1a91a964bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x0000000000000000 in ?? ()

#0  0x00007f1a0eb61dcb in std::string::_M_mutate(unsigned long, unsigned
long, unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00007f1a0eb61e1c in std::string::_M_replace_safe(unsigned long,
unsigned long, char const*, unsigned long) () from /usr/lib/x86_64-linux-
gnu/libstdc++.so.6
#2  0x00007f1a0ee03559 in ?? () from /usr/lib/riak/lib/eleveldb-1.2.2p5/
priv/eleveldb.so
#3  0x00007f1a0ee037bd in ?? () from /usr/lib/riak/lib/eleveldb-1.2.2p5/
priv/eleveldb.so
#4  0x00007f1a0ee00680 in ?? () from /usr/lib/riak/lib/eleveldb-1.2.2p5/
priv/eleveldb.so
#5  0x00007f1a0ede7ccd in leveldb::DBImpl::DoCompactionWork(leveldb::
DBImpl::CompactionState*) () from /usr/lib/riak/lib/eleveldb-1.2.2p5/priv/
eleveldb.so
#6  0x00007f1a0ede8456 in leveldb::DBImpl::BackgroundCompaction() () from /
usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#7  0x00007f1a0ede9038 in leveldb::DBImpl::BackgroundCall() () from /usr/
lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#8  0x00007f1a0ee06c1e in ?? () from /usr/lib/riak/lib/eleveldb-1.2.2p5/
priv/eleveldb.so
#9  0x00007f1a91f70e9a in start_thread () from /lib/x86_64-linux-gnu/
libpthread.so.0
#10 0x00007f1a91a964bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x0000000000000000 in ?? ()

Software used:

OS: Ubuntu 12.04 LTS, amd64
Riak: riak_1.2.1rc2, installed from the Basho-provided deb package
client accesses Riak via riak-erlang-client 1.2.1

Any hints?

Thanks, Jan
Evan Vigil-McClanahan
2012-10-11 20:36:51 UTC
Permalink
Can you attach the eleveldb portion of your app.config file?
Configuration problems, especially max_open_files being too low, can
often cause issues like this.

If it isn't sensitive, the whole app.config and vm.args files are also
often helpful.
Post by J***@seznam.cz
Hello,
I am writing a new application and I am testing it on a cluster with 4 Riak
nodes (16 GM RAM, 2 x i3 3.4GHz - 2 cores).
The application is tested with the expected load of 1000 requests/second,
90% of the requests cause a Riak read and write of a new key. The problem
is that the performance starts falling after 18-20 hours and one of the Riak
nodes stops responding after 23-25 hours.
(Key is cca 61 bytes long, data is just 3 timestamps converted to binary,
and there is a secondary key containing an expiration time. There should be
a mapred job to delete keys older than 24 hours, but it is turned off while
researching the performance problem.)
responding **
** Removing (timedout) connection **
The problematic node itself does not respond to "/usr/sbin/riak ping", but
beam.smp is running and ALIVE messages are written regularly to the erlang
log. There is nothing suspicious in logs on the node, its error log is
empty.
The beam.smp consumes 20% memory and 50-100% of 1 CPU (the other 3 CPUs sit
idle), and the process has 267 open LevelDB files.
node1: 16249M, 281 files in 21 dirs (with 4 additional files like
/home/riak/leveldb/0/lost/BLOCKS.bad); this is the problematic node
node2: 16183M, 264 files in 16 dirs
node3: 16664M, 264 files in 16 dirs
node4: 16205M, 265 files in 16 dirs
I tried to attach to the beam.smp process with Erlang, but it does not
respond to net_adm:ping/1.
I attached gdb to the process, and gdb shows that most of its 93 threads are
/lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f1a0ee0ae9d in leveldb::port::CondVar::Wait() () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#2 0x00007f1a0ede3841 in leveldb::DBImpl::MakeRoomForWrite(bool) () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#3 0x00007f1a0ede91ad in leveldb::DBImpl::Write(leveldb::WriteOptions
const&, leveldb::WriteBatch*) () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#4 0x00007f1a0eddeca4 in eleveldb_write () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#5 0x0000000000534c16 in process_main ()
#6 0x00000000004987e3 in ?? ()
#7 0x0000000000595320 in ?? ()
#8 0x00007f1a91f70e9a in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#9 0x00007f1a91a964bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x0000000000000000 in ?? ()
#0 0x00007f1a0ee05a67 in leveldb::crc32c::Extend(unsigned int, char const*,
unsigned long) () from /usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#1 0x00007f1a0ee012b9 in
leveldb::TableBuilder::WriteRawBlock(leveldb::Slice const&,
leveldb::CompressionType, leveldb::BlockHandle*) () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#2 0x00007f1a0ee01444 in
leveldb::TableBuilder::WriteBlock(leveldb::BlockBuilder*,
leveldb::BlockHandle*) () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#3 0x00007f1a0ee015e4 in leveldb::TableBuilder::Flush() () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#4 0x00007f1a0ee0178b in leveldb::TableBuilder::Add(leveldb::Slice const&,
leveldb::Slice const&) () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#5 0x00007f1a0ede7cad in
leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*) () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#6 0x00007f1a0ede8456 in leveldb::DBImpl::BackgroundCompaction() () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#7 0x00007f1a0ede9038 in leveldb::DBImpl::BackgroundCall() () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#8 0x00007f1a0ee06c1e in ?? () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#9 0x00007f1a91f70e9a in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#10 0x00007f1a91a964bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x0000000000000000 in ?? ()
When I looked at thread 20 in the process again, the stack has shown some
Snappy compressions, and many later inspections have shown call to
fdatasync(2),
which was replaced by some more compaction work. Thread 24 still sits in
leveldb::DBImpl::MakeRoomForWrite.
#0 0x00007f1a0ee0ed6d in ?? () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#1 0x00007f1a0ee0edb3 in ?? () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#2 0x00007f1a0ee0f9dc in snappy::internal::CompressFragment(char const*,
unsigned long, char*, unsigned short*, int) () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#3 0x00007f1a0ee10dc1 in snappy::Compress(snappy::Source*, snappy::Sink*)
() from /usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#4 0x00007f1a0ee1115a in snappy::RawCompress(char const*, unsigned long,
char*, unsigned long*) () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#5 0x00007f1a0ee014eb in
leveldb::TableBuilder::WriteBlock(leveldb::BlockBuilder*,
leveldb::BlockHandle*) () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#6 0x00007f1a0ee015e4 in leveldb::TableBuilder::Flush() () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#7 0x00007f1a0ee0178b in leveldb::TableBuilder::Add(leveldb::Slice const&,
leveldb::Slice const&) () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#8 0x00007f1a0ede7cad in
leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*) () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#9 0x00007f1a0ede8456 in leveldb::DBImpl::BackgroundCompaction() () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#10 0x00007f1a0ede9038 in leveldb::DBImpl::BackgroundCall() () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#11 0x00007f1a0ee06c1e in ?? () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#12 0x00007f1a91f70e9a in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007f1a91a964bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#14 0x0000000000000000 in ?? ()
#0 0x00007f1a91a8fa5d in fdatasync () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f1a0ee08d64 in ?? () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#2 0x00007f1a0ede3357 in
leveldb::DBImpl::FinishCompactionOutputFile(leveldb::DBImpl::CompactionState*,
leveldb::Iterator*) () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#3 0x00007f1a0ede7e6e in
leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*) () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#4 0x00007f1a0ede8456 in leveldb::DBImpl::BackgroundCompaction() () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#5 0x00007f1a0ede9038 in leveldb::DBImpl::BackgroundCall() () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#6 0x00007f1a0ee06c1e in ?? () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#7 0x00007f1a91f70e9a in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#8 0x00007f1a91a964bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#9 0x0000000000000000 in ?? ()
#0 0x00007f1a0ee05765 in ?? () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#1 0x00007f1a0ee0b6da in
leveldb::InternalKeyComparator::Compare(leveldb::Slice const&,
leveldb::Slice const&) const () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#2 0x00007f1a0ee00218 in ?? () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#3 0x00007f1a0ee006aa in ?? () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#4 0x00007f1a0ede7ccd in
leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*) () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#5 0x00007f1a0ede8456 in leveldb::DBImpl::BackgroundCompaction() () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#6 0x00007f1a0ede9038 in leveldb::DBImpl::BackgroundCall() () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#7 0x00007f1a0ee06c1e in ?? () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#8 0x00007f1a91f70e9a in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#9 0x00007f1a91a964bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x0000000000000000 in ?? ()
#0 0x00007f1a0eb61dcb in std::string::_M_mutate(unsigned long, unsigned
long, unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1 0x00007f1a0eb61e1c in std::string::_M_replace_safe(unsigned long,
unsigned long, char const*, unsigned long) () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x00007f1a0ee03559 in ?? () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#3 0x00007f1a0ee037bd in ?? () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#4 0x00007f1a0ee00680 in ?? () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#5 0x00007f1a0ede7ccd in
leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*) () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#6 0x00007f1a0ede8456 in leveldb::DBImpl::BackgroundCompaction() () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#7 0x00007f1a0ede9038 in leveldb::DBImpl::BackgroundCall() () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#8 0x00007f1a0ee06c1e in ?? () from
/usr/lib/riak/lib/eleveldb-1.2.2p5/priv/eleveldb.so
#9 0x00007f1a91f70e9a in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#10 0x00007f1a91a964bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x0000000000000000 in ?? ()
OS: Ubuntu 12.04 LTS, amd64
Riak: riak_1.2.1rc2, installed from the Basho-provided deb package
client accesses Riak via riak-erlang-client 1.2.1
Any hints?
Thanks, Jan
_______________________________________________
riak-users mailing list
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
J***@seznam.cz
2012-10-12 12:02:36 UTC
Permalink
Post by Evan Vigil-McClanahan
Can you attach the eleveldb portion of your app.config file?
Configuration problems, especially max_open_files being too low, can
often cause issues like this.
If it isn't sensitive, the whole app.config and vm.args files are also
often helpful.
Hello Evan,

thanks for responding.

I originally had default LevelDB settings. When the node stalled, I changed it
to

{eleveldb, [
{data_root, "/home/riak/leveldb"},
{max_open_files, 132},
{cache_size, 377487360}
]},

on all nodes and I restarted them all. The application started to run with
about 1000 requests/second, after about 1 minute it dropped to <500
requests/second, and the node stalled again after 41 minutes. BTW according to
lsof(1) it had 267 open LevelDB files which is more than the 132 files limit
(??).

Here is the configuration (with comments removed to work around 40 KB message
limit on the Riak users list).

/etc/riak/vm.args:

-name ***@172.16.0.2
-setcookie riak_0627f0bc6c9
+K true
+A 64
+W w
-env ERL_MAX_PORTS 4096
-env ERL_FULLSWEEP_AFTER 0
-env ERL_CRASH_DUMP /var/log/riak/erl_crash.dump
-env ERL_MAX_ETS_TABLES 8192

/etc/riak/app.config:

[
{riak_api, [
{pb_ip, "172.16.0.2" },
{pb_port, 8087 }
]},
{riak_core, [
{ring_state_dir, "/var/lib/riak/ring"},
%{ring_creation_size, 64},
{http, [ {"127.0.0.1", 8098 } ]},
%{https, [{ "127.0.0.1", 8098 }]},
%{ssl, [
% {certfile, "/etc/riak/cert.pem"},
% {keyfile, "/etc/riak/key.pem"}
% ]},
{handoff_port, 8099 },
%{handoff_ssl_options, [{certfile, "/tmp/erlserver.pem"}]},
{dtrace_support, false},
{platform_bin_dir, "/usr/sbin"},
{platform_data_dir, "/var/lib/riak"},
{platform_etc_dir, "/etc/riak"},
{platform_lib_dir, "/usr/lib/riak/lib"},
{platform_log_dir, "/var/log/riak"}
]},
{riak_kv, [
{add_paths,
["/usr/local/share/riak-mapred/riak-1.2.0/deps/greylisting/ebin/"]},
{storage_backend, riak_kv_eleveldb_backend},
%{raw_name, "riak"},
{mapred_name, "mapred"},
{mapred_system, pipe},
{mapred_2i_pipe, true},
{map_js_vm_count, 8 },
{reduce_js_vm_count, 6 },
{hook_js_vm_count, 2 },
{js_max_vm_mem, 8},
{js_thread_stack, 16},
%{js_source_dir, "/tmp/js_source"},
{http_url_encoding, on},
{vnode_vclocks, true},
{legacy_keylisting, false},
{listkeys_backpressure, true}
]},
{riak_search, [
{enabled, false}
]},
{merge_index, [
{data_root, "/var/lib/riak/merge_index"},
{buffer_rollover_size, 1048576},
{max_compact_segments, 20}
]},
{bitcask, [
{data_root, "/var/lib/riak/bitcask"}
]},
{eleveldb, [
{data_root, "/home/riak/leveldb"},
{max_open_files, 132},
{cache_size, 377487360}
]},
{lager, [
{handlers, [
{lager_console_backend, info},
{lager_file_backend, [
{"/var/log/riak/error.log", error, 10485760, "$D0", 5},
{"/var/log/riak/console.log", info, 10485760, "$D0", 5}
]}
]},
{crash_log, "/var/log/riak/crash.log"},
{crash_log_msg_size, 65536},
{crash_log_size, 10485760},
{crash_log_date, "$D0"},
{crash_log_count, 5},
{error_logger_redirect, true}
]},
{riak_sysmon, [
{process_limit, 30},
{port_limit, 2},
{gc_ms_limit, 100},
{heap_word_limit, 40111000},
{busy_port, true},
{busy_dist_port, true}
]},
{sasl, [
{sasl_error_logger, false}
]},
{riak_control, [
{enabled, false},
{auth, userlist},
{userlist, [{"user", "pass"}
]},
{admin, true}
]}
].

Thanks.
--
Jan
Evan Vigil-McClanahan
2012-10-12 14:31:06 UTC
Permalink
Hi there, Jan,

The lsof issue is that max_open_files is per backend, iirc, so if
you're maxed out you'll see vnode count * max_open_files.

I think on the second try, you may have set the cache too high. I'd
drop it back to 8 or 16 MB, and possibly up the open files a bit more,
but you don't seem to be running into contention at this point.
There's a RAM cost, so maybe just leave it where it is for now, unless
you have quite a lot of memory.

Another thing to check is that vm.swappiness is set to 0 and that your
disk scheduler is set to deadline for spinning disks and noop for
SSDs.
Post by J***@seznam.cz
Post by Evan Vigil-McClanahan
Can you attach the eleveldb portion of your app.config file?
Configuration problems, especially max_open_files being too low, can
often cause issues like this.
If it isn't sensitive, the whole app.config and vm.args files are also
often helpful.
Hello Evan,
thanks for responding.
I originally had default LevelDB settings. When the node stalled, I changed it
to
{eleveldb, [
{data_root, "/home/riak/leveldb"},
{max_open_files, 132},
{cache_size, 377487360}
]},
on all nodes and I restarted them all. The application started to run with
about 1000 requests/second, after about 1 minute it dropped to <500
requests/second, and the node stalled again after 41 minutes. BTW according to
lsof(1) it had 267 open LevelDB files which is more than the 132 files limit
(??).
J***@seznam.cz
2012-10-15 19:10:27 UTC
Permalink
Hi Evan,

regarding the swappiness and disk scheduling: these were set to default, I will correct it and run another test.

The hosting provider sets the computer with software RAID1 over 2 physical disks, do you think it is useful with Riak?

BTW, I suspected that part of the problem could be caused by the hardware of the first node. So I ran another test over the weekend with the node replaced, and the result was slightly better: one of the nodes crashed after cca 22 hours when its DB reached 14G, but the other 3 nodes worked for 2.8 days until the DB reached 40G (see Loading Image... ). All the nodes crashed silently, there is nothing interesting in Riak logs.

Thanks, Jan

---------- Původní zpráva ----------
Od: Evan Vigil-McClanahan
Datum: 12. 10. 2012
Předmět: Re: Re: Riak performance problems when LevelDB database grows beyond 16GB
Hi there, Jan,

The lsof issue is that max_open_files is per backend, iirc, so if
you're maxed out you'll see vnode count * max_open_files.

I think on the second try, you may have set the cache too high. I'd
drop it back to 8 or 16 MB, and possibly up the open files a bit more,
but you don't seem to be running into contention at this point.
There's a RAM cost, so maybe just leave it where it is for now, unless
you have quite a lot of memory.

Another thing to check is that vm.swappiness is set to 0 and that your
disk scheduler is set to deadline for spinning disks and noop for
SSDs.
Post by J***@seznam.cz
Post by Evan Vigil-McClanahan
Can you attach the eleveldb portion of your app.config file?
Configuration problems, especially max_open_files being too low, can
often cause issues like this.
If it isn't sensitive, the whole app.config and vm.args files are also
often helpful.
Hello Evan,
thanks for responding.
I originally had default LevelDB settings. When the node stalled, I changed it
to
{eleveldb, [
{data_root, "/home/riak/leveldb"},
{max_open_files, 132},
{cache_size, 377487360}
]},
on all nodes and I restarted them all. The application started to run with
about 1000 requests/second, after about 1 minute it dropped to <500
requests/second, and the node stalled again after 41 minutes. BTW according to
lsof(1) it had 267 open LevelDB files which is more than the 132 files limit
(??).
Eli Janssen
2012-10-15 22:09:20 UTC
Permalink
Anything in the system logs or dmesg?
With vm.swappiness set to the defaults, the oom-killer could be doing its job a bit too well.
Post by J***@seznam.cz
Hi Evan,
regarding the swappiness and disk scheduling: these were set to default, I will correct it and run another test.
The hosting provider sets the computer with software RAID1 over 2 physical disks, do you think it is useful with Riak?
BTW, I suspected that part of the problem could be caused by the hardware of the first node. So I ran another test over the weekend with the node replaced, and the result was slightly better: one of the nodes crashed after cca 22 hours when its DB reached 14G, but the other 3 nodes worked for 2.8 days until the DB reached 40G (see http://janevangelista.rajce.idnes.cz/nastenka/#4Riak_2K_2.1RC2_3d_edited.jpg ). All the nodes crashed silently, there is nothing interesting in Riak logs.
Thanks, Jan
---------- Původní zpráva ----------
Od: Evan Vigil-McClanahan
Datum: 12. 10. 2012
Předmět: Re: Re: Riak performance problems when LevelDB database grows beyond 16GB
Hi there, Jan,
The lsof issue is that max_open_files is per backend, iirc, so if
you're maxed out you'll see vnode count * max_open_files.
I think on the second try, you may have set the cache too high. I'd
drop it back to 8 or 16 MB, and possibly up the open files a bit more,
but you don't seem to be running into contention at this point.
There's a RAM cost, so maybe just leave it where it is for now, unless
you have quite a lot of memory.
Another thing to check is that vm.swappiness is set to 0 and that your
disk scheduler is set to deadline for spinning disks and noop for
SSDs.
Post by J***@seznam.cz
Post by Evan Vigil-McClanahan
Can you attach the eleveldb portion of your app.config file?
Configuration problems, especially max_open_files being too low, can
often cause issues like this.
If it isn't sensitive, the whole app.config and vm.args files are also
often helpful.
Hello Evan,
thanks for responding.
I originally had default LevelDB settings. When the node stalled, I changed it
to
{eleveldb, [
{data_root, "/home/riak/leveldb"},
{max_open_files, 132},
{cache_size, 377487360}
]},
on all nodes and I restarted them all. The application started to run with
about 1000 requests/second, after about 1 minute it dropped to <500
requests/second, and the node stalled again after 41 minutes. BTW according to
lsof(1) it had 267 open LevelDB files which is more than the 132 files limit
(??).
_______________________________________________
riak-users mailing list
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
J***@seznam.cz
2012-10-16 10:17:44 UTC
Permalink
You are right! Riak was killed by the oom killer on all nodes except the one I was looking at.

Oct 14 18:34:01 gr-node03 kernel: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
Oct 14 18:34:01 gr-node03 kernel: [31808] 106 31808 5178811 3884730 0 0 0 beam.smp
...
Oct 14 18:34:01 gr-node03 kernel: Out of memory: Kill process 31808 (beam.smp) score 955 or sacrifice child

(I found that the node I was looking at was made temporarily unreachable by a hosting provider's configuration error and then restarted.)

Thanks, Jan

---------- Původní zpráva ----------
Od: Eli Janssen
Datum: 16. 10. 2012
Předmět: Re: Riak performance problems when LevelDB database grows beyond 16GB
Anything in the system logs or dmesg?
With vm.swappiness set to the defaults, the oom-killer could be doing its job a bit too well.
Post by J***@seznam.cz
Hi Evan,
regarding the swappiness and disk scheduling: these were set to default, I will correct it and run another test.
The hosting provider sets the computer with software RAID1 over 2 physical disks, do you think it is useful with Riak?
...
J***@seznam.cz
2012-10-17 12:59:08 UTC
Permalink
Hi Evan,

I corrected the setup according to your recommendations:

- vm.swappiness is 0
- fs is ext4 on software RAID1, mounted with noatime
- disk scheduler is set to deadline (it was the default)
- eleveldb max_open_files is set to 200, cache is set to default

(BTW, why is Riak not using the new O_NOATIME open(2) flag?)

I restarted the last test with 3x40G and 1x14G DB, and it was able to sustain 1000 ops/sec for 5 minutes. Then node5 stalled with the call stack described in the original mail, 1 of 4 cores almost 100% busy. The node did write 29 M/s (140 IOPs), with an occasional read (<5 IOPs), with 252 open LevelDB files. The disk has 869G of free space.

When I looked at the performance graphs 17 hours later, it still did write at cca 29M/s (120 IOPs), with the same call stack. The Riak node was busy even after 17 hours without any application requests, and it was not even connected to the rest of the Riak cluster (the node was not listed by erlang:nodes() on other nodes). I would suspect a bug in LevelDB, but people are using it in production, aren't they?

I intend to retry the test without the software RAID. Any other hints?

Best regards, Jan

---------- Původní zpráva ----------
Od: Evan Vigil-McClanahan
Datum: 12. 10. 2012
Předmět: Re: Re: Riak performance problems when LevelDB database grows beyond 16GB
Hi there, Jan,

The lsof issue is that max_open_files is per backend, iirc, so if
you're maxed out you'll see vnode count * max_open_files.

I think on the second try, you may have set the cache too high. I'd
drop it back to 8 or 16 MB, and possibly up the open files a bit more,
but you don't seem to be running into contention at this point.
There's a RAM cost, so maybe just leave it where it is for now, unless
you have quite a lot of memory.

Another thing to check is that vm.swappiness is set to 0 and that your
disk scheduler is set to deadline for spinning disks and noop for
SSDs.
Post by J***@seznam.cz
Post by Evan Vigil-McClanahan
Can you attach the eleveldb portion of your app.config file?
Configuration problems, especially max_open_files being too low, can
often cause issues like this.
If it isn't sensitive, the whole app.config and vm.args files are also
often helpful.
Hello Evan,
thanks for responding.
I originally had default LevelDB settings. When the node stalled, I changed it
to
{eleveldb, [
{data_root, "/home/riak/leveldb"},
{max_open_files, 132},
{cache_size, 377487360}
]},
on all nodes and I restarted them all. The application started to run with
about 1000 requests/second, after about 1 minute it dropped to <500
requests/second, and the node stalled again after 41 minutes. BTW according to
lsof(1) it had 267 open LevelDB files which is more than the 132 files limit
(??).
Evan Vigil-McClanahan
2012-10-18 22:32:54 UTC
Permalink
Ahoj, Jan,

Our leveldb developer, Matthew, sent along a reply. Reading it and
reading your last reply, I am at the limits of my ability to suggest
things, other than to note that if you're IO bound, running the disks
in RAID 0 rather than RAID 1 may help.

Please contact me off list if you have any issues getting those files
to Matthew because of their size.
Post by J***@seznam.cz
Hi Evan,
- vm.swappiness is 0
- fs is ext4 on software RAID1, mounted with noatime
- disk scheduler is set to deadline (it was the default)
- eleveldb max_open_files is set to 200, cache is set to default
(BTW, why is Riak not using the new O_NOATIME open(2) flag?)
I restarted the last test with 3x40G and 1x14G DB, and it was able to sustain 1000 ops/sec for 5 minutes. Then node5 stalled with the call stack described in the original mail, 1 of 4 cores almost 100% busy. The node did write 29 M/s (140 IOPs), with an occasional read (<5 IOPs), with 252 open LevelDB files. The disk has 869G of free space.
When I looked at the performance graphs 17 hours later, it still did write at cca 29M/s (120 IOPs), with the same call stack. The Riak node was busy even after 17 hours without any application requests, and it was not even connected to the rest of the Riak cluster (the node was not listed by erlang:nodes() on other nodes). I would suspect a bug in LevelDB, but people are using it in production, aren't they?
I intend to retry the test without the software RAID. Any other hints?
Best regards, Jan
---------- Původní zpráva ----------
Od: Evan Vigil-McClanahan
Datum: 12. 10. 2012
Předmět: Re: Re: Riak performance problems when LevelDB database grows beyond 16GB
Hi there, Jan,
The lsof issue is that max_open_files is per backend, iirc, so if
you're maxed out you'll see vnode count * max_open_files.
I think on the second try, you may have set the cache too high. I'd
drop it back to 8 or 16 MB, and possibly up the open files a bit more,
but you don't seem to be running into contention at this point.
There's a RAM cost, so maybe just leave it where it is for now, unless
you have quite a lot of memory.
Another thing to check is that vm.swappiness is set to 0 and that your
disk scheduler is set to deadline for spinning disks and noop for
SSDs.
Post by J***@seznam.cz
Post by Evan Vigil-McClanahan
Can you attach the eleveldb portion of your app.config file?
Configuration problems, especially max_open_files being too low, can
often cause issues like this.
If it isn't sensitive, the whole app.config and vm.args files are also
often helpful.
Hello Evan,
thanks for responding.
I originally had default LevelDB settings. When the node stalled, I changed it
to
{eleveldb, [
{data_root, "/home/riak/leveldb"},
{max_open_files, 132},
{cache_size, 377487360}
]},
on all nodes and I restarted them all. The application started to run with
about 1000 requests/second, after about 1 minute it dropped to <500
requests/second, and the node stalled again after 41 minutes. BTW according to
lsof(1) it had 267 open LevelDB files which is more than the 132 files limit
(??).
Matthew Von-Maszewski
2012-10-18 12:21:15 UTC
Permalink
Greetings,

I am currently responsible for tuning Google's leveldb implementation for Riak. I have read through most of the thread and have a couple of information requests. Then I will try to address various questions and comments from the thread. In general, you are filling leveldb faster than its background compaction (optimization) can keep up. I am willing to work with you to figure out why and what can be done about it.

Questions / requests:

1. Execute the following on one of the servers:

sort /home/riak/leveldb/*/LOG* >log_jan.txt

Tar/gzip the log_jan.txt and email it back.

2. Execute the following on one of the servers:

grep -i flags /proc/cpuinfo

Include the output (actually just one line will do) in a reply.

3. On a running server that is processing data, execute:

grep -i swap /proc/meminfo

Include the full output (3 lines) in a reply.

4. Pick a server, then one directory in /home/riak/leveldb. Select 3 of the largest *.sst files. Tar/gzip those and email back.


Notes about other messages on this thread:

a. the gdb stack traces are nice! They clearly indicate that the leveldb has intentionally entered a "stall" state because compaction is not keeping up with the input stream. Riak 1.2.1rc1 contains code that attempts to slow the write rate to allow the background compactions to catch up. It is not working in your case.

b. there is a performance bug in the cache code, not your main problem though. this is why Evan asked you to reduce the cache size from 377,487,360. Yes, I created the bug and will get it addressed soon.

c. the compaction process is disk and cpu intensive. The fact that your CPUs are not heavily loaded, yet the client/request code is stalled waiting for compaction to catch up, suggests the disk is thrashing / could use some help. Again, this is why Evan had you work some configuration settings there.

d. you comment about using O_NOATIME is valid. The issue is that the flag is relatively new. We are supporting some really old compilers and linux/solaris versions. It is easier to ask everyone to work noatime at the mount level than have conditional code for some and mount level tuning for others. But your comment is still correct.

e. a non-zero sized lost/BLOCKS.bad means data corruption. It looks like you already figured that out. Either the crc code or the decompression code found an issue during compaction and moved the bad data to the side.

f. max_open_files in 1.1 was a hard limit on the number of open files per vnode (per subdirectory in /home/riak/leveldb). 1.2 uses the number as more of a memory consumption per file suggestion. A future release will drop the option and substitute something like "file_cache_size". Memory is the critical resource, not file handles (at least for Riak … I am told Google uses this code in Android, so it might be critical there).

What issues did I miss?

Matthew
István
2012-10-19 02:27:12 UTC
Permalink
Post by Matthew Von-Maszewski
d. you comment about using O_NOATIME is valid. The issue is that the flag is relatively new. We are supporting some really old compilers and linux/solaris versions. It is easier to ask everyone to work noatime at the mount level than have conditional code for some and mount level tuning for others. But your comment is still correct.
isn't O_NOATIME per file open[1]? (while the mount option is per mount
point afaik.)

this might be a bit of performance drop in some cases using.


1. http://www.kernel.org/doc/man-pages/online/pages/man2/open.2.html

O_NOATIME (Since Linux 2.6.8)
Do not update the file last access time (st_atime in the
inode) when
the file is read(2). This flag is intended for use by indexing or
backup programs, where its use can significantly reduce
the amount of
disk activity. This flag may not be effective on all
file systems.
One example is NFS, where the server maintains the access time.


Regards,
Istvan
--
the sun shines for all
Matthew Von-Maszewski
2012-10-22 16:20:23 UTC
Permalink
Jan,

I apologize for the delayed response.

1. Did you realize that the "log_jan.txt" file from #1 below documented a hard disk failure? You mentioned a failed drive once. I am not sure if this is the same drive.


2. The "sse4_2" tells me that your Intel cpu supports hardware CRC32c calculation. This feature is not useful to you at this moment (unless you want to pull the mv-hardware-crc branch from basho/leveldb). It will bring some performance improvements to you in the next release IF we do not decide that your problems are hard disk performance limited.


3. This just confirmed for me that the app.config is not accidentally blowing your physical memory. The app.config file you posted suggested this was not the case, but I wanted to verify.

You also discuss a basho_bench failure. Is this the same test run as the log_jan.txt file? The hard drives had their first failure at:

2012/10/18-02:08:44.136238 7f8297fff700 Compaction error: Corruption: corrupted compressed block contents

And things go really bad at:

2012/10/18-06:10:37.657072 7f829effd700 Moving corrupted block to lost/BLOCKS.bad (size 1647)


4. I was looking to see if your data was compressing well. The answer it that it is. You are achieving 2 to 2.6x compression ratio. Since you are concerned about throughput, I was verifying that the time leveldb spends on block compression is worthwhile for you (it is).


The next question from me is whether the drive / disk array problems are your only problem at this point. The data in log_jan.txt looks ok until the failures start. I am willing to work more, but I need to better understand your next level of problems.

Matthew
Hi Matthew,
big thanks for responding. I see that you are the main committer to Basho's
leveldb code. :-)
Post by Matthew Von-Maszewski
sort /home/riak/leveldb/*/LOG* >log_jan.txt
See the attached file log_jan.txt.gz. It is from the stalled Riak node.
Post by Matthew Von-Maszewski
grep -i flags /proc/cpuinfo
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm
constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc
aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr
pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer xsave avx lahf_lm arat epb
xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
Post by Matthew Von-Maszewski
grep -i swap /proc/meminfo
I will restart the test and report back when it stalls again. In the meantime,
am sending you yesterday's zabbix graph showing memory usage on the node
(attached file ZabbixMemory.png). The time when the node stopped responding is
I am also attaching the corresponding Basho bench output of the test. The test
was started on Oct 17 16:38 with an empty database, and it was run on a plain
ext4 partition (no RAID).
Post by Matthew Von-Maszewski
4. Pick a server, then one directory in /home/riak/leveldb. Select 3 of
the largest *.sst files. Tar/gzip those and email back.
I will send them in the next mail. I can also put the entire database
somewhere for you for download, if you need it.
Thanks, Jan<log_jan.txt.gz><ZabbixMemory.png><4Riak_2K_2.1RC2_noraid.png>
J***@seznam.cz
2012-10-23 10:35:02 UTC
Permalink
Post by Matthew Von-Maszewski
1. Did you realize that the "log_jan.txt" file from #1 below documented a
hard disk failure?

I did not know about the corruption (I did not know that LevelDB logs are
human readable application logs), thanks for telling me. I did look at syslog
and I did not find any traces of a disk failure.
Post by Matthew Von-Maszewski
You mentioned a failed drive once. I am not sure if this is the same drive.
No, it is a different disk. I am not using the problematic node anymore.
Post by Matthew Von-Maszewski
You also discuss a basho_bench failure. Is this the same test run as the
log_jan.txt file? The hard drives had their first failure at:

Yes, it is. (It is an application-level test written as a basho_bench driver.)
Post by Matthew Von-Maszewski
The next question from me is whether the drive / disk array problems are
your only problem at this point.
Post by Matthew Von-Maszewski
The data in log_jan.txt looks ok until the failures start. I am willing to
work more, but I need to better understand your next level of problems.

I will look into it and report back.

Thanks!

Jan
J***@seznam.cz
2012-11-21 15:52:51 UTC
Permalink
---------- Původní zpráva ----------
Od: Matthew Von-Maszewski
Datum: 22. 10. 2012
Předmět: Re: Riak performance problems when LevelDB database grows beyond 16GB
Jan,
...
The next question from me is whether the drive / disk array problems are your only problem at this point. The data in log_jan.txt looks ok until
the failures start. I am willing to work more, but I need to better understand your next level of problems.
Matthew
Hi Matthew,

thanks again for helping me. It was actually bad RAM. It took me some time to convince the hosting provider since the problem did not show up in their hardware tests. :-/

I ran a 4 day test when the two bad nodes got fixed, and the original issue (that a Riak node got stuck) did not appear again.

There are however other problems which seem to be caused by my "misuse" of LevelDB for storing short-lived data (the data is valid only for 24 hours).

Here is the application throughput during a 4 days test with 5 Riak nodes:

Loading Image...

This graph shows memory use on a Riak node:

Loading Image...

(Memory use on other nodes looks similar, but the OOM killed was not invoked there.)

And this graph shows disk space consumption on a Riak node:

Loading Image...

The OOM condition which killed one Riak node (and slowed down the other ones) seems to be caused by the map-reduce jobs which
periodically delete old data from the database. The entries are deleted with a mapred job querying the secondary index and using the reduce function published at http://contrib.basho.com/delete_keys.html .

I wish LevelDB could expire old entries in the same was as BitCask does. :-)

In an older 3 day test I had only 5 min timeout for the mapred jobs (a bug). It caused premature cancellation of the jobs deleting the old data - but the throughput was better:

Loading Image...

The memory use looked reasonable as well:

Loading Image...

The disk use in this case was:

Loading Image...

The databases were cca 85 GB.

So the only problem now seems to be how to get rid of the old data. Any hints?

Thanks, Jan
Matthew Von-Maszewski
2012-11-21 16:01:17 UTC
Permalink
map reduce is currently outside my skill set. I have forwarded the question to others on the team.

I have also asked the team if they can give me the key specifications used for auto delete in bitcask. Maybe, just maybe, I can slide the same logic into leveldb's compaction algorithm.

Matthew
Post by J***@seznam.cz
---------- Původní zpráva ----------
Od: Matthew Von-Maszewski
Datum: 22. 10. 2012
Předmět: Re: Riak performance problems when LevelDB database grows beyond 16GB
Jan,
...
The next question from me is whether the drive / disk array problems are your only problem at this point. The data in log_jan.txt looks ok until
the failures start. I am willing to work more, but I need to better understand your next level of problems.
Matthew
Hi Matthew,
thanks again for helping me. It was actually bad RAM. It took me some time to convince the hosting provider since the problem did not show up in their hardware tests. :-/
I ran a 4 day test when the two bad nodes got fixed, and the original issue (that a Riak node got stuck) did not appear again.
There are however other problems which seem to be caused by my "misuse" of LevelDB for storing short-lived data (the data is valid only for 24 hours).
http://janevangelista.rajce.idnes.cz/nastenka#5Riak_4d_edited.jpg
http://janevangelista.rajce.idnes.cz/nastenka/#MemNode3-edited.jpg
(Memory use on other nodes looks similar, but the OOM killed was not invoked there.)
http://janevangelista.rajce.idnes.cz/nastenka#DiskSpace-4d-edited.jpg
The OOM condition which killed one Riak node (and slowed down the other ones) seems to be caused by the map-reduce jobs which
periodically delete old data from the database. The entries are deleted with a mapred job querying the secondary index and using the reduce function published at http://contrib.basho.com/delete_keys.html .
I wish LevelDB could expire old entries in the same was as BitCask does. :-)
http://janevangelista.rajce.idnes.cz/nastenka#4Riak_3d_8K_edited.jpg
http://janevangelista.rajce.idnes.cz/nastenka/#Memory-3d-edited.jpg
http://janevangelista.rajce.idnes.cz/nastenka#DiskSpace.jpg
The databases were cca 85 GB.
So the only problem now seems to be how to get rid of the old data. Any hints?
Thanks, Jan
J***@seznam.cz
2012-11-22 17:30:45 UTC
Permalink
It would be sufficient if I could have two eleveldb instances in riak_kv_multi_backend and if I could drop one of them with riak_kv_eleveldb_backend:drop/1 while the other instance is used. After another interval I would drop the second instance etc.

Jan

---------- Původní zpráva ----------
Od: Matthew Von-Maszewski
Datum: 21. 11. 2012
Předmět: Re: Riak performance problems when LevelDB database grows beyond 16GB
map reduce is currently outside my skill set. I have forwarded the question to others on the team.

I have also asked the team if they can give me the key specifications used for auto delete in bitcask. Maybe, just maybe, I can slide the same logic into leveldb's compaction algorithm.

Matthew

Loading...