High disk usage on node

Discussion:

Travis Kirstine

2018-11-01 13:25:15 UTC

I'm running riak (v2.14) in a 5 node cluster and for some reason one of the nodes has higher disk usage than the other nodes. The problem seems to be related to how riak distributes the partitions, in my case I'm using the default 64, riak has given each node 12 partition except one node that gets 16 (4x12+16=64). As a result the node with 16 partitions has filled the disk and become 'un-reachable'.

I have a node on standby with roughly the same disk space as the failed node, my concern is that if a add it to the cluster it will overflow as well.

How do I recover the failed node and add a new node without destroying the cluster..... BTW just to make things more fun the new node is at a newer version of riak so I need to perform a rolling upgrade at the same time.

Any help would be greatly appreciated!

Nicholas Adams

2018-11-01 14:23:47 UTC

Permalink

Hi Travis,
I see you have encountered the disk space pickle. Just for the record, the safest way to run Riak in production is to keep all resources (CPU, RAM, Network, Disk Space, I/O etc.) below 70% utilization on all nodes at all times. The reason behind this is to compensate for when one or more nodes go down. When this does happen, the remaining nodes have to carry all the load of the offline node(s) on top of their existing load and therefore need to have sufficient free resources available to do it. If you are running right on the limit for any resource then you need to expect issues like this or worse to happen on a regular basis.

Potential initial prevention
When you join all the nodes together to form your cluster, you get to run `riak-admin cluster plan` and it will show you how things will turn out. If you like this plan, run `riak-admin cluster commit` and partitions will be moved around accordingly. If not, you can cancel the plan and generate a new one and keep doing so until you are happy with the distribution. Sometimes with unfortunate divisions of partitions/nodes/ring size then one node gets its fair share and then some but often a replan will make it as painless as possible.

Temporary escape from current situation
Before beginning here, let me mention that this method does have the potential to go horribly wrong so proceed at your own risk. With regards to the server with the filled disk, you can follow the method underneath:

* Stop Riak
* Attach additional storage (USB, additional disks, NAS, whatever)
* Copy partitions from the data directory of, presumably bitcask, to the additional storage
* Once the copy has been completed, delete the data from the regular node's hard disk
* Create a symlink from the external storage to where you just deleted the data from
* Repeat until you have freed up sufficient disk space (new stuff may be copied here, so make sure you do have enough space)
* Start Riak

The above should bring your server back in touch with the cluster. Monitor transfers and once they have all finished, add your new node to the cluster. After this new node has been added and all transfers have finished, take the previously full node offline and reverse the steps above until you are able to remove the additional storage.

Note: running a mixed version cluster for a prolonged period of time is not recommended in production. Out of preference, I would suggest installing the same version of Riak on the new node, going through the above and then looking at upgrading the cluster once everything is stable.

Good luck,

Nicholas
From: riak-users <riak-users-***@lists.basho.com> On Behalf Of Travis Kirstine
Sent: 01 November 2018 22:25
To: riak-***@lists.basho.com
Subject: High disk usage on node

I'm running riak (v2.14) in a 5 node cluster and for some reason one of the nodes has higher disk usage than the other nodes. The problem seems to be related to how riak distributes the partitions, in my case I'm using the default 64, riak has given each node 12 partition except one node that gets 16 (4x12+16=64). As a result the node with 16 partitions has filled the disk and become 'un-reachable'.

I have a node on standby with roughly the same disk space as the failed node, my concern is that if a add it to the cluster it will overflow as well.

How do I recover the failed node and add a new node without destroying the cluster..... BTW just to make things more fun the new node is at a newer version of riak so I need to perform a rolling upgrade at the same time.

Any help would be greatly appreciated!

Fred Dushin

2018-11-01 14:29:49 UTC

Permalink

I think your best bet is to do a force-replace (and then a manual repair, if you are not using AAE) with a node that has higher capacity than your current standby. You are correct that replacing with your standby will fail when you run repairs and end up running out of space.

I think you do NOT want to do a remove (or force remove) of the node that has run out of space, as you will likely just end up pushing those partitions to other nodes that are already stressed for capacity.

I would not tempt fate by doing an add on a cluster with a node that is down for the count. Fix that node first, and then add capacity to the cluster. But others here might have more experience with expanding clusters that are in ill health. For example, it might be possible to do a replace without a manual repair (which will leave a bunch of near-empty partitions on your new node), and then do an add with the node you took out of the cluster. You would then need to track down where all the partitions got moved to in the cluster, and then do a manual repair of those partitions. (Or I suppose if you are using AAE, wait for a full set of exchanges to complete, but if were me I'd turn off AAE and run the repairs manually, so you can monitor which partitions actually got repaired.)

If you are moving to the latest version of Riak (2.2.6), then you may find that the v2 claim algorithm does a better job of distributing (ownership) partitions more evenly around the cluster, but you should upgrade all your nodes first (before running your cap add).

Also make use of `riak_core_claim_sim:help()` in the Riak console. You can get pretty good insight into what your cap adds will look like, without having to do a cluster plan. As Drew has mentioned, you might not want to try the claim_v3 algorithms if you are on an already under-provisioned cluster, as claim_v3 may move partitions around to already heavily-loaded nodes as part of its ownership handoff, which in your case could be catastrophic, as you currently have keys that have only two replicas (and presumably are already growing in size with secondary partitions), so losing a node or two in the rest of your cluster due to disk space could result in data loss (4eva).

-Fred

PS> I see Sick has posted a temporary workaround if you can add extra disk capacity to your node (NAS, etc), and that seems like a good route to go, if you can do that. Otherwise, I'd find a machine with more disk and do the replace as described above.

Iâm running riak (v2.14) in a 5 node cluster and for some reason one of the nodes has higher disk usage than the other nodes. The problem seems to be related to how riak distributes the partitions, in my case Iâm using the default 64, riak has given each node 12 partition except one node that gets 16 (4x12+16=64). As a result the node with 16 partitions has filled the disk and become âun-reachableâ.
I have a node on standby with roughly the same disk space as the failed node, my concern is that if a add it to the cluster it will overflow as well.
How do I recover the failed node and add a new node without destroying the clusterâŠ.. BTW just to make things more fun the new node is at a newer version of riak so I need to perform a rolling upgrade at the same time.
Any help would be greatly appreciated!
_______________________________________________
riak-users mailing list
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com <http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com>

Travis Kirstine

2018-11-01 15:18:44 UTC

Permalink

Fred / Nicholas,

Thanks for your input! We are running riak (leveldb with AAE) inside of VMs, as a result we may be able to simply copy the VM to a new server and add more storage to the logical volume. So our plan would be to:

- Shutdown the failed node

- Copy the VM and allocate storage

- Restart the node

- Wait for transfers to complete.

Just to be clear riak will only redistribute the partitions if the failed node is removed from the cluster? My worry is that it may take 3-4 days to copy the VM at which time the failed riak node would be offline, if riak attempts to redistribute the partitions all nodes would fill and it would be a nightmare!

Thanks

From: riak-users [mailto:riak-users-***@lists.basho.com] On Behalf Of Fred Dushin
Sent: Thursday, November 01, 2018 10:30 AM
To: riak-***@lists.basho.com
Subject: Re: High disk usage on node

I think your best bet is to do a force-replace (and then a manual repair, if you are not using AAE) with a node that has higher capacity than your current standby. You are correct that replacing with your standby will fail when you run repairs and end up running out of space.

I think you do NOT want to do a remove (or force remove) of the node that has run out of space, as you will likely just end up pushing those partitions to other nodes that are already stressed for capacity.

I would not tempt fate by doing an add on a cluster with a node that is down for the count. Fix that node first, and then add capacity to the cluster. But others here might have more experience with expanding clusters that are in ill health. For example, it might be possible to do a replace without a manual repair (which will leave a bunch of near-empty partitions on your new node), and then do an add with the node you took out of the cluster. You would then need to track down where all the partitions got moved to in the cluster, and then do a manual repair of those partitions. (Or I suppose if you are using AAE, wait for a full set of exchanges to complete, but if were me I'd turn off AAE and run the repairs manually, so you can monitor which partitions actually got repaired.)

If you are moving to the latest version of Riak (2.2.6), then you may find that the v2 claim algorithm does a better job of distributing (ownership) partitions more evenly around the cluster, but you should upgrade all your nodes first (before running your cap add).

Also make use of `riak_core_claim_sim:help()` in the Riak console. You can get pretty good insight into what your cap adds will look like, without having to do a cluster plan. As Drew has mentioned, you might not want to try the claim_v3 algorithms if you are on an already under-provisioned cluster, as claim_v3 may move partitions around to already heavily-loaded nodes as part of its ownership handoff, which in your case could be catastrophic, as you currently have keys that have only two replicas (and presumably are already growing in size with secondary partitions), so losing a node or two in the rest of your cluster due to disk space could result in data loss (4eva).

-Fred

PS> I see Sick has posted a temporary workaround if you can add extra disk capacity to your node (NAS, etc), and that seems like a good route to go, if you can do that. Otherwise, I'd find a machine with more disk and do the replace as described above.

On Nov 1, 2018, at 9:25 AM, Travis Kirstine <***@firstbasesolutions.com<mailto:***@firstbasesolutions.com>> wrote:

Iâm running riak (v2.14) in a 5 node cluster and for some reason one of the nodes has higher disk usage than the other nodes. The problem seems to be related to how riak distributes the partitions, in my case Iâm using the default 64, riak has given each node 12 partition except one node that gets 16 (4x12+16=64). As a result the node with 16 partitions has filled the disk and become âun-reachableâ.

I have a node on standby with roughly the same disk space as the failed node, my concern is that if a add it to the cluster it will overflow as well.

How do I recover the failed node and add a new node without destroying the clusterâŠ.. BTW just to make things more fun the new node is at a newer version of riak so I need to perform a rolling upgrade at the same time.

Any help would be greatly appreciated!
_______________________________________________
riak-users mailing list
riak-***@lists.basho.com<mailto:riak-***@lists.basho.com>
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

István

2018-11-02 09:43:46 UTC

Permalink

Hi,

64 is a bit low, I guess 128 would be better to avoid such situation.

I.

On Thu, Nov 1, 2018 at 2:26 PM Travis Kirstine <

Iâm running riak (v2.14) in a 5 node cluster and for some reason one of
the nodes has higher disk usage than the other nodes. The problem seems to
be related to how riak distributes the partitions, in my case Iâm using the
default 64, riak has given each node 12 partition except one node that gets
16 (4x12+16=64). As a result the node with 16 partitions has filled the
disk and become âun-reachableâ.
I have a node on standby with roughly the same disk space as the failed
node, my concern is that if a add it to the cluster it will overflow as
well.
How do I recover the failed node and add a new node without destroying the
clusterâŠ.. BTW just to make things more fun the new node is at a newer
version of riak so I need to perform a rolling upgrade at the same time.
Any help would be greatly appreciated!
_______________________________________________
riak-users mailing list
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

--
the sun shines for all