Reduce bandwidth costs with dm-cache: fast local SSD caching for network storage

devcenter.upsun.com

78 points by tlar 4 days ago

loeg 20 hours ago

Historically, I believe bcache offered a better design than dm-cache. I wonder if that has changed at all?

That said, for this use, I would be very concerned about coherency issues putting any cache in front of the actual distributed filesystem. (Unless this is the only node doing writes, I guess?)

miladyincontrol 3 hours ago

I just use fs-cache for networked storage caching. Good enough for redhat, good enough for me. Unsure how performance compares but I like that it works transparently with little more than a mount flag to activate, works fine in containers, and if managed with cachefilesd it can scale dynamically as per configured quotas.

For local disks though? bcache

rbranson a day ago

> For e-commerce workloads, the performance benefit of write-back mode isn’t worth the data integrity risk. Our customers depend on transactional consistency, and write-through mode ensures every write operation is safely committed to our replicated Ceph storage before the application considers it complete.

Unless the writer is always overwriting entire files at once blindly (doesn't read-then-write), consistency requires consistency reads AND writes. Even then, potential ordering issues creep in. It would be really interesting to hear how they deal with it.

twotwotwo 18 hours ago

They mention it as a block device, and the diagram makes it look like there's one reader. If so, this seems like it has the same function as the page cache in RAM, just saving reads, and looks a lot like https://discord.com/blog/how-discord-supercharges-network-di... (which mentions dm-cache too).
If so, safe enough, though if they're going to do that, why stop at 512MB? The big win of Flash would be that you could go much bigger.

0xbadcafebee a day ago

This is good timing; I was just looking at a use-case where we need more iops and the only immediate solutions involve allocating way more high-performance disks or network storage. The problem with a cache is having a large dataset with random access, so repeated cache hits might not be frequent. But I had a theory that you could still make an impact on performance and lower your storage performance requirements. I may try this out, but it is block-level, so it's a bit intrusive.

Another option I haven't tried is tmpfs with an overlay. Initial access is RAM, falls back to underlying slower storage. Since I'm mostly doing reads, should be fine, writes can go to the slower disk mount. No block storage changes needed.

otterley 7 hours ago

You don’t need a tmpfs to have the OS use memory to cache block reads for you. The kernel gives you that for free.

mrkurt 21 hours ago

dm-cache writeback mode is both amazing and terrifying. It reorders writes, so not only do you lose data if the cache fails, you probably just corrupted the entire backing disk.

saltcured 6 hours ago

Yeah, when I used it on a workstation many years ago, I layered it on top of an MD RAID-1 SSD array for the cache and an MD RAID-5 HDD array for the bulk store.
I used writeback mode, but expected to wipe the machine if the caching layer ever collapsed. In the end, the SSDs outlived my interest in the machine, though I think I did failover an HDD or two while the rest remained in normal operating mode.

otterley 9 hours ago

Why is two-thirds of their I/O crossing AZ boundaries for a read-heavy application? This application seems like it’s not well architected for AWS and puts them at availability risk in the event of a zonal impairment. It looks like they’re using Ceph instead of EBS, and it’s not clear why.

mgerdts 19 hours ago

I remember seeing another strategy where a remote block device was (lazily?) mirrored to a local SSD. The mirror was configured such that reads from the local device were preferred and writes would go to both devices. I think this was done by someone on GCP.

Does this ring any bells? I’ve searched for this a time or two and can’t find it again.

twotwotwo 18 hours ago

Discord: https://discord.com/blog/how-discord-supercharges-network-di...
(Somehow the name "SuperDisks" was burned into my brain for this. Although Discord's post does use 'Super-Disks' in a section header, if you search the Internet for SuperDisks you'll everything's about the LS-120 floppies that went by that name.)
magicalhippo 11 hours ago

There was some discussion amongst the ZFS devs for such a feature.
As I recall it was to change the current mirrored read strategy to be aware of the speed of the underlying devices, and prefer the faster if it has capacity. Though perhaps a fixed pool property to always read from a given device was discussed, it's been a while so my memory is hazy.
The use-case was similar IIRC, where a customer wanted to combine local SSD with remote block device.
So, might come to ZFS.
Conch5606 14 hours ago

This is not quite the same, it's for migrating from one device to another while keeping the file system writable, but it's quite neat: dm-clone[1]
I've used it before for a low downtime migration of VMs between two machines - it was a personal project and I could have just kept the VM offline for the migration, but it was fun to play around with it.
You give it a read-only backing device and a writable device that's at least as big. It will slowly copy the data from the read-only device to the writable device. If a read is issued to the dm-clone target it's either gotten from the writable device if it's already cloned or forwarded to the read-only device. Writes are always going to the writable device and afterwards the read-only device is ignored for that block.
It's not the fastest, but it's relatively easy to set up, even though using device mapper directly is a bit clunky. It's also not super efficient, IIRC if a read goes to a chunk that hasn't been copied yet, that's used to give the data to the reading program, but it's not stored on the writable device, so it has to be fetched again. If the file system being copied isn't full, it's a good idea to run trimming after creating the dm-clone target as discarded blocks are marked as not needing to be fetched.
[1] https://docs.kernel.org/admin-guide/device-mapper/dm-clone.h...
zipmapfoldright 13 hours ago

Google's L4 cache? https://cloud.google.com/blog/products/storage-data-transfer...
cperciva 18 hours ago

I've done this on EC2 -- in particular back in the days when EBS billed per I/O (as opposed to using a "reserved IOPs" model where you say up front how much I/O performance you need). I haven't bothered recently since EBS performance is good enough for most purposes and there's no automatic cost savings.

kayson a day ago

I was looking into SSD caching recently and decided to go with Open-CAS instead, which should be more performant (didn't test it personally): https://github.com/Open-CAS/open-cas-linux/issues/1221

It's maintained by Intel and Huawei and the devs were very responsive.

mgerdts 21 hours ago

Is Intel still working on it? Open-CAS bdev support was nearly removed from SPDK at a time when Intel still employed a SPDK development and QA team. Huawei stepped in to offer support to keep it alive, preventing its removal.
I’ve been under the impression that Intel got rid of pretty much all of their storage software employees.
- quickslowdown 19 hours ago
  
  I mean to ask a genuine, good faith question here, because I don't know much about Huawei's development team.
  My head goes to the xz attack when I hear that Intel decided to stop supporting an open source tool, and a Chinese company known to sell backdoored equipment "steps in" to continue development, and it makes me suspicious & concerned.
  This is to say nothing of the quality of the software they write or its functionality, they may be "good stewards" of it, but does it seem paranoid to be unsure of that arrangement?

AtlasBarfed a day ago

"When deploying infrastructure across multiple AWS availability zones (AZs), bandwidth costs can become a significant operational expense"

An expense in the age of 100gbit networking that is entirely because AWS can get away with charging the suckers, um, customers for it

0xbadcafebee a day ago

AZs are whole datacenters, so I imagine their backbone bandwidth between AZs is a fraction of total bandwidth inside the DC. If they didn't charge it'd probably get saturated and then there's not much point in using them for reliability.
The internet egress price is where they're bastards.
- martinald a day ago
  
  Definitely not. Azure doesn't charge for intra region costs FWIW.
  Getting terabits and terabits of 'private' interconnect is unbelievably cheap at amazon scale. AWS even own some of their own cables and have plans to build more.
  There is _so_ much capacity available on fiber links. For example one newish (Anjana) cable between the US and Europe has 480Tbit/sec capacity. That's just one cable. And that could probably be upgraded to 10-20x that already with newer modulation techniques.
random3 a day ago

reduce network bandwidth from the network attaches SSD volumes, yes?

kosolam 10 hours ago

Hmm.. I have a few questions:

1. How is the cache invalidated to avoid reading stale data? 2. If multi az setup is for high availability then I guess the only traffic between zones must be replication from the active one to the standby zones, in such a setup read cache doesn’t make much sense..