Pages

Sunday, July 29, 2012

Facebook's Flashcache

Facebook's Flashcache:

 Use some Flash storage on existing servers
 Simple to deploy and use
 IO access patterns benefit from a cache

Introduction

  • Block cache for Linux - write back and write through modes
  • Layered below the filesystem at the top of the storage stack
  • Cache Disk Blocks on fast persistent storage (Flash, SSD)
  • Loadable Linux Kernel module, built using the Device Mapper (DM)
  • Primary use case InnoDB, but general purpose
  • Based on dm-cache by Prof. Ming

Caching Modes

Write Back                                                                                Write Through

  • Lazy writing to disk                                                          Non-persistent
  • Persistent across reboot                                                   Are you a pessimist
  • Persistent across device removal

Cache Structure

  • Set associative hash
  • Hash with fixed sized buckets (sets) with linear probing within a set
  • 512-way set associative by default
  • dbn: Disk Block Number, address of block on disk
  • Set = (dbn / block size / set size) mod (number of sets)
  • Sequential range of dbns map onto a single sets

Structure


Write Back

  • Replacement policy is FIFO (default) or LRU within a set
  • Switch on the fly between FIFO/LRU (sysctl)
  • Metadata per cache block: 24 bytes in memory, 16 bytes on ssd
  • On ssd metadata per-slot
  • In memory metadata per-slot:


Write Through

  • Replacement policy is FIFO
  • Metadata per slot
  • 17 bytes (memory), no metadata stored on ssd
  • In memory metadata per-slot


Reads

  • Compute cache set for dbn
  • Cache Hit
  • Verify checksums if configured
  • Serve read out of cache
  • Cache Miss
  • Find free block or reclaim block based on replacement policy
  • Read block from disk and populate cache
  • Update block checksum if configured
  • Return data to user

Write Through - writes

  • Compute cache set for dbn
  • Cache hit
  • Get cached block
  • Cache miss
  • Find free block or reclaim block
  • Write data block to disk
  • Write data block to cache
  • Update block checksum

Write Back - writes

  • Compute cache set for dbn
  • Cache Hit
  • Write data block into cache
  • If data block not DIRTY, synchronously update on-ssd cache metadata to mark block DIRTY
  • Cache miss
  • Find free block or reclaim block based on replacement policy
  • Write data block to cache
  • Synchronously update on-ssd cache metadata to mark block DIRTY

Small or uncacheable requests

  • First invalidate blocks that overlap the requests
  • There are at most 2 such blocks
  • For Write Back, if the overlapping blocks are DIRTY they are cleaned first then invalidated
  • Uncacheable full block reads are served from cache in case of a cache hit.
  • Perform disk IO
  • Repeat invalidation to close races which might have caused the block to be cached while the disk IO was in progress

Write Back policy

  • Default expiration of 30 seconds (work in progress)
  • When dirty blocks in a set exceeds configurable threshold, clean some blocks
  • Blocks selected for writeback based on replacement policy
  • Default dirty threshold 20%. Set higher for write heavy workloads
  • Sort selected blocks and pickup any other blocks in set that can be contiguously merged with these
  • Writes merged by the IO scheduler

Write Back – cache metadata overhead

  • In-Memory cache metadata memory footprint
  • 300GB/4KB cache -> ~1.8GB
  • 160GB/4KB cache -> ~960MB
  • Cache metadata writes/file system write
  • Worst case is 2 cache metadata updates per write
  • (VALID->DIRTY, DIRTY->VALID)
  • Average case is much lower because of cache write hits and batching of cache metadata updates

Write Through – cache metadata overhead

  • In-Memory Cache metadata footprint
  • 300GB/4KB cache -> ~1.3GB
  • 160GB/4KB cache -> ~700MB
  • Cache metadata writes per file system write
  • 1 cache data write per file system write

Write Back – metadata updates

  • Cache (on-ssd) metadata only updated on writes and block cleanings
  • (VALID->DIRTY or DIRTY->VALID)
  • Cache (on-ssd) metadata not updated on cache population for reads
  • Reload after an unclean shutdown only loads DIRTY blocks
  • Fast and Slow cache shutdowns
  • Only metadata is written on fast shutdown. Reload loads both dirty and clean blocks
  • Slow shutdown writes all dirty blocks to disk first, then writes out metadata to the ssd. Reload only loads clean blocks.
  • Metadata updates to multiple blocks in same sector are batched

Torn Page Problem

  • Handle partial block write caused by power failure or other causes
  • Problem exists for Flashcache in Write Back mode
  • Detected via block checksums
  • Checksums are disabled by default
  • Pages with bad checksums are not used
  • Checksums increase cache metadata writes and memory footprint
  • Update cache metadata checksums on DIRTY->DIRTY block transitions for

 

Write Back

Each per-cache slot grows by 8 bytes to hold the checksum (a 33% increase from 24 bytes to 32 bytes for the Write Back case).

 

Cache controls for Write Back

  • Work best with O_DIRECT file access
  • Global modes – Cache All or Cache Nothing
  • Cache All has a blacklist of pids and tgids
  • Cache Nothing has a whitelist of pids and tgids
  • tgids can be used to tag all pthreads in the group as cacheable
  • Exceptions for threads within a group are supported
  • List changes done via FlashCache ioctls
  • Cache can be read but is not written for non-cacheable tgids and pids

Cache Nothing policy

  • If the thread id is whitelisted, cache all IOs for this thread
  • If the tgid is whitelisted, cache all IOs for this thread
  • If the thread id is blacklisted do not cache IOs

 

Utilities

flashcache_create
flashcache_create -b 4k -s 10g flashcache /dev/flash /dev/disk
flashcache_destroy
flashcache_destory /dev/flash
flashcache_load
sysctl –a | grep flash

dev.flashcache.cache_all
dev.flashcache.fast_remove
dev.flashcache.zero_stats
dev.flashcache.write_merge
dev.flashcache.reclaim_policy
dev.flashcache.pid_expiry_secs
dev.flashcache.max_pids
dev.flashcache.do_pid_expiry
dev.flashcache.max_clean_ios_set
dev.flashcache.max_clean_ios_total
dev.flashcache.debug
dev.flashcache.dirty_thresh_pct
dev.flashcache.stop_sync
dev.flashcache.do_sync 

 

Status

cat /proc/flashcache_stats

 

Removing FlashCache

umount /data
dmesetup remove flashcache
flashcache_destroy /dev/flash

 

Resources

▪ GitHub : facebook/flashcache