Facebook's Flashcache:
Use some Flash storage on existing servers
Simple to deploy and use
IO access patterns benefit from a cache
Introduction
- Block cache for Linux - write back and write through modes
- Layered below the filesystem at the top of the storage stack
- Cache Disk Blocks on fast persistent storage (Flash, SSD)
- Loadable Linux Kernel module, built using the Device Mapper (DM)
- Primary use case InnoDB, but general purpose
- Based on dm-cache by Prof. Ming
Caching Modes
Write Back Write Through
- Lazy writing to disk Non-persistent
- Persistent across reboot Are you a pessimist
- Persistent across device removal
Cache Structure
- Set associative hash
- Hash with fixed sized buckets (sets) with linear probing within a set
- 512-way set associative by default
- dbn: Disk Block Number, address of block on disk
- Set = (dbn / block size / set size) mod (number of sets)
- Sequential range of dbns map onto a single sets
Structure
Write Back
- Replacement policy is FIFO (default) or LRU within a set
- Switch on the fly between FIFO/LRU (sysctl)
- Metadata per cache block: 24 bytes in memory, 16 bytes on ssd
- On ssd metadata per-slot
In memory metadata per-slot:
Write Through
- Replacement policy is FIFO
- Metadata per slot
- 17 bytes (memory), no metadata stored on ssd
- In memory metadata per-slot
Reads
- Compute cache set for dbn
- Cache Hit
- Verify checksums if configured
- Serve read out of cache
- Cache Miss
- Find free block or reclaim block based on replacement policy
- Read block from disk and populate cache
- Update block checksum if configured
- Return data to user
Write Through - writes
- Compute cache set for dbn
- Cache hit
- Get cached block
- Cache miss
- Find free block or reclaim block
- Write data block to disk
- Write data block to cache
- Update block checksum
Write Back - writes
- Compute cache set for dbn
- Cache Hit
- Write data block into cache
- If data block not DIRTY, synchronously update on-ssd cache metadata to mark block DIRTY
- Cache miss
- Find free block or reclaim block based on replacement policy
- Write data block to cache
- Synchronously update on-ssd cache metadata to mark block DIRTY
Small or uncacheable requests
- First invalidate blocks that overlap the requests
- There are at most 2 such blocks
- For Write Back, if the overlapping blocks are DIRTY they are cleaned first then invalidated
- Uncacheable full block reads are served from cache in case of a cache hit.
- Perform disk IO
- Repeat invalidation to close races which might have caused the block to be cached while the disk IO was in progress
Write Back policy
- Default expiration of 30 seconds (work in progress)
- When dirty blocks in a set exceeds configurable threshold, clean some blocks
- Blocks selected for writeback based on replacement policy
- Default dirty threshold 20%. Set higher for write heavy workloads
- Sort selected blocks and pickup any other blocks in set that can be contiguously merged with these
- Writes merged by the IO scheduler
Write Back – cache metadata overhead
- In-Memory cache metadata memory footprint
- 300GB/4KB cache -> ~1.8GB
- 160GB/4KB cache -> ~960MB
- Cache metadata writes/file system write
- Worst case is 2 cache metadata updates per write
- (VALID->DIRTY, DIRTY->VALID)
- Average case is much lower because of cache write hits and batching of cache metadata updates
Write Through – cache metadata overhead
- In-Memory Cache metadata footprint
- 300GB/4KB cache -> ~1.3GB
- 160GB/4KB cache -> ~700MB
- Cache metadata writes per file system write
- 1 cache data write per file system write
Write Back – metadata updates
- Cache (on-ssd) metadata only updated on writes and block cleanings
- (VALID->DIRTY or DIRTY->VALID)
- Cache (on-ssd) metadata not updated on cache population for reads
- Reload after an unclean shutdown only loads DIRTY blocks
- Fast and Slow cache shutdowns
- Only metadata is written on fast shutdown. Reload loads both dirty and clean blocks
- Slow shutdown writes all dirty blocks to disk first, then writes out metadata to the ssd. Reload only loads clean blocks.
- Metadata updates to multiple blocks in same sector are batched
Torn Page Problem
- Handle partial block write caused by power failure or other causes
- Problem exists for Flashcache in Write Back mode
- Detected via block checksums
- Checksums are disabled by default
- Pages with bad checksums are not used
- Checksums increase cache metadata writes and memory footprint
- Update cache metadata checksums on DIRTY->DIRTY block transitions for
Write Back
Each per-cache slot grows by 8 bytes to hold the checksum (a 33% increase from 24 bytes to 32 bytes for the Write Back case).Cache controls for Write Back
- Work best with O_DIRECT file access
- Global modes – Cache All or Cache Nothing
- Cache All has a blacklist of pids and tgids
- Cache Nothing has a whitelist of pids and tgids
- tgids can be used to tag all pthreads in the group as cacheable
- Exceptions for threads within a group are supported
- List changes done via FlashCache ioctls
- Cache can be read but is not written for non-cacheable tgids and pids
Cache Nothing policy
- If the thread id is whitelisted, cache all IOs for this thread
- If the tgid is whitelisted, cache all IOs for this thread
- If the thread id is blacklisted do not cache IOs
Utilities
flashcache_create
flashcache_create -b 4k -s 10g flashcache /dev/flash /dev/disk
flashcache_destroy
flashcache_destory /dev/flash
flashcache_load
flashcache_create -b 4k -s 10g flashcache /dev/flash /dev/disk
flashcache_destroy
flashcache_destory /dev/flash
flashcache_load
sysctl –a | grep flash
dev.flashcache.cache_all
dev.flashcache.fast_remove
dev.flashcache.zero_stats
dev.flashcache.write_merge
dev.flashcache.reclaim_policy
dev.flashcache.pid_expiry_secs
dev.flashcache.max_pids
dev.flashcache.do_pid_expiry
dev.flashcache.max_clean_ios_set
dev.flashcache.max_clean_ios_total
dev.flashcache.debug
dev.flashcache.dirty_thresh_pct
dev.flashcache.stop_sync
dev.flashcache.do_sync
dev.flashcache.cache_all
dev.flashcache.fast_remove
dev.flashcache.zero_stats
dev.flashcache.write_merge
dev.flashcache.reclaim_policy
dev.flashcache.pid_expiry_secs
dev.flashcache.max_pids
dev.flashcache.do_pid_expiry
dev.flashcache.max_clean_ios_set
dev.flashcache.max_clean_ios_total
dev.flashcache.debug
dev.flashcache.dirty_thresh_pct
dev.flashcache.stop_sync
dev.flashcache.do_sync
Status
cat /proc/flashcache_stats
Removing FlashCache
umount /data
dmesetup remove flashcache
flashcache_destroy /dev/flash
dmesetup remove flashcache
flashcache_destroy /dev/flash
Resources
▪ GitHub : facebook/flashcache
No comments:
Post a Comment