| Advanced Topics in Computer Systems |
Lecture 3
|
Joe Hellerstein & Eric Brewer
|
|
UNIX Fast File System and Modern Filesystem Metadata Handling
FFS
First, the original UNIX filesystem:
- All operations made to appear synchronous
- All disk blocks the same 512byte size.
- Disk has 3 sections:
- Superblock (#datablocks, max#files, freelist ptr)
- inode tree
- data blocks
- inodes stored at front of filesystem
- Performed abominably: as low as 2% of
potential BW of the spinning disk platters. Why?
- small block size,
- poor freelist organization -- "consecutive" blocks are far apart
- poor inode locality (inodes far from data, inodes of
different files in a directory not close to each other)
What are the main issues in the paper?
- Performance improvements based on disk layout: storage/xfer units and
data placement
- Feature enhancements including redundancy for superblocks,
softlinks, Long file names, Support to do "advisory", "soft" locks
on files, Atomic rename and Quotas.
Block size:
- Increase the block size. Pros/Cons?
- Fragments: 2,4 or 8 per block
- For small files and ends of files.
- Remind you of anything else?
Unorganized Freelist
- Starts linear and packed, over time becomes random. Hard to fix
for good
- Instead, switch to bitmaps for free space representation.
Locality
- Make sure not to overfill disk, and you'll usually find a free
block nearby,
- Attempt to cluster data: i.e. keep related things close,
unrelated things far apart. Define related. Define close/far.
- Typical FS definition of "related":
- blocks in a file in sequence. Files and the inodes in a
directory.
- Try to keep all files in a directory in same cylinder
group.
- Try to get directories spread out across cylinder
groups.
- big files spread across cylinder groups to prevent havign a
single file per cylinder group
- Cylinder group: copy of superblock, fixed # of inodes, bitmap of
free blocks, usage summary for highlevel alloc policy, data
blocks.
- FFS definition of close: skip-sector
Superblock replication
- spatially diverse replication of key data, on a single
device!
- Redundancy suggests metadata is more important than data? we'll
see this in a minute...
Metadata updates are on key source of FFS seek overhead.
- Metadata writes are poorly localized.
- E.g., extending a file requires writes to the inode, direct and
indirect blocks, cylinder group bit maps and summaries, and
the file block itself.
-
Metadata writes can be delayed, but this incurs a higher risk
of file system corruption in a crash.
- If you lose your metadata, you are dead in the water.
- FFS schedules metadata block writes carefully to limit the
kinds of inconsistencies that can occur. Some metadata updates
must be synchronous on controllers that don't respect order of
writes.
Seltzer, et al
FFS metadata is basically superblock and inodes. At minimum, want
to have a valid superblock and valid inodes. How to assure
this?
- may need atomic update of multiple blocks (e.g. directory and the block it refers to)
- solution one: ordered writes (simple for manipulating subtrees)
- solution two: a more general mechanism for arbitrary
transactions: write-ahead logging (WAL).
Soft updates postpone the writes, and tracks dependencies (a poset
of actions) for subsequent writing. Problem: cyclic dependencies.
(Example from paper). Solution: update pages with selected actions as
needed ("rollback/roll-forward"). Needed because of update-in-place.
- Key feature is asynchrony of metadata updates. This is not just a soft updates thing.
- Another key feature is background delete of data (delete metadata
now, reclaim data later). Again, not tied to soft updates per
se.
Journaling filesystem basically does WAL-based transactional
recovery for metadata. Each cached (metadata) page has a first-update
LSN (since paged in) and a last-update LSN. The former allows you to
do log truncation: log recs before the oldest first-update LSN in
cache can be discarded. The latter helps you ensure the WAL property:
before writing back the page, you flush the log up to the last-update
LSN. Superblock keeps location of beginning of log, which changes on
each flush! (Not the way ARIES works, BTW). "Checkpoint" in this
context means flush the log (not like an ARIES checkpoint record.)
Notes:
- file vs. WAFS.
- group commit. not doing group commit makes all the journaling
numbers questionable -- should be able to achieve much better
throughput at the expense of latency.
- async vs. sync logging. What is async logging? Each update will
generate a consistent result, but application code may not have the
right picture of which updates actually succeeded.
- What is soft updates trying to guarantee? "file system integrity
but not durability". What does that mean? Define integrity? No
dangling inodes. Is the soft semantics worth the performance gain?
- the word "semantics" is bandied about a lot here, but is very
operational (the following individual things can go wrong...)
E.g. two versions of a file name can be in directory on crash during
a rename.
More on Journaling
A helpful paper on journaling filesystems is Prabhakaran
& the Arpaci-Dusseaus' Analysis
and Evolution of Journaling File Systems from USENIX '05. It
analyzes a number of the systems that are out there. A couple
examples (more in the paper):
- Linux Ext3:
- Three modes:
- writeback mode: metadata is journaled, but journal needn't be write-ahead. Guarantees consistent metadata, but can have inconsistent data blocks.
- ordered journaling mode much like Seltzer paper
- data journaling mode: both metadata & data block updates
are journaled. Can be faster or slower than ordered. Faster on
random async writes.
- Compound Transaction groups many updates into a single commit. Good if frequent updates to same block (e.g. free space bitmap)
- full block logging, not diffs.
- NTFS
- Every object in NTFS is a file. Even metadata is stored in
files. The journal is a file stored in the middle of the FS.
- NTFS does only metadata journaling. Does not do block-level journaling, does change records.
- Does indeed do WAL ("ordered journaling")