Loading...
 

ZFS

Introduction


Once name ZettaByte File System but now simply known as ZFS. It is a combined file system and logical volume manager designed by Sun Microsystems. The features of ZFS include protection against data corruption, support for high storage capacities, efficient data compression, deduplication, integration of the concepts of filesystem and volume management, snapshots and copy-on-write clones, compression, encryption continuous integrity checking and automatic repair, RAID-Z and native NFSv4 ACLs.

ZFS was originally implemented as open-source software, licensed under the Common Development and Distribution License (CDDL). The ZFS name is registered as a trademark of Oracle Corporation.3better source needed4

OpenZFS is an umbrella project aimed at bringing together individuals and com panies that use the ZFS file system and work on its improvements.

Structure, design and key principles

Structure

Image


Design


A ZFS “pool” is analogous to a “volume group” in other logical volume management systems. Each pool is composed of “virtual devices,” which can be raw storage devices (disks, partitions, SAN devices, etc.), mirror groups, or RAID arrays.

ZFS RAID is similar in spirit to RAID 5 in that it uses one or more parity devices to provide redundancy for the array. However, ZFS calls the scheme RAID-Z and uses variable-sized stripes to eliminate the RAID 5 write hole. All writes to the storage pool are striped across the pool’s virtual devices, so a pool that contains only individual storage devices is effectively an implementation of RAID 0, although the devices in this configuration are not required to be of the same size.

Although you can turn over raw, unpartitioned disks to ZFS’s care, ZFS secretly writes a GPT-style partition table onto them and allocates all of each disk’s space to its first partition.

Previously a big underlying problem with storage in general was how to manage metadata with larger disks because as disks became bigger performance becomes comprimised. Furthermore with fragmentation the metadata area would not grow linearly.
key principles

  • ZFS allows for very quick additions of new disks because it just marks the physical device as ready and doesn’t have to build volumes, set up partitions or format.
  • ZFS consolidates the block layer and DM/LVM layer into one this allows particular parts of the physical disk to be allocated automatically on request allowing just the edges of the disk to be used for quicker access. The extra layer approach will not allow this to happen because the information cannot be passed through or leap the layer
  • ZFS record size is analogous to block size. It handles record size dynamically when the block allocation layer allocates space as part of the write transaction
  • De-fragmentation is handled automatically within ZFS
  • Data recovery is made easier because when updates to files are made, the block modified isn’t overwritten, a whole new block is written with the update and the original is freed and marked in a special table as recently changed
  • Snapshots and clones on ZFS
  • Just as in a logical volume manager, ZFS brings copy-on-write to the user level by allowing you to create instantaneous snapshots. However, there’s an important difference: ZFS snapshots are implemented per-filesystem rather than per-volume, so they have arbitrary granularity.
  • just keeps a metadata record for each snapshot.
  • Unlike LVMs that have to refer to multiple tables and traverse every snapshot back in time, ZFS simply directly accesses the referred snapshot. In this case LVM snapshots can affect performance. Effectively ZFS is only impacted during the import and initialisation of the whole zpool.
  •  
  • ZFS is organized around the principle of copy-on-write. Instead of overwriting disk blocks in place, ZFS allocates new blocks and updates pointers. This approach makes ZFS resistant to corruption because operations can never end up half-completed in the event of a power failure or crash. Either the root block is updated or it’s not; the filesystem is consistent either way (though a few recent changes may be “undone”).
  • ZFS snapshots are read-only, and although they can bear properties, they are not true filesystems
  • Cloning isn’t a common operation, but it’s the only way to create a branch in a filesystem’s evolution. The zfs rollback operation can only revert a filesystem to its most recent snapshot, so to use it you must permanently zfs destroy any snapshots made since the snapshot that is your reversion target.
  • Cloning lets you go back in time without losing access to recent changes. For example, if you had a security breach within the last week and you want to revert a filesystem to its state a week ago (to be sure it contains no hacker-installed back doors whilst still retaining data from recent work or data for forensic analysis), the solution is:
  • clone the week-ago snapshot to a new filesystem
  • zfs rename the old filesystem
  • then zfs rename the clone in place of the original filesystem.
  • For good measure, you should also zfs promote the clone (this inverts the relationship between the clone and the filesystem of origin).
  • After promotion, the main-line filesystem has access to all the old filesystem’s snapshots, and the old, moved-aside filesystem becomes the “cloned” branch.
  • With ZFS, there is no need to regularly manage /etc/fstab (or /etc/vfstab) as when ZFS volumes are created, the root zpool directory will automatically mount the volume as part of the creation
  • ZFS allows the delegation of permissions of volume creation. For example, if user’s home directories where on a ZFS volume called home, users could be given permission to create their own sub-volumes.
  • ZFS RAID manages replication at the block layer and isn’t tied to single disk to single disk replication. Allowing more distribution and better performance. This even allows redundancy and mirroring on a single physical disk.
  • With every read operation, a checksum is compared and recovered from mirrored block if necessary. Effectively recovering errors constantly and removing any need to FSCK a filesystem
  • Compression is done at the block layer meaning each block has a record of whether it is compressed or not. This means that compression can be enabled and disabled at will and some blocks will be compressed and some won’t, you also have the following algorithms to choose from when enabling compression:
  • IOs are handled totally differently (Tomasz Kloczko to explain further)
  • L2ARC improves caching of disk IO. Cache layer in-between main memory and disk, using flash memory based SSD’s. Designed to improve the performance of random read workloads. Writes to the cache devices are done asynchronously. L2ARC attempts to cache data from the ARC before it is evicted from ARC by periodically scanning buffers from the eviction-end of the MFU and MRU ARC lists.
  • ZIL the ZFS Intent Log satisfies the POSIX requirements for synchronous writes and crash recovery, ZFS system calls are logged by the ZIL and contain sufficient information to play them back in the event of a system crash. Note the maximum size of a log device should be approximately 1/2 the size of physical memory because that is the maximum amount of potential in-play data that can be stored

Filesystems and properties


by default, ZFS filesystems consume no particular amount of space. All filesystems that live in a pool can draw from the pool’s available space.

Unlike traditional filesystems, which are independent of one another, ZFS filesystems are hierarchical and interact with their parent and child filesystems in several ways.

ZFS distributes writes among all a pool’s virtual devices. Therefore it is not necessary for all virtual devices to be the same size.

ZFS has an especially nice implementation of read caching that makes good use of SSDs. To set up this configuration, just add the SSDs to the storage pool as vdevs of type cache. The caching system then uses an adaptive replacement algorithm developed at IBM that is smarter than a normal LRU (least recently used) cache. It knows about the frequency at which blocks are referenced as well as their recency of use, so reads of large files are not supposed to wipe out the cache.

Hot spares are handled as vdevs of type spare. You can add the same disk to multiple storage pools; whichever pool experiences a disk failure first gets to claim thespare disk.

Configuration

To be completed

Administration


Only two main commands are required to manage ZFS and below are some example commands. Use zpool to build and manage storage pools. Use zfs to create and manage the entities created from pools, chiefly filesystems and raw volumes used as swap space and database storage.
zfs - configures ZFS file systems
  • list -t snapshot - list all file-system snapshots
  • destroy datapool/filesys@11Jan - destroy a snapshot
  • create - create a file-system
  • set - set file-system properties such as compression, de duplication & quotas.
  • get - get the value of a file-system enabled property
  • list - list file-system information
  • rename - rename the name and mountpoint - automatically handles un/mount and -p creates any new directory creation
zpool - configures ZFS storage pools
  • add - add new block devices to pool as a concatenated device
  • attach - attach a disk to another (as in a mirror)
  • clear - refreshes/clears the state of a pool. So if it was degraded, it can be refreshed
  • destroy - Destroys the given pool, freeing up any devices for other use.
  • get all rpool - shows all attributes of the zpool
  • get all volume - shows all attributes of the (sub)volume
  • list - will show each volume in list and they’re utilisation. Each parent volume will show the aggregated usage of all of its sub-volumes
  • history - show list of all actions on the pool over time
  • iostat N N - gives IO stats at the zfs layer (i.e. to the physical disk) like normal iostat
  • status - show zpool status, (can add -v) for additional pool health status. you’ll have columns for: Name, State, Read, Write & Chksum. See http://docs.oracle.com/cd/E19253-01/819-5461/gamno/index.html
  • scrub - check data integrity on the pool, note this operation traverses all the data in the pool once and verifies that all blocks can be read and is done with a lower priority to avoid degraded performance. Progress can be viewed with the status command after being executed.

in the output of zpool get all volume available is the amount of available space to be used, used is all utilised space including space used by snapshots and referenced is space that is available for reference (i.e. the dataset size) (Tomasz Kloczko to clarify the differences)

snapshots are created simply by creating a new directory (e.g. mkdir snapshot_dir) in the snapshot directory (.zfs in the root of every volume). This can even be done remotely over NFS. Snapshot naming standard has @ in the name.

Examples

Solaris Disk Addition
ZFS Volume Sharing


Other commands

  • kstat - can be used to show cache data
  • df -h will show all volumes of the pool to have available the same amount because they all have access to the pool space unless quotas are set.
  • iostat -nz 2 - this will operate as normal output but this value will be higher than zpool iostat because it shows all IOs including those that hit L2ARC (cache)

logging

TBC

Further Reading


http://en.wikipedia.org/wiki/ZFS

http://www.freebsd.org/cgi/man.cgi?zpool%288%29