Configuring XFS Storage Pool with correct Disk Alignment

From Public PIC Wiki
Revision as of 14:01, 30 June 2015 by Mcaubet (talk | contribs)
Jump to navigation Jump to search

Understanding Disk Alignment

  • In all the document, be careful with the terminology.
  • Stripe unit is per disk, called "chunk" by mdadm.
  • Stripe width is per array.
  • Stripe size is ambiguous (unless we speak about the RAID6 stripe size)

About RAID6

  • At PIC storage disk arrays are configured as one or more RAID6.
  • We will always configure all RAID6 in a storage pool with the same size, unless this is not possible (i.e. Flytech)
  • Each RAID6 contains any number of data disk (N) and 2 parity disks. Hence, RAID6=<N data disks>+2
For example, depending on which configurations are allowed by the controller, the following configurations would be possible:
3*(10+2)
2*(16+2)
4*(7+2) [this configuration should be avoid because we are wasting many disk for parity purposes]
  • Each RAID6 should be configured with the following options:
Cached IO Policy
Enabled Disk Cache Policy
Always Read Ahead
Write Back with BBU
  • Each RAID6 will be configured with a Stripe Size. 256KB is the current default one, but it will depend on the storage pool workload type:
  • For small files consider to decrease the PIC-default Stripe Size
  • For large files consider to increase the PIC-default Stripe Size
Stripe Size will depend also in the final Stripe Width configured in formatted file system (see below). If the XFS Stripe Width is too big, consider reduce the RAID6 Stripe Size according to it.
  • Each RAID6 will be shown as disk devices from the O.S. side. It means, for 3*(10+2) you should see something like /dev/sda, /dev/sdb, /dev/sdc. Names will depend on the disk priority, with a fdisk -l you can see which ones are according to the size they should have.

About LVM2/MD

LVM2

  • We use to configure LVM2 as follows:
vgcreate dcvg_a /dev/sdc /dev/sdd /dev/sde
lvcreate -i <stripe units> -I <stripe width> -n dcpool -l 100%FREE -v dcvg_a
Stripe Unit
  • In LVM2, "StripeSize" (-I parameter) is limited to power of 2 values (this is according to lvcreate(8)). For instance, in a RAID6 10+2 with stripe size of 256KB you cannot use LVM for the outer stripe when you have 10 data spindles per RAID6: your RAID6 stripe width is 2560 KB which is not a power of 2 value. So in this case you must use md. See mdadm(8).
  • When nesting stripes, the "stripe width" of the RAID6 becomes the "stripe unit" of the outer stripe of the resulting RAID60. In essence, each RAID6 is treated as a "drive" in the outer stripe. For example, assuming a configuration of 12 drives per RAID6 (10 data spindles and 2 for parity), and 3 RAID6 arrays per nested stripe:

RAID6 stripe unit = 256 KB RAID6 stripe width = 2560 KB RAID60 stripe unit = 2560 KB RAID60 stripe width = 7680 KB

For RAID6 w/1MB stripe unit

RAID6 stripe unit = 1 MB RAID6 stripe width = 10 MB RAID60 stripe unit = 10 MB RAID60 stripe width = 30 MB

Hence, as shown above, with a RAID6 10+2 and 256KB of stripe size, we will have a final 2.5MB stripe width per RAID6. With the same configuration with 1MB of stripe size (unit) we will have a final 10MB stripe width.

Raid0 MD

  • Some times the RAID6 devices can not be aligned with the outer stripe by using LVM because of the LVM2 limits (newer versions may allow that, but the current ones in the repository in SL/RHEL 6.5 aren't). This happens when the Stripe Width is not power of 2 (in example, 10+2 disks with 256KB stripe size becomes in a Stripe Width of 2560K which is not power of 2).
  • Instead of that, mdadm can be used in order to create a stripe of disks (software RAID0). This can be done as follows:
mdadm -C /dev/md0 --raid_devices=<strip number> --chunk=<stripe width in KB> --level=0 /dev/sd[abc]
In example, having a RAID6 10+2 with 256KB:
# Stripe Width = 256KB * 10 data 
mdadm -C /dev/md0 --raid_devices=3 --chunk=2560 --level=0 /dev/sd[abc]
  • In the example above, the new Stripe Unit will be 2560KB.

About XFS

  • XFS must be aligned to the outer stripe geometry (LVM2 or md).
  • Align XFS to the RAID6 geometry instead is not correct. For instance, LVM2 for 3*RAI6(16+2 of 256KB stripe size) you must no align with su=256K,sw=16; otherwise, you must calculate the outer stripe geometry, for instace, 16*256=4096K and 3 stripes (LVM2 of 3 devices), hence su=4096K,sw=3 is the correct one.
  • For example, for making the filesystem and aligning it to the md nested stripe RAID60 (10+2 256KB), this is all that is required:
mkfs.xfs -d su=2560k,sw=3 /dev/md0

Examples

Flytech

  • Affected pools:
dc02[0-6].pic.es
  • RAID Controller Configuration (configuration depends on the disk distribution in each backplane):
1. RAID6: 22+2, StripeSize=256k, I0 Policy=Cached, Always Read Ahead, Write Back with BBU, Enabled Disk Cache Policy
2. RAID6: 19+2, StripeSize=256k, I0 Policy=Cached, Always Read Ahead, Write Back with BBU, Enabled Disk Cache Policy
3. RAID6: 16+2, StripeSize=256k, I0 Policy=Cached, Always Read Ahead, Write Back with BBU, Enabled Disk Cache Policy
4. RAID6: 16+2, StripeSize=256k, I0 Policy=Cached, Always Read Ahead, Write Back with BBU, Enabled Disk Cache Policy
  • Considering 4xRAID6 being presented as disk devices, no multipath, no LVMs configured, hence the XFS alignment is calculated over the RAID6.
  • XFS Format as follows:
# Format each device depending on its data disks and stripe size

mkfs.xfs -f -d su=256k,sw=22 -l size=128m,lazy-count=1 /dev/sda
mkfs.xfs -f -d su=256k,sw=19 -l size=128m,lazy-count=1 /dev/sdb
mkfs.xfs -f -d su=256k,sw=16 -l size=128m,lazy-count=1 /dev/sdc
mkfs.xfs -f -d su=256k,sw=16 -l size=128m,lazy-count=1 /dev/sdd

Supermicro

  • Affected pools:
dc03[4-7].pic.es
dc112.pic.es
  • RAID Controller Configuration (all RAIDs must be identical, and RAID configuration is the same for each backplane):
1. RAID6: 10+2, StripeSize=256k, I0 Policy=Cached, Always Read Ahead, Write Back with BBU, Enabled Disk Cache Policy
2. RAID6: 10+2, StripeSize=256k, I0 Policy=Cached, Always Read Ahead, Write Back with BBU, Enabled Disk Cache Policy
3. RAID6: 10+2, StripeSize=256k, I0 Policy=Cached, Always Read Ahead, Write Back with BBU, Enabled Disk Cache Policy
  • Considering 3xRAID6 being presented as disk devices, no multipath, LVMs configured as follows. Alignment must be calculated for LVM2. Alignment must be calculated for the XFS from the LVM2 configuration.
Current configuration (which btw is wrong):
vgcreate dcvg_a /dev/sdc /dev/sdd /dev/sde
lvcreate -i 3 -I 4096 -n dcpool -l 100%FREE -v dcvg_a			
Correct configuration (should be corrected in the future, LVM2 can't be used because 2560 is not power of 2 -see above documentation-):
# Replaces the above LVM2 configuration
mdadm -C /dev/md0 --raid_devices=3 --chunk=2560 --level=0 /dev/sd[abc]			
  • XFS format as follows:
# Format the logical volume depending on the data disks and stripe size
# sw is the number of data disks for one RAID (all RAIDs must be identical)
Correct configuration (su and sw should be calculated from the outer stripe which is MD):
# Replaces the above XFS configuration
mkfs.xfs -d su=2560k,sw=3 -l size=128m,lazy-count=1 /dev/dcvg_a/dcpool