Internet Engineering Task Force (IETF) D. Black Request for Comments: 5663 S. Fridella Category: Standards Track EMC Corporation ISSN: 2070-1721 J. Glasgow Google January 2010
Internet Engineering Task Force (IETF) D. Black Request for Comments: 5663 S. Fridella Category: Standards Track EMC Corporation ISSN: 2070-1721 J. Glasgow Google January 2010
Parallel NFS (pNFS) Block/Volume Layout
并行NFS(pNFS)块/卷布局
Abstract
摘要
Parallel NFS (pNFS) extends Network File Sharing version 4 (NFSv4) to allow clients to directly access file data on the storage used by the NFSv4 server. This ability to bypass the server for data access can increase both performance and parallelism, but requires additional client functionality for data access, some of which is dependent on the class of storage used. The main pNFS operations document specifies storage-class-independent extensions to NFS; this document specifies the additional extensions (primarily data structures) for use of pNFS with block- and volume-based storage.
并行NFS(pNFS)扩展了网络文件共享版本4(NFSv4),允许客户端直接访问NFSv4服务器使用的存储上的文件数据。这种绕过服务器进行数据访问的能力可以提高性能和并行性,但需要额外的客户端数据访问功能,其中一些功能取决于所使用的存储类别。主pNFS操作文档指定了与存储类无关的NFS扩展;本文档指定了PNF与基于块和基于卷的存储一起使用的附加扩展(主要是数据结构)。
Status of This Memo
关于下段备忘
This is an Internet Standards Track document.
这是一份互联网标准跟踪文件。
This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 5741.
本文件是互联网工程任务组(IETF)的产品。它代表了IETF社区的共识。它已经接受了公众审查,并已被互联网工程指导小组(IESG)批准出版。有关互联网标准的更多信息,请参见RFC 5741第2节。
Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc5663.
有关本文件当前状态、任何勘误表以及如何提供反馈的信息,请访问http://www.rfc-editor.org/info/rfc5663.
Copyright Notice
版权公告
Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved.
版权所有(c)2010 IETF信托基金和确定为文件作者的人员。版权所有。
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
本文件受BCP 78和IETF信托有关IETF文件的法律规定的约束(http://trustee.ietf.org/license-info)自本文件出版之日起生效。请仔细阅读这些文件,因为它们描述了您对本文件的权利和限制。从本文件中提取的代码组件必须包括信托法律条款第4.e节中所述的简化BSD许可证文本,并提供简化BSD许可证中所述的无担保。
Table of Contents
目录
1. Introduction ....................................................4 1.1. Conventions Used in This Document ..........................4 1.2. General Definitions ........................................5 1.3. Code Components Licensing Notice ...........................5 1.4. XDR Description ............................................5 2. Block Layout Description ........................................7 2.1. Background and Architecture ................................7 2.2. GETDEVICELIST and GETDEVICEINFO ............................9 2.2.1. Volume Identification ...............................9 2.2.2. Volume Topology ....................................10 2.2.3. GETDEVICELIST and GETDEVICEINFO deviceid4 ..........12 2.3. Data Structures: Extents and Extent Lists .................12 2.3.1. Layout Requests and Extent Lists ...................15 2.3.2. Layout Commits .....................................16 2.3.3. Layout Returns .....................................16 2.3.4. Client Copy-on-Write Processing ....................17 2.3.5. Extents are Permissions ............................18 2.3.6. End-of-file Processing .............................20 2.3.7. Layout Hints .......................................20 2.3.8. Client Fencing .....................................21 2.4. Crash Recovery Issues .....................................23 2.5. Recalling Resources: CB_RECALL_ANY ........................23 2.6. Transient and Permanent Errors ............................24 3. Security Considerations ........................................24 4. Conclusions ....................................................26 5. IANA Considerations ............................................26 6. Acknowledgments ................................................26 7. References .....................................................27 7.1. Normative References ......................................27 7.2. Informative References ....................................27
1. Introduction ....................................................4 1.1. Conventions Used in This Document ..........................4 1.2. General Definitions ........................................5 1.3. Code Components Licensing Notice ...........................5 1.4. XDR Description ............................................5 2. Block Layout Description ........................................7 2.1. Background and Architecture ................................7 2.2. GETDEVICELIST and GETDEVICEINFO ............................9 2.2.1. Volume Identification ...............................9 2.2.2. Volume Topology ....................................10 2.2.3. GETDEVICELIST and GETDEVICEINFO deviceid4 ..........12 2.3. Data Structures: Extents and Extent Lists .................12 2.3.1. Layout Requests and Extent Lists ...................15 2.3.2. Layout Commits .....................................16 2.3.3. Layout Returns .....................................16 2.3.4. Client Copy-on-Write Processing ....................17 2.3.5. Extents are Permissions ............................18 2.3.6. End-of-file Processing .............................20 2.3.7. Layout Hints .......................................20 2.3.8. Client Fencing .....................................21 2.4. Crash Recovery Issues .....................................23 2.5. Recalling Resources: CB_RECALL_ANY ........................23 2.6. Transient and Permanent Errors ............................24 3. Security Considerations ........................................24 4. Conclusions ....................................................26 5. IANA Considerations ............................................26 6. Acknowledgments ................................................26 7. References .....................................................27 7.1. Normative References ......................................27 7.2. Informative References ....................................27
Figure 1 shows the overall architecture of a Parallel NFS (pNFS) system:
图1显示了并行NFS(pNFS)系统的总体架构:
+-----------+ |+-----------+ +-----------+ ||+-----------+ | | ||| | NFSv4.1 + pNFS | | +|| Clients |<------------------------------>| Server | +| | | | +-----------+ | | ||| +-----------+ ||| | ||| | ||| Storage +-----------+ | ||| Protocol |+-----------+ | ||+----------------||+-----------+ Control | |+-----------------||| | Protocol| +------------------+|| Storage |------------+ +| Systems | +-----------+
+-----------+ |+-----------+ +-----------+ ||+-----------+ | | ||| | NFSv4.1 + pNFS | | +|| Clients |<------------------------------>| Server | +| | | | +-----------+ | | ||| +-----------+ ||| | ||| | ||| Storage +-----------+ | ||| Protocol |+-----------+ | ||+----------------||+-----------+ Control | |+-----------------||| | Protocol| +------------------+|| Storage |------------+ +| Systems | +-----------+
Figure 1: pNFS Architecture
图1:pNFS体系结构
The overall approach is that pNFS-enhanced clients obtain sufficient information from the server to enable them to access the underlying storage (on the storage systems) directly. See the pNFS portion of [NFSv4.1] for more details. This document is concerned with access from pNFS clients to storage systems over storage protocols based on blocks and volumes, such as the Small Computer System Interface (SCSI) protocol family (e.g., parallel SCSI, Fibre Channel Protocol (FCP) for Fibre Channel, Internet SCSI (iSCSI), Serial Attached SCSI (SAS), and Fibre Channel over Ethernet (FCoE)). This class of storage is referred to as block/volume storage. While the Server to Storage System protocol, called the "Control Protocol", is not of concern for interoperability here, it will typically also be a block/volume protocol when clients use block/ volume protocols.
总体方法是,增强了pNFS的客户端从服务器获取足够的信息,使它们能够直接访问底层存储(在存储系统上)。有关更多详细信息,请参见[NFSv4.1]的pNFS部分。本文档涉及pNFS客户端通过基于块和卷的存储协议访问存储系统,例如小型计算机系统接口(SCSI)协议系列(例如,并行SCSI、光纤通道光纤通道协议(FCP)、Internet SCSI(iSCSI)、串行连接SCSI(SAS)和以太网光纤通道)(FCoE))。这类存储称为块/卷存储。虽然称为“控制协议”的服务器到存储系统协议在这里不涉及互操作性,但当客户端使用块/卷协议时,它通常也是块/卷协议。
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].
本文件中的关键词“必须”、“不得”、“要求”、“应”、“不应”、“应”、“不应”、“建议”、“可”和“可选”应按照RFC 2119[RFC2119]中所述进行解释。
The following definitions are provided for the purpose of providing an appropriate context for the reader.
以下定义旨在为读者提供适当的上下文。
Byte
字节
This document defines a byte as an octet, i.e., a datum exactly 8 bits in length.
本文件将字节定义为八位字节,即长度正好为8位的数据。
Client
客户
The "client" is the entity that accesses the NFS server's resources. The client may be an application that contains the logic to access the NFS server directly. The client may also be the traditional operating system client that provides remote file system services for a set of applications.
“客户机”是访问NFS服务器资源的实体。客户端可能是包含直接访问NFS服务器的逻辑的应用程序。客户端也可以是为一组应用程序提供远程文件系统服务的传统操作系统客户端。
Server
服务器
The "server" is the entity responsible for coordinating client access to a set of file systems and is identified by a server owner.
“服务器”是负责协调客户端对一组文件系统的访问的实体,由服务器所有者标识。
The external data representation (XDR) description and scripts for extracting the XDR description are Code Components as described in Section 4 of "Legal Provisions Relating to IETF Documents" [LEGAL]. These Code Components are licensed according to the terms of Section 4 of "Legal Provisions Relating to IETF Documents".
外部数据表示(XDR)描述和用于提取XDR描述的脚本是“与IETF文件相关的法律规定”[法律]第4节所述的代码组件。这些代码组件根据“与IETF文件有关的法律规定”第4节的条款获得许可。
This document contains the XDR ([XDR]) description of the NFSv4.1 block layout protocol. The XDR description is embedded in this document in a way that makes it simple for the reader to extract into a ready-to-compile form. The reader can feed this document into the following shell script to produce the machine readable XDR description of the NFSv4.1 block layout:
本文档包含NFSv4.1块布局协议的XDR([XDR])说明。XDR描述以某种方式嵌入到本文档中,使读者能够轻松地将其提取到准备编译的表单中。读者可以将此文档输入以下shell脚本,以生成NFSv4.1块布局的机器可读XDR描述:
#!/bin/sh grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??'
#!/bin/sh grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??'
That is, if the above script is stored in a file called "extract.sh", and this document is in a file called "spec.txt", then the reader can do:
也就是说,如果上述脚本存储在一个名为“extract.sh”的文件中,而此文档存储在一个名为“spec.txt”的文件中,那么读者可以执行以下操作:
sh extract.sh < spec.txt > nfs4_block_layout_spec.x
sh extract.sh<spec.txt>nfs4\u block\u layout\u spec.x
The effect of the script is to remove both leading white space and a sentinel sequence of "///" from each matching line.
脚本的作用是从每个匹配行中删除前导空格和“//”的前哨序列。
The embedded XDR file header follows, with subsequent pieces embedded throughout the document:
嵌入的XDR文件头如下所示,后续部分嵌入整个文档:
/// /* /// * This code was derived from RFC 5663. /// * Please reproduce this note if possible. /// */ /// /* /// * Copyright (c) 2010 IETF Trust and the persons identified /// * as the document authors. All rights reserved. /// * /// * Redistribution and use in source and binary forms, with /// * or without modification, are permitted provided that the /// * following conditions are met: /// * /// * - Redistributions of source code must retain the above /// * copyright notice, this list of conditions and the /// * following disclaimer. /// * /// * - Redistributions in binary form must reproduce the above /// * copyright notice, this list of conditions and the /// * following disclaimer in the documentation and/or other /// * materials provided with the distribution. /// * /// * - Neither the name of Internet Society, IETF or IETF /// * Trust, nor the names of specific contributors, may be /// * used to endorse or promote products derived from this /// * software without specific prior written permission. /// * /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
/// /* /// * This code was derived from RFC 5663. /// * Please reproduce this note if possible. /// */ /// /* /// * Copyright (c) 2010 IETF Trust and the persons identified /// * as the document authors. All rights reserved. /// * /// * Redistribution and use in source and binary forms, with /// * or without modification, are permitted provided that the /// * following conditions are met: /// * /// * - Redistributions of source code must retain the above /// * copyright notice, this list of conditions and the /// * following disclaimer. /// * /// * - Redistributions in binary form must reproduce the above /// * copyright notice, this list of conditions and the /// * following disclaimer in the documentation and/or other /// * materials provided with the distribution. /// * /// * - Neither the name of Internet Society, IETF or IETF /// * Trust, nor the names of specific contributors, may be /// * used to endorse or promote products derived from this /// * software without specific prior written permission. /// * /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
/// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. /// */ /// /// /* /// * nfs4_block_layout_prot.x /// */ /// /// %#include "nfsv41.h" ///
/// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. /// */ /// /// /* /// * nfs4_block_layout_prot.x /// */ /// /// %#include "nfsv41.h" ///
The XDR code contained in this document depends on types from the nfsv41.x file. This includes both nfs types that end with a 4, such as offset4, length4, etc., as well as more generic types such as uint32_t and uint64_t.
本文档中包含的XDR代码取决于nfsv41.x文件中的类型。这包括以4结尾的nfs类型,如offset4、length4等,以及更通用的类型,如uint32和uint64。
The fundamental storage abstraction supported by block/volume storage is a storage volume consisting of a sequential series of fixed-size blocks. This can be thought of as a logical disk; it may be realized by the storage system as a physical disk, a portion of a physical disk, or something more complex (e.g., concatenation, striping, RAID, and combinations thereof) involving multiple physical disks or portions thereof.
块/卷存储支持的基本存储抽象是由一系列固定大小的块组成的存储卷。这可以看作是一个逻辑磁盘;它可以由存储系统实现为物理磁盘、物理磁盘的一部分或涉及多个物理磁盘或其部分的更复杂的东西(例如,连接、分条、RAID及其组合)。
A pNFS layout for this block/volume class of storage is responsible for mapping from an NFS file (or portion of a file) to the blocks of storage volumes that contain the file. The blocks are expressed as extents with 64-bit offsets and lengths using the existing NFSv4 offset4 and length4 types. Clients must be able to perform I/O to the block extents without affecting additional areas of storage (especially important for writes); therefore, extents MUST be aligned to 512-byte boundaries, and writable extents MUST be aligned to the block size used by the NFSv4 server in managing the actual file system (4 kilobytes and 8 kilobytes are common block sizes). This block size is available as the NFSv4.1 layout_blksize attribute. [NFSv4.1]. Readable extents SHOULD be aligned to the block size used by the NFSv4 server, but in order to support legacy file systems with fragments, alignment to 512-byte boundaries is acceptable.
此存储块/卷类的pNFS布局负责从NFS文件(或文件的一部分)映射到包含该文件的存储卷块。使用现有的NFSv4 offset4和length4类型,将块表示为具有64位偏移量和长度的区段。客户机必须能够在不影响额外存储区域的情况下对块扩展执行I/O(对于写入尤其重要);因此,扩展数据块必须与512字节边界对齐,可写扩展数据块必须与NFSv4服务器在管理实际文件系统时使用的块大小对齐(4 KB和8 KB是常见的块大小)。此块大小作为NFSv4.1 layout_blksize属性提供。[NFSv4.1]。可读扩展数据块应与NFSv4服务器使用的块大小对齐,但为了支持带有片段的旧式文件系统,可以与512字节边界对齐。
The pNFS operation for requesting a layout (LAYOUTGET) includes the "layoutiomode4 loga_iomode" argument, which indicates whether the requested layout is for read-only use or read-write use. A read-only layout may contain holes that are read as zero, whereas a read-write layout will contain allocated, but un-initialized storage in those holes (read as zero, can be written by client). This document also supports client participation in copy-on-write (e.g., for file systems with snapshots) by providing both read-only and un-initialized storage for the same range in a layout. Reads are initially performed on the read-only storage, with writes going to the un-initialized storage. After the first write that initializes the un-initialized storage, all reads are performed to that now-initialized writable storage, and the corresponding read-only storage is no longer used.
用于请求布局的pNFS操作(LAYOUTGET)包括“layoutiomode4 loga_iomode”参数,该参数指示请求的布局是只读使用还是读写使用。只读布局可能包含读取为零的孔,而读写布局将在这些孔中包含已分配但未初始化的存储(读取为零,可由客户端写入)。本文档还通过为布局中的同一范围提供只读和未初始化存储,支持客户端参与写时拷贝(例如,对于具有快照的文件系统)。读操作最初在只读存储器上执行,而写操作则进入未初始化的存储器。在初始化未初始化的存储器的第一次写入之后,将对现在已初始化的可写存储器执行所有读取,并且不再使用相应的只读存储器。
The block/volume layout solution expands the security responsibilities of the pNFS clients, and there are a number of environments where the mandatory to implement security properties for NFS cannot be satisfied. The additional security responsibilities of the client follow, and a full discussion is present in Section 3, "Security Considerations".
块/卷布局解决方案扩展了pNFS客户端的安全责任,并且有许多环境无法满足为NFS实现安全属性的强制要求。接下来是客户的额外安全责任,第3节“安全注意事项”中有详细讨论。
o Typically, storage area network (SAN) disk arrays and SAN protocols provide access control mechanisms (e.g., Logical Unit Number (LUN) mapping and/or masking), which operate at the granularity of individual hosts, not individual blocks. For this reason, block-based protection must be provided by the client software.
o 通常,存储区域网络(SAN)磁盘阵列和SAN协议提供访问控制机制(例如,逻辑单元号(LUN)映射和/或掩蔽),这些机制以单个主机而不是单个块的粒度运行。因此,客户端软件必须提供基于块的保护。
o Similarly, SAN disk arrays and SAN protocols typically are not able to validate NFS locks that apply to file regions. For instance, if a file is covered by a mandatory read-only lock, the server can ensure that only readable layouts for the file are granted to pNFS clients. However, it is up to each pNFS client to ensure that the readable layout is used only to service read requests, and not to allow writes to the existing parts of the file.
o 类似地,SAN磁盘阵列和SAN协议通常无法验证应用于文件区域的NFS锁。例如,如果一个文件被强制只读锁覆盖,服务器可以确保只有文件的可读布局被授予pNFS客户端。但是,由每个pNFS客户机来确保可读布局仅用于服务读取请求,而不允许写入文件的现有部分。
Since block/volume storage systems are generally not capable of enforcing such file-based security, in environments where pNFS clients cannot be trusted to enforce such policies, pNFS block/volume storage layouts SHOULD NOT be used.
由于块/卷存储系统通常无法实施此类基于文件的安全性,因此在无法信任pNFS客户端实施此类策略的环境中,不应使用pNFS块/卷存储布局。
Storage systems such as storage arrays can have multiple physical network ports that need not be connected to a common network, resulting in a pNFS client having simultaneous multipath access to the same storage volumes via different ports on different networks.
存储系统(如存储阵列)可以具有多个物理网络端口,这些端口不需要连接到公共网络,从而导致pNFS客户端可以通过不同网络上的不同端口同时多路径访问相同的存储卷。
The networks may not even be the same technology -- for example, access to the same volume via both iSCSI and Fibre Channel is possible, hence network addresses are difficult to use for volume identification. For this reason, this pNFS block layout identifies storage volumes by content, for example providing the means to match (unique portions of) labels used by volume managers. Volume identification is performed by matching one or more opaque byte sequences to specific parts of the stored data. Any block pNFS system using this layout MUST support a means of content-based unique volume identification that can be employed via the data structure given here.
网络甚至可能不是相同的技术--例如,可以通过iSCSI和光纤通道访问相同的卷,因此很难使用网络地址进行卷标识。因此,此pNFS块布局通过内容标识存储卷,例如提供匹配卷管理器使用的标签(唯一部分)的方法。卷标识是通过将一个或多个不透明字节序列和存储数据的特定部分相匹配来执行的。使用此布局的任何块pNFS系统必须支持一种基于内容的唯一卷标识方法,该方法可通过此处给出的数据结构使用。
/// struct pnfs_block_sig_component4 { /* disk signature component */ /// int64_t bsc_sig_offset; /* byte offset of component /// on volume*/ /// opaque bsc_contents<>; /* contents of this component /// of the signature */ /// }; ///
/// struct pnfs_block_sig_component4 { /* disk signature component */ /// int64_t bsc_sig_offset; /* byte offset of component /// on volume*/ /// opaque bsc_contents<>; /* contents of this component /// of the signature */ /// }; ///
Note that the opaque "bsc_contents" field in the "pnfs_block_sig_component4" structure MUST NOT be interpreted as a zero-terminated string, as it may contain embedded zero-valued bytes. There are no restrictions on alignment (e.g., neither bsc_sig_offset nor the length are required to be multiples of 4). The bsc_sig_offset is a signed quantity, which, when positive, represents an byte offset from the start of the volume, and when negative represents an byte offset from the end of the volume.
请注意,“pnfs_block_sig_component4”结构中不透明的“bsc_contents”字段不能解释为以零结尾的字符串,因为它可能包含嵌入的零值字节。对对齐没有任何限制(例如,bsc_sig_偏移量和长度均不要求为4的倍数)。bsc_sig_偏移量是一个有符号的量,当为正时,表示从卷开始的字节偏移量,当为负时,表示从卷结束的字节偏移量。
Negative offsets are permitted in order to simplify the client implementation on systems where the device label is found at a fixed offset from the end of the volume. If the server uses negative offsets to describe the signature, then the client and server MUST NOT see different volume sizes. Negative offsets SHOULD NOT be used in systems that dynamically resize volumes unless care is taken to ensure that the device label is always present at the offset from the end of the volume as seen by the clients.
允许负偏移,以简化系统上的客户端实现,其中设备标签位于卷末端的固定偏移处。如果服务器使用负偏移量来描述签名,则客户端和服务器不能看到不同的卷大小。在动态调整卷大小的系统中不应使用负偏移量,除非注意确保设备标签始终位于客户端看到的卷末端的偏移量处。
A signature is an array of up to "PNFS_BLOCK_MAX_SIG_COMP" (defined below) signature components. The client MUST NOT assume that all signature components are co-located within a single sector on a block device.
签名是最多包含“PNFS\u块\u最大\u SIG\u COMP”(定义见下文)签名组件的数组。客户端不得假设所有签名组件都位于块设备上的单个扇区内。
The pNFS client block layout driver uses this volume identification to map pnfs_block_volume_type4 PNFS_BLOCK_VOLUME_SIMPLE deviceid4s to its local view of a LUN.
pNFS客户端块布局驱动程序使用此卷标识将pNFS_block_volume_type4 pNFS_block_volume_SIMPLE deviceid4映射到LUN的本地视图。
The pNFS block server volume topology is expressed as an arbitrary combination of base volume types enumerated in the following data structures. The individual components of the topology are contained in an array and components may refer to other components by using array indices.
pNFS块服务器卷拓扑表示为以下数据结构中枚举的基本卷类型的任意组合。拓扑的各个组件包含在一个数组中,组件可以使用数组索引引用其他组件。
/// enum pnfs_block_volume_type4 { /// PNFS_BLOCK_VOLUME_SIMPLE = 0, /* volume maps to a single /// LU */ /// PNFS_BLOCK_VOLUME_SLICE = 1, /* volume is a slice of /// another volume */ /// PNFS_BLOCK_VOLUME_CONCAT = 2, /* volume is a /// concatenation of /// multiple volumes */ /// PNFS_BLOCK_VOLUME_STRIPE = 3 /* volume is striped across /// multiple volumes */ /// }; /// /// const PNFS_BLOCK_MAX_SIG_COMP = 16;/* maximum components per /// signature */ /// struct pnfs_block_simple_volume_info4 { /// pnfs_block_sig_component4 bsv_ds<PNFS_BLOCK_MAX_SIG_COMP>; /// /* disk signature */ /// }; /// /// /// struct pnfs_block_slice_volume_info4 { /// offset4 bsv_start; /* offset of the start of the /// slice in bytes */ /// length4 bsv_length; /* length of slice in bytes */ /// uint32_t bsv_volume; /* array index of sliced /// volume */ /// }; /// /// struct pnfs_block_concat_volume_info4 { /// uint32_t bcv_volumes<>; /* array indices of volumes /// which are concatenated */
/// enum pnfs_block_volume_type4 { /// PNFS_BLOCK_VOLUME_SIMPLE = 0, /* volume maps to a single /// LU */ /// PNFS_BLOCK_VOLUME_SLICE = 1, /* volume is a slice of /// another volume */ /// PNFS_BLOCK_VOLUME_CONCAT = 2, /* volume is a /// concatenation of /// multiple volumes */ /// PNFS_BLOCK_VOLUME_STRIPE = 3 /* volume is striped across /// multiple volumes */ /// }; /// /// const PNFS_BLOCK_MAX_SIG_COMP = 16;/* maximum components per /// signature */ /// struct pnfs_block_simple_volume_info4 { /// pnfs_block_sig_component4 bsv_ds<PNFS_BLOCK_MAX_SIG_COMP>; /// /* disk signature */ /// }; /// /// /// struct pnfs_block_slice_volume_info4 { /// offset4 bsv_start; /* offset of the start of the /// slice in bytes */ /// length4 bsv_length; /* length of slice in bytes */ /// uint32_t bsv_volume; /* array index of sliced /// volume */ /// }; /// /// struct pnfs_block_concat_volume_info4 { /// uint32_t bcv_volumes<>; /* array indices of volumes /// which are concatenated */
/// }; /// /// struct pnfs_block_stripe_volume_info4 { /// length4 bsv_stripe_unit; /* size of stripe in bytes */ /// uint32_t bsv_volumes<>; /* array indices of volumes /// which are striped across -- /// MUST be same size */ /// }; /// /// union pnfs_block_volume4 switch (pnfs_block_volume_type4 type) { /// case PNFS_BLOCK_VOLUME_SIMPLE: /// pnfs_block_simple_volume_info4 bv_simple_info; /// case PNFS_BLOCK_VOLUME_SLICE: /// pnfs_block_slice_volume_info4 bv_slice_info; /// case PNFS_BLOCK_VOLUME_CONCAT: /// pnfs_block_concat_volume_info4 bv_concat_info; /// case PNFS_BLOCK_VOLUME_STRIPE: /// pnfs_block_stripe_volume_info4 bv_stripe_info; /// }; /// /// /* block layout specific type for da_addr_body */ /// struct pnfs_block_deviceaddr4 { /// pnfs_block_volume4 bda_volumes<>; /* array of volumes */ /// }; ///
/// }; /// /// struct pnfs_block_stripe_volume_info4 { /// length4 bsv_stripe_unit; /* size of stripe in bytes */ /// uint32_t bsv_volumes<>; /* array indices of volumes /// which are striped across -- /// MUST be same size */ /// }; /// /// union pnfs_block_volume4 switch (pnfs_block_volume_type4 type) { /// case PNFS_BLOCK_VOLUME_SIMPLE: /// pnfs_block_simple_volume_info4 bv_simple_info; /// case PNFS_BLOCK_VOLUME_SLICE: /// pnfs_block_slice_volume_info4 bv_slice_info; /// case PNFS_BLOCK_VOLUME_CONCAT: /// pnfs_block_concat_volume_info4 bv_concat_info; /// case PNFS_BLOCK_VOLUME_STRIPE: /// pnfs_block_stripe_volume_info4 bv_stripe_info; /// }; /// /// /* block layout specific type for da_addr_body */ /// struct pnfs_block_deviceaddr4 { /// pnfs_block_volume4 bda_volumes<>; /* array of volumes */ /// }; ///
The "pnfs_block_deviceaddr4" data structure is a structure that allows arbitrarily complex nested volume structures to be encoded. The types of aggregations that are allowed are stripes, concatenations, and slices. Note that the volume topology expressed in the pnfs_block_deviceaddr4 data structure will always resolve to a set of pnfs_block_volume_type4 PNFS_BLOCK_VOLUME_SIMPLE. The array of volumes is ordered such that the root of the volume hierarchy is the last element of the array. Concat, slice, and stripe volumes MUST refer to volumes defined by lower indexed elements of the array.
“pnfs_block_deviceaddr4”数据结构是一种允许对任意复杂的嵌套卷结构进行编码的结构。允许的聚合类型包括条带、连接和切片。请注意,pnfs_block_deviceADR4数据结构中表示的卷拓扑将始终解析为一组pnfs_block_volume_type4 pnfs_block_volume_SIMPLE。对卷数组进行排序,使卷层次结构的根是数组的最后一个元素。Concat、slice和stripe卷必须引用由数组中索引较低的元素定义的卷。
The "pnfs_block_device_addr4" data structure is returned by the server as the storage-protocol-specific opaque field da_addr_body in the "device_addr4" structure by a successful GETDEVICEINFO operation [NFSv4.1].
“pnfs_block_device_addr4”数据结构由服务器通过成功的GETDEVICEINFO操作[NFSv4.1]返回,作为“device_addr4”结构中存储协议特定的不透明字段da_addr_体。
As noted above, all device_addr4 structures eventually resolve to a set of volumes of type PNFS_BLOCK_VOLUME_SIMPLE. These volumes are each uniquely identified by a set of signature components. Complicated volume hierarchies may be composed of dozens of volumes each with several signature components; thus, the device address may require several kilobytes. The client SHOULD be prepared to allocate a large buffer to contain the result. In the case of the server
如上所述,所有device_addr4结构最终解析为一组PNFS_BLOCK_VOLUME_SIMPLE类型的卷。这些卷都由一组签名组件唯一标识。复杂的卷层次结构可能由几十个卷组成,每个卷都有几个签名组件;因此,设备地址可能需要几千字节。客户端应该准备分配一个大的缓冲区来包含结果。就服务器而言
returning NFS4ERR_TOOSMALL, the client SHOULD allocate a buffer of at least gdir_mincount_bytes to contain the expected result and retry the GETDEVICEINFO request.
返回NFS4ERR_TOOSMALL时,客户端应分配至少gdir_mincount_字节的缓冲区以包含预期结果,然后重试GETDEVICEINFO请求。
The server in response to a GETDEVICELIST request typically will return a single "deviceid4" in the gdlr_deviceid_list array. This is because the deviceid4 when passed to GETDEVICEINFO will return a "device_addr4", which encodes the entire volume hierarchy. In the case of copy-on-write file systems, the "gdlr_deviceid_list" array may contain two deviceid4's, one referencing the read-only volume hierarchy, and one referencing the writable volume hierarchy. There is no required ordering of the readable and writable IDs in the array as the volumes are uniquely identified by their deviceid4, and are referred to by layouts using the deviceid4. Another example of the server returning multiple device items occurs when the file handle represents the root of a namespace spanning multiple physical file systems on the server, each with a different volume hierarchy. In this example, a server implementation may return either a list of device IDs used by each of the physical file systems, or it may return an empty list.
响应GETDEVICELIST请求的服务器通常会在gdlr_deviceid_列表数组中返回一个“deviceid4”。这是因为deviceid4在传递给GETDEVICEINFO时将返回一个“device_addr4”,它对整个卷层次结构进行编码。在写时复制文件系统的情况下,“gdlr_deviceid_list”数组可能包含两个deviceid4,一个引用只读卷层次结构,另一个引用可写卷层次结构。阵列中的可读写ID不需要排序,因为卷由其deviceid4唯一标识,并由使用deviceid4的布局引用。服务器返回多个设备项的另一个示例是,文件句柄表示服务器上跨多个物理文件系统的命名空间的根,每个系统具有不同的卷层次结构。在此示例中,服务器实现可以返回每个物理文件系统使用的设备id列表,也可以返回空列表。
Each deviceid4 returned by a successful GETDEVICELIST operation is a shorthand id used to reference the whole volume topology. These device IDs, as well as device IDs returned in extents of a LAYOUTGET operation, can be used as input to the GETDEVICEINFO operation. Decoding the "pnfs_block_deviceaddr4" results in a flat ordering of data blocks mapped to PNFS_BLOCK_VOLUME_SIMPLE volumes. Combined with the mapping to a client LUN described in Section 2.2.1 "Volume Identification", a logical volume offset can be mapped to a block on a pNFS client LUN [NFSv4.1].
成功的GETDEVICELIST操作返回的每个deviceid4都是用于引用整个卷拓扑的速记id。这些设备ID以及在LAYOUTGET操作的扩展中返回的设备ID可以用作GETDEVICEINFO操作的输入。解码“pnfs_block_deviceaddr4”会导致映射到pnfs_block_VOLUME_简单卷的数据块的平面排序。结合第2.2.1节“卷标识”中描述的到客户端LUN的映射,可以将逻辑卷偏移映射到pNFS客户端LUN上的块[NFSv4.1]。
A pNFS block layout is a list of extents within a flat array of data blocks in a logical volume. The details of the volume topology can be determined by using the GETDEVICEINFO operation (see discussion of volume identification, Section 2.2 above). The block layout describes the individual block extents on the volume that make up the file. The offsets and length contained in an extent are specified in units of bytes.
pNFS块布局是逻辑卷中数据块平面阵列内的数据块列表。卷拓扑的详细信息可以通过使用GETDEVICEINFO操作来确定(请参阅上文第2.2节卷标识的讨论)。块布局描述构成文件的卷上的各个块范围。数据块中包含的偏移量和长度以字节为单位指定。
/// enum pnfs_block_extent_state4 { /// PNFS_BLOCK_READ_WRITE_DATA = 0,/* the data located by this /// extent is valid /// for reading and writing. */ /// PNFS_BLOCK_READ_DATA = 1, /* the data located by this /// extent is valid for reading /// only; it may not be /// written. */ /// PNFS_BLOCK_INVALID_DATA = 2, /* the location is valid; the /// data is invalid. It is a /// newly (pre-) allocated /// extent. There is physical /// space on the volume. */ /// PNFS_BLOCK_NONE_DATA = 3 /* the location is invalid. /// It is a hole in the file. /// There is no physical space /// on the volume. */ /// };
/// enum pnfs_block_extent_state4 { /// PNFS_BLOCK_READ_WRITE_DATA = 0,/* the data located by this /// extent is valid /// for reading and writing. */ /// PNFS_BLOCK_READ_DATA = 1, /* the data located by this /// extent is valid for reading /// only; it may not be /// written. */ /// PNFS_BLOCK_INVALID_DATA = 2, /* the location is valid; the /// data is invalid. It is a /// newly (pre-) allocated /// extent. There is physical /// space on the volume. */ /// PNFS_BLOCK_NONE_DATA = 3 /* the location is invalid. /// It is a hole in the file. /// There is no physical space /// on the volume. */ /// };
/// /// struct pnfs_block_extent4 { /// deviceid4 bex_vol_id; /* id of logical volume on /// which extent of file is /// stored. */ /// offset4 bex_file_offset; /* the starting byte offset in /// the file */ /// length4 bex_length; /* the size in bytes of the /// extent */ /// offset4 bex_storage_offset; /* the starting byte offset /// in the volume */ /// pnfs_block_extent_state4 bex_state; /// /* the state of this extent */ /// }; /// /// /* block layout specific type for loc_body */ /// struct pnfs_block_layout4 { /// pnfs_block_extent4 blo_extents<>; /// /* extents which make up this /// layout. */ /// }; ///
/// /// struct pnfs_block_extent4 { /// deviceid4 bex_vol_id; /* id of logical volume on /// which extent of file is /// stored. */ /// offset4 bex_file_offset; /* the starting byte offset in /// the file */ /// length4 bex_length; /* the size in bytes of the /// extent */ /// offset4 bex_storage_offset; /* the starting byte offset /// in the volume */ /// pnfs_block_extent_state4 bex_state; /// /* the state of this extent */ /// }; /// /// /* block layout specific type for loc_body */ /// struct pnfs_block_layout4 { /// pnfs_block_extent4 blo_extents<>; /// /* extents which make up this /// layout. */ /// }; ///
The block layout consists of a list of extents that map the logical regions of the file to physical locations on a volume. The "bex_storage_offset" field within each extent identifies a location on the logical volume specified by the "bex_vol_id" field in the extent. The bex_vol_id itself is shorthand for the whole topology of
块布局由一系列扩展数据块组成,这些扩展数据块将文件的逻辑区域映射到卷上的物理位置。每个扩展数据块中的“bex_存储_偏移”字段标识扩展数据块中“bex_vol_id”字段指定的逻辑卷上的位置。bex_vol_id本身是
the logical volume on which the file is stored. The client is responsible for translating this logical offset into an offset on the appropriate underlying SAN logical unit. In most cases, all extents in a layout will reside on the same volume and thus have the same bex_vol_id. In the case of copy-on-write file systems, the PNFS_BLOCK_READ_DATA extents may have a different bex_vol_id from the writable extents.
存储文件的逻辑卷。客户机负责将此逻辑偏移转换为相应基础SAN逻辑单元上的偏移。在大多数情况下,布局中的所有扩展数据块将驻留在同一个卷上,因此具有相同的bex_vol_id。在写时复制文件系统的情况下,PNFS_BLOCK_READ_数据扩展数据块可能具有不同于可写扩展数据块的bex_vol_id。
Each extent maps a logical region of the file onto a portion of the specified logical volume. The bex_file_offset, bex_length, and bex_state fields for an extent returned from the server are valid for all extents. In contrast, the interpretation of the bex_storage_offset field depends on the value of bex_state as follows (in increasing order):
每个区段将文件的逻辑区域映射到指定逻辑卷的一部分。从服务器返回的数据块的bex_文件_偏移量、bex_长度和bex_状态字段对所有数据块都有效。相反,bex_存储_偏移字段的解释取决于bex_状态的值,如下所示(按递增顺序):
o PNFS_BLOCK_READ_WRITE_DATA means that bex_storage_offset is valid, and points to valid/initialized data that can be read and written.
o PNFS_BLOCK_READ_WRITE_DATA意味着bex_storage_offset有效,并指向可以读取和写入的有效/初始化数据。
o PNFS_BLOCK_READ_DATA means that bex_storage_offset is valid and points to valid/ initialized data that can only be read. Write operations are prohibited; the client may need to request a read-write layout.
o PNFS_块_读取_数据意味着bex_存储_偏移量有效,并指向只能读取的有效/初始化数据。禁止写操作;客户端可能需要请求读写布局。
o PNFS_BLOCK_INVALID_DATA means that bex_storage_offset is valid, but points to invalid un-initialized data. This data must not be physically read from the disk until it has been initialized. A read request for a PNFS_BLOCK_INVALID_DATA extent must fill the user buffer with zeros, unless the extent is covered by a PNFS_BLOCK_READ_DATA extent of a copy-on-write file system. Write requests must write whole server-sized blocks to the disk; bytes not initialized by the user must be set to zero. Any write to storage in a PNFS_BLOCK_INVALID_DATA extent changes the written portion of the extent to PNFS_BLOCK_READ_WRITE_DATA; the pNFS client is responsible for reporting this change via LAYOUTCOMMIT.
o PNFS_块_无效_数据表示bex_存储_偏移量有效,但指向无效的未初始化数据。在磁盘初始化之前,不得从磁盘物理读取此数据。PNFS_块_无效_数据扩展数据块的读取请求必须用零填充用户缓冲区,除非该扩展数据块由写时复制文件系统的PNFS_块_读取_数据扩展数据块覆盖。写请求必须将整个服务器大小的块写入磁盘;用户未初始化的字节必须设置为零。任何对PNFS_块_无效_数据扩展数据块中存储器的写入操作都会将扩展数据块的写入部分更改为PNFS_块_读_写_数据;pNFS客户端负责通过LAYOUTCOMMIT报告此更改。
o PNFS_BLOCK_NONE_DATA means that bex_storage_offset is not valid, and this extent may not be used to satisfy write requests. Read requests may be satisfied by zero-filling as for PNFS_BLOCK_INVALID_DATA. PNFS_BLOCK_NONE_DATA extents may be returned by requests for readable extents; they are never returned if the request was for a writable extent.
o PNFS_BLOCK_NONE_数据意味着bex_存储_偏移量无效,并且此区段可能无法用于满足写入请求。对于PNFS\u块\u无效\u数据,可以通过零填充来满足读取请求。PNFS_BLOCK_NONE_数据区段可通过可读区段请求返回;如果请求是针对可写数据块的,则不会返回它们。
An extent list contains all relevant extents in increasing order of the bex_file_offset of each extent; any ties are broken by increasing order of the extent state (bex_state).
数据块列表按每个数据块的bex_文件_偏移量的递增顺序包含所有相关数据块;任何连接都会通过区段状态(bex_状态)的递增顺序断开。
Each request for a layout specifies at least three parameters: file offset, desired size, and minimum size. If the status of a request indicates success, the extent list returned must meet the following criteria:
每个布局请求至少指定三个参数:文件偏移量、所需大小和最小大小。如果请求的状态指示成功,则返回的数据块列表必须满足以下条件:
o A request for a readable (but not writable) layout returns only PNFS_BLOCK_READ_DATA or PNFS_BLOCK_NONE_DATA extents (but not PNFS_BLOCK_INVALID_DATA or PNFS_BLOCK_READ_WRITE_DATA extents).
o 对可读(但不可写)布局的请求仅返回PNFS_BLOCK_READ_数据或PNFS_BLOCK_NONE_数据范围(但不返回PNFS_BLOCK_INVALID_数据或PNFS_BLOCK_READ_WRITE_数据范围)。
o A request for a writable layout returns PNFS_BLOCK_READ_WRITE_DATA or PNFS_BLOCK_INVALID_DATA extents (but not PNFS_BLOCK_NONE_DATA extents). It may also return PNFS_BLOCK_READ_DATA extents only when the offset ranges in those extents are also covered by PNFS_BLOCK_INVALID_DATA extents to permit writes.
o 对可写布局的请求返回PNFS\ U BLOCK\ U READ\ U WRITE\ U数据或PNFS\ U BLOCK\ U INVALID\ U数据块(但不是PNFS\ U BLOCK\ U NONE\ U数据块)。仅当PNFS_块_无效_数据扩展数据块也覆盖了这些扩展数据块中的偏移量范围以允许写入时,它还可能返回PNFS_块_读取_数据扩展数据块。
o The first extent in the list MUST contain the requested starting offset.
o 列表中的第一个区段必须包含请求的起始偏移量。
o The total size of extents within the requested range MUST cover at least the minimum size. One exception is allowed: the total size MAY be smaller if only readable extents were requested and EOF is encountered.
o 请求范围内的扩展数据块的总大小必须至少包含最小大小。允许一种例外情况:如果仅请求可读数据块且遇到EOF,则总大小可能较小。
o Extents in the extent list MUST be logically contiguous for a read-only layout. For a read-write layout, the set of writable extents (i.e., excluding PNFS_BLOCK_READ_DATA extents) MUST be logically contiguous. Every PNFS_BLOCK_READ_DATA extent in a read-write layout MUST be covered by one or more PNFS_BLOCK_INVALID_DATA extents. This overlap of PNFS_BLOCK_READ_DATA and PNFS_BLOCK_INVALID_DATA extents is the only permitted extent overlap.
o 对于只读布局,数据块列表中的数据块必须在逻辑上连续。对于读写布局,可写扩展数据块集(即,不包括PNFS_块_读_数据扩展数据块)必须在逻辑上连续。读写布局中的每个PNFS_块_读_数据块必须由一个或多个PNFS_块_无效_数据块覆盖。PNFS_块_读取_数据和PNFS_块_无效_数据区段的重叠是唯一允许的区段重叠。
o Extents MUST be ordered in the list by starting offset, with PNFS_BLOCK_READ_DATA extents preceding PNFS_BLOCK_INVALID_DATA extents in the case of equal bex_file_offsets.
o 必须通过起始偏移量在列表中对数据块进行排序,在相同的bex_文件_偏移量的情况下,PNFS_块_读取_数据块在PNFS_块_无效_数据块之前。
If the minimum requested size, loga_minlength, is zero, this is an indication to the metadata server that the client desires any layout at offset loga_offset or less that the metadata server has "readily available". Readily is subjective, and depends on the layout type and the pNFS server implementation. For block layout servers, readily available SHOULD be interpreted such that readable layouts are always available, even if some extents are in the PNFS_BLOCK_NONE_DATA state. When processing requests for writable layouts, a layout is readily available if extents can be returned in the PNFS_BLOCK_READ_WRITE_DATA state.
如果请求的最小大小loga_minlength为零,则表示元数据服务器希望在偏移量loga_offset或更小的位置进行布局,表示元数据服务器“随时可用”。这是主观的,取决于布局类型和pNFS服务器实现。对于块布局服务器,应解释随时可用的布局,以便即使某些区段处于PNFS_block_NONE_数据状态,可读布局也始终可用。在处理可写布局的请求时,如果可以在PNFS_BLOCK_READ_WRITE_数据状态下返回扩展数据块,则布局很容易可用。
/// /* block layout specific type for lou_body */ /// struct pnfs_block_layoutupdate4 { /// pnfs_block_extent4 blu_commit_list<>; /// /* list of extents which /// * now contain valid data. /// */ /// }; ///
/// /* block layout specific type for lou_body */ /// struct pnfs_block_layoutupdate4 { /// pnfs_block_extent4 blu_commit_list<>; /// /* list of extents which /// * now contain valid data. /// */ /// }; ///
The "pnfs_block_layoutupdate4" structure is used by the client as the block-protocol specific argument in a LAYOUTCOMMIT operation. The "blu_commit_list" field is an extent list covering regions of the file layout that were previously in the PNFS_BLOCK_INVALID_DATA state, but have been written by the client and should now be considered in the PNFS_BLOCK_READ_WRITE_DATA state. The bex_state field of each extent in the blu_commit_list MUST be set to PNFS_BLOCK_READ_WRITE_DATA. The extents in the commit list MUST be disjoint and MUST be sorted by bex_file_offset. The bex_storage_offset field is unused. Implementors should be aware that a server may be unable to commit regions at a granularity smaller than a file-system block (typically 4 KB or 8 KB). As noted above, the block-size that the server uses is available as an NFSv4 attribute, and any extents included in the "blu_commit_list" MUST be aligned to this granularity and have a size that is a multiple of this granularity. If the client believes that its actions have moved the end-of-file into the middle of a block being committed, the client MUST write zeroes from the end-of-file to the end of that block before committing the block. Failure to do so may result in junk (un-initialized data) appearing in that area if the file is subsequently extended by moving the end-of-file.
“pnfs_block_layoutupdate4”结构由客户端用作LAYOUTCOMMIT操作中的块协议特定参数。“blu_commit_list”字段是一个扩展数据块列表,覆盖文件布局的区域,这些区域以前处于PNFS_BLOCK_INVALID_数据状态,但已由客户端写入,现在应被视为处于PNFS_BLOCK_READ_WRITE_数据状态。blu_commit_列表中每个数据块的bex_state字段必须设置为PNFS_BLOCK_READ_WRITE_DATA。提交列表中的数据块必须是不相交的,并且必须按bex\u文件\u偏移量排序。bex_存储_偏移字段未使用。实现者应该知道,服务器可能无法以小于文件系统块(通常为4KB或8KB)的粒度提交区域。如上所述,服务器使用的块大小可作为NFSv4属性使用,并且“blu_commit_list”中包含的任何扩展数据块都必须与此粒度对齐,并且具有此粒度的倍数大小。如果客户机认为其操作已将文件结尾移动到正在提交的块的中间,则客户机必须在提交块之前将文件结尾写入该块结尾的零。否则,如果随后通过移动文件结尾扩展文件,则可能会导致该区域出现垃圾邮件(未初始化的数据)。
The LAYOUTRETURN operation is done without any block layout specific data. When the LAYOUTRETURN operation specifies a LAYOUTRETURN4_FILE_return type, then the layoutreturn_file4 data structure specifies the region of the file layout that is no longer needed by the client. The opaque "lrf_body" field of the "layoutreturn_file4" data structure MUST have length zero. A LAYOUTRETURN operation represents an explicit release of resources by the client, usually done for the purpose of avoiding unnecessary CB_LAYOUTRECALL operations in the future. The client may return disjoint regions of the file by using multiple LAYOUTRETURN operations within a single COMPOUND operation.
LAYOUTRETURN操作在没有任何块布局特定数据的情况下完成。当LAYOUTRETURN操作指定LAYOUTRETURN4_文件_返回类型时,LAYOUTRETURN_文件4数据结构将指定客户端不再需要的文件布局区域。“layoutreturn_file4”数据结构的不透明“lrf_body”字段的长度必须为零。LAYOUTRETURN操作表示客户端显式释放资源,通常是为了避免将来不必要的CB_LAYOUTRETURN操作。客户端可以通过在单个复合操作中使用多个LAYOUTRETURN操作来返回文件的不相交区域。
Note that the block/volume layout supports unilateral layout revocation. When a layout is unilaterally revoked by the server, usually due to the client's lease time expiring, or a delegation being recalled, or the client failing to return a layout in a timely manner, it is important for the sake of correctness that any in-flight I/Os that the client issued before the layout was revoked are rejected at the storage. For the block/volume protocol, this is possible by fencing a client with an expired layout timer from the physical storage. Note, however, that the granularity of this operation can only be at the host/logical-unit level. Thus, if one of a client's layouts is unilaterally revoked by the server, it will effectively render useless *all* of the client's layouts for files located on the storage units comprising the logical volume. This may render useless the client's layouts for files in other file systems.
请注意,块/卷布局支持单向布局撤销。当服务器单方面撤销布局时,通常是由于客户端的租约到期、撤回委托或客户端未能及时返回布局,为确保正确性,必须在存储中拒绝客户端在撤销布局之前发出的任何正在运行的I/O。对于块/卷协议,这可以通过使用物理存储器中过期的布局计时器来保护客户端来实现。但是,请注意,此操作的粒度只能在主机/逻辑单元级别。因此,如果服务器单方面撤销了客户机的一个布局,则对于位于构成逻辑卷的存储单元上的文件,它将有效地使客户机的所有布局变得无用。这可能会使客户端的布局对其他文件系统中的文件无效。
Copy-on-write is a mechanism used to support file and/or file system snapshots. When writing to unaligned regions, or to regions smaller than a file system block, the writer must copy the portions of the original file data to a new location on disk. This behavior can either be implemented on the client or the server. The paragraphs below describe how a pNFS block layout client implements access to a file that requires copy-on-write semantics.
写时复制是一种用于支持文件和/或文件系统快照的机制。写入未对齐的区域或小于文件系统块的区域时,写入程序必须将原始文件数据的部分复制到磁盘上的新位置。此行为可以在客户端或服务器上实现。下面的段落描述了pNFS块布局客户端如何实现对需要写时复制语义的文件的访问。
Distinguishing the PNFS_BLOCK_READ_WRITE_DATA and PNFS_BLOCK_READ_DATA extent types in combination with the allowed overlap of PNFS_BLOCK_READ_DATA extents with PNFS_BLOCK_INVALID_DATA extents allows copy-on-write processing to be done by pNFS clients. In classic NFS, this operation would be done by the server. Since pNFS enables clients to do direct block access, it is useful for clients to participate in copy-on-write operations. All block/volume pNFS clients MUST support this copy-on-write processing.
区分PNFS_BLOCK_READ_WRITE_数据和PNFS_BLOCK_READ_数据扩展数据类型,结合允许的PNFS_BLOCK_READ_数据扩展数据与PNFS_BLOCK_INVALID_数据扩展数据的重叠,允许PNFS客户端完成写时复制处理。在传统NFS中,此操作将由服务器完成。由于pNFS允许客户端进行直接块访问,因此客户端参与写时复制操作非常有用。所有块/卷pNFS客户端都必须支持此写时拷贝处理。
When a client wishes to write data covered by a PNFS_BLOCK_READ_DATA extent, it MUST have requested a writable layout from the server; that layout will contain PNFS_BLOCK_INVALID_DATA extents to cover all the data ranges of that layout's PNFS_BLOCK_READ_DATA extents. More precisely, for any bex_file_offset range covered by one or more PNFS_BLOCK_READ_DATA extents in a writable layout, the server MUST include one or more PNFS_BLOCK_INVALID_DATA extents in the layout that cover the same bex_file_offset range. When performing a write to such an area of a layout, the client MUST effectively copy the data from the PNFS_BLOCK_READ_DATA extent for any partial blocks of bex_file_offset and range, merge in the changes to be written, and write the result to the PNFS_BLOCK_INVALID_DATA extent for the blocks for that bex_file_offset and range. That is, if entire blocks of data are to be overwritten by an operation, the corresponding
当客户机希望写入PNFS_BLOCK_READ_数据区段所覆盖的数据时,它必须已从服务器请求可写布局;该布局将包含PNFS_块_无效_数据扩展数据块,以覆盖该布局的PNFS_块_读取_数据扩展数据块的所有数据范围。更准确地说,对于可写布局中的一个或多个PNFS\U BLOCK\U READ\U数据块所覆盖的任何bex\U文件\U偏移范围,服务器必须在覆盖相同bex\U文件\U偏移范围的布局中包含一个或多个PNFS\U BLOCK\U INVALID\U数据块。在对布局的这样一个区域执行写入时,客户机必须有效地从bex_文件_偏移量和范围的任何部分块的PNFS_块_读取_数据区段复制数据,合并要写入的更改,并将结果写入该bex_文件_偏移量和范围的块的PNFS_块_无效_数据区段。也就是说,如果操作要覆盖整个数据块,则相应的
PNFS_BLOCK_READ_DATA blocks need not be fetched, but any partial-block writes must be merged with data fetched via PNFS_BLOCK_READ_DATA extents before storing the result via PNFS_BLOCK_INVALID_DATA extents. For the purposes of this discussion, "entire blocks" and "partial blocks" refer to the server's file-system block size. Storing of data in a PNFS_BLOCK_INVALID_DATA extent converts the written portion of the PNFS_BLOCK_INVALID_DATA extent to a PNFS_BLOCK_READ_WRITE_DATA extent; all subsequent reads MUST be performed from this extent; the corresponding portion of the PNFS_BLOCK_READ_DATA extent MUST NOT be used after storing data in a PNFS_BLOCK_INVALID_DATA extent. If a client writes only a portion of an extent, the extent may be split at block aligned boundaries.
不需要提取PNFS_块_读取_数据块,但在通过PNFS_块_读取_数据块存储结果之前,必须将任何部分块写入与通过PNFS_块_读取_数据块获取的数据合并。在本讨论中,“整个块”和“部分块”指的是服务器的文件系统块大小。在PNFS_块_无效_数据区段中存储数据将PNFS_块_无效_数据区段的写入部分转换为PNFS_块_读取_写入_数据区段;必须从此区段执行所有后续读取;将数据存储在PNFS_块_无效_数据块中后,不得使用PNFS_块_读取_数据块的相应部分。如果客户端仅写入数据块的一部分,则该数据块可能会在块对齐的边界处拆分。
When a client wishes to write data to a PNFS_BLOCK_INVALID_DATA extent that is not covered by a PNFS_BLOCK_READ_DATA extent, it MUST treat this write identically to a write to a file not involved with copy-on-write semantics. Thus, data must be written in at least block-sized increments, aligned to multiples of block-sized offsets, and unwritten portions of blocks must be zero filled.
当客户端希望将数据写入PNFS_BLOCK_无效数据块(PNFS_BLOCK_READ_数据块未涵盖)时,它必须将此写入等同于写入不涉及写时复制语义的文件。因此,数据必须以至少块大小的增量写入,与块大小偏移的倍数对齐,并且块的未写入部分必须为零填充。
In the LAYOUTCOMMIT operation that normally sends updated layout information back to the server, for writable data, some PNFS_BLOCK_INVALID_DATA extents may be committed as PNFS_BLOCK_READ_WRITE_DATA extents, signifying that the storage at the corresponding bex_storage_offset values has been stored into and is now to be considered as valid data to be read. PNFS_BLOCK_READ_DATA extents are not committed to the server. For extents that the client receives via LAYOUTGET as PNFS_BLOCK_INVALID_DATA and returns via LAYOUTCOMMIT as PNFS_BLOCK_READ_WRITE_DATA, the server will understand that the PNFS_BLOCK_READ_DATA mapping for that extent is no longer valid or necessary for that file.
在通常将更新的布局信息发送回服务器的LAYOUTCOMMIT操作中,对于可写数据,某些PNFS_BLOCK_INVALID_数据块可能会提交为PNFS_BLOCK_READ_WRITE_数据块,表示相应bex_存储_偏移值处的存储已存储到中,并且现在被视为要读取的有效数据。PNFS_块_读取_数据扩展数据块未提交到服务器。对于客户端通过LAYOUTGET接收为PNFS_BLOCK_无效数据并通过LAYOUTCOMMIT返回为PNFS_BLOCK_READ_WRITE_数据的扩展数据块,服务器将了解该扩展数据块的PNFS_BLOCK_READ_数据映射对于该文件不再有效或必要。
Layout extents returned to pNFS clients grant permission to read or write; PNFS_BLOCK_READ_DATA and PNFS_BLOCK_NONE_DATA are read-only (PNFS_BLOCK_NONE_DATA reads as zeroes), PNFS_BLOCK_READ_WRITE_DATA and PNFS_BLOCK_INVALID_DATA are read/write, (PNFS_BLOCK_INVALID_DATA reads as zeros, any write converts it to PNFS_BLOCK_READ_WRITE_DATA). This is the only means a client has of obtaining permission to perform direct I/O to storage devices; a pNFS client MUST NOT perform direct I/O operations that are not permitted by an extent held by the client. Client adherence to this rule places the pNFS server in control of potentially conflicting storage device operations, enabling the server to determine what does conflict and how to avoid conflicts by granting and recalling extents to/from clients.
返回给pNFS客户端的布局扩展数据块授予读取或写入权限;PNFS_BLOCK_READ_DATA和PNFS_BLOCK_NONE_DATA是只读的(PNFS_BLOCK_NONE_DATA读取为零),PNFS_BLOCK_READ_WRITE_DATA和PNFS_BLOCK_INVALID_DATA是读/写的(PNFS_BLOCK_INVALID_DATA读取为零,任何写操作都将其转换为PNFS_BLOCK_READ_WRITE_DATA)。这是客户端获得对存储设备执行直接I/O权限的唯一方法;pNFS客户端不得执行客户端持有的扩展数据块不允许的直接I/O操作。遵守此规则的客户端将使pNFS服务器控制可能发生冲突的存储设备操作,从而使服务器能够通过向客户端授予和从客户端调用扩展数据块来确定哪些操作会发生冲突以及如何避免冲突。
Block/volume class storage devices are not required to perform read and write operations atomically. Overlapping concurrent read and write operations to the same data may cause the read to return a mixture of before-write and after-write data. Overlapping write operations can be worse, as the result could be a mixture of data from the two write operations; data corruption can occur if the underlying storage is striped and the operations complete in different orders on different stripes. When there are multiple clients who wish to access the same data, a pNFS server can avoid these conflicts by implementing a concurrency control policy of single writer XOR multiple readers. This policy MUST be implemented when storage devices do not provide atomicity for concurrent read/write and write/write operations to the same data.
块/卷级存储设备不需要以原子方式执行读写操作。对同一数据的重叠并发读写操作可能导致读操作返回写前和写后数据的混合。重叠的写操作可能更糟糕,因为结果可能是来自两个写操作的数据的混合;如果底层存储是条带化的,并且操作在不同条带上以不同顺序完成,则可能会发生数据损坏。当有多个客户端希望访问相同的数据时,pNFS服务器可以通过实现单写入器XOR多读取器的并发控制策略来避免这些冲突。当存储设备不为同一数据的并发读/写和写/写操作提供原子性时,必须实施此策略。
If a client makes a layout request that conflicts with an existing layout delegation, the request will be rejected with the error NFS4ERR_LAYOUTTRYLATER. This client is then expected to retry the request after a short interval. During this interval, the server SHOULD recall the conflicting portion of the layout delegation from the client that currently holds it. This reject-and-retry approach does not prevent client starvation when there is contention for the layout of a particular file. For this reason, a pNFS server SHOULD implement a mechanism to prevent starvation. One possibility is that the server can maintain a queue of rejected layout requests. Each new layout request can be checked to see if it conflicts with a previous rejected request, and if so, the newer request can be rejected. Once the original requesting client retries its request, its entry in the rejected request queue can be cleared, or the entry in the rejected request queue can be removed when it reaches a certain age.
如果客户端发出与现有布局委派冲突的布局请求,则该请求将被拒绝,并出现错误NFS4ERR_LayouttyLater。然后,该客户端将在短时间间隔后重试该请求。在此时间间隔内,服务器应从当前持有布局委派的客户端调用布局委派的冲突部分。当存在对特定文件布局的争用时,这种拒绝并重试方法无法防止客户端饥饿。因此,pNFS服务器应该实现一种防止饥饿的机制。一种可能性是服务器可以维护一个被拒绝的布局请求队列。可以检查每个新的布局请求,查看它是否与以前被拒绝的请求冲突,如果冲突,则可以拒绝较新的请求。一旦原始请求客户端重试其请求,可以清除其在被拒绝请求队列中的条目,或者在被拒绝请求队列中的条目达到一定期限后可以删除。
NFSv4 supports mandatory locks and share reservations. These are mechanisms that clients can use to restrict the set of I/O operations that are permissible to other clients. Since all I/O operations ultimately arrive at the NFSv4 server for processing, the server is in a position to enforce these restrictions. However, with pNFS layouts, I/Os will be issued from the clients that hold the layouts directly to the storage devices that host the data. These devices have no knowledge of files, mandatory locks, or share reservations, and are not in a position to enforce such restrictions. For this reason the NFSv4 server MUST NOT grant layouts that conflict with mandatory locks or share reservations. Further, if a conflicting mandatory lock request or a conflicting open request arrives at the server, the server MUST recall the part of the layout in conflict with the request before granting the request.
NFSv4支持强制锁定和共享保留。客户机可以使用这些机制来限制其他客户机允许的I/O操作集。由于所有I/O操作最终都会到达NFSv4服务器进行处理,因此服务器可以强制执行这些限制。但是,对于pNFS布局,I/O将从保存布局的客户端直接发送到承载数据的存储设备。这些设备不了解文件、强制锁或共享保留,也无法实施此类限制。因此,NFSv4服务器不得授予与强制锁或共享保留冲突的布局。此外,如果冲突的强制锁定请求或冲突的打开请求到达服务器,则服务器必须在授予请求之前调用与请求冲突的布局部分。
The end-of-file location can be changed in two ways: implicitly as the result of a WRITE or LAYOUTCOMMIT beyond the current end-of-file, or explicitly as the result of a SETATTR request. Typically, when a file is truncated by an NFSv4 client via the SETATTR call, the server frees any disk blocks belonging to the file that are beyond the new end-of-file byte, and MUST write zeros to the portion of the new end-of-file block beyond the new end-of-file byte. These actions render any pNFS layouts that refer to the blocks that are freed or written semantically invalid. Therefore, the server MUST recall from clients the portions of any pNFS layouts that refer to blocks that will be freed or written by the server before processing the truncate request. These recalls may take time to complete; as explained in [NFSv4.1], if the server cannot respond to the client SETATTR request in a reasonable amount of time, it SHOULD reply to the client with the error NFS4ERR_DELAY.
可以通过两种方式更改文件结尾位置:隐式地作为当前文件结尾之外的写入或LAYOUTCOMMIT的结果,或显式地作为SETATTR请求的结果。通常,当NFSv4客户端通过SETATTR调用截断文件时,服务器会释放文件中超出新文件结尾字节的所有磁盘块,并且必须将零写入新文件结尾字节以外的新文件结尾块部分。这些操作会导致引用已释放或写入的块的任何pNFS布局在语义上无效。因此,在处理truncate请求之前,服务器必须从客户端调用任何pNFS布局中引用服务器将释放或写入的块的部分。这些召回可能需要时间才能完成;如[NFSv4.1]中所述,如果服务器无法在合理的时间内响应客户端SETATTR请求,则应向客户端回复错误NFS4ERR_DELAY。
Blocks in the PNFS_BLOCK_INVALID_DATA state that lie beyond the new end-of-file block present a special case. The server has reserved these blocks for use by a pNFS client with a writable layout for the file, but the client has yet to commit the blocks, and they are not yet a part of the file mapping on disk. The server MAY free these blocks while processing the SETATTR request. If so, the server MUST recall any layouts from pNFS clients that refer to the blocks before processing the truncate. If the server does not free the PNFS_BLOCK_INVALID_DATA blocks while processing the SETATTR request, it need not recall layouts that refer only to the PNFS_BLOCK_INVALID DATA blocks.
PNFS_BLOCK_INVALID_数据状态中位于新文件结尾块之外的块是一种特殊情况。服务器已保留这些块供具有文件可写布局的pNFS客户端使用,但客户端尚未提交这些块,并且它们还不是磁盘上文件映射的一部分。服务器可以在处理SETATTR请求时释放这些块。如果是这样,服务器必须在处理截断之前从引用块的pNFS客户端调用任何布局。如果服务器在处理SETATTR请求时未释放PNFS_块_无效数据块,则无需调用仅引用PNFS_块_无效数据块的布局。
When a file is extended implicitly by a WRITE or LAYOUTCOMMIT beyond the current end-of-file, or extended explicitly by a SETATTR request, the server need not recall any portions of any pNFS layouts.
当文件被WRITE或LAYOUTCOMMIT隐式扩展到文件的当前结尾之外,或者被SETATTR请求显式扩展时,服务器不需要调用任何pNFS布局的任何部分。
The SETATTR operation supports a layout hint attribute [NFSv4.1]. When the client sets a layout hint (data type layouthint4) with a layout type of LAYOUT4_BLOCK_VOLUME (the loh_type field), the loh_body field contains a value of data type pnfs_block_layouthint4.
SETATTR操作支持布局提示属性[NFSv4.1]。当客户端设置布局类型为LAYOUT4\u BLOCK\u VOLUME(loh\u类型字段)的布局提示(数据类型LayoutInt4)时,loh\u body字段包含数据类型pnfs\u BLOCK\u LayoutInt4的值。
/// /* block layout specific type for loh_body */ /// struct pnfs_block_layouthint4 { /// uint64_t blh_maximum_io_time; /* maximum i/o time in seconds /// */ /// }; ///
/// /* block layout specific type for loh_body */ /// struct pnfs_block_layouthint4 { /// uint64_t blh_maximum_io_time; /* maximum i/o time in seconds /// */ /// }; ///
The block layout client uses the layout hint data structure to communicate to the server the maximum time that it may take an I/O to execute on the client. Clients using block layouts MUST set the layout hint attribute before using LAYOUTGET operations.
块布局客户端使用布局提示数据结构与服务器通信,以确定在客户端上执行I/O所需的最长时间。使用块布局的客户端必须在使用LAYOUTGET操作之前设置布局提示属性。
The pNFS block protocol must handle situations in which a system failure, typically a network connectivity issue, requires the server to unilaterally revoke extents from one client in order to transfer the extents to another client. The pNFS server implementation MUST ensure that when resources are transferred to another client, they are not used by the client originally owning them, and this must be ensured against any possible combination of partitions and delays among all of the participants to the protocol (server, storage and client). Two approaches to guaranteeing this isolation are possible and are discussed below.
pNFS块协议必须处理以下情况:系统故障(通常是网络连接问题)要求服务器从一个客户端单方面撤销扩展数据块,以便将扩展数据块传输到另一个客户端。pNFS服务器实现必须确保当资源传输到另一个客户机时,它们不会被最初拥有它们的客户机使用,并且必须确保协议的所有参与者(服务器、存储和客户机)之间不会出现分区和延迟的任何可能组合。保证这种隔离的两种方法是可能的,下面将讨论。
One implementation choice for fencing the block client from the block storage is the use of LUN masking or mapping at the storage systems or storage area network to disable access by the client to be isolated. This requires server access to a management interface for the storage system and authorization to perform LUN masking and management operations. For example, the Storage Management Initiative Specification (SMI-S) [SMIS] provides a means to discover and mask LUNs, including a means of associating clients with the necessary World Wide Names or Initiator names to be masked.
将块客户机与块存储隔离的一种实现选择是在存储系统或存储区域网络上使用LUN掩蔽或映射来禁用要隔离的客户机的访问。这需要服务器访问存储系统的管理界面,并获得执行LUN掩蔽和管理操作的授权。例如,存储管理计划规范(SMI-S)[SMIS]提供了一种发现和屏蔽LUN的方法,包括将客户端与要屏蔽的必要全球通用名称或启动器名称关联的方法。
In the absence of support for LUN masking, the server has to rely on the clients to implement a timed-lease I/O fencing mechanism. Because clients do not know if the server is using LUN masking, in all cases, the client MUST implement timed-lease fencing. In timed-lease fencing, we define two time periods, the first, "lease_time" is the length of a lease as defined by the server's lease_time attribute (see [NFSv4.1]), and the second, "blh_maximum_io_time" is the maximum time it can take for a client I/O to the storage system to either complete or fail; this value is often 30 seconds or 60 seconds, but may be longer in some environments. If the maximum client I/O time cannot be bounded, the client MUST use a value of all 1s as the blh_maximum_io_time.
在不支持LUN掩蔽的情况下,服务器必须依靠客户端来实现定时租约I/O防护机制。由于客户端不知道服务器是否正在使用LUN掩蔽,因此在所有情况下,客户端都必须实施定时租用保护。在定时租约保护中,我们定义了两个时间段,第一个“租约时间”是由服务器的租约时间属性(请参见[NFSv4.1])定义的租约长度,第二个“blh_maximum_io_time”是客户端对存储系统的I/O完成或失败所需的最长时间;此值通常为30秒或60秒,但在某些环境中可能更长。如果无法限定最大客户端I/O时间,则客户端必须使用所有1的值作为blh_最大io_时间。
After a new client ID is established, the client MUST use SETATTR with a layout hint of type LAYOUT4_BLOCK_VOLUME to inform the server of its maximum I/O time prior to issuing the first LAYOUTGET operation. While the maximum I/O time hint is a per-file attribute, it is actually a per-client characteristic. Thus, the server MUST maintain the last maximum I/O time hint sent separately for each client. Each time the maximum I/O time changes, the server MUST
建立新的客户端ID后,客户端必须使用SETATTR和LAYOUT4\u BLOCK\u VOLUME类型的布局提示,在发出第一个LAYOUTGET操作之前,将其最大I/O时间通知服务器。虽然最大I/O时间提示是每个文件的属性,但它实际上是每个客户机的特征。因此,服务器必须维护为每个客户端分别发送的最后一个最大I/O时间提示。每次更改最大I/O时间时,服务器必须
apply it to all files for which the client has a layout. If the client does not specify this attribute on a file for which a block layout is requested, the server SHOULD use the most recent value provided by the same client for any file; if that client has not provided a value for this attribute, the server SHOULD reject the layout request with the error NFS4ERR_LAYOUTUNAVAILABLE. The client SHOULD NOT send a SETATTR of the layout hint with every LAYOUTGET. A server that implements fencing via LUN masking SHOULD accept any maximum I/O time value from a client. A server that does not implement fencing may return an error NFS4ERR_INVAL to the SETATTR operation. Such a server SHOULD return NFS4ERR_INVAL when a client sends an unbounded maximum I/O time (all 1s), or when the maximum I/O time is significantly greater than that of other clients using block layouts with pNFS.
将其应用于客户端具有布局的所有文件。如果客户端未在请求块布局的文件上指定此属性,则服务器应使用同一客户端为任何文件提供的最新值;如果该客户端未提供此属性的值,则服务器应拒绝布局请求,错误为NFS4ERR\u LAYOUTUNAVAILABLE。客户端不应在每个LAYOUTGET中发送布局提示的SETATTR。通过LUN掩蔽实现隔离的服务器应接受来自客户端的任何最大I/O时间值。未实施围栏的服务器可能会向SETATTR操作返回错误NFS4ERR_INVAL。当客户端发送无限制的最大I/O时间(全部1)时,或当最大I/O时间明显大于使用PNF块布局的其他客户端时,这样的服务器应返回NFS4ERR_INVAL。
When a client receives the error NFS4ERR_INVAL in response to the SETATTR operation for a layout hint, the client MUST NOT use the LAYOUTGET operation. After responding with NFS4ERR_INVAL to the SETATTR for layout hint, the server MUST return the error NFS4ERR_LAYOUTUNAVAILABLE to all subsequent LAYOUTGET operations from that client. Thus, the server, by returning either NFS4ERR_INVAL or NFS4_OK determines whether or not a client with a large, or an unbounded-maximum I/O time may use pNFS.
当客户端收到错误NFS4ERR_INVAL以响应布局提示的SETATTR操作时,客户端不得使用LAYOUTGET操作。在使用NFS4ERR_INVAL对SETATTR for layout提示做出响应后,服务器必须将错误NFS4ERR_LAYOUTUNAVAILABLE返回给该客户端的所有后续LAYOUTGET操作。因此,服务器通过返回NFS4ERR_INVAL或NFS4_OK来确定具有较大或无限最大I/O时间的客户端是否可以使用pNFS。
Using the lease time and the maximum I/O time values, we specify the behavior of the client and server as follows.
使用租约时间和最大I/O时间值,我们指定客户机和服务器的行为如下。
When a client receives layout information via a LAYOUTGET operation, those layouts are valid for at most "lease_time" seconds from when the server granted them. A layout is renewed by any successful SEQUENCE operation, or whenever a new stateid is created or updated (see the section "Lease Renewal" of [NFSv4.1]). If the layout lease is not renewed prior to expiration, the client MUST cease to use the layout after "lease_time" seconds from when it either sent the original LAYOUTGET command or sent the last operation renewing the lease. In other words, the client may not issue any I/O to blocks specified by an expired layout. In the presence of large communication delays between the client and server, it is even possible for the lease to expire prior to the server response arriving at the client. In such a situation, the client MUST NOT use the expired layouts, and SHOULD revert to using standard NFSv41 READ and WRITE operations. Furthermore, the client must be configured such that I/O operations complete within the "blh_maximum_io_time" even in the presence of multipath drivers that will retry I/Os via multiple paths.
当客户端通过LAYOUTGET操作接收布局信息时,这些布局的有效期最长为“lease_time”(从服务器授予它们时算起)秒。通过任何成功的序列操作,或每当创建或更新新的stateid时(请参见[NFSv4.1]的“租约续订”一节),都会更新布局。如果版面租约在到期前未续订,则客户端必须在发送原始LAYOUTGET命令或发送上次续订租约操作后的“lease_time”秒后停止使用版面。换句话说,客户端可能不会向过期布局指定的块发出任何I/O。在客户端和服务器之间存在较大通信延迟的情况下,租约甚至可能在服务器响应到达客户端之前过期。在这种情况下,客户端不得使用过期的布局,应恢复使用标准NFSv41读写操作。此外,客户机的配置必须确保I/O操作在“blh_最大io_时间”内完成,即使存在将通过多条路径重试I/O的多路径驱动程序。
As stated in the "Dealing with Lease Expiration on the Client" section of [NFSv4.1], if any SEQUENCE operation is successful, but sr_status_flag has SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, or SEQ4_STATUS_ADMIN_STATE_REVOKED is set, the client MUST immediately cease to use all layouts and device ID to device address mappings associated with the corresponding server.
如[NFSv4.1]的“处理客户租约到期”部分所述,如果任何序列操作成功,但sr_status_标志设置了SEQ4_status_EXPIRED_ALL_STATE_reversed、SEQ4_status_EXPIRED_SOME_STATE_reversed或SEQ4_status_ADMIN_reversed,客户端必须立即停止使用与相应服务器关联的所有布局和设备ID到设备地址的映射。
In the absence of known two-way communication between the client and the server on the fore channel, the server must wait for at least the time period "lease_time" plus "blh_maximum_io_time" before transferring layouts from the original client to any other client. The server, like the client, must take a conservative approach, and start the lease expiration timer from the time that it received the operation that last renewed the lease.
在前端通道上的客户端和服务器之间没有已知的双向通信的情况下,服务器在将布局从原始客户端传输到任何其他客户端之前,必须至少等待“租赁时间”加上“blh_最大io_时间”的时间段。与客户机一样,服务器必须采取保守的方法,从收到上次续订租约的操作时开始启动租约到期计时器。
A critical requirement in crash recovery is that both the client and the server know when the other has failed. Additionally, it is required that a client sees a consistent view of data across server restarts. These requirements and a full discussion of crash recovery issues are covered in the "Crash Recovery" section of the NFSv41 specification [NFSv4.1]. This document contains additional crash recovery material specific only to the block/volume layout.
崩溃恢复中的一个关键要求是,客户端和服务器都知道另一方何时出现故障。此外,要求客户端在服务器重新启动时看到一致的数据视图。NFSv41规范[NFSv4.1]的“碰撞恢复”部分涵盖了这些要求和碰撞恢复问题的全面讨论。本文档包含仅针对块/卷布局的其他崩溃恢复资料。
When the server crashes while the client holds a writable layout, and the client has written data to blocks covered by the layout, and the blocks are still in the PNFS_BLOCK_INVALID_DATA state, the client has two options for recovery. If the data that has been written to these blocks is still cached by the client, the client can simply re-write the data via NFSv4, once the server has come back online. However, if the data is no longer in the client's cache, the client MUST NOT attempt to source the data from the data servers. Instead, it should attempt to commit the blocks in question to the server during the server's recovery grace period, by sending a LAYOUTCOMMIT with the "loca_reclaim" flag set to true. This process is described in detail in Section 18.42.4 of [NFSv4.1].
当服务器崩溃,而客户端持有可写布局,并且客户端已将数据写入布局覆盖的块,并且这些块仍处于PNFS_BLOCK_INVALID_data状态时,客户端有两个恢复选项。如果已写入这些块的数据仍由客户端缓存,则一旦服务器恢复联机,客户端可以通过NFSv4简单地重新写入数据。但是,如果数据不再在客户端缓存中,客户端不得尝试从数据服务器获取数据。相反,它应该在服务器的恢复宽限期内,通过发送一个LAYOUTCOMMIT,将“loca_reclain”标志设置为true,尝试将有问题的块提交给服务器。[NFSv4.1]第18.42.4节详细描述了该过程。
The server may decide that it cannot hold all of the state for layouts without running out of resources. In such a case, it is free to recall individual layouts using CB_LAYOUTRECALL to reduce the load, or it may choose to request that the client return any layout.
服务器可能会决定,如果没有资源,它无法保存布局的所有状态。在这种情况下,可以使用CB_LAYOUTRECALL调用单个布局以减少负载,也可以选择请求客户端返回任何布局。
The NFSv4.1 spec [NFSv4.1] defines the following types:
NFSv4.1规范[NFSv4.1]定义了以下类型:
const RCA4_TYPE_MASK_BLK_LAYOUT = 4;
常数RCA4类型掩码布局=4;
struct CB_RECALL_ANY4args { uint32_t craa_objects_to_keep; bitmap4 craa_type_mask; };
struct CB_RECALL_ANY4args { uint32_t craa_objects_to_keep; bitmap4 craa_type_mask; };
When the server sends a CB_RECALL_ANY request to a client specifying the RCA4_TYPE_MASK_BLK_LAYOUT bit in craa_type_mask, the client should immediately respond with NFS4_OK, and then asynchronously return complete file layouts until the number of files with layouts cached on the client is less than craa_object_to_keep.
当服务器向指定craa_类型_掩码中RCA4_类型_掩码_BLK_布局位的客户端发送CB_REALL_ANY请求时,客户端应立即响应NFS4_OK,然后异步返回完整的文件布局,直到客户端上缓存的具有布局的文件数小于要保留的craa_对象。
The server may respond to LAYOUTGET with a variety of error statuses. These errors can convey transient conditions or more permanent conditions that are unlikely to be resolved soon.
服务器可能会以各种错误状态响应LAYOUTGET。这些错误可能传递暂时性条件或不太可能很快解决的更永久性条件。
The transient errors, NFS4ERR_RECALLCONFLICT and NFS4ERR_TRYLATER, are used to indicate that the server cannot immediately grant the layout to the client. In the former case, this is because the server has recently issued a CB_LAYOUTRECALL to the requesting client, whereas in the case of NFS4ERR_TRYLATER, the server cannot grant the request possibly due to sharing conflicts with other clients. In either case, a reasonable approach for the client is to wait several milliseconds and retry the request. The client SHOULD track the number of retries, and if forward progress is not made, the client SHOULD send the READ or WRITE operation directly to the server.
瞬时错误NFS4ERR_RECALLCONFLICT和NFS4ERR_TRYLATER用于指示服务器无法立即将布局授予客户端。在前一种情况下,这是因为服务器最近向请求客户端发出了CB_LAYOUTRECALL,而在NFS4ERR_TRYLATER的情况下,服务器可能由于与其他客户端的共享冲突而无法授予请求。在这两种情况下,客户端的合理方法是等待几毫秒,然后重试请求。客户端应跟踪重试次数,如果未进行转发,则客户端应将读或写操作直接发送到服务器。
The error NFS4ERR_LAYOUTUNAVAILABLE may be returned by the server if layouts are not supported for the requested file or its containing file system. The server may also return this error code if the server is the progress of migrating the file from secondary storage, or for any other reason that causes the server to be unable to supply the layout. As a result of receiving NFS4ERR_LAYOUTUNAVAILABLE, the client SHOULD send future READ and WRITE requests directly to the server. It is expected that a client will not cache the file's layoutunavailable state forever, particular if the file is closed, and thus eventually, the client MAY reissue a LAYOUTGET operation.
如果请求的文件或其包含的文件系统不支持布局,服务器可能会返回错误NFS4ERR_LAYOUTUNAVAILABLE。如果服务器正在从辅助存储迁移文件,或者由于任何其他原因导致服务器无法提供布局,则服务器也可能返回此错误代码。由于接收到NFS4ERR_LAYOUTUNAVAILABLE,客户端应将未来的读写请求直接发送到服务器。预计客户机不会永远缓存文件的layoutunavailable状态,尤其是在文件关闭时,因此最终客户机可能会重新发出LAYOUTGET操作。
Typically, SAN disk arrays and SAN protocols provide access control mechanisms (e.g., LUN mapping and/or masking) that operate at the granularity of individual hosts. The functionality provided by such
通常,SAN磁盘阵列和SAN协议提供访问控制机制(例如,LUN映射和/或掩蔽),这些机制在各个主机的粒度上运行。由此类组件提供的功能
mechanisms makes it possible for the server to "fence" individual client machines from certain physical disks -- that is to say, to prevent individual client machines from reading or writing to certain physical disks. Finer-grained access control methods are not generally available. For this reason, certain security responsibilities are delegated to pNFS clients for block/volume layouts. Block/volume storage systems generally control access at a volume granularity, and hence pNFS clients have to be trusted to only perform accesses allowed by the layout extents they currently hold (e.g., and not access storage for files on which a layout extent is not held). In general, the server will not be able to prevent a client that holds a layout for a file from accessing parts of the physical disk not covered by the layout. Similarly, the server will not be able to prevent a client from accessing blocks covered by a layout that it has already returned. This block-based level of protection must be provided by the client software.
这些机制使服务器能够将单个客户机与某些物理磁盘“隔离”——也就是说,防止单个客户机读取或写入某些物理磁盘。细粒度的访问控制方法通常不可用。因此,对于块/卷布局,某些安全责任委托给pNFS客户端。块/卷存储系统通常以卷粒度控制访问,因此必须信任pNFS客户端仅执行其当前持有的布局扩展数据块所允许的访问(例如,不访问未持有布局扩展数据块的文件的存储)。通常,服务器将无法阻止持有文件布局的客户端访问布局未覆盖的物理磁盘部分。类似地,服务器将无法阻止客户端访问其已返回的布局所覆盖的块。此基于块的保护级别必须由客户端软件提供。
An alternative method of block/volume protocol use is for the storage devices to export virtualized block addresses, which do reflect the files to which blocks belong. These virtual block addresses are exported to pNFS clients via layouts. This allows the storage device to make appropriate access checks, while mapping virtual block addresses to physical block addresses. In environments where the security requirements are such that client-side protection from access to storage outside of the authorized layout extents is not sufficient, pNFS block/volume storage layouts SHOULD NOT be used unless the storage device is able to implement the appropriate access checks, via use of virtualized block addresses or other means. In contrast, an environment where client-side protection may suffice consists of co-located clients, server and storage systems in a data center with a physically isolated SAN under control of a single system administrator or small group of system administrators.
块/卷协议使用的另一种方法是存储设备导出虚拟化块地址,该地址确实反映了块所属的文件。这些虚拟块地址通过布局导出到pNFS客户端。这允许存储设备进行适当的访问检查,同时将虚拟块地址映射到物理块地址。在安全要求不足以防止客户端访问授权布局范围之外的存储的环境中,除非存储设备能够执行适当的访问检查,否则不应使用pNFS块/卷存储布局,通过使用虚拟化块地址或其他方式。相比之下,客户机端保护可能已经足够的环境由位于数据中心的同一位置的客户机、服务器和存储系统组成,该数据中心有一个物理上隔离的SAN,由单个系统管理员或一小群系统管理员控制。
This also has implications for some NFSv4 functionality outside pNFS. For instance, if a file is covered by a mandatory read-only lock, the server can ensure that only readable layouts for the file are granted to pNFS clients. However, it is up to each pNFS client to ensure that the readable layout is used only to service read requests, and not to allow writes to the existing parts of the file. Similarly, block/volume storage devices are unable to validate NFS Access Control Lists (ACLs) and file open modes, so the client must enforce the policies before sending a READ or WRITE request to the storage device. Since block/volume storage systems are generally not capable of enforcing such file-based security, in environments where pNFS clients cannot be trusted to enforce such policies, pNFS block/volume storage layouts SHOULD NOT be used.
这对pNFS之外的一些NFSv4功能也有影响。例如,如果一个文件被强制只读锁覆盖,服务器可以确保只有文件的可读布局被授予pNFS客户端。但是,由每个pNFS客户机来确保可读布局仅用于服务读取请求,而不允许写入文件的现有部分。类似地,块/卷存储设备无法验证NFS访问控制列表(ACL)和文件打开模式,因此客户端必须在向存储设备发送读或写请求之前强制执行策略。由于块/卷存储系统通常无法实施此类基于文件的安全性,因此在无法信任pNFS客户端实施此类策略的环境中,不应使用pNFS块/卷存储布局。
Access to block/volume storage is logically at a lower layer of the I/O stack than NFSv4, and hence NFSv4 security is not directly applicable to protocols that access such storage directly. Depending on the protocol, some of the security mechanisms provided by NFSv4 (e.g., encryption, cryptographic integrity) may not be available or may be provided via different means. At one extreme, pNFS with block/volume storage can be used with storage access protocols (e.g., parallel SCSI) that provide essentially no security functionality. At the other extreme, pNFS may be used with storage protocols such as iSCSI that can provide significant security functionality. It is the responsibility of those administering and deploying pNFS with a block/volume storage access protocol to ensure that appropriate protection is provided to that protocol (physical security is a common means for protocols not based on IP). In environments where the security requirements for the storage protocol cannot be met, pNFS block/volume storage layouts SHOULD NOT be used.
与NFSv4相比,对块/卷存储的访问在逻辑上位于I/O堆栈的较低层,因此NFSv4安全性不直接适用于直接访问此类存储的协议。根据协议的不同,NFSv4提供的一些安全机制(例如,加密、密码完整性)可能不可用,或者可能通过不同的方式提供。在一个极端情况下,带有块/卷存储的PNF可以与基本上不提供安全功能的存储访问协议(如并行SCSI)一起使用。在另一个极端,pNFS可以与存储协议(如iSCSI)一起使用,后者可以提供重要的安全功能。使用块/卷存储访问协议管理和部署PNF的人员负责确保为该协议提供适当的保护(物理安全是不基于IP的协议的常见手段)。在无法满足存储协议安全要求的环境中,不应使用pNFS块/卷存储布局。
When security is available for a storage protocol, it is generally at a different granularity and with a different notion of identity than NFSv4 (e.g., NFSv4 controls user access to files, iSCSI controls initiator access to volumes). The responsibility for enforcing appropriate correspondences between these security layers is placed upon the pNFS client. As with the issues in the first paragraph of this section, in environments where the security requirements are such that client-side protection from access to storage outside of the layout is not sufficient, pNFS block/volume storage layouts SHOULD NOT be used.
当存储协议具有安全性时,其粒度和身份概念通常与NFSv4不同(例如,NFSv4控制用户对文件的访问,iSCSI控制启动器对卷的访问)。pNFS客户端负责在这些安全层之间执行适当的通信。与本节第一段中的问题一样,在安全要求不足以防止客户端访问布局外存储的环境中,不应使用pNFS块/卷存储布局。
This document specifies the block/volume layout type for pNFS and associated functionality.
本文档指定PNF的块/卷布局类型和相关功能。
There are no IANA considerations in this document. All pNFS IANA Considerations are covered in [NFSv4.1].
本文件中没有IANA注意事项。[NFSv4.1]涵盖了所有pNFS IANA注意事项。
This document draws extensively on the authors' familiarity with the mapping functionality and protocol in EMC's Multi-Path File System (MPFS) (previously named HighRoad) system [MPFS]. The protocol used by MPFS is called FMP (File Mapping Protocol); it is an add-on protocol that runs in parallel with file system protocols such as NFSv3 to provide pNFS-like functionality for block/volume storage. While drawing on FMP, the data structures and functional considerations in this document differ in significant ways, based on
本文档充分利用了作者对EMC多路径文件系统(MPFS)(以前称为HighRoad)系统[MPFS]中的映射功能和协议的熟悉程度。MPFS使用的协议称为FMP(文件映射协议);它是一个附加协议,与NFSv3等文件系统协议并行运行,为块/卷存储提供类似pNFS的功能。在使用FMP时,本文件中的数据结构和功能注意事项在很大程度上不同,具体取决于
lessons learned and the opportunity to take advantage of NFSv4 features such as COMPOUND operations. The design to support pNFS client participation in copy-on-write is based on text and ideas contributed by Craig Everhart.
吸取的经验教训以及利用NFSv4功能(如复合作战)的机会。支持pNFS客户参与书面拷贝的设计基于Craig Everhart提供的文本和想法。
Andy Adamson, Ben Campbell, Richard Chandler, Benny Halevy, Fredric Isaman, and Mario Wurzl all helped to review versions of this specification.
Andy Adamson、Ben Campbell、Richard Chandler、Benny Halevy、Fredric Isaman和Mario Wurzl都帮助审查了该规范的版本。
[LEGAL] IETF Trust, "Legal Provisions Relating to IETF Documents", http://trustee.ietf.org/docs/IETF-Trust-License-Policy.pdf, November 2008.
[法律]IETF信托,“与IETF文件相关的法律规定”,http://trustee.ietf.org/docs/IETF-Trust-License-Policy.pdf,2008年11月。
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2119]Bradner,S.,“RFC中用于表示需求水平的关键词”,BCP 14,RFC 2119,1997年3月。
[NFSv4.1] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., "Network File System (NFS) Version 4 Minor Version 1 Protocol", RFC 5661, January 2010.
[NFSv4.1]Shepler,S.,Ed.,Eisler,M.,Ed.,和D.Noveck,Ed.,“网络文件系统(NFS)版本4次要版本1协议”,RFC 56612010年1月。
[XDR] Eisler, M., Ed., "XDR: External Data Representation Standard", STD 67, RFC 4506, May 2006.
[XDR]艾斯勒,M.,编辑,“XDR:外部数据表示标准”,STD 67,RFC 4506,2006年5月。
[MPFS] EMC Corporation, "EMC Celerra Multi-Path File System (MPFS)", EMC Data Sheet, http://www.emc.com/collateral/software/data-sheet/ h2006-celerra-mpfs-mpfsi.pdf.
[MPFS]EMC公司,“EMC Celerra多路径文件系统(MPFS)”,EMC产品介绍,http://www.emc.com/collateral/software/data-sheet/ h2006-celerra-mpfs-mpfsi.pdf。
[SMIS] SNIA, "Storage Management Initiative Specification (SMI-S) v1.4", http://www.snia.org/tech_activities/standards/ curr_standards/smi/SMI-S_Technical_Position_v1.4.0r4.zip.
[SMIS]SNIA,“存储管理计划规范(SMI-S)v1.4”,http://www.snia.org/tech_activities/standards/ curr_standards/smi/smi-S_Technical_Position_v1.4.0r4.zip。
Authors' Addresses
作者地址
David L. Black EMC Corporation 176 South Street Hopkinton, MA 01748
David L.Black EMC Corporation马萨诸塞州霍普金顿南街176号01748
Phone: +1 (508) 293-7953 EMail: black_david@emc.com
Phone: +1 (508) 293-7953 EMail: black_david@emc.com
Stephen Fridella Nasuni Inc 313 Speen St Natick MA 01760
Stephen Fridella Nasuni公司马萨诸塞州斯宾圣纳提克313号01760
EMail: stevef@nasuni.com
EMail: stevef@nasuni.com
Jason Glasgow Google 5 Cambridge Center Cambridge, MA 02142
杰森·格拉斯哥谷歌5剑桥中心马萨诸塞州剑桥02142
Phone: +1 (617) 575 1599 EMail: jglasgow@aya.yale.edu
Phone: +1 (617) 575 1599 EMail: jglasgow@aya.yale.edu