Internet Engineering Task Force (IETF) B. Halevy Request for Comments: 5664 B. Welch Category: Standards Track J. Zelenka ISSN: 2070-1721 Panasas January 2010
Internet Engineering Task Force (IETF) B. Halevy Request for Comments: 5664 B. Welch Category: Standards Track J. Zelenka ISSN: 2070-1721 Panasas January 2010
Object-Based Parallel NFS (pNFS) Operations
基于对象的并行NFS(pNFS)操作
Abstract
摘要
Parallel NFS (pNFS) extends Network File System version 4 (NFSv4) to allow clients to directly access file data on the storage used by the NFSv4 server. This ability to bypass the server for data access can increase both performance and parallelism, but requires additional client functionality for data access, some of which is dependent on the class of storage used, a.k.a. the Layout Type. The main pNFS operations and data types in NFSv4 Minor version 1 specify a layout-type-independent layer; layout-type-specific information is conveyed using opaque data structures whose internal structure is further defined by the particular layout type specification. This document specifies the NFSv4.1 Object-Based pNFS Layout Type as a companion to the main NFSv4 Minor version 1 specification.
并行NFS(pNFS)扩展了网络文件系统版本4(NFSv4),允许客户端直接访问NFSv4服务器使用的存储上的文件数据。这种绕过服务器进行数据访问的能力可以提高性能和并行性,但需要额外的客户端数据访问功能,其中一些功能取决于所使用的存储类别(也称为布局类型)。NFSv4次要版本1中的主要pNFS操作和数据类型指定了独立于布局类型的层;布局类型特定的信息使用不透明的数据结构传递,其内部结构由特定布局类型规范进一步定义。本文档指定了NFSv4.1基于对象的pNFS布局类型,作为主要NFSv4次要版本1规范的补充。
Status of This Memo
关于下段备忘
This is an Internet Standards Track document.
这是一份互联网标准跟踪文件。
This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 5741.
本文件是互联网工程任务组(IETF)的产品。它代表了IETF社区的共识。它已经接受了公众审查,并已被互联网工程指导小组(IESG)批准出版。有关互联网标准的更多信息,请参见RFC 5741第2节。
Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc5664.
有关本文件当前状态、任何勘误表以及如何提供反馈的信息,请访问http://www.rfc-editor.org/info/rfc5664.
Copyright Notice
版权公告
Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved.
版权所有(c)2010 IETF信托基金和确定为文件作者的人员。版权所有。
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
本文件受BCP 78和IETF信托有关IETF文件的法律规定的约束(http://trustee.ietf.org/license-info)自本文件出版之日起生效。请仔细阅读这些文件,因为它们描述了您对本文件的权利和限制。从本文件中提取的代码组件必须包括信托法律条款第4.e节中所述的简化BSD许可证文本,并提供简化BSD许可证中所述的无担保。
Table of Contents
目录
1. Introduction ....................................................3 1.1. Requirements Language ......................................4 2. XDR Description of the Objects-Based Layout Protocol ............4 2.1. Code Components Licensing Notice ...........................4 3. Basic Data Type Definitions .....................................6 3.1. pnfs_osd_objid4 ............................................6 3.2. pnfs_osd_version4 ..........................................6 3.3. pnfs_osd_object_cred4 ......................................7 3.4. pnfs_osd_raid_algorithm4 ...................................8 4. Object Storage Device Addressing and Discovery ..................8 4.1. pnfs_osd_targetid_type4 ...................................10 4.2. pnfs_osd_deviceaddr4 ......................................10 4.2.1. SCSI Target Identifier .............................11 4.2.2. Device Network Address .............................11 5. Object-Based Layout ............................................12 5.1. pnfs_osd_data_map4 ........................................13 5.2. pnfs_osd_layout4 ..........................................14 5.3. Data Mapping Schemes ......................................14 5.3.1. Simple Striping ....................................15 5.3.2. Nested Striping ....................................16 5.3.3. Mirroring ..........................................17 5.4. RAID Algorithms ...........................................18 5.4.1. PNFS_OSD_RAID_0 ....................................18 5.4.2. PNFS_OSD_RAID_4 ....................................18 5.4.3. PNFS_OSD_RAID_5 ....................................18 5.4.4. PNFS_OSD_RAID_PQ ...................................19 5.4.5. RAID Usage and Implementation Notes ................19 6. Object-Based Layout Update .....................................20 6.1. pnfs_osd_deltaspaceused4 ..................................20 6.2. pnfs_osd_layoutupdate4 ....................................21 7. Recovering from Client I/O Errors ..............................21
1. Introduction ....................................................3 1.1. Requirements Language ......................................4 2. XDR Description of the Objects-Based Layout Protocol ............4 2.1. Code Components Licensing Notice ...........................4 3. Basic Data Type Definitions .....................................6 3.1. pnfs_osd_objid4 ............................................6 3.2. pnfs_osd_version4 ..........................................6 3.3. pnfs_osd_object_cred4 ......................................7 3.4. pnfs_osd_raid_algorithm4 ...................................8 4. Object Storage Device Addressing and Discovery ..................8 4.1. pnfs_osd_targetid_type4 ...................................10 4.2. pnfs_osd_deviceaddr4 ......................................10 4.2.1. SCSI Target Identifier .............................11 4.2.2. Device Network Address .............................11 5. Object-Based Layout ............................................12 5.1. pnfs_osd_data_map4 ........................................13 5.2. pnfs_osd_layout4 ..........................................14 5.3. Data Mapping Schemes ......................................14 5.3.1. Simple Striping ....................................15 5.3.2. Nested Striping ....................................16 5.3.3. Mirroring ..........................................17 5.4. RAID Algorithms ...........................................18 5.4.1. PNFS_OSD_RAID_0 ....................................18 5.4.2. PNFS_OSD_RAID_4 ....................................18 5.4.3. PNFS_OSD_RAID_5 ....................................18 5.4.4. PNFS_OSD_RAID_PQ ...................................19 5.4.5. RAID Usage and Implementation Notes ................19 6. Object-Based Layout Update .....................................20 6.1. pnfs_osd_deltaspaceused4 ..................................20 6.2. pnfs_osd_layoutupdate4 ....................................21 7. Recovering from Client I/O Errors ..............................21
8. Object-Based Layout Return .....................................22 8.1. pnfs_osd_errno4 ...........................................23 8.2. pnfs_osd_ioerr4 ...........................................24 8.3. pnfs_osd_layoutreturn4 ....................................24 9. Object-Based Creation Layout Hint ..............................25 9.1. pnfs_osd_layouthint4 ......................................25 10. Layout Segments ...............................................26 10.1. CB_LAYOUTRECALL and LAYOUTRETURN .........................27 10.2. LAYOUTCOMMIT .............................................27 11. Recalling Layouts .............................................27 11.1. CB_RECALL_ANY ............................................28 12. Client Fencing ................................................29 13. Security Considerations .......................................29 13.1. OSD Security Data Types ..................................30 13.2. The OSD Security Protocol ................................30 13.3. Protocol Privacy Requirements ............................32 13.4. Revoking Capabilities ....................................32 14. IANA Considerations ...........................................33 15. References ....................................................33 15.1. Normative References .....................................33 15.2. Informative References ...................................34 Appendix A. Acknowledgments ......................................35
8. Object-Based Layout Return .....................................22 8.1. pnfs_osd_errno4 ...........................................23 8.2. pnfs_osd_ioerr4 ...........................................24 8.3. pnfs_osd_layoutreturn4 ....................................24 9. Object-Based Creation Layout Hint ..............................25 9.1. pnfs_osd_layouthint4 ......................................25 10. Layout Segments ...............................................26 10.1. CB_LAYOUTRECALL and LAYOUTRETURN .........................27 10.2. LAYOUTCOMMIT .............................................27 11. Recalling Layouts .............................................27 11.1. CB_RECALL_ANY ............................................28 12. Client Fencing ................................................29 13. Security Considerations .......................................29 13.1. OSD Security Data Types ..................................30 13.2. The OSD Security Protocol ................................30 13.3. Protocol Privacy Requirements ............................32 13.4. Revoking Capabilities ....................................32 14. IANA Considerations ...........................................33 15. References ....................................................33 15.1. Normative References .....................................33 15.2. Informative References ...................................34 Appendix A. Acknowledgments ......................................35
In pNFS, the file server returns typed layout structures that describe where file data is located. There are different layouts for different storage systems and methods of arranging data on storage devices. This document describes the layouts used with object-based storage devices (OSDs) that are accessed according to the OSD storage protocol standard (ANSI INCITS 400-2004 [1]).
在pNFS中,文件服务器返回描述文件数据位置的类型化布局结构。对于不同的存储系统和在存储设备上排列数据的方法,有不同的布局。本文档描述了根据OSD存储协议标准(ANSI INCITS 400-2004[1])访问的基于对象的存储设备(OSD)所使用的布局。
An "object" is a container for data and attributes, and files are stored in one or more objects. The OSD protocol specifies several operations on objects, including READ, WRITE, FLUSH, GET ATTRIBUTES, SET ATTRIBUTES, CREATE, and DELETE. However, using the object-based layout the client only uses the READ, WRITE, GET ATTRIBUTES, and FLUSH commands. The other commands are only used by the pNFS server.
“对象”是数据和属性的容器,文件存储在一个或多个对象中。OSD协议指定对对象的若干操作,包括读取、写入、刷新、获取属性、设置属性、创建和删除。但是,使用基于对象的布局时,客户端仅使用读取、写入、获取属性和刷新命令。其他命令仅由pNFS服务器使用。
An object-based layout for pNFS includes object identifiers, capabilities that allow clients to READ or WRITE those objects, and various parameters that control how file data is striped across their component objects. The OSD protocol has a capability-based security scheme that allows the pNFS server to control what operations and what objects can be used by clients. This scheme is described in more detail in the "Security Considerations" section (Section 13).
pNFS的基于对象的布局包括对象标识符、允许客户端读取或写入这些对象的功能,以及控制文件数据在其组件对象上条带化方式的各种参数。OSD协议具有基于功能的安全方案,允许pNFS服务器控制客户端可以使用的操作和对象。该方案在“安全注意事项”一节(第13节)中有更详细的描述。
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [2].
本文件中的关键词“必须”、“不得”、“要求”、“应”、“不应”、“应”、“不应”、“建议”、“可”和“可选”应按照RFC 2119[2]中所述进行解释。
This document contains the external data representation (XDR [3]) description of the NFSv4.1 objects layout protocol. The XDR description is embedded in this document in a way that makes it simple for the reader to extract into a ready-to-compile form. The reader can feed this document into the following shell script to produce the machine readable XDR description of the NFSv4.1 objects layout protocol:
本文档包含NFSv4.1对象布局协议的外部数据表示(XDR[3])说明。XDR描述以某种方式嵌入到本文档中,使读者能够轻松地将其提取到准备编译的表单中。读者可以将此文档输入以下shell脚本,以生成NFSv4.1对象布局协议的机器可读XDR描述:
#!/bin/sh grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??'
#!/bin/sh grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??'
That is, if the above script is stored in a file called "extract.sh", and this document is in a file called "spec.txt", then the reader can do:
也就是说,如果上述脚本存储在一个名为“extract.sh”的文件中,而此文档存储在一个名为“spec.txt”的文件中,那么读者可以执行以下操作:
sh extract.sh < spec.txt > pnfs_osd_prot.x
sh extract.sh<spec.txt>pnfs\u osd\u prot.x
The effect of the script is to remove leading white space from each line, plus a sentinel sequence of "///".
脚本的作用是删除每行的前导空格,以及“//”的哨兵序列。
The embedded XDR file header follows. Subsequent XDR descriptions, with the sentinel sequence are embedded throughout the document.
下面是嵌入式XDR文件头。后续的XDR描述以及sentinel序列嵌入到整个文档中。
Note that the XDR code contained in this document depends on types from the NFSv4.1 nfs4_prot.x file ([4]). This includes both nfs types that end with a 4, such as offset4, length4, etc., as well as more generic types such as uint32_t and uint64_t.
请注意,本文档中包含的XDR代码取决于NFSv4.1 nfs4_prot.x文件([4])中的类型。这包括以4结尾的nfs类型,如offset4、length4等,以及更通用的类型,如uint32和uint64。
The XDR description, marked with lines beginning with the sequence "///", as well as scripts for extracting the XDR description are Code Components as described in Section 4 of "Legal Provisions Relating to IETF Documents" [5]. These Code Components are licensed according to the terms of Section 4 of "Legal Provisions Relating to IETF Documents".
XDR描述(标有以“//”开头的行)以及用于提取XDR描述的脚本是“与IETF文件相关的法律规定”[5]第4节所述的代码组件。这些代码组件根据“与IETF文件有关的法律规定”第4节的条款获得许可。
/// /* /// * Copyright (c) 2010 IETF Trust and the persons identified /// * as authors of the code. All rights reserved. /// * /// * Redistribution and use in source and binary forms, with /// * or without modification, are permitted provided that the /// * following conditions are met: /// * /// * o Redistributions of source code must retain the above /// * copyright notice, this list of conditions and the /// * following disclaimer. /// * /// * o Redistributions in binary form must reproduce the above /// * copyright notice, this list of conditions and the /// * following disclaimer in the documentation and/or other /// * materials provided with the distribution. /// * /// * o Neither the name of Internet Society, IETF or IETF /// * Trust, nor the names of specific contributors, may be /// * used to endorse or promote products derived from this /// * software without specific prior written permission. /// * /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. /// * /// * This code was derived from RFC 5664. /// * Please reproduce this note if possible. /// */ /// /// /* /// * pnfs_osd_prot.x /// */ /// /// %#include <nfs4_prot.x> ///
/// /* /// * Copyright (c) 2010 IETF Trust and the persons identified /// * as authors of the code. All rights reserved. /// * /// * Redistribution and use in source and binary forms, with /// * or without modification, are permitted provided that the /// * following conditions are met: /// * /// * o Redistributions of source code must retain the above /// * copyright notice, this list of conditions and the /// * following disclaimer. /// * /// * o Redistributions in binary form must reproduce the above /// * copyright notice, this list of conditions and the /// * following disclaimer in the documentation and/or other /// * materials provided with the distribution. /// * /// * o Neither the name of Internet Society, IETF or IETF /// * Trust, nor the names of specific contributors, may be /// * used to endorse or promote products derived from this /// * software without specific prior written permission. /// * /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. /// * /// * This code was derived from RFC 5664. /// * Please reproduce this note if possible. /// */ /// /// /* /// * pnfs_osd_prot.x /// */ /// /// %#include <nfs4_prot.x> ///
The following sections define basic data types and constants used by the Object-Based Layout protocol.
以下各节定义了基于对象的布局协议使用的基本数据类型和常量。
An object is identified by a number, somewhat like an inode number. The object storage model has a two-level scheme, where the objects within an object storage device are grouped into partitions.
对象由一个数字标识,有点像inode编号。对象存储模型有一个两级方案,其中对象存储设备中的对象被分组到分区中。
/// struct pnfs_osd_objid4 { /// deviceid4 oid_device_id; /// uint64_t oid_partition_id; /// uint64_t oid_object_id; /// }; ///
/// struct pnfs_osd_objid4 { /// deviceid4 oid_device_id; /// uint64_t oid_partition_id; /// uint64_t oid_object_id; /// }; ///
The pnfs_osd_objid4 type is used to identify an object within a partition on a specified object storage device. "oid_device_id" selects the object storage device from the set of available storage devices. The device is identified with the deviceid4 type, which is an index into addressing information about that device returned by the GETDEVICELIST and GETDEVICEINFO operations. The deviceid4 data type is defined in NFSv4.1 [6]. Within an OSD, a partition is identified with a 64-bit number, "oid_partition_id". Within a partition, an object is identified with a 64-bit number, "oid_object_id". Creation and management of partitions is outside the scope of this document, and is a facility provided by the object-based storage file system.
pnfs_osd_objid4类型用于标识指定对象存储设备上分区内的对象。“oid_device_id”从可用存储设备集中选择对象存储设备。该设备由deviceid4类型标识,该类型是一个索引,用于获取GETDEVICELIST和GETDEVICEINFO操作返回的有关该设备的寻址信息。deviceid4数据类型在NFSv4.1[6]中定义。在OSD中,分区由64位数字“oid\U partition\U id”标识。在分区内,一个对象由一个64位的数字“oid\U object\U id”标识。分区的创建和管理不在本文档的范围内,它是基于对象的存储文件系统提供的一种功能。
/// enum pnfs_osd_version4 { /// PNFS_OSD_MISSING = 0, /// PNFS_OSD_VERSION_1 = 1, /// PNFS_OSD_VERSION_2 = 2 /// }; ///
/// enum pnfs_osd_version4 { /// PNFS_OSD_MISSING = 0, /// PNFS_OSD_VERSION_1 = 1, /// PNFS_OSD_VERSION_2 = 2 /// }; ///
pnfs_osd_version4 is used to indicate the OSD protocol version or whether an object is missing (i.e., unavailable). Some of the object-based layout-supported RAID algorithms encode redundant information and can compensate for missing components, but the data placement algorithm needs to know what parts are missing.
pnfs_osd_version4用于指示osd协议版本或对象是否丢失(即不可用)。一些基于对象的布局支持的RAID算法编码冗余信息,并可以补偿缺失的组件,但数据放置算法需要知道缺失的部分。
At this time, the OSD standard is at version 1.0, and we anticipate a version 2.0 of the standard (SNIA T10/1729-D [14]). The second generation OSD protocol has additional proposed features to support more robust error recovery, snapshots, and byte-range capabilities. Therefore, the OSD version is explicitly called out in the information returned in the layout. (This information can also be deduced by looking inside the capability type at the format field, which is the first byte. The format value is 0x1 for an OSD v1 capability. However, it seems most robust to call out the version explicitly.)
目前,OSD标准的版本为1.0,我们预计该标准的版本为2.0(SNIA T10/1729-D[14])。第二代OSD协议具有其他建议的功能,以支持更强健的错误恢复、快照和字节范围功能。因此,在布局中返回的信息中显式调用OSD版本。(也可以通过查看format字段中的capability type(第一个字节)来推断此信息。对于OSD v1功能,format值为0x1。但是,显式调用版本似乎是最可靠的。)
/// enum pnfs_osd_cap_key_sec4 { /// PNFS_OSD_CAP_KEY_SEC_NONE = 0, /// PNFS_OSD_CAP_KEY_SEC_SSV = 1 /// }; /// /// struct pnfs_osd_object_cred4 { /// pnfs_osd_objid4 oc_object_id; /// pnfs_osd_version4 oc_osd_version; /// pnfs_osd_cap_key_sec4 oc_cap_key_sec; /// opaque oc_capability_key<>; /// opaque oc_capability<>; /// }; ///
/// enum pnfs_osd_cap_key_sec4 { /// PNFS_OSD_CAP_KEY_SEC_NONE = 0, /// PNFS_OSD_CAP_KEY_SEC_SSV = 1 /// }; /// /// struct pnfs_osd_object_cred4 { /// pnfs_osd_objid4 oc_object_id; /// pnfs_osd_version4 oc_osd_version; /// pnfs_osd_cap_key_sec4 oc_cap_key_sec; /// opaque oc_capability_key<>; /// opaque oc_capability<>; /// }; ///
The pnfs_osd_object_cred4 structure is used to identify each component comprising the file. The "oc_object_id" identifies the component object, the "oc_osd_version" represents the osd protocol version, or whether that component is unavailable, and the "oc_capability" and "oc_capability_key", along with the "oda_systemid" from the pnfs_osd_deviceaddr4, provide the OSD security credentials needed to access that object. The "oc_cap_key_sec" value denotes the method used to secure the oc_capability_key (see Section 13.1 for more details).
pnfs_osd_object_cred4结构用于标识构成文件的每个组件。“oc_对象_id”标识组件对象,“oc_osd_版本”表示osd协议版本,或者该组件是否不可用,“oc_能力”和“oc_能力_密钥”以及pnfs_osd_设备ADDR4中的“oda_系统id”提供访问该对象所需的osd安全凭据。“oc_cap_key_sec”值表示用于保护oc_capability_key的方法(有关更多详细信息,请参阅第13.1节)。
To comply with the OSD security requirements, the capability key SHOULD be transferred securely to prevent eavesdropping (see Section 13). Therefore, a client SHOULD either issue the LAYOUTGET or GETDEVICEINFO operations via RPCSEC_GSS with the privacy service or previously establish a secret state verifier (SSV) for the sessions via the NFSv4.1 SET_SSV operation. The pnfs_osd_cap_key_sec4 type is used to identify the method used by the server to secure the capability key.
为符合OSD安全要求,应安全传输能力密钥以防止窃听(见第13节)。因此,客户机应该通过RPCSEC_GSS和隐私服务发出LAYOUTGET或GETDEVICEINFO操作,或者事先通过NFSv4.1 SET_SSV操作为会话建立一个秘密状态验证器(SSV)。pnfs_osd_cap_key_sec4类型用于标识服务器用于保护功能密钥的方法。
o PNFS_OSD_CAP_KEY_SEC_NONE denotes that the oc_capability_key is not encrypted, in which case the client SHOULD issue the LAYOUTGET or GETDEVICEINFO operations with RPCSEC_GSS with the privacy service or the NFSv4.1 transport should be secured by using methods that are external to NFSv4.1 like the use of IPsec [15] for transporting the NFSV4.1 protocol.
o PNFS_OSD_CAP_KEY_SEC_NONE表示oc_capability_KEY未加密,在这种情况下,客户端应使用隐私服务向RPCSEC_GSS发出LAYOUTGET或GETDEVICEINFO操作,或者应使用NFSv4.1之外的方法(如使用IPsec)保护NFSv4.1传输[15]用于传输NFSV4.1协议。
o PNFS_OSD_CAP_KEY_SEC_SSV denotes that the oc_capability_key contents are encrypted using the SSV GSS context and the capability key as inputs to the GSS_Wrap() function (see GSS-API [7]) with the conf_req_flag set to TRUE. The client MUST use the secret SSV key as part of the client's GSS context to decrypt the capability key using the value of the oc_capability_key field as the input_message to the GSS_unwrap() function. Note that to prevent eavesdropping of the SSV key, the client SHOULD issue SET_SSV via RPCSEC_GSS with the privacy service.
o PNFS_OSD_CAP_KEY_SEC_SSV表示oc_capability_KEY内容使用SSV GSS上下文和capability KEY作为GSS_Wrap()函数(参见GSS-API[7])的输入进行加密,conf_req_标志设置为TRUE。客户端必须使用机密SSV密钥作为客户端GSS上下文的一部分,以使用oc_capability_key字段的值作为GSS_unwrap()函数的输入_消息来解密功能密钥。请注意,为了防止SSV密钥被窃听,客户端应通过RPCSEC_GSS和隐私服务发出SET_SSV。
The actual method chosen depends on whether the client established a SSV key with the server and whether it issued the operation with the RPCSEC_GSS privacy method. Naturally, if the client did not establish an SSV key via SET_SSV, the server MUST use the PNFS_OSD_CAP_KEY_SEC_NONE method. Otherwise, if the operation was not issued with the RPCSEC_GSS privacy method, the server SHOULD secure the oc_capability_key with the PNFS_OSD_CAP_KEY_SEC_SSV method. The server MAY use the PNFS_OSD_CAP_KEY_SEC_SSV method also when the operation was issued with the RPCSEC_GSS privacy method.
选择的实际方法取决于客户端是否与服务器建立了SSV密钥,以及客户端是否使用RPCSEC_GSS隐私方法发布操作。当然,如果客户端没有通过SET_SSV建立SSV密钥,服务器必须使用PNFS_OSD_CAP_key_SEC_NONE方法。否则,如果操作不是使用RPCSEC_GSS隐私方法发出的,则服务器应使用PNFS_OSD_CAP_key_SEC_SSV方法保护oc_capability_密钥。当使用RPCSEC_GSS隐私方法发布操作时,服务器也可以使用PNFS_OSD_CAP_KEY_SEC_SSV方法。
/// enum pnfs_osd_raid_algorithm4 { /// PNFS_OSD_RAID_0 = 1, /// PNFS_OSD_RAID_4 = 2, /// PNFS_OSD_RAID_5 = 3, /// PNFS_OSD_RAID_PQ = 4 /* Reed-Solomon P+Q */ /// }; ///
/// enum pnfs_osd_raid_algorithm4 { /// PNFS_OSD_RAID_0 = 1, /// PNFS_OSD_RAID_4 = 2, /// PNFS_OSD_RAID_5 = 3, /// PNFS_OSD_RAID_PQ = 4 /* Reed-Solomon P+Q */ /// }; ///
pnfs_osd_raid_algorithm4 represents the data redundancy algorithm used to protect the file's contents. See Section 5.4 for more details.
pnfs_osd_raid_算法4表示用于保护文件内容的数据冗余算法。详见第5.4节。
Data operations to an OSD require the client to know the "address" of each OSD's root object. The root object is synonymous with the Small Computer System Interface (SCSI) logical unit. The client specifies SCSI logical units to its SCSI protocol stack using a representation
对OSD的数据操作要求客户端知道每个OSD根对象的“地址”。根对象与小型计算机系统接口(SCSI)逻辑单元同义。客户机使用表示将SCSI逻辑单元指定给其SCSI协议堆栈
local to the client. Because these representations are local, GETDEVICEINFO must return information that can be used by the client to select the correct local representation.
本地到客户端。因为这些表示是本地的,所以GETDEVICEINFO必须返回客户机可以用来选择正确的本地表示的信息。
In the block world, a set offset (logical block number or track/ sector) contains a disk label. This label identifies the disk uniquely. In contrast, an OSD has a standard set of attributes on its root object. For device identification purposes, the OSD System ID (root information attribute number 3) and the OSD Name (root information attribute number 9) are used as the label. These appear in the pnfs_osd_deviceaddr4 type below under the "oda_systemid" and "oda_osdname" fields.
在块世界中,一组偏移量(逻辑块编号或磁道/扇区)包含一个磁盘标签。此标签唯一标识磁盘。相反,OSD在其根对象上有一组标准属性。出于设备识别目的,OSD系统ID(根信息属性编号3)和OSD名称(根信息属性编号9)用作标签。它们出现在下面“oda_systemid”和“oda_osdname”字段下的pnfs_osd_deviceaddr4类型中。
In some situations, SCSI target discovery may need to be driven based on information contained in the GETDEVICEINFO response. One example of this is Internet SCSI (iSCSI) targets that are not known to the client until a layout has been requested. The information provided as the "oda_targetid", "oda_targetaddr", and "oda_lun" fields in the pnfs_osd_deviceaddr4 type described below (see Section 4.2) allows the client to probe a specific device given its network address and optionally its iSCSI Name (see iSCSI [8]), or when the device network address is omitted, allows it to discover the object storage device using the provided device name or SCSI Device Identifier (see SPC-3 [9].)
在某些情况下,可能需要根据GETDEVICEINFO响应中包含的信息来驱动SCSI目标发现。其中一个示例是Internet SCSI(iSCSI)目标,在请求布局之前,客户端不知道这些目标。以下描述的pnfs_osd_deviceaddr4类型(参见第4.2节)中作为“oda_targetid”、“oda_TargetAddress”和“oda_lun”字段提供的信息允许客户端根据特定设备的网络地址和可选iSCSI名称(参见iSCSI[8])或在省略设备网络地址时探测特定设备,允许它使用提供的设备名称或SCSI设备标识符发现对象存储设备(请参阅SPC-3[9])
The oda_systemid is implicitly used by the client, by using the object credential signing key to sign each request with the request integrity check value. This method protects the client from unintentionally accessing a device if the device address mapping was changed (or revoked). The server computes the capability key using its own view of the systemid associated with the respective deviceid present in the credential. If the client's view of the deviceid mapping is stale, the client will use the wrong systemid (which must be system-wide unique) and the I/O request to the OSD will fail to pass the integrity check verification.
oda_systemid由客户端隐式使用,方法是使用对象凭据签名密钥使用请求完整性检查值对每个请求进行签名。如果设备地址映射已更改(或吊销),此方法可防止客户端无意中访问设备。服务器使用其自己的systemid视图计算能力密钥,systemid与凭证中存在的相应设备ID关联。如果客户端的deviceid映射视图过时,客户端将使用错误的systemid(必须是系统范围内唯一的),并且对OSD的I/O请求将无法通过完整性检查验证。
To recover from this condition the client should report the error and return the layout using LAYOUTRETURN, and invalidate all the device address mappings associated with this layout. The client can then ask for a new layout if it wishes using LAYOUTGET and resolve the referenced deviceids using GETDEVICEINFO or GETDEVICELIST.
要从这种情况中恢复,客户端应报告错误并使用LAYOUTRETURN返回布局,并使与此布局关联的所有设备地址映射无效。然后,如果客户端希望使用LAYOUTGET请求新布局,并使用GETDEVICEINFO或GETDEVICELIST解析引用的设备ID。
The server MUST provide the oda_systemid and SHOULD also provide the oda_osdname. When the OSD name is present, the client SHOULD get the root information attributes whenever it establishes communication with the OSD and verify that the OSD name it got from the OSD matches the one sent by the metadata server. To do so, the client uses the root_obj_cred credentials.
服务器必须提供oda_systemid,还应提供oda_osdname。当OSD名称存在时,客户机应在与OSD建立通信时获取根信息属性,并验证从OSD获取的OSD名称是否与元数据服务器发送的名称匹配。为此,客户端使用root_obj_cred凭据。
The following enum specifies the manner in which a SCSI target can be specified. The target can be specified as a SCSI Name, or as an SCSI Device Identifier.
以下枚举指定指定SCSI目标的方式。目标可以指定为SCSI名称或SCSI设备标识符。
/// enum pnfs_osd_targetid_type4 { /// OBJ_TARGET_ANON = 1, /// OBJ_TARGET_SCSI_NAME = 2, /// OBJ_TARGET_SCSI_DEVICE_ID = 3 /// }; ///
/// enum pnfs_osd_targetid_type4 { /// OBJ_TARGET_ANON = 1, /// OBJ_TARGET_SCSI_NAME = 2, /// OBJ_TARGET_SCSI_DEVICE_ID = 3 /// }; ///
The specification for an object device address is as follows:
对象设备地址的规范如下所示:
/// union pnfs_osd_targetid4 switch (pnfs_osd_targetid_type4 oti_type) { /// case OBJ_TARGET_SCSI_NAME: /// string oti_scsi_name<>; /// /// case OBJ_TARGET_SCSI_DEVICE_ID: /// opaque oti_scsi_device_id<>; /// /// default: /// void; /// }; /// /// union pnfs_osd_targetaddr4 switch (bool ota_available) { /// case TRUE: /// netaddr4 ota_netaddr; /// case FALSE: /// void; /// }; /// /// struct pnfs_osd_deviceaddr4 { /// pnfs_osd_targetid4 oda_targetid; /// pnfs_osd_targetaddr4 oda_targetaddr; /// opaque oda_lun[8]; /// opaque oda_systemid<>; /// pnfs_osd_object_cred4 oda_root_obj_cred; /// opaque oda_osdname<>; /// }; ///
/// union pnfs_osd_targetid4 switch (pnfs_osd_targetid_type4 oti_type) { /// case OBJ_TARGET_SCSI_NAME: /// string oti_scsi_name<>; /// /// case OBJ_TARGET_SCSI_DEVICE_ID: /// opaque oti_scsi_device_id<>; /// /// default: /// void; /// }; /// /// union pnfs_osd_targetaddr4 switch (bool ota_available) { /// case TRUE: /// netaddr4 ota_netaddr; /// case FALSE: /// void; /// }; /// /// struct pnfs_osd_deviceaddr4 { /// pnfs_osd_targetid4 oda_targetid; /// pnfs_osd_targetaddr4 oda_targetaddr; /// opaque oda_lun[8]; /// opaque oda_systemid<>; /// pnfs_osd_object_cred4 oda_root_obj_cred; /// opaque oda_osdname<>; /// }; ///
When "oda_targetid" is specified as an OBJ_TARGET_SCSI_NAME, the "oti_scsi_name" string MUST be formatted as an "iSCSI Name" as specified in iSCSI [8] and [10]. Note that the specification of the oti_scsi_name string format is outside the scope of this document. Parsing the string is based on the string prefix, e.g., "iqn.", "eui.", or "naa." and more formats MAY be specified in the future in accordance with iSCSI Names properties.
当“oda_targetid”指定为OBJ_TARGET_SCSI_名称时,“oti_SCSI_NAME”字符串必须按照iSCSI[8]和[10]中的规定格式化为“iSCSI名称”。请注意,oti_scsi_名称字符串格式的规范不在本文档的范围内。解析字符串基于字符串前缀,例如“iqn.”、“eui.”或“naa.”,将来可能会根据iSCSI名称属性指定更多格式。
Currently, the iSCSI Name provides for naming the target device using a string formatted as an iSCSI Qualified Name (IQN) or as an Extended Unique Identifier (EUI) [11] string. Those are typically used to identify iSCSI or Secure Routing Protocol (SRP) [16] devices. The Network Address Authority (NAA) string format (see [10]) provides for naming the device using globally unique identifiers, as defined in Fibre Channel Framing and Signaling (FC-FS) [17]. These are typically used to identify Fibre Channel or SAS [18] (Serial Attached SCSI) devices. In particular, such devices that are dual-attached both over Fibre Channel or SAS and over iSCSI.
目前,iSCSI名称用于使用格式化为iSCSI限定名称(IQN)或扩展唯一标识符(EUI)[11]字符串的字符串命名目标设备。这些通常用于标识iSCSI或安全路由协议(SRP)[16]设备。网络地址授权(NAA)字符串格式(见[10])提供了使用全局唯一标识符命名设备的功能,如光纤通道成帧和信令(FC-FS)[17]中所定义。它们通常用于标识光纤通道或SAS[18](串行连接SCSI)设备。特别是通过光纤通道或SAS以及iSCSI进行双连接的此类设备。
When "oda_targetid" is specified as an OBJ_TARGET_SCSI_DEVICE_ID, the "oti_scsi_device_id" opaque field MUST be formatted as a SCSI Device Identifier as defined in SPC-3 [9] VPD Page 83h (Section 7.6.3. "Device Identification VPD Page"). If the Device Identifier is identical to the OSD System ID, as given by oda_systemid, the server SHOULD provide a zero-length oti_scsi_device_id opaque value. Note that similarly to the "oti_scsi_name", the specification of the oti_scsi_device_id opaque contents is outside the scope of this document and more formats MAY be specified in the future in accordance with SPC-3.
如果将“oda_targetid”指定为OBJ_TARGET_SCSI_DEVICE_ID,“oti_SCSI_DEVICE_ID”不透明字段必须按照SPC-3[9]VPD第83h页(第7.6.3节“设备标识VPD页”)中的定义格式化为SCSI设备标识符。如果设备标识符与oda_systemid给出的OSD系统ID相同,则服务器应提供长度为零的oti_scsi_Device_ID不透明值。请注意,与“oti_scsi_名称”类似,oti_scsi_设备_id不透明内容的规范不在本文档范围内,将来可能会根据SPC-3规定更多格式。
The OBJ_TARGET_ANON pnfs_osd_targetid_type4 MAY be used for providing no target identification. In this case, only the OSD System ID, and optionally the provided network address, are used to locate the device.
OBJ_TARGET_ANON pnfs_osd_targetid_type4可用于不提供目标标识。在这种情况下,仅使用OSD系统ID和可选提供的网络地址来定位设备。
The optional "oda_targetaddr" field MAY be provided by the server as a hint to accelerate device discovery over, e.g., the iSCSI transport protocol. The network address is given with the netaddr4 type, which specifies a TCP/IP based endpoint (as specified in NFSv4.1 [6]). When given, the client SHOULD use it to probe for the SCSI device at the given network address. The client MAY still use other discovery mechanisms such as Internet Storage Name Service (iSNS) [12] to locate the device using the oda_targetid. In particular, such an
服务器可能会提供可选的“oda_targetaddr”字段,作为通过iSCSI传输协议等加速设备发现的提示。网络地址以NetAddress4类型给出,该类型指定基于TCP/IP的端点(如NFSv4.1[6]中所述)。给定时,客户端应使用它在给定的网络地址探测SCSI设备。客户端仍然可以使用其他发现机制,例如互联网存储名称服务(iSNS)[12]来使用oda_targetid定位设备。特别是这样一个
external name service SHOULD be used when the devices may be attached to the network using multiple connections, and/or multiple storage fabrics (e.g., Fibre-Channel and iSCSI).
当设备可以使用多个连接和/或多个存储结构(如光纤通道和iSCSI)连接到网络时,应使用外部名称服务。
The "oda_lun" field identifies the OSD 64-bit Logical Unit Number, formatted in accordance with SAM-3 [13]. The client uses the Logical Unit Number to communicate with the specific OSD Logical Unit. Its use is defined in detail by the SCSI transport protocol, e.g., iSCSI [8].
“oda_lun”字段标识OSD 64位逻辑单元编号,按照SAM-3[13]进行格式化。客户端使用逻辑单元号与特定OSD逻辑单元通信。它的使用由SCSI传输协议详细定义,例如iSCSI[8]。
The layout4 type is defined in the NFSv4.1 [6] as follows:
布局4类型在NFSv4.1[6]中定义如下:
enum layouttype4 { LAYOUT4_NFSV4_1_FILES = 1, LAYOUT4_OSD2_OBJECTS = 2, LAYOUT4_BLOCK_VOLUME = 3 };
enum layouttype4 { LAYOUT4_NFSV4_1_FILES = 1, LAYOUT4_OSD2_OBJECTS = 2, LAYOUT4_BLOCK_VOLUME = 3 };
struct layout_content4 { layouttype4 loc_type; opaque loc_body<>; };
struct layout_content4 { layouttype4 loc_type; opaque loc_body<>; };
struct layout4 { offset4 lo_offset; length4 lo_length; layoutiomode4 lo_iomode; layout_content4 lo_content; };
struct layout4 { offset4 lo_offset; length4 lo_length; layoutiomode4 lo_iomode; layout_content4 lo_content; };
This document defines structure associated with the layouttype4 value, LAYOUT4_OSD2_OBJECTS. The NFSv4.1 [6] specifies the loc_body structure as an XDR type "opaque". The opaque layout is uninterpreted by the generic pNFS client layers, but obviously must be interpreted by the object storage layout driver. This section defines the structure of this opaque value, pnfs_osd_layout4.
本文档定义了与layouttype4值LAYOUT4_OSD2_对象关联的结构。NFSv4.1[6]将loc_主体结构指定为XDR类型“不透明”。一般pNFS客户机层并不理解不透明布局,但显然必须由对象存储布局驱动程序解释。本节定义此不透明值pnfs_osd_layout4的结构。
/// struct pnfs_osd_data_map4 { /// uint32_t odm_num_comps; /// length4 odm_stripe_unit; /// uint32_t odm_group_width; /// uint32_t odm_group_depth; /// uint32_t odm_mirror_cnt; /// pnfs_osd_raid_algorithm4 odm_raid_algorithm; /// }; ///
/// struct pnfs_osd_data_map4 { /// uint32_t odm_num_comps; /// length4 odm_stripe_unit; /// uint32_t odm_group_width; /// uint32_t odm_group_depth; /// uint32_t odm_mirror_cnt; /// pnfs_osd_raid_algorithm4 odm_raid_algorithm; /// }; ///
The pnfs_osd_data_map4 structure parameterizes the algorithm that maps a file's contents over the component objects. Instead of limiting the system to simple striping scheme where loss of a single component object results in data loss, the map parameters support mirroring and more complicated schemes that protect against loss of a component object.
pnfs_osd_data_map4结构参数化了将文件内容映射到组件对象上的算法。与将系统限制为单一组件对象丢失导致数据丢失的简单条带化方案不同,映射参数支持镜像和更复杂的方案,以防止组件对象丢失。
"odm_num_comps" is the number of component objects the file is striped over. The server MAY grow the file by adding more components to the stripe while clients hold valid layouts until the file has reached its final stripe width. The file length in this case MUST be limited to the number of bytes in a full stripe.
“odm_num_comps”是文件分条覆盖的组件对象数。服务器可以通过向条带添加更多组件来增加文件,而客户端保留有效布局,直到文件达到其最终条带宽度。在这种情况下,文件长度必须限制为完整条带中的字节数。
The "odm_stripe_unit" is the number of bytes placed on one component before advancing to the next one in the list of components. The number of bytes in a full stripe is odm_stripe_unit times the number of components. In some RAID schemes, a stripe includes redundant information (i.e., parity) that lets the system recover from loss or damage to a component object.
“odm_stripe_unit”是在进入组件列表中的下一个组件之前放置在一个组件上的字节数。完整条带中的字节数是odm_stripe_单位乘以组件数。在某些RAID方案中,条带包含冗余信息(即奇偶校验),使系统能够从组件对象的丢失或损坏中恢复。
The "odm_group_width" and "odm_group_depth" parameters allow a nested striping pattern (see Section 5.3.2 for details). If there is no nesting, then odm_group_width and odm_group_depth MUST be zero. The size of the components array MUST be a multiple of odm_group_width.
“odm_组_宽度”和“odm_组_深度”参数允许嵌套条带模式(详情见第5.3.2节)。如果没有嵌套,则odm_组的宽度和odm_组的深度必须为零。components数组的大小必须是odm_group_width的倍数。
The "odm_mirror_cnt" is used to replicate a file by replicating its component objects. If there is no mirroring, then odm_mirror_cnt MUST be 0. If odm_mirror_cnt is greater than zero, then the size of the component array MUST be a multiple of (odm_mirror_cnt+1).
“odm_mirror_cnt”用于通过复制文件的组件对象来复制文件。如果没有镜像,则odm_mirror_cnt必须为0。如果odm_mirror_cnt大于零,则组件阵列的大小必须是(odm_mirror_cnt+1)的倍数。
See Section 5.3 for more details.
详见第5.3节。
/// struct pnfs_osd_layout4 { /// pnfs_osd_data_map4 olo_map; /// uint32_t olo_comps_index; /// pnfs_osd_object_cred4 olo_components<>; /// }; ///
/// struct pnfs_osd_layout4 { /// pnfs_osd_data_map4 olo_map; /// uint32_t olo_comps_index; /// pnfs_osd_object_cred4 olo_components<>; /// }; ///
The pnfs_osd_layout4 structure specifies a layout over a set of component objects. The "olo_components" field is an array of object identifiers and security credentials that grant access to each object. The organization of the data is defined by the pnfs_osd_data_map4 type that specifies how the file's data is mapped onto the component objects (i.e., the striping pattern). The data placement algorithm that maps file data onto component objects assumes that each component object occurs exactly once in the array of components. Therefore, component objects MUST appear in the olo_components array only once. The components array may represent all objects comprising the file, in which case "olo_comps_index" is set to zero and the number of entries in the olo_components array is equal to olo_map.odm_num_comps. The server MAY return fewer components than odm_num_comps, provided that the returned components are sufficient to access any byte in the layout's data range (e.g., a sub-stripe of "odm_group_width" components). In this case, olo_comps_index represents the position of the returned components array within the full array of components that comprise the file.
pnfs_osd_layout4结构指定一组组件对象上的布局。“olo_组件”字段是一个对象标识符和安全凭证的数组,用于授予对每个对象的访问权。数据的组织由pnfs_osd_data_map4类型定义,该类型指定如何将文件的数据映射到组件对象(即条带化模式)。将文件数据映射到组件对象的数据放置算法假定每个组件对象在组件数组中只出现一次。因此,组件对象只能在olo_组件数组中出现一次。组件数组可以表示构成文件的所有对象,在这种情况下,“olo_comps_index”设置为零,并且olo_组件数组中的条目数等于olo_map.odm_num_comps。服务器返回的组件可能少于odm_num_comps,前提是返回的组件足以访问布局数据范围中的任何字节(例如,“odm_group_width”组件的子条带)。在这种情况下,olo_comps_索引表示返回的组件数组在组成文件的完整组件数组中的位置。
Note that the layout depends on the file size, which the client learns from the generic return parameters of LAYOUTGET, by doing GETATTR commands to the metadata server. The client uses the file size to decide if it should fill holes with zeros or return a short read. Striping patterns can cause cases where component objects are shorter than other components because a hole happens to correspond to the last part of the component object.
请注意,布局取决于文件大小,客户机通过向元数据服务器执行GETATTR命令,从LAYOUTGET的通用返回参数中学习文件大小。客户端使用文件大小来决定是用零填充孔还是返回短读取。条带化图案可能会导致零部件对象比其他零部件短的情况,因为孔恰好对应于零部件对象的最后一部分。
This section describes the different data mapping schemes in detail. The object layout always uses a "dense" layout as described in NFSv4.1 [6]. This means that the second stripe unit of the file starts at offset 0 of the second component, rather than at offset stripe_unit bytes. After a full stripe has been written, the next stripe unit is appended to the first component object in the list without any holes in the component objects.
本节详细介绍了不同的数据映射方案。对象布局始终使用NFSv4.1[6]中所述的“密集”布局。这意味着文件的第二个条带单元从第二个组件的偏移量0开始,而不是从偏移量条带单元字节开始。写入完整条带后,下一条带单元将附加到列表中的第一个组件对象,而组件对象中没有任何孔。
The mapping from the logical offset within a file (L) to the component object C and object-specific offset O is defined by the following equations:
从文件(L)中的逻辑偏移量到组件对象C和对象特定偏移量O的映射由以下等式定义:
L = logical offset into the file W = total number of components S = W * stripe_unit N = L / S C = (L-(N*S)) / stripe_unit O = (N*stripe_unit)+(L%stripe_unit)
L = logical offset into the file W = total number of components S = W * stripe_unit N = L / S C = (L-(N*S)) / stripe_unit O = (N*stripe_unit)+(L%stripe_unit)
In these equations, S is the number of bytes in a full stripe, and N is the stripe number. C is an index into the array of components, so it selects a particular object storage device. Both N and C count from zero. O is the offset within the object that corresponds to the file offset. Note that this computation does not accommodate the same object appearing in the olo_components array multiple times.
在这些等式中,S是完整条带中的字节数,N是条带数。C是组件数组的索引,因此它选择一个特定的对象存储设备。N和C都从零开始计数。O是对象内与文件偏移相对应的偏移量。请注意,此计算不适用于多次出现在olo_组件数组中的同一对象。
For example, consider an object striped over four devices, <D0 D1 D2 D3>. The stripe_unit is 4096 bytes. The stripe width S is thus 4 * 4096 = 16384.
例如,考虑一个超过四个设备的对象,<d0d1d2d3>。条带单位为4096字节。因此条纹宽度S为4×4096=16384。
Offset 0: N = 0 / 16384 = 0 C = 0-0/4096 = 0 (D0) O = 0*4096 + (0%4096) = 0
Offset 0: N = 0 / 16384 = 0 C = 0-0/4096 = 0 (D0) O = 0*4096 + (0%4096) = 0
Offset 4096: N = 4096 / 16384 = 0 C = (4096-(0*16384)) / 4096 = 1 (D1) O = (0*4096)+(4096%4096) = 0
Offset 4096: N = 4096 / 16384 = 0 C = (4096-(0*16384)) / 4096 = 1 (D1) O = (0*4096)+(4096%4096) = 0
Offset 9000: N = 9000 / 16384 = 0 C = (9000-(0*16384)) / 4096 = 2 (D2) O = (0*4096)+(9000%4096) = 808
Offset 9000: N = 9000 / 16384 = 0 C = (9000-(0*16384)) / 4096 = 2 (D2) O = (0*4096)+(9000%4096) = 808
Offset 132000: N = 132000 / 16384 = 8 C = (132000-(8*16384)) / 4096 = 0 (D0) O = (8*4096) + (132000%4096) = 33696
Offset 132000: N = 132000 / 16384 = 8 C = (132000-(8*16384)) / 4096 = 0 (D0) O = (8*4096) + (132000%4096) = 33696
The odm_group_width and odm_group_depth parameters allow a nested striping pattern. odm_group_width defines the width of a data stripe and odm_group_depth defines how many stripes are written before advancing to the next group of components in the list of component objects for the file. The math used to map from a file offset to a component object and offset within that object is shown below. The computations map from the logical offset L to the component index C and offset relative O within that component object.
odm_group_width和odm_group_depth参数允许嵌套条带模式。odm_group_width定义数据条带的宽度,odm_group_depth定义在前进到文件的组件对象列表中的下一组组件之前写入的条带数。用于从文件偏移映射到组件对象以及该对象内偏移的数学如下所示。计算从逻辑偏移量L映射到该组件对象内的组件索引C和相对偏移量O。
L = logical offset into the file W = total number of components S = stripe_unit * group_depth * W T = stripe_unit * group_depth * group_width U = stripe_unit * group_width M = L / S G = (L - (M * S)) / T H = (L - (M * S)) % T N = H / U C = (H - (N * U)) / stripe_unit + G * group_width O = L % stripe_unit + N * stripe_unit + M * group_depth * stripe_unit
L = logical offset into the file W = total number of components S = stripe_unit * group_depth * W T = stripe_unit * group_depth * group_width U = stripe_unit * group_width M = L / S G = (L - (M * S)) / T H = (L - (M * S)) % T N = H / U C = (H - (N * U)) / stripe_unit + G * group_width O = L % stripe_unit + N * stripe_unit + M * group_depth * stripe_unit
In these equations, S is the number of bytes striped across all component objects before the pattern repeats. T is the number of bytes striped within a group of component objects before advancing to the next group. U is the number of bytes in a stripe within a group. M is the "major" (i.e., across all components) stripe number, and N is the "minor" (i.e., across the group) stripe number. G counts the groups from the beginning of the major stripe, and H is the byte offset within the group.
在这些等式中,S是模式重复之前在所有组件对象上条带化的字节数。T是在前进到下一个组之前,在一组组件对象中条带化的字节数。U是组中条带中的字节数。M是“主要”(即所有组件)条带编号,N是“次要”(即整个组)条带编号。G从主条带的开头开始计算组,H是组内的字节偏移量。
For example, consider an object striped over 100 devices with a group_width of 10, a group_depth of 50, and a stripe_unit of 1 MB. In this scheme, 500 MB are written to the first 10 components, and 5000 MB are written before the pattern wraps back around to the first component in the array.
例如,考虑一个超过100个设备的对象,该对象的组宽度为10,组为50,深度为1 MB。在这个方案中,500 MB被写入前10个组件,5000 MB在模式回绕到阵列中的第一个组件之前被写入。
Offset 0: W = 100 S = 1 MB * 50 * 100 = 5000 MB T = 1 MB * 50 * 10 = 500 MB U = 1 MB * 10 = 10 MB M = 0 / 5000 MB = 0 G = (0 - (0 * 5000 MB)) / 500 MB = 0 H = (0 - (0 * 5000 MB)) % 500 MB = 0 N = 0 / 10 MB = 0 C = (0 - (0 * 10 MB)) / 1 MB + 0 * 10 = 0 O = 0 % 1 MB + 0 * 1 MB + 0 * 50 * 1 MB = 0
Offset 0: W = 100 S = 1 MB * 50 * 100 = 5000 MB T = 1 MB * 50 * 10 = 500 MB U = 1 MB * 10 = 10 MB M = 0 / 5000 MB = 0 G = (0 - (0 * 5000 MB)) / 500 MB = 0 H = (0 - (0 * 5000 MB)) % 500 MB = 0 N = 0 / 10 MB = 0 C = (0 - (0 * 10 MB)) / 1 MB + 0 * 10 = 0 O = 0 % 1 MB + 0 * 1 MB + 0 * 50 * 1 MB = 0
Offset 27 MB: M = 27 MB / 5000 MB = 0 G = (27 MB - (0 * 5000 MB)) / 500 MB = 0 H = (27 MB - (0 * 5000 MB)) % 500 MB = 27 MB N = 27 MB / 10 MB = 2 C = (27 MB - (2 * 10 MB)) / 1 MB + 0 * 10 = 7 O = 27 MB % 1 MB + 2 * 1 MB + 0 * 50 * 1 MB = 2 MB
Offset 27 MB: M = 27 MB / 5000 MB = 0 G = (27 MB - (0 * 5000 MB)) / 500 MB = 0 H = (27 MB - (0 * 5000 MB)) % 500 MB = 27 MB N = 27 MB / 10 MB = 2 C = (27 MB - (2 * 10 MB)) / 1 MB + 0 * 10 = 7 O = 27 MB % 1 MB + 2 * 1 MB + 0 * 50 * 1 MB = 2 MB
Offset 7232 MB: M = 7232 MB / 5000 MB = 1 G = (7232 MB - (1 * 5000 MB)) / 500 MB = 4 H = (7232 MB - (1 * 5000 MB)) % 500 MB = 232 MB N = 232 MB / 10 MB = 23 C = (232 MB - (23 * 10 MB)) / 1 MB + 4 * 10 = 42 O = 7232 MB % 1 MB + 23 * 1 MB + 1 * 50 * 1 MB = 73 MB
Offset 7232 MB: M = 7232 MB / 5000 MB = 1 G = (7232 MB - (1 * 5000 MB)) / 500 MB = 4 H = (7232 MB - (1 * 5000 MB)) % 500 MB = 232 MB N = 232 MB / 10 MB = 23 C = (232 MB - (23 * 10 MB)) / 1 MB + 4 * 10 = 42 O = 7232 MB % 1 MB + 23 * 1 MB + 1 * 50 * 1 MB = 73 MB
The odm_mirror_cnt is used to replicate a file by replicating its component objects. If there is no mirroring, then odm_mirror_cnt MUST be 0. If odm_mirror_cnt is greater than zero, then the size of the olo_components array MUST be a multiple of (odm_mirror_cnt+1). Thus, for a classic mirror on two objects, odm_mirror_cnt is one. Note that mirroring can be defined over any RAID algorithm and striping pattern (either simple or nested). If odm_group_width is also non-zero, then the size of the olo_components array MUST be a multiple of odm_group_width * (odm_mirror_cnt+1). Replicas are adjacent in the olo_components array, and the value C produced by the above equations is not a direct index into the olo_components array. Instead, the following equations determine the replica component index RCi, where i ranges from 0 to odm_mirror_cnt.
odm_mirror_cnt用于通过复制文件的组件对象来复制文件。如果没有镜像,则odm_mirror_cnt必须为0。如果odm_mirror_cnt大于零,则olo_组件数组的大小必须是(odm_mirror_cnt+1)的倍数。因此,对于两个对象上的经典镜像,odm_mirror_cnt就是其中之一。请注意,镜像可以在任何RAID算法和条带化模式(简单或嵌套)上定义。如果odm_group_width也非零,则olo_组件数组的大小必须是odm_group_width*(odm_mirror_cnt+1)的倍数。复制副本在olo_组件数组中相邻,由上述等式产生的值C不是olo_组件数组的直接索引。相反,以下等式确定了副本组件索引RCi,其中i的范围为0到odm_mirror_cnt。
C = component index for striping or two-level striping i ranges from 0 to odm_mirror_cnt, inclusive RCi = C * (odm_mirror_cnt+1) + i
C = component index for striping or two-level striping i ranges from 0 to odm_mirror_cnt, inclusive RCi = C * (odm_mirror_cnt+1) + i
pnfs_osd_raid_algorithm4 determines the algorithm and placement of redundant data. This section defines the different redundancy algorithms. Note: The term "RAID" (Redundant Array of Independent Disks) is used in this document to represent an array of component objects that store data for an individual file. The objects are stored on independent object-based storage devices. File data is encoded and striped across the array of component objects using algorithms developed for block-based RAID systems.
pnfs_osd_raid_算法4确定冗余数据的算法和位置。本节定义了不同的冗余算法。注意:本文档中使用术语“RAID”(独立磁盘冗余阵列)表示存储单个文件数据的组件对象阵列。对象存储在独立的基于对象的存储设备上。使用为基于块的RAID系统开发的算法,跨组件对象阵列对文件数据进行编码和条带化。
PNFS_OSD_RAID_0 means there is no parity data, so all bytes in the component objects are data bytes located by the above equations for C and O. If a component object is marked as PNFS_OSD_MISSING, the pNFS client MUST either return an I/O error if this component is attempted to be read or, alternatively, it can retry the READ against the pNFS server.
PNFS_OSD_RAID_0表示没有奇偶校验数据,因此组件对象中的所有字节都是由上述C和O方程定位的数据字节。如果组件对象标记为缺少PNFS_OSD_,则PNFS客户端必须在尝试读取该组件时返回I/O错误,或者,它可以对pNFS服务器重试读取。
PNFS_OSD_RAID_4 means that the last component object, or the last in each group (if odm_group_width is greater than zero), contains parity information computed over the rest of the stripe with an XOR operation. If a component object is unavailable, the client can read the rest of the stripe units in the damaged stripe and recompute the missing stripe unit by XORing the other stripe units in the stripe. Or the client can replay the READ against the pNFS server that will presumably perform the reconstructed read on the client's behalf.
PNFS_OSD_RAID_4表示最后一个组件对象,或每个组中的最后一个(如果odm_group_width大于零),包含通过异或操作在条带的其余部分上计算的奇偶校验信息。如果组件对象不可用,客户端可以读取损坏条带中的其余条带单元,并通过对条带中的其他条带单元进行异或,重新计算丢失的条带单元。或者,客户机可以对pNFS服务器重放读取,该服务器可能会代表客户机执行重建的读取。
When parity is present in the file, then there is an additional computation to map from the file offset L to the offset that accounts for embedded parity, L'. First compute L', and then use L' in the above equations for C and O.
当文件中存在奇偶校验时,将有一个额外的计算从文件偏移量L映射到解释嵌入奇偶校验的偏移量L’。首先计算L',然后在上述C和O方程中使用L'。
L = file offset, not accounting for parity P = number of parity devices in each stripe W = group_width, if not zero, else size of olo_components array N = L / (W-P * stripe_unit) L' = N * (W * stripe_unit) + (L % (W-P * stripe_unit))
L = file offset, not accounting for parity P = number of parity devices in each stripe W = group_width, if not zero, else size of olo_components array N = L / (W-P * stripe_unit) L' = N * (W * stripe_unit) + (L % (W-P * stripe_unit))
PNFS_OSD_RAID_5 means that the position of the parity data is rotated on each stripe or each group (if odm_group_width is greater than zero). In the first stripe, the last component holds the parity. In
PNFS_OSD_RAID_5 means that the position of the parity data is rotated on each stripe or each group (if odm_group_width is greater than zero). In the first stripe, the last component holds the parity. Intranslate error, please retry
the second stripe, the next-to-last component holds the parity, and so on. In this scheme, all stripe units are rotated so that I/O is evenly spread across objects as the file is read sequentially. The rotated parity layout is illustrated here, with numbers indicating the stripe unit.
第二个条带(紧挨着最后一个的组件)保存奇偶校验,依此类推。在此方案中,所有条带单元都会旋转,以便在顺序读取文件时,I/O均匀分布在对象上。旋转奇偶校验布局如图所示,数字表示条带单元。
0 1 2 P 4 5 P 3 8 P 6 7 P 9 a b
012P45P38P67P9AB
To compute the component object C, first compute the offset that accounts for parity L' and use that to compute C. Then rotate C to get C'. Finally, increase C' by one if the parity information comes at or before C' within that stripe. The following equations illustrate this by computing I, which is the index of the component that contains parity for a given stripe.
要计算组件对象C,首先计算奇偶校验L'的偏移量,并使用该偏移量计算C。然后旋转C以获得C'。最后,如果奇偶校验信息在该条带内的C'处或之前,则将C'增加1。以下方程式通过计算I来说明这一点,I是包含给定条带奇偶校验的组件的索引。
L = file offset, not accounting for parity W = odm_group_width, if not zero, else size of olo_components array N = L / (W-1 * stripe_unit) (Compute L' as describe above) (Compute C based on L' as described above) C' = (C - (N%W)) % W I = W - (N%W) - 1 if (C' <= I) { C'++ }
L = file offset, not accounting for parity W = odm_group_width, if not zero, else size of olo_components array N = L / (W-1 * stripe_unit) (Compute L' as describe above) (Compute C based on L' as described above) C' = (C - (N%W)) % W I = W - (N%W) - 1 if (C' <= I) { C'++ }
PNFS_OSD_RAID_PQ is a double-parity scheme that uses the Reed-Solomon P+Q encoding scheme [19]. In this layout, the last two component objects hold the P and Q data, respectively. P is parity computed with XOR, and Q is a more complex equation that is not described here. The equations given above for embedded parity can be used to map a file offset to the correct component object by setting the number of parity components to 2 instead of 1 for RAID4 or RAID5. Clients may simply choose to read data through the metadata server if two components are missing or damaged.
PNFS_OSD_RAID_PQ是一种使用Reed-Solomon P+Q编码方案的双奇偶校验方案[19]。在此布局中,最后两个组件对象分别保存P和Q数据。P是用异或计算的奇偶校验,Q是一个更复杂的方程,这里没有描述。通过将奇偶校验组件的数量设置为2,而不是RAID4或RAID5中的1,以上给出的嵌入式奇偶校验方程可用于将文件偏移映射到正确的组件对象。如果两个组件丢失或损坏,客户端可以选择通过元数据服务器读取数据。
RAID layouts with redundant data in their stripes require additional serialization of updates to ensure correct operation. Otherwise, if two clients simultaneously write to the same logical range of an object, the result could include different data in the same ranges of mirrored tuples, or corrupt parity information. It is the
条带中包含冗余数据的RAID布局需要额外的更新序列化,以确保正确操作。否则,如果两个客户端同时写入对象的同一逻辑范围,则结果可能包括镜像元组相同范围内的不同数据,或损坏奇偶校验信息。它是
responsibility of the metadata server to enforce serialization requirements such as this. For example, the metadata server may do so by not granting overlapping write layouts within mirrored objects.
元数据服务器强制执行序列化要求的责任,例如。例如,元数据服务器可以通过不授予镜像对象内的重叠写入布局来实现。
layoutupdate4 is used in the LAYOUTCOMMIT operation to convey updates to the layout and additional information to the metadata server. It is defined in the NFSv4.1 [6] as follows:
layoutupdate4在LAYOUTCOMMIT操作中用于向布局传递更新,并向元数据服务器传递附加信息。NFSv4.1[6]中对其定义如下:
struct layoutupdate4 { layouttype4 lou_type; opaque lou_body<>; };
struct layoutupdate4 { layouttype4 lou_type; opaque lou_body<>; };
The layoutupdate4 type is an opaque value at the generic pNFS client level. If the lou_type layout type is LAYOUT4_OSD2_OBJECTS, then the lou_body opaque value is defined by the pnfs_osd_layoutupdate4 type.
layoutupdate4类型是泛型pNFS客户端级别的不透明值。如果lou_类型布局类型为LAYOUT4_OSD2_对象,则lou_body不透明值由pnfs_osd_layoutupdate4类型定义。
Object-Based pNFS clients are not allowed to modify the layout. Therefore, the information passed in pnfs_osd_layoutupdate4 is used only to update the file's attributes. In addition to the generic information the client can pass to the metadata server in LAYOUTCOMMIT such as the highest offset the client wrote to and the last time it modified the file, the client MAY use pnfs_osd_layoutupdate4 to convey the capacity consumed (or released) by writes using the layout, and to indicate that I/O errors were encountered by such writes.
不允许基于对象的pNFS客户端修改布局。因此,在pnfs_osd_layoutupdate4中传递的信息仅用于更新文件的属性。除了客户端可以在LAYOUTCOMMIT中传递给元数据服务器的一般信息,例如客户端写入的最高偏移量和上次修改文件的时间,客户端还可以使用pnfs_osd_layoutupdate4来传递使用布局的写入所消耗(或释放)的容量,并指示此类写入操作遇到I/O错误。
/// union pnfs_osd_deltaspaceused4 switch (bool dsu_valid) { /// case TRUE: /// int64_t dsu_delta; /// case FALSE: /// void; /// }; ///
/// union pnfs_osd_deltaspaceused4 switch (bool dsu_valid) { /// case TRUE: /// int64_t dsu_delta; /// case FALSE: /// void; /// }; ///
pnfs_osd_deltaspaceused4 is used to convey space utilization information at the time of LAYOUTCOMMIT. For the file system to properly maintain capacity-used information, it needs to track how much capacity was consumed by WRITE operations performed by the client. In this protocol, the OSD returns the capacity consumed by a write (*), which can be different than the number of bytes written because of internal overhead like block-level allocation and indirect blocks, and the client reflects this back to the pNFS server so it can accurately track quota. The pNFS server can choose to trust this
pnfs_osd_deltaspaceused4用于在LAYOUTCOMMIT时传递空间利用率信息。为了使文件系统能够正确地维护容量使用信息,它需要跟踪客户端执行的写操作消耗了多少容量。在该协议中,OSD返回写入所消耗的容量(*),由于诸如块级分配和间接块之类的内部开销,该容量可能不同于写入的字节数,并且客户端将其反映回pNFS服务器,以便能够准确跟踪配额。pNFS服务器可以选择信任此服务器
information coming from the clients and therefore avoid querying the OSDs at the time of LAYOUTCOMMIT. If the client is unable to obtain this information from the OSD, it simply returns invalid olu_delta_space_used.
来自客户端的信息,因此避免在LAYOUTCOMMIT时查询OSD。如果客户端无法从OSD获取此信息,它只会返回无效的olu_delta_space_used。
/// struct pnfs_osd_layoutupdate4 { /// pnfs_osd_deltaspaceused4 olu_delta_space_used; /// bool olu_ioerr_flag; /// }; ///
/// struct pnfs_osd_layoutupdate4 { /// pnfs_osd_deltaspaceused4 olu_delta_space_used; /// bool olu_ioerr_flag; /// }; ///
"olu_delta_space_used" is used to convey capacity usage information back to the metadata server.
“olu_delta_space_used”用于将容量使用信息传回元数据服务器。
The "olu_ioerr_flag" is used when I/O errors were encountered while writing the file. The client MUST report the errors using the pnfs_osd_ioerr4 structure (see Section 8.1) at LAYOUTRETURN time.
写入文件时遇到I/O错误时使用“olu_ioerr_标志”。客户端必须在LAYOUTRETURN时使用pnfs_osd_ioerr4结构(见第8.1节)报告错误。
If the client updated the file successfully before hitting the I/O errors, it MAY use LAYOUTCOMMIT to update the metadata server as described above. Typically, in the error-free case, the server MAY turn around and update the file's attributes on the storage devices. However, if I/O errors were encountered, the server better not attempt to write the new attributes on the storage devices until it receives the I/O error report; therefore, the client MUST set the olu_ioerr_flag to true. Note that in this case, the client SHOULD send both the LAYOUTCOMMIT and LAYOUTRETURN operations in the same COMPOUND RPC.
如果客户端在出现I/O错误之前成功更新了文件,则它可能会使用LAYOUTCOMMIT来更新元数据服务器,如上所述。通常,在无错误的情况下,服务器可能会掉头并更新存储设备上的文件属性。但是,如果遇到I/O错误,服务器最好在收到I/O错误报告之前不要尝试在存储设备上写入新属性;因此,客户端必须将olu_ioerr_标志设置为true。注意,在这种情况下,客户端应该在同一个复合RPC中发送LAYOUTCOMMIT和LAYOUTRETURN操作。
The pNFS client may encounter errors when directly accessing the object storage devices. However, it is the responsibility of the metadata server to handle the I/O errors. When the LAYOUT4_OSD2_OBJECTS layout type is used, the client MUST report the I/O errors to the server at LAYOUTRETURN time using the pnfs_osd_ioerr4 structure (see Section 8.1).
pNFS客户端在直接访问对象存储设备时可能会遇到错误。但是,元数据服务器负责处理I/O错误。使用LAYOUT4_OSD2_对象布局类型时,客户端必须在LAYOUTRETURN时使用pnfs_osd_ioerr4结构向服务器报告I/O错误(请参见第8.1节)。
The metadata server analyzes the error and determines the required recovery operations such as repairing any parity inconsistencies, recovering media failures, or reconstructing missing objects.
元数据服务器分析错误并确定所需的恢复操作,如修复任何奇偶校验不一致、恢复媒体故障或重建丢失的对象。
The metadata server SHOULD recall any outstanding layouts to allow it exclusive write access to the stripes being recovered and to prevent other clients from hitting the same error condition. In these cases, the server MUST complete recovery before handing out any new layouts to the affected byte ranges.
元数据服务器应调用任何未完成的布局,以允许其对正在恢复的条带进行独占写入访问,并防止其他客户端遇到相同的错误情况。在这些情况下,服务器必须在向受影响的字节范围分发任何新布局之前完成恢复。
Although it MAY be acceptable for the client to propagate a corresponding error to the application that initiated the I/O operation and drop any unwritten data, the client SHOULD attempt to retry the original I/O operation by requesting a new layout using LAYOUTGET and retry the I/O operation(s) using the new layout, or the client MAY just retry the I/O operation(s) using regular NFS READ or WRITE operations via the metadata server. The client SHOULD attempt to retrieve a new layout and retry the I/O operation using OSD commands first and only if the error persists, retry the I/O operation via the metadata server.
虽然客户端可以将相应错误传播到启动I/O操作的应用程序并删除任何未写入的数据,但客户端应尝试通过使用LAYOUTGET请求新布局重试原始I/O操作,并使用新布局重试I/O操作,或者,客户端可以通过元数据服务器使用常规NFS读写操作重试I/O操作。客户端应首先尝试检索新布局并使用OSD命令重试I/O操作,只有在错误持续存在时,才通过元数据服务器重试I/O操作。
layoutreturn_file4 is used in the LAYOUTRETURN operation to convey layout-type specific information to the server. It is defined in the NFSv4.1 [6] as follows:
layoutreturn_文件4在layoutreturn操作中用于将布局类型特定的信息传送到服务器。NFSv4.1[6]中对其定义如下:
struct layoutreturn_file4 { offset4 lrf_offset; length4 lrf_length; stateid4 lrf_stateid; /* layouttype4 specific data */ opaque lrf_body<>; };
struct layoutreturn_file4 { offset4 lrf_offset; length4 lrf_length; stateid4 lrf_stateid; /* layouttype4 specific data */ opaque lrf_body<>; };
union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { case LAYOUTRETURN4_FILE: layoutreturn_file4 lr_layout; default: void; };
union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { case LAYOUTRETURN4_FILE: layoutreturn_file4 lr_layout; default: void; };
struct LAYOUTRETURN4args { /* CURRENT_FH: file */ bool lora_reclaim; layoutreturn_stateid lora_recallstateid; layouttype4 lora_layout_type; layoutiomode4 lora_iomode; layoutreturn4 lora_layoutreturn; };
struct LAYOUTRETURN4args { /* CURRENT_FH: file */ bool lora_reclaim; layoutreturn_stateid lora_recallstateid; layouttype4 lora_layout_type; layoutiomode4 lora_iomode; layoutreturn4 lora_layoutreturn; };
If the lora_layout_type layout type is LAYOUT4_OSD2_OBJECTS, then the lrf_body opaque value is defined by the pnfs_osd_layoutreturn4 type.
如果lora_布局类型布局类型为LAYOUT4_OSD2_对象,则lrf_主体不透明值由pnfs_osd_layoutreturn4类型定义。
The pnfs_osd_layoutreturn4 type allows the client to report I/O error information back to the metadata server as defined below.
pnfs_osd_layoutreturn4类型允许客户端向元数据服务器报告I/O错误信息,如下所述。
/// enum pnfs_osd_errno4 { /// PNFS_OSD_ERR_EIO = 1, /// PNFS_OSD_ERR_NOT_FOUND = 2, /// PNFS_OSD_ERR_NO_SPACE = 3, /// PNFS_OSD_ERR_BAD_CRED = 4, /// PNFS_OSD_ERR_NO_ACCESS = 5, /// PNFS_OSD_ERR_UNREACHABLE = 6, /// PNFS_OSD_ERR_RESOURCE = 7 /// }; ///
/// enum pnfs_osd_errno4 { /// PNFS_OSD_ERR_EIO = 1, /// PNFS_OSD_ERR_NOT_FOUND = 2, /// PNFS_OSD_ERR_NO_SPACE = 3, /// PNFS_OSD_ERR_BAD_CRED = 4, /// PNFS_OSD_ERR_NO_ACCESS = 5, /// PNFS_OSD_ERR_UNREACHABLE = 6, /// PNFS_OSD_ERR_RESOURCE = 7 /// }; ///
pnfs_osd_errno4 is used to represent error types when read/write errors are reported to the metadata server. The error codes serve as hints to the metadata server that may help it in diagnosing the exact reason for the error and in repairing it.
pnfs_osd_errno4用于在向元数据服务器报告读/写错误时表示错误类型。错误代码可作为元数据服务器的提示,帮助其诊断错误的确切原因并修复错误。
o PNFS_OSD_ERR_EIO indicates the operation failed because the object storage device experienced a failure trying to access the object. The most common source of these errors is media errors, but other internal errors might cause this as well. In this case, the metadata server should go examine the broken object more closely; hence, it should be used as the default error code.
o PNFS_OSD_ERR_EIO表示操作失败,因为对象存储设备在尝试访问对象时遇到故障。这些错误最常见的来源是媒体错误,但其他内部错误也可能导致这种情况。在这种情况下,元数据服务器应该更仔细地检查损坏的对象;因此,应将其用作默认错误代码。
o PNFS_OSD_ERR_NOT_FOUND indicates the object ID specifies an object that does not exist on the object storage device.
o PNFS_OSD_ERR_NOT_FOUND表示对象ID指定了对象存储设备上不存在的对象。
o PNFS_OSD_ERR_NO_SPACE indicates the operation failed because the object storage device ran out of free capacity during the operation.
o PNFS_OSD_ERR_NO_SPACE表示操作失败,因为对象存储设备在操作过程中耗尽了可用容量。
o PNFS_OSD_ERR_BAD_CRED indicates the security parameters are not valid. The primary cause of this is that the capability has expired, or the access policy tag (a.k.a., capability version number) has been changed to revoke capabilities. The client will need to return the layout and get a new one with fresh capabilities.
o PNFS_OSD_ERR_BAD_CRED表示安全参数无效。主要原因是该功能已过期,或者访问策略标签(也称为功能版本号)已更改为吊销功能。客户端将需要返回布局,并获得一个具有新功能的新布局。
o PNFS_OSD_ERR_NO_ACCESS indicates the capability does not allow the requested operation. This should not occur in normal operation because the metadata server should give out correct capabilities, or none at all.
o PNFS_OSD_ERR_NO_ACCESS表示该功能不允许请求的操作。这不应该发生在正常操作中,因为元数据服务器应该提供正确的功能,或者根本没有。
o PNFS_OSD_ERR_UNREACHABLE indicates the client did not complete the I/O operation at the object storage device due to a communication failure. Whether or not the I/O operation was executed by the OSD is undetermined.
o PNFS_OSD_ERR_UNREACHABLE表示由于通信故障,客户端未在对象存储设备上完成I/O操作。OSD是否执行了I/O操作尚不确定。
o PNFS_OSD_ERR_RESOURCE indicates the client did not issue the I/O operation due to a local problem on the initiator (i.e., client) side, e.g., when running out of memory. The client MUST guarantee that the OSD command was never dispatched to the OSD.
o PNFS_OSD_ERR_RESOURCE表示由于启动器(即客户端)端的本地问题(例如,当内存不足时),客户端没有发出I/O操作。客户机必须保证OSD命令从未发送到OSD。
/// struct pnfs_osd_ioerr4 { /// pnfs_osd_objid4 oer_component; /// length4 oer_comp_offset; /// length4 oer_comp_length; /// bool oer_iswrite; /// pnfs_osd_errno4 oer_errno; /// }; ///
/// struct pnfs_osd_ioerr4 { /// pnfs_osd_objid4 oer_component; /// length4 oer_comp_offset; /// length4 oer_comp_length; /// bool oer_iswrite; /// pnfs_osd_errno4 oer_errno; /// }; ///
The pnfs_osd_ioerr4 structure is used to return error indications for objects that generated errors during data transfers. These are hints to the metadata server that there are problems with that object. For each error, "oer_component", "oer_comp_offset", and "oer_comp_length" represent the object and byte range within the component object in which the error occurred; "oer_iswrite" is set to "true" if the failed OSD operation was data modifying, and "oer_errno" represents the type of error.
pnfs_osd_ioerr4结构用于为在数据传输期间生成错误的对象返回错误指示。这些提示提示元数据服务器该对象存在问题。对于每个错误,“oer_组件”、“oer_组件偏移量”和“oer_组件长度”表示发生错误的组件对象内的对象和字节范围;如果失败的OSD操作正在修改数据,“oer_iswrite”设置为“true”,而“oer_errno”表示错误类型。
Component byte ranges in the optional pnfs_osd_ioerr4 structure are used for recovering the object and MUST be set by the client to cover all failed I/O operations to the component.
可选pnfs_osd_ioerr4结构中的组件字节范围用于恢复对象,客户端必须将其设置为覆盖组件的所有失败I/O操作。
/// struct pnfs_osd_layoutreturn4 { /// pnfs_osd_ioerr4 olr_ioerr_report<>; /// }; ///
/// struct pnfs_osd_layoutreturn4 { /// pnfs_osd_ioerr4 olr_ioerr_report<>; /// }; ///
When OSD I/O operations failed, "olr_ioerr_report<>" is used to report these errors to the metadata server as an array of elements of type pnfs_osd_ioerr4. Each element in the array represents an error that occurred on the object specified by oer_component. If no errors are to be reported, the size of the olr_ioerr_report<> array is set to zero.
当OSD I/O操作失败时,“olr_ioerr_report<>”用于将这些错误作为pnfs_OSD_ioerr4类型的元素数组报告给元数据服务器。数组中的每个元素表示在oer_组件指定的对象上发生的错误。如果不报告任何错误,则olr_ioerr_报告<>数组的大小设置为零。
The layouthint4 type is defined in the NFSv4.1 [6] as follows:
NFSv4.1[6]中对LayoutInt4类型的定义如下:
struct layouthint4 { layouttype4 loh_type; opaque loh_body<>; };
struct layouthint4 { layouttype4 loh_type; opaque loh_body<>; };
The layouthint4 structure is used by the client to pass a hint about the type of layout it would like created for a particular file. If the loh_type layout type is LAYOUT4_OSD2_OBJECTS, then the loh_body opaque value is defined by the pnfs_osd_layouthint4 type.
客户端使用LayoutInt4结构传递有关它希望为特定文件创建的布局类型的提示。如果loh_类型布局类型为LAYOUT4_OSD2_对象,则loh_主体不透明值由pnfs_osd_layouthint4类型定义。
/// union pnfs_osd_max_comps_hint4 switch (bool omx_valid) { /// case TRUE: /// uint32_t omx_max_comps; /// case FALSE: /// void; /// }; /// /// union pnfs_osd_stripe_unit_hint4 switch (bool osu_valid) { /// case TRUE: /// length4 osu_stripe_unit; /// case FALSE: /// void; /// }; /// /// union pnfs_osd_group_width_hint4 switch (bool ogw_valid) { /// case TRUE: /// uint32_t ogw_group_width; /// case FALSE: /// void; /// }; /// /// union pnfs_osd_group_depth_hint4 switch (bool ogd_valid) { /// case TRUE: /// uint32_t ogd_group_depth; /// case FALSE:
/// union pnfs_osd_max_comps_hint4 switch (bool omx_valid) { /// case TRUE: /// uint32_t omx_max_comps; /// case FALSE: /// void; /// }; /// /// union pnfs_osd_stripe_unit_hint4 switch (bool osu_valid) { /// case TRUE: /// length4 osu_stripe_unit; /// case FALSE: /// void; /// }; /// /// union pnfs_osd_group_width_hint4 switch (bool ogw_valid) { /// case TRUE: /// uint32_t ogw_group_width; /// case FALSE: /// void; /// }; /// /// union pnfs_osd_group_depth_hint4 switch (bool ogd_valid) { /// case TRUE: /// uint32_t ogd_group_depth; /// case FALSE:
/// void; /// }; /// /// union pnfs_osd_mirror_cnt_hint4 switch (bool omc_valid) { /// case TRUE: /// uint32_t omc_mirror_cnt; /// case FALSE: /// void; /// }; /// /// union pnfs_osd_raid_algorithm_hint4 switch (bool ora_valid) { /// case TRUE: /// pnfs_osd_raid_algorithm4 ora_raid_algorithm; /// case FALSE: /// void; /// }; /// /// struct pnfs_osd_layouthint4 { /// pnfs_osd_max_comps_hint4 olh_max_comps_hint; /// pnfs_osd_stripe_unit_hint4 olh_stripe_unit_hint; /// pnfs_osd_group_width_hint4 olh_group_width_hint; /// pnfs_osd_group_depth_hint4 olh_group_depth_hint; /// pnfs_osd_mirror_cnt_hint4 olh_mirror_cnt_hint; /// pnfs_osd_raid_algorithm_hint4 olh_raid_algorithm_hint; /// }; ///
/// void; /// }; /// /// union pnfs_osd_mirror_cnt_hint4 switch (bool omc_valid) { /// case TRUE: /// uint32_t omc_mirror_cnt; /// case FALSE: /// void; /// }; /// /// union pnfs_osd_raid_algorithm_hint4 switch (bool ora_valid) { /// case TRUE: /// pnfs_osd_raid_algorithm4 ora_raid_algorithm; /// case FALSE: /// void; /// }; /// /// struct pnfs_osd_layouthint4 { /// pnfs_osd_max_comps_hint4 olh_max_comps_hint; /// pnfs_osd_stripe_unit_hint4 olh_stripe_unit_hint; /// pnfs_osd_group_width_hint4 olh_group_width_hint; /// pnfs_osd_group_depth_hint4 olh_group_depth_hint; /// pnfs_osd_mirror_cnt_hint4 olh_mirror_cnt_hint; /// pnfs_osd_raid_algorithm_hint4 olh_raid_algorithm_hint; /// }; ///
This type conveys hints for the desired data map. All parameters are optional so the client can give values for only the parameters it cares about, e.g. it can provide a hint for the desired number of mirrored components, regardless of the RAID algorithm selected for the file. The server should make an attempt to honor the hints, but it can ignore any or all of them at its own discretion and without failing the respective CREATE operation.
此类型传递所需数据映射的提示。所有参数都是可选的,因此客户端可以仅为其关心的参数提供值,例如,它可以提供所需镜像组件数量的提示,而不考虑为文件选择的RAID算法。服务器应该尝试遵守这些提示,但它可以自行决定忽略任何或所有提示,并且不会导致相应的创建操作失败。
The "olh_max_comps_hint" can be used to limit the total number of component objects comprising the file. All other hints correspond directly to the different fields of pnfs_osd_data_map4.
“olh_max_comps_提示”可用于限制组成文件的组件对象的总数。所有其他提示直接对应于pnfs_osd_data_map4的不同字段。
The pnfs layout operations operate on logical byte ranges. There is no requirement in the protocol for any relationship between byte ranges used in LAYOUTGET to acquire layouts and byte ranges used in CB_LAYOUTRECALL, LAYOUTCOMMIT, or LAYOUTRETURN. However, using OSD byte-range capabilities poses limitations on these operations since
pnfs布局操作在逻辑字节范围内操作。协议中不要求LAYOUTGET中使用的字节范围与CB_LAYOUTRECALL、LAYOUTCOMMIT或LAYOUTRETURN中使用的字节范围之间存在任何关系。但是,使用OSD字节范围功能会对这些操作造成限制,因为
the capabilities associated with layout segments cannot be merged or split. The following guidelines should be followed for proper operation of object-based layouts.
无法合并或拆分与布局段关联的功能。基于对象的布局的正确操作应遵循以下指南。
In general, the object-based layout driver should keep track of each layout segment it got, keeping record of the segment's iomode, offset, and length. The server should allow the client to get multiple overlapping layout segments but is free to recall the layout to prevent overlap.
通常,基于对象的布局驱动程序应该跟踪它得到的每个布局段,并记录该段的iomode、偏移和长度。服务器应允许客户端获得多个重叠的布局段,但可以自由调用布局以防止重叠。
In response to CB_LAYOUTRECALL, the client should return all layout segments matching the given iomode and overlapping with the recalled range. When returning the layouts for this byte range with LAYOUTRETURN, the client MUST NOT return a sub-range of a layout segment it has; each LAYOUTRETURN sent MUST completely cover at least one outstanding layout segment.
作为对CB_LAYOUTRECALL的响应,客户端应返回与给定iomode匹配且与调用范围重叠的所有布局段。使用LAYOUTRETURN返回此字节范围的布局时,客户端不得返回其拥有的布局段的子范围;发送的每个LAYOUTRETURN必须完全覆盖至少一个未完成的布局段。
The server, in turn, should release any segment that exactly matches the clientid, iomode, and byte range given in LAYOUTRETURN. If no exact match is found, then the server should release all layout segments matching the clientid and iomode and that are fully contained in the returned byte range. If none are found and the byte range is a subset of an outstanding layout segment with for the same clientid and iomode, then the client can be considered malfunctioning and the server SHOULD recall all layouts from this client to reset its state. If this behavior repeats, the server SHOULD deny all LAYOUTGETs from this client.
反过来,服务器应该释放与LAYOUTRETURN中给定的clientid、iomode和字节范围完全匹配的任何段。如果找不到精确匹配,则服务器应释放与clientid和iomode匹配且完全包含在返回字节范围中的所有布局段。如果未找到任何布局,并且字节范围是具有相同clientid和iomode的未完成布局段的子集,则可以认为客户端出现故障,服务器应从该客户端调用所有布局以重置其状态。如果此行为重复,服务器应拒绝来自此客户端的所有LayoutGet。
LAYOUTCOMMIT is only used by object-based pNFS to convey modified attributes hints and/or to report the presence of I/O errors to the metadata server (MDS). Therefore, the offset and length in LAYOUTCOMMIT4args are reserved for future use and should be set to 0.
LAYOUTCOMMIT仅由基于对象的PNF用于传递修改的属性提示和/或向元数据服务器(MDS)报告存在的I/O错误。因此,LAYOUTCOMMIT4args中的偏移量和长度保留供将来使用,应设置为0。
The object-based metadata server should recall outstanding layouts in the following cases:
在以下情况下,基于对象的元数据服务器应调用未完成的布局:
o When the file's security policy changes, i.e., Access Control Lists (ACLs) or permission mode bits are set.
o 当文件的安全策略更改时,即设置访问控制列表(ACL)或权限模式位。
o When the file's aggregation map changes, rendering outstanding layouts invalid.
o 当文件的聚合映射更改时,呈现未完成布局无效。
o When there are sharing conflicts. For example, the server will issue stripe-aligned layout segments for RAID-5 objects. To prevent corruption of the file's parity, multiple clients must not hold valid write layouts for the same stripes. An outstanding READ/WRITE (RW) layout should be recalled when a conflicting LAYOUTGET is received from a different client for LAYOUTIOMODE4_RW and for a byte range overlapping with the outstanding layout segment.
o 当存在共享冲突时。例如,服务器将为RAID-5对象发布条带对齐的布局段。为防止损坏文件的奇偶校验,多个客户端不得为同一条带保留有效的写入布局。当从不同的客户端接收到与LayoutMode4_RW和与未完成布局段重叠的字节范围冲突的LAYOUTGET时,应调用未完成的读/写(RW)布局。
The metadata server can use the CB_RECALL_ANY callback operation to notify the client to return some or all of its layouts. The NFSv4.1 [6] defines the following types:
元数据服务器可以使用CB_RECALL_ANY回调操作通知客户端返回其部分或全部布局。NFSv4.1[6]定义了以下类型:
const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN = 8; const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 9;
const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN = 8; const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 9;
struct CB_RECALL_ANY4args { uint32_t craa_objects_to_keep; bitmap4 craa_type_mask; };
struct CB_RECALL_ANY4args { uint32_t craa_objects_to_keep; bitmap4 craa_type_mask; };
Typically, CB_RECALL_ANY will be used to recall client state when the server needs to reclaim resources. The craa_type_mask bitmap specifies the type of resources that are recalled and the craa_objects_to_keep value specifies how many of the recalled objects the client is allowed to keep. The object-based layout type mask flags are defined as follows. They represent the iomode of the recalled layouts. In response, the client SHOULD return layouts of the recalled iomode that it needs the least, keeping at most craa_objects_to_keep object-based layouts.
通常,当服务器需要回收资源时,CB_RECALL_ANY将用于调用客户端状态。craa_type_mask位图指定调用的资源类型,craa_objects_to_keep值指定允许客户端保留调用对象的数量。基于对象的布局类型掩码标志定义如下。它们代表召回布局的iomode。作为响应,客户机应返回其所需最少的调用iomode布局,保留最多craa_对象,以保留基于对象的布局。
/// enum pnfs_osd_cb_recall_any_mask { /// PNFS_OSD_RCA4_TYPE_MASK_READ = 8, /// PNFS_OSD_RCA4_TYPE_MASK_RW = 9 /// }; ///
/// enum pnfs_osd_cb_recall_any_mask { /// PNFS_OSD_RCA4_TYPE_MASK_READ = 8, /// PNFS_OSD_RCA4_TYPE_MASK_RW = 9 /// }; ///
The PNFS_OSD_RCA4_TYPE_MASK_READ flag notifies the client to return layouts of iomode LAYOUTIOMODE4_READ. Similarly, the PNFS_OSD_RCA4_TYPE_MASK_RW flag notifies the client to return layouts of iomode LAYOUTIOMODE4_RW. When both mask flags are set, the client is notified to return layouts of either iomode.
PNFS_OSD_RCA4_TYPE_MASK_READ标志通知客户端返回iomode LAYOUTIOMODE4_READ的布局。类似地,PNFS_OSD_RCA4_TYPE_MASK_RW标志通知客户机返回iomode LAYOUTIOMODE4_RW的布局。设置两个掩码标志后,将通知客户端返回任一iomode的布局。
In cases where clients are uncommunicative and their lease has expired or when clients fail to return recalled layouts within a lease period at the least (see "Recalling a Layout"[6]), the server MAY revoke client layouts and/or device address mappings and reassign these resources to other clients. To avoid data corruption, the metadata server MUST fence off the revoked clients from the respective objects as described in Section 13.4.
如果客户端无法通信且其租约已到期,或者客户端至少在租约期内未能返回调用的布局(请参阅“调用布局”[6]),则服务器可以撤销客户端布局和/或设备地址映射,并将这些资源重新分配给其他客户端。为了避免数据损坏,元数据服务器必须按照第13.4节所述,将已撤销的客户端与相应的对象隔离开来。
The pNFS extension partitions the NFSv4 file system protocol into two parts, the control path and the data path (storage protocol). The control path contains all the new operations described by this extension; all existing NFSv4 security mechanisms and features apply to the control path. The combination of components in a pNFS system is required to preserve the security properties of NFSv4 with respect to an entity accessing data via a client, including security countermeasures to defend against threats that NFSv4 provides defenses for in environments where these threats are considered significant.
pNFS扩展将NFSv4文件系统协议分为两部分,控制路径和数据路径(存储协议)。控制路径包含此扩展描述的所有新操作;所有现有的NFSv4安全机制和功能都适用于控制路径。pNFS系统中的组件组合需要保留NFSv4关于通过客户端访问数据的实体的安全属性,包括在这些威胁被认为是重大的环境中防御NFSv4提供的威胁的安全对策。
The metadata server enforces the file access-control policy at LAYOUTGET time. The client should use suitable authorization credentials for getting the layout for the requested iomode (READ or RW) and the server verifies the permissions and ACL for these credentials, possibly returning NFS4ERR_ACCESS if the client is not allowed the requested iomode. If the LAYOUTGET operation succeeds the client receives, as part of the layout, a set of object capabilities allowing it I/O access to the specified objects corresponding to the requested iomode. When the client acts on I/O operations on behalf of its local users, it MUST authenticate and authorize the user by issuing respective OPEN and ACCESS calls to the metadata server, similar to having NFSv4 data delegations. If access is allowed, the client uses the corresponding (READ or RW) capabilities to perform the I/O operations at the object storage devices. When the metadata server receives a request to change a file's permissions or ACL, it SHOULD recall all layouts for that file and it MUST change the capability version attribute on all objects comprising the file to implicitly invalidate any outstanding capabilities before committing to the new permissions and ACL. Doing this will ensure that clients re-authorize their layouts according to the modified permissions and ACL by requesting new layouts. Recalling the layouts in this case is courtesy of the server intended to prevent clients from getting an error on I/Os done after the capability version changed.
元数据服务器在LAYOUTGET时强制执行文件访问控制策略。客户端应使用适当的授权凭据获取请求的iomode(读取或RW)的布局,服务器验证这些凭据的权限和ACL,如果不允许客户端访问请求的iomode,则可能返回NFS4ERR_访问权限。如果LAYOUTGET操作成功,作为布局的一部分,客户端将收到一组对象功能,允许其对与请求的iomode对应的指定对象进行I/O访问。当客户端代表其本地用户执行I/O操作时,它必须通过向元数据服务器发出相应的打开和访问调用来对用户进行身份验证和授权,这类似于NFSv4数据委托。如果允许访问,客户端将使用相应的(读取或RW)功能在对象存储设备上执行I/O操作。当元数据服务器收到更改文件权限或ACL的请求时,它应该调用该文件的所有布局,并且它必须在提交到新权限和ACL之前更改组成该文件的所有对象上的“功能版本”属性,以隐式地使任何未完成的功能无效。这样做将确保客户端通过请求新布局,根据修改的权限和ACL重新授权其布局。在这种情况下,调用布局是出于服务器的需要,目的是防止客户端在功能版本更改后在I/O上出错。
The object storage protocol MUST implement the security aspects described in version 1 of the T10 OSD protocol definition [1]. The standard defines four security methods: NOSEC, CAPKEY, CMDRSP, and ALLDATA. To provide minimum level of security allowing verification and enforcement of the server access control policy using the layout security credentials, the NOSEC security method MUST NOT be used for any I/O operation. The remainder of this section gives an overview of the security mechanism described in that standard. The goal is to give the reader a basic understanding of the object security model. Any discrepancies between this text and the actual standard are obviously to be resolved in favor of the OSD standard.
对象存储协议必须实现T10 OSD协议定义[1]版本1中描述的安全方面。该标准定义了四种安全方法:NOSEC、CAPKEY、CMDRSP和ALLDATA。为了提供允许使用布局安全凭据验证和实施服务器访问控制策略的最低安全级别,不得将NOSC安全方法用于任何I/O操作。本节剩余部分概述了该标准中描述的安全机制。目的是让读者对对象安全模型有一个基本的了解。本标准与实际标准之间的任何差异都应以OSD标准为准进行解决。
There are three main data types associated with object security: a capability, a credential, and security parameters. The capability is a set of fields that specifies an object and what operations can be performed on it. A credential is a signed capability. Only a security manager that knows the secret device keys can correctly sign a capability to form a valid credential. In pNFS, the file server acts as the security manager and returns signed capabilities (i.e., credentials) to the pNFS client. The security parameters are values computed by the issuer of OSD commands (i.e., the client) that prove they hold valid credentials. The client uses the credential as a signing key to sign the requests it makes to OSD, and puts the resulting signatures into the security_parameters field of the OSD command. The object storage device uses the secret keys it shares with the security manager to validate the signature values in the security parameters.
有三种主要数据类型与对象安全性相关:功能、凭证和安全参数。功能是一组字段,用于指定对象以及可以对其执行的操作。凭证是一种签名功能。只有知道秘密设备密钥的安全管理器才能正确签署形成有效凭据的功能。在pNFS中,文件服务器充当安全管理器,并将签名功能(即凭据)返回给pNFS客户端。安全参数是OSD命令的颁发者(即客户端)计算的值,用于证明它们持有有效凭据。客户端将凭证用作签名密钥,对它向OSD发出的请求进行签名,并将生成的签名放入OSD命令的security_参数字段中。对象存储设备使用它与安全管理器共享的密钥来验证安全参数中的签名值。
The security types are opaque to the generic layers of the pNFS client. The credential contents are defined as opaque within the pnfs_osd_object_cred4 type. Instead of repeating the definitions here, the reader is referred to Section 4.9.2.2 of the OSD standard.
安全类型对于pNFS客户端的通用层是不透明的。凭据内容在pnfs_osd_object_cred4类型中定义为不透明。读者不再重复此处的定义,而是参考OSD标准的第4.9.2.2节。
The object storage protocol relies on a cryptographically secure capability to control accesses at the object storage devices. Capabilities are generated by the metadata server, returned to the client, and used by the client as described below to authenticate their requests to the object-based storage device. Capabilities therefore achieve the required access and open mode checking. They allow the file server to define and check a policy (e.g., open mode) and the OSD to enforce that policy without knowing the details (e.g., user IDs and ACLs).
对象存储协议依赖于加密安全功能来控制对象存储设备的访问。功能由元数据服务器生成,返回给客户端,并由客户端使用,如下所述,以验证其对基于对象的存储设备的请求。因此,这些功能可以实现所需的访问和开放模式检查。它们允许文件服务器定义和检查策略(例如,开放模式),OSD在不知道详细信息(例如,用户ID和ACL)的情况下强制执行该策略。
Since capabilities are tied to layouts, and since they are used to enforce access control, when the file ACL or mode changes the outstanding capabilities MUST be revoked to enforce the new access permissions. The server SHOULD recall layouts to allow clients to gracefully return their capabilities before the access permissions change.
由于功能绑定到布局,并且用于强制访问控制,因此当文件ACL或模式更改时,必须撤销未完成的功能以强制执行新的访问权限。服务器应该调用布局,以允许客户端在访问权限更改之前优雅地返回其功能。
Each capability is specific to a particular object, an operation on that object, a byte range within the object (in OSDv2), and has an explicit expiration time. The capabilities are signed with a secret key that is shared by the object storage devices and the metadata managers. Clients do not have device keys so they are unable to forge the signatures in the security parameters. The combination of a capability, the OSD System ID, and a signature is called a "credential" in the OSD specification.
每个功能都特定于特定对象、该对象上的操作、对象内的字节范围(在OSDv2中),并且具有明确的过期时间。这些功能使用对象存储设备和元数据管理器共享的密钥进行签名。客户端没有设备密钥,因此无法伪造安全参数中的签名。功能、OSD系统ID和签名的组合在OSD规范中称为“凭证”。
The details of the security and privacy model for object storage are defined in the T10 OSD standard. The following sketch of the algorithm should help the reader understand the basic model.
T10 OSD标准中定义了对象存储的安全和隐私模型的详细信息。下面的算法草图应该有助于读者理解基本模型。
LAYOUTGET returns a CapKey and a Cap, which, together with the OSD System ID, are also called a credential. It is a capability and a signature over that capability and the SystemID. The OSD Standard refers to the CapKey as the "Credential integrity check value" and to the ReqMAC as the "Request integrity check value".
LAYOUTGET返回CapKey和Cap,它们与OSD系统ID一起也称为凭证。它是一种能力,是对该能力和SystemID的签名。OSD标准将CapKey称为“凭证完整性检查值”,将ReqMAC称为“请求完整性检查值”。
CapKey = MAC<SecretKey>(Cap, SystemID) Credential = {Cap, SystemID, CapKey}
CapKey = MAC<SecretKey>(Cap, SystemID) Credential = {Cap, SystemID, CapKey}
The client uses CapKey to sign all the requests it issues for that object using the respective Cap. In other words, the Cap appears in the request to the storage device, and that request is signed with the CapKey as follows:
客户机使用CapKey对使用相应Cap为该对象发出的所有请求进行签名。换句话说,Cap出现在对存储设备的请求中,该请求使用CapKey签名,如下所示:
ReqMAC = MAC<CapKey>(Req, ReqNonce) Request = {Cap, Req, ReqNonce, ReqMAC}
ReqMAC = MAC<CapKey>(Req, ReqNonce) Request = {Cap, Req, ReqNonce, ReqMAC}
The following is sent to the OSD: {Cap, Req, ReqNonce, ReqMAC}. The OSD uses the SecretKey it shares with the metadata server to compare the ReqMAC the client sent with a locally computed value:
将以下内容发送到OSD:{Cap,Req,ReqNonce,ReqMAC}。OSD使用它与元数据服务器共享的SecretKey将客户端发送的ReqMAC与本地计算的值进行比较:
LocalCapKey = MAC<SecretKey>(Cap, SystemID) LocalReqMAC = MAC<LocalCapKey>(Req, ReqNonce)
LocalCapKey = MAC<SecretKey>(Cap, SystemID) LocalReqMAC = MAC<LocalCapKey>(Req, ReqNonce)
and if they match the OSD assumes that the capabilities came from an authentic metadata server and allows access to the object, as allowed by the Cap.
如果它们匹配,OSD将假定这些功能来自真实的元数据服务器,并允许访问Cap允许的对象。
Note that if the server LAYOUTGET reply, holding CapKey and Cap, is snooped by another client, it can be used to generate valid OSD requests (within the Cap access restrictions).
请注意,如果另一个客户端窥探到服务器LAYOUTGET reply(包含CapKey和Cap),则可以使用它生成有效的OSD请求(在Cap访问限制内)。
To provide the required privacy requirements for the capability key returned by LAYOUTGET, the GSS-API [7] framework can be used, e.g., by using the RPCSEC_GSS privacy method to send the LAYOUTGET operation or by using the SSV key to encrypt the oc_capability_key using the GSS_Wrap() function. Two general ways to provide privacy in the absence of GSS-API that are independent of NFSv4 are either an isolated network such as a VLAN or a secure channel provided by IPsec [15].
为了为LAYOUTGET返回的功能密钥提供所需的隐私要求,可以使用GSS-API[7]框架,例如,使用RPCSEC_GSS隐私方法发送LAYOUTGET操作,或者使用SSV密钥使用GSS_Wrap()函数加密oc_功能密钥。在没有独立于NFSv4的GSS-API的情况下提供隐私的两种通用方法是隔离网络(如VLAN)或IPsec提供的安全通道[15]。
At any time, the metadata server may invalidate all outstanding capabilities on an object by changing its POLICY ACCESS TAG attribute. The value of the POLICY ACCESS TAG is part of a capability, and it must match the state of the object attribute. If they do not match, the OSD rejects accesses to the object with the sense key set to ILLEGAL REQUEST and an additional sense code set to INVALID FIELD IN CDB. When a client attempts to use a capability and is rejected this way, it should issue a LAYOUTCOMMIT for the object and specify PNFS_OSD_BAD_CRED in the olr_ioerr_report parameter. The client may elect to issue a compound LAYOUTRETURN/LAYOUTGET (or LAYOUTCOMMIT/LAYOUTRETURN/LAYOUTGET) to attempt to fetch a refreshed set of capabilities.
在任何时候,元数据服务器都可以通过更改其策略访问标记属性使对象上所有未完成的功能失效。策略访问标记的值是功能的一部分,它必须与对象属性的状态匹配。如果它们不匹配,OSD将拒绝对对象的访问,并将检测密钥设置为非法请求,将附加检测代码设置为CDB中的无效字段。当客户机尝试使用某个功能并以这种方式被拒绝时,它应该为该对象发出LAYOUTCOMMIT,并在olr_ioerr_报告参数中指定PNFS_OSD_BAD_CRED。客户机可以选择发出复合LAYOUTRETURN/LAYOUTGET(或LAYOUTCOMMIT/LAYOUTRETURN/LAYOUTGET)来尝试获取一组刷新的功能。
The metadata server may elect to change the access policy tag on an object at any time, for any reason (with the understanding that there is likely an associated performance penalty, especially if there are outstanding layouts for this object). The metadata server MUST revoke outstanding capabilities when any one of the following occurs:
元数据服务器可能会选择在任何时候出于任何原因更改对象上的访问策略标记(需要理解的是,可能存在相关的性能损失,特别是如果此对象存在未完成的布局)。发生以下任一情况时,元数据服务器必须撤销未完成的功能:
o the permissions on the object change,
o 对象上的权限已更改,
o a conflicting mandatory byte-range lock is granted, or
o 授予冲突的强制字节范围锁,或
o a layout is revoked and reassigned to another client.
o 布局被撤销并重新分配给另一个客户端。
A pNFS client will typically hold one layout for each byte range for either READ or READ/WRITE. The client's credentials are checked by the metadata server at LAYOUTGET time and it is the client's responsibility to enforce access control among multiple users accessing the same file. It is neither required nor expected that the pNFS client will obtain a separate layout for each user accessing
pNFS客户端通常会为每个字节范围保存一个布局,用于读或读/写。元数据服务器在LAYOUTGET时检查客户端的凭据,客户端负责在访问同一文件的多个用户之间实施访问控制。pNFS客户端既不需要也不期望为每个访问的用户获得单独的布局
a shared object. The client SHOULD use OPEN and ACCESS calls to check user permissions when performing I/O so that the server's access control policies are correctly enforced. The result of the ACCESS operation may be cached while the client holds a valid layout as the server is expected to recall layouts when the file's access permissions or ACL change.
共享对象。执行I/O时,客户端应使用OPEN和ACCESS调用检查用户权限,以便正确实施服务器的访问控制策略。当客户端持有有效布局时,可能会缓存访问操作的结果,因为当文件的访问权限或ACL更改时,服务器将调用布局。
As described in NFSv4.1 [6], new layout type numbers have been assigned by IANA. This document defines the protocol associated with the existing layout type number, LAYOUT4_OSD2_OBJECTS, and it requires no further actions for IANA.
如NFSv4.1[6]所述,IANA已分配了新的布局类型编号。本文档定义了与现有布局类型编号LAYOUT4_OSD2_对象关联的协议,IANA无需进一步操作。
[1] Weber, R., "Information Technology - SCSI Object-Based Storage Device Commands (OSD)", ANSI INCITS 400-2004, December 2004.
[1] 韦伯,R.,“信息技术-基于SCSI对象的存储设备命令(OSD)”,ANSI INCITS 400-2004,2004年12月。
[2] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.
[2] Bradner,S.,“RFC中用于表示需求水平的关键词”,BCP 14,RFC 2119,1997年3月。
[3] Eisler, M., "XDR: External Data Representation Standard", STD 67, RFC 4506, May 2006.
[3] 艾斯勒,M.,“XDR:外部数据表示标准”,STD 67,RFC 4506,2006年5月。
[4] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., "Network File System (NFS) Version 4 Minor Version 1 External Data Representation Standard (XDR) Description", RFC 5662, January 2010.
[4] Shepler,S.,Ed.,Eisler,M.,Ed.,和D.Noveck,Ed.,“网络文件系统(NFS)版本4次要版本1外部数据表示标准(XDR)描述”,RFC 5662,2010年1月。
[5] IETF Trust, "Legal Provisions Relating to IETF Documents", November 2008, <http://trustee.ietf.org/docs/IETF-Trust-License-Policy.pdf>.
[5] IETF信托,“与IETF文件相关的法律规定”,2008年11月<http://trustee.ietf.org/docs/IETF-Trust-License-Policy.pdf>.
[6] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., "Network File System (NFS) Version 4 Minor Version 1 Protocol", RFC 5661, January 2010.
[6] Shepler,S.,Ed.,Eisler,M.,Ed.,和D.Noveck,Ed.,“网络文件系统(NFS)版本4次要版本1协议”,RFC 56612010年1月。
[7] Linn, J., "Generic Security Service Application Program Interface Version 2, Update 1", RFC 2743, January 2000.
[7] 林恩,J.,“通用安全服务应用程序接口版本2,更新1”,RFC 2743,2000年1月。
[8] Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M., and E. Zeidner, "Internet Small Computer Systems Interface (iSCSI)", RFC 3720, April 2004.
[8] Satran,J.,Meth,K.,Sapuntzakis,C.,Chadalapaka,M.,和E.Zeidner,“互联网小型计算机系统接口(iSCSI)”,RFC 3720,2004年4月。
[9] Weber, R., "SCSI Primary Commands - 3 (SPC-3)", ANSI INCITS 408-2005, October 2005.
[9] 韦伯,R.,“SCSI主命令-3(SPC-3)”,ANSI INCITS 408-2005,2005年10月。
[10] Krueger, M., Chadalapaka, M., and R. Elliott, "T11 Network Address Authority (NAA) Naming Format for iSCSI Node Names", RFC 3980, February 2005.
[10] Krueger,M.,Chadalapaka,M.,和R.Elliott,“iSCSI节点名称的T11网络地址授权(NAA)命名格式”,RFC 39802005年2月。
[11] IEEE, "Guidelines for 64-bit Global Identifier (EUI-64) Registration Authority", <http://standards.ieee.org/regauth/oui/tutorials/EUI64.html>.
[11] IEEE,“64位全局标识符(EUI-64)注册机构指南”<http://standards.ieee.org/regauth/oui/tutorials/EUI64.html>.
[12] Tseng, J., Gibbons, K., Travostino, F., Du Laney, C., and J. Souza, "Internet Storage Name Service (iSNS)", RFC 4171, September 2005.
[12] Tseng,J.,Gibbons,K.,Travostino,F.,Du Laney,C.,和J.Souza,“互联网存储名称服务(iSNS)”,RFC 41712005年9月。
[13] Weber, R., "SCSI Architecture Model - 3 (SAM-3)", ANSI INCITS 402-2005, February 2005.
[13] 韦伯,R.,“SCSI体系结构模型-3(SAM-3)”,ANSI INCITS 402-2005,2005年2月。
[14] Weber, R., "SCSI Object-Based Storage Device Commands -2 (OSD-2)", January 2009, <http://www.t10.org/cgi-bin/ac.pl?t=f&f=osd2r05a.pdf>.
[14] 韦伯,R.,“基于SCSI对象的存储设备命令-2(OSD-2)”,2009年1月<http://www.t10.org/cgi-bin/ac.pl?t=f&f=osd2r05a.pdf>.
[15] Kent, S. and K. Seo, "Security Architecture for the Internet Protocol", RFC 4301, December 2005.
[15] Kent,S.和K.Seo,“互联网协议的安全架构”,RFC 43012005年12月。
[16] T10 1415-D, "SCSI RDMA Protocol (SRP)", ANSI INCITS 365-2002, December 2002.
[16] T10 1415-D,“SCSI RDMA协议(SRP)”,ANSI INCITS 365-2002,2002年12月。
[17] T11 1619-D, "Fibre Channel Framing and Signaling - 2 (FC-FS-2)", ANSI INCITS 424-2007, February 2007.
[17] T11 1619-D,“光纤通道成帧和信令-2(FC-FS-2)”,ANSI INCITS 424-2007,2007年2月。
[18] T10 1601-D, "Serial Attached SCSI - 1.1 (SAS-1.1)", ANSI INCITS 417-2006, June 2006.
[18] T10 1601-D,“串行连接SCSI-1.1(SAS-1.1)”,ANSI INCITS 417-2006,2006年6月。
[19] MacWilliams, F. and N. Sloane, "The Theory of Error-Correcting Codes, Part I", 1977.
[19] 麦克威廉姆斯,F.和N.斯隆,“纠错码理论,第一部分”,1977年。
Todd Pisek was a co-editor of the initial versions of this document. Daniel E. Messinger, Pete Wyckoff, Mike Eisler, Sean P. Turner, Brian E. Carpenter, Jari Arkko, David Black, and Jason Glasgow reviewed and commented on this document.
托德·皮塞克是这份文件最初版本的共同编辑。Daniel E.Messinger、Pete Wyckoff、Mike Eisler、Sean P.Turner、Brian E.Carpenter、Jari Arkko、David Black和Jason Glasgow对本文件进行了审查和评论。
Authors' Addresses
作者地址
Benny Halevy Panasas, Inc. 1501 Reedsdale St. Suite 400 Pittsburgh, PA 15233 USA
Benny Halevy Panasas,Inc.美国宾夕法尼亚州匹兹堡里德斯代尔街1501号400套房,邮编15233
Phone: +1-412-323-3500 EMail: bhalevy@panasas.com URI: http://www.panasas.com/
Phone: +1-412-323-3500 EMail: bhalevy@panasas.com URI: http://www.panasas.com/
Brent Welch Panasas, Inc. 6520 Kaiser Drive Fremont, CA 95444 USA
Brent Welch Panasas,Inc.美国加利福尼亚州弗里蒙特凯撒大道6520号,邮编95444
Phone: +1-510-608-7770 EMail: welch@panasas.com URI: http://www.panasas.com/
Phone: +1-510-608-7770 EMail: welch@panasas.com URI: http://www.panasas.com/
Jim Zelenka Panasas, Inc. 1501 Reedsdale St. Suite 400 Pittsburgh, PA 15233 USA
Jim Zelenka Panasas,Inc.美国宾夕法尼亚州匹兹堡里德斯代尔街1501号400套房,邮编15233
Phone: +1-412-323-3500 EMail: jimz@panasas.com URI: http://www.panasas.com/
Phone: +1-412-323-3500 EMail: jimz@panasas.com URI: http://www.panasas.com/