Internet Engineering Task Force (IETF) Y. Collet Request for Comments: 8478 M. Kucherawy, Ed. Category: Informational Facebook ISSN: 2070-1721 October 2018
Internet Engineering Task Force (IETF) Y. Collet Request for Comments: 8478 M. Kucherawy, Ed. Category: Informational Facebook ISSN: 2070-1721 October 2018
Zstandard Compression and the application/zstd Media Type
Zstandard压缩和应用/zstd媒体类型
Abstract
摘要
Zstandard, or "zstd" (pronounced "zee standard"), is a data compression mechanism. This document describes the mechanism and registers a media type and content encoding to be used when transporting zstd-compressed content via Multipurpose Internet Mail Extensions (MIME).
Zstandard或“zstd”(发音为“zee标准”)是一种数据压缩机制。本文档描述了通过多用途Internet邮件扩展(MIME)传输zstd压缩内容时使用的机制,并注册了媒体类型和内容编码。
Despite use of the word "standard" as part of its name, readers are advised that this document is not an Internet Standards Track specification; it is being published for informational purposes only.
尽管使用了“标准”一词作为其名称的一部分,但建议读者,本文件不是互联网标准轨道规范;发布本手册仅供参考。
Status of This Memo
关于下段备忘
This document is not an Internet Standards Track specification; it is published for informational purposes.
本文件不是互联网标准跟踪规范;它是为了提供信息而发布的。
This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Not all documents approved by the IESG are candidates for any level of Internet Standard; see Section 2 of RFC 7841.
本文件是互联网工程任务组(IETF)的产品。它代表了IETF社区的共识。它已经接受了公众审查,并已被互联网工程指导小组(IESG)批准出版。并非IESG批准的所有文件都适用于任何级别的互联网标准;见RFC 7841第2节。
Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at https://www.rfc-editor.org/info/rfc8478.
有关本文件当前状态、任何勘误表以及如何提供反馈的信息,请访问https://www.rfc-editor.org/info/rfc8478.
Copyright Notice
版权公告
Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved.
版权所有(c)2018 IETF信托基金和确定为文件作者的人员。版权所有。
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
本文件受BCP 78和IETF信托有关IETF文件的法律规定的约束(https://trustee.ietf.org/license-info)自本文件出版之日起生效。请仔细阅读这些文件,因为它们描述了您对本文件的权利和限制。从本文件中提取的代码组件必须包括信托法律条款第4.e节中所述的简化BSD许可证文本,并提供简化BSD许可证中所述的无担保。
Table of Contents
目录
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4 3. Compression Algorithm . . . . . . . . . . . . . . . . . . . . 5 3.1. Frames . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1.1. Zstandard Frames . . . . . . . . . . . . . . . . . . 6 3.1.1.1. Frame Header . . . . . . . . . . . . . . . . . . 7 3.1.1.2. Blocks . . . . . . . . . . . . . . . . . . . . . 12 3.1.1.3. Compressed Blocks . . . . . . . . . . . . . . . . 14 3.1.1.4. Sequence Execution . . . . . . . . . . . . . . . 28 3.1.1.5. Repeat Offsets . . . . . . . . . . . . . . . . . 29 3.1.2. Skippable Frames . . . . . . . . . . . . . . . . . . 30 4. Entropy Encoding . . . . . . . . . . . . . . . . . . . . . . 30 4.1. FSE . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.1. FSE Table Description . . . . . . . . . . . . . . . . 31 4.2. Huffman Coding . . . . . . . . . . . . . . . . . . . . . 34 4.2.1. Huffman Tree Description . . . . . . . . . . . . . . 35 4.2.1.1. Huffman Tree Header . . . . . . . . . . . . . . . 36 4.2.1.2. FSE Compression of Huffman Weights . . . . . . . 37 4.2.1.3. Conversion from Weights to Huffman Prefix Codes . 38 4.2.2. Huffman-Coded Streams . . . . . . . . . . . . . . . . 39 5. Dictionary Format . . . . . . . . . . . . . . . . . . . . . . 40 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 42 6.1. The 'application/zstd' Media Type . . . . . . . . . . . . 42 6.2. Content Encoding . . . . . . . . . . . . . . . . . . . . 43 6.3. Dictionaries . . . . . . . . . . . . . . . . . . . . . . 43 7. Security Considerations . . . . . . . . . . . . . . . . . . . 43 8. Implementation Status . . . . . . . . . . . . . . . . . . . . 44 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 45 9.1. Normative References . . . . . . . . . . . . . . . . . . 45 9.2. Informative References . . . . . . . . . . . . . . . . . 45 Appendix A. Decoding Tables for Predefined Codes . . . . . . . . 46 A.1. Literal Length Code Table . . . . . . . . . . . . . . . . 46 A.2. Match Length Code Table . . . . . . . . . . . . . . . . . 49 A.3. Offset Code Table . . . . . . . . . . . . . . . . . . . . 52 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 53 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 54
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4 3. Compression Algorithm . . . . . . . . . . . . . . . . . . . . 5 3.1. Frames . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1.1. Zstandard Frames . . . . . . . . . . . . . . . . . . 6 3.1.1.1. Frame Header . . . . . . . . . . . . . . . . . . 7 3.1.1.2. Blocks . . . . . . . . . . . . . . . . . . . . . 12 3.1.1.3. Compressed Blocks . . . . . . . . . . . . . . . . 14 3.1.1.4. Sequence Execution . . . . . . . . . . . . . . . 28 3.1.1.5. Repeat Offsets . . . . . . . . . . . . . . . . . 29 3.1.2. Skippable Frames . . . . . . . . . . . . . . . . . . 30 4. Entropy Encoding . . . . . . . . . . . . . . . . . . . . . . 30 4.1. FSE . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.1. FSE Table Description . . . . . . . . . . . . . . . . 31 4.2. Huffman Coding . . . . . . . . . . . . . . . . . . . . . 34 4.2.1. Huffman Tree Description . . . . . . . . . . . . . . 35 4.2.1.1. Huffman Tree Header . . . . . . . . . . . . . . . 36 4.2.1.2. FSE Compression of Huffman Weights . . . . . . . 37 4.2.1.3. Conversion from Weights to Huffman Prefix Codes . 38 4.2.2. Huffman-Coded Streams . . . . . . . . . . . . . . . . 39 5. Dictionary Format . . . . . . . . . . . . . . . . . . . . . . 40 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 42 6.1. The 'application/zstd' Media Type . . . . . . . . . . . . 42 6.2. Content Encoding . . . . . . . . . . . . . . . . . . . . 43 6.3. Dictionaries . . . . . . . . . . . . . . . . . . . . . . 43 7. Security Considerations . . . . . . . . . . . . . . . . . . . 43 8. Implementation Status . . . . . . . . . . . . . . . . . . . . 44 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 45 9.1. Normative References . . . . . . . . . . . . . . . . . . 45 9.2. Informative References . . . . . . . . . . . . . . . . . 45 Appendix A. Decoding Tables for Predefined Codes . . . . . . . . 46 A.1. Literal Length Code Table . . . . . . . . . . . . . . . . 46 A.2. Match Length Code Table . . . . . . . . . . . . . . . . . 49 A.3. Offset Code Table . . . . . . . . . . . . . . . . . . . . 52 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 53 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 54
Zstandard, or "zstd" (pronounced "zee standard"), is a data compression mechanism, akin to gzip [RFC1952].
Zstandard或“zstd”(发音为“zee标准”)是一种数据压缩机制,类似于gzip[RFC1952]。
Despite use of the word "standard" as part of its name, readers are advised that this document is not an Internet Standards Track specification; it is being published for informational purposes only.
尽管使用了“标准”一词作为其名称的一部分,但建议读者,本文件不是互联网标准轨道规范;发布本手册仅供参考。
This document describes the Zstandard format. Also, to enable the transport of a data object compressed with Zstandard, this document registers a media type that can be used to identify such content when it is used in a payload encoded using Multipurpose Internet Mail Extensions (MIME).
本文档描述了ZS标准格式。此外,为了能够传输使用Zstandard压缩的数据对象,本文档注册了一种媒体类型,当该媒体类型用于使用多用途Internet邮件扩展(MIME)编码的有效负载时,可用于识别此类内容。
Some terms used elsewhere in this document are defined here for clarity.
为清晰起见,本文件其他地方使用的一些术语在此定义。
uncompressed: Describes an arbitrary set of bytes in their original form, prior to being subjected to compression.
未压缩:在进行压缩之前,描述原始形式的任意字节集。
compress, compression: The act of processing a set of bytes via the compression mechanism described here.
压缩:通过这里描述的压缩机制处理一组字节的行为。
compressed: Describes the result of passing a set of bytes through this mechanism. The original input has thus been compressed.
compressed:描述通过此机制传递一组字节的结果。因此,原始输入已被压缩。
decompress, decompression: The act of processing a set of bytes through the inverse of the compression mechanism described here, in an attempt to recover the original set of bytes prior to compression.
解压,解压:通过与此处描述的压缩机制相反的方式处理一组字节的行为,试图在压缩之前恢复原始字节集。
decompressed: Describes the result of passing a set of bytes through the reverse of this mechanism. When this is successful, the decompressed payload and the uncompressed payload are indistinguishable.
解压缩:描述通过与此机制相反的方式传递一组字节的结果。如果此操作成功,则无法区分解压缩的有效负载和未压缩的有效负载。
encode: The process of translating data from one form to another; this may include compression or it may refer to other translations done as part of this specification.
编码:将数据从一种形式转换为另一种形式的过程;这可能包括压缩,也可能涉及作为本规范一部分的其他翻译。
decode: The reverse of "encode"; describes a process of reversing a prior encoding to recover the original content.
解码:与“编码”相反;描述反转先前编码以恢复原始内容的过程。
frame: Content compressed by Zstandard is transformed into a Zstandard frame. Multiple frames can be appended into a single file or stream. A frame is completely independent, has a defined beginning and end, and has a set of parameters that tells the decoder how to decompress it.
框架:由Zstandard压缩的内容转换为Zstandard框架。可以将多个帧追加到单个文件或流中。帧是完全独立的,有定义的开始和结束,并且有一组参数告诉解码器如何解压缩。
block: A frame encapsulates one or multiple blocks. Each block contains arbitrary content, which is described by its header, and has a guaranteed maximum content size that depends upon frame parameters. Unlike frames, each block depends on previous blocks for proper decoding. However, each block can be decompressed without waiting for its successor, allowing streaming operations.
块:帧封装一个或多个块。每个块包含任意内容,这些内容由其标头描述,并具有保证的最大内容大小,该大小取决于帧参数。与帧不同,每个块依赖于前一块进行正确解码。但是,每个块都可以在不等待后续块的情况下进行解压缩,从而允许流操作。
natural order: A sequence or ordering of objects or values that is typical of that type of object or value. A set of unique integers, for example, is in "natural order" if when progressing from one element in the set or sequence to the next, there is never a decrease in value.
自然顺序:对象或值的序列或顺序,是该类型对象或值的典型。例如,如果从集合或序列中的一个元素前进到下一个元素时,值从未减少,则一组唯一整数是“自然顺序”的。
The naming convention for identifiers within the specification is Mixed_Case_With_Underscores. Identifiers inside square brackets indicate that the identifier is optional in the presented context.
规范中标识符的命名约定混合了大小写和下划线。方括号内的标识符表示标识符在呈现的上下文中是可选的。
This section describes the Zstandard algorithm.
本节介绍Zstandard算法。
The purpose of this document is to define a lossless compressed data format that is a) independent of the CPU type, operating system, file system, and character set and b) is suitable for file compression and pipe and streaming compression, using the Zstandard algorithm. The text of the specification assumes a basic background in programming at the level of bits and other primitive data representations.
本文档旨在定义无损压缩数据格式,该格式a)独立于CPU类型、操作系统、文件系统和字符集,b)适用于文件压缩以及管道和流式压缩,使用Zstandard算法。本规范的文本假定在位和其他原始数据表示级别编程的基本背景。
The data can be produced or consumed, even for an arbitrarily long sequentially presented input data stream, using only an a priori bounded amount of intermediate storage, and hence can be used in data communications. The format uses the Zstandard compression method, and an optional xxHash-64 checksum method [XXHASH], for detection of data corruption.
即使对于任意长的顺序呈现的输入数据流,也可以仅使用先验有界的中间存储量来产生或消耗数据,因此可以在数据通信中使用。该格式使用Zstandard压缩方法和可选的xxHash-64校验和方法[xxHash]来检测数据损坏。
The data format defined by this specification does not attempt to allow random access to compressed data.
本规范定义的数据格式不允许随机访问压缩数据。
Unless otherwise indicated below, a compliant compressor must produce data sets that conform to the specifications presented here. However, it does not need to support all options.
除非下文另有说明,否则符合要求的压缩机必须生成符合此处所述规范的数据集。但是,它不需要支持所有选项。
A compliant decompressor must be able to decompress at least one working set of parameters that conforms to the specifications presented here. It may also ignore informative fields, such as the checksum. Whenever it does not support a parameter defined in the compressed stream, it must produce a non-ambiguous error code and associated error message explaining which parameter is unsupported.
符合要求的解压器必须能够解压至少一组符合此处所述规范的工作参数。它还可能忽略信息字段,例如校验和。当它不支持压缩流中定义的参数时,它必须生成一个不含糊的错误代码和相关的错误消息,解释不支持哪个参数。
This specification is intended for use by implementers of software to compress data into Zstandard format and/or decompress data from Zstandard format. The Zstandard format is supported by an open source reference implementation, written in portable C, and available at [ZSTD].
本规范旨在供软件实施者使用,以将数据压缩为Zstandard格式和/或从Zstandard格式解压数据。Zstandard格式由一个开源参考实现支持,该实现是用portable C编写的,可在[ZSTD]上获得。
Zstandard compressed data is made up of one or more frames. Each frame is independent and can be decompressed independently of other frames. The decompressed content of multiple concatenated frames is the concatenation of each frame's decompressed content.
标准压缩数据由一个或多个帧组成。每个帧都是独立的,可以独立于其他帧进行解压缩。多个级联帧的解压缩内容是每个帧的解压缩内容的级联。
There are two frame formats defined for Zstandard: Zstandard frames and skippable frames. Zstandard frames contain compressed data, while skippable frames contain custom user metadata.
为Zstandard定义了两种帧格式:Zstandard帧和可跳过帧。Zstandard框架包含压缩数据,而可跳过的框架包含自定义用户元数据。
The structure of a single Zstandard frame is as follows:
单个Z标准框架的结构如下所示:
+--------------------+------------+ | Magic_Number | 4 bytes | +--------------------+------------+ | Frame_Header | 2-14 bytes | +--------------------+------------+ | Data_Block | n bytes | +--------------------+------------+ | [More Data_Blocks] | | +--------------------+------------+ | [Content_Checksum] | 0-4 bytes | +--------------------+------------+
+--------------------+------------+ | Magic_Number | 4 bytes | +--------------------+------------+ | Frame_Header | 2-14 bytes | +--------------------+------------+ | Data_Block | n bytes | +--------------------+------------+ | [More Data_Blocks] | | +--------------------+------------+ | [Content_Checksum] | 0-4 bytes | +--------------------+------------+
Magic_Number: 4 bytes, little-endian format. Value: 0xFD2FB528.
魔数:4字节,小尾端格式。值:0xFD2FB528。
Frame_Header: 2 to 14 bytes, detailed in Section 3.1.1.1.
帧头:2至14字节,详见第3.1.1.1节。
Data_Block: Detailed in Section 3.1.1.2. This is where data appears.
数据块:详见第3.1.1.2节。这就是数据出现的地方。
Content_Checksum: An optional 32-bit checksum, only present if Content_Checksum_Flag is set. The content checksum is the result of the XXH64() hash function [XXHASH] digesting the original (decoded) data as input, and a seed of zero. The low 4 bytes of the checksum are stored in little-endian format.
内容校验和:可选的32位校验和,仅在设置内容校验和标志时出现。内容校验和是XXH64()散列函数[XXHASH]将原始(解码)数据消化为输入的结果,种子为零。校验和的低位4字节以little endian格式存储。
The magic number was selected to be less probable to find at the beginning of an arbitrary file. It avoids trivial patterns (0x00, 0xFF, repeated bytes, increasing bytes, etc.), contains byte values outside of ASCII range, and doesn't map into UTF-8 space, all of which reduce the likelihood of its appearance at the top of a text file.
选择的幻数不太可能在任意文件的开头找到。它避免了琐碎的模式(0x00、0xFF、重复字节、增加字节等),包含ASCII范围之外的字节值,并且不映射到UTF-8空间,所有这些都降低了它出现在文本文件顶部的可能性。
The frame header has a variable size, with a minimum of 2 bytes and up to 14 bytes depending on optional parameters. The structure of Frame_Header is as follows:
帧头大小可变,根据可选参数,最小为2字节,最多为14字节。帧_头的结构如下:
+-------------------------+-----------+ | Frame_Header_Descriptor | 1 byte | +-------------------------+-----------+ | [Window_Descriptor] | 0-1 byte | +-------------------------+-----------+ | [Dictionary_ID] | 0-4 bytes | +-------------------------+-----------+ | [Frame_Content_Size] | 0-8 bytes | +-------------------------+-----------+
+-------------------------+-----------+ | Frame_Header_Descriptor | 1 byte | +-------------------------+-----------+ | [Window_Descriptor] | 0-1 byte | +-------------------------+-----------+ | [Dictionary_ID] | 0-4 bytes | +-------------------------+-----------+ | [Frame_Content_Size] | 0-8 bytes | +-------------------------+-----------+
The first header's byte is called the Frame_Header_Descriptor. It describes which other fields are present. Decoding this byte is enough to tell the size of Frame_Header.
第一个头的字节称为帧头描述符。它描述了存在哪些其他字段。对这个字节进行解码就足以说明帧头的大小。
+------------+-------------------------+ | Bit Number | Field Name | +------------+-------------------------+ | 7-6 | Frame_Content_Size_Flag | +------------+-------------------------+ | 5 | Single_Segment_Flag | +------------+-------------------------+ | 4 | (unused) | +------------+-------------------------+ | 3 | (reserved) | +------------+-------------------------+ | 2 | Content_Checksum_Flag | +------------+-------------------------+ | 1-0 | Dictionary_ID_Flag | +------------+-------------------------+
+------------+-------------------------+ | Bit Number | Field Name | +------------+-------------------------+ | 7-6 | Frame_Content_Size_Flag | +------------+-------------------------+ | 5 | Single_Segment_Flag | +------------+-------------------------+ | 4 | (unused) | +------------+-------------------------+ | 3 | (reserved) | +------------+-------------------------+ | 2 | Content_Checksum_Flag | +------------+-------------------------+ | 1-0 | Dictionary_ID_Flag | +------------+-------------------------+
In this table, bit 7 is the highest bit, while bit 0 is the lowest one.
在此表中,位7是最高位,而位0是最低位。
This is a 2-bit flag (equivalent to Frame_Header_Descriptor right-shifted 6 bits) specifying whether Frame_Content_Size (the decompressed data size) is provided within the header. Flag_Value provides FCS_Field_Size, which is the number of bytes used by Frame_Content_Size according to the following table:
这是一个2位标志(相当于帧\头\描述符右移6位),指定帧\内容\大小(解压缩数据大小)是否在头中提供。Flag_Value提供FCS_Field_Size,根据下表,它是帧_Content_Size使用的字节数:
+----------------+--------+---+---+---+ | Flag_Value | 0 | 1 | 2 | 3 | +----------------+--------+---+---+---+ | FCS_Field_Size | 0 or 1 | 2 | 4 | 8 | +----------------+--------+---+---+---+
+----------------+--------+---+---+---+ | Flag_Value | 0 | 1 | 2 | 3 | +----------------+--------+---+---+---+ | FCS_Field_Size | 0 or 1 | 2 | 4 | 8 | +----------------+--------+---+---+---+
When Flag_Value is 0, FCS_Field_Size depends on Single_Segment_Flag: If Single_Segment_Flag is set, FCS_Field_Size is 1. Otherwise, FCS_Field_Size is 0; Frame_Content_Size is not provided.
当Flag_值为0时,FCS_字段_大小取决于单_段_标志:如果设置了单_段_标志,则FCS_字段_大小为1。否则,FCS_字段_大小为0;未提供框架内容大小。
If this flag is set, data must be regenerated within a single continuous memory segment.
如果设置了此标志,则必须在单个连续内存段内重新生成数据。
In this case, Window_Descriptor byte is skipped, but Frame_Content_Size is necessarily present. As a consequence, the decoder must allocate a memory segment of size equal or larger than Frame_Content_Size.
在这种情况下,跳过窗口描述符字节,但帧内容大小必须存在。因此,解码器必须分配一个大小等于或大于帧内容大小的内存段。
In order to protect the decoder from unreasonable memory requirements, a decoder is allowed to reject a compressed frame that requests a memory size beyond the decoder's authorized range.
为了保护解码器不受不合理的内存需求的影响,允许解码器拒绝请求超出解码器授权范围的内存大小的压缩帧。
For broader compatibility, decoders are recommended to support memory sizes of at least 8 MB. This is only a recommendation; each decoder is free to support higher or lower limits, depending on local limitations.
为了更广泛的兼容性,建议解码器支持至少8MB的内存大小。这只是一项建议;根据本地限制,每个解码器可自由支持更高或更低的限制。
A decoder compliant with this specification version shall not interpret this bit. It might be used in a future version, to signal a property that is not mandatory to properly decode the frame. An encoder compliant with this specification must set this bit to zero.
符合本规范版本的解码器不得解释该位。它可能会在将来的版本中使用,以表示一个属性,该属性不是正确解码帧所必需的。符合此规范的编码器必须将此位设置为零。
This bit is reserved for some future feature. Its value must be zero. A decoder compliant with this specification version must ensure it is not set. This bit may be used in a future revision, to signal a feature that must be interpreted to decode the frame correctly.
此位保留用于将来的某些功能。它的值必须为零。符合此规范版本的解码器必须确保未设置。该位可在未来版本中使用,以表示必须解释以正确解码帧的特征。
If this flag is set, a 32-bit Content_Checksum will be present at the frame's end. See the description of Content_Checksum above.
如果设置了此标志,则帧末尾将出现32位内容校验和。请参阅上面的内容\u校验和说明。
This is a 2-bit flag (= Frame_Header_Descriptor & 0x3) indicating whether a dictionary ID is provided within the header. It also specifies the size of this field as DID_Field_Size:
这是一个2位标志(=Frame\u Header\u Descriptor&0x3),指示是否在报头中提供字典ID。它还指定此字段的大小,就像指定字段大小一样:
+----------------+---+---+---+---+ | Flag_Value | 0 | 1 | 2 | 3 | +----------------+---+---+---+---+ | DID_Field_Size | 0 | 1 | 2 | 4 | +----------------+---+---+---+---+
+----------------+---+---+---+---+ | Flag_Value | 0 | 1 | 2 | 3 | +----------------+---+---+---+---+ | DID_Field_Size | 0 | 1 | 2 | 4 | +----------------+---+---+---+---+
This provides guarantees about the minimum memory buffer required to decompress a frame. This information is important for decoders to allocate enough memory.
这保证了解压帧所需的最小内存缓冲区。这些信息对于解码器分配足够的内存非常重要。
The Window_Descriptor byte is optional. When Single_Segment_Flag is set, Window_Descriptor is not present. In this case, Window_Size is Frame_Content_Size, which can be any value from 0 to 2^64-1 bytes (16 ExaBytes).
窗口描述符字节是可选的。设置单_段_标志时,窗口_描述符不存在。在本例中,窗口大小是帧内容大小,可以是0到2^64-1字节(16 EB)之间的任何值。
+------------+----------+----------+ | Bit Number | 7-3 | 2-0 | +------------+----------+----------+ | Field Name | Exponent | Mantissa | +------------+----------+----------+
+------------+----------+----------+ | Bit Number | 7-3 | 2-0 | +------------+----------+----------+ | Field Name | Exponent | Mantissa | +------------+----------+----------+
The minimum memory buffer size is called Window_Size. It is described by the following formulae:
最小内存缓冲区大小称为窗口大小。它由以下公式描述:
windowLog = 10 + Exponent; windowBase = 1 << windowLog; windowAdd = (windowBase / 8) * Mantissa; Window_Size = windowBase + windowAdd;
windowLog = 10 + Exponent; windowBase = 1 << windowLog; windowAdd = (windowBase / 8) * Mantissa; Window_Size = windowBase + windowAdd;
The minimum Window_Size is 1 KB. The maximum Window_Size is (1<<41) + 7*(1<<38) bytes, which is 3.75 TB.
最小窗口大小为1 KB。最大窗口大小为(1<<41)+7*(1<<38)字节,即3.75 TB。
In general, larger Window_Size values tend to improve the compression ratio, but at the cost of increased memory usage.
通常,较大的窗口大小值往往会提高压缩比,但会增加内存使用量。
To properly decode compressed data, a decoder will need to allocate a buffer of at least Window_Size bytes.
为了正确解码压缩数据,解码器需要分配至少为Window_大小字节的缓冲区。
In order to protect decoders from unreasonable memory requirements, a decoder is allowed to reject a compressed frame that requests a memory size beyond decoder's authorized range.
为了保护解码器不受不合理的内存需求的影响,允许解码器拒绝请求超出解码器授权范围的内存大小的压缩帧。
For improved interoperability, it's recommended for decoders to support values of Window_Size up to 8 MB and for encoders not to generate frames requiring a Window_Size larger than 8 MB. It's merely a recommendation though, and decoders are free to support larger or lower limits, depending on local limitations.
为了提高互操作性,建议解码器支持最大8MB的窗口大小值,编码器不生成需要大于8MB的窗口大小的帧。不过这只是一个建议,解码器可以自由支持更大或更低的限制,具体取决于本地限制。
This is a variable size field, which contains the ID of the dictionary required to properly decode the frame. This field is optional. When it's not present, it's up to the decoder to know which dictionary to use.
这是一个可变大小的字段,其中包含正确解码帧所需的字典ID。此字段是可选的。当它不存在时,由解码器决定使用哪个词典。
Dictionary_ID field size is provided by DID_Field_Size. DID_Field_Size is directly derived from the value of Dictionary_ID_Flag. One byte can represent an ID 0-255; 2 bytes can represent an ID 0-65535; 4 bytes can represent an ID 0-4294967295. Format is little-endian.
Dictionary\u ID字段大小由DID\u field\u size提供。DID_字段_大小直接来自Dictionary_ID_标志的值。一个字节可以表示ID 0-255;2个字节可以表示ID 0-65535;4个字节可以表示ID 0-4294967295。格式是LittleEndian。
It is permitted to represent a small ID (for example, 13) with a large 4-byte dictionary ID, even if it is less efficient.
允许用一个大的4字节字典ID来表示一个小ID(例如13),即使它的效率较低。
Within private environments, any dictionary ID can be used. However, for frames and dictionaries distributed in public space, Dictionary_ID must be attributed carefully. The following ranges are reserved for use only with dictionaries that have been registered with IANA (see Section 6.3):
在私有环境中,可以使用任何字典ID。然而,对于分布在公共空间中的框架和字典,必须小心地对字典ID进行属性化。以下范围仅适用于已向IANA注册的词典(见第6.3节):
low range: <= 32767 high range: >= (1 << 31)
low range: <= 32767 high range: >= (1 << 31)
Any other value for Dictionary_ID can be used by private arrangement between participants.
参与者之间的私人安排可以使用Dictionary_ID的任何其他值。
Any payload presented for decompression that references an unregistered reserved dictionary ID results in an error.
为解压缩提供的任何有效负载引用未注册的保留字典ID都会导致错误。
This is the original (uncompressed) size. This information is optional. Frame_Content_Size uses a variable number of bytes, provided by FCS_Field_Size. FCS_Field_Size is provided by the value of Frame_Content_Size_Flag. FCS_Field_Size can be equal to 0 (not present), 1, 2, 4, or 8 bytes.
这是原始(未压缩)大小。此信息是可选的。帧内容大小使用由FCS字段大小提供的可变字节数。FCS_字段_大小由帧_内容_大小_标志的值提供。FCS_字段_大小可以等于0(不存在)、1、2、4或8字节。
+----------------+--------------+ | FCS Field Size | Range | +----------------+--------------+ | 0 | unknown | +----------------+--------------+ | 1 | 0 - 255 | +----------------+--------------+ | 2 | 256 - 65791 | +----------------+--------------+ | 4 | 0 - 2^32 - 1 | +----------------+--------------+ | 8 | 0 - 2^64 - 1 | +----------------+--------------+
+----------------+--------------+ | FCS Field Size | Range | +----------------+--------------+ | 0 | unknown | +----------------+--------------+ | 1 | 0 - 255 | +----------------+--------------+ | 2 | 256 - 65791 | +----------------+--------------+ | 4 | 0 - 2^32 - 1 | +----------------+--------------+ | 8 | 0 - 2^64 - 1 | +----------------+--------------+
Frame_Content_Size format is little-endian. When FCS_Field_Size is 1, 4, or 8 bytes, the value is read directly. When FCS_Field_Size is 2, the offset of 256 is added. It's allowed to represent a small size (for example 18) using any compatible variant.
帧\内容\大小格式为little endian。当FCS_字段_大小为1、4或8字节时,直接读取该值。当FCS_字段_大小为2时,将添加256的偏移量。允许使用任何兼容的变体表示小尺寸(例如18)。
After Magic_Number and Frame_Header, there are some number of blocks. Each frame must have at least 1 block, but there is no upper limit on the number of blocks per frame.
在Magic_编号和Frame_标题之后,有一些块。每个帧必须至少有一个块,但每个帧的块数没有上限。
The structure of a block is as follows:
区块的结构如下所示:
+--------------+---------------+ | Block_Header | Block_Content | +--------------+---------------+ | 3 bytes | n bytes | +--------------+---------------+
+--------------+---------------+ | Block_Header | Block_Content | +--------------+---------------+ | 3 bytes | n bytes | +--------------+---------------+
Block_Header uses 3 bytes, written using little-endian convention. It contains three fields:
Block_头使用3个字节,使用little endian约定写入。它包含三个字段:
+------------+------------+------------+ | Last_Block | Block_Type | Block_Size | +------------+------------+------------+ | bit 0 | bits 1-2 | bits 3-23 | +------------+------------+------------+
+------------+------------+------------+ | Last_Block | Block_Type | Block_Size | +------------+------------+------------+ | bit 0 | bits 1-2 | bits 3-23 | +------------+------------+------------+
The lowest bit (Last_Block) signals whether this block is the last one. The frame will end after this last block. It may be followed by an optional Content_Checksum (see Section 3.1.1).
最低位(最后一个块)表示该块是否为最后一个。帧将在最后一个块之后结束。之后可能会有可选的内容校验和(见第3.1.1节)。
The next 2 bits represent the Block_Type. There are four block types:
接下来的2位表示块类型。有四种块类型:
+-----------+------------------+ | Value | Block_Type | +-----------+------------------+ | 0 | Raw_Block | +-----------+------------------+ | 1 | RLE_Block | +-----------+------------------+ | 2 | Compressed_Block | +-----------+------------------+ | 3 | Reserved | +-----------+------------------+
+-----------+------------------+ | Value | Block_Type | +-----------+------------------+ | 0 | Raw_Block | +-----------+------------------+ | 1 | RLE_Block | +-----------+------------------+ | 2 | Compressed_Block | +-----------+------------------+ | 3 | Reserved | +-----------+------------------+
Raw_Block: This is an uncompressed block. Block_Content contains Block_Size bytes.
原始块:这是一个未压缩的块。块\内容包含块\大小字节。
RLE_Block: This is a single byte, repeated Block_Size times. Block_Content consists of a single byte. On the decompression side, this byte must be repeated Block_Size times.
RLE_块:这是一个单字节,重复块大小的时间。块内容由单个字节组成。在解压缩端,该字节必须重复块大小的次。
Compressed_Block: This is a compressed block as described in Section 3.1.1.3. Block_Size is the length of Block_Content, namely the compressed data. The decompressed size is not known, but its maximum possible value is guaranteed (see below).
压缩块:这是第3.1.1.3节所述的压缩块。Block_Size是Block_内容的长度,即压缩数据。解压后的大小未知,但可以保证其最大可能值(见下文)。
Reserved: This is not a block. This value cannot be used with the current specification. If such a value is present, it is considered to be corrupt data.
保留:这不是一个区块。此值不能与当前规范一起使用。如果存在这样的值,则认为该值是损坏的数据。
The upper 21 bits of Block_Header represent the Block_Size. Block_Size is the size of the block excluding the header. A block can contain any number of bytes (even zero), up to Block_Maximum_Decompressed_Size, which is the smallest of:
块_头的上21位表示块_大小。Block_Size是不包括标头的块的大小。一个块可以包含任意数量的字节(甚至为零),最大为块\u最大\u解压\u大小,最小为:
o Window_Size
o 窗口大小
o 128 KB
o 128 KB
A Compressed_Block has the extra restriction that Block_Size is always strictly less than the decompressed size. If this condition cannot be respected, the block must be sent uncompressed instead (i.e., treated as a Raw_Block).
压缩的块具有额外的限制,即块大小始终严格小于解压缩的大小。如果不能满足此条件,则必须以未压缩的方式发送该块(即,将其视为原始块)。
To decompress a compressed block, the compressed size must be provided from the Block_Size field within Block_Header.
要解压缩压缩块,必须从块\头中的块\大小字段提供压缩大小。
A compressed block consists of two sections: a Literals Section (Section 3.1.1.3.1) and a Sequences_Section (Section 3.1.1.3.2). The results of the two sections are then combined to produce the decompressed data in Sequence Execution (Section 3.1.1.4).
压缩块由两部分组成:文字部分(第3.1.1.3.1节)和序列部分(第3.1.1.3.2节)。然后将这两个部分的结果合并,以在序列执行中生成解压缩数据(第3.1.1.4节)。
To decode a compressed block, the following elements are necessary:
要解码压缩块,需要以下元素:
o Previous decoded data, up to a distance of Window_Size, or the beginning of the Frame, whichever is smaller. Single_Segment_Flag will be set in the latter case.
o 先前解码的数据,直到窗口大小的距离或帧的开始处,以较小者为准。在后一种情况下,将设置单_段_标志。
o List of "recent offsets" from the previous Compressed_Block.
o 上一压缩块的“最近偏移”列表。
o The previous Huffman tree, required by Treeless_Literals_Block type.
o 前一个哈夫曼树,无树\u文本\u块类型所需。
o Previous Finite State Entropy (FSE) decoding tables, required by Repeat_Mode, for each symbol type (literals lengths, match lengths, offsets).
o 对于每种符号类型(文字长度、匹配长度、偏移量),Repeat_模式所需的先前有限状态熵(FSE)解码表。
Note that decoding tables are not always from the previous Compressed_Block:
请注意,解码表并不总是来自上一个压缩块:
o Every decoding table can come from a dictionary.
o 每个解码表都可以来自字典。
o The Huffman tree comes from the previous Compressed_Literals_Block.
o 哈夫曼树来自前面的压缩文本块。
All literals are regrouped in the first part of the block. They can be decoded first and then copied during Sequence Execution (see Section 3.1.1.4), or they can be decoded on the flow during Sequence Execution.
所有文字在块的第一部分重新分组。它们可以在序列执行期间先解码然后复制(见第3.1.1.4节),也可以在序列执行期间在流上解码。
Literals can be stored uncompressed or compressed using Huffman prefix codes. When compressed, an optional tree description can be present, followed by 1 or 4 streams.
文本可以未压缩存储,也可以使用哈夫曼前缀代码进行压缩。压缩后,可以显示可选的树描述,后跟1或4个流。
+----------------------------+ | Literals_Section_Header | +----------------------------+ | [Huffman_Tree_Description] | +----------------------------+ | [Jump_Table] | +----------------------------+ | Stream_1 | +----------------------------+ | [Stream_2] | +----------------------------+ | [Stream_3] | +----------------------------+ | [Stream_4] | +----------------------------+
+----------------------------+ | Literals_Section_Header | +----------------------------+ | [Huffman_Tree_Description] | +----------------------------+ | [Jump_Table] | +----------------------------+ | Stream_1 | +----------------------------+ | [Stream_2] | +----------------------------+ | [Stream_3] | +----------------------------+ | [Stream_4] | +----------------------------+
This field describes how literals are packed. It's a byte-aligned variable-size bit field, ranging from 1 to 5 bytes, using little-endian convention.
此字段描述如何打包文字。它是一个字节对齐的可变大小的位字段,范围从1到5字节,使用小尾端约定。
+---------------------+-----------+ | Literals_Block_Type | 2 bits | +---------------------+-----------+ | Size_Format | 1-2 bits | +---------------------+-----------+ | Regenerated_Size | 5-20 bits | +---------------------+-----------+ | [Compressed_Size] | 0-18 bits | +---------------------+-----------+
+---------------------+-----------+ | Literals_Block_Type | 2 bits | +---------------------+-----------+ | Size_Format | 1-2 bits | +---------------------+-----------+ | Regenerated_Size | 5-20 bits | +---------------------+-----------+ | [Compressed_Size] | 0-18 bits | +---------------------+-----------+
In this representation, bits at the top are the lowest bits.
在这种表示法中,顶部的位是最低位。
The Literals_Block_Type field uses the two lowest bits of the first byte, describing four different block types:
“文字\块\类型”字段使用第一个字节的两个最低位,描述四种不同的块类型:
+---------------------------+-------+ | Literals_Block_Type | Value | +---------------------------+-------+ | Raw_Literals_Block | 0 | +---------------------------+-------+ | RLE_Literals_Block | 1 | +---------------------------+-------+ | Compressed_Literals_Block | 2 | +---------------------------+-------+ | Treeless_Literals_Block | 3 | +---------------------------+-------+
+---------------------------+-------+ | Literals_Block_Type | Value | +---------------------------+-------+ | Raw_Literals_Block | 0 | +---------------------------+-------+ | RLE_Literals_Block | 1 | +---------------------------+-------+ | Compressed_Literals_Block | 2 | +---------------------------+-------+ | Treeless_Literals_Block | 3 | +---------------------------+-------+
Raw_Literals_Block: Literals are stored uncompressed. Literals_Section_Content is Regenerated_Size.
原始文本块:文本未压缩存储。文字\u节\u内容重新生成\u大小。
RLE_Literals_Block: Literals consist of a single-byte value repeated Regenerated_Size times. Literals_Section_Content is 1.
RLE_Literals_Block:Literals由一个字节值组成,重复大小为\u次。文字\u节\u内容为1。
Compressed_Literals_Block: This is a standard Huffman-compressed block, starting with a Huffman tree description. See details below. Literals_Section_Content is Compressed_Size.
压缩文本块:这是一个标准的哈夫曼压缩块,从哈夫曼树描述开始。详情见下文。文字\u节\u内容已压缩\u大小。
Treeless_Literals_Block: This is a Huffman-compressed block, using the Huffman tree from the previous Compressed_Literals_Block, or a dictionary if there is no previous Huffman-compressed literals block. Huffman_Tree_Description will be skipped. Note that if this mode is triggered without any previous Huffman-table in the frame (or dictionary, per Section 5), it should be treated as data corruption. Literals_Section_Content is Compressed_Size.
Treeless_Literals_Block:这是一个Huffman压缩块,使用以前压缩的_Literals_块中的Huffman树,或者如果没有以前的Huffman压缩的Literals块,则使用字典。哈夫曼树描述将被跳过。请注意,如果触发此模式时帧中没有任何先前的Huffman表(或字典,根据第5节),则应将其视为数据损坏。文字\u节\u内容已压缩\u大小。
The Size_Format is divided into two families:
Size_格式分为两个系列:
o For Raw_Literals_Block and RLE_Literals_Block, it's only necessary to decode Regenerated_Size. There is no Compressed_Size field.
o 对于原始文本块和RLE文本块,只需要解码重新生成的文本大小。没有压缩的大小字段。
o For Compressed_Block and Treeless_Literals_Block, it's required to decode both Compressed_Size and Regenerated_Size (the decompressed size). It's also necessary to decode the number of streams (1 or 4).
o 对于压缩的\u块和无树的\u文本\u块,需要对压缩的\u大小和重新生成的\u大小(解压缩的大小)进行解码。还需要解码流的数量(1或4)。
For values spanning several bytes, the convention is little endian.
对于跨越多个字节的值,约定为小端。
Size_Format for Raw_Literals_Block and RLE_Literals_Block uses 1 or 2 bits. Its value is (Literals_Section_Header[0]>>2) & 0x3.
原始文本块和RLE文本块的大小格式使用1或2位。它的值是(文本\u节\u头[0]>>2)&0x3。
Size_Format == 00 or 10: Size_Format uses 1 bit. Regenerated_Size uses 5 bits (value 0-31). Literals_Section_Header uses 1 byte. Regenerated_Size = Literal_Section_Header[0]>>3.
大小\格式==00或10:大小\格式使用1位。重新生成的_大小使用5位(值0-31)。文本\u节\u头使用1字节。重新生成的\u大小=文字\u节\u头[0]>>3。
Size_Format == 01: Size_Format uses 2 bits. Regenerated_Size uses 12 bits (values 0-4095). Literals_Section_Header uses 2 bytes. Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4).
Size_Format == 01: Size_Format uses 2 bits. Regenerated_Size uses 12 bits (values 0-4095). Literals_Section_Header uses 2 bytes. Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4).
Size_Format == 11: Size_Format uses 2 bits. Regenerated_Size uses 20 bits (values 0-1048575). Literals_Section_Header uses 3 bytes. Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4) + (Literals_Section_Header[2]<<12)
Size_Format == 11: Size_Format uses 2 bits. Regenerated_Size uses 20 bits (values 0-1048575). Literals_Section_Header uses 3 bytes. Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4) + (Literals_Section_Header[2]<<12)
Only Stream_1 is present for these cases. Note that it is permitted to represent a short value (for example, 13) using a long format, even if it's less efficient.
对于这些情况,只有流_1存在。请注意,允许使用长格式表示短值(例如13),即使效率较低。
Size_Format for Compressed_Literals_Block and Treeless_Literals_Block always uses 2 bits.
压缩文本块和无树文本块的大小格式始终使用2位。
Size_Format == 00: A single stream. Both Regenerated_Size and Compressed_Size use 10 bits (values 0-1023). Literals_Section_Header uses 3 bytes.
Size_Format==00:单个流。重新生成的_大小和压缩的_大小都使用10位(值0-1023)。文本\u节\u头使用3个字节。
Size_Format == 01: 4 streams. Both Regenerated_Size and Compressed_Size use 10 bits (values 0-1023). Literals_Section_Header uses 3 bytes.
大小\格式==01:4个流。重新生成的_大小和压缩的_大小都使用10位(值0-1023)。文本\u节\u头使用3个字节。
Size_Format == 10: 4 streams. Both Regenerated_Size and Compressed_Size use 14 bits (values 0-16383). Literals_Section_Header uses 4 bytes.
大小\格式==10:4个流。重新生成的_大小和压缩的_大小都使用14位(值0-16383)。文本\u节\u头使用4个字节。
Size_Format == 11: 4 streams. Both Regenerated_Size and Compressed_Size use 18 bits (values 0-262143). Literals_Section_Header uses 5 bytes.
大小\格式==11:4个流。重新生成的_大小和压缩的_大小都使用18位(值0-262143)。文本\u节\u头使用5个字节。
Both the Compressed_Size and Regenerated_Size fields follow little-endian convention. Note that Compressed_Size includes the size of the Huffman_Tree_Description when it is present.
压缩的_Size和重新生成的_Size字段都遵循小端惯例。请注意,压缩的大小包括哈夫曼树描述的大小。
The data in Stream_1 is Regenerated_Size bytes long. It contains the raw literals data to be used during Sequence Execution (Section 3.1.1.3.2).
流_1中的数据是重新生成的_大小字节长。它包含序列执行期间使用的原始文字数据(第3.1.1.3.2节)。
Stream_1 consists of a single byte that should be repeated Regenerated_Size times to generate the decoded literals.
流_1由一个字节组成,该字节应重复_大小的次数以生成解码的文本。
Both of these modes contain Huffman-encoded data. For Treeless_Literals_Block, the Huffman table comes from the previously compressed literals block, or from a dictionary; see Section 5.
这两种模式都包含哈夫曼编码的数据。对于无树文本块,Huffman表来自先前压缩的文本块或字典;见第5节。
This section is only present when the Literals_Block_Type type is Compressed_Literals_Block (2). The format of Huffman_Tree_Description can be found in Section 4.2.1. The size of Huffman_Tree_Description is determined during the decoding process. It must be used to determine where streams begin.
本节仅在文字\u块\u类型为压缩\u文字\u块(2)时出现。哈夫曼树描述的格式见第4.2.1节。哈夫曼树描述的大小在解码过程中确定。它必须用于确定流从何处开始。
Total_Streams_Size = Compressed_Size - Huffman_Tree_Description_Size
总\u流\u大小=压缩\u大小-哈夫曼\u树\u描述\u大小
The Jump_Table is only present when there are 4 Huffman-coded streams.
只有当有4个哈夫曼编码流时,才会出现跳转_表。
(Reminder: Huffman-compressed data consists of either 1 or 4 Huffman-coded streams.)
(提醒:哈夫曼压缩数据由1或4个哈夫曼编码流组成。)
If only 1 stream is present, it is a single bitstream occupying the entire remaining portion of the literals block, encoded as described within Section 4.2.2.
如果仅存在1个流,则它是一个占用文字块的整个剩余部分的单个比特流,按照第4.2.2节所述进行编码。
If there are 4 streams, Literals_Section_Header only provides enough information to know the decompressed and compressed sizes of all 4 streams combined. The decompressed size of each stream is equal to (Regenerated_Size+3)/4, except for the last stream, which may be up to 3 bytes smaller, to reach a total decompressed size as specified in Regenerated_Size.
如果有4个流,则Literals\u Section\u Header仅提供足够的信息来了解所有4个流的解压缩和压缩大小。每个流的解压缩大小等于(重新生成的\u大小+3)/4,但最后一个流除外,该流可能小于3字节,以达到重新生成的\u大小中指定的总解压缩大小。
The compressed size of each stream is provided explicitly in the Jump_Table. The Jump_Table is 6 bytes long and consists of three 2-byte little-endian fields, describing the compressed sizes of the first 3 streams. Stream4_Size is computed from Total_Streams_Size minus sizes of other streams.
每个流的压缩大小在Jump_表中显式提供。Jump_表的长度为6字节,由三个2字节的小端字段组成,描述前3个流的压缩大小。Stream4_大小是根据总_流_大小减去其他流的大小来计算的。
Stream4_Size = Total_Streams_Size - 6 - Stream1_Size - Stream2_Size - Stream3_Size
Stream4_Size = Total_Streams_Size - 6 - Stream1_Size - Stream2_Size - Stream3_Size
Note that if Stream1_Size + Stream2_Size + Stream3_Size exceeds Total_Streams_Size, the data are considered corrupted.
请注意,如果Stream1\u Size+Stream2\u Size+Stream3\u Size超过总的\u Streams\u Size,则认为数据已损坏。
Each of these 4 bitstreams is then decoded independently as a Huffman-Coded stream, as described in Section 4.2.2.
如第4.2.2节所述,将这4个比特流中的每一个单独解码为哈夫曼编码流。
A compressed block is a succession of sequences. A sequence is a literal copy command, followed by a match copy command. A literal copy command specifies a length. It is the number of bytes to be copied (or extracted) from the Literals Section. A match copy command specifies an offset and a length.
压缩块是一系列序列。序列是文字复制命令,后跟匹配复制命令。文字复制命令指定长度。它是要从文本部分复制(或提取)的字节数。“匹配复制”命令指定偏移量和长度。
When all sequences are decoded, if there are literals left in the literals section, these bytes are added at the end of the block.
当解码所有序列时,如果文本部分中还剩下文本,则这些字节将添加到块的末尾。
This is described in more detail in Section 3.1.1.4.
第3.1.1.4节对此进行了更详细的描述。
The Sequences_Section regroups all symbols required to decode commands. There are three symbol types: literals lengths, offsets, and match lengths. They are encoded together, interleaved, in a single "bitstream".
Sequences_部分重新组合解码命令所需的所有符号。有三种符号类型:文字长度、偏移量和匹配长度。它们被编码在一起,交织在一个“比特流”中。
The Sequences_Section starts by a header, followed by optional probability tables for each symbol type, followed by the bitstream.
Sequences_部分首先是一个标头,然后是每个符号类型的可选概率表,然后是比特流。
Sequences_Section_Header [Literals_Length_Table] [Offset_Table] [Match_Length_Table] bitStream
Sequences_Section_Header [Literals_Length_Table] [Offset_Table] [Match_Length_Table] bitStream
To decode the Sequences_Section, it's necessary to know its size. This size is deduced from the size of the Literals_Section: Sequences_Section_Size = Block_Size - Literals_Section_Header - Literals_Section_Content
要解码序列_部分,必须知道它的大小。此大小是根据文本\u部分的大小推导得出的:序列\u部分\u大小=块\u大小-文本\u部分\u头-文本\u部分\u内容
This header consists of two items:
此标题由两项组成:
o Number_of_Sequences
o _序列的个数
o Symbol_Compression_Modes
o 符号\压缩\模式
Number_of_Sequences is a variable size field using between 1 and 3 bytes. If the first byte is "byte0":
_序列的数量_是一个大小可变的字段,使用1到3个字节。如果第一个字节为“字节0”:
o if (byte0 == 0): there are no sequences. The sequence section stops here. Decompressed content is defined entirely as Literals Section content. The FSE tables used in Repeat_Mode are not updated.
o if(byte0==0):没有序列。序列部分到此结束。解压缩内容完全定义为文本部分内容。重复模式下使用的FSE表不会更新。
o if (byte0 < 128): Number_of_Sequences = byte0. Uses 1 byte.
o if(字节0<128):字节序列的个数=字节0。使用1字节。
o if (byte0 < 255): Number_of_Sequences = ((byte0 - 128) << 8) + byte1. Uses 2 bytes.
o if(字节0<255):字节序列的个数=((字节0-128)<8)+字节1。使用2个字节。
o if (byte0 == 255): Number_of_Sequences = byte1 + (byte2 << 8) + 0x7F00. Uses 3 bytes.
o if(byte0==255):序列的个数=byte1+(byte2<<8)+0x7F00。使用3个字节。
Symbol_Compression_Modes is a single byte, defining the compression mode of each symbol type.
Symbol_Compression_Modes是一个字节,定义了每种符号类型的压缩模式。
+-------------+----------------------+ | Bit Number | Field Name | +-------------+----------------------+ | 7-6 | Literal_Lengths_Mode | +-------------+----------------------+ | 5-4 | Offsets_Mode | +-------------+----------------------+ | 3-2 | Match_Lengths_Mode | +-------------+----------------------+ | 1-0 | Reserved | +-------------+----------------------+
+-------------+----------------------+ | Bit Number | Field Name | +-------------+----------------------+ | 7-6 | Literal_Lengths_Mode | +-------------+----------------------+ | 5-4 | Offsets_Mode | +-------------+----------------------+ | 3-2 | Match_Lengths_Mode | +-------------+----------------------+ | 1-0 | Reserved | +-------------+----------------------+
The last field, Reserved, must be all zeroes.
保留的最后一个字段必须全为零。
Literals_Lengths_Mode, Offsets_Mode, and Match_Lengths_Mode define the Compression_Mode of literals lengths, offsets, and match lengths symbols, respectively. They follow the same enumeration:
文字长度模式、偏移量模式和匹配长度模式分别定义文字长度、偏移量和匹配长度符号的压缩模式。它们遵循相同的枚举:
+-------+---------------------+ | Value | Compression_Mode | +-------+---------------------+ | 0 | Predefined_Mode | +-------+---------------------+ | 1 | RLE_Mode | +-------+---------------------+ | 2 | FSE_Compressed_Mode | +-------+---------------------+ | 3 | Repeat_Mode | +-------+---------------------+
+-------+---------------------+ | Value | Compression_Mode | +-------+---------------------+ | 0 | Predefined_Mode | +-------+---------------------+ | 1 | RLE_Mode | +-------+---------------------+ | 2 | FSE_Compressed_Mode | +-------+---------------------+ | 3 | Repeat_Mode | +-------+---------------------+
Predefined_Mode: A predefined FSE (see Section 4.1) distribution table is used, as defined in Section 3.1.1.3.2.2. No distribution table will be present.
预定义的_模式:使用预定义的FSE(见第4.1节)分布表,如第3.1.1.3.2.2节所定义。没有分发表。
RLE_Mode: The table description consists of a single byte, which contains the symbol's value. This symbol will be used for all sequences.
RLE_模式:表格描述由单个字节组成,其中包含符号值。此符号将用于所有序列。
FSE_Compressed_Mode: Standard FSE compression. A distribution table will be present. The format of this distribution table is described in Section 4.1.1. Note that the maximum allowed accuracy log for literals length and match length tables is 9, and the maximum accuracy log for the offsets table is 8. This mode must not be used when only one symbol is present; RLE_Mode should be used instead (although any other mode will work).
FSE_压缩模式:标准FSE压缩。将提供分配表。第4.1.1节描述了该分配表的格式。请注意,文字长度和匹配长度表允许的最大精度日志为9,偏移量表允许的最大精度日志为8。只有一个符号时,不得使用此模式;应改用RLE_模式(尽管任何其他模式都可以)。
Repeat_Mode: The table used in the previous Compressed_Block with Number_Of_Sequences > 0 will be used again, or if this is the first block, the table in the dictionary will be used. Note that this includes RLE_Mode, so if Repeat_Mode follows RLE_Mode, the same symbol will be repeated. It also includes Predefined_Mode, in which case Repeat_Mode will have the same outcome as Predefined_Mode. No distribution table will be present. If this mode is used without any previous sequence table in the frame (or dictionary; see Section 5) to repeat, this should be treated as corruption.
Repeat_Mode(重复_模式):将再次使用前一个压缩的_块中使用的、_序列数>0的表,或者如果这是第一个块,则将使用字典中的表。请注意,这包括RLE_模式,因此如果重复_模式遵循RLE_模式,则相同的符号将重复。它还包括预定义的_模式,在这种情况下,重复_模式将与预定义的_模式具有相同的结果。没有分发表。如果使用此模式时框架(或字典;参见第5节)中没有任何先前的序列表重复,则应将其视为损坏。
Each symbol is a code in its own context, which specifies Baseline and Number_of_Bits to add. Codes are FSE compressed and interleaved with raw additional bits in the same bitstream.
每个符号都是其自身上下文中的一个代码,它指定要添加的基线和位的数量。代码经过FSE压缩,并与同一比特流中的原始附加比特交织。
Literals length codes are values ranging from 0 to 35 inclusive. They define lengths from 0 to 131071 bytes. The literals length is equal to the decoded Baseline plus the result of reading Number_of_Bits bits from the bitstream, as a little-endian value.
文字长度代码的值范围为0到35(含0到35)。它们定义从0到131071字节的长度。字面值长度等于解码基线加上从比特流中读取_比特数_的结果,作为一个小的endian值。
+----------------------+----------+----------------+ | Literals_Length_Code | Baseline | Number_of_Bits | +----------------------+----------+----------------+ | 0-15 | length | 0 | +----------------------+----------+----------------+ | 16 | 16 | 1 | +----------------------+----------+----------------+ | 17 | 18 | 1 | +----------------------+----------+----------------+ | 18 | 20 | 1 | +----------------------+----------+----------------+ | 19 | 22 | 1 | +----------------------+----------+----------------+ | 20 | 24 | 2 | +----------------------+----------+----------------+ | 21 | 28 | 2 | +----------------------+----------+----------------+ | 22 | 32 | 3 | +----------------------+----------+----------------+ | 23 | 40 | 3 | +----------------------+----------+----------------+ | 24 | 48 | 4 | +----------------------+----------+----------------+ | 25 | 64 | 6 | +----------------------+----------+----------------+ | 26 | 128 | 7 | +----------------------+----------+----------------+ | 27 | 256 | 8 | +----------------------+----------+----------------+ | 28 | 512 | 9 | +----------------------+----------+----------------+ | 29 | 1024 | 10 | +----------------------+----------+----------------+ | 30 | 2048 | 11 | +----------------------+----------+----------------+ | 31 | 4096 | 12 | +----------------------+----------+----------------+ | 32 | 8192 | 13 | +----------------------+----------+----------------+ | 33 | 16384 | 14 | +----------------------+----------+----------------+ | 34 | 32768 | 15 | +----------------------+----------+----------------+ | 35 | 65536 | 16 | +----------------------+----------+----------------+
+----------------------+----------+----------------+ | Literals_Length_Code | Baseline | Number_of_Bits | +----------------------+----------+----------------+ | 0-15 | length | 0 | +----------------------+----------+----------------+ | 16 | 16 | 1 | +----------------------+----------+----------------+ | 17 | 18 | 1 | +----------------------+----------+----------------+ | 18 | 20 | 1 | +----------------------+----------+----------------+ | 19 | 22 | 1 | +----------------------+----------+----------------+ | 20 | 24 | 2 | +----------------------+----------+----------------+ | 21 | 28 | 2 | +----------------------+----------+----------------+ | 22 | 32 | 3 | +----------------------+----------+----------------+ | 23 | 40 | 3 | +----------------------+----------+----------------+ | 24 | 48 | 4 | +----------------------+----------+----------------+ | 25 | 64 | 6 | +----------------------+----------+----------------+ | 26 | 128 | 7 | +----------------------+----------+----------------+ | 27 | 256 | 8 | +----------------------+----------+----------------+ | 28 | 512 | 9 | +----------------------+----------+----------------+ | 29 | 1024 | 10 | +----------------------+----------+----------------+ | 30 | 2048 | 11 | +----------------------+----------+----------------+ | 31 | 4096 | 12 | +----------------------+----------+----------------+ | 32 | 8192 | 13 | +----------------------+----------+----------------+ | 33 | 16384 | 14 | +----------------------+----------+----------------+ | 34 | 32768 | 15 | +----------------------+----------+----------------+ | 35 | 65536 | 16 | +----------------------+----------+----------------+
Match length codes are values ranging from 0 to 52 inclusive. They define lengths from 3 to 131074 bytes. The match length is equal to the decoded Baseline plus the result of reading Number_of_Bits bits from the bitstream, as a little-endian value.
匹配长度代码的值范围为0到52(含0到52)。它们定义了从3到131074字节的长度。匹配长度等于解码基线加上从比特流中读取的_比特数_的结果,作为一个小的endian值。
+-------------------+-----------------------+----------------+ | Match_Length_Code | Baseline | Number_of_Bits | +-------------------+-----------------------+----------------+ | 0-31 | Match_Length_Code + 3 | 0 | +-------------------+-----------------------+----------------+ | 32 | 35 | 1 | +-------------------+-----------------------+----------------+ | 33 | 37 | 1 | +-------------------+-----------------------+----------------+ | 34 | 39 | 1 | +-------------------+-----------------------+----------------+ | 35 | 41 | 1 | +-------------------+-----------------------+----------------+ | 36 | 43 | 2 | +-------------------+-----------------------+----------------+ | 37 | 47 | 2 | +-------------------+-----------------------+----------------+ | 38 | 51 | 3 | +-------------------+-----------------------+----------------+ | 39 | 59 | 3 | +-------------------+-----------------------+----------------+ | 40 | 67 | 4 | +-------------------+-----------------------+----------------+ | 41 | 83 | 4 | +-------------------+-----------------------+----------------+ | 42 | 99 | 5 | +-------------------+-----------------------+----------------+ | 43 | 131 | 7 | +-------------------+-----------------------+----------------+ | 44 | 259 | 8 | +-------------------+-----------------------+----------------+ | 45 | 515 | 9 | +-------------------+-----------------------+----------------+ | 46 | 1027 | 10 | +-------------------+-----------------------+----------------+ | 47 | 2051 | 11 | +-------------------+-----------------------+----------------+ | 48 | 4099 | 12 | +-------------------+-----------------------+----------------+ | 49 | 8195 | 13 | +-------------------+-----------------------+----------------+ | 50 | 16387 | 14 | +-------------------+-----------------------+----------------+ | 51 | 32771 | 15 | +-------------------+-----------------------+----------------+ | 52 | 65539 | 16 | +-------------------+-----------------------+----------------+
+-------------------+-----------------------+----------------+ | Match_Length_Code | Baseline | Number_of_Bits | +-------------------+-----------------------+----------------+ | 0-31 | Match_Length_Code + 3 | 0 | +-------------------+-----------------------+----------------+ | 32 | 35 | 1 | +-------------------+-----------------------+----------------+ | 33 | 37 | 1 | +-------------------+-----------------------+----------------+ | 34 | 39 | 1 | +-------------------+-----------------------+----------------+ | 35 | 41 | 1 | +-------------------+-----------------------+----------------+ | 36 | 43 | 2 | +-------------------+-----------------------+----------------+ | 37 | 47 | 2 | +-------------------+-----------------------+----------------+ | 38 | 51 | 3 | +-------------------+-----------------------+----------------+ | 39 | 59 | 3 | +-------------------+-----------------------+----------------+ | 40 | 67 | 4 | +-------------------+-----------------------+----------------+ | 41 | 83 | 4 | +-------------------+-----------------------+----------------+ | 42 | 99 | 5 | +-------------------+-----------------------+----------------+ | 43 | 131 | 7 | +-------------------+-----------------------+----------------+ | 44 | 259 | 8 | +-------------------+-----------------------+----------------+ | 45 | 515 | 9 | +-------------------+-----------------------+----------------+ | 46 | 1027 | 10 | +-------------------+-----------------------+----------------+ | 47 | 2051 | 11 | +-------------------+-----------------------+----------------+ | 48 | 4099 | 12 | +-------------------+-----------------------+----------------+ | 49 | 8195 | 13 | +-------------------+-----------------------+----------------+ | 50 | 16387 | 14 | +-------------------+-----------------------+----------------+ | 51 | 32771 | 15 | +-------------------+-----------------------+----------------+ | 52 | 65539 | 16 | +-------------------+-----------------------+----------------+
Offset codes are values ranging from 0 to N.
偏移代码是从0到N的值。
A decoder is free to limit its maximum supported value for N. Support for values of at least 22 is recommended. At the time of this writing, the reference decoder supports a maximum N value of 31.
解码器可自由将其最大支持值限制为N。建议支持至少22的值。在撰写本文时,参考解码器支持最大N值31。
An offset code is also the number of additional bits to read in little-endian fashion and can be translated into an Offset_Value using the following formulas:
偏移量代码也是以小端方式读取的附加位数,可以使用以下公式转换为偏移量_值:
Offset_Value = (1 << offsetCode) + readNBits(offsetCode); if (Offset_Value > 3) Offset = Offset_Value - 3;
Offset_Value = (1 << offsetCode) + readNBits(offsetCode); if (Offset_Value > 3) Offset = Offset_Value - 3;
This means that maximum Offset_Value is (2^(N+1))-1, supporting back-reference distance up to (2^(N+1))-4, but it is limited by the maximum back-reference distance (see Section 3.1.1.1.2).
这意味着最大偏移量_值为(2^(N+1))-1,支持高达(2^(N+1))-4的后参考距离,但受最大后参考距离的限制(见第3.1.1.2节)。
Offset_Value from 1 to 3 are special: they define "repeat codes". This is described in more detail in Section 3.1.1.5.
从1到3的偏移值是特殊的:它们定义了“重复代码”。第3.1.1.5节对此进行了更详细的描述。
FSE bitstreams are read in reverse of the direction they are written. In zstd, the compressor writes bits forward into a block, and the decompressor must read the bitstream backwards.
FSE比特流的读取方向与写入方向相反。在zstd中,压缩器将位向前写入块,而解压缩器必须向后读取位流。
To find the start of the bitstream, it is therefore necessary to know the offset of the last byte of the block, which can be found by counting Block_Size bytes after the block header.
为了找到比特流的开始,因此有必要知道块的最后一个字节的偏移量,这可以通过在块头之后计算块大小字节来找到。
After writing the last bit containing information, the compressor writes a single 1 bit and then fills the byte with 0-7 zero bits of padding. The last byte of the compressed bitstream cannot be zero for that reason.
写入包含信息的最后一位后,压缩器写入单个1位,然后用0-7个零位填充字节。因此,压缩比特流的最后一个字节不能为零。
When decompressing, the last byte containing the padding is the first byte to read. The decompressor needs to skip 0-7 initial zero bits until the first 1 bit occurs. Afterwards, the useful part of the bitstream begins.
解压缩时,包含填充的最后一个字节是要读取的第一个字节。解压器需要跳过0-7个初始零位,直到出现第一个1位。之后,比特流的有用部分开始。
FSE decoding requires a 'state' to be carried from symbol to symbol. For more explanation on FSE decoding, see Section 4.1.
FSE解码要求从一个符号到另一个符号携带“状态”。有关FSE解码的更多说明,请参见第4.1节。
For sequence decoding, a separate state keeps track of each literal lengths, offsets, and match lengths symbols. Some FSE primitives are also used. For more details on the operation of these primitives, see Section 4.1.
对于序列解码,单独的状态跟踪每个文字长度、偏移量和匹配长度符号。还使用了一些FSE原语。有关这些原语操作的更多详细信息,请参见第4.1节。
The bitstream starts with initial FSE state values, each using the required number of bits in their respective accuracy, decoded previously from their normalized distribution. It starts with Literals_Length_State, followed by Offset_State, and finally Match_Length_State.
比特流从初始FSE状态值开始,每个值使用先前从其标准化分布解码的各自精度所需的比特数。它从文字长度状态开始,然后是偏移量状态,最后是匹配长度状态。
Note that all values are read backward, so the 'start' of the bitstream is at the highest position in memory, immediately before the last 1 bit for padding.
请注意,所有值都是向后读取的,因此位流的“开始”位于内存中的最高位置,紧靠用于填充的最后1位之前。
After decoding the starting states, a single sequence is decoded Number_Of_Sequences times. These sequences are decoded in order from first to last. Since the compressor writes the bitstream in the forward direction, this means the compressor must encode the sequences starting with the last one and ending with the first.
在解码开始状态之后,单个序列被解码若干次。这些序列是按照从头到尾的顺序解码的。由于压缩器以正向写入比特流,这意味着压缩器必须编码从最后一个开始到第一个结束的序列。
For each of the symbol types, the FSE state can be used to determine the appropriate code. The code then defines the Baseline and Number_of_Bits to read for each type. The description of the codes for how to determine these values can be found in Section 3.1.1.3.2.1.
对于每种符号类型,FSE状态可用于确定适当的代码。然后,代码为每种类型定义基线和要读取的字节数。有关如何确定这些值的代码说明,请参见第3.1.1.3.2.1节。
Decoding starts by reading the Number_of_Bits required to decode offset. It does the same for Match_Length and then for Literals_Length. This sequence is then used for Sequence Execution (see Section 3.1.1.4).
解码首先读取解码偏移量所需的\u位数。它对Match_Length和Literals_Length执行相同的操作。然后将该序列用于序列执行(见第3.1.1.4节)。
If it is not the last sequence in the block, the next operation is to update states. Using the rules pre-calculated in the decoding tables, Literals_Length_State is updated, followed by Match_Length_State, and then Offset_State. See Section 4.1 for details on how to update states from the bitstream.
如果不是块中的最后一个序列,则下一个操作是更新状态。使用解码表中预先计算的规则,更新文字的长度状态,然后更新匹配长度状态,然后更新偏移量状态。有关如何从比特流更新状态的详细信息,请参见第4.1节。
This operation will be repeated Number_of_Sequences times. At the end, the bitstream shall be entirely consumed; otherwise, the bitstream is considered corrupted.
此操作将重复\u个\u序列次。最后,比特流将被完全消耗;否则,位流将被视为已损坏。
If Predefined_Mode is selected for a symbol type, its FSE decoding table is generated from a predefined distribution table defined here. For details on how to convert this distribution into a decoding table, see Section 4.1.
如果为符号类型选择了预定义的_模式,则其FSE解码表将根据此处定义的预定义分布表生成。有关如何将此分布转换为解码表的详细信息,请参见第4.1节。
The decoding table uses an accuracy log of 6 bits (64 states).
解码表使用6位(64个状态)的精度日志。
short literalsLength_defaultDistribution[36] = { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1, -1,-1,-1,-1 };
短文字长度分布[36]={4,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,2,2,2,2,2,2,2,2,3,2,1,1,1,1,-1,-1,1};
The decoding table uses an accuracy log of 6 bits (64 states).
解码表使用6位(64个状态)的精度日志。
short matchLengths_defaultDistribution[53] = { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1, -1,-1,-1,-1,-1 };
短匹配长度\u defaultDistribution[53]={1,4,3,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1};
The decoding table uses an accuracy log of 5 bits (32 states), and supports a maximum N value of 28, allowing offset values up to 536,870,908.
解码表使用5位(32个状态)的精度日志,并支持最大N值28,允许偏移值高达536870908。
If any sequence in the compressed block requires a larger offset than this, it's not possible to use the default distribution to represent it.
如果压缩块中的任何序列需要大于此值的偏移量,则不可能使用默认分布来表示它。
short offsetCodes_defaultDistribution[29] = { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
短偏移量代码_defaultDistribution[29]={1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1;
Once literals and sequences have been decoded, they are combined to produce the decoded content of a block.
一旦文字和序列被解码,它们就被组合起来产生块的解码内容。
Each sequence consists of a tuple of (literals_length, offset_value, match_length), decoded as described in the Sequences_Section (Section 3.1.1.3.2). To execute a sequence, first copy literals_length bytes from the decoded literals to the output.
每个序列由一个元组(文字长度、偏移量值、匹配长度)组成,按照序列部分(第3.1.1.3.2节)中的说明进行解码。要执行一个序列,首先将已解码的文本复制到输出。
Then, match_length bytes are copied from previous decoded data. The offset to copy from is determined by offset_value:
然后,从先前解码的数据复制匹配长度字节。要从中复制的偏移量由偏移量_值确定:
o if Offset_Value > 3, then the offset is Offset_Value - 3;
o 如果偏移量_值>3,则偏移量为偏移量_值-3;
o if Offset_Value is from 1-3, the offset is a special repeat offset value. See Section 3.1.1.5 for how the offset is determined in this case.
o 如果偏移量_值为1-3,则偏移量为特殊的重复偏移量值。关于这种情况下如何确定偏移量,请参见第3.1.1.5节。
The offset is defined as from the current position (after copying the literals), so an offset of 6 and a match length of 3 means that 3 bytes should be copied from 6 bytes back. Note that all offsets leading to previously decoded data must be smaller than Window_Size defined in Frame_Header_Descriptor (Section 3.1.1.1.1).
偏移量定义为从当前位置开始(复制文本后),因此偏移量为6且匹配长度为3意味着应从6个字节向后复制3个字节。请注意,导致先前解码数据的所有偏移量必须小于帧头描述符(第3.1.1.1.1节)中定义的窗口大小。
As seen above, the first three values define a repeated offset; we will call them Repeated_Offset1, Repeated_Offset2, and Repeated_Offset3. They are sorted in recency order, with Repeated_Offset1 meaning "most recent one".
如上所述,前三个值定义了重复偏移;我们将它们称为重复偏移1、重复偏移2和重复偏移3。它们按最近顺序排序,重复的_Offset1表示“最近的一个”。
If offset_value is 1, then the offset used is Repeated_Offset1, etc.
如果偏移量_值为1,则使用的偏移量将重复_Offset1,以此类推。
There is one exception: When the current sequence's literals_length is 0, repeated offsets are shifted by 1, so an offset_value of 1 means Repeated_Offset2, an offset_value of 2 means Repeated_Offset3, and an offset_value of 3 means Repeated_Offset1 - 1_byte.
有一个例外:当当前序列的文本长度为0时,重复的偏移量被移位1,因此偏移量值1表示重复的偏移量2,偏移量值2表示重复的偏移量3,偏移量值3表示重复的偏移量1-1字节。
For the first block, the starting offset history is populated with the following values: Repeated_Offset1 (1), Repeated_Offset2 (4), and Repeated_Offset3 (8), unless a dictionary is used, in which case they come from the dictionary.
对于第一个块,起始偏移历史使用以下值填充:Repeated_Offset1(1)、Repeated_Offset2(4)和Repeated_Offset3(8),除非使用字典,在这种情况下,它们来自字典。
Then each block gets its starting offset history from the ending values of the most recent Compressed_Block. Note that blocks that are not Compressed_Block are skipped; they do not contribute to offset history.
然后,每个块从最近压缩的块的结束值获取其起始偏移历史。请注意,跳过未压缩的块;它们不会抵消历史。
The newest offset takes the lead in offset history, shifting others back (up to its previous place if it was already present). This means that when Repeated_Offset1 (most recent) is used, history is unmodified. When Repeated_Offset2 is used, it is swapped with Repeated_Offset1. If any other offset is used, it becomes Repeated_Offset1, and the rest are shifted back by 1.
最新的偏移量在偏移量历史中处于领先地位,将其他偏移量移回原来的位置(如果已经存在的话)。这意味着当使用重复的_Offset1(最近的)时,历史不会被修改。使用重复_Offset2时,它将与重复_Offset1交换。如果使用任何其他偏移量,它将变为重复的偏移量1,其余偏移量向后移动1。
+--------------+------------+-----------+ | Magic_Number | Frame_Size | User_Data | +--------------+------------+-----------+ | 4 bytes | 4 bytes | n bytes | +--------------+------------+-----------+
+--------------+------------+-----------+ | Magic_Number | Frame_Size | User_Data | +--------------+------------+-----------+ | 4 bytes | 4 bytes | n bytes | +--------------+------------+-----------+
Skippable frames allow the insertion of user-defined metadata into a flow of concatenated frames.
可跳过帧允许将用户定义的元数据插入到连接帧流中。
Skippable frames defined in this specification are compatible with skippable frames in [LZ4].
本规范中定义的可跳过帧与[LZ4]中的可跳过帧兼容。
From a compliant decoder perspective, skippable frames simply need to be skipped, and their content ignored, resuming decoding after the skippable frame.
从兼容解码器的角度来看,只需跳过可跳过帧,忽略其内容,即可在可跳过帧之后恢复解码。
It should be noted that a skippable frame can be used to watermark a stream of concatenated frames embedding any kind of tracking information (even just a Universally Unique Identifier (UUID)). Users wary of such possibility should scan the stream of concatenated frames in an attempt to detect such frames for analysis or removal.
应当注意,可跳过帧可用于对嵌入任何类型的跟踪信息(甚至只是通用唯一标识符(UUID))的级联帧流进行水印。对这种可能性持谨慎态度的用户应扫描连接帧流,尝试检测此类帧以进行分析或删除。
The fields are:
这些字段是:
Magic_Number: 4 bytes, little-endian format. Value: 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F. All 16 values are valid to identify a skippable frame. This specification does not detail any specific tagging methods for skippable frames.
魔数:4字节,小尾端格式。值:0x184D2A5?,表示从0x184D2A50到0x184D2A5F的任意值。所有16个值都可用于标识可跳过帧。本规范未详细说明可跳过帧的任何特定标记方法。
Frame_Size: This is the size, in bytes, of the following User_Data (without including the magic number nor the size field itself). This field is represented using 4 bytes, little-endian format, unsigned 32 bits. This means User_Data can't be bigger than (2^32-1) bytes.
Frame_Size:这是以下用户_数据的大小(以字节为单位)(不包括幻数或大小字段本身)。此字段使用4字节、小尾端格式、无符号32位表示。这意味着用户数据不能大于(2^32-1)字节。
User_Data: This field can be anything. Data will just be skipped by the decoder.
用户数据:此字段可以是任何内容。解码器只会跳过数据。
Two types of entropy encoding are used by the Zstandard format: FSE and Huffman coding. Huffman is used to compress literals, while FSE is used for all other symbols (Literals_Length_Code, Match_Length_Code, and offset codes) and to compress Huffman headers.
Zstandard格式使用两种类型的熵编码:FSE和Huffman编码。哈夫曼用于压缩文字,而FSE用于所有其他符号(文字长度代码、匹配长度代码和偏移量代码)和压缩哈夫曼头。
FSE, short for Finite State Entropy, is an entropy codec based on [ANS]. FSE encoding/decoding involves a state that is carried over between symbols, so decoding must be done in the opposite direction as encoding. Therefore, all FSE bitstreams are read from end to beginning. Note that the order of the bits in the stream is not reversed; they are simply read in the reverse order from which they were written.
FSE是有限状态熵的缩写,是基于[ANS]的熵编解码器。FSE编码/解码涉及在符号之间进行的状态,因此解码必须以编码的相反方向进行。因此,所有FSE比特流都是从头到尾读取的。注意,流中位的顺序没有反转;它们只是按照写它们的相反顺序读。
For additional details on FSE, see Finite State Entropy [FSE].
有关FSE的更多详细信息,请参见有限状态熵[FSE]。
FSE decoding involves a decoding table that has a power of 2 size and contains three elements: Symbol, Num_Bits, and Baseline. The base 2 logarithm of the table size is its Accuracy_Log. An FSE state value represents an index in this table.
FSE解码涉及一个解码表,该表的幂为2,包含三个元素:符号、Num_位和基线。表大小的以2为底的对数是其精度。FSE状态值表示此表中的索引。
To obtain the initial state value, consume Accuracy_Log bits from the stream as a little-endian value. The next symbol in the stream is the Symbol indicated in the table for that state. To obtain the next state value, the decoder should consume Num_Bits bits from the stream as a little-endian value and add it to Baseline.
为了获得初始状态值,将流中的日志位作为一个小的endian值使用。流中的下一个符号是该状态表中指示的符号。为了获得下一个状态值,解码器应该使用流中的Num_位作为一个小的endian值,并将其添加到基线。
To decode FSE streams, it is necessary to construct the decoding table. The Zstandard format encodes FSE table descriptions as described here.
为了解码FSE流,需要构造解码表。ZS标准格式对FSE表描述进行编码,如下所述。
An FSE distribution table describes the probabilities of all symbols from 0 to the last present one (included) on a normalized scale of (1 << Accuracy_Log). Note that there must be two or more symbols with non-zero probability.
FSE分布表描述了从0到最后一个符号(包括在内)的所有符号的概率,其标准化标度为(1<<准确度\u Log)。请注意,必须有两个或多个非零概率符号。
A bitstream is read forward, in little-endian fashion. It is not necessary to know its exact size, since the size will be discovered and reported by the decoding process. The bitstream starts by reporting on which scale it operates. If low4bits designates the lowest 4 bits of the first byte, then Accuracy_Log = low4bits + 5.
比特流以小端方式向前读取。无需知道其确切大小,因为解码过程将发现并报告大小。比特流首先报告其运行的规模。如果low4bits指定第一个字节的最低4位,则精度=low4bits+5。
This is followed by each symbol value, from 0 to the last present one. The number of bits used by each field is variable and depends on:
然后是每个符号值,从0到最后一个符号值。每个字段使用的位数是可变的,取决于:
Remaining probabilities + 1: For example, presuming an Accuracy_Log of 8, and presuming 100 probabilities points have already been distributed, the decoder may read any value from 0 to (256 - 100 + 1) == 157, inclusive. Therefore, it must read log2sup(157) == 8 bits.
剩余概率+1:例如,假设精度为8,并且假设已经分布了100个概率点,解码器可以读取从0到(256-100+1)==157的任何值。因此,它必须读取log2sup(157)==8位。
Value decoded: Small values use 1 fewer bit. For example, presuming values from 0 to 157 (inclusive) are possible, 255 - 157 = 98 values are remaining in an 8-bit field. The first 98 values (hence from 0 to 97) use only 7 bits, and values from 98 to 157 use 8 bits. This is achieved through this scheme:
解码值:较小的值使用较少的1位。例如,假设值从0到157(包括0到157)是可能的,255-157=98个值保留在8位字段中。前98个值(因此从0到97)仅使用7位,而从98到157的值使用8位。这是通过该计划实现的:
+------------+---------------+-----------+ | Value Read | Value Decoded | Bits Used | +------------+---------------+-----------+ | 0 - 97 | 0 - 97 | 7 | +------------+---------------+-----------+ | 98 - 127 | 98 - 127 | 8 | +------------+---------------+-----------+ | 128 - 225 | 0 - 97 | 7 | +------------+---------------+-----------+ | 226 - 255 | 128 - 157 | 8 | +------------+---------------+-----------+
+------------+---------------+-----------+ | Value Read | Value Decoded | Bits Used | +------------+---------------+-----------+ | 0 - 97 | 0 - 97 | 7 | +------------+---------------+-----------+ | 98 - 127 | 98 - 127 | 8 | +------------+---------------+-----------+ | 128 - 225 | 0 - 97 | 7 | +------------+---------------+-----------+ | 226 - 255 | 128 - 157 | 8 | +------------+---------------+-----------+
Symbol probabilities are read one by one, in order. The probability is obtained from Value decoded using the formula P = Value - 1. This means the value 0 becomes the negative probability -1. This is a special probability that means "less than 1". Its effect on the distribution table is described below. For the purpose of calculating total allocated probability points, it counts as 1.
符号概率按顺序逐个读取。概率由使用公式P=值-1解码的值获得。这意味着值0变为负概率-1。这是一个特殊的概率,表示“小于1”。其对分布表的影响如下所述。为了计算分配的总概率点,它计为1。
When a symbol has a probability of zero, it is followed by a 2-bit repeat flag. This repeat flag tells how many probabilities of zeroes follow the current one. It provides a number ranging from 0 to 3. If it is a 3, another 2-bit repeat flag follows, and so on.
当一个符号的概率为零时,它后面跟着一个2位重复标志。这个repeat标志告诉我们有多少个零的概率跟在当前的后面。它提供了一个从0到3的数字。如果它是一个3,另一个2位重复标志随后出现,依此类推。
When the last symbol reaches a cumulated total of (1 << Accuracy_Log), decoding is complete. If the last symbol makes the cumulated total go above (1 << Accuracy_Log), distribution is considered corrupted.
当最后一个符号达到累计总数(1<<精度\u Log)时,解码完成。如果最后一个符号使累计总数超过(1<<精度\u Log),则认为分布已损坏。
Finally, the decoder can tell how many bytes were used in this process and how many symbols are present. The bitstream consumes a round number of bytes. Any remaining bit within the last byte is simply unused.
最后,解码器可以知道在这个过程中使用了多少字节以及存在多少符号。比特流消耗的字节数为整数。最后一个字节中的任何剩余位都是未使用的。
The distribution of normalized probabilities is enough to create a unique decoding table. The table has a size of (1 << Accuracy_Log). Each cell describes the symbol decoded and instructions to get the next state.
归一化概率的分布足以创建唯一的解码表。该表的大小为(1<<精度\对数)。每个单元描述解码的符号和获取下一个状态的指令。
Symbols are scanned in their natural order for "less than 1" probabilities as described above. Symbols with this probability are being attributed a single cell, starting from the end of the table and retreating. These symbols define a full state reset, reading Accuracy_Log bits.
如上所述,按照符号的自然顺序扫描“小于1”的概率。具有这种概率的符号被归于一个单元格,从表的末尾开始,然后后退。这些符号定义了完全状态重置,读取日志位的精度。
All remaining symbols are allocated in their natural order. Starting from symbol 0 and table position 0, each symbol gets allocated as many cells as its probability. Cell allocation is spread, not linear; each successor position follows this rule:
所有剩余符号按其自然顺序分配。从符号0和表位置0开始,每个符号分配的单元格数量与其概率相同。小区分配是分散的,不是线性的;每个继任职位都遵循这一规则:
position += (tableSize >> 1) + (tableSize >> 3) + 3; position &= tableSize - 1;
position += (tableSize >> 1) + (tableSize >> 3) + 3; position &= tableSize - 1;
A position is skipped if it is already occupied by a "less than 1" probability symbol. Position does not reset between symbols; it simply iterates through each position in the table, switching to the next symbol when enough states have been allocated to the current one.
如果某个位置已被“小于1”的概率符号占用,则跳过该位置。位置不会在符号之间重置;它只是遍历表中的每个位置,在为当前位置分配了足够的状态后切换到下一个符号。
The result is a list of state values. Each state will decode the current symbol.
结果是一个状态值列表。每个状态将解码当前符号。
To get the Number_of_Bits and Baseline required for the next state, it is first necessary to sort all states in their natural order. The lower states will need 1 more bit than higher ones. The process is repeated for each symbol.
要获得下一个状态所需的\u比特数和基线,首先需要按其自然顺序对所有状态进行排序。较低的状态需要比较高的状态多1位。对每个符号重复该过程。
For example, presuming a symbol has a probability of 5, it receives five state values. States are sorted in natural order. The next power of 2 is 8. The space of probabilities is divided into 8 equal parts. Presuming the Accuracy_Log is 7, this defines 128 states, and each share (divided by 8) is 16 in size. In order to reach 8, 8 - 5 = 3 lowest states will count "double", doubling the number of shares (32 in width), requiring 1 more bit in the process.
例如,假设一个符号的概率为5,它会收到五个状态值。状态是按自然顺序排序的。2的次幂是8。概率空间被分成8个相等的部分。假设精度为7,这定义了128个状态,每个共享(除以8)的大小为16。为了达到8,8-5=3个最低状态将计数为“double”,使共享数(32个宽度)增加一倍,需要在该过程中增加1位。
Baseline is assigned starting from the higher states using fewer bits, and proceeding naturally, then resuming at the first state, each taking its allocated width from Baseline.
基线从较高的状态开始分配,使用较少的位,然后自然进行,然后在第一个状态下恢复,每个状态从基线获取其分配的宽度。
+----------------+-------+-------+--------+------+-------+ | state order | 0 | 1 | 2 | 3 | 4 | +----------------+-------+-------+--------+------+-------+ | width | 32 | 32 | 32 | 16 | 16 | +----------------+-------+-------+--------+------+-------+ | Number_of_Bits | 5 | 5 | 5 | 4 | 4 | +----------------+-------+-------+--------+------+-------+ | range number | 2 | 4 | 6 | 0 | 1 | +----------------+-------+-------+--------+------+-------+ | Baseline | 32 | 64 | 96 | 0 | 16 | +----------------+-------+-------+--------+------+-------+ | range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 | +----------------+-------+-------+--------+------+-------+
+----------------+-------+-------+--------+------+-------+ | state order | 0 | 1 | 2 | 3 | 4 | +----------------+-------+-------+--------+------+-------+ | width | 32 | 32 | 32 | 16 | 16 | +----------------+-------+-------+--------+------+-------+ | Number_of_Bits | 5 | 5 | 5 | 4 | 4 | +----------------+-------+-------+--------+------+-------+ | range number | 2 | 4 | 6 | 0 | 1 | +----------------+-------+-------+--------+------+-------+ | Baseline | 32 | 64 | 96 | 0 | 16 | +----------------+-------+-------+--------+------+-------+ | range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 | +----------------+-------+-------+--------+------+-------+
The next state is determined from the current state by reading the required Number_of_Bits and adding the specified Baseline.
通过读取所需数量的\u位并添加指定的基线,从当前状态确定下一个状态。
See Appendix A for the results of this process that are applied to the default distributions.
有关应用于默认分布的此过程的结果,请参见附录A。
Zstandard Huffman-coded streams are read backwards, similar to the FSE bitstreams. Therefore, to find the start of the bitstream, it is necessary to know the offset of the last byte of the Huffman-coded stream.
Zstandard Huffman编码流是向后读取的,类似于FSE比特流。因此,为了找到比特流的开始,有必要知道哈夫曼编码流的最后一个字节的偏移量。
After writing the last bit containing information, the compressor writes a single 1 bit and then fills the byte with 0-7 0 bits of padding. The last byte of the compressed bitstream cannot be 0 for that reason.
写入包含信息的最后一位后,压缩器写入一个1位,然后用0-70位的填充填充字节。因此,压缩比特流的最后一个字节不能为0。
When decompressing, the last byte containing the padding is the first byte to read. The decompressor needs to skip 0-7 initial 0 bits and the first 1 bit that occurs. Afterwards, the useful part of the bitstream begins.
解压缩时,包含填充的最后一个字节是要读取的第一个字节。解压器需要跳过0-7个初始0位和出现的第一个1位。之后,比特流的有用部分开始。
The bitstream contains Huffman-coded symbols in little-endian order, with the codes defined by the method below.
比特流包含小端顺序的哈夫曼编码符号,其代码由以下方法定义。
Prefix coding represents symbols from an a priori known alphabet by bit sequences (codewords), one codeword for each symbol, in a manner such that different symbols may be represented by bit sequences of different lengths, but a parser can always parse an encoded string unambiguously symbol by symbol.
前缀编码通过位序列(码字)表示来自先验已知字母表的符号,每个符号对应一个码字,其方式使得不同的符号可以由不同长度的位序列表示,但是解析器始终可以逐个符号明确地解析编码字符串。
Given an alphabet with known symbol frequencies, the Huffman algorithm allows the construction of an optimal prefix code using the fewest bits of any possible prefix codes for that alphabet.
给定一个具有已知符号频率的字母表,哈夫曼算法允许使用该字母表中任何可能的前缀码的最少比特来构造最佳前缀码。
The prefix code must not exceed a maximum code length. More bits improve accuracy but yield a larger header size and require more memory or more complex decoding operations. This specification limits the maximum code length to 11 bits.
前缀代码不得超过最大代码长度。更多比特可以提高精度,但会产生更大的报头大小,并且需要更多内存或更复杂的解码操作。本规范将最大代码长度限制为11位。
All literal values from zero (included) to the last present one (excluded) are represented by Weight with values from 0 to Max_Number_of_Bits. Transformation from Weight to Number_of_Bits follows this pseudocode:
从零(包括)到最后一个当前值(排除)的所有文字值都由权重表示,其值从0到最大\u位数\u。_比特从权重到数量的转换遵循以下伪码:
if Weight == 0 Number_of_Bits = 0 else Number_of_Bits = Max_Number_of_Bits + 1 - Weight
if Weight == 0 Number_of_Bits = 0 else Number_of_Bits = Max_Number_of_Bits + 1 - Weight
The last symbol's Weight is deduced from previously decoded ones, by completing to the nearest power of 2. This power of 2 gives Max_Number_of_Bits the depth of the current tree.
最后一个符号的权重是从先前解码的符号中推导出来的,通过完成到最接近的2次方。这个2的幂为Max_Number_of_Bits提供了当前树的深度。
For example, presume the following Huffman tree must be described:
例如,假设必须描述以下哈夫曼树:
+---------------+----------------+ | Literal Value | Number_of_Bits | +---------------+----------------+ | 0 | 1 | +---------------+----------------+ | 1 | 2 | +---------------+----------------+ | 2 | 3 | +---------------+----------------+ | 3 | 0 | +---------------+----------------+ | 4 | 4 | +---------------+----------------+ | 5 | 4 | +---------------+----------------+
+---------------+----------------+ | Literal Value | Number_of_Bits | +---------------+----------------+ | 0 | 1 | +---------------+----------------+ | 1 | 2 | +---------------+----------------+ | 2 | 3 | +---------------+----------------+ | 3 | 0 | +---------------+----------------+ | 4 | 4 | +---------------+----------------+ | 5 | 4 | +---------------+----------------+
The tree depth is 4, since its longest element uses 4 bits. (The longest elements are those with the smallest frequencies.) Value 5 will not be listed as it can be determined from the values for 0-4, nor will values above 5 as they are all 0. Values from 0 to 4 will be listed using Weight instead of Number_of_Bits. The pseudocode to determine Weight is:
树深度为4,因为其最长元素使用4位。(最长的元素是频率最小的元素。)不会列出值5,因为它可以从0-4的值确定,也不会列出高于5的值,因为它们都是0。0到4之间的值将使用权重而不是位的数量列出。确定重量的伪代码为:
if Number_of_Bits == 0 Weight = 0 else Weight = Max_Number_of_Bits + 1 - Number_of_Bits
if Number_of_Bits == 0 Weight = 0 else Weight = Max_Number_of_Bits + 1 - Number_of_Bits
It gives the following series of weights:
它给出了以下一系列权重:
+---------------+--------+ | Literal Value | Weight | +---------------+--------+ | 0 | 4 | +---------------+--------+ | 1 | 3 | +---------------+--------+ | 2 | 2 | +---------------+--------+ | 3 | 0 | +---------------+--------+ | 4 | 1 | +---------------+--------+
+---------------+--------+ | Literal Value | Weight | +---------------+--------+ | 0 | 4 | +---------------+--------+ | 1 | 3 | +---------------+--------+ | 2 | 2 | +---------------+--------+ | 3 | 0 | +---------------+--------+ | 4 | 1 | +---------------+--------+
The decoder will do the inverse operation: having collected weights of literals from 0 to 4, it knows the last literal, 5, is present with a non-zero Weight. The Weight of 5 can be determined by advancing to the next power of 2. The sum of 2^(Weight-1) (excluding 0's) is 15. The nearest power of 2 is 16. Therefore, Max_Number_of_Bits = 4 and Weight[5] = 16 - 15 = 1.
解码器将执行相反的操作:收集了从0到4的文本权重后,它知道最后一个文本5具有非零权重。5的重量可以通过前进到下一次幂2来确定。2^(权重-1)(不包括0)之和为15。2的最近幂是16。因此,位的最大数量=4,权重[5]=16-15=1。
This is a single byte value (0-255), which describes how the series of weights is encoded.
这是一个单字节值(0-255),用于描述权重序列的编码方式。
headerByte < 128: The series of weights is compressed using FSE (see below). The length of the FSE-compressed series is equal to headerByte (0-127).
headerByte<128:使用FSE压缩权重序列(见下文)。FSE压缩系列的长度等于headerByte(0-127)。
headerByte >= 128: This is a direct representation, where each Weight is written directly as a 4-bit field (0-15). They are encoded forward, 2 weights to a byte with the first weight taking the top 4 bits and the second taking the bottom 4; for example, the following operations could be used to read the weights:
headerByte>=128:这是一种直接表示法,其中每个权重直接写入一个4位字段(0-15)。它们被向前编码,2个权值对应一个字节,第一个权值取前4位,第二个权值取下4位;例如,以下操作可用于读取权重:
Weight[0] = (Byte[0] >> 4) Weight[1] = (Byte[0] & 0xf), etc.
Weight[0] = (Byte[0] >> 4) Weight[1] = (Byte[0] & 0xf), etc.
The full representation occupies ceiling(Number_of_Symbols/2) bytes, meaning it uses only full bytes even if Number_of_Symbols is odd. Number_of_Symbols = headerByte - 127. Note that maximum Number_of_Symbols is 255 - 127 = 128. If any literal has a value over 128, raw header mode is not possible, and it is necessary to use FSE compression.
完整表示占用上限(符号数/2)字节,这意味着即使符号数为奇数,它也只使用完整字节。符号数=头字节-127。请注意,_符号的最大数量为255-127=128。如果任何文字的值超过128,则不可能使用原始标头模式,并且必须使用FSE压缩。
In this case, the series of Huffman weights is compressed using FSE compression. It is a single bitstream with two interleaved states, sharing a single distribution table.
在这种情况下,使用FSE压缩来压缩哈夫曼权重序列。它是一个具有两个交织状态的单比特流,共享一个分发表。
To decode an FSE bitstream, it is necessary to know its compressed size. Compressed size is provided by headerByte. It's also necessary to know its maximum possible decompressed size, which is 255, since literal values span from 0 to 255, and the last symbol's Weight is not represented.
要解码FSE比特流,必须知道其压缩大小。压缩大小由headerByte提供。还需要知道它的最大可能解压大小,即255,因为文字值的范围从0到255,最后一个符号的权重不表示。
An FSE bitstream starts by a header, describing probabilities distribution. It will create a decoding table. For a list of Huffman weights, the maximum accuracy log is 6 bits. For more details, see Section 4.1.1.
FSE比特流由描述概率分布的报头开始。它将创建一个解码表。对于哈夫曼权重列表,最大精度日志为6位。有关更多详细信息,请参见第4.1.1节。
The Huffman header compression uses two states, which share the same FSE distribution table. The first state (State1) encodes the even-numbered index symbols, and the second (State2) encodes the odd-numbered index symbols. State1 is initialized first, and then State2, and they take turns decoding a single symbol and updating their state. For more details on these FSE operations, see Section 4.1.
Huffman头压缩使用两个状态,它们共享相同的FSE分布表。第一个状态(状态1)编码偶数编号的索引符号,第二个状态(状态2)编码奇数编号的索引符号。首先初始化State1,然后初始化State2,然后依次解码单个符号并更新其状态。有关这些FSE操作的更多详细信息,请参见第4.1节。
The number of symbols to be decoded is determined by tracking the bitStream overflow condition: If updating state after decoding a symbol would require more bits than remain in the stream, it is assumed that extra bits are zero. Then, symbols for each of the final states are decoded and the process is complete.
要解码的符号数量通过跟踪比特流溢出条件来确定:如果解码符号后更新状态需要的比特数超过流中的剩余比特数,则假定额外比特数为零。然后,对每个最终状态的符号进行解码并完成该过程。
All present symbols will now have a Weight value. It is possible to transform weights into Number_of_Bits, using this formula:
现在,所有当前符号都将具有权重值。可以使用以下公式将权重转换为\u比特数\u:
if Weight > 0 Number_of_Bits = Max_Number_of_Bits + 1 - Weight else Number_of_Bits = 0
如果权重>0个\u位数=最大\u位数\u位数+1-其他权重\u位数\u=0
Symbols are sorted by Weight. Within the same Weight, symbols keep natural sequential order. Symbols with a Weight of zero are removed. Then, starting from the lowest Weight, prefix codes are distributed in sequential order.
符号按权重排序。在相同的权重范围内,符号保持自然的顺序。权重为零的符号将被删除。然后,从最小权重开始,按顺序分配前缀码。
For example, assume the following list of weights has been decoded:
例如,假设以下权重列表已解码:
+---------+--------+ | Literal | Weight | +---------+--------+ | 0 | 4 | +---------+--------+ | 1 | 3 | +---------+--------+ | 2 | 2 | +---------+--------+ | 3 | 0 | +---------+--------+ | 4 | 1 | +---------+--------+ | 5 | 1 | +---------+--------+
+---------+--------+ | Literal | Weight | +---------+--------+ | 0 | 4 | +---------+--------+ | 1 | 3 | +---------+--------+ | 2 | 2 | +---------+--------+ | 3 | 0 | +---------+--------+ | 4 | 1 | +---------+--------+ | 5 | 1 | +---------+--------+
Sorting by weight and then the natural sequential order yields the following distribution:
按重量排序,然后按自然顺序排序,得到以下分布:
+---------+--------+----------------+--------------+ | Literal | Weight | Number_Of_Bits | Prefix Codes | +---------+--------+----------------|--------------+ | 3 | 0 | 0 | N/A | +---------+--------+----------------|--------------+ | 4 | 1 | 4 | 0000 | +---------+--------+----------------|--------------+ | 5 | 1 | 4 | 0001 | +---------+--------+----------------|--------------+ | 2 | 2 | 3 | 001 | +---------+--------+----------------|--------------+ | 1 | 3 | 2 | 01 | +---------+--------+----------------|--------------+ | 0 | 4 | 1 | 1 | +---------+--------+----------------|--------------+
+---------+--------+----------------+--------------+ | Literal | Weight | Number_Of_Bits | Prefix Codes | +---------+--------+----------------|--------------+ | 3 | 0 | 0 | N/A | +---------+--------+----------------|--------------+ | 4 | 1 | 4 | 0000 | +---------+--------+----------------|--------------+ | 5 | 1 | 4 | 0001 | +---------+--------+----------------|--------------+ | 2 | 2 | 3 | 001 | +---------+--------+----------------|--------------+ | 1 | 3 | 2 | 01 | +---------+--------+----------------|--------------+ | 0 | 4 | 1 | 1 | +---------+--------+----------------|--------------+
Given a Huffman decoding table, it is possible to decode a Huffman-coded stream.
给定哈夫曼解码表,可以对哈夫曼编码流进行解码。
Each bitstream must be read backward, which starts from the end and goes up to the beginning. Therefore, it is necessary to know the size of each bitstream.
每个比特流必须向后读取,从末尾开始,一直到开头。因此,有必要知道每个比特流的大小。
It is also necessary to know exactly which bit is the last. This is detected by a final bit flag: the highest bit of the last byte is a final-bit-flag. Consequently, a last byte of 0 is not possible. And the final-bit-flag itself is not part of the useful bitstream. Hence, the last byte contains between 0 and 7 useful bits.
还必须准确地知道哪一位是最后一位。这由最终位标志检测:最后一个字节的最高位是最终位标志。因此,最后一个字节0是不可能的。最后一个位标志本身不是有用位流的一部分。因此,最后一个字节包含0到7个有用位。
Starting from the end, it is possible to read the bitstream in a little-endian fashion, keeping track of already used bits. Since the bitstream is encoded in reverse order, starting from the end, read symbols in forward order.
从末尾开始,可以以一种稍微末端的方式读取位流,跟踪已经使用的位。由于比特流是以相反的顺序编码的,所以从末尾开始,以正向顺序读取符号。
For example, if the literal sequence "0145" was encoded using the above prefix code, it would be encoded (in reverse order) as:
例如,如果使用上述前缀代码对文字序列“0145”进行编码,则其编码(按相反顺序)如下:
+---------+----------+ | Symbol | Encoding | +---------+----------+ | 5 | 0000 | +---------+----------+ | 4 | 0001 | +---------+----------+ | 1 | 01 | +---------+----------+ | 0 | 1 | +---------+----------+ | Padding | 00001 | +---------+----------+
+---------+----------+ | Symbol | Encoding | +---------+----------+ | 5 | 0000 | +---------+----------+ | 4 | 0001 | +---------+----------+ | 1 | 01 | +---------+----------+ | 0 | 1 | +---------+----------+ | Padding | 00001 | +---------+----------+
This results in the following 2-byte bitstream:
这将产生以下2字节比特流:
00010000 00001101
00010000 00001101
Here is an alternative representation with the symbol codes separated by underscores:
下面是一个用下划线分隔的符号代码的替代表示法:
0001_0000 00001_1_01
0001_0000 00001_1_01
Reading the highest Max_Number_of_Bits bits, it's possible to compare the extracted value to the decoding table, determining the symbol to decode and number of bits to discard.
读取最大最大\u位数的\u位,可以将提取的值与解码表进行比较,确定要解码的符号和要丢弃的位数。
The process continues reading up to the required number of symbols per stream. If a bitstream is not entirely and exactly consumed, hence reaching exactly its beginning position with all bits consumed, the decoding process is considered faulty.
该过程继续读取每个流所需的符号数。如果一个比特流没有完全和准确地被消耗,因此在消耗所有比特的情况下正好到达其起始位置,则解码过程被认为是错误的。
Zstandard is compatible with "raw content" dictionaries, free of any format restriction, except that they must be at least 8 bytes. These dictionaries function as if they were just the content part of a formatted dictionary.
Zstandard与“原始内容”词典兼容,不受任何格式限制,但它们必须至少为8字节。这些词典的功能就好像它们只是格式化词典的内容部分一样。
However, dictionaries created by "zstd --train" in the reference implementation follow a specific format, described here.
但是,参考实现中由“zstd--train”创建的字典遵循此处描述的特定格式。
Dictionaries are not included in the compressed content but rather are provided out of band. That is, the Dictionary_ID identifies which should be used, but this specification does not describe the
词典不包括在压缩内容中,而是在带外提供。也就是说,Dictionary_ID标识应该使用哪些,但是本规范没有描述
mechanism by which the dictionary is obtained prior to use during compression or decompression.
在压缩或解压缩期间使用之前获取词典的机制。
A dictionary has a size, defined either by a buffer limit or a file size. The general format is:
字典的大小由缓冲区限制或文件大小定义。一般格式为:
+--------------+---------------+----------------+---------+ | Magic_Number | Dictionary_ID | Entropy_Tables | Content | +--------------+---------------+----------------+---------+
+--------------+---------------+----------------+---------+ | Magic_Number | Dictionary_ID | Entropy_Tables | Content | +--------------+---------------+----------------+---------+
Magic_Number: 4 bytes ID, value 0xEC30A437, little-endian format.
幻数:4字节ID,值0xEC30A437,小尾端格式。
Dictionary_ID: 4 bytes, stored in little-endian format. Dictionary_ID can be any value, except 0 (which means no Dictionary_ID). It is used by decoders to check if they use the correct dictionary. If the frame is going to be distributed in a private environment, any Dictionary_ID can be used. However, for public distribution of compressed frames, the following ranges are reserved and shall not be used:
字典ID:4字节,以little-endian格式存储。Dictionary\u ID可以是除0(表示没有Dictionary\u ID)之外的任何值。解码器使用它来检查他们是否使用了正确的词典。如果要在私有环境中分发框架,则可以使用任何字典ID。但是,对于压缩框架的公共分发,保留以下范围,不得使用:
low range: <= 32767 high range: >= (2^31)
low range: <= 32767 high range: >= (2^31)
Entropy_Tables: Follow the same format as the tables in compressed blocks. See the relevant FSE and Huffman sections for how to decode these tables. They are stored in the following order: Huffman table for literals, FSE table for offsets, FSE table for match lengths, and FSE table for literals lengths. These tables populate the Repeat Stats literals mode and Repeat distribution mode for sequence decoding. It is finally followed by 3 offset values, populating repeat offsets (instead of using {1,4,8}), stored in order, 4-bytes little-endian each, for a total of 12 bytes. Each repeat offset must have a value less than the dictionary size.
熵_表:遵循与压缩块中的表相同的格式。有关如何解码这些表,请参阅相关的FSE和Huffman部分。它们按以下顺序存储:用于文字的霍夫曼表、用于偏移量的FSE表、用于匹配长度的FSE表和用于文字长度的FSE表。这些表填充用于序列解码的Repeat Stats literals模式和Repeat distribution模式。最后,它后面是3个偏移量值,填充重复偏移量(而不是使用{1,4,8}),按顺序存储,每个偏移量为4字节小端,总共12字节。每个重复偏移量的值必须小于字典大小。
Content: The rest of the dictionary is its content. The content acts as a "past" in front of data to be compressed or decompressed, so it can be referenced in sequence commands. As long as the amount of data decoded from this frame is less than or equal to Window_Size, sequence commands may specify offsets longer than the total length of decoded output so far to reference back to the dictionary, even parts of the dictionary with offsets larger than Window_Size. After the total output has surpassed Window_Size, however, this is no longer allowed, and the dictionary is no longer accessible.
内容:字典的其余部分是它的内容。内容在要压缩或解压缩的数据前面充当“过去”,因此可以在顺序命令中引用它。只要从该帧解码的数据量小于或等于Window_Size,序列命令就可以指定大于解码输出总长度的偏移量,以引用回字典,甚至是偏移量大于Window_Size的字典部分。但是,在总输出超过窗口大小后,将不再允许这样做,并且无法再访问字典。
IANA has made two registrations, as described below.
IANA进行了两次注册,如下所述。
The 'application/zstd' media type identifies a block of data that is compressed using zstd compression. The data is a stream of bytes as described in this document. IANA has added the following to the "Media Types" registry:
“application/zstd”媒体类型标识使用zstd压缩压缩的数据块。数据是本文档中描述的字节流。IANA已将以下内容添加到“媒体类型”注册表中:
Type name: application
类型名称:应用程序
Subtype name: zstd
子类型名称:zstd
Required parameters: N/A
所需参数:不适用
Optional parameters: N/A
可选参数:不适用
Encoding considerations: binary
编码注意事项:二进制
Security considerations: See Section 7 of RFC 8478
安全注意事项:见RFC 8478第7节
Interoperability considerations: N/A
互操作性注意事项:不适用
Published specification: RFC 8478
已发布规范:RFC 8478
Applications that use this media type: anywhere data size is an issue
使用此介质类型的应用程序:anywhere数据大小是一个问题
Additional information:
其他信息:
Magic number(s): 4 bytes, little-endian format. Value: 0xFD2FB528
幻数:4字节,小尾端格式。值:0xFD2FB528
File extension(s): zst
文件扩展名:zst
Macintosh file type code(s): N/A
Macintosh file type code(s): N/A
For further information: See [ZSTD]
有关更多信息,请参见[ZSTD]
Intended usage: common
预期用途:普通
Restrictions on usage: N/A
使用限制:不适用
Author: Murray S. Kucherawy
作者:Murray S.Kucherawy
Change Controller: IETF
更改控制器:IETF
Provisional registration: no
临时注册:否
IANA has added the following entry to the "HTTP Content Coding Registry" within the "Hypertext Transfer Protocol (HTTP) Parameters" registry:
IANA已将以下条目添加到“超文本传输协议(HTTP)参数”注册表中的“HTTP内容编码注册表”:
Name: zstd
姓名:zstd
Description: A stream of bytes compressed using the Zstandard protocol
描述:使用Zstandard协议压缩的字节流
Pointer to specification text: RFC 8478
指向规范文本的指针:RFC 8478
Work in progress includes development of dictionaries that will optimize compression and decompression of particular types of data. Specification of such dictionaries for public use will necessitate registration of a code point from the reserved range described in Section 3.1.1.1.3 and its association with a specific dictionary.
正在进行的工作包括开发字典,优化特定类型数据的压缩和解压缩。公共使用的此类词典的规范要求注册第3.1.1.1.3节所述保留范围内的代码点及其与特定词典的关联。
However, there are at present no such dictionaries published for public use, so this document makes no immediate request of IANA to create such a registry.
然而,目前还没有出版供公众使用的此类词典,因此本文件没有立即要求IANA创建此类注册表。
Any data compression method involves the reduction of redundancy in the data. Zstandard is no exception, and the usual precautions apply.
任何数据压缩方法都涉及减少数据中的冗余。Z标准也不例外,通常的预防措施也适用。
One should never compress a message whose content must remain secret with a message generated by a third party. Such a compression can be used to guess the content of the secret message through analysis of entropy reduction. This was demonstrated in the Compression Ratio Info-leak Made Easy (CRIME) attack [CRIME], for example.
决不能用第三方生成的消息压缩其内容必须保密的消息。这种压缩可以通过分析熵减少来猜测秘密消息的内容。例如,这在压缩比信息泄漏使易(犯罪)攻击[犯罪]中得到了证明。
A decoder has to demonstrate capabilities to detect and prevent any kind of data tampering in the compressed frame from triggering system faults, such as reading or writing beyond allowed memory ranges. This can be guaranteed by either the implementation language or careful bound checkings. Of particular note is the encoding of Number_of_Sequences values that cause the decoder to read into the block header (and beyond), as well as the indication of a Frame_Content_Size that is smaller than the actual decompressed data, in an attempt to trigger a buffer overflow. It is highly recommended
解码器必须能够检测并防止压缩帧中的任何类型的数据篡改触发系统故障,例如读取或写入超出允许的内存范围。这可以通过实现语言或仔细的绑定检查来保证。特别值得注意的是,编码导致解码器读入块头(及以上)的序列值的数量,以及指示小于实际解压缩数据的帧内容大小,以试图触发缓冲区溢出。强烈推荐
to fuzz-test (i.e., provide invalid, unexpected, or random input and verify safe operation of) decoder implementations to test and harden their capability to detect bad frames and deal with them without any adverse system side effect.
对解码器实现进行模糊测试(即,提供无效、意外或随机输入,并验证其安全操作),以测试和增强其检测坏帧并处理坏帧的能力,而不会产生任何不利的系统副作用。
An attacker may provide correctly formed compressed frames with unreasonable memory requirements. A decoder must always control memory requirements and enforce some (system-specific) limits in order to protect memory usage from such scenarios.
攻击者可能会提供格式正确、内存要求不合理的压缩帧。解码器必须始终控制内存需求并强制执行某些(特定于系统的)限制,以保护内存使用不受此类情况的影响。
Compression can be optimized by training a dictionary on a variety of related content payloads. This dictionary must then be available at the decoder for decompression of the payload to be possible. While this document does not specify how to acquire a dictionary for a given compressed payload, it is worth noting that third-party dictionaries may interact unexpectedly with a decoder, leading to possible memory or other resource exhaustion attacks. We expect such topics to be discussed in further detail in the Security Considerations section of a forthcoming RFC for dictionary acquisition and transmission, but highlight this issue now out of an abundance of caution.
压缩可以通过在各种相关内容有效负载上训练字典来优化。然后,该字典必须在解码器处可用,以便能够对有效负载进行解压缩。虽然本文档未指定如何获取给定压缩负载的词典,但值得注意的是,第三方词典可能会意外地与解码器交互,从而导致可能的内存或其他资源耗尽攻击。我们希望在即将发布的词典获取和传输RFC的安全注意事项部分进一步详细讨论此类主题,但出于谨慎考虑,现在强调此问题。
As discussed in Section 3.1.2, it is possible to store arbitrary user metadata in skippable frames. While such frames are ignored during decompression of the data, they can be used as a watermark to track the path of the compressed payload.
如第3.1.2节所述,可以在可跳过的帧中存储任意用户元数据。虽然这些帧在数据解压缩期间被忽略,但它们可以用作跟踪压缩有效负载路径的水印。
Source code for a C language implementation of a Zstandard-compliant library is available at [ZSTD-GITHUB]. This implementation is considered to be the reference implementation and is production ready; it implements the full range of the specification. It is routinely tested against security hazards and widely deployed within Facebook infrastructure.
符合ZSTD标准的库的C语言实现的源代码可从[ZSTD-GITHUB]获得。该实施被视为参考实施,并已准备好生产;它实现了规范的全部范围。它定期进行安全隐患测试,并广泛部署在Facebook基础设施中。
The reference version is optimized for speed and is highly portable. It has been proven to run safely on multiple architectures (e.g., x86, x64, ARM, MIPS, PowerPC, IA64) featuring 32- or 64-bit addressing schemes, a little- or big-endian storage scheme, a number of different operating systems (e.g., UNIX (including Linux, BSD, OS-X, and Solaris) and Windows), and a number of compilers (e.g., gcc, clang, visual, and icc).
参考版本针对速度进行了优化,并且具有高度的可移植性。经证明,它可以在多种体系结构(如x86、x64、ARM、MIPS、PowerPC、IA64)上安全运行,这些体系结构具有32位或64位寻址方案、小型或大型存储方案、多种不同的操作系统(如UNIX(包括Linux、BSD、OS-X和Solaris)和Windows)以及多种编译器(如gcc、clang、visual和icc)。
[ZSTD] "Zstandard", <http://www.zstd.net>.
[ZSTD]“Zstandard”<http://www.zstd.net>.
[ANS] Duda, J., "Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding", January 2014, <https://arxiv.org/pdf/1311.2540>.
[ANS]Duda,J.“非对称数字系统:熵编码结合了哈夫曼编码的速度和算术编码的压缩率”,2014年1月<https://arxiv.org/pdf/1311.2540>.
[CRIME] "CRIME", June 2018, <https://en.wikipedia.org/w/ index.php?title=CRIME&oldid=844538656>.
[犯罪]“犯罪”,2018年6月<https://en.wikipedia.org/w/ index.php?title=CRIME&oldid=844538656>。
[FSE] "FiniteStateEntropy", commit 6efa78a, June 2018, <https://github.com/Cyan4973/FiniteStateEntropy/>.
[FSE]“有限测试熵”,承诺6efa78a,2018年6月<https://github.com/Cyan4973/FiniteStateEntropy/>.
[LZ4] "LZ4 Frame Format Description", commit d03224b, January 2018, <https://github.com/lz4/lz4/blob/master/doc/ lz4_Frame_format.md>.
[LZ4]“LZ4帧格式说明”,提交d03224b,2018年1月<https://github.com/lz4/lz4/blob/master/doc/ lz4_Frame_format.md>。
[RFC1952] Deutsch, P., "GZIP file format specification version 4.3", RFC 1952, DOI 10.17487/RFC1952, May 1996, <https://www.rfc-editor.org/info/rfc1952>.
[RFC1952]Deutsch,P.,“GZIP文件格式规范版本4.3”,RFC 1952,DOI 10.17487/RFC1952,1996年5月<https://www.rfc-editor.org/info/rfc1952>.
[XXHASH] "XXHASH Algorithm", <http://www.xxhash.org>.
[XXHASH]“XXHASH算法”<http://www.xxhash.org>.
[ZSTD-GITHUB] "zstd", commit 8514bd8, August 2018, <https://github.com/facebook/zstd>.
[ZSTD-GITHUB]“ZSTD”,承诺8514bd8,2018年8月<https://github.com/facebook/zstd>.
This appendix contains FSE decoding tables for the predefined literal length, match length, and offset codes. The tables have been constructed using the algorithm as given above in Section 4.1.1. The tables here can be used as examples to crosscheck that an implementation has built its decoding tables correctly.
本附录包含预定义文字长度、匹配长度和偏移代码的FSE解码表。表是使用上述第4.1.1节中给出的算法构建的。这里的表可以用作交叉检查实现是否正确构建了解码表的示例。
+-------+--------+----------------+------+ | State | Symbol | Number_Of_Bits | Base | +-------+--------+----------------+------+ | 0 | 0 | 0 | 0 | +-------+--------+----------------+------+ | 0 | 0 | 4 | 0 | +-------+--------+----------------+------+ | 1 | 0 | 4 | 16 | +-------+--------+----------------+------+ | 2 | 1 | 5 | 32 | +-------+--------+----------------+------+ | 3 | 3 | 5 | 0 | +-------+--------+----------------+------+ | 4 | 4 | 5 | 0 | +-------+--------+----------------+------+ | 5 | 6 | 5 | 0 | +-------+--------+----------------+------+ | 6 | 7 | 5 | 0 | +-------+--------+----------------+------+ | 7 | 9 | 5 | 0 | +-------+--------+----------------+------+ | 8 | 10 | 5 | 0 | +-------+--------+----------------+------+ | 9 | 12 | 5 | 0 | +-------+--------+----------------+------+ | 10 | 14 | 6 | 0 | +-------+--------+----------------+------+ | 11 | 16 | 5 | 0 | +-------+--------+----------------+------+ | 12 | 18 | 5 | 0 | +-------+--------+----------------+------+ | 13 | 19 | 5 | 0 | +-------+--------+----------------+------+ | 14 | 21 | 5 | 0 | +-------+--------+----------------+------+ | 15 | 22 | 5 | 0 | +-------+--------+----------------+------+ | 16 | 24 | 5 | 0 |
+-------+--------+----------------+------+ | State | Symbol | Number_Of_Bits | Base | +-------+--------+----------------+------+ | 0 | 0 | 0 | 0 | +-------+--------+----------------+------+ | 0 | 0 | 4 | 0 | +-------+--------+----------------+------+ | 1 | 0 | 4 | 16 | +-------+--------+----------------+------+ | 2 | 1 | 5 | 32 | +-------+--------+----------------+------+ | 3 | 3 | 5 | 0 | +-------+--------+----------------+------+ | 4 | 4 | 5 | 0 | +-------+--------+----------------+------+ | 5 | 6 | 5 | 0 | +-------+--------+----------------+------+ | 6 | 7 | 5 | 0 | +-------+--------+----------------+------+ | 7 | 9 | 5 | 0 | +-------+--------+----------------+------+ | 8 | 10 | 5 | 0 | +-------+--------+----------------+------+ | 9 | 12 | 5 | 0 | +-------+--------+----------------+------+ | 10 | 14 | 6 | 0 | +-------+--------+----------------+------+ | 11 | 16 | 5 | 0 | +-------+--------+----------------+------+ | 12 | 18 | 5 | 0 | +-------+--------+----------------+------+ | 13 | 19 | 5 | 0 | +-------+--------+----------------+------+ | 14 | 21 | 5 | 0 | +-------+--------+----------------+------+ | 15 | 22 | 5 | 0 | +-------+--------+----------------+------+ | 16 | 24 | 5 | 0 |
+-------+--------+----------------+------+ | 17 | 25 | 5 | 32 | +-------+--------+----------------+------+ | 18 | 26 | 5 | 0 | +-------+--------+----------------+------+ | 19 | 27 | 6 | 0 | +-------+--------+----------------+------+ | 20 | 29 | 6 | 0 | +-------+--------+----------------+------+ | 21 | 31 | 6 | 0 | +-------+--------+----------------+------+ | 22 | 0 | 4 | 32 | +-------+--------+----------------+------+ | 23 | 1 | 4 | 0 | +-------+--------+----------------+------+ | 24 | 2 | 5 | 0 | +-------+--------+----------------+------+ | 25 | 4 | 5 | 32 | +-------+--------+----------------+------+ | 26 | 5 | 5 | 0 | +-------+--------+----------------+------+ | 27 | 7 | 5 | 32 | +-------+--------+----------------+------+ | 28 | 8 | 5 | 0 | +-------+--------+----------------+------+ | 29 | 10 | 5 | 32 | +-------+--------+----------------+------+ | 30 | 11 | 5 | 0 | +-------+--------+----------------+------+ | 31 | 13 | 6 | 0 | +-------+--------+----------------+------+ | 32 | 16 | 5 | 32 | +-------+--------+----------------+------+ | 33 | 17 | 5 | 0 | +-------+--------+----------------+------+ | 34 | 19 | 5 | 32 | +-------+--------+----------------+------+ | 35 | 20 | 5 | 0 | +-------+--------+----------------+------+ | 36 | 22 | 5 | 32 | +-------+--------+----------------+------+ | 37 | 23 | 5 | 0 | +-------+--------+----------------+------+ | 38 | 25 | 4 | 0 | +-------+--------+----------------+------+ | 39 | 25 | 4 | 16 | +-------+--------+----------------+------+ | 40 | 26 | 5 | 32 |
+-------+--------+----------------+------+ | 17 | 25 | 5 | 32 | +-------+--------+----------------+------+ | 18 | 26 | 5 | 0 | +-------+--------+----------------+------+ | 19 | 27 | 6 | 0 | +-------+--------+----------------+------+ | 20 | 29 | 6 | 0 | +-------+--------+----------------+------+ | 21 | 31 | 6 | 0 | +-------+--------+----------------+------+ | 22 | 0 | 4 | 32 | +-------+--------+----------------+------+ | 23 | 1 | 4 | 0 | +-------+--------+----------------+------+ | 24 | 2 | 5 | 0 | +-------+--------+----------------+------+ | 25 | 4 | 5 | 32 | +-------+--------+----------------+------+ | 26 | 5 | 5 | 0 | +-------+--------+----------------+------+ | 27 | 7 | 5 | 32 | +-------+--------+----------------+------+ | 28 | 8 | 5 | 0 | +-------+--------+----------------+------+ | 29 | 10 | 5 | 32 | +-------+--------+----------------+------+ | 30 | 11 | 5 | 0 | +-------+--------+----------------+------+ | 31 | 13 | 6 | 0 | +-------+--------+----------------+------+ | 32 | 16 | 5 | 32 | +-------+--------+----------------+------+ | 33 | 17 | 5 | 0 | +-------+--------+----------------+------+ | 34 | 19 | 5 | 32 | +-------+--------+----------------+------+ | 35 | 20 | 5 | 0 | +-------+--------+----------------+------+ | 36 | 22 | 5 | 32 | +-------+--------+----------------+------+ | 37 | 23 | 5 | 0 | +-------+--------+----------------+------+ | 38 | 25 | 4 | 0 | +-------+--------+----------------+------+ | 39 | 25 | 4 | 16 | +-------+--------+----------------+------+ | 40 | 26 | 5 | 32 |
+-------+--------+----------------+------+ | 41 | 28 | 6 | 0 | +-------+--------+----------------+------+ | 42 | 30 | 6 | 0 | +-------+--------+----------------+------+ | 43 | 0 | 4 | 48 | +-------+--------+----------------+------+ | 44 | 1 | 4 | 16 | +-------+--------+----------------+------+ | 45 | 2 | 5 | 32 | +-------+--------+----------------+------+ | 46 | 3 | 5 | 32 | +-------+--------+----------------+------+ | 47 | 5 | 5 | 32 | +-------+--------+----------------+------+ | 48 | 6 | 5 | 32 | +-------+--------+----------------+------+ | 49 | 8 | 5 | 32 | +-------+--------+----------------+------+ | 50 | 9 | 5 | 32 | +-------+--------+----------------+------+ | 51 | 11 | 5 | 32 | +-------+--------+----------------+------+ | 52 | 12 | 5 | 32 | +-------+--------+----------------+------+ | 53 | 15 | 6 | 0 | +-------+--------+----------------+------+ | 54 | 17 | 5 | 32 | +-------+--------+----------------+------+ | 55 | 18 | 5 | 32 | +-------+--------+----------------+------+ | 56 | 20 | 5 | 32 | +-------+--------+----------------+------+ | 57 | 21 | 5 | 32 | +-------+--------+----------------+------+ | 58 | 23 | 5 | 32 | +-------+--------+----------------+------+ | 59 | 24 | 5 | 32 | +-------+--------+----------------+------+ | 60 | 35 | 6 | 0 | +-------+--------+----------------+------+ | 61 | 34 | 6 | 0 | +-------+--------+----------------+------+ | 62 | 33 | 6 | 0 | +-------+--------+----------------+------+ | 63 | 32 | 6 | 0 | +-------+--------+----------------+------+
+-------+--------+----------------+------+ | 41 | 28 | 6 | 0 | +-------+--------+----------------+------+ | 42 | 30 | 6 | 0 | +-------+--------+----------------+------+ | 43 | 0 | 4 | 48 | +-------+--------+----------------+------+ | 44 | 1 | 4 | 16 | +-------+--------+----------------+------+ | 45 | 2 | 5 | 32 | +-------+--------+----------------+------+ | 46 | 3 | 5 | 32 | +-------+--------+----------------+------+ | 47 | 5 | 5 | 32 | +-------+--------+----------------+------+ | 48 | 6 | 5 | 32 | +-------+--------+----------------+------+ | 49 | 8 | 5 | 32 | +-------+--------+----------------+------+ | 50 | 9 | 5 | 32 | +-------+--------+----------------+------+ | 51 | 11 | 5 | 32 | +-------+--------+----------------+------+ | 52 | 12 | 5 | 32 | +-------+--------+----------------+------+ | 53 | 15 | 6 | 0 | +-------+--------+----------------+------+ | 54 | 17 | 5 | 32 | +-------+--------+----------------+------+ | 55 | 18 | 5 | 32 | +-------+--------+----------------+------+ | 56 | 20 | 5 | 32 | +-------+--------+----------------+------+ | 57 | 21 | 5 | 32 | +-------+--------+----------------+------+ | 58 | 23 | 5 | 32 | +-------+--------+----------------+------+ | 59 | 24 | 5 | 32 | +-------+--------+----------------+------+ | 60 | 35 | 6 | 0 | +-------+--------+----------------+------+ | 61 | 34 | 6 | 0 | +-------+--------+----------------+------+ | 62 | 33 | 6 | 0 | +-------+--------+----------------+------+ | 63 | 32 | 6 | 0 | +-------+--------+----------------+------+
+-------+--------+----------------+------+ | State | Symbol | Number_Of_Bits | Base | +-------+--------+----------------+------+ | 0 | 0 | 0 | 0 | +-------+--------+----------------+------+ | 0 | 0 | 6 | 0 | +-------+--------+----------------+------+ | 1 | 1 | 4 | 0 | +-------+--------+----------------+------+ | 2 | 2 | 5 | 32 | +-------+--------+----------------+------+ | 3 | 3 | 5 | 0 | +-------+--------+----------------+------+ | 4 | 5 | 5 | 0 | +-------+--------+----------------+------+ | 5 | 6 | 5 | 0 | +-------+--------+----------------+------+ | 6 | 8 | 5 | 0 | +-------+--------+----------------+------+ | 7 | 10 | 6 | 0 | +-------+--------+----------------+------+ | 8 | 13 | 6 | 0 | +-------+--------+----------------+------+ | 9 | 16 | 6 | 0 | +-------+--------+----------------+------+ | 10 | 19 | 6 | 0 | +-------+--------+----------------+------+ | 11 | 22 | 6 | 0 | +-------+--------+----------------+------+ | 12 | 25 | 6 | 0 | +-------+--------+----------------+------+ | 13 | 28 | 6 | 0 | +-------+--------+----------------+------+ | 14 | 31 | 6 | 0 | +-------+--------+----------------+------+ | 15 | 33 | 6 | 0 | +-------+--------+----------------+------+ | 16 | 35 | 6 | 0 | +-------+--------+----------------+------+ | 17 | 37 | 6 | 0 | +-------+--------+----------------+------+ | 18 | 39 | 6 | 0 | +-------+--------+----------------+------+ | 19 | 41 | 6 | 0 | +-------+--------+----------------+------+ | 20 | 43 | 6 | 0 |
+-------+--------+----------------+------+ | State | Symbol | Number_Of_Bits | Base | +-------+--------+----------------+------+ | 0 | 0 | 0 | 0 | +-------+--------+----------------+------+ | 0 | 0 | 6 | 0 | +-------+--------+----------------+------+ | 1 | 1 | 4 | 0 | +-------+--------+----------------+------+ | 2 | 2 | 5 | 32 | +-------+--------+----------------+------+ | 3 | 3 | 5 | 0 | +-------+--------+----------------+------+ | 4 | 5 | 5 | 0 | +-------+--------+----------------+------+ | 5 | 6 | 5 | 0 | +-------+--------+----------------+------+ | 6 | 8 | 5 | 0 | +-------+--------+----------------+------+ | 7 | 10 | 6 | 0 | +-------+--------+----------------+------+ | 8 | 13 | 6 | 0 | +-------+--------+----------------+------+ | 9 | 16 | 6 | 0 | +-------+--------+----------------+------+ | 10 | 19 | 6 | 0 | +-------+--------+----------------+------+ | 11 | 22 | 6 | 0 | +-------+--------+----------------+------+ | 12 | 25 | 6 | 0 | +-------+--------+----------------+------+ | 13 | 28 | 6 | 0 | +-------+--------+----------------+------+ | 14 | 31 | 6 | 0 | +-------+--------+----------------+------+ | 15 | 33 | 6 | 0 | +-------+--------+----------------+------+ | 16 | 35 | 6 | 0 | +-------+--------+----------------+------+ | 17 | 37 | 6 | 0 | +-------+--------+----------------+------+ | 18 | 39 | 6 | 0 | +-------+--------+----------------+------+ | 19 | 41 | 6 | 0 | +-------+--------+----------------+------+ | 20 | 43 | 6 | 0 |
+-------+--------+----------------+------+ | 21 | 45 | 6 | 0 | +-------+--------+----------------+------+ | 22 | 1 | 4 | 16 | +-------+--------+----------------+------+ | 23 | 2 | 4 | 0 | +-------+--------+----------------+------+ | 24 | 3 | 5 | 32 | +-------+--------+----------------+------+ | 25 | 4 | 5 | 0 | +-------+--------+----------------+------+ | 26 | 6 | 5 | 32 | +-------+--------+----------------+------+ | 27 | 7 | 5 | 0 | +-------+--------+----------------+------+ | 28 | 9 | 6 | 0 | +-------+--------+----------------+------+ | 29 | 12 | 6 | 0 | +-------+--------+----------------+------+ | 30 | 15 | 6 | 0 | +-------+--------+----------------+------+ | 31 | 18 | 6 | 0 | +-------+--------+----------------+------+ | 32 | 21 | 6 | 0 | +-------+--------+----------------+------+ | 33 | 24 | 6 | 0 | +-------+--------+----------------+------+ | 34 | 27 | 6 | 0 | +-------+--------+----------------+------+ | 35 | 30 | 6 | 0 | +-------+--------+----------------+------+ | 36 | 32 | 6 | 0 | +-------+--------+----------------+------+ | 37 | 34 | 6 | 0 | +-------+--------+----------------+------+ | 38 | 36 | 6 | 0 | +-------+--------+----------------+------+ | 39 | 38 | 6 | 0 | +-------+--------+----------------+------+ | 40 | 40 | 6 | 0 | +-------+--------+----------------+------+ | 41 | 42 | 6 | 0 | +-------+--------+----------------+------+ | 42 | 44 | 6 | 0 | +-------+--------+----------------+------+ | 43 | 1 | 4 | 32 | +-------+--------+----------------+------+ | 44 | 1 | 4 | 48 |
+-------+--------+----------------+------+ | 21 | 45 | 6 | 0 | +-------+--------+----------------+------+ | 22 | 1 | 4 | 16 | +-------+--------+----------------+------+ | 23 | 2 | 4 | 0 | +-------+--------+----------------+------+ | 24 | 3 | 5 | 32 | +-------+--------+----------------+------+ | 25 | 4 | 5 | 0 | +-------+--------+----------------+------+ | 26 | 6 | 5 | 32 | +-------+--------+----------------+------+ | 27 | 7 | 5 | 0 | +-------+--------+----------------+------+ | 28 | 9 | 6 | 0 | +-------+--------+----------------+------+ | 29 | 12 | 6 | 0 | +-------+--------+----------------+------+ | 30 | 15 | 6 | 0 | +-------+--------+----------------+------+ | 31 | 18 | 6 | 0 | +-------+--------+----------------+------+ | 32 | 21 | 6 | 0 | +-------+--------+----------------+------+ | 33 | 24 | 6 | 0 | +-------+--------+----------------+------+ | 34 | 27 | 6 | 0 | +-------+--------+----------------+------+ | 35 | 30 | 6 | 0 | +-------+--------+----------------+------+ | 36 | 32 | 6 | 0 | +-------+--------+----------------+------+ | 37 | 34 | 6 | 0 | +-------+--------+----------------+------+ | 38 | 36 | 6 | 0 | +-------+--------+----------------+------+ | 39 | 38 | 6 | 0 | +-------+--------+----------------+------+ | 40 | 40 | 6 | 0 | +-------+--------+----------------+------+ | 41 | 42 | 6 | 0 | +-------+--------+----------------+------+ | 42 | 44 | 6 | 0 | +-------+--------+----------------+------+ | 43 | 1 | 4 | 32 | +-------+--------+----------------+------+ | 44 | 1 | 4 | 48 |
+-------+--------+----------------+------+ | 45 | 2 | 4 | 16 | +-------+--------+----------------+------+ | 46 | 4 | 5 | 32 | +-------+--------+----------------+------+ | 47 | 5 | 5 | 32 | +-------+--------+----------------+------+ | 48 | 7 | 5 | 32 | +-------+--------+----------------+------+ | 49 | 8 | 5 | 32 | +-------+--------+----------------+------+ | 50 | 11 | 6 | 0 | +-------+--------+----------------+------+ | 51 | 14 | 6 | 0 | +-------+--------+----------------+------+ | 52 | 17 | 6 | 0 | +-------+--------+----------------+------+ | 53 | 20 | 6 | 0 | +-------+--------+----------------+------+ | 54 | 23 | 6 | 0 | +-------+--------+----------------+------+ | 55 | 26 | 6 | 0 | +-------+--------+----------------+------+ | 56 | 29 | 6 | 0 | +-------+--------+----------------+------+ | 57 | 52 | 6 | 0 | +-------+--------+----------------+------+ | 58 | 51 | 6 | 0 | +-------+--------+----------------+------+ | 59 | 50 | 6 | 0 | +-------+--------+----------------+------+ | 60 | 49 | 6 | 0 | +-------+--------+----------------+------+ | 61 | 48 | 6 | 0 | +-------+--------+----------------+------+ | 62 | 47 | 6 | 0 | +-------+--------+----------------+------+ | 63 | 46 | 6 | 0 | +-------+--------+----------------+------+
+-------+--------+----------------+------+ | 45 | 2 | 4 | 16 | +-------+--------+----------------+------+ | 46 | 4 | 5 | 32 | +-------+--------+----------------+------+ | 47 | 5 | 5 | 32 | +-------+--------+----------------+------+ | 48 | 7 | 5 | 32 | +-------+--------+----------------+------+ | 49 | 8 | 5 | 32 | +-------+--------+----------------+------+ | 50 | 11 | 6 | 0 | +-------+--------+----------------+------+ | 51 | 14 | 6 | 0 | +-------+--------+----------------+------+ | 52 | 17 | 6 | 0 | +-------+--------+----------------+------+ | 53 | 20 | 6 | 0 | +-------+--------+----------------+------+ | 54 | 23 | 6 | 0 | +-------+--------+----------------+------+ | 55 | 26 | 6 | 0 | +-------+--------+----------------+------+ | 56 | 29 | 6 | 0 | +-------+--------+----------------+------+ | 57 | 52 | 6 | 0 | +-------+--------+----------------+------+ | 58 | 51 | 6 | 0 | +-------+--------+----------------+------+ | 59 | 50 | 6 | 0 | +-------+--------+----------------+------+ | 60 | 49 | 6 | 0 | +-------+--------+----------------+------+ | 61 | 48 | 6 | 0 | +-------+--------+----------------+------+ | 62 | 47 | 6 | 0 | +-------+--------+----------------+------+ | 63 | 46 | 6 | 0 | +-------+--------+----------------+------+
+-------+--------+----------------+------+ | State | Symbol | Number_Of_Bits | Base | +-------+--------+----------------+------+ | 0 | 0 | 0 | 0 | +-------+--------+----------------+------+ | 0 | 0 | 5 | 0 | +-------+--------+----------------+------+ | 1 | 6 | 4 | 0 | +-------+--------+----------------+------+ | 2 | 9 | 5 | 0 | +-------+--------+----------------+------+ | 3 | 15 | 5 | 0 | +-------+--------+----------------+------+ | 4 | 21 | 5 | 0 | +-------+--------+----------------+------+ | 5 | 3 | 5 | 0 | +-------+--------+----------------+------+ | 6 | 7 | 4 | 0 | +-------+--------+----------------+------+ | 7 | 12 | 5 | 0 | +-------+--------+----------------+------+ | 8 | 18 | 5 | 0 | +-------+--------+----------------+------+ | 9 | 23 | 5 | 0 | +-------+--------+----------------+------+ | 10 | 5 | 5 | 0 | +-------+--------+----------------+------+ | 11 | 8 | 4 | 0 | +-------+--------+----------------+------+ | 12 | 14 | 5 | 0 | +-------+--------+----------------+------+ | 13 | 20 | 5 | 0 | +-------+--------+----------------+------+ | 14 | 2 | 5 | 0 | +-------+--------+----------------+------+ | 15 | 7 | 4 | 16 | +-------+--------+----------------+------+ | 16 | 11 | 5 | 0 | +-------+--------+----------------+------+ | 17 | 17 | 5 | 0 | +-------+--------+----------------+------+ | 18 | 22 | 5 | 0 | +-------+--------+----------------+------+ | 19 | 4 | 5 | 0 | +-------+--------+----------------+------+ | 20 | 8 | 4 | 16 |
+-------+--------+----------------+------+ | State | Symbol | Number_Of_Bits | Base | +-------+--------+----------------+------+ | 0 | 0 | 0 | 0 | +-------+--------+----------------+------+ | 0 | 0 | 5 | 0 | +-------+--------+----------------+------+ | 1 | 6 | 4 | 0 | +-------+--------+----------------+------+ | 2 | 9 | 5 | 0 | +-------+--------+----------------+------+ | 3 | 15 | 5 | 0 | +-------+--------+----------------+------+ | 4 | 21 | 5 | 0 | +-------+--------+----------------+------+ | 5 | 3 | 5 | 0 | +-------+--------+----------------+------+ | 6 | 7 | 4 | 0 | +-------+--------+----------------+------+ | 7 | 12 | 5 | 0 | +-------+--------+----------------+------+ | 8 | 18 | 5 | 0 | +-------+--------+----------------+------+ | 9 | 23 | 5 | 0 | +-------+--------+----------------+------+ | 10 | 5 | 5 | 0 | +-------+--------+----------------+------+ | 11 | 8 | 4 | 0 | +-------+--------+----------------+------+ | 12 | 14 | 5 | 0 | +-------+--------+----------------+------+ | 13 | 20 | 5 | 0 | +-------+--------+----------------+------+ | 14 | 2 | 5 | 0 | +-------+--------+----------------+------+ | 15 | 7 | 4 | 16 | +-------+--------+----------------+------+ | 16 | 11 | 5 | 0 | +-------+--------+----------------+------+ | 17 | 17 | 5 | 0 | +-------+--------+----------------+------+ | 18 | 22 | 5 | 0 | +-------+--------+----------------+------+ | 19 | 4 | 5 | 0 | +-------+--------+----------------+------+ | 20 | 8 | 4 | 16 |
+-------+--------+----------------+------+ | 21 | 13 | 5 | 0 | +-------+--------+----------------+------+ | 22 | 19 | 5 | 0 | +-------+--------+----------------+------+ | 23 | 1 | 5 | 0 | +-------+--------+----------------+------+ | 24 | 6 | 4 | 16 | +-------+--------+----------------+------+ | 25 | 10 | 5 | 0 | +-------+--------+----------------+------+ | 26 | 16 | 5 | 0 | +-------+--------+----------------+------+ | 27 | 28 | 5 | 0 | +-------+--------+----------------+------+ | 28 | 27 | 5 | 0 | +-------+--------+----------------+------+ | 29 | 26 | 5 | 0 | +-------+--------+----------------+------+ | 30 | 25 | 5 | 0 | +-------+--------+----------------+------+ | 31 | 24 | 5 | 0 | +-------+--------+----------------+------+
+-------+--------+----------------+------+ | 21 | 13 | 5 | 0 | +-------+--------+----------------+------+ | 22 | 19 | 5 | 0 | +-------+--------+----------------+------+ | 23 | 1 | 5 | 0 | +-------+--------+----------------+------+ | 24 | 6 | 4 | 16 | +-------+--------+----------------+------+ | 25 | 10 | 5 | 0 | +-------+--------+----------------+------+ | 26 | 16 | 5 | 0 | +-------+--------+----------------+------+ | 27 | 28 | 5 | 0 | +-------+--------+----------------+------+ | 28 | 27 | 5 | 0 | +-------+--------+----------------+------+ | 29 | 26 | 5 | 0 | +-------+--------+----------------+------+ | 30 | 25 | 5 | 0 | +-------+--------+----------------+------+ | 31 | 24 | 5 | 0 | +-------+--------+----------------+------+
Acknowledgments
致谢
zstd was developed by Yann Collet.
zstd由Yann Collet开发。
Bobo Bose-Kolanu, Felix Handte, Kyle Nekritz, Nick Terrell, and David Schleimer provided helpful feedback during the development of this document.
Bobo Bose Kolanu、Felix Handte、Kyle Nekritz、Nick Terrell和David Schleimer在本文档的开发过程中提供了有用的反馈。
Authors' Addresses
作者地址
Yann Collet Facebook 1 Hacker Way Menlo Park, CA 94025 United States of America
美国加利福尼亚州门罗公园Yann Collet Facebook 1 Hacker Way Menlo Park 94025
Email: cyan@fb.com
Email: cyan@fb.com
Murray S. Kucherawy (editor) Facebook 1 Hacker Way Menlo Park, CA 94025 United States of America
Murray S.Kucherawy(编辑)Facebook 1 Hacker Way Menlo Park,加利福尼亚州,美国94025
Email: msk@fb.com
Email: msk@fb.com