1 Block diagram of the cryptographic transformation algorithm 1

2 Easy replacement mode 4

3 Gamma mode 8

4 Gamma mode with feedback 11

5 Simulation insert generation mode 14

Appendix 1 Terms used in this standard and their definitions 16

Appendix 2 Values of constants C1, C2 18

Appendix 3 Schemes software implementation cryptographic algorithm

transformations. 19

Appendix 4 Rules for summation modulo 2 32 and modulo (2 32 -I) 25

STATE STANDARD

USSR UNION

INFORMATION PROCESSING SYSTEMS. CRYPTOGRAPHIC SECURITY

Cryptographic conversion algorithm

Date of introduction 07/01/90

This standard establishes a unified cryptographic transformation algorithm for information processing systems in networks of electronic computers (computers), individual computing systems and computers, which defines the rules for data encryption and the development of imitations.

The cryptographic transformation algorithm is designed for hardware or software implementation, satisfies cryptographic requirements and, in its capabilities, does not impose restrictions on the degree of secrecy of the protected information.

The standard is mandatory for organizations, enterprises and institutions that use cryptographic protection data stored and transmitted in computer networks, in separate computer systems or in computers.

The terms used in this standard and their definitions are given in Appendix 1.

I. BLOCK DIAGRAM OF THE CRYPTOGRAPHIC TRANSFORMATION ALGORITHM

1.1. The block diagram of the cryptographic transformation algorithm (crypto-scheme) contains (see Figure 1):

Official publication ★

a 256-bit key storage device (KSD), consisting of eight 32-bit drives (X0, Xt. X2, A3 L4, X$, X6, Xu); four 32-bit drives (/V (, N 2, Nj, /V 4);

Reproduction is prohibited

two 32-bit drives L/$,) with permanent fillings C 2, C\\ recorded in them

two 32-bit modulo 2 32 adders (SM|, SL/3);

32-bit adder of bitwise summation modulo 2 (SL/2);

32-bit modulo adder (2 32 - 1) (SL/ 4);

modulo 2(SL/5) adder, there is no restriction on the capacity of the SL/$ adder;

substitution block (A);

cyclic shift register eleven steps towards the most significant digit (R).

1.2. The substitution block A" consists of eight substitution nodes A'j,

A 2, A“z, K 4, A5, A7, A 8 with 64-bit memory each. Post

A 32-bit vector applied to a substitution block is split into eight sequential 4-bit vectors, each of which is converted into a 4-bit vector by the corresponding substitution node, which is a table of sixteen rows containing four padding bits per row. The input vector determines the address of a row in the table, the filling of this row is the output vector. The 4-bit output vectors are then concatenated sequentially into a 32-bit vector.

1.3. When adding and cyclically shifting binary vectors, the most significant bits are considered to be the bits of the storage devices with large numbers.

1.4. When writing the key (I", W 2 ..., W q e (0,1), d = N256, in

The RCU value W\ is entered into the i-th digit of the drive Xq, the value W 2 is entered into the 2nd digit of the drive L#, ..., the value W^ 2 is entered into the 32nd digit of the drive Xq; the value W33 is entered into the 1st digit of the drive X\ y, the value is entered into the 2nd digit of the drive X\ y..., the value W M is entered into the 32nd digit of the drive X\\ the value W 6 5 is entered into the 1st digit of the drive X 2 etc., the value 1U 2 5b is entered into the 32nd bit of the Xy drive.

1.5. When rewriting information, the contents of the p-th digit of one drive (adder) are rewritten into the p-th digit of another drive (adder).

1.6. The values of constant fillings Cj, C 2 (constants) of storage devices /V 6, /V5 are given in Appendix 2.

1.7. The keys that determine the filling of the KZU and tables of the substitution block K are secret elements and are supplied in the prescribed manner.

Filling the tables of the substitution block K is a long-term key element common to the computer network.

Organization various types communication is achieved by building an appropriate key system. In this case, the possibility of generating keys (filling the KZU) in a simple replacement mode and encrypting them in a simple replacement mode while providing imitational protection for transmission over communication channels or storage in computer memory can be used.

1.8. The crypto scheme provides for four types of work: encryption (decryption) of data in simple replacement mode; encryption (decryption) of data in gamma mode;

encryption (decryption) of data in gamma mode with feedback;

simulation insert generation mode.

Schemes for software implementation of the cryptographic transformation algorithm are given in Appendix 3.

2. EASY CHANGE MODE

2.1. Encrypting clear data using simple replacement mode

2.1.1. The cryptoscheme that implements the encryption algorithm in simple replacement mode should have the form shown in Figure 2.

Open data to be encrypted is divided into blocks of 64 bits each. Input of any block T () = (D|(0), ^(O), ..., d 3 1(0), i 32 (0), £|(0), b 2 (0) y.. . , Z> 32 (0)) of binary information into drives N\ and N 2 is carried out so that the value D|(0) is entered into the 1st bit of N|, the value a 2 (0) is entered into the 2nd bit / Vj, etc., the value i 32 (0) is entered into the 32nd digit iVj; value />|(0) is entered into

1st digit L/ 2, the value b 2 (0) is entered into the 2nd digit N 2, etc., the value >> 32 (0) is entered into the 32nd digit N 2. As a result, the state is obtained (i 32 (0), i 3 |(0), ..., and 2 (0) y<7|(0)) накопителя yVj и состояние (/>32 (0), b 2 1(0), ... , />|(0)) storage unit N 2.

2.1.2. 256 bits of the key are entered into the RCU. The contents of eight 32-bit drives Aq, X\ t ..., Xj have the form:

^0 = (^32^3.....

*1 =(^64^63, . ^34^33)

*7 = (^56> ^255. ... , I/ 226, ^ 225)

2.1.3. The encryption algorithm for a 64-bit block of plain data in simple replacement mode consists of 32 cycles.

In the first cycle, the initial filling of the storage is summed modulo 2 32 in the adder CM\ with the filling of the storage Xq, while the filling of the storage Nj is maintained.

The result of the summation is converted in the substitution block K and the resulting vector is sent to the input of the register /?, where it is cyclically shifted by eleven steps towards the most significant bits. The result of the shift is summed bit by bit modulo 2 in the adder CM 2 with 32-bit filling of the drive yV 2 . The result obtained in CM 2 is written in N\%, with the old filling N| rewritten to N 2. The first cycle ends.

Subsequent cycles are carried out similarly, with

In the 2nd cycle, the filling X\ is read from the RCU, in the 3rd cycle from the RCU

the filling X2 is read, etc., in the 8th cycle the filling Xj is read from the RCU. In cycles from 9 to 16, as well as in cycles from 17 to 24, fillings from the RCU are read in the same order:

In the last eight cycles from the 25th to the 32nd, the order of reading the RCD fills is reversed:

hell, hell, hell, hell.

Thus, when encrypting in 32 cycles, the following order of selecting storage fills is carried out:

hell, ^2,^),^4>^5,^6"^7, hell, ^2,^3"^4,^5,-^6,^7, hell, hell, hell, hell, hell ,hell,hell,hell.

In the 32nd cycle, the result from the adder SL/ 2 is entered into the drive CU 2, and the old filling is stored in the drive N\.

Received after the 32nd cycle of encryption of filling the N| and N2 are the encrypted data block corresponding to the plain data block.

2.1 4 Encryption equations in simple replacement mode have the form:

J*Cr> "(

I b(/) = a(/~ I)

at y = I -24;

\bO) - a O - O at / 8* 25 -g 31; a(32) = a(31)

A (32) = (d (31) ffl X 0)KRG> b (31)

where d(0) = (a 32 (0), «з|(0), ... , Д|(0)) is the initial filling of N\ before the first encryption cycle;

6(0) = (632(0), 63j(0), ... , 6j(0)) - initial filling /U 2 before the first encryption cycle;

a(j) = (032(7), 0з|(/) e... , 0|(/)) - filling of the control unit, after the y-th encryption cycle;

b(j) = (6з 2 (/), 63j(/"), ... , 6|(/)) - filling /V 2 after the yth encryption cycle, y = 032.

The sign φ means the bitwise summation of 32-bit vectors modulo 2.

The sign Ш means the summation of 32-bit vectors modulo 2 32. The rules for summation modulo 2 32 are given in Appendix 4;

/? - the operation of a cyclic shift of eleven steps towards the highest digits, i.e.

^(g 32"O|>g 30>g 29>g 28>g 27>g 26"g 25>g 24>g 23'G 22"G 2b G 20>"g 2* g |)~

= (g 21" g 20> - "g 2* g 1 * G 32>G31 *GzO" g 29* g 28*, 27e"26e/"25>, 24>G23", 22)*

2.1.5. The 64-bit block of encrypted data Tsh is output from drives L^, VU 2 in the following order: from the 1st, 2nd, ..., 32nd bits of drive L7|, then from the 1st, 2nd , ... , 32nd bit of storage W 2, i.e.

t w - (a,<32),0 2 (32),0 32 (32), 6,(32), 6 2 <32),6 32 <32».

The remaining blocks of open data in simple replacement mode are encrypted in the same way.

2.2. Decrypting encrypted data using easy replacement mode

2.2.1. The cryptoscheme that implements the decryption algorithm in simple replacement mode has the same form (see Chsrt.2) as for encryption. 256 bits of the same key used for encryption are entered into the RCU. The encrypted data to be decrypted is divided into blocks of 64 bits each. Enter any block

T w - (0,(32),o 2 (32), ..., 0 32 (32), 6,(32), 6 2 (32), ..., 6 32 (32))

into accumulators L', and N 2 are produced so that the value dj(32) is entered into the 1st digit /V, the value o 2 (32) is entered into the 2nd digit /V, etc., the value a 32 (32) is entered into the 32nd digit /V,; the value 6,(32) is entered into the 1st digit of N2, etc., the value 6 32 (32) is entered into the 32nd digit of N2.

2.2.2. Decryption is carried out according to the same algorithm as the encryption of open data, with the difference that the fillings of the drives Xq, X\y..., Xj are read from the RCU in decryption cycles in the following order:

hell, hell 3, hell, hell, hell, hell, hell, hell 0,

hell 6, hell 4, hell 2, hell, hell, hell, hell 2, hell.

2.2.3. The decryption equations look like:

G d (32 - /) = (d (32 - / + 1) SHLG,.,) *LF6(32 - / + 1) b (32 - /) = d (32 - / + 1) at, / = 1+8;

I o(32- /) = (a(32-/M)SHDG (32 _ /)(tod8))KLFL(32./M) |6(32-/) = d (32 - / + 1)

at /= 9 + 31;

b(0) = (a (1) ШДГо) ОФй(1)

2.2.4. The W and N 2 drives obtained after 32 cycles of operation constitute a block of open data.

Then = (fli(O), a 2 (0), ... , Az 2 (0)" 6, (0), 6 2 (0), ... , 6 32 (0)), corresponding to the block of encrypted data, while the value o,(0) of block 7о corresponds to the contents of the 1st bit yV, the value 02(0) corresponds to

P. 8 GOST 28147-89

corresponds to the contents of the 2nd bit N\, etc., the value Dz2(0) corresponds to the contents of the 32nd bit N\; the value 6j(0) corresponds to the contents of the 1st bit; the value ^(0) corresponds to the contents of the 2nd bit N2, etc., the value £зг(0) corresponds to the contents of the 32nd bit N2-

The remaining blocks of encrypted data are decrypted in the same way.

2.3. The encryption algorithm in the mode of simple replacement of the 64-bit block G 0 is denoted by A y i.e.

A (T 0) = A (a (0), b (0)) = (a (32), b (32)) = T w.

2.4. The simple replacement mode can be used to encrypt (decrypt) data only in the cases given in clause 1.7.

3. GAMING MODE

3.1. Encrypting open data in gamma mode

3.1.1. The cryptoscheme that implements the encryption algorithm in gamma mode has the form shown in Figure 3.

Open data, divided into 64-bit blocks T\)\ 7), 2) ..., 7)) m “, 1 7[) M), is encrypted in gamma mode by bitwise summation modulo 2 in the adder SL/5 with the cipher gamma Гш, which is produced in blocks of 64 bits, i.e.

G _/L1) R2) Lm-1) LM)\

"ill V 1 w e * w * » " Sh » " * * * " 111 /»

where M is determined by the volume of encrypted data.

Tjj) - Yth 64-bit block, /“ the number of binary digits in block 7J) M) can be less than 64, while the part of the cipher gamut not used for encryption from block Г\^] is discarded.

3.1.2. 256 bits of the key are entered into the RCU. A 64-bit binary sequence (sync message) S = (5*1, S 2, ..., 5^4) is introduced into the drives iVj, N 2, which is the initial filling of these drives for the subsequent generation of M-blocks of the cipher gamma. The sync message is entered into jV| and L^so that the value 5[ is entered into the 1st digit of VU), the value S 2 is entered into the 2nd digit N\, etc., the value ^is entered into the 32nd digit 7V|; the value S33 is entered into the 1st digit N2, the value 4S34 is entered into the 2nd digit N2, etc., the value is entered into the 32nd digit N2.

3.1.3. The initial filling of drives /Vj and N 2 (sync message.5) is encrypted in simple replacement mode in accordance with

A well-known researcher, the founder of algebraic cryptanalysis, Nicolas Courtois, claims that the GOST block cipher, which was soon to become an international standard, has actually been cracked and expects many publications in the future that should develop his ideas about the instability of this algorithm.

The following are brief excerpts from this work, which can be considered as an alarmist attack in the midst of international standardization (the author was known for similar exaggerations regarding AES, but his work then had a great theoretical impact on cryptanalysis, but has not led to practical attacks on AES itself). But perhaps this is a real warning about the beginning of the “spin-diving airplane” stage, which could end with the collapse of the national encryption standard, as was the case with the SHA-1 hashing algorithm and the “de facto” MD5 hashing algorithm.

GOST 28147-89 was standardized in 1989 and became the first official standard for protecting confidential information, but the cipher specification remained closed. In 1994, the standard was declassified, published and translated into English. By analogy with AES (and unlike DES), GOST is allowed to protect classified information without restrictions, in accordance with how it is specified in the Russian standard. That. GOST is not an analogue of DES, but a competitor to 3-DES with three independent keys or AES-256. It is obvious that GOST is a fairly serious cipher that meets military criteria, created with the most serious applications in mind. At least two sets of GOST S-boxes have been identified based on applications used by Russian banks. These banks need to conduct secret communications with hundreds of branches and protect billions of dollars from fraudulent thefts.

GOST is a block cipher with a simple Feistel structure, with a block size of 64 bits, a 256-bit key and 32 rounds. Each round contains a modulo 2^32 key addition, a set of eight 4-bit S-boxes, and a simple 11-bit round shift. A feature of GOST is the ability to store S-blocks secretly, which can be thought of as a second key, increasing the effective key material to 610 bits. One set of S-boxes was published in 1994 as part of the GOST-R 34.11-94 hash function specification and, as Schneier wrote, was used by the Central Bank of the Russian Federation. It was also included in the RFC4357 standard in the part "id-GostR3411-94-CryptoProParamSet". There was an error in the source codes at the end of Schneier's book (in the S-box order). The most accurate reference implementation of original Russian origin can now be found in the OpenSSL library. If secret S-boxes are used somewhere, then they can be extracted from software implementations and from microcircuits, on which the corresponding works have been published.

GOST is a serious competitor

In addition to its very large key size, GOST has a significantly lower execution cost compared to AES and other similar encryption systems. In fact, it costs much less than AES, which requires four times as many hardware logic gates for a significantly lower stated level of security.

It is not surprising that GOST has become an Internet standard, in particular, it is included in many crypto libraries such as OpenSSL and Crypto++, and is becoming increasingly popular outside its country of origin. In 2010, GOST was submitted to ISO standardization as a worldwide encryption standard. An extremely small number of algorithms have managed to become international standards. ISO/IEC 18033-3:2010 describes the following algorithms: four 64-bit ciphers - TDEA, MISTY1, CAST-128, HIGHT - and three 128-bit ciphers - AES, Camellia, SEED. GOST is proposed to be added to the same ISO/IEC 18033-3 standard.

For the first time in the history of industrial standardization, we are dealing with such a competitive algorithm in terms of optimality between cost and security level. GOST has 20 years of cryptanalysis attempts behind it, and until recently its military-grade security was not questioned.

As the author recently learned through private correspondence, the majority of countries voted against GOST at the ISO vote in Singapore, but the results of this vote will still be considered at the ISO SC27 plenary meeting, so GOST is still in the process of standardization at the time of publication of this work.

Expert opinions on GOST

None of the available information and literature regarding GOST gives any reason to believe that the cipher may be unsafe. On the contrary, the large key size and large number of rounds make GOST, at first glance, suitable for decades of use.

Anyone familiar with Moore's Law understands that, in theory, 256-bit keys will remain secure for at least 200 years. GOST has been widely studied by leading cryptography experts known in the field of block cipher analysis, such as Schneier, Biryukov, Dunkelman, Wagner, many Australian, Japanese and Russian scientists, ISO cryptography experts, and all researchers have expressed that it looks like this that it can be or should be safe. Although it is widely understood that the GOST structure itself is extremely weak, for example, compared to DES, in particular, diffusion is not so good, but this has always been due to the fact that this must be compensated by a large number of rounds (32), as well as additional nonlinearity and diffusion provided by modulo addition.

Biryukov and Wagner wrote: "The large number of rounds (32) and the well-studied Feistel construction, combined with successive Shannon permutations, provide a solid basis for GOST security." In the same paper we read: "after considerable investment of time and effort, no progress has been made in the cryptanalysis standard in the open literature." Thus, there were no significant attacks that would allow decryption or recovery of the key in a realistic scenario where GOST is used in encryption with many different random keys. In contrast, there are a lot of works on attacks on weak keys in GOST, attacks with linked keys, attacks on the recovery of secret S-boxes. At Crypto 2008, a hack of a hash function based on this cipher was presented. In all attacks, the attacker has a significantly greater level of freedom than he is normally allowed. In traditional applications of encryption using randomly selected keys, no serious cryptographic attacks on GOST have been found to date, which in 2010 was expressed by the summary phrase: “despite the significant efforts of cryptanalysts over the past 20 years, GOST has not yet been cracked” (Axel Poschmann , San Ling, and Huaxiong Wang: 256 Bit Standardized Crypto for 650 GE GOST Revisited, In CHES 2010, LNCS 6225, pp. 219-233, 2010).

Linear and differential analysis GOST

In Schneier's well-known book we read: "Against differential and linear cryptanalysis, GOST is probably more robust than DES." The main assessment of the security of GOST was given in 2000 by Gabidulin et al. Their results are very impressive: with the intended security level of 2^256, five rounds are enough to protect GOST from linear cryptanalysis. Moreover, even when replacing S-blocks with identical ones and the only non-linear operation of the cipher - addition modulo 2^32 - the cipher is still resistant to linear cryptanalysis after 6 rounds out of 32. GOST differential cryptanalysis looks comparatively easier and attracts more attention. For the 2^128 security level, researchers (Vitaly V. Shorin, Vadim V. Jelezniakov and Ernst M. Gabidulin: Linear and Differential Cryptanalysis of Russian GOST, Preprint submitted to Elsevier Preprint, April 4, 2001) assumed sufficient resistance at the level of 7 rounds. According to them, hacking GOST in more than five rounds is “extremely difficult.” Moreover, two Japanese researchers showed that a classic forward differential attack with one differential characteristic has an extremely low probability of surviving through a large number of rounds. Based on the fact of studying a fairly “good” iterative differential characteristic for a limited number of rounds (which itself has a probability of passing no better than 2-11.4 per round), the values of the set of suitable keys are less than half. For a full-round GOST, such an attack with a single characteristic will only work with a negligible part of the keys on the order of 2-62 (and even in this small part it will have a probability of passing no more than 2-360).

More complex attacks involve multiple differentials following specific patterns, such as using individual S-boxes that have zero differentials while other bits have non-zero ones. We are talking about discriminating attacks based on the poor diffusion properties of GOST. The best of these attacks works against 17 rounds of GOST, depends on the key and has an extremely weak discriminator from random data at the output, so that it can somehow be used to obtain information about the key.

Sliding and reflecting attacks

According to Biryukov and Wagner, GOST's structure, which includes reversing the order of subkeys in the final rounds, makes it resistant to sliding attacks (aka "slide attacks"). However, due to the presence of a large amount of self-similarity in the cipher, it allows key inversion attacks on combinations of fixed points and "reflection" properties (aka "reflective attacks") for certain weak keys. The complexity of this attack is 2^192 and 2^32 matched plaintexts.

Latest results

New attacks also use reflection and actually broke GOST, which was presented at the FSE 2011 conference. These attacks were also discovered independently by the author of this work. The attack requires 2^132 bytes of memory, which is actually worse than slower attacks with lower memory requirements.

Many new attacks based on self-similarity work against all GOST keys and allow you to crack a full-round GOST with a 256-bit key, and not only for weak keys, as was previously the case. All of these attacks require significantly less memory and are significantly faster.

These new attacks can be seen as examples of a new general paradigm for the cryptanalysis of block ciphers called "algebraic complexity reduction", generalizing these attacks to many special cases of attacks with known fixed points, slips, involutions and cycles. It is important that among the family of all these attacks there are those that allow GOST cryptanalysis without any reflections and without any symmetrical points that appear during calculations. One example is a simple GOST hacking attack without reflection in this work.

Algebraic cryptanalysis and low-complexity attacks on reduced-round ciphers

Algebraic attacks on block and stream ciphers can be represented as the problem of solving a large system of Boolean algebraic equations that follows the geometry and structure of a proprietary cryptographic scheme. The idea itself goes back to Shannon. In practice, it was introduced for DES (first introduced by the author of this work) as a formal encoding method and can crack 6 rounds on just one known plaintext. Manipulation with equations occurs on the basis of XL algorithms, Gröbner bases, the ElimLin method, and SAT solvers.

In practice, algebraic attacks have been implemented against a very small number of rounds of block ciphers, but have already led to the breaking of stream ciphers, and there have also been successes in breaking ultra-lightweight ciphers for microhardware. Due to difficulties in memory space and computational cost estimates, they are combined with other attacks.

How to hack GOST?

An algebraic attack on the full-round GOST is presented in more detail in the publication under review. In a previous work, the author has already outlined 20 algebraic attacks on GOST and expects a large number of them in the near future. The attack proposed in this work is not the best of them, but it opens a simple (at least for cryptographers to understand) path for subsequent developments to create a specific methodology for breaking GOST.

The practical result is still modest: 2^64 known plaintexts and 2^64 memory for storing plaintext/ciphertext pairs make it possible to crack GOST 2^8 faster than simple brute force. But in terms of cryptanalysis, this makes the statement that “GOST has been hacked” completely true.

conclusions

GOST is designed to ensure a military level of safety for 200 years to come. Most of the leading experts who studied GOST agreed that "despite significant cryptanalytic efforts for 20 years, GOST has not yet been cracked." In 2010, GOST was promoted to ISO 18033 as a global encryption standard.

The basis of ideas about algebraic cryptanalysis arose more than 60 years ago. But it is only in the last 10 years that efficient software tools have been developed to (partially) solve a variety of NP-complete problems. A number of stream ciphers have been broken. Only one block cipher has been broken by this method - the weak KeeLoq itself. In this work, the author cracks an important, actually used GOST cipher. He notes that this is the first time in history that a standardized government cipher has been broken by algebraic cryptanalysis.

A simple "MITM with reflection" attack on GOST has already been presented at the FSE 2011 conference. In the work discussed in this article, another attack is presented only to illustrate the fact of how many attacks on GOST have already appeared now, many of which are faster, and an algebraic attack requires much less memory and opens up a virtually inexhaustible space of possibilities for an adversary to attack the cipher in different ways. This work also shows that there is no need to use the reflection property for hacking.

The author states: it is obvious that GOST has serious flaws and no longer provides the level of resistance required by ISO. Many attacks on GOST are presented as part of the confirmation of the algebraic complexity reduction paradigm.

Finally, the researcher especially notes some facts that are not yet available to the reader, but are known to the researcher, which are important during the ISO standardization process. This attack is not just a “certification” attack on GOST, which is faster than brute force. In fact, standardizing GOST now would be extremely dangerous and irresponsible. This is so because some of the attacks are feasible in practice. Some GOST keys can even be decrypted in practice, be they weak keys or keys from particular real-life applications of GOST. In a previous work, the author provides a detailed examination of cases where practical attacks are possible. What's also important is that "this is the first time in history that a serious, standardized block cipher, designed to protect military-grade secrets and designed to protect government secrets for governments, major banks and other organizations, has been broken by a mathematical attack."

). At the same time, in the Russian media and blogs of Russian users, the number of notes about this algorithm is growing: both covering the results of attacks on the Russian standard with varying degrees of reliability, and containing opinions about its operational characteristics. The authors (and, consequently, readers) of these notes often get the impression that the domestic encryption algorithm is outdated, slow and has vulnerabilities that make it significantly more susceptible to attacks than foreign encryption algorithms with a similar key length. With this series of notes we would like to tell you in an accessible form about the current state of affairs with the Russian standard. The first part will cover all attacks on GOST 28147-89 known to the international cryptographic community and current assessments of its strength. In future publications, we will also take a closer look at the properties of the standard from the point of view of the ability to build effective implementations.

Nicolas Courtois - "the great and terrible"

Let's start with a story about the activities of Nicolas Courtois, who is the author of a whole series of works devoted to the Russian block encryption standard ().

In October 2010, the process of considering the inclusion of the GOST 28147-89 algorithm in the international standard ISO/IEC 18033-3 began. Already in May 2011, an article by the famous cryptographer Nicolas Courtois appeared on the ePrint electronic archive, marked by a very ambiguous attitude towards him from the world cryptographic community. Courtois's publications are a sad example of the manipulation of concepts, which does not reveal any new properties of the object in question, but with a claim to sensation provokes the spread of erroneous opinions about its actual properties in an incompetent environment.

Algebraic method

Courtois's reasoning is built around two classes of cryptanalysis methods: algebraic methods and differential ones. Let's look at the first class of methods.

In a simplified way, the method of algebraic cryptanalysis can be described as the compilation and solution of a large system of equations, each of the solutions of which corresponds to the goal of the cryptanalyst (for example, if a system is compiled using one pair of plaintext and ciphertext, then all solutions of this system correspond to the keys at which this plaintext is converted into this one is encrypted). That is, in the case of the problem of cryptanalysis of a block cipher, the essence of the algebraic method of cryptanalysis is that the key is found as a result of solving a system of polynomial equations. The main difficulty is to be able to create as simple a system as possible, taking into account the characteristics of a particular cipher, so that the process of solving it takes as little time as possible. Here the key role is played by the features of each specific cipher being analyzed.

The algebraic method used by Courtois can be briefly described as follows. At the first stage, such properties of GOST 28147-89 are used as the existence of a fixed point for part of the encryption transformation, as well as the so-called reflection point. Thanks to these properties, several pairs are selected from a sufficiently large number of plaintext-ciphertext pairs, which make it possible to consider transformations not at 32, but only at 8 rounds. The second stage is that, based on the results of 8 round transformations obtained at the first stage, a system of nonlinear equations is constructed, in which the key bits are unknowns. Next, this system is solved (this sounds simple, but in reality is the most time-consuming part of the method, since the system consists of non-linear equations).

As noted above, nowhere in the work is there a detailed description and analysis of the complexity of the second and main stage of determining the key. It is the complexity of the second stage that determines the complexity of the entire method as a whole. Instead, the author provides the notorious “facts” on the basis of which he makes estimates of labor intensity. These "facts" are said to be based on experimental results. An analysis of the “facts” from Courtois’s work as a whole is given in the work of domestic authors. The authors of this work note that many of Courtois’ “facts” presented without any evidence were found to be false during experimental testing. The authors of the article went further and, following Courtois, carried out an analysis of the complexity of the second stage using well-founded algorithms and estimates. The resulting complexity estimates show the complete inapplicability of the presented attack. In addition to domestic authors, the big problems that Courtois has with assessments and justification of his methods were also noted, for example, in the work.

Differential method

Let's consider the second Courtois method, which is based on differential cryptanalysis.

The general method of differential cryptanalysis is based on the exploitation of the properties of nonlinear mappings used in cryptographic primitives, associated with the influence of the key value on the dependencies between the differences between pairs of input and pairs of output values of these mappings. Let us describe the main idea of the differential method of cryptographic analysis of a block cipher. Typically, block ciphers transform the input data in stages using a number of so-called round transformations, and each round transformation does not use the entire key, but only some part of it. Let's consider a slightly “truncated” cipher, which differs from the original one in that it does not have the last round. Let us assume that it has been possible to establish that encrypting two plaintexts that differ in some fixed positions using such a “truncated” cipher will most likely result in ciphertexts that also differ in some fixed positions. This property shows that a “truncated” cipher most likely leaves a dependency between some plaintexts and the results of their encryption. In order to recover part of the key using this obvious flaw, it is necessary to be able to encrypt pre-selected plaintexts on the key we want to recover (the so-called “chosen plaintext attack”). At the beginning of the “key opening” procedure, a number of pairs of plaintexts are randomly generated, differing in those same fixed positions. All texts are encrypted using a “full” cipher. The resulting ciphertext pairs are used to recover those key bits that were used in the last round transformation, as follows. Using some randomly selected value of the desired key bits, a transformation inverse to the last round transformation is applied to all ciphertexts. In fact, if we guessed the desired value of the key bits, we will get the result of a “truncated” cipher, and if we didn’t guess, we will actually “encrypt the data even more,” which will only reduce the dependence between the blocks noted above (the difference is in some fixed positions). In other words, if among the results of such “additional processing” of ciphertexts there were quite a lot of pairs that differ in the fixed positions known to us, then this means that we have guessed the required key bits. Otherwise, there will be significantly fewer such pairs. Since only part of the key is used in each round, the searched bits (that is, the key bits used in the last round) are not as numerous as the bits in the full key and can simply be iterated over by repeating the steps above. In this case, we will definitely stumble upon the correct meaning someday.

From the above description it follows that the most important thing in the differential analysis method is the numbers of those very positions in plaintexts and ciphertexts, the differences in which play a key role in recovering the key bits. The fundamental presence of these positions, as well as the set of their numbers, directly depends on the properties of those nonlinear transformations that are used in any block cipher (usually all the “nonlinearity” is concentrated in the so-called S-boxes or replacement nodes).

Courtois uses a slightly modified version of the differential method. Let us immediately note that Courtois conducts his analysis for S-boxes that are different from the current ones and from those proposed in ISO. The work provides differential characteristics (those numbers in which the blocks should differ) for a small number of rounds. The justification for extending the characteristics for more rounds is, as usual, based on “facts”. Courtois expresses, again, with nothing other than his authority, an unsupported assumption that changing the S-boxes will not affect the resistance of GOST 28147-89 against his attack (while for unknown reasons, the S-boxes from the 1st working draft of the addition to the standard ISO/IEC 18033-3 was not considered). The analysis carried out by the authors of the article shows that even if we take Courtois’s unfounded “facts” on faith and analyze GOST 28147-89 with other S-blocks, the attack again turns out to be no better than a complete search.

A detailed analysis of Courtois’s works with a detailed substantiation of the groundlessness of all statements about the decrease in the resistance of the Russian standard was carried out in the works [,].

At the same time, even Courtois himself admits the absolute lack of accuracy in the calculations! The following slide is taken from Courtois' presentation at the FSE 2012 short notice section.

It should be noted that Courtois’s work has also been repeatedly criticized by foreign researchers. For example, his work on constructing attacks on the AES block encryption algorithm using the XSL method contained the same fundamental flaws as the work on the analysis of the Russian standard: most of the labor intensity estimates appear in the text completely unfounded and unsubstantiated - detailed criticism can be found, for example, in work. In addition, Courtois himself admits to widespread refusals to publish his work at major cryptography conferences and in established peer-reviewed journals, often leaving him only the opportunity to speak in the short announcement section. For example, you can read about this in section 3 of the work. Here are some quotes from Courtois himself relating to his work:

“I think that the audiences of Asiacrypt will not feel it is interesting.” Asiacrypt 2011 reviewer.
“… there is a big, big, big problem: this attack, which is the main contribution of the paper has already been published at FSE’11 (it was even the best paper), …”. Reviewer for Crypto 2011.

Thus, the professional part of the international cryptographic community regards the quality of Courtois’s work with no less doubt than, say, the statements of some Russian specialists about their ability to crack AES for 2,100, which are not confirmed by any consistent calculations, or the latest “evidence” of a two-page hypothesis on the inequality of complexity classes P and NP.

Attacks by Isobe and Dinur-Dankelman-Shamir

The general idea of the Isobe () and Dinur-Dankelman-Shamir attacks (hereinafter: DDS attack) () is to construct for a certain (key-dependent) narrow set of plaintexts an equivalent transformation on this set, which has a structure simpler than the encryption transformation itself . In the case of the Isobe method, this is the set of 64-bit blocks x such that F 8 -1 (Swap(F 8 (z))) = z, where z = F 16 (x), through F 8 (x) and F 16 ( x) indicate the first 8 and first 16 rounds of GOST 28147-89 encryption, respectively, through Swap - the operation of swapping halves of a 64-byte word. If the plaintext is included in this set, the result of the full 32-round transformation of GOST 28147-89 coincides with the result of the 16-round one, which is what the author of the attack exploits. In the case of the DDS method, this is the set of x such that F 8 (x) = x (fixed point of the transformation F 8). For any plaintext from this set, the GOST 28147-89 transformation works exactly the same as its last 8 rounds, which simplifies the analysis.

The complexity of the Isobe attack is 2,224 encryption operations, the DDS attack is 2,192. However, all questions about whether the Isobe and DDS attacks introduce new restrictions on the conditions for using our algorithm are removed by assessing the requirements for the volume of material required to carry out each of the attacks: the Isobe method requires 2 32 pairs of plaintext and ciphertext, and for the DDS method - 2 64. Processing such volumes of material without changing the key is a priori unacceptable for any block cipher with a block length of 64: on material of volume 2 32 , taking into account the problem of birthdays (see, for example, ), the probability of occurrence of repeated blocks is close to 1/2, which will provide the attacker is able to draw certain conclusions about the plaintexts from the ciphertexts without determining the key. The presence of 2 64 pairs of plain and encrypted texts obtained on one key actually allows the enemy to carry out encryption and decryption operations without knowing this key at all. This is due to a purely combinatorial property: the enemy in this case has the entire encryption conversion table. This situation is absolutely unacceptable under any reasonable operating conditions. For example, in CryptoPro CSP there is a technical limitation on the volume of encrypted material (without key conversion) of 4 MB (see). Thus, a strict prohibition on using a key on material of such volume is inherent in any block cipher with a block length of 64 bits, and therefore, Isobe and DDS attacks in no way narrow the scope of use of the GOST 28147-89 algorithm while maintaining the maximum possible strength of 2,256.

Of course, it should be noted that researchers (Isobe and Dinur-Dankelman-Shamir) have shown that some properties of the GOST 28147-89 algorithm make it possible to find analysis paths that were not taken into account by the creators of the algorithm. The simple form of the key schedule, which significantly simplifies the task of constructing effective implementations, also allows for some rare cases of keys and plaintexts to construct simpler descriptions of the transformations produced by the algorithm.

The work demonstrates that this negative property of the algorithm can be easily eliminated with full preservation of operational characteristics, but, unfortunately, it is an integral part of the algorithm in its commonly used form.

Note that certain negligence in the estimates of average labor intensity is also present in the work of Dinur, Dunkelman and Shamir. Thus, when constructing an attack, due attention is not paid to the following point: for a significant proportion of keys, the set of plaintexts x, such that F 8 (x) = x, is empty: there may simply be no fixed points for 8 rounds of transformation. The existence of fixed points also depends on the choice of replacement nodes. Thus, the attack is only applicable for certain replacement nodes and keys.

It is also worth mentioning one more work with an attack on GOST 28147-89. In February 2012, an updated version of the article (dated November 2011) appeared on the ePrint electronic archive of the international cryptographic association, which contained a new attack on GOST 28147-89. The characteristics of the presented attack are as follows: the volume of material is 2 32 (like Isobe), and the labor intensity is 2 192 (like DDS). Thus, this attack improved the time-record DDS attack in terms of material volume from 2 64 to 2 32. We note separately that the authors honestly presented all the calculations with justification for the complexity and volume of the material. After 9 months, a fundamental error was found in the above calculations, and since November 2012, the updated version of the article in the electronic archive no longer contains any results regarding the domestic algorithm.

Attacks assuming the attacker knows "something" about the keys

Finally, we note that in the literature there is also a number of works (see, for example, and ) devoted to attacks on GOST 28147-89 in the so-called linked key model. This model basically contains the assumption that an attacker can gain access for analysis not just to pairs of open texts and encrypted using the desired key, but also to pairs of open and encrypted texts obtained using (also unknown) keys that differ from the desired one by known regular way (for example, in fixed bit positions). In this model, it is indeed possible to obtain interesting results about GOST 28147-89, but in this model it is possible to obtain no less strong results about, for example, the AES standard, which is most widely used in modern public networks (see, for example,). Note that the conditions for carrying out this type of attack arise when using a cipher in a certain protocol. It should be noted that results of this kind, although they are of undoubted academic interest from the point of view of studying the properties of cryptographic transformations, are actually not relevant to practice. For example, all cryptographic information protection tools certified by the FSB of Russia meet the strictest requirements for encryption key generation schemes (see, for example,). As indicated in the results of the analysis, if there are 18 associated keys and 2 10 pairs of plaintext and ciphertext blocks, the complexity of completely opening the private key, with a probability of success of 1-10 -4, is actually 2 26. However, if the above-mentioned requirements for the development of key material are met, the probability of finding such keys is 2 -4352, that is, 24096 times less than if you simply try to guess the secret key on the first try.

Works related to the model with linked keys also include work, which in 2010 caused a lot of noise in Russian electronic publications, which do not suffer from the habit of carefully checking the material in the race for sensations. The results presented in it were not supported by any rigorous justification, but contained loud statements about the ability to hack the state standard of the Russian Federation on a weak laptop in a matter of seconds - in general, the article was written in the best traditions of Nicolas Courtois. But, despite the absolutely obvious groundlessness of the article to the reader who is more or less familiar with the basic principles of scientific publications, it was precisely to reassure the Russian public after the work that Rudsky wrote a detailed and thorough text containing a comprehensive analysis of this shortcoming. The article with the self-explanatory title “On the zero practical significance of the work “Key recovery attack on full GOST block cipher with zero time and memory”” provides justification that the average complexity of the method given in is no less than the complexity of a complete search.

Dry residue: what is durability in practice?

In conclusion, we present a table containing data on all the results of strictly described and justified attacks on GOST 28147-89 known to the international cryptographic community. Note that the complexity is given in the encryption operations of the GOST 28147-89 algorithm, and the memory and material are indicated in algorithm blocks (64 bits = 8 bytes).

Attack	Labor intensity	Memory	Required material
Isobe	2 224	2 64	2 32
Dinur-Dankelman-Shamir, FP, 2DMitM	2 192	2 36	2 64
Dinur-Dankelman-Shamir, FP, low-memory	2 204	2 19	2 64
	2 224	2 36	2 32
Dinur-Dankelman-Shamir, Reflection, 2DMitM	2 236	2 19	2 32
Complete search	2 256	1	4
Number of nanoseconds since the creation of the Universe	2 89

Despite a fairly large-scale cycle of research in the field of resistance of the GOST 28147-89 algorithm, at the moment there is not a single attack known, the conditions for the implementation of which would be achievable with the accompanying operational requirements for a block length of 64 bits. The restrictions on the volume of material that can be processed on one key resulting from the cipher parameters (key bit length, block bit length) are significantly stricter than the minimum volume required to carry out any currently known attack. Consequently, when meeting existing operational requirements, none of the cryptanalysis methods proposed to date GOST 28147-89 allows determining a key with a labor intensity less than exhaustive search.

The popularly known term “processor performance” is an objective, calculated parameter that is measured in flops. However, most people measure it in gigahertz, naively believing that they are the same thing. Nobody knows the term “code performance”, and I’ll immediately explain why.

The reason is that I only recently came up with it and haven’t told anyone about it yet. However, code performance, like processor performance, has objective characteristics that can be measured. This article is specifically about the performance of code executed by the processor core.

How is code performance measured? Since I was the first to talk about this, by right of the discoverer I will measure it in RTT units;).

Now seriously. In modern processors, the main transformations are operations on 32-bit numbers; everything else is, by and large, exotic. Therefore, we will take into account the main thing - operations with 32-bit numbers. How many 32-bit operations do you think a modern processor core can perform simultaneously?

The student will answer - one, his teacher will think and say that there are four, the professional - that there are only twelve operations so far.

So, a program code that loads all processor execution units simultaneously throughout the entire execution time of the code will have a performance of 12 RTTs. Maximum! To be honest, I have never written such code before, but in this article I will try to make an effort.

I will prove that code with simultaneous execution of twelve 32-bit operations is possible

Program code that uses one actuator in the processor core will naturally have a performance of 1 RTT. Programs generated by high-level language compilers and virtual machine interpreters can boast of such code performance. There is no need to assume that the processor load indicator, which can be seen in the OS task manager, can serve as an objective criterion for the efficiency of the code. The processor core load can be 100%, but the program code will use one execution unit in it (performance 1 RTT). In this case, at 100% load, the processor core will operate at 1/12 of its maximum performance. In other words, when the Windows Task Manager shows the maximum processor load, its actual performance may vary from 1 to 12 RTT. If you see 100% load on any processor core in the performance window, it is wrong to assume that all actuators are working in this core, not at all!

The only criterion for indirectly assessing the operation of a processor core at maximum performance can be its power consumption and, as a consequence, the noise of the cooler. Now, if the cooler is noisy, then yes, the load has gone to maximum. However, it’s time to finish with general concepts and move on to harsh practice.

Traditional implementation of GOST 28147-89

I am not a professional in the field of information security, but I am still familiar with the topic of encryption. I was inspired to study symmetric stream encryption specifically by conversations with a professional cryptographer whom I deeply respect. And, having taken up this topic, I tried to do it well, and not just well, but also quickly, performing the maximum number of operations per unit of time. In other words, I was faced with the task of writing program code with the maximum RTT value.

Cryptographic transformation in accordance with GOST 28147-89 is used for stream encryption of information in communication channels and on disk drives.

Currently, software implementation of this GOST on the RON of the central processor is widely used. In known methods of implementing GOST, all secret information (encryption keys, replacement blocks) is located in RAM. This reduces the reliability of encryption, since having a RAM dump can completely reveal all the secret elements of the crypto-transformation. In addition, the method has performance limitations due to the location of the main crypto conversion objects in the OP and incomplete loading of the ALU executive units. Modern processors, implementing the crypto procedure using a well-known method, can provide encryption speeds of 40–60 megabytes per second. And if we really understand it to the end, the reason for the low performance and weak security of crypto conversion is the software implementation of the substitution block. For its description in GOST, see Fig. 1.

According to clause 1.2 of GOST, this block implements tetrad (four bits) permutations in a 32-bit word, but the x86/64 processor architecture and its instruction system are not capable of effectively manipulating tetrads.

For the software implementation of the substitution block, special tables in RAM are used, prepared at the stage of initialization of the cryptofunction. These tables combine the replacement nodes of adjacent tetrads into 8 × 8-bit byte tables, thus placing four 256-byte tables in RAM.

In more advanced implementations, these tables are 1024 bytes in size (256 words of four bytes). This was done in order to implement in the tables an additional cyclic shift by 11 positions of the 32-bit word obtained as a result of substitution (the next operation of the conversion algorithm according to GOST). An example of GOST implementation using this method is shown in Appendix 1 (on disk).

The information of the substitution block is a secret component of the cryptofunction (as formulated in GOST, see Fig. 2).

Placing these tables with the keys of the substitution block in the OP contradicts the requirements of GOST (clause 1.7), since secret information becomes available to third-party programs running on the computing installation. The FSB, which also certifies software implementations of encryption according to GOST, looks at this violation, to put it mildly, condescendingly. If, in order to place keys in the OP, the FSB still requires a “fig leaf” - masking the keys using the XOR operation, then nothing is required for replacement blocks in the OP; they are stored in clear form.

In short, the FSB allows such software implementations of cryptoprocedures to pass, despite the obvious decrease in the security of such a solution and a direct violation of its own requirements according to GOST (clause 1.7). And this despite the well-known methods of breaking ciphers by taking a memory dump...

We will return to the issue of storing keys and replacement blocks in internal registers of the processor a little later (there is a beautiful and fast solution), but for now we will only store encryption keys in MMX registers, this is more reliable.

But enough of the lyrics, what is important for the topic under consideration is that this program code has a performance of 1 RTT. Now let's write code with the performance of 2 RTTs.

Multithreaded implementation of GOST 28147-89

The only way to speed up crypto procedures in the known algorithm is to introduce multithreading. The point of this change in the implementation of the algorithm is to calculate several blocks of data in parallel.

Most programmers mean by parallel processing exclusively the work of several processor cores, synchronized through interrupts and semaphores in memory.

However, there is another option for parallel data processing on a single processor core. Let me explain this non-obvious idea.

Modern processors include at least two, and even three to six arithmetic-logical units. These ALUs (FPUs, address arithmetic units, and so on) can operate independently of each other; the only condition for their parallel operation is that the software objects they operate on are disjoint. In other words, in instructions that simultaneously execute an ALU, the memory addresses and register numbers must be different. Or, no writes should be performed to shared registers and memory addresses accessed by various processor execution units.

The workload of all ALUs is controlled by a special hardware unit inside the processor core - the scheduler, which scans the executable code forward, to a depth of 32–64 bytes. If the scheduler discovers instructions that can be run on the ALU without conflicts, then it runs them simultaneously on different execution devices. In this case, the counter of executed commands indicates the executing command (there are several of them in such a scheme), after which all commands have already been executed.

Most program sequences generated automatically (by compilers) cannot load all the ALUs and FPUs located in the processor core. In this case, the processor hardware is idle, which significantly reduces its resulting performance. Processor developers understand this and introduce modes to increase core frequency when the hardware is not fully utilized. This is also what hypertrading systems are designed for, and I will use this system to “press” the code to the maximum in the future.

Compilers, even the most optimized ones, and even more so virtual machine engines, cannot generate optimized code in terms of performance. Only a programmer with engineering knowledge can write such optimized code, and the tool for writing it is exclusively assembler.

A typical illustration of the possibility of executing several independent program threads on one processor core is the GOST implementation, executed in two threads on a single processor core. The idea of the code is simple: there are two blocks of data to encrypt/decrypt, but one processor core that will perform the conversion. It is possible to perform the conversion on these two data blocks sequentially, and this has been done to this day. In this case, the time required to complete the transformations doubles.

But you can do it differently: alternate commands related to the processing of different data blocks. These options are presented graphically in Fig. 3.

In the figure, the top example shows the usual order of processing two independent blocks of data. The first block is processed first, then the processor proceeds to process the second block. Naturally, the resulting time is equal to twice the time required to process one block, and the actuators of the processor core are not fully loaded.

The following is an example of interleaving commands from different processing threads. In this case, commands related to different data blocks are interleaved. The scheduler selects instructions that are independent of each other and transmits them for execution to ALU1 and ALU2. The grouping of commands of the first and second threads on these ALUs is carried out automatically, since the scheduler operating algorithm includes a grouping of commands linked to common data on the same executive device.

In order for such program code to work without ALU downtime, it is necessary that each program thread work with its own set of registers. The cache in this scheme becomes a bottleneck (it has only two data output ports), so we store the keys in MMX registers. Since in this case the replacement (and shift) nodes in memory are only read, they can be common to both program threads.

This, of course, is a very simplified explanation of the principle of parallel execution of program threads on a single core; in reality, everything is much more complicated. In practice, you need to take into account the pipeline architecture of actuators, restrictions on simultaneous access to the cache and the RON register block, the presence of address arithmetic nodes, switches and much more... So this is a topic for professionals, who can be counted on the fingers... of one hand.

The parallel encryption method is effectively implemented only for the 64-bit processor operating mode, since in this mode there is a sufficient number of RON (as many as 16 pieces!). An example of GOST implementation using this method is shown in Appendix 2 (on disk).

It is clear that this GOST implementation has a code performance of 2 RTT codes. Now let's see how this affects execution time.

The encryption cycle for one stream (Appendix 1) is 352 clock cycles, and during this time 8 bytes of data are calculated; for a two-threaded implementation of GOST (Appendix 2) 416 processor cycles are required, but 16 bytes are calculated. Thus, the resulting conversion speed increases from 80 to 144 megabytes for a 3.6 GHz processor.

An interesting picture emerges: the code contains exactly twice as many commands, and executes only 15% longer, but I think readers have already understood the reason for this phenomenon...

Theoretically, the code from the second example should be executed in the same number of clock cycles as the code from the first example, but the scheduler node is developed by Intel engineers, but they are also human, and we are all far from perfect. So it is possible to evaluate the effectiveness of their creation. This code will also run on an AMD processor, and you can compare their results.

If anyone doesn’t take my word for it, then for such non-believers, test programs with clock counters are included on the disk. The programs are in source code, of course in assembler, so there is an opportunity to check my words, and at the same time, see some of the tricks of professional coding.

Using SSE registers and AVX commands of modern processors to implement GOST 28147-89

Modern processors of the x86/64 architecture include a set of SSE registers of 16 bytes in size and specialized FPUs (at least two) to perform various operations on these registers. It is possible to implement GOST on this equipment, and in this case, replacement nodes can be placed not in the form of tables in RAM, but directly on dedicated SSE registers.

One SSE register can accommodate two tables of 16 rows at once. Thus, four SSE registers will completely accommodate all replacement tables. The only condition for such placement is the interleaving requirement, according to which tetrads of the same byte must be placed in different SSE registers. In addition, it is advisable to place the low and high tetrads of the input bytes, respectively, in the low and high tetrads of the bytes of the SSE registers.

These requirements are determined by optimization for the existing set of AVX commands. Thus, each byte of the SSE register will contain two tetrads corresponding to different bytes of the input register of the substitution block, while the position of the byte in the SSE register uniquely corresponds to the index in the substitution block substitution table.

A diagram of one of the possible placements of replacement nodes on SSE registers is shown in Fig. 4.

Placing secret information of replacement nodes on SSE registers increases the security of the crypto procedure, but complete isolation of this secret information is possible if the following conditions are met:

The processor core is switched to hypervisor host mode, and the interrupt block (APIC) is forcibly disabled in it. In this case, the processor core is completely isolated from the OS and applications running on the computing installation.
SSE registers are loaded and the computing core is isolated before the OS starts; it is optimal to perform these procedures from the trusted boot module (TLM).
Programs for cryptoprocedures in accordance with GOST are located in a non-modifiable memory area of the computing installation (either BIOS or in flash memory of the MDZ).

Compliance with these requirements will ensure complete isolation and immutability of the program code of cryptoprocedures and the secret information used in them.

For efficient sampling of tetrad SSE registers, the multi-input byte switches included in the FPU blocks are used. These switches allow transfers from any source byte to any destination byte, using indices located in a special SSE index register. Moreover, transfer is performed in parallel for all 16 bytes of the SSE receiver register.

Having substitution storage nodes on SSE registers and a multi-input switch in FPU blocks, it is possible to organize the following transformation in the substitution block (Fig. 5).

In this scheme, the input register in each tetrad specifies the address for the corresponding switch, which transmits information from the drives of the replacement nodes to the output register via the data bus. This scheme can be organized in three ways:

Create an appropriate chip design, but this is fantastic for us.
Reprogramming the microcode and creating your own processor command to implement this function on existing processors is no longer a fantasy, but, unfortunately, it is unrealistic in the current conditions.
Write a program using official AVX commands. The option may not be very effective, but it can be implemented “here and now.” So that's what we'll do next.

The operation of switches is controlled by a special three-address command AVX VPSHUFB. Its first operand is the receiver of information from the switches, the second is the source to which the switches' inputs are connected. The third operand is a control register for switches, each byte of which is associated with a corresponding switch; the value in it specifies the number of the direction from which the switch reads information. For a description of this command from the official Intel documentation, see Fig. 5. In Fig. Figure 6 shows a diagram of how this command works - only half of the SSE registers are shown, for the second half everything is similar.

The switch uses only the lowest four bits to determine the direction of commutation, the last bit in each byte is used to force the corresponding receiver byte to zero, but this switch function is not yet in demand in our case.

A program for sampling tetrads through FPU switches was written, but I didn’t even put it in the application - it was too pathetic. Having a 128-bit register and only using 32 bits in it is unprofessional.

As they say, “Our finish line is the horizon,” so squeeze it out, squeeze it out... we’ll press it and put it in bags!

This is not a play on words, but a harsh FPU reality - SSE registers can be divided into equal parts and the same transformations can be performed on these parts with one command. In order for the processor to understand this, there is a magic letter “P” - a packet that is placed before the command mnemonics, and no less magical letters “Q”, “D”, “W”, “B”, which are placed at the end and declare What parts are the SSE registers divided into in this command?

We are interested in batch mode with the SSE register divided into four 32-bit blocks; accordingly, all commands will be prefixed with “P” and at the end with the symbol “D”. This makes it possible to process four 32-bit blocks in parallel with one processor command, that is, calculate four data blocks in parallel.

The program that implements this method is available in Appendix 3, with all the explanations there.

However, press so much! Modern processors have at least two FPUs, and two independent instruction threads can be used to fully load them. If you correctly alternate commands from independent threads, you can load both FPU blocks with work completely and get eight parallel processed data streams at once. Such a program was written, and it can be viewed in Appendix 4, but you need to watch it carefully - you can go crazy. This is what is called “the code is not for everyone...”.

Price issue

The use of SSE registers to store replacement nodes is understandable - it provides some guarantee of isolation of secret information, but the meaning of calculating the cryptofunction itself on the FPU is not obvious. Therefore, the execution time of standard procedures was measured using the direct replacement method in accordance with GOST for four and eight threads.

For four threads, an execution speed of 472 processor cycles was obtained. Thus, for a processor with a frequency of 3.6 GHz, one thread is calculated at a speed of 59 megabytes per second, and four threads, respectively, at a speed of 236 megabytes per second.

For eight threads, an execution speed of 580 processor cycles was obtained. Thus, for a 3.6 GHz processor, one thread counts at 49 megabytes per second, and eight threads at 392 megabytes per second.

As the reader can see, the code in example #3 has a performance of 4 RTT, and the code in example #4 has a performance of 8 RTT. In these examples on SSE registers, the patterns are the same as when using RON, only the scheduler has reduced its efficiency. It currently provides a 20% increase in duration while doubling the code length.

Moreover, these results were obtained using universal AVX commands available in both Intel and AMD processors. If you optimize for an AMD processor, the result will be much better. It sounds contrary to the trend, but nevertheless it is true, and here’s why: AMD processors have an additional set of instructions, the so-called XOP extension, and in this additional set of instructions there are those that significantly simplify the implementation of the GOST algorithm.

This refers to the commands for logical burst byte shift and burst cyclic shift of double words. In the examples given in Appendices 3 and 4, sequences of universal commands are used that implement the necessary transformation: in the first case, one “extra” command, and in the other case, four extra commands at once. So there are optimization reserves, and considerable ones.

When it comes to further optimization, it’s worth remembering the presence of 256-bit registers (YMM registers), using which you can theoretically double the speed of calculations. But for now this is only a prospect; at the moment, processors slow down very much when executing 256-bit instructions (FPUs have a path width of 128 bits). Experiments have shown that on modern processors, counting 16 threads on YMM registers does not give any benefit. But this is only for now; on new processor models, the performance of 256-bit instructions will undoubtedly be increased, and then the use of 16 parallel threads will become advisable and will lead to an even greater increase in the speed of the crypto procedure.

Theoretically, you can count on a speed of 600–700 megabytes per second if the processor has two FPUs with a working path width of 256 bits each. In this case, we can talk about writing code with an efficiency of 16 RTT, and this is not fantasy, but the near future.

Mixed mode

Again the question of the number of registers arises; there are not enough of them to promote such an algorithm. But the hypertrading mode will help us. The processor core has a second set of registers available in logical processor mode. Therefore, we will execute the same code on two logical processors at once. In this mode, of course, we will not have more actuators, but due to alternation we can get a full load of all actuators.

You can’t count on an increase of 50% here; the bottleneck is the cache memory where the technological masks are stored, but you can still get an increase of 100 additional megabytes. This option is not shown in the appendices (the macros are similar to those used in the 8 RTT code), but it is available in the program files. So if anyone does not believe in the possibility of encryption at a speed of 500 megabytes per second on a single processor core, let them run test files. There are also texts with comments so that no one thinks that I am lying.

This focus is only possible on Intel processors; AMD has only two FPU units per two processor modules (analogous to the hypertrading mode). But there are four more ALUs that would be a sin not to use.

You can put the Bulldozer processor modules in a mode similar to the hypertrading mode, but run the conversion to RON on different modules in one thread, and on SSE registers in another thread and get the same 12 RTTs. I haven’t tested this option, but I think 12 RTT code will work more efficiently on AMD. Those interested can try it; test programs can be adjusted to work on “Bulldozers” quite easily.

Who needs it?

A serious question, but with a simple answer - everyone needs this. Soon we will all be hooked on the clouds, we will store both data and programs there, and there, oh, how we want to create our own, private corner. To do this, you will have to encrypt the traffic, and the speed of crypto conversion will be the main determining factor of comfortable work in the cloud. Our choice of encryption algorithm is small - either GOST or AES.

Moreover, oddly enough, the AES encryption algorithm built into processors turns out to be much slower; tests show speeds of 100–150 megabytes per second, and this is with a hardware implementation of the algorithm! The problem lies in the single-threaded count and the replacement block, which operates on bytes (a table of 256 rows). So GOST turns out to be more effective when implemented on x86/64 architecture, who would have thought...

This is if we talk about the achieved level of encryption speed. And if we keep in mind theoretical refinements in the field of increasing code efficiency, then most likely no one needs this. There are practically no specialists at the 3–6 RTT level, compilers generally generate code at the 1–2.5 RTT level, and the majority of programmers do not know assembler, and even if they know its spelling, they do not understand the design of a modern processor. And without this knowledge, it doesn’t matter whether it’s an assembler or some kind of SI-sharp.

But not everything is so sad: the bottom line after a week of sleepless nights is a new algorithm for implementing GOST, which would be a sin not to patent. And applications for patents (three in total) have already been completed and filed, so, gentlemen, businessmen, line up - women and children get a discount.

The history of this cipher is much older. The algorithm, which later formed the basis of the standard, was born, presumably, in the bowels of the Eighth Main Directorate of the KGB of the USSR (now within the structure of the FSB), most likely, in one of the closed research institutes subordinate to it, probably back in the 1970s as part of projects to create software and hardware implementations of the cipher for various computer platforms.

Since the publication of GOST, it was marked with the restrictive stamp “For official use”, and formally the cipher was declared “fully open” only in May 1994. The history of the creation of the cipher and the criteria of the developers have not been published as of 2010.

Description

GOST 28147-89 is a block cipher with a 256-bit key and 32 conversion cycles, operating on 64-bit blocks. The basis of the cipher algorithm is the Feistel Network. There are four operating modes according to GOST 28147-89:

simulated insertion mode.

Easy replacement mode

To encrypt in this mode, the plaintext is first split into two halves (the least significant bits are A, the most significant bits are B). On the i-th cycle, the subkey K i is used:

( = binary "exclusive or")

To generate subkeys, the original 256-bit key is divided into eight 32-bit blocks: K 1 ... K 8 .

Keys K 9 ... K 24 are a cyclic repetition of keys K 1 ... K 8 (numbered from least significant to highest bits). Keys K 25...K 32 are keys K 8...K 1.

After completing all 32 rounds of the algorithm, blocks A 33 and B 33 are glued together (note that the most significant bit becomes A 33 and the least significant bit becomes B 33) - the result is the result of the algorithm.

Decryption is performed in the same way as encryption, but the order of the subkeys K i is inverted.

Function is calculated as follows:

A i and K i are added modulo 2 32.

The result is divided into eight 4-bit subsequences, each of which is input to its own substitution table node(in ascending order of bit precedence), called below S-block. The total number of GOST S-blocks is eight, that is, the same number as subsequences. Every S-block is a permutation of numbers from 0 to 15. The first 4-bit subsequence goes to the input of the first S-box, the second - to the input of the second, etc.

If S-block looks like that:

1, 15, 13, 0, 5, 7, 10, 4, 9, 2, 3, 14, 6, 11, 8, 12

and the input of the S-block is 0, then the output will be 1, if 4, then the output will be 5, if the input is 12, then the output will be 6, etc.

The outputs of all eight S-boxes are combined into a 32-bit word, then the entire word is cyclically shifted left (towards the most significant bits) by 11 bits.

The simple replacement mode has the following disadvantages:

Can only be used to encrypt plaintexts with a length that is a multiple of 64 bits
By encrypting identical blocks of plaintext, identical blocks of ciphertext are obtained, which can provide certain information to a cryptanalyst.

The text of the standard states that the supply of filling replacement nodes (S-boxes) is carried out in the prescribed manner, that is, by the developer of the algorithm. The community of Russian CIPF developers has agreed on replacement nodes used on the Internet, see RFC 4357.

Advantages of GOST

the futility of a force attack (XSL attacks are not taken into account, since their effectiveness has not yet been fully proven);
efficient implementation and, accordingly, high performance on modern computers.
the presence of protection against the imposition of false data (development of imitative insertion) and the same encryption cycle in all four GOST algorithms.

Cryptanalysis

It is believed (see, for example, Vitaly V. Shorin, Vadim V. Jelezniakov and Ernst M. Gabidulin: Linear and Differential Cryptanalysis of Russian GOST) that GOST is resistant to such widely used methods as linear and differential cryptanalysis. Reversing the order of the keys in the last eight rounds provides protection against slide attacks and reflection attacks.

In May 2011, the famous cryptanalyst Nicolas Courtois proved the existence of an attack on this cipher, which has a complexity 2 8 (256) times less than the complexity of direct search of keys, provided there are 2 64 plaintext/plaintext pairs. This attack cannot be carried out in practice due to its too high computational complexity. Moreover, knowing 2 64 plaintext/privatetext pairs apparently allows you to read ciphertexts without even computing the key. Most other works also describe attacks that are applicable only under certain assumptions, such as a certain kind of keys or replacement tables, some modification of the original algorithm, or that require still unattainable amounts of memory or computation. The question of whether there are practical attacks without exploiting the weakness of individual keys or replacement tables remains open.

Criticism of GOST

The main problems of GOST are related to the incompleteness of the standard in terms of generating keys and replacement tables. It is believed that GOST has “weak” keys and replacement tables, but the standard does not describe the criteria for selecting and eliminating “weak” ones. Also, the standard does not specify an algorithm for generating a substitution table (S-boxes). On the one hand, this may be additional secret information (in addition to the key), and on the other hand, it raises a number of problems:

it is impossible to determine the cryptographic strength of an algorithm without knowing the substitution table in advance;
implementations of the algorithm from different manufacturers may use different replacement tables and may be incompatible with each other;
the possibility of deliberate provision of weak replacement tables by licensing authorities of the Russian Federation;
the potential (no prohibition in the standard) of using substitution tables in which the nodes are not permutations, which can lead to an extreme reduction in the strength of the cipher.

In October 2010, at the meeting of the 1st Joint Technical Committee of the International Organization for Standardization (ISO/IEC JTC 1/SC 27), GOST was nominated for inclusion in the international block encryption standard ISO/IEC 18033-3. In this regard, in January 2011, fixed sets of replacement nodes were formed and their cryptographic properties were analyzed. However, GOST was not adopted as a standard, and the corresponding replacement tables were not published

Possible applications

Notes

Literature

Melnikov V.V. Protection of information in computer systems. - M.: Finance and Statistics, 1997.
Romanets Yu. V., Timofeev P. A., Shangin V. F. Protection of information in computer systems and networks. - M.: Radio and communication, 1999.
Kharin Yu. S., Bernik V. I., Matveev G. V. Mathematical foundations of cryptology. - Mn. : BSU, 1999.
Gerasimenko V. A., Malyuk A. A. Fundamentals of information security. - M.: MGIFI, 1997.
Leonov A. P., Leonov K. P., Frolov G. V. Security of automated banking and office technologies. - Mn. : National book Chamber of Belarus, 1996.
Winter V. M.. Moldovyan A. A., Moldovyan N. A. Computer networks and protection of transmitted information. - St. Petersburg. : St. Petersburg State University, 1998.
Schneier B. 14.1 Algorithm GOST 28147-89 // Applied cryptography. Protocols, algorithms, source texts in C language = Applied Cryptography. Protocols, Algorithms and Source Code in C. - M.: Triumph, 2002. - pp. 373-377. - 816 p. - 3000 copies. - ISBN 5-89392-055-4
Popov, V., Kurepkin, I., and S. Leontiev Additional Cryptographic Algorithms for Use with GOST 28147-89, GOST R 34.10-94, GOST R 34.10-2001, and GOST R 34.11-94 Algorithms (English) // RFC 4357. - IETF, January 2006.

Wikimedia Foundation. 2010.

GOST 28147 cryptographic conversion algorithm