You are on page 1of 124

Efficient Algorithms in Software

Julio López
jlopez@ic.unicamp.br

Institute of Computing, University of Campinas

September 2017, Habana, Cuba.

ASCrypto 2017
Agenda

1 Efficient Software Implementations


Software Efficiency
Parallel Computation -SIMD

2 Symmetric-Key Cryptography
Data Encryption
Hash Functions
SHA2 Implementation
SHA3 Implementation

3 Elliptic Curve Cryptography


Elliptic Curves
Elliptic Curve Diffie-Hellman
Digital Signatures
EdDSA Scheme

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 2 / 83


Section 1

Efficient Software Implementations


1.1

Software Efficiency
Efficient Software Implementations Software Efficiency

Software Efficiency

The optimization of a software implementation of a cryptographic


algorithm is a task with several goals:

• Ensure security.
• Running time.
• Code size.
• Memory consumption.
• Computer platform
characteristics
• Energy consumption.

Sometimes these goals are in conflict with each other. For example:
accelerating an operation using look-up tables, it will increase code size,
and it could result vulnerable against memory cache-attacks (if not
implemented adequately).
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 3 / 83
Efficient Software Implementations Software Efficiency

How Performance is Measured?

• Measuring the elapsed time does not allow to compare timing


between different computers; instead, clock cycles are measured.
• Use the RDTSC instruction to read the Time-Stamp Counter on
processor.
1 # include < stdint .h >
2 uint64_t get_cycles () {
3 uint32_t lo , hi ;
4 asm volatile ( " rdtsc " : " = a " ( lo ) , " = d " ( hi ));
5 return (( uint64_t ) hi < <32) | lo ;
6 }

• To reduce certain sources of randomness during measurements it is


recommended to turn off technologies such as Turbo Boost or
Hyper-Threading.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 4 / 83


1.2

Parallel Computation -SIMD


Efficient Software Implementations Parallel Computation -SIMD

Single Instruction Multiple Data

• Single Instruction Multiple Data is a class of computers where a single


instruction is applied simultaneously over a set of data.
• Latest processors support SIMD class by using a bank of wider
registers, also known as vector registers.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 5 / 83


Efficient Software Implementations Parallel Computation -SIMD

Vector instructions

Instructions associated to vector registers are known as vector instructions.


These instructions operate over words packed in vector registers.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 6 / 83


Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83


Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

Integer Arithmetic

MMX

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX
(64)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83


Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

Integer Arithmetic
Floating-point Arithmetic

SSE

MMX

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX XMM
(64) (128)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83


Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

Integer Arithmetic
Floating-point Arithmetic SSE2

SSE

MMX

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX XMM
(64) (128)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83


Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

Integer Arithmetic
Floating-point Arithmetic SSE2

SSE

MMX

SSE3

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX XMM
(64) (128)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83


Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

Integer Arithmetic
Floating-point Arithmetic SSE2
String Manipulation

SSE
SSE4
MMX

SSE3

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX XMM
(64) (128)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83


Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

Integer Arithmetic
Floating-point Arithmetic SSE2
String Manipulation
Cryptography

AES-NI + CLMUL
SSE
SSE4
MMX

SSE3

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX XMM
(64) (128)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83


Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

Integer Arithmetic
Floating-point Arithmetic SSE2
String Manipulation
Cryptography
AVX

AES-NI + CLMUL
SSE
SSE4
MMX

SSE3

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX XMM YMM
(64) (128) (256)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83


Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

Integer Arithmetic AVX2


Floating-point Arithmetic SSE2
String Manipulation
Cryptography
Bit Manipulation AVX

AES-NI + CLMUL
SSE
SSE4
MMX

SSE3

BMI

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX XMM YMM
(64) (128) (256)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83


Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

Integer Arithmetic AVX2


Floating-point Arithmetic SSE2
String Manipulation
Cryptography
Bit Manipulation AVX

AES-NI + CLMUL
SSE

SHA1-SHA2
SSE4
MMX

SSE3

BMI

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX XMM YMM
(64) (128) (256)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83


Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

Integer Arithmetic AVX2 AVX-512


Floating-point Arithmetic SSE2
String Manipulation
Cryptography
Bit Manipulation AVX

AES-NI + CLMUL
SSE

SHA1-SHA2
SSE4
MMX

SSE3

BMI

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX XMM YMM ZMM
(64) (128) (256) (512)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83


Efficient Software Implementations Parallel Computation -SIMD

Relevant AVX2 Instructions

Integer arithmetic for 64-bit words:


• 1 cycle for add/sub. C = ADD(A, B)
• 5 cycles for multiplications.
a3 a2 a1 a0
+ + + +
b3 b2 b1 b0

c3 c2 c1 c0

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 8 / 83


Efficient Software Implementations Parallel Computation -SIMD

Relevant AVX2 Instructions

Integer arithmetic for 64-bit words:


• 1 cycle for add/sub. C = VSHL(A, B)
• 5 cycles for multiplications.
a3 a2 a1 a0
Variable logic shifts.
   
• 1 cycle for fixed shifts.
• 2 cycles for variable shifts. b3 b2 b1 b0

c3 c2 c1 c0

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 8 / 83


Efficient Software Implementations Parallel Computation -SIMD

Relevant AVX2 Instructions

Integer arithmetic for 64-bit words:


• 1 cycle for add/sub. C = PERM(A, M )
• 5 cycles for multiplications.
a3 a2 a1 a0
Variable logic shifts.
• 1 cycle for fixed shifts.
• 2 cycles for variable shifts.
Permutation of words. m3 m2 m1 m0 {0, 1, 2, 3}
• 3 cycles for permutations.
am3 am2 am1 am0

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 8 / 83


Efficient Software Implementations Parallel Computation -SIMD

Relevant AVX2 Instructions

Integer arithmetic for 64-bit words:


• 1 cycle for add/sub. C = BLEND(A, B, M )
• 5 cycles for multiplications.
a3 a2 a1 a0 b3 b2 b1 b0
Variable logic shifts.
• 1 cycle for fixed shifts.
• 2 cycles for variable shifts.
Permutation of words. 0/1 0/1 0/1 0/1
• 3 cycles for permutations.
Combination/selection of registers. c3 c2 c1 c0
• Up-to 3 instructions per cycle
without dependencies.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 8 / 83


Efficient Software Implementations Parallel Computation -SIMD

Vector Instruction Guide

Full documentation available at:


http://software.intel.com/sites/landingpage/IntrinsicsGuide

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 9 / 83


Efficient Software Implementations Parallel Computation -SIMD

Skylake Execution Engine

The Skylake processor has eight execution ports for instructions.


This improves the Instruction-Level Parallelism (ILP).

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 10 / 83


Section 2

Symmetric-Key Cryptography
2.1

Data Encryption
Symmetric-Key Cryptography Data Encryption

Secure Communication

• Alice and Bob would like to communicate through an insecure


channel.
• Charles is a malicious third party that has also access to the channel.
• It is desired that Charles does not be able to read messages
interchanged by Alice and Bob.

0111100001100010101011111010

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 11 / 83


Symmetric-Key Cryptography Data Encryption

Symmetric Data Encryption

Using a secret key k, Alice and Bob can interchange encrypted messages.
Charles can not read the messages without the knowledge of the key k.

k k
Key Generation

(M, k) M

encryption C C decryption
0111100001100010101011111010
C = Ek (M ) M = Dk (C)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 12 / 83


Symmetric-Key Cryptography Data Encryption

Advanced Encryption Standard (AES)

• AES, 1998 (Daemen and Rijmen)


• AES (2000) is the current NIST standard for encrypting data using a
symmetric key.
• AES is a cipher that encrypts a 128-bit plaintext (M ) producing a
128-bit ciphertext (C) using a key k.
k

M AES C
• AES supports three key sizes, |k| = {128, 192, 256}, leading to three
algorithms:
• AES-128.
• AES-192.
• AES-256.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 13 / 83
Symmetric-Key Cryptography Data Encryption

AES State Representation

AES keeps track of a 128-bit state, which can be seen as a 4 × 4 matrix of


bytes.

M ... C

k0 kNr

In each round, AES applies a series of transformations over the matrix.

10
if |k| = 128



Nr = 12 if |k| = 192
14 if |k| = 256

After Nr rounds, the last state is returned as the ciphertext.


Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 14 / 83
Symmetric-Key Cryptography Data Encryption

AES State Transformations

• SubBytes

• ShiftRows

• MixColumns

• AddRoundKey

For decryption, transformations are inverted and applied in reverse order.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 15 / 83


Symmetric-Key Cryptography Data Encryption

AES Mix Column-Encryption

pe = {03}x3 + {01}x2 + {01}x + {02}


c = pe ⊗ c = Me ⊗ c

02 03 01 01
    
c0 c0
c1 01 02 03 01 c1
=
    
    
c2 01 01 02 03 c2
  
    
c3 03 01 01 02 c3

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 16 / 83


Symmetric-Key Cryptography Data Encryption

AES Mix Column-Decryption

pd = {0b}x3 + {0d}x2 + {09}x + {0e}


c = pd ⊗ c = Md ⊗ c

0e 0b 0d 09
    
c0 c0
c1 09 0e 0b 0d c1
=
    
    
c2 0d 09 02 0b c2
  
    
c3 0b 0d 09 0e c3

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 17 / 83


Symmetric-Key Cryptography Data Encryption

The AES-NI Instruction Set

In 2010, Intel released a set of instructions to perform the AES algorithm.

Plaintext Plaintext

AddRoundKey AddRoundKey

SubBytes AESDECLAST InvSubBytes

ShiftRows InvShiftRows

Nr − 1
AESENC
MixColumns AddRoundKey

AddRoundKey InvMixColumns

Nr − 1
AESDEC
SubBytes InvSubBytes

InvShiftRows
AESENCLAST ShiftRows

AddRoundKey AddRoundKey

Ciphertext Ciphertext

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 18 / 83


Symmetric-Key Cryptography Data Encryption

AES-128 Encryption

Encrypting a 128-bit block (stored in xmm15) using the key schedule


(stored in xmm0-xmm10). Nr = 10.

1 MOVQDA xmm15 , (% rsi ) ; Load message block


2 PXOR xmm15 , xmm0 ; AddRoundKey
3 AESENC xmm15 , xmm1 ; Round 1
4 AESENC xmm15 , xmm2 ; Round 2
5 AESENC xmm15 , xmm3 ; Round 3
6 AESENC xmm15 , xmm4 ; Round 4
7 AESENC xmm15 , xmm5 ; Round 5
8 AESENC xmm15 , xmm6 ; Round 6
9 AESENC xmm15 , xmm7 ; Round 7
10 AESENC xmm15 , xmm8 ; Round 8
11 AESENC xmm15 , xmm9 ; Round 9
12 AESENCLAST xmm15 , xmm10 ; Round 10
13 MOVQDA (% rdi ) , xmm15 ; Store cipher block

Analogously, for decryption use AESDEC, AESDECLAST and invert the key
schedule using AESIMC.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 19 / 83
Symmetric-Key Cryptography Data Encryption

Modes of Operation

Splitting a long message into


128-bit blocks and encrypting
each one is not secure!
(ECB Mode)

Modes of operation are used for encrypting arbitrary-length messages using


a block cipher as a building block.
• CBC. Cipher block chaining.
• CTR. Counter mode.
• GCM. Galois-counter mode. (Authenticated encryption)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 20 / 83


Symmetric-Key Cryptography Data Encryption

Cipher Block Chaining (CBC)

P1 P2 P3 P4 C1 C2 C3 C4

IV Dk Dk Dk Dk

Ek Ek Ek Ek IV

C1 C2 C3 C4 P1 P2 P3 P4
Encryption Decryption
(sequential execution) (parallel execution)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 21 / 83


Symmetric-Key Cryptography Data Encryption

Counter mode (CTR)

IV+1 IV+2 IV+3 IV+4 IV+1 IV+2 IV+3 IV+4

Ek Ek Ek Ek Ek Ek Ek Ek

P1 P2 P3 P4 C1 C2 C3 C4

C1 C2 C3 C4 P1 P2 P3 P4
Encryption Decryption

Either encryption and decryption can be executed in parallel.


The block cipher encryption is used only.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 22 / 83


Symmetric-Key Cryptography Data Encryption

Performance of AES-128-CBC Encryption

The performance is determined by the latency of the AESENC instruction.


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Clock
Latency

AESENC AESENC ········· AESENC

µ-arch Latency CBC-ENC


Intel Haswell 7 4.49
Intel Skylake 4 2.71
AMD Zen 4 2.44

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 23 / 83


Symmetric-Key Cryptography Data Encryption

Pipelined AES Implementation

The execution of AESENC instruction can be overlapped with other


instructions of the same type.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Clock
Latency

AESENC AESENC ········· AESENC


AESENC AESENC ········· AESENC
w=4 AESENC AESENC ········· AESENC
AESENC AESENC ········· AESENC

Throughput

Processor’s pipeline improves performance of CBC-DEC and CTR modes.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 24 / 83


Symmetric-Key Cryptography Data Encryption

Performance of AES-128-CBC Decryption

1.4 w=1 w=2 w=4


1.2
Running Time
(cycles-per-byte)

1.0
0.8
0.6
0.4
0.2
0.0
Haswell Skylake Zen

Scheduling w = 4 AES-NI instructions, the performance of decryption is


improved.

Can we do better?
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 25 / 83
Symmetric-Key Cryptography Data Encryption

Performance of AES-128-CBC Decryption

1.4 w=1 w=2 w=4 w=8


1.2
Running Time
(cycles-per-byte)

1.0
0.8
0.6
0.4
0.2
0.0
Haswell Skylake Zen

Yes! Zen has two execution units for AES-NI instructions.


µ-arch Latency CBC-ENC CBC-DEC
Intel Haswell 7 4.49 0.63
Intel Skylake 4 2.71 0.62
AMD Zen 4 2.44 0.37

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 25 / 83


Symmetric-Key Cryptography Data Encryption

Performance of AES-128-CTR Mode

Sequential w=2 w=4 w=8


1.4
1.2
Running Time
(cycles-per-byte)

1.0
0.8
0.6
0.4
0.2
0.0
Haswell Skylake Zen

µ-arch Latency CBC-ENC CBC-DEC CTR


Intel Haswell 7 4.49 0.63 0.74
Intel Skylake 4 2.71 0.62 0.62
AMD Zen 4 2.44 0.37 0.39

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 26 / 83


2.2

Hash Functions
Symmetric-Key Cryptography Hash Functions

Hash Function

A hash function maps an arbitrary-length bit-string into a n-bit string.

h : {0, 1}∗ → {0, 1}n

The output of a hash function is called as digest or hash value.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 27 / 83


Symmetric-Key Cryptography Hash Functions

Cryptographic Properties

1st pre-image. Given a hash value r it should be difficult to find any


message M such that r = h(M ).
2nd pre-image. Given an input M1 it should be difficult to find a
different input M2 such that h(M1 ) = h(M2 ).
Collision resistant. It should be difficult to find two different messages
M1 and M2 such that h(M1 ) = h(M2 ).
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 28 / 83
Symmetric-Key Cryptography Hash Functions

Applications of Hash Functions

There is a large number of applications of cryptographic hash functions:


• Verifying the integrity of files or messages.
• Password verification.
• Pseudo-random number generation.
• Key derivation functions.
• Digital signatures.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 29 / 83


Symmetric-Key Cryptography Hash Functions

NIST Hash Functions

1993 · · ·• SHA-0: Secure Hash Algorithm (160 bits).

1995 · · ·• SHA-1: output 160 bits.


2001 · · ·• SHA-2: output: 224, 256, 384, 512.
2015 · · ·• SHA-3 Keccak, output: 224, 256, 384, 512.

SHA-3 (SHAKE128, SHAKE256),


2015 · · ·•
output: m (arbitrary) (FIPS) 180-4.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 30 / 83


2.3

SHA2 Implementation
Symmetric-Key Cryptography SHA2 Implementation

SHA2 Algorithm

SHA2-256 operates as follows.


• Initialize state S0 with constant values.
• After padding, the message is split into n 512-bit blocks:
M1 , . . . , M n .
• For each block Mj :

Sj = Update(Sj−1 , Mj ) for 1 ≤ j ≤ n

• The digest of M is H(M ) = Sn .

Update consists of two phases:


1 Message Schedule.
2 State Update.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 31 / 83


Symmetric-Key Cryptography SHA2 Implementation

Update Phase 1: Message Schedule

Let w0 , . . . , w15 be the message block Mi split into 16 words of 32 bits,


then, the message schedule calculates 48 new words:

wi ← σ0 (wi−15 ) + σ1 (wi−2 ) + wi−7 + wi−16 , for 16 ≤ i < 64.

where
σ0 (x) = Rot(x, 7) ⊕ Rot(x, 18) ⊕ Shr(x, 3)
σ1 (x) = Rot(x, 17) ⊕ Rot(x, 19) ⊕ Shr(x, 10)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 32 / 83


Symmetric-Key Cryptography SHA2 Implementation

Update Phase 2: State Update

(a0 , b0 , c0 , d0 , e0 , f0 , g0 , h0 ) ← S
for i ← 0 to 63 do
T2 i

T1 ← hi  Σ1 (ei )  Ch(ei , fi , gi )  ai ai+1

k i  wi bi bi+1
T2 ← Σ0 (ai )  Maj(ai , bi , ci ) ci ci+1
hi+1 ← gi , gi+1 ← fi
di di+1
fi+1 ← ei , ei+1 ← di  T1
ei ei+1
di+1 ← ci , ci+1 ← bi
fi fi+1
bi+1 ← ai , ai+1 ← T1  T2
gi gi+1
end for
S 0 ← (a0  a63 , . . . , h0  h63 ) hi T1 i hi+1

ki wi
32
 is addition modulo 2 .
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 33 / 83
Symmetric-Key Cryptography SHA2 Implementation

SHA New Instructions (SHA-NI)

In 2013, Intel released the specification of the SHA New Instructions


(SHA-NI).
• Since 2016 it was supported by Goldmont Intel micro-architecture.
• Zen AMD’s micro-architecture also added support in 2017.

SHA1: SHA2-256 (and SHA2-224):


• SHA1MSG1 • SHA256MSG1
• SHA1MSG2 • SHA256MSG2
• SHA1NEXTE • SHA256RNDS2
• SHA1RNDS4

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 34 / 83


Symmetric-Key Cryptography SHA2 Implementation

Implementation of Phase 1a: Message Schedule

The SHA256MSG1 instruction performs the following operation:

xi = σ0 (wi+1 ) + wi , for 0 ≤ i < 4.

xmm0 xmm1

w7 w6 w5 w4 w3 w2 w1 w0

σ0 σ0 σ0 σ0

+ + + +

x3 x2 x1 x0

xmm2

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 35 / 83


Symmetric-Key Cryptography SHA2 Implementation

Implementation of Phase 1b: Message Schedule

The SHA256MSG2 instruction performs the following operation:


wi+16 = σ1 (wi+14 ) + yi , for 0 ≤ i < 4.

xmm0 xmm1

y3 y2 y1 y0 w15 w14 w13 w12

σ1 σ1

+ + + +

w19 w18 w17 w16 xmm2

σ1 σ1

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 36 / 83


Symmetric-Key Cryptography SHA2 Implementation

Implementation of Phase 2: Two Iterations

Let Ai = [ai , bi , ei , fi ] and C = [ci , di , gi , hi ] be the state at the i-th


iteration.
Then, it holds that:
Ci+2 = Ai

The remaining values Ai+2 = [ai+2 , bi+2 , ei+2 , fi+2 ] are calculated by the
SHA256RNDS2 instruction:

Ai+2 = SHA256RNDS2(Ai , Ci , X)

where X = [wi + ki , wi+1 + ki+1 ].

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 37 / 83


Symmetric-Key Cryptography SHA2 Implementation

Implementation of Phase 2: Two Iterations

T2i T2 i+1

ai ai+1 ai+2

bi bi+1 bi+2

ci ci+1 ci+2 = ai

di di+1 di+2

ei ei+1 ei+2

fi fi+1 fi+2

gi gi+1 gi+2

hi T1i hi+1 T1 i+1 hi+2

ki wi ki+1 wi+1

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 38 / 83


Symmetric-Key Cryptography SHA2 Implementation

Implementation of Phase 2: Two Iterations

T2i T2 i+1

ai ai+1 ai+2

bi bi+1 bi+2

ci ci+1 ci+2 = ai

di di+1 di+2 = bi

ei ei+1 ei+2

fi fi+1 fi+2

gi gi+1 gi+2 = ei

hi T1i hi+1 T1 i+1 hi+2 = fi

ki wi ki+1 wi+1

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 38 / 83


Symmetric-Key Cryptography SHA2 Implementation

Implementation of Phase 2: Four Iterations

Using two SHA256RNDS2 instructions, one can compute four iterations of


the Update function:

Ci+2 = Ai
Ai+2 = SHA256RNDS2 (Ci , Ai , X)
Ci+4 = Ai+2
Ai+4 = SHA256RNDS2 (Ci+2 , Ai+2 , Y )

where X = [wi + ki , wi+1 + ki+1 ] and Y = [wi+2 + ki+2 , wi+3 + ki+3 ].

This is equivalent to:

Ci+4 = SHA256RNDS2 (Ci , Ai , X)


Ai+4 = SHA256RNDS2 (Ai , Ci+4 , Y )

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 39 / 83


Symmetric-Key Cryptography SHA2 Implementation

Performance of SHA2-256 using SHA-NI

SHA-NI is 4-5× faster than 64-bit implementations of SHA2-256.

210
29 5×
28
Running Time
(cycles-per-byte)

27 4×

Speedup
26

25
24 2×
23
22 1×
21

1 16 256 4K 64K 1M 1 16 256 4K 64K 1M


Message size (bytes) Message size (bytes)

sphlib (supercop) OpenSSL SHA-NI

Can we do better?

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 40 / 83


Symmetric-Key Cryptography SHA2 Implementation

Pipelined Implementation of SHA-NI

Like AES-NI, SHA-NI instructions can be executed in pipeline.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Clock
Latency

SHA256RNDS2 SHA256RNDS2 ········· SHA256RNDS2

SHA256RNDS2 SHA256RNDS2 ········· SHA256RNDS2

w=4 SHA256RNDS2 SHA256RNDS2 ········· SHA256RNDS2

SHA256RNDS2 SHA256RNDS2 ········· SHA256RNDS2

Throughput

Target scenario: multiple hashing ⇒ hash-based signatures (PQ-Crypto).

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 41 / 83


Symmetric-Key Cryptography SHA2 Implementation

Performance of Pipelined Implementation of SHA-NI

Example:
Calculating four hashes (pipelined) is 20% faster than a sequential
implementation.

Zen (Ryzen 7 1800X processor)

2.5 1 message 2 messages


4 messages 8 messages
Running Time
(cycles-per-byte)

2.0

1.5

1.0
256 4K 64K 1M
Message size (bytes)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 42 / 83


2.4

SHA3 Implementation
Symmetric-Key Cryptography SHA3 Implementation

The SHA-3 Family of Functions

SHA-3 is composed of four hash functions and two XOF called as SHAKE.

Function Output size (n) Bit-rate (r) Security Level1


SHA-3224 224 1,152 112
SHA-3256 256 1,088 128
SHA-3384 384 832 192
SHA-3512 512 576 256
SHAKE128 n 1,344 min(n/2, 128)
SHAKE256 n 1,088 min(n/2, 256)

The input of a SHA-3 is split into blocks of r bits. The larger bit-rate the
faster execution.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 43 / 83


Symmetric-Key Cryptography SHA3 Implementation

Extendable-Output Function

An extendable-output function (XOF) maps an arbitrary length bit string


producing a variable-length digest value.
XOF : {0, 1}∗ × N 7→ {0, 1}∗
(a, n) → {0, 1}n

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 44 / 83


Symmetric-Key Cryptography SHA3 Implementation

The SHA-3 Design

The SHA-3 was designed using a sponge construction proposed in 2009 by


Bertoni et al.

Initializing Absorbing Squeezing

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 45 / 83


Symmetric-Key Cryptography SHA3 Implementation

Sponge Construction

Initializing: The state has 1,600 bits that are initialized to 0; then, the
input is split into blocks of r bits.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 46 / 83


Symmetric-Key Cryptography SHA3 Implementation

Sponge Construction

Absorbing: Each block is added to the first r bits of the state; then, the
state is processed by a permutation function P .

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 46 / 83


Symmetric-Key Cryptography SHA3 Implementation

Sponge Construction

Squeezing: After the input was consumed, the function P is used to


produce bn/rc output blocks of r bits concatenated with n (mod r) bits
taken from the last state.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 46 / 83


Symmetric-Key Cryptography SHA3 Implementation

Permutation Function P

The state has 1, 600 bits and is represented by 5 × 5 matrix S, each entry
of the matrix is 64-bit word.
 
s0 s1 s2 s3 s4
 s5 s6 s7 s8 s9 
S = s10 s11 s12 s13 s14  ; S[x, y] = s5x+y for 0 ≤ x, y < 5.
s s16 s17 s18 s19

15
s20 s21 s22 s23 s24

The permutation P consists of 24 rounds applying the transformations:

θ
ι ρ
24

χ π
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 47 / 83
Symmetric-Key Cryptography SHA3 Implementation

Using 256-bit instructions

The SHA-3 state is stored in seven 256-bit registers.

Y0 s0 s1 s2 s3

Y1 s5 s6 s7 s8
Pros:
Y2 s10 s11 s12 s13
• It uses just few 256-bit
Y3 s15 s16 s17 s18
vector registers.
Y4 s20 s21 s22 s23 Cons:
Y5 s24 s24 s24 s24 • The permutation
instructions of AVX-2 are
Y6 s4 s9 s14 s19
expensive.
• Yi : 256-bit vector registers.
• si : 64-bit words.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 48 / 83


Symmetric-Key Cryptography SHA3 Implementation

Using 128-bit instructions


• State representation.

X0 s0 s1 X7 s15 s16
• The state uses 12
X1 s2 s3 X8 s17 s18 variables of 256 bits.
X2 s5 s6 X9 s14 s19 • Pros:
• The permutation
s7 s8 s20 s21
X3 X10
instructions of SSE4
X4 s4 s9 X11 s22 s23 are cheaper than
AVX-2.
X5 s10 s11 X12 s24 s24 • Cons:
X6 s12 s13 • It uses more variables.

• Xi : 128-bit vector registers.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 49 / 83


Symmetric-Key Cryptography SHA3 Implementation

4-way implementation
• State representation.

Y0 s10 s20 s30 s40


• The state uses 25
Y1 s11 s21 s31 s41 variables of 256 bits.
Y2 s12 s22 s32 s42 • Pros:
. .
• There is no 64-bit
. .
. . permutations.
Y22 s122 s222 s322 s422 • Cons:
• It uses many variables
s123 s223 s323 s423
Y23
and the processor has
Y24 s124 s224 s324 s424 only 16 registers.

• Yi : 256-bit vector registers.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 50 / 83


Symmetric-Key Cryptography SHA3 Implementation

Performance of SHA3-128 Function

Cycles-per-bytes taken for hashing a message of 4096 bytes.

18
Running Time

15
(cycles-per-byte)

12
9
6
3
0
Haswell Skylake Zen

x64 x64shld AVX2 generic64


2M-SSE 4M-AVX2

Measurements were taken using the official Keccak Code Package.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 51 / 83


Symmetric-Key Cryptography SHA3 Implementation

SHA3 Parallel Hashing: Two and Four Messages

(1M) 64-bit native instructions. 4

Haswell
Skylake
(2M) 128-bit vector instructions 3 Zen
[SSE2/AVX].

Speedup
2
(4M) 256-bit vector instructions
[AVX2].
1
1 2 3 4
Number of messages

Performance of Zen does not scale well for hashing 4 messages.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 52 / 83


Section 3

Elliptic Curve Cryptography


3.1

Elliptic Curves
Elliptic Curve Cryptography Elliptic Curves

ECC: Software Implementation

• Introduction
• Point Multiplication kP
• Elliptic Curve Diffie-Hellman (X25519, X448)
• Digital Signature (EdDSA)
• Performance (vector instructions on Intel Haswell/Skylake)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 53 / 83


Elliptic Curve Cryptography Elliptic Curves

Elliptic Curve Cryptography (ECC)

• In 1985, Koblitz [8] and Miller [9] independently suggested the use of
elliptic curves for cryptographic purposes.
• ECC achieves the same security as RSA-based protocols using shorter
keys sizes. For example: at the 128-bit security level:
• RSA uses keys of 3,072 bits
• ECC uses keys of 256 bits.
• Applications of ECC:
• Key-agreement protocols.
• Digital signatures.
• Bitcoin.
• End-to-end encryption.
• Smart cards security.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 54 / 83


Elliptic Curve Cryptography Elliptic Curves

Mathematical Aspects of Elliptic Curves

• An elliptic curve is defined by the following equation:

E/Fp : y 2 + a1 xy + a3 y = x3 + a2 x2 + a4 x + a6

where a1 , a2 , a3 , a4 , a6 ∈ Fp and p is a prime number.


• The points of an elliptic curve form a commutative group, with O as
identity.
(E, +) = {(x, y) ∈ E} ∪ {O}
• The addition of two different points (x3 , y3 ) = (x1 , y1 ) + (x2 , y2 ) is
calculated as:
y2 − y1 2
 
x3 = − x1 − x2
x − x1
 2
y2 − y1

y3 = (x1 − x3 ) − y1
x2 − x1
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 55 / 83
Elliptic Curve Cryptography Elliptic Curves

Point Addition

Let P and Q two points in the curve, then we can compute P + Q using a
geometric construction:

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 56 / 83


Elliptic Curve Cryptography Elliptic Curves

Point Addition

Let P and Q two points in the curve, then we can compute P + Q using a
geometric construction:

• Trace a line passing through P


and Q.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 56 / 83


Elliptic Curve Cryptography Elliptic Curves

Point Addition

Let P and Q two points in the curve, then we can compute P + Q using a
geometric construction:

• Trace a line passing through P


and Q.
• This line will intersect the curve in
a point R.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 56 / 83


Elliptic Curve Cryptography Elliptic Curves

Point Addition

Let P and Q two points in the curve, then we can compute P + Q using a
geometric construction:

• Trace a line passing through P


and Q.
• This line will intersect the curve in
a point R.
• Trace a vertical line passing
through R.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 56 / 83


Elliptic Curve Cryptography Elliptic Curves

Point Addition

Let P and Q two points in the curve, then we can compute P + Q using a
geometric construction:

• Trace a line passing through P


and Q.
• This line will intersect the curve in
a point R.
• Trace a vertical line passing
through R.
• The point where this line
intersects the curve will be defined
as the addition P + Q.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 56 / 83


Elliptic Curve Cryptography Elliptic Curves

Point Doubling

The addition of a point P with itself can be computed as follows:

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 57 / 83


Elliptic Curve Cryptography Elliptic Curves

Point Doubling

The addition of a point P with itself can be computed as follows:

• Trace a line tangent to the curve


at point P .

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 57 / 83


Elliptic Curve Cryptography Elliptic Curves

Point Doubling

The addition of a point P with itself can be computed as follows:

• Trace a line tangent to the curve


at point P .
• The line will intersect to the curve
in a point R.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 57 / 83


Elliptic Curve Cryptography Elliptic Curves

Point Doubling

The addition of a point P with itself can be computed as follows:

• Trace a line tangent to the curve


at point P .
• The line will intersect to the curve
in a point R.
• Trace a vertical line passing
through R.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 57 / 83


Elliptic Curve Cryptography Elliptic Curves

Point Doubling

The addition of a point P with itself can be computed as follows:

• Trace a line tangent to the curve


at point P .
• The line will intersect to the curve
in a point R.
• Trace a vertical line passing
through R.
• The point were this line intersects
to the curve is defined as 2P .

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 57 / 83


Elliptic Curve Cryptography Elliptic Curves

Point Multiplication kP

Given an integer number k and a point P ∈ E, point multiplication is


defined as:
kP = P|
+P + {z
· · · + P}
k times

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 58 / 83


Elliptic Curve Cryptography Elliptic Curves

Point Multiplication kP

Given an integer number k and a point P ∈ E, point multiplication is


defined as:
kP = P|
+P + {z
· · · + P}
k times

15P = (1111)2 P = (23 + 22 + 21 + 1)P = 23 P + 22 P + 21 P + P

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 58 / 83


Elliptic Curve Cryptography Elliptic Curves

Point Multiplication kP

Given an integer number k and a point P ∈ E, point multiplication is


defined as:
kP = P|
+P + {z
· · · + P}
k times

15P = (1111)2 P = (23 + 22 + 21 + 1)P = 23 P + 22 P + 21 P + P

kP = kn−1 2n−1 + · · · + k1 2P + k0 P

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 58 / 83


Elliptic Curve Cryptography Elliptic Curves

Point Multiplication: Double-and-Add algorithm

Input: P ∈ E and k ∈ Z+ .
Output: kP
(kn−1 , . . . , k1 , k0 )2 ← k
Q←O
for i ← n − 1 to 0 do
Q ← 2Q
Q ← Q + ki P
end for
return Q

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 59 / 83


Elliptic Curve Cryptography Elliptic Curves

Techniques for kP

The operation kP can be performed using different techniques:


• Double-and-Add Algorithm (right-to-left)
• Montgomery Algorithm.
• w-NAF representations.
• Fixed recoding representations.
• Elliptic curves with endomorphism, GLV/GLS curves.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 60 / 83


Elliptic Curve Cryptography Elliptic Curves

Elliptic Curve Discrete Logarithm Problem (ECDLP)

Given two points, P and Q, the problem of finding an integer k such that
Q = kP is known as the elliptic curve discrete logarithm problem.
• The Pollard’s algorithm is the best known algorithm that solves
ECDLP. The complexity of this algorithm is:
q 
O #E(Fp ) ,

where #E(Fp ) ≈ p is the number of points in the curve.


• For example: an elliptic curve defined over a prime field such that
p ≈ 2256 then 2128 operations are required to solve ECDLP.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 61 / 83


Elliptic Curve Cryptography Elliptic Curves

The Standardized Elliptic Curves by NIST

• In 1999, NIST standardized a set of elliptic curves to compute digital


signatures (ECDSA) and the key-agreement protocol (ECDH) [10].
• NIST’s curves have the following equation:

E/Fp : y 2 = x3 − 3x + b

• Prime curves: P-256 and P-384

P-256 P-384
Security 128-bit 192-bit
p 2256 − 2224 + 2192 + 296 − 1 2384 − 2128 − 296 + 232 − 1
b 0x5ac635d...27d2604b 0xb3312fa...d3ec2aef
#E 2256 − 2224 + 2192 − 2128 +t 2384 − t
t 0xbce6faa...fc632551 0x389cb27...333ad68d

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 62 / 83


Elliptic Curve Cryptography Elliptic Curves

RFC7748: Edwards/Montgomery Elliptic Curves

On January 2016, the RFC7748 recommends the use of Curve25519 and


Curve448 in two elliptic curve models:
• Edwards curves: E : ax2 + y 2 = 1 + dx2 y 2 .
• Montgomery curves: E : v 2 = u3 + Au2 + u.

Curve25519 Bernstein [1, 2] Curve448 Hamburg [5]


Security 128-bit 224-bit
p 2255 − 19 2448 − 2224 − 1
(a, d, A) (−1, − 121665
121666
, 486662) (1, −39081, 156326)
#E 8` 4`
2252 −0x14def9dea2f79cd65812631a 2446 −0x8335dc163bb124b65129c96fd
`
5cf5d3ed e933d8d723a70aadc873d6d54a7bb0d

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 63 / 83


3.2

Elliptic Curve Diffie-Hellman


Elliptic Curve Cryptography Elliptic Curve Diffie-Hellman

Diffie-Hellman Protocol using Montgomery Curves

The RFC 7748 recommends the use of two functions to compute a shared
secret.
X25519 Keys of 32 bytes.
X448 Keys of 56 bytes.

$ $
− {0, 1}256
a← − {0, 1}256
b←

KA ← X25519(9, a) KB ← X25519(9, b)
K = X25519(KB , a) K = X25519(KA , b)

K is the shared secret.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 64 / 83


Elliptic Curve Cryptography Elliptic Curve Diffie-Hellman

The X Function

Internally X is the calculation of an elliptic curve point multiplication kP .

Example: 22P .
Montgomery ladder algorithm.
ki Q0 ← O Q1 ← P
Input: P ∈ E and k ∈ Z+ .
Output: kP
1: (kn−1 = 1, . . . , k0 )2 ← k 1 P 2P
2: Q0 ← P
3: Q1 ← 2P 0 2P 3P
4: for i ← n − 2 to 0 do
5: b ← ki ⊕ ki+1
1 5P 6P
6: Q0 , Q1 ← cswap(b, Q0 , Q1 )
7: Q0 , Q1 ← 2Q0 , Q0 + Q1
8: end for 1 11P 12P
9: Q0 , Q1 ← cswap(k0 , Q0 , Q1 )
10: return Q0 0 22P 23P

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 65 / 83


Elliptic Curve Cryptography Elliptic Curve Diffie-Hellman

Representation of Prime Field Elements

Elements of Fp are split into words of size w:


t−1
|p|
 
a ∈ Fp = ai 2wi = a0 + a1 2w + a2 22w + . . . where t =
X
.
i=0
w

Let W be the machine’s word size, then there are two cases:
w=W Full-radix or saturated arithmetic.
w<W Reduced-radix, redundant representation, unsaturated arith...

E.g. for p = 2255 − 19 and a W = 64 instruction set,


use an array of t = 5 words storing coefficients of w = 51 bits.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 66 / 83


Elliptic Curve Cryptography Elliptic Curve Diffie-Hellman

X25519 Shared Secret Computation

Full-radix: Using MULX+ADCX/ADOX a 11-14% of time reduction of the


fastest implementation reported in SUPERCOP.
Reduced-radix: an additional 8-10% is obtained by using AVX2.

175 Haswell Skylake


150 100 Kcc
Running Time

125
(103 cycles)

100

75

50

25

Moon Tung Oliveira et al. Our code


(floodyberry) SAC 2015 SAC 2017 AVX2
x64 x64+SSE2 x64(MULX/ADCX) MULX/ADCX

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 67 / 83


Elliptic Curve Cryptography Elliptic Curve Diffie-Hellman

X448 Shared Secret Computation

We reduce a 13% in Haswell and a 17% in Skylake the timings reported by


Hamburg.

Haswell Skylake

500
103 clock cycles

400

300

200

100

eBacs (supercop) Hamburg Our code


x64 x64 AVX2

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 68 / 83


3.3

Digital Signatures
Elliptic Curve Cryptography Digital Signatures

Digital Signatures

• They are used to verify both integrity and authenticity of a message.


• Basic operations:
Sign Given a message there is an algorithm that computes a
bit string, called signature, associated to the private key
of the signer.
Verify This step determines whether a signature is valid,
i.e. the signature for the message was created using the
private key corresponding to the referenced public key.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 69 / 83


Elliptic Curve Cryptography Digital Signatures

Signature Generation

Private
Key

Hash Signing

• The message is processed through a cryptographic hash function H


to obtain a digest value.
• The digest along with the private key are used to generate a signature.
• Both message and signature must be sent together for further
verification.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 70 / 83
Elliptic Curve Cryptography Digital Signatures

Signature Verification

Public
Key

Valid

Verification
Reject

• Using the signer’s public key, the verification algorithm determines


whether a signature is valid.
• Ensuring authenticity of the signer and integrity of the message.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 71 / 83


Elliptic Curve Cryptography Digital Signatures

Digital Signatures

1991 · · ·• PKCS#1: Rivest-Shamir-Adleman scheme (RSA).

1993 · · ·• FIPS 186: Digital Signature Algorithm (DSA).

ANSI X9.62: Elliptic Curve Digital Signature


1999 · · ·•
Algorithm (ECDSA).

Bernstein et. al. proposed the Edwards Digital


2011 · · ·•
Signature Algorithm (EdDSA).

2015 · · ·• EdDSA is in a draft of the IETF for discussion [6].

2017 · · ·• EdDSA is described in RFC-8032 [7].

The use of EdDSA is increasing; for instance, OpenSSH now supports


Ed25519 signatures.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 72 / 83


3.4

EdDSA Scheme
Elliptic Curve Cryptography EdDSA Scheme

Edwards Digital Signature Algorithm

• This is a novel signature scheme based on the Edwards curves.


• The RFC-8032[7] describes the usage of two instances of EdDSA.
• EdDSA delivers digital signatures faster than the ECDSA.
• It consists of three primitive operations:
• Key Generation.
• Signing.
• Verification.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 73 / 83


Elliptic Curve Cryptography EdDSA Scheme

EdDSA: Domain parameters

• Public key of b bits and signature size of 2b bits.


• Ed (Fp ), an Edwards curve over a prime field.
• ` · h = #Ed (Fp ), the number of points in the curve.
• B 6= (0, 1), a generator point.
• c ∈ {2, 3} and n = log2 (`), two constants. c ≤ n < b
• s = Encode(P ), converts a point P = (x, y) into a string s.
s = (x mod 2) k y
• (x, y)=Decode(s), converts a string s into a pair (x, y).

y2 − 1
s
y = s mod 2b−1 , x=
dy 2 − a
such that x ≡ sb−1 mod 2.
• H, a hash function producing 2b bits.
• Ex: use of the SHAKE128 function which is part of the SHA3 standard.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 74 / 83
Elliptic Curve Cryptography EdDSA Scheme

EdDSA: Key Generation

Computing the secret and public keys, (sk, pk):

1: sk ∈R [0, `)
2: h = (h2b−1 , . . . , h0 )2 ← H(sk)
a ← 2n + 2 hi , for c ≤ i < n; a : n + 1 bits, bottom c bits cleared.
P i
3:

4: pk ← aB
5: return (sk, pk)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 75 / 83


Elliptic Curve Cryptography EdDSA Scheme

EdDSA: Signing

Given a message M and the pair of keys (sk, pk)


compute the signature (R, S) as:

1: h = (h2b−1 , . . . , hb , hb−1 , . . . , h0 )2 ← H(sk)


| {z } | {z }
hH hL
2: a ← 2n + 2i hi , for c ≤ i < n
P

3: r ← H(hH k M ) (mod `)
4: R0 ← rB
5: R ← Encode(R0 )
6: S ← r + H(R k pk k M ) · a (mod `)
7: return (R, S)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 76 / 83


Elliptic Curve Cryptography EdDSA Scheme

EdDSA: Verification

Given a message M , a signature (R, S) and a public key pk:

P ← Decode(pk)
h ← H(R k pk k M ) (mod `)

Accept signature if the following is true:

P ∈ Ed (Fp ) and S ∈ [0, `) and SB = R + hP

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 77 / 83


Elliptic Curve Cryptography EdDSA Scheme

Optimization Techniques for EdDSA

Focus on the optimization of two main operations:


• kP , when P is known.
• kP + lQ, when P is known and Q is an arbitrary point.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 78 / 83


Elliptic Curve Cryptography EdDSA Scheme

Fixed-point mult: computing kP when P is known


Input: k, a n-bit integer,
w, an integer window size,
P , a fixed point of order `.
Output: Q a point such that Q = kP .
Off-line computation:
1: Compute the look-up tables {Ti ← d2wi P } for odd d ∈ [1, 2w−1 ] and all
i ∈ [0, t).
On-line computation:
1: t ← dn/we
2: Q ← O
3: Let (K0 , K1 , . . . , Kt−1 )w be the signed radix-w representation of k.
4: for i ← 0 to t − 1 do
5: P ← Query(Ti , Ki )
6: Q←Q+P
7: end for
8: return Q

Query must be protected against side-channel attacks.


Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 79 / 83
Elliptic Curve Cryptography EdDSA Scheme

Double-point mult: computing kP + lQ when P is known


and Q is an arbitrary point

One efficient algorithm is the interleaving method using ω-NAF.


• Obtain the ω-NAF of k and l, {ki } ← k and {li } ← l.
• There exists a pair (ωk , ωl ) that minimizes the number of operations.
• Precompute Td = dP for odd d ∈ [1, 2ωk −1 ].
• Compute Ud = dQ for odd d ∈ [1, 2ωl −1 ].
R←O
for i ← n − 1 to 0 do
R ← 2R
if ki 6= 0 then R ← R + Tki
if li 6= 0 then R ← R + Uli
end for
• R is the required point kP + lQ.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 80 / 83


Elliptic Curve Cryptography EdDSA Scheme

Improvements on Ed25519 Signature Generation

The synergy between AVX2, MULX, and ADCX/ADOX instructions increases


the performance of the signing operation.

100
Haswell Skylake
80
Running time
(103 cycles)

60
40
20

Moon Moon Schwabe Our code Our code


(floodyberry) (floodyberry) (supercop) AVX2 AVX2
SSE2 x64 x64+SSE2 MULX/ADCX MULX/ADCX
24 KB 24 KB 30 KB 12 KB 24 KB

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 81 / 83


Elliptic Curve Cryptography EdDSA Scheme

Improvements on Ed448 Signature Generation

Running time was reduced in around 16-18% on Haswell and Skylake


platforms.

200 Haswell Skylake

160
Running time
(103 cycles)

120
80
40

supercop Hamburg Our code


x64 x64 AVX2

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 82 / 83


Elliptic Curve Cryptography EdDSA Scheme

Thanks for your attention!


jlopez@ic.unicamp.br

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 83 / 83


References

[1] Daniel J. Bernstein. Ed448-Goldilocks, a new elliptic curve.


Curve25519: New Diffie-Hellman Speed Records. Cryptology ePrint Archive, Report 2015/625, 2015.
In Moti Yung, Yevgeniy Dodis, Aggelos Kiayias, and Tal http://eprint.iacr.org/.
Malkin, editors, Public Key Cryptography, volume 3958
of Lecture Notes in Computer Science, pages 207–228. [6] Simmon Josefsson and NIels Moeller.
Springer, 2006. EdDSA and Ed25519 draft-josefsson-eddsa-ed25519-03.
Available on https://tools.ietf.org/html/
[2] DanielJ. Bernstein, Niels Duif, Tanja Lange, Peter draft-josefsson-eddsa-ed25519-03, May 2015.
Schwabe, and Bo-Yin Yang.
High-speed high-security signatures. [7] Simon Josefsson and Ilari Liusvaara.
Journal of Cryptographic Engineering, 2(2):77–89, 2012. Edwards-Curve Digital Signature Algorithm (EdDSA).
RFC 8032, January 2017.
[3] Joppe W. Bos, J. Alex Halderman, Nadia Heninger,
Jonathan Moore, Michael Naehrig, and Eric Wustrow. [8] Neal Koblitz.
Elliptic Curve Cryptography in Practice. Elliptic Curve Cryptosystems.
In Nicolas Christin and Reihaneh Safavi-Naini, editors, Mathematics of Computation, 48(177):203–209, January
Financial Cryptography and Data Security: 18th 1987.
International Conference, FC 2014, Christ Church, [9] VictorS. Miller.
Barbados, March 3-7, 2014, Revised Selected Papers, Use of Elliptic Curves in Cryptography.
pages 157–175, Berlin, Heidelberg, 2014. Springer Berlin In HughC. Williams, editor, Advances in Cryptology —
Heidelberg. CRYPTO ’85 Proceedings, volume 218 of Lecture Notes
in Computer Science, pages 417–426. Springer Berlin
[4] Intel Corporation.
Heidelberg, 1986.
Intel Instruction Set Architecture Extensions.
Available at https: [10] National Institute for Standards and Technology.
//software.intel.com/en-us/intel-isa-extensions, Digital Signature Standard (DSS).
July 2013. http://csrc.nist.gov/publications/fips/archive/
fips186-2/fips186-2.pdf, January 2000.
[5] Mike Hamburg.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 83 / 83

You might also like