Abstract—Although there is no theoretical weakness in a cryptographic algorithm, Side Channel Analysis can find out some secret data from the physical implementation of a cryptosystem. The analysis is based on extra information such as timing information, power consumption, electromagnetic leaks or even sound which can be exploited to break the system. Differential Power Analysis is one of the most popular analyses, as computing the statistical correlations of the secret keys and power consumptions. It is usually necessary to calculate huge data and takes a long time. It may take several weeks for some devices with countermeasures. We suggest and evaluate the methods to shorten the time to analyze cryptosystems. Our methods include distributed computing and parallelized processing.

Keywords—DPA, distributed computing, parallelized processing.

I. INTRODUCTION

Most computing devices including communication devices contain hardware or software cryptographic modules in them and the security relies on the modules. But, although there is no theoretical weakness in a cryptographic algorithm, the physical implementation of the cryptosystem may leak some information that makes the cryptosystem vulnerable to side channel attacks [1].

Side channel analysis (SCA) is a cryptanalysis which analyzes the side channel information such as timing information, power consumption, electromagnetic leaks or even sound of devices. If cryptosystems do not have countermeasures against SCA, they may face serious risks. SCA on devices is a necessary procedure to verify the safety of computing devices.

Differential Power Analysis (DPA) is a kind of SCA. To perform a DPA test for a device, it is needed to calculate statistics from huge data, which are the side channel information. It may take several weeks for some devices with countermeasures.

Our work suggests and evaluates the methods to shorten the time to analyze cryptosystems. It includes distributed computing over multiple computers and parallelized processing.

II. SIDE CHANNEL ANALYSIS

When a computing device performs some calculations, it consumes electric power, and the consumption varies with the values that it is calculating. So an attacker can use the power consumption data as the leakage information for SCA.

In a broad sense of SCA, there are passive and active analyses. Passive SCA is based on only the leakage information from a cryptosystem. Active SCA is based on some physical impacts to the system. Fault injection is an active analysis, which injects voltage glitch or laser to make devices malfunction. But usually, SCA means only passive SCA and our research focuses on it.

There are two kinds of passive SCAs of Simple Power Analysis (SPA) and Differential Power Analysis (DPA). Both analyses use the power traces of power consumption or electromagnetic radiation from the analyzing device.

SPA interprets the visual graph of power traces. The operations of a cryptographic algorithm are composed of repeating similar arithmetic operations, so the power traces show repeating patterns over time. For some cases, the pattern may mirror the key value.

Fig. 1 SPA for ECC

Fig. 1 is the graph of a power trace during the point multiplication of elliptic curve cryptography [2]. Only doubling calculation occurs when the key bit is 0, and doubling and addition occur when the key bit is 1. The patterns are apparent in the graph and we can easily find out the key.

Although the visual graphs of power traces do not show any useful information to human eyes, that is, SPA is not available DPA may succeed to find the keys.

The attacker of DPA computes the hypothetical changes of intermediate values within cryptographic operations, and compares them with the multiple power traces collected from cryptographic operations. The correct key makes strong correlations between the hypothetical values and the real power consumptions.

The attacker collects multiple power traces from the
cryptographic calculations with random plaintexts. Usually there are many useless noises in the raw power traces, the traces need to be refined by various preprocessing such as frequency filtering or time alignment. Finally, to find the key, the correlation values are calculated between the plaintext and each key of all available intermediate key for the whole traces. The correct key will result in high correlation. The operation quantity of these processes is very huge and takes much time.

This work proposes distributed computing and parallelized processing for acceleration. With distributed processing it also has the benefit that cooperating computers divide the required memory size.

A. Distributed Computing

The performance can be improved by distributing the analysis operations to multiple computers. But it is needed to manage data and resources properly. The overall performance differs as how to structure the distributed system.

Our distributed SCA system is composed of one master system that manages overall distributed processing and multiple slave systems that cooperate on the analysis.

In most trace preprocessing operations, processing a trace is independent to other trace data. So it is easy to distribute operations. Traces are assigned randomly to slave systems, and merged again to the master system after processing.

Distributing DPA operations is a little complicated. A unit DPA operation has to process the data of multiple traces, and the intermediate analysis data should be retained until overall analysis is completed. As a DPA operation of Fig. 2 can be divided by the l index, which is the point position of traces, we divided a trace by the number of computers and allocated each segment to each slave system. Fig. 3 depicts the allocation of trace data to the slaves.

Fig. 2 shows the procedure to get the correlation values $R[L, K, GK]$. If there are some points of which the correlation values are relative high for a guessing key of each key byte, the key is found. The calculation complexity is $O(N \cdot L \cdot K \cdot GK)$ and the required data memory size for $s, s^2, d, d^2, sd$ is about $L \cdot K \cdot GK$. For the case of AES128, $K$ is 16, $GK$ is 256. The value of $N$ or $L$ are various from thousands to millions by test cases.

III. ACCELERATING SCA OPERATIONS

As the time cost for SCA is not negligible, reducing the calculation time for SCA is important. One of accelerating method is using graphics processing units (GPU). GPU has powerful parallel processing function for cryptanalysis [3]. And [4] has implemented the GPU accelerating function for SCA. But there is the limit of the GPU memory size, and I/O overhead increases rapidly when the required memory is bigger than that of the GPU.

\[
\begin{align*}
\text{for each trace } & T[L], P, \\
\text{for each } & k, gk \text{ (where } 0 \leq k < K, 0 \leq gk < GK) \\
data[k, gk] &= \text{hypothetical values of } P \\
d[k, gk] &= data[k, gk] \\
d_{gk}[L, k, gk] &= data[k, gk]^2, \\
\text{for each } & f^{th} \text{ point of } T \text{ (where } 0 \leq f < L) \\
s[f] &= T[f] \\
s_{gk}[f] &= T[f]^2, \\
\text{for each } & l, k, gk \\
sd[l, k, gk] &= T[l] \cdot data[k, gk] \\
\text{finally, for each } & l, k, gk \\
R[l, k, gk] &= \frac{(L \cdot sd[l, k, gk] - d[k, gk] \cdot s_l)}{\sqrt{(L \cdot d_{gk}[l, k, gk] - d[k, gk]^2) \cdot (L \cdot s_{gk}[l] - s_l^2) - d_{gk}[l, k, gk] \cdot s_{gk}[l] - d[k, gk] \cdot s_l - sd[l, k, gk] \cdot s_{gk}[l]}}
\end{align*}
\]
B. Parallelized Processing

SCA procedures include 3 steps of collecting power traces, preprocessing with signal processing, and analyzing of calculating the correlation values. The collecting step which makes raw power traces does not require much computing resource, but the main time factor in collecting traces is related to the performance of the target device. The cryptographic operations of many security devices including smart cards takes much time. The signal processing and statistical computing for the analysis can be huge calculations and it may also take much time although with high performance analyzing systems.

These procedures are common to the commercially available SCA tools [4], [5], and they proceed the steps sequentially as Fig. 5 (a).

We enhanced overall performance by executing other steps concurrently during the collecting step. We designed our implementation to process preprocessing step and analyzing step simultaneously when each new power trace is available. Fig. 5 (b) depicts the parallelized procedure.

Fig. 6 is our comparison result for each procedure. The CPU utilization of the analyzing system with sequential procedure of Fig. 6 (a) is not effective during the collecting step. But the parallelized processing of Fig. 6 (b) utilizes the CPU more effectively as it proceeds the preprocess and analysis steps while collecting power traces and shortens the time to finish the test.

IV. CONCLUSION

Safety against SCA became a common requirement for various security devices. Many standards including CC, EMV, and FIPS140-3 require the safety verification against SCA for security hardware. Our accelerated SCA tools can enhance the verification tests more efficiently as it shortens the time for the test. And our methods are applicable not only DPA but other advanced analyzing method such as High Order DPA.

ACKNOWLEDGMENT

This work was supported by the K-SCARF project, the ICT R&D program of ETRI (Research on Key Leakage Analysis
and Response Technologies)

REFERENCES


