Friday, April 3, 2015

Bin-tropy analysis to detect cryptomalware

Bin-tropy calculation:

Analyzing the file contents in each path or drive is one of the preliminary steps in detecting the crypto-malware execution. A main difference between an encrypted file and a normal file is that the randomness in characters in encrypted file is greater than expected in a normal file.
The Binary Entropy calculation is done using statistical test suite based on Discrete Fourier tranformation of the file sequence.
Steps involve :
  1.  Binary sequence of the file content to be analysed. Each 0 and 1 in the sequence to be converted to -1 and 1 respectively. For eg : Seq = 10110101 converted as  Seq = 1, -1, 1,1,-1,1,-1,1.
  2. Apply discrete fourier transform (DFT) to the sequence so that a continuous sine wave can be produced. This would reveal periodic repetition in the input data. In this case, periodic components of the sequence of bits at different frequencies.
  3. Calculate the modulus of the substring of the DFT sequence generated, which would give the sequence of peak heights.
  4. Compute threshold peak height value (95% peak height value).                                                                                           Threshold  = √(log 1/0.05)n
  5.  Under the assumption of randomness, 95% of the peak heights obtained from the sequence should be less than this threshold value.
  6. To compare the theoretical number of peaks (95% of the peak heights) that are less than threshold, with the actual number of peaks that are less than threshold, compute        theoretical (N)   =    .95 (n / 2)   , expected number of peaks with heights less than threshold   actual (N_1)    =  the actual number of peaks that are less than T (as observed)
  7. Find d = normalized difference between the expected and theoretical number of frequency components that are beyond the 95% threshold.
  8. Compute complementary error function value as “E = erfc( abs(d)/√2)”
If the computed E value is greater than 0.01, then conclude that the input sequence is random (encrypted).  Else non-random sequence (normal).
d value that is too low means that there are too few peaks below T, and too many peaks above T.

Limitations of the bin-entropy detection method:

Not perfect in cases of very small files or user encrypted files.
For eg : say a txt file with “SSN : 0123456789″.
Randomness test would fail with E > threshold because within the 14 characters, except “S”, all of them are unique,  thus random in nature. Even though it is a valid text, the entropy value would be higher than threshold.
In case of user encrypted files, Entropy would already be higher, so if a malware starts encrypting the same file again,
the script cannot differentiate between “legitimate user encryption” and “unauthorized encryption” thus wouldn’t be
Source : Python Sourcecode for the implementation can be found in
References: Bin Entropy calculated based on ‘Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications’
published by National Institute of Standards and Technology, U.S Department of Commerce
Source :

