Bin-tropy calculation:
Analyzing the file contents in each path or drive is one of the preliminary steps in detecting the crypto-malware execution. A main difference between an encrypted file and a normal file is that the randomness in characters in encrypted file is greater than expected in a normal file.
The Binary Entropy calculation is done using statistical test suite based on Discrete Fourier tranformation of the file sequence.
Steps involve :
- Binary sequence of the file content to be analysed. Each 0 and 1 in the sequence to be converted to -1 and 1 respectively. For eg : Seq = 10110101 converted as Seq = 1, -1, 1,1,-1,1,-1,1.
- Apply discrete fourier transform (DFT) to the sequence so that a continuous sine wave can be produced. This would reveal periodic repetition in the input data. In this case, periodic components of the sequence of bits at different frequencies.
- Calculate the modulus of the substring of the DFT sequence generated, which would give the sequence of peak heights.
- Compute threshold peak height value (95% peak height value). Threshold = √(log 1/0.05)n
- Under the assumption of randomness, 95% of the peak heights obtained from the sequence should be less than this threshold value.
- To compare the theoretical number of peaks (95% of the peak heights) that are less than threshold, with the actual number of peaks that are less than threshold, compute theoretical (N) = .95 (n / 2) , expected number of peaks with heights less than threshold actual (N_1) = the actual number of peaks that are less than T (as observed)
- Find d = normalized difference between the expected and theoretical number of frequency components that are beyond the 95% threshold.
- Compute complementary error function value as “E = erfc( abs(d)/√2)”
If the computed E value is greater than 0.01, then conclude that the input sequence is random (encrypted). Else non-random sequence (normal).
d value that is too low means that there are too few peaks below T, and too many peaks above T.
Limitations of the bin-entropy detection method:
Not perfect in cases of very small files or user encrypted files.
For eg : say a txt file with “SSN : 0123456789″.
Randomness test would fail with E > threshold because within the 14 characters, except “S”, all of them are unique, thus random in nature. Even though it is a valid text, the entropy value would be higher than threshold.
In case of user encrypted files, Entropy would already be higher, so if a malware starts encrypting the same file again,
the script cannot differentiate between “legitimate user encryption” and “unauthorized encryption” thus wouldn’t be
efficient.
Randomness test would fail with E > threshold because within the 14 characters, except “S”, all of them are unique, thus random in nature. Even though it is a valid text, the entropy value would be higher than threshold.
In case of user encrypted files, Entropy would already be higher, so if a malware starts encrypting the same file again,
the script cannot differentiate between “legitimate user encryption” and “unauthorized encryption” thus wouldn’t be
efficient.
Source : Python Sourcecode for the implementation can be found in https://github.com/EC700/Charlie-2/tree/master/Entropy
References: Bin Entropy calculated based on ‘Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications’
published by National Institute of Standards and Technology, U.S Department of Commerce
Source : http://csrc.nist.gov/groups/ST/toolkit/rng/documents/SP800-22rev1a.pdf
published by National Institute of Standards and Technology, U.S Department of Commerce
Source : http://csrc.nist.gov/groups/ST/toolkit/rng/documents/SP800-22rev1a.pdf