0% found this document useful (0 votes)
50 views

Anti Virus 2.0 "Compilers in Disguise": Mihai G. Chiriac Bitdefender

The document discusses optimizing emulator performance for antivirus scanning. It begins by discussing the history of antivirus techniques and challenges with emulation. It then proposes using an intermediate language and optimizations like static single assignment form to improve performance of decryption loop emulation. The document suggests three execution modes: simulating micro-operations directly, linking micro-function simulations, or generating target-specific machine code. Faster execution speeds can be achieved by combining multiple micro-operations or generating native code.

Uploaded by

admiral9hacker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Anti Virus 2.0 "Compilers in Disguise": Mihai G. Chiriac Bitdefender

The document discusses optimizing emulator performance for antivirus scanning. It begins by discussing the history of antivirus techniques and challenges with emulation. It then proposes using an intermediate language and optimizations like static single assignment form to improve performance of decryption loop emulation. The document suggests three execution modes: simulating micro-operations directly, linking micro-function simulations, or generating target-specific machine code. Faster execution speeds can be achieved by combining multiple micro-operations or generating native code.

Uploaded by

admiral9hacker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Anti Virus 2.

0
“Compilers in disguise”
Mihai G. Chiriac
BitDefender
Talk outline
• AV History
• Emulation basics
• Compiler technology
– Intermediate Language
– Optimizations
– Code generation
• Conclusions
AV History!
• String searching
– Aho-Corasick, KMP, Boyer-Moore
– PolySearch
– Bookmarks (from top, tail, EP?)
– Hashes (B, ofs1, sz1, crc1, ofs2, sz2, crc2)

NYB.A
AV History (cont’d)
• Encrypted viruses
– Static decryption loop (signature)
– Simple encryption (xray-ing)
– Algorithmic detection

Cascade.1706 decryption loop


AV History (cont’d)
• Polymorphic viruses
– Simple encryption (xray-ing)
– Algorithmic detection, heuristics

TPE.Giraffe.A
Emulation!
• Hardware
– Virtual CPU
– Virtual memory
– Virtual devices
• Software
– Partial OS simulation
• Bonus Goodies
– Fake IRC, SMTP, DNS, etc servers
Workflow
• Init the CPU / VM
• Init the virtual OS
– Modules
– Structures
• Map the file
• Start emulation from cs:eip
• Scan (when conditions are met)
• Quit (when conditions are met)
Sample

Win32.Parite (Pinfi) decryption loop


Ready to emulate our first
instruction?
…Not yet…
Chores! 
• Pre-instruction tasks
– DRx handling
– Segment access rights
– Page access rights
• Post-instruction tasks
– TF handling
– Update the virtual EIP
– Update the EI number
Tasks
• Fetch instruction from cs:eip
• Decode
– Handle prefixes

• Emulate!
• Easy, huh? 
On average,
one every three instructions
references memory…

Memory accesses
• We have to virtualize the entire 4 GB
space….
• Every memory access needs:
– Segment access checks
– Linear address computation
– Page access checks
– Hardware debugging checks
– SMC checks!! (for memory writes)
Problems…
• Millions of instructions…
• Polymorphic decryption loops are full of
do-nothing, “garbage” code…
• Decompression loops are optimized for
size, not speed…

• …This results in unacceptable


performance …
600000

200000
400000
Parses 800000

0
1000000
1200000
1400000
1600000

0x00565577 ->…
0x005D1EF2 ->…
0x005D1F04 ->…
0x005D1F1C ->…
0x005D1F33 ->…
0x005D1F4F ->…
0x005D1F61 ->…
0x005D1F74 ->…
0x005D1F9A ->…
UPX decompression

0x005D1FC3 ->…
0x005D1FF6 ->…
0x005D201F ->…
0x7FF00C0E ->…
0x7FF80430 ->…
0x7FF80498 ->…
0x7FF804E8 ->…
0x7FF80521 ->…
0x7FF8055D ->…
Advantages
• Typically, an emulator spends the most
time in loops…

• A small percentage of code is responsible


for a large percentage of emulation time…

• So… we know what to optimize!


The plan
• Identify hot-spots
– Basic blocks that execute very frequently

• Try to make them run as fast as possible


– Reducing to a minimum the set of repetitive
actions
– Reducing to a minimum the number of
reduntant operations
Back to our code…
• .420010 (31 1C 3E) xor [esi+edi], ebx
First thoughts
• For loops, keep the opcodes already
decoded!
• Memory model is usually flat…
– We can catch accesses to DS, SS,…
• Hardware debugging rarely used…
– We can catch accesses to DRx
• Trap Flag rarely used…
– We can monitor accesses to EFlags
Back to our code…
• .420010 (31 1C 3E) xor [esi+edi], ebx
But we can do much more!
• x86 - Very rich instruction set
– One instruction – many basic operations
– Different encodings, same result
– Hard(er) to optimize…

• Mike’s Intermediate Language Format


• …apparently the acronym is taken 
IL Basics
• Very RISC-like
• Single-purpose micro-operations
• Infinite number of virtual registers
• Many info, useful for optimizations
– Operation type, operands
– Input / output variables (use-define)
• Many info, useful for dynamic analysis
– Memory access info
Parite.A decryption (1)
• Decrypt:
• .420010 xor [esi+edi], ebx
• .420013 sub esi, 2
• .420016 sub esi, 2
• .420019 jnz Decrypt

Compute_ZF (tm1)
mm0 = esi + edi Compute_SF (tm1)
tm0 = load32 (mm0) Compute_PF (tm1)
tm1 = tm0 ^ ebx
store32 (mm0, tm1) Compute_OF (OP_XOR, …)
Compute_AF (OP_XOR, …)
Compute_CF (OP_XOR, …)
Parite.A decryption (2)
• Decrypt:
• .420010 xor [esi+edi], ebx
• .420013 sub esi, 2
• .420016 sub esi, 2
• .420019 jnz Decrypt

Compute_ZF (esi)
Compute_SF (esi)
tm0 = esi Compute_PF (esi)
esi = esi – 2
Compute_OF (OP_SUB, …)
Compute_AF (OP_SUB, …)
Compute_CF (OP_SUB, …)
Parite.A decryption (3)
• Decrypt:
• .420010 xor [esi+edi], ebx
• .420013 sub esi, 2
• .420016 sub esi, 2
• .420019 jnz Decrypt

Compute_ZF (esi)
Compute_SF (esi)
tm0 = esi Compute_PF (esi)
esi = esi – 2
Compute_OF (OP_SUB, …)
Compute_AF (OP_SUB, …)
Compute_CF (OP_SUB, …)
Parite.A decryption (4)

• We can follow the use-def chains and


remove unnecessary micro-ops…

mm0 = esi + edi Compute_ZF (esi)


tm0 = load32 (mm0) Compute_SF (esi)
tm1 = tm0 ^ ebx Compute_PF (esi)
store32 (mm0, tm1)
esi = esi – 2 Compute_OF (OP_SUB, …)
tm0 = esi Compute_AF (OP_SUB, …)
esi = esi – 2 Compute_CF (OP_SUB, …)
Parite.A decryption (5)

• We can compute some values only if


really needed…

mm0 = esi + edi


tm0 = load32 (mm0)
tm1 = tm0 ^ ebx Set_LazyFlags (OP_SUB, …)
store32 (mm0, tm1) Compute_ZF (esi)
esi = esi – 2
tm0 = esi
esi = esi – 2
Static single assignment
Sample code… Three-address code..

int a, b, c; int a, b, c;

a = 5; a = 5;
b = 3; b = 3;
c = a + b + 3; c = a + b;
b = c + 1; c = c + 3;
b = c + 1;
SSA (cont’d)
Three-address code… SSA Form

a = 5; a[0] = cnst(5)
b = 3; b[0] = cnst(3)
c = a + b; c[0] = a[0]+b[0]
c = c + 3; c[1] = c[0]+cnst(3)
b = c + 1; b[1] = c[1]+cnst(1)

Easy! Create a different version for every variable state!


SSA (cont’d)
SSA Form… Graph!
b[1]
+
a[0] = cnst(5) / \
b[0] = cnst(3) c[1] cnst (1)
c[0] = a[0]+b[0] /
+
c[1] = c[0]+cnst(3) / \
b[1] = c[1]+cnst(1) c[0] cnst (3)
+
/ \
a[0] b[0]
SSA (cont’d)
• Very simple optimization
b[1]
framework +
– Constant folding / \
c[1] cnst (1)
– Constant propagation /
+
– Common sub-expression / \
c[0] cnst (3)
elimination +
/ \
– Dead code removal a[0] b[0]

• Expensive, so it’s used


only when needed…
Memory!
0040517E 812B 84F1183C SUB DWORD PTR DS:[EBX],3C18F184
00405184 832B 96 SUB DWORD PTR DS:[EBX],-6A
00405187 013B ADD DWORD PTR DS:[EBX],EDI
00405189 D1CF ROR EDI,1
0040518D 832B DF SUB DWORD PTR DS:[EBX],-21
00405190 812B 69802E61 SUB DWORD PTR DS:[EBX],612E8069
00405196 29C9 SUB ECX,ECX
00405198 812B CD05B390 SUB DWORD PTR DS:[EBX],90B305CD
0040519E 832B 79 SUB DWORD PTR DS:[EBX],79
004051A3 87C1 XCHG ECX,EAX
004051A5 29D1 SUB ECX,EDX
004051A7 832B C9 SUB DWORD PTR DS:[EBX],-37
004051AE 2933 SUB DWORD PTR DS:[EBX],ESI

Win32.Harrier decryption loop (partial)


Challenges
• Memory locations = variables, but…
– Hard to prove the addresses are valid…
– Problems with pointer aliasing (including
ESP!!)

• A possible solution
– Perform these optimizations only after we’ve
gathered a set of run-time data…
Execution modes – 1
• No code generation! 
• Simply simulate the micro-ops
• Advantages:
– Very portable
– Easy to profile
– Easy to debug
• Disadvantages:
– Slow 
PSP 
Execution modes – 2
• Trivial code generation…

• Simply link the micro-functions that


simulate the micro-ops
– Most of them are 2-4 x86 instructions
– Compiler generated, so they’re portable
– Need a (very basic) platform-specific linker
– Fast! 
Execution modes – 3
• Generate code tailored for the target CPU!
• Advantages
– Fastest!
– We can combine multiple micro-ops into a
single CPU instruction
– Special case: X86->X86
• Disadvantages
– Platform dependent
Speed statistics
Speed
20
18
16
14
12
10
8 Speed
6
4
2
0
Normal uOp
uOp link code gen
Execution simulation
Speed 1 17.44 9.11 5.3
Exit conditions
• We want to quit as early as possible when
analyzing clean files
– Too many GUI calls?
• We want to quit in less than X seconds, no
matter what…
– Inject “time check” code…
• We want to “chew” as much from the file
as possible in those X seconds…
What NOT to do…
• For every basic block!!!
• pushfd / pushad
• call GetTickCount
• sub eax, dword ptr [start_count]
• xor edx, edx
• mov ecx, 0x3e8
• div ecx
• cmp eax, dword ptr [max_seconds]
• jg __out
• popad / popfd
UPX CFG
An idea...
• We have the control-flow-graph…
• Why add “time check” code for every BB?
– We can check only once / cycle in the CFG
– Make sure there’s at least one “time check”
per graph cycle
• Easier way!
– We can add “time check” code only for
“backward” branches 
Scan conditions
• Old(er) techniques
– Specific APIs
– Common startup code (CRT?)
• New(er) techniques
– Execution from a “dirty” page
– Memory access patterns! (e.g. linear
decryption loops)
– Suspicious branches, purging of decryption
code etc …
Conclusions
• CPU-intensive packers are here to stay…
• …Ex: VMProtect requires 40 billion
instructions…
• Code optimization is a good way to reduce
analysis time…
• Compiler-like structures are good ways of
solving other difficult AV problems 
Questions?
[email protected]

You might also like