Authored by: Anandeshwar Unnikrishnan
Stage 1: GULoader Shellcode Deployment 
In recent GULoader campaigns, we are seeing a rise in NSIS-based installers delivered via E-mail as malspam that use plugin libraries to execute the GU shellcode on the victim system. The NSIS scriptable installer is a highly efficient software packaging utility. The installer behavior is dictated by an NSIS script and users can extend the functionality of the packager by adding custom libraries (dll) known as NSIS plugins. Since its inception, adversaries have abused the utility to deliver malware. 
NSIS stands for Nullsoft Scriptable Installer. NSIS installer files are self-contained archives enabling malware authors to include malicious assets along with junk data. The junk data is used as Anti-AV / AV Evasion technique. The image below shows the structure of an NSIS GULoader staging executable archive.

The NSIS script, which is a file found in the archive, has a file extension “.nsi” as shown in the image above. The deployment strategy employed by the threat actor can be studied by analyzing the NSIS script commands provided in the script file. The image shown below is an oversimplified view of the whole shellcode staging process. 

The file that holds the encoded GULoader shellcode is dropped on to victim’s disc based on the script configuration along with other data. Junk is appended at the beginning of the encoded shellcode. The encoding style varies from sample to sample. But in all most all the cases, it’s a simple XOR encoding. As mentioned before, the shellcode is appended to junk data, because of this, an offset is used to retrieve encoded GULoader shellcode. In the image, the FileSeek NSIS command is used to do proper offsetting. Some samples have unprotected GULoader shellcode appended to junk data. 

A plugin used by the NSIS installer is nothing but a DLL which gets loaded by the installer program at runtime and invokes functions exported by the library.  Two DLL files are dropped in user’s TEMP directory, in all analyzed samples one DLL has a consistent name of system.dll and name of the other one varies.   
The system.dll is responsible for allocating memory for the shellcode and its execution. The following image shows how the NSIS script calls functions in plugin libraries.

The system.dll has the following exports as shown the in the image below. The function named “Call” is being used to deploy the shellcode on victim’s system. 

The Call function exported by system.dll resolves following functions dynamically and execute them to deploy the shellcode. 
CreateFile – To read the shellcode dumped on to disk by the installer. As part of installer set up, all the files seen in the installer archive earlier are dumped on to disk in new directory created in C: drive. 
VirtualAlloc – To hold the shellcode in the RWX memory. 
SetFilePointer – To seek the exact position of the shellcode in the dumped file. 
ReadFile – To read the shellcode.  
EnumResourceTypesA – Execution via callback mechanism. The second parameter is of the type ENUMRESTYPEPROCA which is simply a pointer to a callback routine. The address where the shellcode is allocated in the memory is passed as the second argument to this API leading to execution of the shellcode. Callback functions parameters are good resources for indirect execution of the code.   

Vectored Exception Handling in GULoader 
The implementation of the exception handling by the Operating System provides an opportunity for the adversary to take over execution flow. The Vectored Exception Handling on Windows provides the user with ability to register custom exception handler, which is simply a code logic that gets executed at the event of an exception. The interesting thing about handling exceptions is that the way in which the system resumes its normal execution flow of the program after the event of exception. Adversaries exploit this mechanism and take ownership of the execution flow. Malware can divert the flow to the code which is under its control when the exception occurs. Normally it is employed by the malware to achieve following goals: 

Covert code execution and anti-analysis 

The GuLoader employs the VEH mainly for obfuscating the execution flow and to slow down the analysis. This section will cover the internals of Vectored exception handling on Windows and investigates how GUloader is abusing the VEH mechanism to thwart any analysis efforts.  

The Vectored Exception Handling (VEH) is an extension of Structured Exception Handling (SEH) with which we can add a vectored exception handler which will be called despite of our position in a call frame, simply put VEH is not frame-based. 
VEH is abused by malware, either to manipulate the control flow or covertly execute user functions. 
Windows provides AddVectoredExceptionHandler Win32 API to add custom exception handlers. The function signature is shown below. 

The Handler routine is of the type PVECTORED_EXCEPTION_HANDLER. Further checking the documentation, we can see the handler function takes a pointer to _EXCEPTION_POINTERS type as its input as shown in the image below. 

The _EXCEPTION_POINTERS type holds two important structures; PEXCEPTION_RECORD and PCONTEXT. PEXCEPTION_RECORD contains all the information related to exception raised by the system like exception code etc. and PCONTEXT structure holds CPU register (like RIP/EIP, debug registers etc.) values or state of the thread captured when exception occurred. 


This means the exception handler can access both ExceptionRecord and ContextRecord. Here from within the handler one can tamper with the data stored in the ContextRecord, thus manipulating EIP/RIP to control the execution flow when user application resumes from exception handling.    
There is one interesting thing about exception handling, the execution to the application is given back via NtContinue native routine. Exception dispatch routines call the handler and when handler returns to dispatcher, it passes the ContextRecord to the NtContinue and execution is resumed from the EIP/RIP in the record. On a side note, this is an oversimplified explanation of the whole exception handling process. 

Vectored Handler in GULoader 

GULoader registers a vectored exception handler via RtlAddVectoredExceptionHandler native routine.  The below image shows the control flow of the handler code. Interestingly most of the code blocks present here are junk added to thwart the analysis efforts.  


The GULoader’s handler implementation is as follows (disregarding the junk code). 
Reads ExceptionInfo passed to the handler by the system. 
Reads the ExceptionCode from ExceptionRecord structure. 
Checks the value of ExceptionCode field against the computed exception codes for STATUS_ACCESS_VIOLATION, STATUS_BREAKPOINT and STATUS_SINGLESTEP. 
Based on the exception code, malware takes a branch and executes code that modifies the EIP. 

The GULoader sets the trap flag to trigger single stepping intentionally to detect analysis. The handler code gets executed as discussed before, a block of code is executed based on the exception code. If the exception is single stepping, status code is 0x80000004, following actions take place:

The GULoader reads the ContextRecord and retrieves EIP value of the thread. 
 Increments the current EIP by 2 and reads the one byte from there. 
Performs an XOR on the one-byte data fetched from step before and a static value. The static value changes with samples. In our sample value is 0x1A. 
The XOR’ed value is then added to the EIP fetched from the ContextRecord. 
Finally, the modified EIP value from prior step is saved in the ContextRecord and returns the control back to the system(dispatcher). 
The malware has the same logic for the access violation exception. 


When the shellcode is executed without debugger, INT3 instruction invokes the vectored exception handler routine, with an exception of EXCEPTION_BREAKPOINT, handler computes EIP by incrementing the EIP by 1 and fetching the data from incremented location. Later XORing the fetched data with a constant in our case 0x1A. The result is added to current EIP value. The logic implemented for handling INT3 exceptions also scan the program code for 0xCC instructions put by the researchers. If 0xCC are found that are placed by researchers then EIP is not calculated properly. 

EIP Calculation Logic Summary 

Trigger via interrupt instruction (INT3) 

Trigger via Single Stepping(PUSHFD/POPFD) 

*The value 0x1A changes with samples 
Detecting Abnormal Execution Flow via VEH 

The shellcode is structured in such a way that the malware can detect abnormal execution flow by the order in which exception occurred at runtime. The pushfd/popfd instructions are followed by the code that when executed throws STATUS_ACCESS_VIOLATION. When program is executed normally, the execution will not reach the code that follows the pushfd/popfd instruction block, thus raising only STATUS_SINGLESTEP. When accidently stepped over the pushfd/popfd block in debugger, the STATUS_SINGLESTEP is not thrown at the debugger as it suppreses this because the debugger is already single stepping through the code, this is detected by the handler logic when we encounter code that follows the pushfd/popfd instruction block wich throws a STATUS_ACCESS_VIOLATION. Now it runs into a nested exception situation (the access violation followed by suppressed single stepping exception via trap). Because of this, whenever an access violation occurs, the handler routine checks for nested exception information in _EXCEPTION_POINTERS structure as discussed in the beginning. 

Below image shows this the carefully laid out code to detect analysis. 

The Egg hunting: VEH Assisted Runtime Padding 
One interesting feature seen in GULoader shellcode in the wild is runtime padding. Runtime padding is an evasive behavior to beat automated scanners and other security checks employed at runtime. It delays the malicious activities performed by the malware on the target system.  

The egg value in the analyzed sample is 0xAE74B61.  
It initiates a search for this value in its own data segment of the shellcode. 
Don’t forget the fact that this is implemented via VEH handler. This search itself adds 0.3 million of VEH iteration on top of regular VEH control manipulation employed in the code. 
The loader ends this search when it retrieves the address location of the egg value. To make sure the value is not being manipulated by any means by the researcher, it performs two additional checks to validate the egg location. 
If the check fails, the search continues. The process of retrieving the location of the egg is shown in the image below.  

As mentioned above, the validity of the egg location is checked by retrieving byte values from two offsets: one is 4 bytes away from the egg location and the value is 0xB8. The other is at 9 bytes from the egg location and the value is 0xC3. This check needs to be passed for the loader to proceed to the next stage of infection. Core malware activities are performed after this runtime padding loop. 

 The following images show the egg location validity checks performed by GULoader. The values 0xB8 and 0xC3 are checked by using proper offsets from the egg location. 

Stage 2: Environment Check and Code Injection  
In the second stage of the infection chain, the GULoader performs anti-analysis and code injection. Major anti-analysis vectors are listed below. After making sure that shellcode is not running in a sandbox, it proceeds to conduct code injection into a newly spawned process where stage 3 is initiated to download and deploy actual payload. This payload can be either commodity stealer or RAT.  
Anti-analysis Techniques  

Employs runtime padding as discussed before. 
Scans whole process memory for analysis tool specific strings 
Uses DJB2 hashing for string checks and dynamic API address resolution. 
Strings are decoded at runtime 
Checks if qemu is installed on the system by checking the installation path: 
C:\Program Files\qqa\qqa.exe 
Patches the following APIs: 
The function’s prologue is patched with ExitProcess call 
The initial bytes are patched with instruction “mov edi edi” 
Patches with instruction nop 
Clears hooks placed in ntdll.dll by security products or researcher for the analysis. 
Window Enumeration via EnumWindows 
Hides the shellcode thread from the debugger via ZwSetInformationThread by passing 0x11 (ThreadHideFromDebugger) 
Device driver enumeration via EnumDeviceDrivers andGetDeviceDriverBaseNameA 
Installed software enumeration via MsiEnumProductsA and MsiGetProductInfoA 
System service enumeration via OpenSCManagerA and EnumServiceStatusA 
Checks use of debugging ports by passing ProcessDebugPort (0x7) class to NtQueryInformationProcess 
Use of CPUID and RDTSC instructions to detect virtual environments and instrumentation. 

Anti-dump Protection 
Whenever GULoader invokes a Win32 api, the call is sandwiched between two XOR loops as shown in the image below.  The loop prior to the call encoded the active shellcode region where the call is taking place to prevent the memory from getting dumped by the security products based on event monitoring or api calls. Following the call, the shellcode region is decoded again back to normal and resumes execution. The XOR key used is a word present in the shellcode itself. 

String Decoding  
This section covers the process undertaken by the GUloader to decode the strings at the runtime. 

The NtAllocateVirtualMemory is called to allocate a buffer to hold the encoded bytes. 
The encoded bytes are computed by performing various arithmetic and logical operations on static values embedded as operands of assembly instructions. Below image shows the recovery of encoded bytes via various mathematical and logical operations. The EAX points to memory buffer, where computed encoded values get stored. 

The first byte/word is reserved to hold the size of the encoded bytes. Below shows a 12 byte long encoded data being written to memory. 

Later, the first word gets replaced by the first word of the actual encoded data. Below image shows the buffer after replacing the first word. 

The encoded data is fully recovered now, and malware proceeds to decode it. For decoding the simple XOR is employed, and key is present in the shellcode. The assembly routine that does the decoding is shown in the image below. Each byte in the buffer is XORed with the key. 

The result of the XOR operation is written to same memory buffer that holds the encoded data. A final view of the memory buffer with decoded data is shown below. 

The image shows the decoding the string “psapi.dll”, later this string is used in fetching the addresses of various functions to employ anti-analysis.  
The stage 2 culminates in code injection, to be specific GULoader employs a variation of the process hollowing technique, where a benign process is spawned in a suspended state by the malware stager process and proceeds to overwrite the original content present in the suspended process with malicious content, later the state of the thread in the suspended process is changed by modifying processor register values like EIP and finally the process resumes its execution. By controlling EIP, malware can now direct the control flow in the spawned process to a desired code location. After a successful hollowing, the malware code will be running under the cover of a legit application.  
The variation of hollowing technique employed by the GULoader doesn’t replace the file contents, but instead injects the same shellcode and maps the memory in the suspended process. Interestingly, GULoader employs an additional technique if the hollowing attempt fails. More details are covered in the following section.  
Listed below Win32 native APIs are dynamically resolved at runtime to perform the code injection. 


Overview of Code Injection 

Initially image “%windir%Microsoft.NETFrameworkversion on 32-bit systems<version>CasPol.exe” is spawned in suspended mode via CreateProcessInternalW native API. 
The Gu loader retrieves a handle to the file “C:WindowsSysWOW64iertutil.dll” which is used in section creation. The section object created via NtCreateSection will be backed by iertutil.dll.  
This behavior is mainly to avoid suspicion, a section object which is not backed by any file may draw unwanted attention from security systems.  
The next phase in the code injection is the mapping of the view created on the section backed by the iertutil.dll into the spawned CasPol.exe process. Once the view is successfully mapped to the process, malware can inject the shellcode in the mapped memory and resume the process thus initiating stage 3. The native api ZwMapViewOfSection is used to perform this task. Following the execution of the above API, the malware checks the result of the function call against the below listed error statuses. 
If the mapping is unsuccessful and status code returned by ZwMapViewOfSection matches with any of the code mentioned above, it has a backup plan. 
The GuLoader calls NtAllocateVirtualMemory by directly calling the system call stub which is normally found in ntdll.dll library to bypass EDR/AV hooks. The memory is allocated in the remote CasPol.exe process with an RWX memory protection. Following image shows the direct use of NtAllocateVirtualMemory system call. 

After memory allocation, it writes itself into remote process via NtWriteVirtualMemory as discussed above. GULoader shellcodes taken from the field are bigger in size,  samples taken for this analysis are all greater than 20 mb. In samples analyzed, the buffer size allocated to hold the shellcode is 2950000 bytes. The below image shows the GuLoader shellcode in the memory. 

Misleading Entry point  

The GULoader is highly evasive in nature, if abnormal execution flow is detected with help of employed anti-analysis vectors, the EIP and EBX fields of thread context structure (of CasPol.exe process) will be overwritten with a decoy address, which is required for the stage 3 of malware execution. The location ebp+4 is used to hold the entry point despite of the fact whether program is being debugged or not. 
The Gu loader uses ZwGetContextThread and NtSetContextThread routines to accomplish modification of the thread state. The CONTEXT structure is retrieved via ZwGetContextThread, the value [ebp+14C] is used as the entry point address. The current EIP value held in the EIP field in the context structure of the thread will be changed to a recalculated address based on value at ebp+4. Below image shows the RVA calculation.  The base address of the executing shellcode (stage 2) is subtracted from the virtual address [ebp+4] to obtain RVA.  

The RVA is added to the base address of the newly allocated memory in the CasPol.exe process to obtain new VA which can be used in the remote process. The new VA is written into EIP and EBX field in the thread context structure of the CasPol.exe process retrieved via ZwGetContextThread. Below image shows the modified context structure and value of EIP.  

Finally, by calling ZwSetContextThread, the changes made to the CONTEXT structure is committed in the target thread of CasPol.exe process. The thread is resumed by calling NtResumeThread. The CasPol.exe resumes execution and performs stage 3 of the infection chain. 
Stage 3: Payload Deployment  
The GULoader shellcode resumes execution from within a new host process, in this report, analyzed samples inject the shellcode either into the same process spawned as a child process or caspol.exe. Stage3 performs all the anti-analysis once again to make sure this stage is not being analyzed. After all checks, GUloader proceeds to perform stage3 activities by decoding the encoded C2 string in the memory as shown in the image below. The decoding method is the same as discussed before. 

Later the addresses of following functions are resolved dynamically by loading wininet.dll: 



The below image shows the response from the content delivery network (cdn) server where the final payload is stored. In this analysis, a payload of size 0x2E640 bytes is sent to the loader. Interestingly, the first 40 bytes are ignored by the loader. The actual payload starts from the offset 40 which is highlighted in the image. 

The cdn server is well protected, it only serves to clients with proper headers and cookies. If these are not present in the HTTP request, the following message is shown to the user. 

Final Payload 
Quasi Key Generation 
The first step in decoding the the downloaded final payload by the GUloader is generating a quasi key which will be later used in decoding the actual key embeded in the GULoader shellcode. The encoded embeded key size is 371 bytes in analysed sample. The process of quasi key generation is as follows: 

The 40th and 41st bytes (word) are retrived from the download buffer in the memory. 
The above word is XORed with the first word of the encoded embeded key along and a counter value. 
The process is repeated untill the the word taken from the downloaded data fully decodes and have a value of 0x4D5A “MZ”. 
The value present in the counter when the 4D5A gets decoded is taken as the quasi key. This key is shown as “key-1” in the image below. In the analysed sample the value of this key is “0x5448” 

Decoding Actual Key 
The embedded key in the GULoader shellcode is of the size 371 bytes as discussed before. The quasi key is used to decode the embeded key as shown in the image below. 

Each word in the embeded key is XORed with quasi key key-1. 
When the interation counter exceeds the size value of 371 bytes, it stops and proceeds to decode the downloaded payload with this new key. 

The decoded 371 bytes of embeded key is shown below in the image below. 

Decoding File 
A byte level decoding happens after embeded key is decoded in the memory. Each byte of the downloaded data is XORed with the key to obtain the actual data, which is a PE file. The decoded data is overwritten to the same buffer used to download the decoded data. 

The final decoded PE file residing in the memory is shown in the image below: 

Finally, the loader loads the PE file by allocating the memory with RWX permission in the stage3 process, based on analyzing multiple samples it’s either the same process in stage 2 as the child process, or casPol.exe. The loading involved code relocation and IAT correction as expected in such a scenario. The final payload resumes execution from within the hollowed stage3 process. Below malware families are usually seen deployed by the GULoader: 

Vidar (Stealer) 
Raccoon (Stealer) 
Remcos RAT 

Below image shows the injected memory regions in stage3 process caspol.exe in this report. 

The role played by malware loaders popularly known as “crypters” is significant in the deployment of Remote Administration Tools and stealer malwares that target consumer data. The exfiltrated Personal Identifiable Information (PII) extracted from the compromised endpoints are largely collected and funneled to various underground data selling marketplaces. This also impacts businesses as various critical information used for authentication purposes are getting leaked from the personal systems of the user leading to initial access on the company networks. The GuLoader is heavily used in mass malware campaigns to infect the users with popular stealer malware like Raccoon, Vidar, and Redline. Commodity RATs like Remcos are also seen delivered in such campaign activities. On the bright side, it is not difficult to fingerprint malware specimens used in the mass campaigns because of the volume its volume and relevance, detection rules and systems can be built around this very fact. 
Following table summarizes all the dynamically resolved Win32 APIs  

Win32 API 















































The post GULoader Campaigns: A Deep Dive Analysis of a highly evasive Shellcode based loader appeared first on McAfee Blog.