Http based botnet detection using network traffic traces

Introduction

Botnet Definition

A bot, short for robot, is a type of malicious software installed on vulnerable hosts that executes harmful actions without user consent, often created by criminal groups for various attacks The term "bot" can also refer to the infected computer, commonly known as a compromised or zombie computer A family of bots consists of multiple versions sharing the same source code, regardless of being managed by different command and control (C&C) servers.

C&C Server bot bot bot bot bot bot bot victim bot

Phish ing Bank fraud Stealing information

Figure 1.1: The attacks of typical botnet

A botnet is a network of computers infected with malware, controlled by an attacker through a command and control (C&C) server Typically managed by a botmaster or cybercriminal, the primary aim of a botnet is to execute profitable malicious activities, including sending spam, conducting distributed denial of service (DDoS) attacks, stealing personal information, and engaging in bank fraud or phishing schemes As a significant threat to Internet security, botnets facilitate large-scale coordinated attacks and enable various illicit activities across numerous compromised devices.

The concept of ‘botnet’ evolved in 1993 by introducing the first botnet called

Eggdrop, created by Robey Pointer, is the most popular open-source IRC bot, known for its flexibility and user-friendliness, and is distributed under the GNU General Public License (GPL) Initially, IRC bots served legitimate purposes, such as keeping channels active and managing control However, the emergence of malicious botnets marks a new era of cyber threats, combining various dangers into a single entity These botnets have become significant tools for cybercrime, enabling users with minimal technical skills to disrupt targeted computer systems by simply renting botnet services from cybercriminals.

The article synthesizes data from diverse sources, including recent hearings on crime and terrorism, major IT firms like Microsoft, cybersecurity institutes such as Symantec, and academic publications It features Table 1.1, which outlines the history of bot samples identified from 1993 to 2014, detailing each botnet's name, alias, year of discovery, estimated bot count, spam capacity, and communication type While Table 1.1 may not encompass every bot on the Internet, it serves as a curated summary of notable and well-known bots.

Table 1.1: Summary of selective well-known botnets in history

Torpig Lethic Kraken Sality Waledac

- Kracken Sector, Kuku, Kookoo Waled, Waledpak DownadUp, Kido Bobic, Oderoor

IRC IRC IRC IRC P2P P2P HTTP HTTP HTTP HTTP HTTP

Zbot, PRG, Wsnpoem Oficla Buzus, Bachsoy Pokier, Slogger

The centralized command and control (C&C) approach resembles traditional client/server architecture, commonly seen in botnets utilizing the Internet Relay Chat (IRC) protocol In this infrastructure, bots maintain a robust communication channel with one or more connection points, where servers dispatch commands and deliver malware updates The primary protocols employed in this centralized architecture include IRC and Hyper Text Transfer Protocol (HTTP).

Figure 1.2: IRC/HTTP botnet C&C architectures

Centralized architecture offers several advantages, including direct feedback that allows botmasters to easily monitor the bonnet's status It ensures low latency since commands are issued directly from a single server The simple structure facilitates easy construction and deployment without the need for specialized hardware Additionally, quick reaction times are achieved as the server coordinates directly with its bots, eliminating the need for third-party intervention.

IRC botnets operate through channels created by a botmaster, where compromised machines, or zombies, connect to receive malicious commands While the IRC protocol is adaptable for command and control (C&C) purposes, its use poses significant detection challenges, as IRC traffic is uncommon in corporate networks and is often blocked Consequently, network administrators can effectively thwart IRC botnet activities by monitoring for IRC traffic and implementing firewall rules to block it.

Due to corporate network restrictions on IRC traffic, the Hyper Text Transfer Protocol (HTTP) has gained popularity for command and control (C&C) communication As the most widely used protocol for data delivery over the Internet, HTTP supports both human-readable content and binary data transfers, making it more favorable than IRC Its acceptance in most networks and minimal filtering make it an ideal choice for bot operators to conceal communication between bots and their botmaster According to the Symantec Global Internet Security Threat Report, centralized C&C servers utilizing HTTP account for 69% of all C&C servers, establishing it as the predominant method for controlling botnets.

The primary drawback of centralized architecture lies in its inherent vulnerabilities: the Command and Control (C&C) server serves as a single point of failure for the entire botnet If this central server is identified and neutralized, the entire botnet collapses Additionally, this architecture exhibits minimal resilience, making it susceptible to disruptions.

IP lists of all bots contained in the server will reveal all bots' locations and make it easy to enumerate the number of bots in the botnet

To combat detection, botnets have increasingly adopted techniques like IP fluxing and domain fluxing Notable examples, such as Zeus and Conficker, utilize domain fluxing to create a multitude of pseudo-random domain names (PDNs) that allow operators to maintain control over their bots This strategy effectively helps botnets evade monitoring systems; when one or more command and control (C&C) domain names are identified and removed, the bots can swiftly relocate to new domains generated through DNS queries.

In a decentralized command and control (C&C) architecture, commonly known as peer-to-peer (P2P) botnets, there is no central infrastructure; instead, each bot functions as both a server and a client This unique setup allows bots to attack victim computers while also distributing commands to their peer bots When a bot acts as a client, it engages in attacks, whereas, as a server, it relays messages to other bots in its peer list To issue commands within a P2P botnet, the botmaster injects commands into trusted bots, which then execute the commands and propagate them to all bots on their peer list, creating a dynamic and resilient network structure.

Decentralized command and control (C&C) architecture offers several key advantages, including high resiliency, as it lacks centralized botnets and features multiple communication pathways among bot clients This structure makes P2P botnets difficult to shut down or hijack; if a bot client is detected, it can only expose a limited number of other bots in its peer list Additionally, the absence of a central C&C server means that even if multiple bots from the same botnet are identified, dismantling the entire network remains a challenge Furthermore, the decentralized nature complicates the task for security researchers, making it hard to accurately assess the overall size of a P2P botnet.

The decentralized command and control (C&C) architecture presents notable disadvantages, primarily high latency and increased complexity High latency arises from the inherent challenges in command delivery, as it is not guaranteed and may be delayed, especially if some bots are offline, leading to potential loss of control over significant portions of the botnet during real-time operations Additionally, the peer-to-peer (P2P) structure is highly complex, necessitating considerable effort in the planning, implementation, and management of the botnet.

To enhance the effectiveness and resilience of their attacks, bot operators often combine various architectures, leading to the development of hybrid models that leverage the benefits of both centralized and P2P systems These hybrid architectures are categorized into servant bots and client bots Servant bots function as both clients and servers, utilizing routable static IP addresses, while client bots are configured with non-routable dynamic IP addresses and do not accept incoming connections Servant bots play a crucial role by providing their IP address information to the peer list and remaining in listening mode for incoming connections Additionally, they implement symmetric keys for each communication, which strengthens the botnet's ability to evade detection.

The Hybrid C&C architecture offers significant advantages, including high resiliency due to its complex layered structure, which enhances its robustness Additionally, its intricate design makes it challenging for security investigators to accurately assess the scale of the botnet, complicating threat detection and mitigation efforts.

Figure 1.4: C&C architectures of hybrid P2P botnet

Evolution of Botnet

The earliest botnets featured a hardcoded IP address or domain name for their command and control (C&C) servers, limiting their mobility and making them vulnerable to takedowns by simply blocking these addresses Bots relied on these fixed C&C servers to receive commands from the botmaster and send back collected data Despite attempts by botmasters to obfuscate the C&C address to evade detection, the unchangeable nature of the hardcoded address meant that even a single alert or misuse report could lead to the quarantine of the C&C server and the suspension of the entire botnet.

To address the limitations of their command and control (C&C) infrastructure, botmasters have implemented advanced techniques that enhance reliability Botnets emerged in the early 1990s, with IRC bots making their debut in 1993 The evolution continued with the transition to peer-to-peer (P2P) bots in 2003, and by 2006, the development of HTTP-based bots marked a significant advancement in this technology.

The first-generation botnets primarily utilized Internet Relay Chat (IRC) as their command and control (C&C) medium, starting from 1993 when several public IRC networks emerged IRC's simple text-based command syntax allowed for almost real-time communication between infected machines and the C&C server In this setup, botnet-infected devices would connect to an IRC server and join a designated channel to receive instructions from the botmaster While botmasters often protected these channels with passwords, researchers could extract this information from malicious binaries, enabling them to access the channel and gather critical insights about the botnet Additionally, the unique message format of the IRC protocol made its traffic easily identifiable, distinguishing it from normal internet activity Notable examples of IRC-based botnets include Agobot, Spybot, and Sdbot.

After the relative success of researchers in tackling the issue of IRC botnets, the next step of cyber criminals in botnet evolution was Peer-to-Peer (P2P) botnet communication

Since 2003, botmasters have been enhancing the resilience of their infrastructure by developing peer-to-peer botnets Upon infection, a bot receives an initial list of active peers, which is regularly updated and can be concealed on the infected machine under an inconspicuous name For instance, the Kelihos/Hlux botnet stores its peer list in the Windows registry at HKEY CURRENT USER/Software/Google, alongside other configuration details, while other botnets, like Nugache, utilize binary hardcoding for seeding Initial seeding occurs either by retrieving the list from a small set of hardcoded default hosts within the bot binary or through pre-seeding the victim machine’s Windows Registry before executing the malware.

Initially designed for file sharing among peer nodes, P2P networks have been exploited for botnet command and control (C&C) communication, making the detection of C&C servers challenging due to the ability to disperse commands across any node in the network Furthermore, classifying P2P traffic poses significant difficulties for gateway security devices tasked with filtering and detecting such traffic The evolution of P2P botnets began with Sinit, followed by notable successors like Phatbot, Storm, and Nugache.

The evolution of HTTP bots originated from exploit kits, notably the Zeus and SpyEye Botnets, primarily developed by Russian cybercriminals Given that the HTTP protocol is extensively used on the Internet, most network traffic involving HTTP-related ports is permitted through firewalls, making it a prime target for network attacks In traditional HTTP-based botnets, attackers issue commands through a data file on a command and control (C&C) server, with bots connecting to these servers at regular intervals Modern bots have advanced beyond merely receiving commands; they can now also collect personal data from infected machines A prominent example is the Zeus botnet, which is specifically designed to steal financial information and connects to its C&C server through URLs like http:// /gate.php.

HTTP-based botnets have evolved with the introduction of a technique known as Domain-flux, which enhances the reliability of their command and control (C&C) infrastructure This method employs Domain Generation Algorithms (DGA) to create numerous pseudo-random domain names, allowing bots to communicate with their botmasters while evading detection Each infected machine generates a list of potential C&C domains using DGA and resolves these through DNS queries until it connects to a pre-reserved malicious domain This strategy effectively protects against takedowns, as bots can quickly shift to new automatically generated domains if any are compromised Notable first-generation HTTP-based botnets include BlackEnergy, followed by others such as Conficker, Zeus, and Bobax.

Motivation and Challenges

The evolution of cybercriminal tactics has shifted from IRC-based bots to more prevalent HTTP-based botnets, as illustrated in Figures 1.6 and 1.7 This transition has led to significant damage across various government organizations and industries, primarily through activities such as personal information theft, DDoS attacks, and spamming.

Cybercriminals increasingly favor HTTP botnets due to several key advantages Firstly, HTTP-based botnets utilize a client-server model for communication and command and control (C&C), making them easier to establish than P2P botnets Secondly, as HTTP is a widely used protocol, it allows botnets to conceal their C&C traffic within the vast volume of regular internet web traffic, enhancing their stealth Lastly, since HTTP services are common and typically not blocked by firewalls, HTTP botnets operate in a more flexible environment compared to other types.

HTTP Botnet detection researchers encounter several significant challenges Firstly, HTTP Botnets communicate through legitimate HTTP requests, making their traffic resemble normal user activity, which complicates the differentiation between botnet and legitimate behavior Secondly, the small volume of traffic generated by HTTP Botnets within large networks makes it difficult to identify malicious actions amidst the vast amount of data Additionally, cybercriminals continuously enhance their bots' resilience against detection and takedown efforts, employing advanced evasion techniques, such as domain generation algorithms (DGA) and fluxing methods Lastly, there is a comparatively low number of researchers dedicated to HTTP Botnet detection in contrast to those focusing on IRC-based and P2P Botnets These challenges serve as the primary motivation for the research presented in this dissertation.

The goal of the dissertation

In Section 1.3, we explored the motivations and challenges associated with detecting HTTP botnets, highlighting a trend where cybercriminals increasingly favor HTTP botnets due to their advantages As botnet developers continuously innovate to evade detection, recent generations of HTTP botnets have adopted techniques like Domain Generation Algorithm (DGA), domain-flux, and fast-flux These methods help them avoid blacklisting and conceal their server locations Consequently, this dissertation aims to develop effective solutions for detecting HTTP botnets that utilize these evasion techniques, addressing three primary research problems.

(1) To detect the presence of domain-flux or DGA-based botnets infected machines in an enterprise network or the monitored network

(2) To detect C&C servers of botnets using domain-flux or DGA-based detection evasion techniques;

(3) To detect botnets based on malicious fast-flux service networks

The details of these three problems will be solved in Chapter 3, Chapter 4 and Chapter 5 of this dissertation, respectively.

Contributions and Outline of dissertation

This dissertation makes significant contributions to botnet detection by introducing innovative approaches and exploring previously uncharted research areas Key contributions include novel methodologies and insights aimed at enhancing the effectiveness of botnet detection techniques.

In Chapter 3, we introduce a novel method for detecting DGA-bot infected machines within enterprise networks by analyzing the periodicity of domain queries Our study reveals that infected machines frequently query numerous non-existent domain names at similar time intervals to locate their command and control (C&C) servers In contrast, legitimate hosts do not exhibit such querying behavior, making this a key differentiator By examining the correlation between time interval series of queries, we can cluster domain names generated by the same botnet, allowing us to identify compromised hosts effectively While this method demonstrates high accuracy in detecting DGA-bot infections, it does not identify the C&C server domains, which we address in Chapter 4 of this dissertation.

Chapter 4 introduces a traffic monitoring system designed to identify botnet Command and Control (C&C) servers, aiming to regain control over compromised networks The system analyzes the flow of successful queries and responses from infected machines, extracting key features to differentiate between domains generated by bots and those created by humans The research reveals significant biases in the feature value distribution between legitimate and malicious domains, demonstrating the effectiveness of the proposed detection features Various machine-learning algorithms were employed to train classifiers and assess their detection capabilities, with experimental results indicating that the decision tree algorithm (J48) outperforms other methods in detecting botnet C&C server domains.

To assess the detection rate using real-world traffic data, we gathered various well-known DGA-bot samples and executed them in a virtual machine environment connected to the Internet The experimental results indicate that this method effectively identifies C&C server domains with high accuracy.

Chapter 5 will introduce an approach for detecting malicious fast-flux networks use feature-based machine learning classification techniques Through examine and analysis on the large number of the obtained data, 16 key features are extracted to be able to classify usual legitimate domains from malicious fast-flux ones Among the

This dissertation introduces 16 features, with 12 being proposed for the first time We measure the 95% confidence interval and standard error for each feature to analyze the distribution differences between benign and fast-flux domains Utilizing various machine learning algorithms, we find that the Random Forest algorithm is the most effective classifier for distinguishing between malicious fast-flux domains and legitimate ones This approach successfully detects a broad spectrum of fast-flux domains, including those associated with malware, demonstrating significant detection capabilities.

This dissertation significantly contributes to botnet detection by analyzing network traffic traces to identify key characteristics that differentiate botnet activity from legitimate user behavior It investigates several critical features and employs popular machine learning algorithms to train datasets, demonstrating which algorithms achieve the highest efficiency in botnet detection.

Figure 1.8: Outline of the dissertation

This dissertation compiles research that has been published, accepted for publication, or is currently under review in reputable journals and conferences As a result, the chapters may exhibit some overlap due to the nature of presenting these works collectively.

Background and Related Works

Botnet Detection Techniques

In recent years, botnet detection and tracking have emerged as significant challenges for network security researchers Proposed solutions primarily fall into two categories: the first involves active analysis through honeypots and honeynets, with initial research focusing on honeynet configurations for botnet detection The second category encompasses passive network monitoring and analysis, which includes signature-based, DNS-based, anomaly-based, and mining-based methods These approaches and their sub-classifications are explored in detail.

A honeypot is an environment intentionally designed with vulnerabilities to monitor and analyze attacks and intrusions These systems are highly effective in detecting security threats, gathering malware signatures, and gaining insights into the motivations and techniques employed by attackers.

Honeypots are categorized into high-interaction and low-interaction types based on their emulation capabilities High-interaction honeypots can replicate nearly all aspects of a real operating system, responding to known ports and protocols like an actual zombie computer, while low-interaction honeypots only mimic essential features While high-interaction honeypots enable intruders to gain full control of the operating system, low-interaction ones do not Additionally, honeypots can be classified based on their physical state.

Physical honeypot is a real machine running a real operating system Virtual honeypot is an emulation of a real machine on a virtualization host

Honeypots employ active methods to attract malware by simulating known vulnerabilities; however, these techniques have a significant drawback as they are easily detectable Once identified, botnet operators can quickly adapt and bypass defenses While active approaches may provide a fleeting opportunity to target botnet operators effectively, their success is limited due to the variety of tactics employed by attackers and the absence of global oversight on defense measures Additionally, the reliance on manual analysis restricts the ability to examine most malware, leading to this method being reserved primarily for monitoring the most prevalent botnets.

The Honeypot-based technique is highly effective for tracking and analyzing botnets, offering researchers valuable insights into malicious behaviors and binaries This approach gathers crucial information, including signatures of hosts for content-based detection, details about botnet command and control (C&C) servers, identification of unknown security vulnerabilities that allow intrusions, the tools and techniques employed by attackers, and their underlying motivations.

Numerous studies have explored the use of honeypots for detecting botnets, highlighting their effectiveness in understanding botnet characteristics and technology However, honeypots have inherent limitations, as they do not consistently detect all bot infections One significant drawback is the limited scale of exploited activities that honeypots can effectively monitor.

The method discussed is limited in its ability to detect bots that propagate through means other than scanning, such as spam or web drive-by downloads It only reports on infected machines that are intentionally set up as traps within the network, leaving out those that are infected but not designated as traps Consequently, this technique requires waiting for a bot to infect a system within the network before any tracking or analysis can occur.

Botnet detection techniques focused on network behavior represent a significant area of research for experts in the field Anomaly-based detection aims to identify botnet activities by monitoring various network behavior anomalies, including increased traffic volumes, elevated network latency, unusual port activity, and atypical system behaviors, all of which may indicate the presence of malicious actors within the network.

Karasaridis et al proposed an innovative algorithm for detecting and characterizing botnets through passive analysis of transport layer flow data This approach is particularly effective as it can identify encrypted botnet communications, since the algorithm operates independently of the encrypted payload data within network flows.

Gu et al [76] introduced BotSniffer, a system that utilizes network-based anomaly detection to identify Botnet command and control (C&C) channels within local area networks The system operates on the premise that bots belonging to the same Botnet exhibit significant similarities in their responses and activities To achieve this, BotSniffer employs various correlation analysis algorithms to detect spatial-temporal correlations in network traffic, maintaining a remarkably low false positive rate.

Researchers have proposed various entropy-based solutions to detect network behavior anomalies, which extend beyond identifying botnets and malicious traffic These approaches serve general purposes, including security-oriented applications such as botnet detection.

DNS-based detection techniques leverage specific DNS information generated by botnets, employing anomaly detection algorithms on DNS traffic Bots connect to Command and Control (C&C) servers to receive commands, necessitating DNS queries to identify the respective C&C servers, often hosted by Dynamic DNS (DDNS) providers Consequently, monitoring DNS traffic allows for the detection of botnet activity and identification of DNS traffic anomalies.

In 2005, Dagon introduced a method for identifying botnet command and control (C&C) servers by detecting domain names with unusually high or concentrated dynamic DNS (DDNS) query rates However, this approach has notable weaknesses, as it can be easily circumvented through the use of fake DNS queries Additionally, an evaluation highlighted that this technique often results in numerous false positives, misclassifying legitimate and popular domains that utilize DNS with a short time-to-live (TTL).

Ramachandran et al proposed innovative techniques for identifying botnets through passive analysis of DNS-based Black-hole List (DNSBL) lookup traffic Their method focuses on detecting DNSBL reconnaissance activities, where botmasters check their bots' blacklist status By distinguishing between legitimate DNSBL queries and those from botmasters, these heuristics enable real-time detection of reconnaissance activities and facilitate proactive countermeasures This approach serves as an early warning system, enhancing response strategies without needing direct communication with the botnet, thus avoiding disruption of its operations While this study marks a significant advancement by analyzing DNSBL logs to infer network behavior, it does face challenges such as potential false positives from countermeasures like reconnaissance poisoning and the inability to detect distributed reconnaissance.

In 2007, Choi et al introduced an anomaly-based botnet detection mechanism that analyzes group activities within DNS traffic, specifically focusing on simultaneous DNS queries from distributed bots By identifying unique features of DNS traffic indicative of group activity, they effectively differentiate botnet queries from legitimate ones This method capitalizes on the presence of DNS traffic throughout various stages of the botnet life cycle, enabling the detection of botnets through their collective behaviors Additionally, the mechanism addresses command and control (C&C) server migration and proves to be more robust than previous methods, capable of identifying various bot types, even those utilizing encrypted channels, by leveraging IP header information However, a significant limitation of this approach is the substantial processing time required to monitor large-scale networks.

Detection evasion techniques

Researchers are actively working to disrupt botnets and identify malicious activities such as spamming, DDoS attacks, and the theft of personal information However, these detection methods often fall short against the advanced evasion techniques employed by botnet operators The upcoming sections will explore various existing evasion strategies.

To prevent Command and Control (C&C) communication, a widely used method is the implementation of a static blacklist This technique involves checking each queried domain name against the blacklist to identify any associations with illegal activities, including botnets, spam, and phishing However, this approach necessitates prior knowledge of the domain names linked to the C&C servers.

To bypass static blacklists, modern HTTP botnets are increasingly utilizing a technique known as Domain Generation Algorithms (DGA) for evasion Research from Seculert's Lab highlights a troubling trend: cybercriminals are leveraging DGA to outsmart traditional detection methods As these bots evolve, they become more sophisticated and complex, posing significant challenges for prevention and detection efforts Consequently, the use of DGA-generated domains is likely to remain prevalent among cybercriminals in the foreseeable future.

Attackers employing domain-flux techniques generate a large number of Pseudo-Random Domain names (PRD) to facilitate communication with the bot-master Each infected bot utilizes a Domain Generation Algorithm (DGA) to create a list of potential command and control (C&C) domains The infected machines then send DNS queries to resolve these domains, seeking a successful connection to the malicious domain pre-registered by the bot-master Although a vast array of DGA-generated domains is produced, only a select few are utilized for C&C purposes This strategy effectively helps evade detection, as if any C&C domains are identified and taken down, the bots can quickly switch to the next set of automatically generated domains through DNS queries.

Cybercriminals are increasingly utilizing a technique known as fast-flux to host illegal content within botnets This method associates a fully qualified domain name, like www.malicious.com, with multiple rapidly changing IP addresses These addresses are selected in a round-robin manner from a vast pool of infected machines By employing very short Time-to-Live (TTL) settings for DNS responses, fast-flux constantly alters the resource records returned when resolving a domain The IP addresses involved do not host the content directly but serve as proxies, redirecting requests to the actual hidden server.

A fast-flux service network (FFSN) is a malicious network composed of compromised hosts that facilitates activities such as malware delivery, illegal content distribution, and credential theft It operates under one or more fully qualified domain names (FQDNs) that resolve to numerous IP addresses from unaware compromised hosts, known as fast-flux agents or bots The hallmark of an FFSN is its high availability, achieved by continuously updating the pool of agents, adding newly compromised hosts while removing inactive ones, and ensuring victims are directed to the most reliable agents This is accomplished through a very short time-to-live (TTL) for DNS records and a round-robin selection process Typically, the agents do not engage in malicious actions themselves; instead, they redirect requests to the fast-flux mothership, the hidden control center of the network, making it challenging to identify without direct control of an agent.

There are two different types of fast-flux networks, single-flux and double-flux

To grasp the concept of fast-flux, it is essential to analyze the standard DNS query process while disregarding unrelated steps The following outlines the procedure for retrieving content associated with a web address.

In a typical DNS query process, the user host first requests the IP address of the DNS server responsible for the domain "example.com" from the ".com" root name server The root name server responds with the relevant IP address, such as 50.40.7.9 Subsequently, the user host utilizes this IP address to contact the DNS server and inquire about the IP address for "www.example.com".

(4) The DNS replies with an IP address (112.24.20.15 in this case) (5) The user host then uses this IP address to contact to the web server for the HTTP content of

“www.example.com” (6) The web server responses with the requested contents [7] (see Figure 2.1)

Root Name Server for com

User host Web Server DNS

1 Get IP of DNS for www.example.com

3 Get IP of www.example.com

5 Get HTTP www.example.com 112.24.20.15

Figure 2.1: Normal content retrieval process

To conceal the IP addresses of unauthorized or illegal websites, botmasters often employ fast-flux techniques This process involves rapidly changing the IP addresses associated with these sites As shown in Figure 2.2, retrieving content from a fluxed web address, such as "flux.example.com," follows similar steps to standard content retrieval, with the key difference being the frequent alteration of the site's IP address during the process.

“flux.example.com” is requested, the DNS response comes with a short TTL

Subsequent DNS queries are likely to yield varying IP address responses When the user's host utilizes the IP address (112.24.20.15) to reach out to the purported web server for the content of "flux.example.com," it initiates the request process.

The "alleged webserver" will perform two additional hidden steps: first, it requests content from "flux.example.com" from the mothership, which then responds with the requested information Finally, the "alleged webserver" redirects this response from the mothership to the user host.

1 Get IP of DNS for flux.example.com

3 Get IP of flux.example.com

5 Get HTTP flux.example.com 112.24.20.15

Ge t H TTP flu x.e xam ple co m

Figure 2.2: Single-flux content retrieval process

Botmasters often complicate detection by also fluxing the IP address of the DNS Figure 2.3 demonstrates the process of accessing content from a fluxed web address, specifically "flux.example.com," which utilizes fluxed DNS.

When a user host seeks the IP address for the domain "example.com," it queries the ".com" root name server The root name server responds with an IP address, such as 50.40.7.9, which has a short Time to Live (TTL) Subsequently, the user host utilizes this IP address to reach the specified DNS and requests the IP address for the domain.

When a user requests "flux.example.com," the alleged DNS forwards this request to the mothership, which responds with an IP address, such as 112.24.20.15 The alleged DNS then redirects this response to the user's host The user host uses the provided IP address to reach the alleged webserver for the HTTP content of "flux.example.com." Subsequently, the alleged webserver requests the content from the mothership, which sends back the requested information Finally, the alleged webserver redirects the mothership's response to the user's host.

1 Get IP of DNS for flux.example.com

3 Get IP of flux.example.com

5 Get HTTP flux.example.com 112.24.20.15

Ge t H TTP flu x.e xam ple co m

3a R edire Get I P of cted flux.e xam ple.c om

Figure 2.3: Double-flux content retrieval process

To summarize, Figure 2.1, Figure 2.2, Figure 2.3 illustrate the difference in content retrieval process of the webpage between normal, Single-Flux and Double-Flux Service Networks.

Detecting DGA-Bot Infected Machines Based On Analyzing The Similar

Introduction

Remote controllable computers, or bots, pose a significant threat on the Internet as they merge the harmful capabilities of worms, rootkits, and Trojan horses A botnet consists of a vast network of infected machines that enable bot-masters to execute various malicious activities, including distributed denial of service attacks, spam, phishing, click fraud, and espionage of sensitive information Once a machine is compromised, it attempts to connect to a command and control (C&C) server managed by the bot-master to initiate attacks To safeguard network hosts, it is crucial to detect these infected machines One effective strategy to combat a botnet is to sever the communication channel between the bots and their bot-master, preventing the bots from accessing the C&C server and halting the transmission of commands to the infected devices.

Over the past decade, malware has significantly evolved, transitioning from standalone programs to sophisticated systems capable of interacting with their creators and forming organized networks like botnets As these networks grow, attackers face challenges in managing large-scale distributions of infected machines while needing a resilient service infrastructure that can seamlessly relocate if compromised A prevalent method for establishing such infrastructures is through the use of domain names.

New generation HTTP botnets are employing advanced techniques such as Domain Generation Algorithms (DGA), domain-flux, and fast-flux to evade detection and blacklisting While some botnets utilize domain-flux to avoid being flagged, others rely on fast-flux to obscure their server locations Research from Seculert's lab highlights a troubling trend: cybercriminals are increasingly adopting DGA methods to bypass traditional detection systems As these bots evolve in sophistication, they become more challenging to prevent and detect, indicating that DGA-generated domains will likely remain prevalent among cybercriminals in the future.

Attackers employing Domain Generation Algorithm (DGA) techniques automatically create a vast array of Pseudo-Random Domain names (PRD) to facilitate communication with their bot-master Each infected machine generates a list of potential command and control (C&C) domains using DGA and sends DNS queries to resolve these names The bot-master, however, only designates a small subset of the numerous DGA-generated domains for actual C&C operations Consequently, if a DNS query fails to resolve to a specific IP address, it results in a non-existent domain (NXDomain) response.

This study introduces a method for detecting bot-infected machines within a network by monitoring their direct communication with Command and Control (C&C) servers post-infection The approach focuses on identifying unique characteristics of this communication, specifically the periodic time intervals of DNS queries By passively capturing DNS traffic at the network gateway, we group queries from the same domain and extract the time intervals between consecutive queries We then assess the periodicity of these DNS queries by calculating the squared Euclidean distance between their time interval series Using a hierarchical clustering algorithm, we cluster domain names with high similarity, revealing that domain names generated by the same botnet or Domain Generation Algorithm (DGA) are grouped together Consequently, hosts querying these clusters are identified as compromised, indicating the presence of a domain-flux botnet within the monitored network.

The main contributions of this work are as follows:

We introduce a method that analyzes the correlation between time interval series of DNS queries to assess the similarity of domain names associated with the same botnet or Domain Generation Algorithm (DGA).

 We show that the domain-flux infected machines often query a large number of the non-existent domain names with similar periodic time interval series to look for their C&C server

 We implement the experiments on five distinct botnet samples to evaluate the effectiveness of the proposed method

This chapter is structured as follows: Section 3.2 introduces the proposed approach, while Section 3.3 presents the experimental results and evaluates the performance of the method Finally, Section 3.5 concludes the work.

Proposed methods

In the following sections, we provide more details for the components of our proposed system

Figure 3.1: Framework of the detection system

The proposed detection system framework, illustrated in Figure 3.1, comprises three key phases: DNS traffic filtering, similarity analysis, and domain clustering The primary objective of DNS traffic filtering is to capture only DNS traffic, effectively blocking all other network traffic to facilitate the detection of botnet activity In the similarity analysis phase, we examine a vast number of queried domain names to identify similar periodic time interval series of DNS queries Finally, the clustering phase involves partitioning domains with high similarities into distinct clusters, with each cluster representing a group of candidate flux domain names associated with the same botnet or DGA algorithm Hosts that query these domain clusters are identified as compromised, indicating their involvement in a specific domain-flux botnet.

Figure 3.2: DNS traffic is generated by DGA-bots 3.2.2 Filtering DNS traffic

We actively monitor and filter NX-Domains generated by machines within the monitored network NX-Domains occur when a DNS query fails to resolve to a specific IP address, indicating potential attempts to reach malicious command and control (C&C) servers While numerous DGA-generated domain names exist, only a select few are utilized by bot-masters for actual C&C operations An increase in NX-Domains in DNS traffic serves as a key indicator for identifying DGA-bot infected machines To ensure accuracy, we exclude NX-Domains resulting from typos or misconfigurations by legitimate users, as these are unlikely to be associated with domain-flux botnets Ultimately, we compile a comprehensive set of NX-Domains observed across all monitored machines.

To identify machines compromised by domain-flux botnets within a monitored network, we analyze a comprehensive set of NX-Domains We group queries from hosts that request the same domain name and calculate the time intervals between consecutive requests (e.g., the i-th time interval is defined as t i+1 − t i, where i ranges from 1 to T, with T being the total number of requests to that domain) This process generates a time interval series, which consists of measurements taken at discrete intervals for queries to a specific domain over a defined period Ultimately, we compile a collection of time interval series for each domain.

A time interval series can be represented as a K × T matrix, where K denotes the number of domain names being queried This framework allows us to define a sequence of DNS queries through a structured set of time interval series, as outlined in formula (3.1).

(3.1) where = { t = 1 T} represents the time intervals series of k-th domain name ( = 1 K), indicates the t-th observation (t = 1 T) of time intervals series at k-th domain name

To identify similar periodic time interval series generated by bots, we employ an autocorrelation method that assesses the correlation between pairs of DNS query time interval series at times t and t − r (with a lag of r = 1 T − 1).

Given a generic time intervals series of k-th domain name = { t 1 T}, the autocorrelation function at lag r (r = 1 R = T − 1) is defined as formula (3.2): ̂ =∑ +1 ( − ̅ )( ( ) − ̅ )

The autocorrelation function of a time series, represented by the formula ∑ 1 ( − ̅ ) (3.2), utilizes the mean of the time intervals series for the k-th domain name to analyze patterns To effectively compare different time series, we focus on their autocorrelation functions rather than directly comparing the observed DNS query data This approach measures the similarity or dissimilarity between time series through the estimated autocorrelation coefficients, providing a more nuanced understanding of their relationships.

= ( 1 ) ̂ = ( ̂ 1 ̂ ̂ ) (3.3) where is time intervals series of queries to k-th domain name ; ̂ : The estimated autocorrelation coefficients of time intervals series for different lags r 1 R (R=T-1) of k-th domain

Similarity detection in time series analysis involves calculating the distance between two series to assess their similarity This process utilizes autocorrelation functions to analyze periodic time intervals We represent the estimated autocorrelation measures of each time series as vectors, denoted as ̂ and ̂ ′ To quantify the similarity or dissimilarity between these pairs, we apply the squared Euclidean distance formula: d ′ = √∑( ̂ − ̂ ′ ) This method provides a systematic approach to evaluate the relationship between time series data.

The Euclidean distance \(d'\) between two domains is always nonnegative, with a distance of 0 occurring only when comparing a point to itself When the distance \(d'\) is low, indicating high similarity between the domains, it suggests that they exhibit similar periodic time interval series of DNS queries This similarity may imply that the domains are generated by the same botnet or through a DGA algorithm.

We utilize the hierarchical clustering algorithm to analyze the distance matrix of domain pairs, effectively grouping domains with high similarities into distinct clusters This approach is motivated by the algorithm's capability to identify clusters of arbitrary shapes, transcending the limitations of Euclidean distance Each cluster formed signifies a collection of malicious flux domains associated with the same botnet or Domain Generation Algorithm (DGA) Consequently, all hosts that have queried these clustered domains are flagged as compromised, indicating they are running a specific domain-flux botnet.

We apply hierarchical clustering algorithm on dataset of the estimated autocorrelation coefficients of time intervals series The hierarchical-clustering algorithm is described below:

Step 1: Compute the distance matrix of all pairs of domains (according formula (3.4)) Step 2: Let each data point be a cluster

Merge the two closest clusters

Until only a single cluster remains

Experiment Results

Figure 3.3: The diagram for submitting files to VirusTotal

Figure 3.4: The Virustotal reports the scan results

Compute MD5 value of malware file

Use API key to require Virustotal

If file existed on VirusTotal?

Obtain the scan results From Virustotal

Submit malware file Submit MD5

We gathered malware samples from Honeypot and Virus Share, utilizing Virus Total's scan results to obtain bot samples The process of submitting malware files to Virus Total is illustrated in Figure 3.3, while Figure 3.4 presents the reported scan results from Virus Total.

For our experiments, we focus exclusively on samples identified as bots by at least one of the 42 antivirus programs listed on Virus Total Our analysis reveals five unique bot samples that employ widely recognized domain-flux techniques, specifically Conficker.C, Kraken, Zeus, Bobax, and Murofet.

Table 3.1: The bot samples collected

In a controlled LAN network experiment using a Virtual Box environment, we generated malicious DNS traffic by executing various samples on a virtual machine directly connected to the Internet, while restricting all other traffic for 24 hours This approach resulted in the collection of a total of 78,974 DNS queries from Conficker, 65,918 from Zeus, 19,548 from Bobax, 22,564 from Murofet, and 24,983 from Kraken.

We analyze DNS queries by grouping requests for the same domain name from various hosts and calculating the time intervals between consecutive requests For instance, Table 3.2 illustrates the extraction of these time intervals (∆t) for adjacent queries of a specific domain, represented as ∆t = t_i - t_(i-1), where i ranges from 1 to T̅̅̅̅̅, with T denoting the total number of requests for that domain In essence, Table 3.2 serves as an example of a K × T matrix, as defined in formula (3.1), where K represents the number of distinct domain names queried by the host and T signifies the total requests made to each domain.

The similarity or dissimilarity between domain pairs xk and xk' is assessed not through direct comparison of the observed query time series, as shown in Table 3.2, but rather by employing a suitable parameter representation of these time series This involves the use of estimated autocorrelation coefficients, as outlined in formula (3.2) For instance, the data presented in Table 3.3 exemplifies the estimated autocorrelation coefficients calculated from time interval series for each domain name xk across various lags r (where r = 1, 2, …, T).

Table 3.2: The example of time intervals (seconds) of queries to domain names generated by bots samples

Table 3.3: The estimated autocorrelation coefficients of time intervals series for different lags (R)

Legitimate machines typically do not query multiple domain names at similar periodic intervals, which leads to high volumes of NX-Domain replies This behavior is characteristic of DGA (Domain Generation Algorithm) botnet-infected machines, as they frequently query to locate their Command and Control (C&C) servers For instance, as illustrated in Figure 3.5, eight different domain names are queried at consistent intervals by bot-infected machines, such as four domains generated by the Conficker botnet and another four by the Kraken botnet This pattern prompts the need to identify clusters of similar domain names that are produced by the same botnet or domain generation algorithm.

Figure 3.5: An example of the similar periodic time intervals of DNS queries

We begin by estimating the autocorrelation coefficients based on the time intervals of DNS queries, as outlined in Section 3.2 and illustrated in Table 3.3 Next, we assess the similarity between pairs of domain names by calculating the squared Euclidean distance, as defined in formula (3.4), for their estimated autocorrelation coefficients Following this, we utilize a hierarchical clustering algorithm (refer to Section 3.2.4, Algorithm 3.1) to group similar domain names The results of this clustering process, which involves DNS queries from five distinct botnet samples, are visually represented in Figure 3.6.

Figure 3.6: An example of clustering domains based on the estimated autocorrelation coefficients of DNS queries

To determine the optimal number of clusters for the hierarchical clustering algorithm, we analyze the minimum distance between clusters as we incrementally increase the number of clusters, K As illustrated in Figure 3.7, there is a significant decrease in the minimum distance until we reach 451 clusters Beyond this point, with a threshold of h = 0.0266, further increases in K do not result in meaningful reductions in cluster distances Consequently, we identify h = 0.0266 as the ideal threshold for segmenting the data into clusters.

Figure 3.7: Number of clusters with various thresholds

Some NX-Domains may arise from typos or misconfigurations by legitimate users, which are not associated with domain-flux botnets These users typically do not generate multiple NX-Domains with similar DNS query patterns Consequently, we focus on machines that query more than two NX-Domains within each cluster, identifying them as potentially compromised by DGA-based botnets To streamline our analysis, we exclude clusters containing only one NX-Domain, as they are unlikely to indicate DGA activity, which is characterized by a high volume of NX-Domains.

Figure 3.8: Number of distinct domain names on each cluster with the threshold

At the optimal threshold value of h = 0.0266, we identified 6,723 unique NX-Domains organized into 220 clusters, with each cluster containing at least two domain names Domain names associated with the same botnet or DGA during a specific timeframe are grouped together in these clusters Consequently, machines querying domain names within these clusters are flagged as compromised by a domain-flux botnet within the monitored network.

Discussions

Traditional antivirus software can identify bot-infected machines, but a significant drawback is its reliance on known signature patterns or blacklists, which may fail to detect new bot variants This research aims not to eliminate malware entirely but to enhance existing antivirus solutions By integrating this approach with current technologies, it could lead to the development of a more effective antivirus solution than those available today.

This study focuses on identifying bot-infected machines, specifically those utilizing domain-flux or DGA techniques, within enterprise or monitored networks While not exhaustive, the findings encourage the exploration of innovative methods for detecting botnet command and control servers, laying the groundwork for future research endeavors.

Detecting C&C Servers Of Botnet With Analysis Features Of Network

Detecting Malicious Fast-Flux Service Networks Use Feature-Based

Conclusion and Future Works

Tiêu đề	Http-Based Botnet Detection Using Network Traffic Traces
Tác giả	Truong Dinh Tu
Người hướng dẫn	Prof. Cheng Guang
Trường học	Southeast University
Chuyên ngành	Computer Science and Engineering
Thể loại	dissertation
Năm xuất bản	2015
Thành phố	Nanjing

Định dạng
Số trang	147
Dung lượng	3,43 MB