INTRODUCTION
New software security vulnerabilities are identified nearly every day, posing significant risks to the essential software infrastructure that underpins our modern society and economies These vulnerabilities can lead to severe damage, as highlighted by findings from the CSI Computer Crime and Security Survey.
In 2008, 522 US companies collectively lost $3.5 billion annually due to attacks on critical business software applications, highlighting the prevalence of significant security weaknesses in long-used systems Over 90% of security incidents reported to the Computer Emergency Response Team (CERT) stem from software defects, emphasizing the importance of early detection and resolution of these issues Late error corrections can be up to 200 times more costly than addressing them early, making it essential to consult prior known vulnerabilities and corresponding patches Public databases, such as the National Vulnerability Database (NVD) and the Common Vulnerabilities and Exposures (CVE) database, along with specific software application websites, serve as vital resources for identifying known software security vulnerabilities.
Our empirical study on security vulnerabilities, utilizing databases such as NVD and CVE, reveals that recurring software vulnerabilities often stem from software reuse We identified numerous similar vulnerabilities across different systems, particularly in those that share a common code base or framework Vulnerable code fragments are frequently reused through practices like copy-and-paste or by branching existing code, leading to the propagation of these vulnerabilities Consequently, patches applied to one system are often delayed in reaching others, resulting in identical or highly similar vulnerable code structures, including function calls, variable names, and other elements We refer to these as Type 1 vulnerabilities.
Recurring vulnerabilities can arise in systems that share APIs or libraries, categorized as Type 2 vulnerabilities These systems often misuse the same API functions, leading to similar errors such as improper input/output checks and incorrect function call orders Additionally, Type 3 vulnerabilities occur when systems share higher-level abstractions like algorithms, protocols, and standards, resulting in common bugs or programming faults Detailed examples and findings related to all three types of vulnerabilities will be explored in Chapter 3.
Research indicates that unreported vulnerabilities in one software system can be identified and addressed by examining previously known vulnerabilities in other systems that share source code, libraries, or specifications To assist developers in this process, we created SecureSync, a tool designed to automatically detect recurring software vulnerabilities across systems that share code or libraries, which are the most common types of such vulnerabilities Future work will focus on identifying recurring vulnerabilities in systems that reuse code at higher levels of abstraction.
SecureSync is designed to work with a semi-automatically built knowledge base of the prior known/re- ported vulnerabilities, including the corresponding systems, libraries, and vulnerable and patched code.
It could support detecting and resolving vulnerabilities in the two following scenarios:
SecureSync evaluates vulnerability reports from system A by analyzing the associated vulnerable and patched code It then stores this information in its knowledge base and utilizes Google Code Search to identify other systems B that share source code and libraries with system A Afterward, SecureSync checks for similar vulnerabilities in system B and reports any identified locations.
SecureSync analyzes system X to determine if it shares code fragments or libraries with another system Y in its knowledge base If the shared code in system X closely resembles the vulnerable code found in system Y, SecureSync will identify it as potentially vulnerable and highlight the specific locations of the vulnerabilities.
SecureSync assists developers in identifying and rectifying vulnerable code by offering valuable suggestions, including the addition of missed function calls, implementing input/output checks before or after function calls, and replacing operators within expressions.
SecureSync focuses on two main technical objectives: representing vulnerable and patched code, and identifying code fragments similar to vulnerable ones To tackle these challenges, we have created two essential techniques that specifically address two types of recurring vulnerabilities.
SecureSync enhances code reuse by representing vulnerable code fragments as Abstract Syntax Tree (AST)-like structures, where node labels indicate both types and attributes For instance, a function call node is labeled with the type FUNCTION CALL, along with the function name and parameter list The similarity between code fragments is assessed through the structural similarity of these labeled trees Our previous technique, Exas, approximates the structural information of labeled trees and graphs using vectors, enabling the measurement of similarity through vector distance.
Traditional code clone detection techniques fall short in addressing recurring Type 2 vulnerabilities, which involve systems sharing libraries, as they require an understanding of program semantics, including API usage and relevant semantic information SecureSync addresses this challenge by representing vulnerable code fragments as graphs, where nodes denote function calls and control structures like if, while, or for, along with operations such as ==, !, or < Each node is labeled with its type and name, while edges illustrate the relationships between nodes, including control and data dependencies as well as the sequence of function calls The similarity between these graphs is assessed through the identification of their largest common subgraphs.
SecureSync enhances performance through various filtering techniques, including text-based filtering that retains only source files with identifiers linked to function names found in vulnerable code Additionally, it employs locality-sensitive hashing (LSH) to enable rapid searches for similar trees in its knowledge base, ensuring that only trees with identical hash codes are compared Furthermore, SecureSync implements set-based filtering to identify Type 2 candidates, keeping only graphs that contain nodes with matching or similar labels to those in its knowledge base for comparison.
In our evaluation of 48 Type 1 and 12 Type 2 vulnerabilities across 119 open-source software systems, we analyzed a total of 51 and 125 releases, respectively Our findings indicate that SecureSync demonstrates high accuracy in identifying vulnerable code locations, particularly in systems that share source code or libraries with its knowledge base Notably, it uncovered 90 releases with potentially vulnerable code fragments that have not been previously detected, reported, or patched, even in well-established systems Following the recommendations from our tool, we developed patches for these vulnerabilities and communicated them to the respective developers, some of which have been confirmed, while we await responses on others.
The contribution of this thesis includes:
1 An empirical study that confirms the existence of recurring/similar software vulnerabilities and our aforementioned software reuse hypothesis, and provides us insights on their characteristics;
2 Two representations and algorithms to detect recurring vulnerabilities on systems sharing source code and/or APIs/libraries;
3 SecureSync: An automatic prototype tool that detects recurring vulnerabilities and recommends the resolution for them;
4 An empirical evaluation of our tool on real-world datasets of vulnerabilities and systems shows its accuracy and usefulness.
The thesis is structured as follows: Chapter 2 provides a comprehensive literature review, while Chapter 3 presents our empirical study on recurring vulnerabilities Chapters 4 and 5 outline our approach and the specific techniques employed The empirical evaluation is detailed in Chapter 6, and the thesis concludes with insights in Chapter 7.
BACKGROUND
Terminology and Concepts
A system is a software product, program, module, or library under investigation (e.g Firefox, OpenSSL, etc) A releaserefers to a specific release/version of a software system (e.g Firefox 3.0.5, OpenSSL 0.9.8).
A software bug refers to an error or flaw in a computer program or system that leads to incorrect or unexpected outcomes, resulting in unintended behavior.
A patch is a software update aimed at resolving issues within a computer program, including addressing security vulnerabilities and bugs, as well as enhancing usability and performance.
A vulnerability refers to an exploitable software flaw present in certain versions of a system For instance, CVE-2008-5023 identified a security vulnerability in the checks of Firefox versions 3.0.0 to 3.0.4 Such software vulnerabilities often arise from bugs that enable attackers to exploit the application.
A recurring vulnerability is a security flaw that must be addressed in at least two different software releases, whether they are from the same system or different ones This term also encompasses a collection of vulnerabilities that share the same underlying causes across various systems or releases For detailed examples of recurring vulnerabilities, please refer to Chapter 3.
Bug Detection and Localization
Our research focuses on static approaches for detecting similar and recurring bugs, which can be divided into rule/pattern-based and peer-based methods Pattern-based approaches identify correct code patterns through mining frequent code usages or predefined rules, flagging any anomalies as potential bugs Conversely, peer-based methods compare the code fragment in question against a database of past cases and offer fix recommendations upon finding a match We selected the peer-based approach for SecureSync to ensure early detection of vulnerabilities, even if they are infrequent SecureSync optimally stores only the features of each case Various bug-finding methods utilize code pattern mining, with some, like Sun et al., employing template- and rule-based strategies to automate bug fix propagation through predefined templates However, unlike these pattern-based methods, SecureSync leverages a general graph-based representation for API-related code, enabling feature extraction that accommodates any vulnerable code alongside its patched versions.
JADET and GrouMiner analyze object usage patterns to identify potential bugs, employing various methods such as mining method call sequences and association rules Specific approaches, like those by Chang et al., focus on detecting error-handling bugs and negligent conditions using program dependence graphs Hipikat enhances project memory by extracting lexical information, while Song et al identify associations among different bug types for predictive analysis However, these methods are limited to particular patterns and bug types In contrast, BugMem utilizes a textual difference approach to find similar bugs, and SecureSync improves API usage context detection with a graph representation, allowing for more accurate identification of API-related vulnerabilities compared to GrouMiner.
FixWizard is a peer-based method for identifying recurring bugs by analyzing code peers, which are methods or classes with similar interactions within the same system, limiting its applicability across different systems Patch Miner identifies all code snapshots containing similar snippets, while CP-Miner focuses on mining frequent token subsequences to uncover bugs arising from inconsistent edits in cloned code Jiang et al detect clone-related bugs by identifying context-based inconsistencies Current tools for maintaining consistent editing of cloned code, such as CloneTracker, primarily rely on interactive synchronization In contrast, our approach to detecting code-reused vulnerabilities is capable of addressing non-continuous clones and employs a graph-based representation and specialized feature extraction to effectively identify fragments with similar API usages, enhancing the detection of API-related bugs across various systems.
Various methods have been suggested to assist users in identifying problematic areas in code These methods utilize historical data from the project, including metrics such as the volume of changed lines of code over a specific timeframe, modules that have been frequently or recently modified or fixed, and the relationships between code changes and bug locations Additionally, complexity metrics and the social networks of developers are also considered in these approaches.
In the realm of software security, various vulnerability prediction methods focus on project components and historical data Some researchers analyze vulnerability characteristics through discovery rates and the time and effort involved in patching Longstaff introduced the Ease of Exploit Classification, which categorizes vulnerabilities based on their exploitability However, a notable gap in current software security research is the lack of studies addressing recurring vulnerabilities and their specific traits.
Vulnerability Databases
to, Common Vulnerabilities and Exposure (CVE) [4], Internet Security Systems (ISS) [7], National
The Vulnerability Database (NVD) and the Open Source Vulnerability Database (ovsbd) are essential resources for identifying software flaws and security-related configuration issues They utilize the Security Content Automation Protocol (SCAP), which includes standards like Common Vulnerabilities and Exposures (CVE), Common Configuration Enumeration (CCE), Common Platform Enumeration (CPE), Common Vulnerability Scoring System (CVSS), Extensible Configuration Checklist Description Format (XCCDF), and Open Vulnerability and Assessment Language (OVAL) These standards help measure system vulnerabilities and rank their impact, facilitating effective security assessments and management.
EMPIRICAL STUDY
Hypotheses and Process
This study operates on the premise that similar code exhibits comparable properties, specifically in relation to software bugs, programming flaws, and vulnerabilities Given the prevalence of software reuse, many systems often contain similar code or share libraries and components, which may arise from common frameworks, copy-and-paste practices, identical API libraries, or shared algorithms and specifications Consequently, we hypothesize that (H1) software vulnerabilities recur across different systems, and (H2) one contributing factor to this phenomenon is software reuse Key terms relevant to this study will be defined and explored.
To validate our hypotheses, we analyzed approximately 3,000 vulnerability reports from various security databases, including the National Vulnerability Database (NVD), Open Source Computer Emergency Response Team Advisories (oCERT), Mozilla Foundation Security Advisories (MFSA), and the Apache Security Team (ASF) Each report typically details a specific vulnerability, with some recurring vulnerabilities appearing in multiple reports Given the volume of data, we employed a textual analysis technique to group similar reports, allowing us to manually review these clusters Notably, we identified groups that highlighted recurring vulnerabilities To deepen our understanding of these vulnerabilities, including their causes and patches, we also gathered and examined relevant source code, bug reports, and discussions We will now present representative examples and detailed results of our findings.
JSContext* cx = (JSContext*) context->GetNativeContext(); nsCOMPtr ourDocument; mPrototypeBinding->XBLDocumentInfo()->GetDocument(getter_AddRefs(ourDocument));
/* Vulnerable code: allows remote attackers to bypass the protection mechanism for codebase principals and execute arbitrary script via the -moz-binding CSS property in a signed JAR file */
PRBool canExecute; nsresult rv = mgr->CanExecuteScripts(cx, ourDocument->NodePrincipal(), &canExecute); return NS_SUCCEEDED(rv) && canExecute;
Figure 3.1 Vulnerable Code in Firefox 3.0.3
JSContext* cx = (JSContext*) context->GetNativeContext(); nsCOMPtr ourDocument; mPrototypeBinding->XBLDocumentInfo()->GetDocument(getter_AddRefs(ourDocument)); nsIPrincipal* principal = ourDocument->GetPrincipal(); if (!principal){ return PR_FALSE;
PRBool canExecute; nsresult rv = mgr->CanExecuteScripts(cx, principal, &canExecute); return NS_SUCCEEDED(rv) && canExecute;
Figure 3.2 Vulnerable Code in SeaMonkey 1.1.12
Representative Examples
CVE-2008-5023 identifies a vulnerability that enables remote attackers to circumvent the protection mechanisms for codebase principals, allowing the execution of arbitrary scripts through the -moz-binding CSS property in signed JAR files This security flaw affects multiple software systems, including all versions of Firefox 3.x prior to 3.0.4, Firefox 2.x before 2.0.0.18, and SeaMonkey 1.x before 1.1.13.
Figures 3.1 and 3.2 illustrate vulnerable code fragments from Firefox 3.0.3 and SeaMonkey 1.1.12, while Figures 3.3 and 3.4 display the corresponding patched code in Firefox 3.0.4 and SeaMonkey 1.1.13 The patches enhance security by incorporating additional checking mechanisms through the functions GetHasCertificate and Subsumes, effectively addressing the vulnerabilities present in the earlier versions.
JSContext* cx = (JSContext*) context->GetNativeContext(); nsCOMPtr ourDocument; mPrototypeBinding->XBLDocumentInfo()->GetDocument(getter_AddRefs(ourDocument));
PRBool canExecute; nsresult rv = mgr->CanExecuteScripts(cx, ourDocument->NodePrincipal(), &canExecute); if (NS_FAILED(rv) || !canExecute) { return PR_FALSE;
PRBool haveCert; doc->NodePrincipal()->GetHasCertificate(&haveCert); if (!haveCert){ return PR_TRUE;
PRBool subsumes; rv = ourDocument->NodePrincipal()->Subsumes(doc->NodePrincipal(), &subsumes); return NS_SUCCEEDED(rv) && subsumes;
The recurring vulnerability in Firefox 3.0.4 is attributed to the reuse of source code, as evidenced by the similarities in the vulnerable code fragments shown in Figures 3.1 and 3.2 Both Firefox and SeaMonkey, developed from the common Mozilla framework, share a significant amount of source code, which includes these vulnerable segments This shared codebase leads to persistent vulnerabilities in both systems.
This example is a representative example of the recurring vulnerabilities that we classify as Type
The recurring vulnerabilities in software systems often arise from the reuse of source code, where undetected flaws in one system are replicated in another This can occur through practices such as copy-and-pasting code or branching an entire codebase to create a new version Additionally, systems derived from the same codebase may be independently developed, yet still share vulnerabilities The reused code typically exhibits high similarity in both textual elements, such as function names and variable identifiers, and structural components, including statement arrangements and expressions This similarity can be leveraged to identify and address vulnerabilities across different systems.
A special case of Type 1 is related to different releases of a system (e.g Firefox 2.x and 3.x in this example) Generally, such versions/releases are developed from the same codebase, e.g later versions
JSContext* cx = (JSContext*) context->GetNativeContext(); nsCOMPtr ourDocument; mPrototypeBinding->XBLDocumentInfo()->GetDocument(getter_AddRefs(ourDocument)); nsIPrincipal* principal = ourDocument->GetPrincipal(); if (!principal){ return PR_FALSE;
PRBool canExecute; nsresult rv = mgr->CanExecuteScripts(cx, ourDocument->NodePrincipal(), &canExecute); if (NS_FAILED(rv) || !canExecute) { return PR_FALSE;
PRBool haveCert; doc->NodePrincipal()->GetHasCertificate(&haveCert); if (!haveCert){ return PR_TRUE;
PRBool subsumes; rv = ourDocument->NodePrincipal()->Subsumes(doc->NodePrincipal(), &subsumes); return NS_SUCCEEDED(rv) && subsumes;
The patched code in SeaMonkey 1.1.13 is derived from earlier versions, which likely contain the same vulnerabilities When vulnerabilities are addressed in later versions, it is essential to apply similar fixes to the earlier versions, a process known as back-porting.
OpenSSL is an open-source implementation of the SSL and TLS protocols, featuring a high-level API library called EVP for cryptographic functions The EVP documentation outlines a signature verification protocol that involves initializing a verification context with EVP VerifyInit, hashing the data with EVP VerifyUpdate, and finally verifying the data against public keys using EVP VerifyFinal This function returns three possible values: 1 for successful verification, 0 for failure, and -1 for any errors during the process However, many developers overlook the -1 return value, mistakenly believing that EVP VerifyFinal only returns 1 and 0, which can lead to flaws in various systems.
EVP VerifyInit(&ctx, peer->digest);
EVP VerifyUpdate (&ctx, (u_char *)&ep->tstamp, vallen + 12);
The EVP_VerifyFinal function indicates verification outcomes with three possible returns: 1 for correct, 0 for incorrect, and -1 for failure However, the expression erroneously treats both a return value of 1 and -1 as valid, resulting in a mishandling of verification failures as successful verifications.
Figure 3.5 Recurring Vulnerability in NTP 4.2.5 int gale_crypto_verify_raw( ) {
EVP VerifyInit(&context,EVP_md5());
EVP VerifyUpdate(&context,data.p,data.l); for (i = 0; is_valid && i < key_count; ++i) {
//Similarly vulnerable code if (!EVP VerifyFinal(&context,sigs[i].p,sigs[i].l,key)) { crypto_i_error(); is_valid = 0; goto cleanup;
} cleanup: EVP_PKEY_free(key);
In Gale 0.99, the recurring vulnerability arises from the use of the control expression (!EVP VerifyFinal( )), which checks for errors during the verification process When EVP VerifyFinal returns -1, indicating a verification failure, the control expression evaluates to false, similar to when it returns 1 Consequently, the program incorrectly assumes that the verification is successful, exposing it to potential exploitation.
The programming flaw identified in CVE-2009-0021 and CVE-2009-0047 affected the NTP and Gale systems using the EVP library, leading to a persistent vulnerability This vulnerability enables remote attackers to circumvent the validation of the certificate chain by exploiting a malformed SSL/TLS signature for DSA and ECDSA keys Vulnerable code examples from NTP are illustrated in Figures 3.5 and 3.6.
4.2.5 and Gale 0.99 Despite detailed differences, both of them use the signature verification protocol provided by EVP and incorrectly process the return value ofEVP VerifyFinalby the aforementioned ifstatement.
This vulnerability is categorized as Type 2, known as API-shared/reused recurring vulnerability (RV2), which arises in systems that share APIs or libraries Proper API usage requires adherence to protocols established by designers, including correct function call sequences and thorough validation of input/output However, developers may misuse APIs by failing to follow these protocols, leading to issues such as incorrect function order, missing essential calls, or mishandling input parameters and return values The reuse of the same library across different systems can result in similar coding errors, increasing the risk of identical vulnerabilities Each RV2 is specifically linked to the misuse of an API function or its associated protocol.
In addition to the previously mentioned types, recurring vulnerabilities are classified as Type 3 (RV3) For instance, CVE-2006-4339 and CVE-2006-7140 highlight a critical issue in OpenSSL 0.9.7 and Sun Solaris 9, where the use of an RSA key with an exponent of 3 leads to the removal of PKCS-1 padding prior to hash generation This flaw enables remote attackers to forge a PKCS, posing significant security risks.
The #1 v1.5 signature, signed by an RSA key, hinders libike's ability to accurately verify X.509 and other certificates utilizing PKCS #1 While both systems employ the same RSA encryption algorithm, they suffer from a critical flaw in their implementations: the removal of PKCS-1 padding prior to hash generation This oversight renders both systems susceptible to identical vulnerabilities.
Recurring Type 3 vulnerabilities often arise in systems that reuse higher-level artifacts, such as algorithms, specifications, or designs, to meet similar requirements If developers replicate implementation errors or if the shared algorithms and specifications contain flaws, these systems may exhibit the same or similar vulnerabilities This pattern distinguishes Type 3 vulnerabilities from Type 1 and Type 2, as they stem from the reuse of flawed components across different systems.
3 is harder to recognize/localize/match in those systems due to the wide varieties of implementation choices and differences in design, architecture, programming language, etc among systems.
Database Report Group RV RV1 RV2 RV3
Results and Implications
Table 3.1 presents the findings of our study, detailing the total number of vulnerability reports collected and the groups of recurring vulnerabilities analyzed The data confirms our hypotheses H1 and H2, indicating a significant presence of recurring vulnerabilities across different systems These vulnerabilities arise from the reuse of source code (RV1), APIs/libraries (RV2), and higher-level artifacts such as algorithms and specifications (RV3) It's important to note that while each group in oCERT contains a single report, each report typically covers multiple vulnerabilities, many of which are recurring, leading to a higher count of recurring vulnerabilities (RV) compared to the number of groups.
The analysis reveals a significant presence of recurring vulnerabilities, specifically RV1s (source code-reused) and RV2s (API-reused), particularly in systems developed on the Mozilla and Apache frameworks, which share substantial amounts of vulnerable code In contrast, Type 3 recurring vulnerabilities (RV3s) are less prevalent, likely due to the reduced likelihood of developers repeating the same mistakes in algorithm implementation compared to the more frequent flaws arising from reused source code and libraries Additionally, there are fewer systems that share designs or algorithms compared to those that reuse code and libraries.
The study validates our hypotheses regarding recurring software vulnerabilities, which are categorized into three types based on the artifacts reused by their systems This finding indicates that leveraging knowledge of previously identified vulnerabilities in reported systems can aid in detecting and addressing unreported vulnerabilities that may arise in other systems or releases utilizing the same source code, libraries, or algorithms.
The study also provides some insights about the characteristics of vulnerable code of Types 1 and
2 While Type 1 vulnerable code is generally similar in texts and structure, Type 2 vulnerable code tends to have similar method calls, and similar input checking and output handling before and after such calls Those insights are used in our detection and resolution of recurring vulnerabilities.
Threats to Validity
Our research relies on public vulnerability databases, which may be incomplete due to undisclosed vulnerabilities or the absence of publicly available patches Consequently, the findings of our empirical study, based on reported vulnerabilities with available patches, could be impacted Additionally, the identification of recurring vulnerabilities is conducted by humans, introducing the potential for bias stemming from subjective interpretations and errors.
APPROACH OVERVIEW
Problem Formulation
Building SecureSync involves two key challenges: representing and measuring the similarity of RV1s and RV2s, and localizing recurring vulnerabilities across different systems In SecureSync, vulnerabilities are represented by features extracted from both vulnerable and patched code, with similarity calculated based on these features The methods for feature extraction and similarity measurement differ for RV1s and RV2s, reflecting their distinct characteristics The detection of recurring vulnerabilities is approached through a specific formulation of the problem.
Definition 1 (Feature and Similarity) Two functionsF()andSim()are called the feature extraction and similarity measure functions for the code fragments F(A)is called the feature set of a fragment
A.Sim(A, B)is the similarity measurement of two fragmentsAandB.
Definition 2 (Recurring Vulnerable Code) Given a vulnerable code fragmentAand its correspond- ing patched code A ′ If a code fragment B is sufficiently similar to A and less similar to A ′ , i.e. Sim(B, A)≥σandSim(B, A ′ )< Sim(B, A), thenB is considered as a recurring vulnerable code fragment ofA.σis a chosen threshold.
AandA ′ could be similar becauseA ′ is modified fromA Thus,B could be similar to bothAand
A ′ The second condition requiresB to be more similar to vulnerable code than to patched code.
Definition 3 (Detecting Recurring Vulnerability) Given a knowledge base as a set of vulnerable and patched code fragmentsK ={(A 1 , A ′ 1 ),(A 2 , A ′ 2 ), ,(A n , A ′ n )}and a program as a set of code frag- mentsP={B 1 , B 2 , , B m } Find fragmentB i (s) that is recurring vulnerable code of some fragment
Algorithmic Solution and Techniques
The detection process for RV1s and RV2s begins with SecureSync generating candidate fragments from the program under investigation These candidates are then compared against both vulnerable and patched code within the knowledge base to identify recurring vulnerabilities Once detected, these vulnerabilities are reported to users along with recommendations for remediation.
1 function Detect(P, K, σ) //detect recurring vulnerability
3 for each fragment B ∈ C: //check against knowledge base for recurring
Figure 4.2 Detection of Recurring Vulnerabilities
This algorithm requires the following techniques:
SecureSync employs advanced methods for feature extraction and similarity measurement in code fragments For RV1s, it utilizes a tree-based representation known as extended AST (xAST), which captures both textual and structural features of the code The similarity between fragments is assessed through Exas, a technique designed for structural approximation and similarity assessment of trees and graphs In contrast, for RV2s, SecureSync adopts a novel graph-based representation called xGRUM, where each code fragment is depicted as a graph with nodes representing function calls, variables, operators, and control statement branching points The edges in this graph illustrate control and data dependencies among the nodes, enabling SecureSync to effectively represent API usage patterns, including function call sequences and input/output handling The similarity of code fragments is then determined by analyzing the alignment of nodes within their respective xGRUMs.
We develop a knowledge base for SecureSync through a semi-automated process that begins with accessing and manually analyzing vulnerability databases to select relevant vulnerabilities Next, we employ code search techniques to identify both the vulnerable and patched code associated with these vulnerabilities SecureSync then automatically generates corresponding xASTs and xGRUMs from the collected code fragments It is important to note that this knowledge-building process could be fully automated if vulnerability databases offered direct information on the vulnerable and patched code.
SecureSync enhances vulnerability detection by generating candidate code fragments from the analyzed program, utilizing established functions for feature extraction and similarity measurement To optimize performance, it employs a text-based filtering method, retaining only those files that contain tokens similar to the function names of known vulnerable code within its knowledge base.
SecureSync provides tailored patch recommendations for addressing vulnerabilities in code For Type 1 vulnerabilities, which involve source code reuse, the system identifies applicable patches from its knowledge base that require minimal modifications It highlights the vulnerable statements and offers sample patches In the case of Type 2 vulnerabilities, related to API usage, SecureSync employs an innovative graph alignment algorithm to suggest necessary additions, such as missed function calls or input/output checks before and after these calls.
SOFTWARE VULNERABILITY DETECTION
Type 1 Vulnerability Detection
To detect Type 1 recurring vulnerabilities, SecureSync represents code fragments, including vul- nerable and patched code in its knowledge base, via an AST-like structure, which we callextended AST
An xAST, or augmented Abstract Syntax Tree, enhances traditional ASTs by incorporating detailed labels for nodes representing function calls, variables, literals, or operators These labels include the node's type, signature, data type, value, or token, which enriches the semantic information of the tree For instance, two nodes of the same function call type with different labels signify distinct calls Figure 5.1 showcases two xASTs derived from similar vulnerable statements, while Figure 5.2 presents the corresponding patch code Despite their similar structures and node labels, such as those for the function calls CanExecuteScripts and the variable of data type nsresult, both trees share the same patch, although node types and parameter lists are omitted for clarity.
5.1.2 Feature Extraction and Similarity Measure
SecureSync defines the feature set F(A) of a code fragment A as a collection of xASTs, with each xAST representing an individual statement within A For instance, the vulnerable code fragments illustrated in Figure 3.1 exhibit feature sets of six and eight xASTs, respectively The similarity between two code fragments is then assessed based on the similarity of their corresponding xAST feature sets.
SecureSync uses Exas [44] to approximate the xAST structures and measure their similarity UsingExas, each xAST or a set of xASTsT is represented by a characteristic vector of occurrence-counts of
PRBool nsresult rv = mgr->CanExecuteScripts (cx, ourDocument->NodePrincipal(), &canExecute);
PRBool nsresult rv = mgr->CanExecuteScripts (cx, principal, &canExecute); a xAST tree in FireFox b xAST tree in SeaMonkey
Figure 5.1 xAST from Code in Figure 3.1 and Figure 3.2
NodePrincipal PRBool if (NS_FAILED(rv) || !canExecute) return PR_FALSE; doc->NodePrincipal()->GetHasCertificate(&haveCert); if (!haveCert) return PR_TRUE; rv = ourDocument->NodePrincipal()->Subsumes(doc-> NodePrincipal(), &subsumes);
PR_TRUE PR_FALSE nsIDocument nsCOMPtr nsIDocument
Figure 5.2 illustrates the xAST derived from the patched code shown in Figures 3.3 and 3.4, highlighting its structural features Ex(T) In the context of trees, a structural feature is defined as a sequence of node labels along a constrained path length For instance, both trees depicted in Figure 5.1 exhibit the features [ASSIGN]-[nsresult] and [ASSIGN]-[CanExecuteScripts]-[PRBool].
The similarity of two fragments are measured based on the Manhattan distance of two correspond- ing Exas vectors of their feature sets:
This formula normalizes vector distance with the vectors’ sizes Thus, with the same threshold for similarity, larger trees, which have larger vectors, are allowed to have more different vectors.
Type 1 vulnerable code fragments often appear in non-consecutive statements, making them difficult for traditional code clone detection techniques to identify To effectively locate these vulnerable code candidates, SecureSync compares each statement of the program to its knowledge base of known vulnerabilities and merges similar statements into larger fragments Additionally, SecureSync enhances its search efficiency through a two-level filtering process.
Text-based filtering is a method used to eliminate source files that lack code resembling known vulnerabilities SecureSync performs lexical analysis on each file, retaining only those that contain tokens or words similar to those found in vulnerable code, such as function or variable names This approach has proven to be highly effective; for instance, when evaluating Mozilla Firefox's extensive collection of over 6,000 source files, SecureSync successfully narrowed it down to approximately 100 relevant files.
Structure-based filtering focuses on retaining statements that share similar xAST structures with known vulnerabilities in the knowledge base SecureSync employs locality-sensitive hashing (LSH) to achieve this, generating hash codes for vectors so that similar vectors are more likely to produce identical hash codes Initially, SecureSync parses each source file from the previous step into an xAST format.
Then, for each sub-tree representing a statementSin the file, it extracts an Exas feature vectorEx(S).
SecureSync verifies the similarity between statement S and statement T in the knowledge base by comparing their LSH hash codes If Ex(S) and Ex(T) share common hash codes, it indicates that S and T likely possess similar xAST structures To enhance processing speed, each statement T from the vulnerable code is pre-hashed into a hash table, allowing SecureSync to disregard any statement S that does not share a hash code with this table After identifying potential candidates with similar xAST structures, SecureSync merges consecutive candidate statements into larger code fragments, typically at the method level These candidate code fragments are then compared to vulnerable and patched code fragments in the knowledge base, utilizing the established definitions and formulas for accurate analysis.
Ain the knowledge base that contain(s) the statementsTs correspondingly having some common LSH hash codes with the statementsSs ofA.
When developers reuse source code, they often rename code entities, which can impact the comparison of code fragments across different system versions For instance, the function "proxy_send_dir_filter" was renamed between Apache HTTP Server versions 2.0.x and 2.2.x To address this issue, SecureSync employs an origin analysis process that maps names between versions or code-sharing systems, ensuring accurate feature extraction from xASTs This mapping is achieved through the OperV technique, which models software systems as project trees and utilizes a tree alignment algorithm to compare sub-tree similarities, ultimately providing the necessary name mappings for effective feature extraction.
Type 2 Vulnerability Detection
Type 2 vulnerabilities are caused by the misuse or mishandling of APIs Therefore, we emphasize on API usages to detect such recurring vulnerabilities If a candidate code fragmentB has similar API usages to a vulnerable code fragmentA,B is likely to have a recurring vulnerability withA In
SecureSync utilizes a graph-based model to represent API usages, where nodes signify API function calls, data structures, and control structures, while edges illustrate the relationships or dependencies among them This innovative representation is known as the Extended Graph-based Usage Model (xGRUM).
Definition 4 (xGRUM) Each extended graph-based usage model is a directed, labeled, acyclic graph in which:
1 Each action node represents a function or method call;
2 Each data node represents a variable;
3 Each control node represents the branching point of a control structure (e.g if, for, while, switch );
4 Each operator node represents an operator (e.g not, and, or );
5 An edge connecting two nodesxandyrepresents the control and data dependencies betweenx andy; and
6 The label of an action, data, control, and operator node is the name, data type, or token of the corresponding function, variable, control structure, or operator, along with the type of the correspond- ing node.
The representation of API function usage is illustrated through action or data nodes, where the sequence of calls, such as x before y, is indicated by edges connecting the respective nodes This highlights the distinction between vulnerable API usage, as seen in NTP and Gale, and the patched API usage in NTP.
EVP_MD_CTX EVP_VerifyUpdate
RETURN EVP_PKEY_free crypto_i_error
EVP_MD_CTX EVP_VerifyUpdate
EVP_MD_CTX EVP_VerifyUpdate
Legends action node data node control node control dependency data dependency
Figure 5.3 illustrates xGRUMs derived from both vulnerable and patched code, as shown in Figure 3.5 The evaluation of API function inputs and outputs is represented through the control and operator nodes that encircle the action nodes associated with these function calls, along with the connections between the control, operator, and action nodes.
Figure 5.3 partially shows three xGRUMs of two vulnerable code fragments in two figures 3.5 and
3.6 and one patched code fragment (For better understanding, only the nodes/edges related to API usages and the vulnerabilities are drawn) The xGRUMs have the action nodes representing function callsEVP VerifyFinal,EVP VerifyInit, data nodeEVP MD CTX, control nodesIF,FOR, and oper- ator nodesNOT,LEQ An edge fromEVP VerifyUpdatetoEVP VerifyFinalrepresents both their control dependency, (i.e EVP VerifyUpdateis used beforeEVP VerifyFinal), and thedata de- pendency: those two nodes share data via data nodeEVP MD CTX The edge betweenEVP MD CTXand
EVP VerifyFinalshows that the corresponding variable is used as an input forEVP VerifyFinal
The control dependency is illustrated by the edge from the action node EVP VerifyFinal to the control node IF, indicating that EVP VerifyFinal is executed prior to the branching point of the if statement, meaning the condition check takes place after the function call.
Especially, operator nodeNOTrepresents the operator in the control expression!EVP VerifyFinal( ).
The control dependencies in the system involve the nodes EVP VerifyFinal and IF, with the control expression modified to EVP VerifyFinal( )