ALERT PRIORITIZATION AND STRENGTHENING: TOWARDS AN INDUSTRY STANDARD PRIORITY SCORING SYSTEM FOR IDS ANALYSTS USING OPEN SOURCE TOOLS AND MODELS OF MACHINE LEARNING.

Date22 September 2020
AuthorDangi, Bikram
  1. INTRODUCTION

    Intrusion Detection Systems ("IDSs") are notorious for requiring extensive adjustment upon installation and they require tailoring for each specific network they are deployed upon. This time spent adjusting, and the potential for human error, leaves systems open to possible hidden attacks. Machine Learning ("ML") has been increasingly leveraged to assist cutting edge IDS technology in identifying threats more accurately than traditional rule-based methods. Both ML assisted technologies and custom signature-based solutions are popular methods in use today.

    One of the Information Security Research and Education ("INSuRE") research sponsors Vince Streiff, at the discretion of the National Security Agency ("NSA"), posed the following question: How do we do a better job of letting defensive analysts know which alerts they need to focus on when they are configuring a new IDS instance? This question is important not only for high risk environment security professionals, but also for the general public as well. Our work focused on IDS flow data logs, often available independently from alert data, providing favorable datasets to apply machine learning techniques to. Our research goal to assist the NSA with their question became an investigation into the plausibility of easily utilizing this nearly ubiquitous data source within the typical security environment to answer the question of IDS prioritization.

    This research explores the plausibility of addressing this IDS alert prioritization problem by bridging the gap between the users of traditional rule-based and open source solutions and the less accessible large scale and cutting-edge corporate production software packages and novel solutions often researched by academic institutions. Our solution requires using simple and scalable ML to produce a model which is bootstrapped with publicly available traffic flow data in order to configure a new IDS deployment quickly and easily. This model can then be retrained periodically with privately collected data samples. The product of this is an easily understandable metric for analysts, that allows them to rank traffic flow data by likeliness of a threat and which improves over time. We explored selected ML models by running them against modern datasets, including the CSE-CIC-IDS2018 dataset. We then tested and validated the models to ensure that the results were sensible.

  2. BACKGROUND

    1. INTRUSION DETECTION SYSTEMS

      An intrusion detection system ("IDS") is a network security system that is utilized to monitor the traffic flowing in and out of any network infrastructure. Based on its configuration, this system logs any suspicious activity as per the rules created by the administrator. Later these logs can be analyzed for a more thorough attack analysis. Certain triggers can be set up in cases of malicious activity. This automates the traffic monitoring process and blocks potential intruders from access and provides the following types of data.

      1) Alert Data: Based on the configuration of an IDS, records saved as logs of network flow data and traffic alerts are usually generated from these systems and can vary largely. Many of these alerts can be false positives. If this false positive volume is sufficiently large it takes a significant amount of important time away from security analysts.

      2) Flow Data: The IDS generated network flow data logs we have used in our research were not chosen to represent any individual environment but rather multiple environments, so that the model does not over-fit. We attempt to work from datasets meant to represent a 'global' threat envelope. In future research, it would be advantageous for more data from different environment types to be pooled and utilized as better "bootstrapping datasets" used while creating generic initial defensive models. However, the first step is to generalize this modeling process and make a functioning simplistic workflow for analysts with the currently available data. This will make the initial IDS set up process less time consuming and more productive than current practice in newly created network nodes.

    2. NETWORK FLOW DATA

      Network Flow is a concept introduced by Cisco in 1996 and summarizes the unidirectional sequence of packets between a source and destination. (1) This information was originally a feature of Cisco Router products, however since 1996 network flow data is captured in a variety of network gateway devices from firewalls to intrusion detection systems. The benefit of network flow over traditional packet capture is the ability to have detailed data collection without the performance impact to the hosting device. (2) This flow data can be used for IP-Flow based intrusion detection and traffic classification. By leveraging advanced techniques such as machine learning, network flow can be employed in identifying intrusions.

  3. OBJECTIVES

    Our first objective was to explore the current landscape of publicly available datasets through the literature commonly used for security research. These datasets often corroborate current research in signature-based detection models and are relevant datasets to this study. As our hypothesis works on being given a limited testing time frame for initial IDS installation, we therefore knew that getting a comprehensive view of the currently available dataset resources was vital to us. We initially explored vector quantization analysis methods and produced visualizations of these already well understood datasets in order to find patterns as a means for comparison. Once we explored the merits of these datasets, we planned to tailor our approach to the available data and move on to testing further methods using what has been observed as the best performing models paired with our chosen datasets. After evaluating these models for efficacy, we tested our next objective, which was to further develop our prescribed generalized training and deployment workflow with yet one more additional generalized "global" dataset for validation. We used this additional validation set in order to establish the overall plausibility of our model and establish a baseline for understanding any potential over-fitting. Then as a final step, we planned released selected code samples in an online open source code repository in order to aid future research and development. (3) The overall objective was to illustrate a method for Security Operations Center ("SOC") Analysts to sort through large volumes of alerts quickly by using a standard scoring metric to compare against IDS generated alert data. This research and our repository provides a supplemental guide to support security professionals looking for an example in developing ML supported analytical workflows run within a python environment that uses well established methods and practices.

  4. PROJECT PROCESS

    1. OUTLINE

      Our initial steps involved performing a survey and review of the current literature with regard to dataset selection. (4) We explored and tested multiple initial candidates and subsets, and applied the K-means vector quantization algorithm to each dataset before testing the use of the decision tree and the random forest models, but ultimately selected the CSE-CIC- IDS2018 dataset for final testing and exploration in addition to discussing the plausibility of other quantization, clustering, and neural network models in future development iterations. We initially used K-means clustering because it is often the first choice of model used by security professionals as one of many available unsupervised methods of machine learning used for threat detection. We created an early prototype using a subset of real world and research datasets in a preliminary solution, and commenced preliminary workflow explorations with various clustering methods, refining some of these subsets into clustering visualizations.

      Our rationale was not only to gain intuition about the types of threat environments portrayed by these datasets, but also to observe any patterns present within them via Principle Component Analysis ("PC A"). One of the first insights we gained with PCA was the mismatch of class balances between datasets as well as within the CSE-CIC-IDS2018 dataset. In fact, class size remains the largest difference between both the datasets that we had observed as well as, modern security datasets, and appears qualitatively to be the distinguishing "signature" feature between collection environments. This class imbalance can often become a hindrance to threat detection model efficiency, yet we managed to yield fairly productive results with the 2018 dataset in our decision trees without the need for under-sampling. (56) Additionally, our goal was to compare and survey different methods and datasets for plausibility rather than to perfect the results of any one ML method. Therefore, we left further sample balancing exploration as an open-ended question for future research.

      The initial unsupervised use of k-means proved less than useful as a final workflow path...

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT