The raw network packets of the dataset was created by the IXIA Perfect Storm tool in the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS) for generating a hybrid of real modern normal activities and synthetic contemporary attack behaviours. Tcpdump tool used to capture 100 GB of the raw traffic (e.g., Pcap files). This data set has nine types of attacks, namely, Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms. The Argus and Bro-IDS tools are used and twelve algorithms are developed to generate totally 49 features with the class label.
a) The features are described here.
b) The number of attacks and their sub-categories is described here.
c) In this coursework, we use the total number of 10-million records that was stored in the CSV file (download). The total size is about 600MB, which is big enough to employ big data methodologies for analytics. As a big data specialist, firstly, we would like to read and understand its features, then apply modeling techniques. If you want to see a few records of this dataset, you can import it into Hadoop HDFS, then make a Hive query for printing the first 5-10 records for your understanding.
This task is using Apache Hive for converting big raw data into useful information for the end users. To do so, firstly understand the dataset carefully. Then, make at least 4 Hive queries (refer to the marking scheme). Apply appropriate visualization tools to present your findings numerically and graphically. Interpret shortly your findings.
In this section, you will conduct advanced analytics using PySpark.
3.1. Analyze and Interpret Big Data We need to learn and understand the data through at least 4 analytical methods (descriptive statistics, correlation, hypothesis testing, density estimation, etc.). You need to present your work numerically and graphically. Apply tooltip text, legend, title, X-Y labels etc.
accordingly to help end-users for getting insights.
3.2. Design and Build a Classifier
a) Design and build a binary classifier over the dataset. Explain your algorithm and its configuration. Explain your findings into both numerical and graphical representations. Evaluate the performance of the model and verify the accuracy and the effectiveness of your model.
b) Apply a multi-class classifier to classify data into ten classes (categories): one normal and nine attacks (e.g., Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms). Briefly explain your model with supportive statements on its parameters, accuracy and effectiveness.
Discuss (1) what other alternative technologies are available for tasks 2 and 3 and how they are differ (use academic references), and (2) what was surprisingly new thinking evoked and/or neglected at your end?
Document all your work. Your final report must follow 5 sections detailed in the “format of final submission” section (refer to the next page). Your work must demonstrate appropriate understanding of academic writing and integrity.
Descriptive Analysis: This method involves summarizing and describing the key features of the data. It includes measures like mean, median, mode, standard deviation, range, and frequency distribution. This helps in getting an overall understanding of the data and identifying patterns or trends.
Inferential Analysis: This method involves making inferences about the population from a sample of data. This method is useful when it's not possible or feasible to collect data from the entire population. It includes hypothesis testing and confidence intervals, and helps in determining the accuracy of the sample data.
Predictive Analysis: This method involves using statistical and machine learning algorithms to predict future outcomes based on historical data. It includes techniques like regression analysis, decision trees, and random forest. This method is useful in identifying patterns in the data and predicting trends in the future.
Prescriptive Analysis: This method involves identifying the best course of action to achieve a specific goal. It includes optimization and simulation techniques, and helps in identifying the best course of action under different scenarios.