Get Instant Help From 5000+ Experts For
question

Writing: Get your essay and assignment written from scratch by PhD expert

Rewriting: Paraphrase or rewrite your friend's essay with similar meaning at reduced cost

Editing:Proofread your work by experts and improve grade at Lowest cost

And Improve Your Grades
myassignmenthelp.com
loader
Phone no. Missing!

Enter phone no. to receive critical updates and urgent messages !

Attach file

Error goes here

Files Missing!

Please upload all relevant files for quick & complete assistance.

Guaranteed Higher Grade!
Free Quote
wave
Understanding Data Virtualisation, Apache Hadoop, MongoDB and Data Mining

Data Virtualisation

According to Bologna (2011), data virtualisation is the process that allows the acuity of data contained in multiple information sources such as, relational databases, data bases exposed through web services, XML repositories, and many others in order for this data to may be accessed without taking into consideration the type of their physical storage nor heterogeneous structure. This process can be helpful in many ways as nowadays with the fast information and constant data transmissions, enterprises are challenged to work with real-time data access. Moreover, the way data virtualisation works is by creating one layer of data that sends data services through different applications and users, this process allows the gathering of real-time data for more flexibility. Furthermore, enterprise data virtualisation allows you to integrate data from multiple sources, locations and formats without requiring the duplication of data. Thus, providing faster access to data, agility to change, less data redundancy, less time spent designing and implementation.

Apache Hadoop

Apache Hadoop can be simply defined as an open source framework suited to efficiently store and process large amounts of data. Therefore, instead of using one large computer to store and process all the data, Hadoop allows you to cluster multiple computers to analyse huge amounts of datasets more quickly. Moreover, Hadoop consists of four main modules, they are: Hadoop Distributed File System (HDFS), Yet Another Resource Negotiator (YARN), MapReduce and Hadoop Common.

-Hadoop Distributed File System (HDFS) – A distributed file system that can be run on a standard or low-end hardware. It also provides better data throughput than other file systems, allowing high fault tolerance and native support for larger datasets.

-Yet Another Resource Negotiator (YARN) – Used to schedule tasks and jobs. In addition, can be used for the management and monitoring of cluster nodes and resources usage.

-MapReduce – A framework that aids programs to do the parallel computation on data. In other words, the map collects the input data and converts it into datasets that can then be computed in key value pairs. Moreover, the output of the map task is consumed by reduce tasks to sum output, providing the desired output.

-Hadoop Common – This provides Java libraries that can be used on all modules.

In addition, as the Hadoop ecosystem became extremely popular over the years, many tools and applications to help store, collect, process, analyse and manage big data were created. The most popular ones are: Spark, Presto, Hive, HBase and Zeppelin.

Apache Hadoop

MongoDB

MongoDB as defined by TutorialsPoint (2020), is a cross-platform, document-oriented database that provides you with flexibility, scalability and high performance. As a document database NoSQL that stores data in JSON format, MongoDB is used for high volume data storage. Unlike the traditional relational database method of storing data, MongoBD works by making use of collections and documents. These documents contain key-value pairs which according to Guru99 (2020), are the basic unit of data in MongoDB. Moreover, collections contain sets of documents and functions that are parallel of relational databases tables, each document contained in a collection can be different with a variety number of fields.

Furthermore, the document structure can be much clearer, as developers construct their classes and objects using their corresponding programming language, and sometimes developers will even say their classes are not rows and columns but have a clear structure with key-value pairs. In addition, the data model of MongoDB allows you to easily represent more complex structures, such as hierarchical relationships and storing arrays (Guru99, 2020). Figure .. below shows us an example of how a collection in MongoDB is structured:

Data Mining

Data mining can be defined as the process used to extract usable data from huge datasets containing raw data. This process is possible by making use of software that looks for patterns in large chunks of data. Hence, business can study the data extracted and learn more about their customers in order to be able to offer a more effective strategy plan for their company (Twin, 2020).

As stated by Oracle (2020), the key properties of data mining are: Automatic discovery of patters, Prediction of likely outcomes and Creation of actionable information.

-Automatic discovery – the data mining is accomplished by data models, and these data models use an algorithm that acts on a set of data; therefore, the automatic discovery refers to the execution of data models.

-Prediction – data mining can be predictive, for example, it can help predict a likely outcome of something based on the information collected. Predictions are associated to probabilities, and these probabilities are also known as confidence. Sometimes prediction data mining has rules, which are the condition that implies a given outcome.

-Actionable information – when a data model is used to predict, for example, something based on certain information in order to accomplish a task.

By exploring and analysing large blocks of data, we can transform it into meaningful patterns that can later be used in a variety of ways. For instance, database marketing, credit risk management, fraud detection and so on.

MapReduce is a framework and program paradigm used for processing huge amounts of data across hundreds or thousands servers in a Hadoop cluster. MapReduce, as the names says, works in two steps, Map and Reduce. Map will take sets of data and convert it into another set of data, where it will be broken down into key value-pairs. The reduce task will take the output from a map as input and combine those key value-pairs into smaller sets in order to reduce data. This whole process goes through four execution phases, these are splitting, mapping, shuffling and reducing (Guru99, 2020).

Therefore, this process provides many benefits, such as Scalability, Flexibility, Speed and Simplicity.

-Scalability – businesses are able to process petabytes of data stored in the Hadoop Distributed File System (HDFS)

-Flexibility – Hadoop enables you to easily access multiple sources and types of data.

-Speed – because of its parallel processing and minimal data movement, it can process large amounts of data very quickly.

-Simplicity – developers can choose what programming language to write the code, that includes Python, Java and C++.

The V's of Big Data

The V's of Big Data is known as Volume, Value, Velocity, Veracity, and Variety. The concept started in 2001 with three V’s (Volume, Variety, and Velocity), after that Veracity was added making the fourth V and then Value making the fifth, later on, the eighth, etc. In this assignment, we are going to discuss the important ones that are the 5V’s (Volume, Variety, and Velocity).

support
Whatsapp
callback
sales
sales chat
Whatsapp
callback
sales chat
close