Hadoop is an open-source JAVA framework developed by Apache. It is used for the storage and processing of enormous datasets of big data with the assistance of the MapReduce programming model. The framework ensures that the data is stored over a distributed file system with a rapid data transfer rate among multiple nodes (De Mauro, Greco & Grimaldi, 2015). This system lessens the risk of any catastrophic system failure and unconditional loss of data if multiple nodes cease to operate. The MapReduce software framework inspired this, where the application is needed to be further broken down into multiple smaller fragments, also known as blocks. These can be preferably run on any node within the cluster. Hadoop has elegantly emerged as a foundation for the task of big data processing purposes like sales or business planning, scientific analysis and also the IoT (Internet of Things) sensors.
Hadoop has slowly grown to become the vertebrae of the Big Data analytics and storage technology. All across the globe, companies have began to integrate the Hadoop framework into their existing system and use big data to further enhance their business mode. The study of Hadoop use cases would help in understanding the big data issues that seek for Hadoop and the infrastructure behind it. Below a research is conducted on some of the use cases that have been adapted by reputed business farms.
Inventory prediction or forecasting has always been one of the most special use cases of Hadoop. Cloudera.com (2017), reveals that the Marks and Spencers had hired the Cloudera Enterprise Data Hub Edition to enrich their digital platform experience and to obtain quality understanding of client or customer behavior. Cloudera uses the Apache Hadoop framework services and support in order to seamlessly store and process large chunks of meaningful data. Through this, M&S managed to handle big data in a more flexible and customized manner. The process of leveraging data now became more robust and scalable, with primary importance being given to security. Using Hadoop gave the company an added advantage against its competitors by giving them important information like inventory forecasting and customer satisfaction through predictive analysis. Customer satisfaction was made possible as Hadoop analysis helped to analyze customer behavior and therefore allowed the company to act accordingly.
Another famous use case of Hadoop lies in its ability to analyze social media and help companies to reach out to their customers. This also helps in the process of promotion and advertising. The British Airways too took into using the Hadoop framework in order to broaden their scope of business. They set up the new Know Me program, which allowed them to stay a step ahead of their competitors. They used this program to gather and analyze customer data and mark loyal customers for offers and benefits. This also allowed the airlines to attend to customers who were stuck on the freeway. They were sent messages with offers to reschedule flights. This too assessed the company to highlight their customer service and convenience. The use of Hadoop allowed the company to easily store, index and process huge amounts of data accumulated from millions of flight customers all around the globe.
With the increasing popularity of Hadoop 1x versions, in the year 2013, Hadoop released its version 2. Hadoop 2 added support to run non-batch applications with the help of the YARN (Yet Another Resource Negotiator) technology. YARN was used initially used as a resource manager in the Hadoop framework; however, later it was recognized as a large scale distributed OS for the big data processes. They also introduced the Hadoop File System (HDFS), which comprises of namespaces and blocks storage services. Namespace allowed proper management of file and directory operation and the block storage service enabled the management of data node clusters, its replications and the block operations. This brought in expanded measures of reliability and scalability into Hadoop. However, as mentioned earlier, the upgradation of YARN added greatly to the abilities of Hadoop. It made Hadoop more model friendly, by allowing writing programs that would run on multiple models other than the pre-existing MapReduce model of Hadoop 1. Apart from these, there are several other aspects where the 2.0 version excelled from its predecessor. The version 1 Hadoop had a limitation over the service as platform for streaming, real-time operation and event processing. However, Hadoop 2.x versions upgraded to a platform that offers a wider variety of data analysis for the same processes. In case of a failure in the namenode feature, the stack used to be affected in the previous versions. Hadoop 2.x stacks like Hive, HBase and Pig are well equipped to tackle numerous such Namenode failures. Last of all, the 2.x versions enabled MS Windows support on Hadoop, which was previously not available. This was one of the greatest drawback of the technology until 2013 (Zhu et al., 2014.
Scalability has been the major area of focus for the 2.x versions of Hadoop. These versions through YARN overcome the limitations of bottlenecks in node and task scalability, which used to be only 4,000 nodes and 40,000 tasks in the 1.x versions. Hadoop 2.x, by virtue of the split resource managing architecture redesigned the scale to handle 10,000 nodes and 100,000 tasks.
However, the Hadoop 3.0 release is knocking on the doors. It is meant to cut down the physical disk usage of hadoop-based applications by half. It is also destined to rise the fault tolerance level of the Hadoop systems by 50%. The hadoop shell script would be structure would be rewritten to provide far-reaching capabilities and attributes to the programmers. Most of all, reliability and scalability is to be enhanced by the assistance of the ATS v2.
With the upgrading structure of Hadoop, its use cases are also changing. Public stable JAVA API compatibility is necessary for ensuring the execution of end-user programs without modifications. Private stable API compatibility would help in the rolling of upgrades. However, the existing MapReduce, YARN and HDFS applications from the previous versions are destined to execute unmodified on the new Hadoop cluster of the updated versions. Wire and network compatibility is also tested in the version 2.x updates. Client-server and server-server compatibility is necessary in order. This is necessary in each of the levels of the updates. For example, a Hadoop V 2.4 client while trying to communicate with a 2.3 cluster, the client is needed to be upgraded prior to the server. This enables seamless execution. In addition, it allows client-end bug fixes before the whole cluster is needed to be upgraded.
Apache Spark is another open source Big-Data processing framework. It is built to achieve great heights in ease of use, speed and sophistication in analysis. Spark has proved itself as a framework that can ensure its advantages over the traditional MapReduce technologies like Storm and Hadoop (Meng et al., 2016).
Sparks provides a unified and comprehensive framework that helps to manage Big-Data processing requirements with a wide variety of datasets what are of varying nature like graphs, texts and so on. Spark boosts applications running in the Hadoop clusters by 100 times more speed. It also makes applications faster while running on disks by 10 times. This provides an added advantage over the use of only Hadoop, as it cannot leverage the cluster memory to the fullest. Spark also allows programmers to code quicker using Java, Python or Scala. Spark comes with a set of around 80 high-level operators built-in. These are useful in running queries to extract data within the shell. Apart from MapReduce operation, Spark also supports streaming of data, graph-data processing, machine learning techniques and SQL queries (Sewak & Singh, 2015). These can be used in either stand-alone applications or combine them in a sole data-pipeline use case.
Data streaming and Machine Learning is a crucial use case of Apache Spark. This allows the organizations to analyse huge chunks of thoroughly flowing data in real time. E-Commerce websites like Amazon, eBay and Alibaba uses this concept in order to enhance their customer base. They use streaming clustering algorithms K-Means Clustering Algorithm or Alternating Least Squares technique. The results from these are further combined with the data that is collected from social media sources, customer preferences on respective websites, customer search history, comments on product review forums and many more. This gives them an edge over their competitors as they can enhance the product recommendations that they can make based on the customer’s data analysis. However, since social media produces more frequent and a wide set of real time data, Spark is used to tackle such quality and quantity of Big Data. Apart from E-commerce use cases another common example is of the Conviva, the second most famous video streaming company after YouTube. Conviva uses Spark Streaming to optimize video streams and manage live video traffics. This helps them to maintain consistency and quality.
Stypinski (2017), mentions that Uber, the famous multinational online-taxi-dispatch company makes use of Apache Spark to collect millions of terabytes of data from its mobile app users. They combine the use of technologies like Spark Streaming and HDFS to convert raw event produced data into structured formats. These are further used for more complex calculations and analysis.
Hadoop is no longer just in existence as a batch-processing tool anymore. Organizations now take the help of Hadoop in order to fulfill their ad hoc analysis. However, Apache Spark has proved its way into the market. It gets it edge over Hadoop due to its streaming and disk management applications. The current trends show that Spark is more widely used in the industry as compared to Hadoop. Nevertheless, the choice of the Hadoop MapReduce versions or Spark depends completely on the complexity and the need of the respective companies with respect to specific field of need in Big Data analysis.
Cloudera.com. (2017). marks spencer developes next gen analytics. Cloudera. Retrieved 10 December 2017, from https://www.cloudera.com/more/news-and-blogs/press-releases/2015-05-13-marks-spencer-develops-next-generation-analytics-capabilities.html
De Mauro, A., Greco, M., & Grimaldi, M. (2015, February). What is big data? A consensual definition and a review of key research topics. In AIP conference proceedings (Vol. 1644, No. 1, pp. 97-104). AIP.
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., ... & Xin, D. (2016). Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, 17(1), 1235-1241.
Sewak, M., & Singh, S. (2015). A reference architecture and road map for enabling E-commerce on Apache Spark. Communications on Applied Electronics, 2(1), 37-42.
Stypinski, M. (2017). Apache Spark Streaming.
Zhu, N., Liu, X., Liu, J., & Hua, Y. (2014). Towards a cost-efficient MapReduce: Mitigating power peaks for Hadoop clusters. Tsinghua Science and Technology, 19(1), 24-32.