Track: Data Analytics and Big Data
Abstract
Big Data is a technology that aims to capture, store, manage, and analyze large numbers of data with different types: structured, semi-structured and unstructured. These data are guided by the rule of the 5 Vs: Volume, Velocity, Variety, Veracity and Value. To analyze a large amount of information coming from several sources, the technological world of Big Data is based on clearly identified tools, including the Hadoop framework and the Apache Spark. Hadoop allows massive data storage with the Hadoop Distributed File System (HDFS) model, as well as the analysis with the MapReduce model, on a cluster that has one or more machines. Apache Spark analysis the distributed data, but it doesn’t contain a system storage. This article presents a comparative between the Big Data analysis methods offered by Hadoop and Spark, their architectures, operating modes and performances. Indeed, We have first begun by a general overview of Big Data technology. We discuss about the Hadoop Framework and their components: HDFS and MapReduce the first element of our comparison, we offer a study of their methodology, their mode of data processing and their architecture, at the end of this section we present the Hadoop ecosystem. We then continue with Apache Spark Framework the second element of our study, we expos their features, their ecosystem and their mode of analysis of large data. At the end, we compare these last two elements Hadoop and Spark based on a detailed study.