5th Annual International Conference on Industrial Engineering and Operations Management

Parallel Implementation of Multiple Linear Regression Algorithm Based on MapReduce

Moufida Adjout
Publisher: IEOM Society International
0 Paper Citations
1 Views
1 Downloads
Track: Systems Engineering
Abstract

The amount of data generated by traditional business activities, creating data repositories ranging from terabytes to petabytes in size. However, this information cannot be practically analyzed on a single commodity computer because the data is too large to fit in memory. For this purpose, the large size of data to be processed requires the use of high-performance analytical systems running on distributed environments. Because the data is so big it affects the types of algorithms we are willing to consider. Then standard analytics algorithms need to be adapted to take advantage of cloud computing models which provide scalability and flexibility.This paper introduces a new distributed training method, which combines the widely used framework for bigdata, MapReduce, with the traditional structure of multiple linear regression.Parallel processing of multiple linear regression will be based on the QR decomposition and the ordinary least squares method adapted to MapReduce. Our platform is deployed on Cloud Amazon EMR service.Experimental results demonstrate that the our parallel version of the multiple linear regression can efficiently handle very large datasets on commodity hardware with a good performance on different evaluation criterions, including number, size and structure of machines in the cluster. 

Published in: 5th Annual International Conference on Industrial Engineering and Operations Management, Dubai, United Arab Emirates

Publisher: IEOM Society International
Date of Conference: March 3-5, 2015

ISBN: 978-0-9855497-2-5
ISSN/E-ISSN: 2169-8767