Benchmarking of distributed computing engines spark and GraphLab for big data analytics

Jian Wei, Kai Chen*, Yi Zhou, Qu Zhou, Jianhua He

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper we evaluate and compare two representativeand popular distributed processing engines for large scalebig data analytics, Spark and graph based engine GraphLab. Wedesign a benchmark suite including representative algorithmsand datasets to compare the performances of the computingengines, from performance aspects of running time, memory andCPU usage, network and I/O overhead. The benchmark suite istested on both local computer cluster and virtual machines oncloud. By varying the number of computers and memory weexamine the scalability of the computing engines with increasingcomputing resources (such as CPU and memory). We also runcross-evaluation of generic and graph based analytic algorithmsover graph processing and generic platforms to identify thepotential performance degradation if only one processing engineis available. It is observed that both computing engines showgood scalability with increase of computing resources. WhileGraphLab largely outperforms Spark for graph algorithms, ithas close running time performance as Spark for non-graphalgorithms. Additionally the running time with Spark for graphalgorithms over cloud virtual machines is observed to increaseby almost 100% compared to over local computer clusters.

Original languageEnglish
Title of host publicationProceedings, 2016 IEEE Second International Conference on Big Data Computing Service and Applications, BigDataService 2016
Place of PublicationPiscataway, NJ (US)
PublisherIEEE
Pages10-13
Number of pages4
ISBN (Print)978-1-5090-2251-9
DOIs
Publication statusPublished - 23 May 2016
Event2nd IEEE International Conference on Big Data Computing Service and Applications - Oxford, United Kingdom
Duration: 29 Mar 20161 Apr 2016

Conference

Conference2nd IEEE International Conference on Big Data Computing Service and Applications
Abbreviated titleBigDataService 2016
CountryUnited Kingdom
CityOxford
Period29/03/161/04/16

Fingerprint

Distributed computer systems
Benchmarking
Electric sparks
Engines
Data storage equipment
Scalability
Processing
Program processors
Degradation
Big data
Distributed computing
Graph
Virtual machine
Resources
Benchmark

Bibliographical note

-

Cite this

Wei, J., Chen, K., Zhou, Y., Zhou, Q., & He, J. (2016). Benchmarking of distributed computing engines spark and GraphLab for big data analytics. In Proceedings, 2016 IEEE Second International Conference on Big Data Computing Service and Applications, BigDataService 2016 (pp. 10-13). Piscataway, NJ (US): IEEE. https://doi.org/10.1109/BigDataService.2016.11
Wei, Jian ; Chen, Kai ; Zhou, Yi ; Zhou, Qu ; He, Jianhua. / Benchmarking of distributed computing engines spark and GraphLab for big data analytics. Proceedings, 2016 IEEE Second International Conference on Big Data Computing Service and Applications, BigDataService 2016. Piscataway, NJ (US) : IEEE, 2016. pp. 10-13
@inproceedings{285d7657d6894a4999ef3bda41713542,
title = "Benchmarking of distributed computing engines spark and GraphLab for big data analytics",
abstract = "In this paper we evaluate and compare two representativeand popular distributed processing engines for large scalebig data analytics, Spark and graph based engine GraphLab. Wedesign a benchmark suite including representative algorithmsand datasets to compare the performances of the computingengines, from performance aspects of running time, memory andCPU usage, network and I/O overhead. The benchmark suite istested on both local computer cluster and virtual machines oncloud. By varying the number of computers and memory weexamine the scalability of the computing engines with increasingcomputing resources (such as CPU and memory). We also runcross-evaluation of generic and graph based analytic algorithmsover graph processing and generic platforms to identify thepotential performance degradation if only one processing engineis available. It is observed that both computing engines showgood scalability with increase of computing resources. WhileGraphLab largely outperforms Spark for graph algorithms, ithas close running time performance as Spark for non-graphalgorithms. Additionally the running time with Spark for graphalgorithms over cloud virtual machines is observed to increaseby almost 100{\%} compared to over local computer clusters.",
author = "Jian Wei and Kai Chen and Yi Zhou and Qu Zhou and Jianhua He",
note = "-",
year = "2016",
month = "5",
day = "23",
doi = "10.1109/BigDataService.2016.11",
language = "English",
isbn = "978-1-5090-2251-9",
pages = "10--13",
booktitle = "Proceedings, 2016 IEEE Second International Conference on Big Data Computing Service and Applications, BigDataService 2016",
publisher = "IEEE",
address = "United States",

}

Wei, J, Chen, K, Zhou, Y, Zhou, Q & He, J 2016, Benchmarking of distributed computing engines spark and GraphLab for big data analytics. in Proceedings, 2016 IEEE Second International Conference on Big Data Computing Service and Applications, BigDataService 2016. IEEE, Piscataway, NJ (US), pp. 10-13, 2nd IEEE International Conference on Big Data Computing Service and Applications, Oxford, United Kingdom, 29/03/16. https://doi.org/10.1109/BigDataService.2016.11

Benchmarking of distributed computing engines spark and GraphLab for big data analytics. / Wei, Jian; Chen, Kai; Zhou, Yi; Zhou, Qu; He, Jianhua.

Proceedings, 2016 IEEE Second International Conference on Big Data Computing Service and Applications, BigDataService 2016. Piscataway, NJ (US) : IEEE, 2016. p. 10-13.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Benchmarking of distributed computing engines spark and GraphLab for big data analytics

AU - Wei, Jian

AU - Chen, Kai

AU - Zhou, Yi

AU - Zhou, Qu

AU - He, Jianhua

N1 - -

PY - 2016/5/23

Y1 - 2016/5/23

N2 - In this paper we evaluate and compare two representativeand popular distributed processing engines for large scalebig data analytics, Spark and graph based engine GraphLab. Wedesign a benchmark suite including representative algorithmsand datasets to compare the performances of the computingengines, from performance aspects of running time, memory andCPU usage, network and I/O overhead. The benchmark suite istested on both local computer cluster and virtual machines oncloud. By varying the number of computers and memory weexamine the scalability of the computing engines with increasingcomputing resources (such as CPU and memory). We also runcross-evaluation of generic and graph based analytic algorithmsover graph processing and generic platforms to identify thepotential performance degradation if only one processing engineis available. It is observed that both computing engines showgood scalability with increase of computing resources. WhileGraphLab largely outperforms Spark for graph algorithms, ithas close running time performance as Spark for non-graphalgorithms. Additionally the running time with Spark for graphalgorithms over cloud virtual machines is observed to increaseby almost 100% compared to over local computer clusters.

AB - In this paper we evaluate and compare two representativeand popular distributed processing engines for large scalebig data analytics, Spark and graph based engine GraphLab. Wedesign a benchmark suite including representative algorithmsand datasets to compare the performances of the computingengines, from performance aspects of running time, memory andCPU usage, network and I/O overhead. The benchmark suite istested on both local computer cluster and virtual machines oncloud. By varying the number of computers and memory weexamine the scalability of the computing engines with increasingcomputing resources (such as CPU and memory). We also runcross-evaluation of generic and graph based analytic algorithmsover graph processing and generic platforms to identify thepotential performance degradation if only one processing engineis available. It is observed that both computing engines showgood scalability with increase of computing resources. WhileGraphLab largely outperforms Spark for graph algorithms, ithas close running time performance as Spark for non-graphalgorithms. Additionally the running time with Spark for graphalgorithms over cloud virtual machines is observed to increaseby almost 100% compared to over local computer clusters.

UR - http://www.scopus.com/inward/record.url?scp=84973661545&partnerID=8YFLogxK

UR - https://ieeexplore.ieee.org/document/7474329/

U2 - 10.1109/BigDataService.2016.11

DO - 10.1109/BigDataService.2016.11

M3 - Conference contribution

AN - SCOPUS:84973661545

SN - 978-1-5090-2251-9

SP - 10

EP - 13

BT - Proceedings, 2016 IEEE Second International Conference on Big Data Computing Service and Applications, BigDataService 2016

PB - IEEE

CY - Piscataway, NJ (US)

ER -

Wei J, Chen K, Zhou Y, Zhou Q, He J. Benchmarking of distributed computing engines spark and GraphLab for big data analytics. In Proceedings, 2016 IEEE Second International Conference on Big Data Computing Service and Applications, BigDataService 2016. Piscataway, NJ (US): IEEE. 2016. p. 10-13 https://doi.org/10.1109/BigDataService.2016.11