vLocality: Revisiting Data Locality for MapReduce in Virtualized Clouds
Recent years have witnessed a surge of new generation applications involving big data. The de facto framework for big data processing, MapReduce, has been increasingly embraced by both academic and industrial users. Data locality seeks to co-locate computation with data, which effectively reduces remote data access and improves MapReduce’s performance in physical machine clusters. State-of-the-art public clouds heavily rely on virtualization to enable resource sharing and scaling for massive users, however. In this article, through real-world experiments, we show strong evidence that the conventional notion of data locality is unfortunately not always beneficial for MapReduce in a virtualized environment. The observations suggest that the measure of node-local must be extended to distinguish physical and virtual entities. We develop vLocality, a comprehensive and practical solution for data locality in virtualized environments. It incorporates a novel storage architecture that efficiently mitigates the shared disk contention, and an enhanced task scheduling algorithm that prioritizes co-located VMs. We have implemented a prototype of vLocality based on Hadoop 1.2.1, and have validated its effectiveness on a typical virtualized cloud platform consisting of 22 nodes. Our experimental results demonstrate that vLocality can improve the job finish time to around a quarter of that for typical Hadoop benchmark applications.