Parallel Disk-Based Processing of Massive Graphs on a Single Consumer Machine
Kyung Hee University, Department of Computer Engineering
Graph Anlytics Parallel Processing Disk-based Processing;
- 원문 URL
Mining very large graphs has gained great importance due to its wide range of diverse applications. A number of systems, mostly distributed, have been proposed for scale up computations of graphs. In recent past, the focus of research in this area has been on development of a graph processing system that processes the graph on a single consumer machine. This approach is both cost effective and reduces the effort to implement graph algorithms. However, such solutions utilizes external memory and an efficient solution requires a good thought on part of storage and data structures. Moreover, as graphs are processed in chunks such solutions also require an efficient message passing interface that propagates the results of computation from one chunk to another in order to tackle the poor locality problem of graph data. In this thesis, we present two different solutions BiShard Parallel Processor and PostMaster for parallel disk-based processing of graph. Both of the solutions process massive graphs on a single consumer machine solving different issues in the existing systems. BiShard Parallel Processor(BSPP) presents a new storage structure BiShard that significantly reduces the number of non-sequential I/Os. It implement a new processing model named BiShard Parallel (BSP) on top of Bishard storage structure that enables full CPU parallelism for processing the graph.We demonstrate with extensive experiments over real large graphs that our solution show significant performance gains over its prior counterparts. PostMaster introduces a novel Inter-Subgraph Message Passing Interface (ISMPI) that uses an Intra-Subgraph Messages Index(ISMI) to enables sharing of messages among receivers and provide direct access to each messages withour searching. Moreover, PostMaster presents an efficient storage structure that is useful in three ways. 1) it reduces the random disk access by packing everything together. 2) Stores data in large chunks that ensures I/O parallelism on SSD. 3) Requires almost no effort to map to the in memory data structure. We demonstrate through extensive experiments on large real world graph that our solution significantly and consistently outperforms state-of-the-art by more than six times.