TY - JOUR
T1 - Locality-aware allocation of multi-dimensional correlated files on the cloud platform
AU - Zhang, Xiaofei
AU - Tong, Yongxin
AU - Chen, Lei
AU - Wang, Min
AU - Feng, Shicong
N1 - Publisher Copyright:
© 2014, Springer Science+Business Media New York.
PY - 2015/9/23
Y1 - 2015/9/23
N2 - The effective management of enormous data volumes on the Cloud platform has attracted devoting research efforts. In this paper, we study the problem of allocating files with multidimensional correlations on the Cloud platform, such that files can be retrieved and processed more efficiently. Currently, most prevailing Cloud file systems allocate data following the principles of fault tolerance and availability, while inter-file correlations, i.e. files correlated with each other, are often neglected. As a matter of fact, data files are commonly correlated in various ways in real practices. And correlated files are most likely to be involved in the same computation process. Therefore, it raises a new challenge of allocating files with multi-dimensional correlations with the “subspace locality” taken into consideration to improve the system throughput. We propose two allocation methods for multi-dimensional correlated files stored on the Cloud platform, such that the I/O efficiency and data access locality are improved in the MapReduce processing paradigm, without hurting the fault tolerance and availability properties of the underlying file systems. Different from the techniques proposed in [1,2], which quickly map the locations of desired data for a given query $${\mathcal {Q}}$$Q, we focus on improving the system throughput for batch jobs over correlated data files. We clearly formulate the problem and study a series of solutions on HDFS [9]. Evaluations with real application scenarios prove the effectiveness of our proposals: significant I/O and network costs can be saved during the data retrieval and processing. Especially for batch OLAP jobs, our solution demonstrates well balanced workload among distributed computing nodes.
AB - The effective management of enormous data volumes on the Cloud platform has attracted devoting research efforts. In this paper, we study the problem of allocating files with multidimensional correlations on the Cloud platform, such that files can be retrieved and processed more efficiently. Currently, most prevailing Cloud file systems allocate data following the principles of fault tolerance and availability, while inter-file correlations, i.e. files correlated with each other, are often neglected. As a matter of fact, data files are commonly correlated in various ways in real practices. And correlated files are most likely to be involved in the same computation process. Therefore, it raises a new challenge of allocating files with multi-dimensional correlations with the “subspace locality” taken into consideration to improve the system throughput. We propose two allocation methods for multi-dimensional correlated files stored on the Cloud platform, such that the I/O efficiency and data access locality are improved in the MapReduce processing paradigm, without hurting the fault tolerance and availability properties of the underlying file systems. Different from the techniques proposed in [1,2], which quickly map the locations of desired data for a given query $${\mathcal {Q}}$$Q, we focus on improving the system throughput for batch jobs over correlated data files. We clearly formulate the problem and study a series of solutions on HDFS [9]. Evaluations with real application scenarios prove the effectiveness of our proposals: significant I/O and network costs can be saved during the data retrieval and processing. Especially for batch OLAP jobs, our solution demonstrates well balanced workload among distributed computing nodes.
KW - Cloud storage
KW - Distributed data allocation
KW - Multi-dimensional correlation
KW - Subspace locality
UR - https://www.webofscience.com/wos/woscc/full-record/WOS:000360554400003
UR - https://openalex.org/W2002757262
UR - https://www.scopus.com/pages/publications/84937518577
U2 - 10.1007/s10619-014-7153-y
DO - 10.1007/s10619-014-7153-y
M3 - Journal Article
SN - 0926-8782
VL - 33
SP - 353
EP - 380
JO - Distributed and Parallel Databases
JF - Distributed and Parallel Databases
IS - 3
ER -