Skip to main navigation Skip to search Skip to main content

Storage optimization for large wide tables in Hadoop

  • Wei Li

Student thesis: Master's thesis

Abstract

Recent advances in data warehousing technologies are enabling the storage and processing of extremely large data sets. In viewing this opportunity, the leading cross-bank settlement institute in China is looking for more business intelligence in their large-volume historical transaction data accumulated in more than 10 years. Though a mature data warehousing solution Hive (an open-source data warehousing solution built on top of Hadoop) is being adopted in production, the efficiency of data storage and processing is suboptimal due to the lack of advanced customization and optimization on the system. Specifically, overlapping fractions of the original data set are materialized to different tables, introducing inter-table redundancy. Additions and changes on columns inside a table also occur in an inconsistent manner during the over 10-year history of the data sets, resulting in intra-table redundancy. Multiple user groups with various levels of needs use different processing engines to access the data sets, causing cross-platform redundancy. Based on these observations, we propose an optimization design that is transparent to all users of the system. It minimizes the inter-table, intra-table and cross-platform redundancy. It employs a row columnar storage format to improve the data storage and processing efficiency and make the data accessible to multiple processing engines. It also applies workload-based partitioning and indexing strategies to further improve the data processing efficiency. We implement our optimization strategies on their system and conduct extensive experiments on the historical transaction data of one day. Experimental results show a 40x improvement in data storage efficiency and 5x speedup on typical query workloads.
Date of Award2015
Original languageEnglish
Awarding Institution
  • The Hong Kong University of Science and Technology

Cite this

'