Big Data Architectures

In the past couple of years, the term big data has become a well-known term used to refer to the exponential increase in the amount of data being generated by all sorts of data generation and collection systems world-wide. The volume of data is rather large and is growing at a very high rate, from an estimated 0.9 ZB (ZettaBytes) of data stored in digital format in 2009, up to about 1.8 ZB by 2011, a trend which is expected to continue in the near future. This data growth originates from very diverse sources, with a large variety of data formats requiring non-uniform forms of data processing, resulting in another set of computing challenges. In addition, the high rate of data being generated every day make the velocity with which the data can be processed an important challenge.

The analysis of these large data sets has shown to provide valuable insights into a number of challenging problems in various fields, ranging from physics to medicine. However, carrying out such analysis methods continues to be hampered by the sheer size of the data sets on the one hand, and by the computational complexity of the used algorithms on the other. The CE lab performs research into HPC infrastructure to enable the efficient processing of these big data problems. Research topics carried out in the lab range from addressing the limitations in parallelism on a scalable computational infrastructure, the limitations incurred by the overhead of moving data between memory and processors, as well as the limitations of accessing and storing such data in storage devices.

Advances in this field have been primarily driven by the industrial need to increase the utilization efficiency of the available infrastructure. One important example is the development of the MapReduce paradigm, to enable the efficient distribution of big data problems on scalable computational clusters. Another example is the development is the Hadoop Distributed File System, which enables the distributed storage of large data files on multiple nodes while ensuring reliability in case of system failure. Recently, Spark has been proposed to enable the reliable in-memory computation of big data volumes, thereby reducing the bottleneck incurred by hard disk access latency for memory intensive tasks. In addition, a whole host of big data database systems (such as HBase, Cassandra and MongoDB) are being developed to manage the continuously increasing volumes of data.

One big data application domain the CE lab focuses on is the field of genomics and personalized medicine. This filed in particular, and the whole domain of biomedical science in general, is becoming extremely data-driven. This is due to the dramatic decrease in the price of DNA sequencing in the past decade, which resulted in shifting the bottleneck in DNA analysis from acquiring the DNA data to the actual processing and analysis of this data. More and more labs are producing an increasing amount of genomic data that requires an ever-increasing computational capacity to be processed. The CE lab is working to streamline the whole genomic computational pipeline starting from data storage and transfer, all the way to data analysis and interpretation. This involves, for example, the optimization of various algorithms to run efficiently on large computer clusters, as well as the development of hardware-accelerated implementations for computationally intensive algorithms in the pipeline. In addition, research is done on efficient data managements systems, in order to increase the performance of database systems as well as efficient domain-specific compression techniques.

CE Tweets