Columnar Database and In-memory Processing

Columnar Database, precisely as its name, means a database that primarily stores and processes data by columns, rather than by rows. Most of the traditional relational databases and OLTP (Online Transaction Processing) cubes store and access data by rows. Due to the increasing challenges brought by today’s Big Data problems, which require ever faster processing of huge amount of non-relational data, columnar database is referred more frequently in present-day database technology discussions.

The major advantages of columnar database are typically twofold: First, it offers more efficient compression of the data in storage blocks, which results in overall less storage space and smaller amount of data to load onto disk or memory. Second, due to both the compression and self-indexing nature of the columnar data, it can drastically reduce disk I/O requirements and offer much better query performance, especially for those column-oriented operations such as SUM or COUNT. If the query operations are performed on limited columns only for a large data set with millions or trillions of records, columnar database usually offers notable performance improvement. Today’s well-known commercial products of columnar database include HP Vertica and Amazon Redshift (in AWS offerings), etc.

With memory chip technologies keep improving, RAM price dropping and non-volatile memory technologies readily available at present, in-memory processing or in-memory database (IMDB) becomes naturally affordable and popular. When the entire database or large blocks of data are stored and directly processed in memory, without disk I/O overheads, query performance can be significantly faster and predictable. Under such scenarios, the performance differences of row-based vs. column-based processing for a standard-size database may be less of a concern. Many enterprise database providers are exploring in-memory processing as their fast DB or Big Data solutions today. For example, in-memory options in Microsoft SQL Server 2014, IBM Informix, Oracle RDBMS, etc.

In the coming cloud world, distributed data stores will become more common. When huge amount and large variety of data need to be accessed across various storage units and processed at the same time with hybrid processing nodes, smart optimizations on storage, retrieval mechanisms and query performance may still demand careful considerations. We will likely see more data technology innovations and solutions with more intelligent designs and optimization algorithms built in.