The Big Cost Of Big Data

Guest post written by John Bantleman

John Bantleman is CEO of RainStor, which sells database software used for big data projects.

We’ve entered the age of Big Data where new business opportunities are discovered every day because innovative data management technologies now enable organizations to analyze all types of data. Semi-structured and unstructured data being generated in vast quantities at network speed are rich sources of information that tell organizations exactly what customers need and want and how or why they buy. But with new business opportunity comes cost, and the true costs are yet to be fully appreciated.

Big Data isn’t exactly new. Market leaders have been storing and analyzing multi-data types not only to gain competitive advantage but also to achieve deeper insights into customer behavior patterns that directly impact their business.

Two specific sectors - telecommunications and retail - have invested heavily in data warehousing solutions where large quantities of customer transactions and interactions are accumulated and examined over time to determine key performance indicators, such as revenue per year or per customer or cost of customer acquisition through online promotions or seasonal peaks. However, even market leaders can’t afford to store and manage petabyte scale raw detailed data over time in traditional data warehouses. Often they store, say, the last four quarters and then offload history to offline tape, which isn’t easily accessible. The business challenge comes when Christmas falls on a Saturday, and they need to analyze data from seven years back to understand specific patterns. Reinstating older and voluminous data into the warehouse is not only very challenging but also costly.

Two key factors come into play regarding enterprise-scale Big Data management and analytics. First, Web innovators, such as Facebook, Google and Yahoo, have developed a massively scalable storage and compute architecture to manage Big Data: Hadoop, which parallelizes large data sets across low-cost commodity hardware for easy scale and dramatically reducing the cost of petabyte environments.

Second, the technology requirements to manage Big Data have moved from the domain of a few distinct markets, to increasing demand and unique requirements across a range of sectors. Communications operators that manage petabyte scale today expect 10-100x data growth due to the shift to 4G and LTE with increasing endpoint devices connected to leverage thousands of mobile apps. The utility smart grid is being plunged into Big Data as cities across the globe join the new “digitized grid.” Financial services institutions are seeing 100 percent compound growths in trading and options data, which must be stored for 7+ years. Over the next 3 to 5 years, Big Data will be a key strategy for both private and public sector organizations. In fact in the next 5 years, 50 percent of Big Data projects are expected to run on Hadoop.

The reality is that traditional database approaches don’t scale or write data fast enough to keep up with the speed of creation. Additionally, purpose-designed data warehouses are great at handling structured data, but there’s a high cost for the hardware to scale out as volumes grow.

A key enabler for Big Data is the low-cost scalability of Hadoop. For example, a petabyte Hadoop cluster will require between 125 and 250 nodes which costs ~$1 million. The cost of a supported Hadoop distribution will have similar annual costs (~$4,000 per node), which is a small fraction of an enterprise data warehouse ($10-$100s of millions). On initial evaluation, Big Data on Hadoop appears to be a great deal. Innovative enterprises have Hadoop today – the question is how will they leverage it and at what pace will it become mission-critical and central to IT focus?

The real cost however is in the operation and overall management or integration of Big Data within the existing ecosystem. As Big Data environments scale, such as at Yahoo, managing 200 petabytes across 50,000 nodes require that more be added to deliver additional storage capacity. Many Web 2.0 organizations running Hadoop rely completely on the redundancy of data, but if you’re an enterprise bank or communications operator, you must adhere to standard-based security, disaster recovery and availability. As Hadoop exists today, it introduces more complex management and the need for skilled resources.

Behind the surface of Big Data on Hadoop deployments, many innovators of the open source platform have invested and created the “Data Scientist” – essentially a statistician that can program natively and leverage MapReduce frameworks. In order to integrate MapReduce, most enterprises need to develop an entirely new skill base, and the human capital investment will quickly outweigh the infrastructure investment. Additionally, they must leverage the existing data warehouse and business intelligence infrastructure where Big Data in Hadoop needs to be integrated in order to leverage those existing tools and skills. The inability to leverage standards on Hadoop such as SQL requires further investment without reducing the cost of the data warehouse.

Big Data offers big business gains, but hidden costs and complexity present barriers that organizations will struggle with. Although Hadoop is relatively new to the enterprise, it’s making great strides toward improving reliability and ease of use. There’s no shortage of innovation coming from start-ups and major contributors to the Apache open source project. The two areas that will have the most serious impact in both ease-of-adoption and cost are:

leverage existing SQL query language and existing BI tools against data within Hadoop; and
the ability to compress data at the most granular level, which will not only reduce storage requirements, but will drive down the number of nodes and simplify the infrastructure.

Without these two capabilities, skill learning will take time and money, and won’t keep pace with business demands. Data growth rates will simply outpace the cost of scale to manage hundreds of terabytes to petabytes of Big Data that comes every day.

CIOs and CTOs must take a closer look at the true cost of Big Data. We know one thing is proven: the benefits of leveraging Big Data will outweigh IT investment, and so for that, we thank our grassroots innovators. Cost by how much is the question.

More From Forbes

The Big Cost Of Big Data