Professional Documents
Culture Documents
Abstract
In this white paper we share some of the best practices and strategies associated with lowering the Total Cost of Ownership (TCO) of your Big Data solutions. It discusses the challenges that relate to the cost of Big Data solutions and looks at the technology options available to overcome these problems.
Table of Contents
Introduction .............................................................................................. 2 Using commodity hardware for Big Data ................................................. 3 Using Open Source and cloud computing ................................................ 3 The cost components of a Big Data Warehouse....................................... 4 Entry cost ..................................................................................... 4 Cost associated with migrating the data ..................................... 4 Other costs................................................................................... 4 Lowering the TCO of Big Data ................................................................... 4 Reducing the cost of storage .................................................................... 5 What technologies, where? ...................................................................... 5 Big Data scenarios in OLAP ....................................................................... 6 Analytics with Hadoop .............................................................................. 7 Indirect analytics over Hadoop .................................................... 7 Direct analytics over Hadoop....................................................... 7 Analytics over Hadoop with MPP Data warehouse ..................... 7 Selecting the right technologies ............................................................... 8 Opting for faster Map Reduce/Hadoop .................................................... 8 NoSQL solutions ........................................................................................ 9 New era RDBMS versions ......................................................................... 9 Impetus solutions and recommendations ............................................... 9 Conclusion............................................................................................... 10
Introduction
The potential of Big Data is growing at a rapid rate. According to IDC/EMC estimates, the cost of computers, networks and storage facilities that drive the digital universe currently stands at a whopping USD 6 trillion! We spend an extra USD 650 billion for the data that is of little use, due to the overload of information that adds to the cost of productivity and storage, and eventually goes to waste.
Over the next few years, these numbers are expected to grow significantly. According to one study, the size of digital universe doubles in every 18 months. We have a rich pool of data at our disposal, we are still extracting poor information. There is a lot more that can be done to unearth better and actionable insights and intelligence from this Big Data. Currently, there are several solutions available for cost-effective Big Data Analytics. Let us examine some of the pros and cons of these offerings, beginning with commodity hardware.
As for using cloud computing for Big Data, this too, has its advantages and disadvantages. The advantage is that you can rent resources over the cloud to deal with your data and analytics. Some prominent service providers you can turn to are Amazon Web Services and Microsoft, for its Windows Azure platform. You can select an offering appropriate for your needs and requirements, from their portfolios. On the minus side, there is storage over the cloud, which is not exactly an economical option.
Entry cost
The first element is entry cost. This is the cost that you will incur while experimenting with your data and identifying whether a particular Big Data solution meets your requirements.
Other costs
The other important cost component that comes into the picture is performing analytics over the data stored within the system. A cost is also incurred due to the manageability aspect, as there is a need for the system to be easily handled for scalability and failing conditions. On-going recurring maintenance accounts for another cost factor that cannot be ignored. Your Big Data Warehouse will always require monitoring and tuning as the data grows or changes are made. Together, all of these factors increase the costs associated with Big Data analytics. Based on its extensive experience and expertise in Big Data, Impetus has identified some best practices that can help you reduce the Total Cost of Ownership (TCO) of your Big Data solutions. We are categorizing them here on the basis of hardware and software.
We do not recommend using Big Data solutions for the storage and retrieval of small amounts of data, as the relative latency will be higher for the fetch. We believe that better insights and conclusive drawings can be made over a small set of data. Therefore, our advice to you is not to take your eyes off the Small Data, as it is very important too.
If you are considering the cloud as a potential solution for your Big Data, ask yourself whether moving to the cloud is the only solution for your data storage requirements. Such a move can be expensive, especially when the data is not already on the cloud. In such a scenario, you will have to upload the entire data required for processing, thereby, adding to your cost. Having covered what technologies, let us now talk about where to use these technologies. You can broadly classify these under two categoriesOnline Analytical Processing and Online Transaction Processing. If you are generating or working with large sets of data in your OLTP scenario, the cost-effective NoSQL solutions will come to your rescue. On the other hand, in a typical data warehouse situation, which requires analytical processing, Map Reduce or MPP-based systems will be a good option. Let us look at the possible Big Data scenarios, where we largely deal with analytical processing.
Massively Parallel Processing Systems (MPP), traditional RDBMS or the newer NoSQL databases. These systems offer the lowest latency in such a scenario. However, if you move from a lesser amount of data to Big Data, the latency of the systems increases and there is a decrease in cost per gigabyte. We know that Hadoop systems are cost effective. However, in case of small data solutions where latency is the key factor, it is better to opt for customized and tailored solutions that enable quicker retrieval of data. The down side is that the deployment of these solutions will take the storage cost per GB to a higher level. MPP, on the other hand, provides significant benefits. There are enough reasons for a warehouse to use MPP solutions. These systems provide you with relational stores while at the same time accommodating larger sizes of data. In a typical scenario, you may need to deploy all or a combination of these systems at the same time. Lets look at some of them.
However, the disadvantage of using this approach could be the costs involved. Most of the MPP DWs are expensive to acquire and some also require high-end servers for deployment, which could be an expensive proposition. Having discussed the Big Data strategies that can be adopted to reduce TCO, let us once again talk about what technologies to use.
NoSQL solutions
In an OLTP scenario you require faster reads and writes. Vendors providing solutions in this area have different kinds of underlying implementations, with each one being suited for a typical business use case. If you require random and real-time read/write access to the Big table-like data, you can use HBase. If you require faster writes, you might want to check Cassandra. These are suited to industries like banking and finance. In case, if your transaction data storage requirement is mostly for query purposes, and you have a need to define indexes, you can use MongoDB or CouchDB. There are other databases--graph databases like Neo4j for instance, which allow the Big-Data-heavy social media analytics problems to be solved more easily.
automatically provision multiple Hadoop clusters and also offer centralized management for these clusters. For on-going maintenance, our success mantra is automate, automate, automate! If there is a requirement to carry out any task more than once, just automate. This goes for monitoring and tuning too. As for dealing with changing capacity, you could keep on adding hardware or look for alternatives that can help speed up things. Utilization of Graphics Processing Units for general purpose computing can also help. We also recommend Rainstor and similar solutions that help to compress the data and reduce the cost of hardware required data storage. Finally, faster, tailored Map Reduce solutions will help you complete more tasks in a lesser time.
Conclusion
In summary it can be said that best practices and robust strategies can help you lower the TCO of your Big Data solutions, and overcome the challenges associated with Big Data. Impetus has successfully dealt with Big Data problems, and used the Hadoop ecosystem to overcome some of these challenges.
About Impetus Impetus Technologies offers Product Engineering and Technology R&D services for software product development. With ongoing investments in research and application of emerging technology areas, innovative business models, and an agile approach, we partner with our client base comprising large scale ISVs and technology innovators to deliver cutting-edge software products. Our expertise spans the domains of Big Data, SaaS, Cloud Computing, Mobility Solutions, Test Engineering, Performance Engineering, and Social Media among others. Impetus Technologies, Inc. 5300 Stevens Creek Boulevard, Suite 450, San Jose, CA 95129, USA Tel: 408.213.3310 | Email: inquiry@impetus.com Regional Development Centers - INDIA: New Delhi Bangalore Indore Hyderabad Visit: www.impetus.com
Disclaimers
The information contained in this document is the proprietary and exclusive property of Impetus Technologies Inc. except as otherwise indicated. No part of this document, in whole or in part, may be reproduced, stored, transmitted, or used for design purposes without the prior written permission of Impetus 10 Technologies Inc.