One of the hot topics these past few years has been Big Data. Everyone is jumping on this bandwagon for a good reason. Big data is able to provide insights into your business, forecasting and drive better decision-making. Companies that use data for decision-making have demonstrated better performance results. With the large amount of information that businesses can access today, a problem has emerged regarding the time required to process all this data.
So far, all analytics and decision-making have been done by processing data offline -batch processing, but what to do if we need this in real-time? How to get faster response so that businesses can make the right choice in time. Any business that I am familiar with today is looking for almost in demand reporting and the ability to react thereafter in a short span of time.
Our world is just getting more complex. Now we have two problems to worry about: the large amount of data to store on constant basis, and the analysis of this vast amount of data in real-time.
What is the right approach to solve these problems?
Maybe you are already muttering the right answer: Hadoop 2.0. This is the definite solution for managing and processing structure and unstructured data. It introduces HDFS federation file system and comprises two major components namespaces and blocks storage service. While the namespace service manages operations on files and directories such as creating and modifying files and directories, the block storage service implements data node cluster management, block operations and replication. HDFS federation brings important measures of scalability and reliability to Hadoop.
The other major advantage of Hadoop 2.0 is YARN.
YARN is a resource manager that has been created dividing the processing engine and resource management capabilities of MapReduce as it was done in Hadoop 1. YARN is also called the operating system of Hadoop because it is responsible for managing and monitoring workloads, maintaining a multi-tenant environment, implementing security controls, and managing high availability features of Hadoop.
With this improvement that 2.0 is brings and the integration of Elasticsearch -search engine based on Lucene, we are getting to the blueprint of a real-time analytics platform. From the first day, I was impressed with Elasticsearch. Having worked with Solr, yet another search engine implementation based on Lucene, I found a big problem with indexing as my database grew in size. Elasticsearch has resolved my problem very simply as it has real-time indexing. A powerful feature for today needs when we want everything simultaneously. Here are some additional features that demonstrate how Elasticsearch is a natural fit with Hadoop:
  1. Elasticsearch integrates with Hadoop and enables Hadoop to directly index as shown on the image above I created for illustration purposes.
  2. Elasticsearch can use HDFS as a long term archive
  3. Allows queries directly from Hadoop
  4. Elasticsearch provides a robust query language
  5. Elasticsearch has a real-time indexing engine
These are ideal examples of where real-time analytics makes a significant difference:
  • In Customer Relationship Management (CRM) real-time analytics can provide up to the minute information about the customer so that businesses decisions can be made fast. Real-time analytics can support instant updates to corporate dashboards to reflect these business changes.
  • In the financial sector where we need to make fast decisions based on available information at that moment (buy, sell or hold).