Faster Reporting with Hadoop
Wed 29 Apr 2015
Nathan Nickels, Head of Marketing and Operations at MetaScale, a big data company of Sears Holdings Corporation, discusses how open source tools and technologies are being leveraged to give business users across the enterprise access to more data faster than ever.
If you are in enterprise IT, you have likely been asked by Business (Marketing or Finance, perhaps) to provide a platform that gives them faster reporting capabilities. Or, being proactive, you are always pushing the envelope before the requests come. Either way, Hadoop has probably crossed your radar, if you haven’t implemented it already.
Hadoop can be many things to many people. The need for faster reporting, as in real-time fast, is a growing requirement for most businesses to stay competitive. By establishing Hadoop as an enterprise data hub, or data lake as it is sometimes called, organizations can build an enterprise reporting framework that enables interactive and sub-second reporting capabilities. These capabilities can be augmented with the evaluation and implementation of the right NoSQL database for real-time analytics needs.
Hadoop as an Enterprise Data Hub
Since its early days of adoption at enterprise organizations, Hadoop and its ecosystem of open source tools has helped solve many traditional enterprise challenges. Early on, the possibility of consolidating hardware and software within the Hadoop infrastructure led organizations to explore the concept of the enterprise data hub. Indeed, the ability to cost effectively establish a single version of truth and store data at its most granular level became an attractive use case of the massively parallel processing power of Hadoop.
The enterprise data hub allows business analysts greater access to data – both in terms of the size and number of data sets, and the time in which they can access the data. ETL processes have long been a bottleneck for business users that were required to wait for a batch process to be setup for each analytics job. With the enterprise data hub, the data is extracted and loaded into the data hub once so users can run as many transformations on the data as needed. By integrating NoSQL databases into the environment, users can achieve “near” real-time access to data as soon as the data is created.
Figure: Reference Architecture of an Enterprise Data Hub based on the Hadoop platform. This framework provides a presentation layer for business users to access enterprise data with necessary security controls.
Enterprise Reporting Framework
An open source framework based on the Hadoop enterprise data hub and NoSQL databases provides a range of analytics capabilities including batch, self-service, in-memory, advanced analytics, embedded analytics, multi-dimensional and real-time analytics. This reporting framework provides speed, scale and the ability to run SQL analytics. In Hadoop, the self-service and in-memory analytics capabilities can provide the fast reporting needed by Business.
The self-service reporting capabilities of Hadoop allow business users at various levels to run reports on enterprise data specific to their needs. For example, a Marketing Analyst may require access to complete data sets in their granular level to generate detailed reports on customer behavior or campaign performance; whereas a VP of Marketing may only require access to aggregate data sets to report on high level trends. By giving all stakeholders the appropriate level of access, they have the flexibility to generate the reports they need when they need them.
The in-memory reporting capabilities of Hadoop can provide the business user with sub-second and interactive analytics reports on large data sets. For example, by storing large data sets (millions or billions of rows) of historical stock market data in Hadoop, users can query price trends for specific stocks with sub-second results. The user can drill down to as granular detail as needed with minimal latency.
Real-Time Analytics with NoSQL
Many enterprises rely on the legacy relational database to store and process data. While the relational database has its place in enterprise data management, many organizations are finding that these legacy systems do not meet modern analytics requirements. For example, one retail organization faced challenges of accurately tracking inventory of items being sold in-store and online.
In this example, the retailer was storing POS data in different formats in different legacy systems. Inventory reports were generated in a batch file that was sent once a day. This latency was resulting in potential loss of sales and customer dissatisfaction when items that showed online as being in stock were actually out of stock.
By using the NoSQL database Cassandra to extract messages from the POS queue for real-time processing, the retailer was able to reduce inventory management reporting times from once a day down to minutes and seconds. The integration of the NoSQL database gave BI teams access to more accurate and real-time reporting on inventory, pricing, sales and return data – which subsequently resulted in more efficient truck load times and a reduction in call center complaint calls.
Securing Your High-Speed Big Data Environment
One of the benefits of using Hadoop as an enterprise data hub is that it gives business users faster, unfettered access to more data. Enterprises, though, need assurance that the data being stored and processed internally is protected through strong access control, auditing and governance.
When developing a big data strategy, organizations need to consider a comprehensive solution for data security and data governance for their enterprise Hadoop implementation. Data security and data governance can be achieved by an optimum combination of appropriate security tools with customized configuration, clear policy definition and adherence to best practices.
Enterprises should adopt an approach that secures their data and complies with regulatory requirements by encrypting data that is stored and processed by Hadoop systems, centralizing key management, enforcing access control policies and gathering security intelligence on data access. By taking this approach, enterprises should be able to ensure a secure big data environment that does not hamper the user’s ability to access, analyze and report on data at the speed that modern business demands.
Maximize the Value in Your Data
By establishing an enterprise data hub with an enterprise reporting framework on Hadoop, integrating the right NoSQL databases for real-time analytics requirements, and providing a secure environment with minimal impact to necessary user access, IT organizations will be able to provide a highly flexible, scalable and user friendly platform that can add value to the business.
Whether you are just exploring big data capabilities or you are an advanced Hadoop shop, these frameworks, in combination with enterprise data governance best practices, will ensure that the data is clean and the business users can access more of the data faster than ever.