Kartic's Musings on Corporate Information and Knowledge Management

August 10, 2015

Big Data Analytics with Hadoop – Case Study

Filed under: SharePoint — kartickapur @ 1:51 am

In my previous post I talked about Big Data at a high level and the merits of extending your BI strategy into Big Data Analytics roadmap. I thought it would be useful to deep dive into case study of Hadoop platform to understand the concept and capability better.

I found a handy example of detailed case study done by Hortonworks for finance sector.

What’s Hadoop?


Hadoop is an open source platform where key technology powerhouses contribute to enhance the capabilities, each bringing their unique use cases.

Following companies contribute to Hadoop open source technology:

  • Microsoft
  • SAP
  • Teradata
  • Yahoo
  • Facebook
  • Twitter
  • LinkedIn
  • Many more

Use cases and data types across industries


Given below are the use cases and data types that can be captured in the big data landscape:


Data architecture with apache hadoop on windows

Reference: Hortonworks 2014, ‘Modern Data Architecture for financial services with Apache Hadoop on Windows’, The journey to a financial services data lake.

Business case

High level business case category:

  • maximise opportunity
  • minimise risk
  • better serve customers
  • enhance financial management
  • develop innovative new business model
  • keep pace with competition

Data sources:

  • web and connected devices
  • social media
  • Partners
  • CRM systems
  • Marketing and advertising databases
  • Order management systems

Challenge with existing Data warehouse and BI architecture:

  • Exponential growth – 2.8ZB in 2012 to 40ZB (2020 estimated).
  • Varied nature – incoming data can have little or no stucture or structure that changes too frequently for reliable schema creation at time of ingest.
  • Value at High Volumes – incoming data can have little or no value as individual or small groups of records but with high volumes and longer historical perspective, data can be inspected for pattern and used for advanced analytic applications.

What gaps does Hadoop fill

Technology (high level)

  • Apache Hadoop collects and manage diverse volumes of unstructured and semi-structured data alongside traditional repositories like the enterprise data warehouse
  • Hadoop also fulfils the vision of enterprise-wide repository for big data or frequently known as ‘data lake’. This provides scalable and flexible storage system that can accept data in any format.
  • Application framework that allows different types of processing workloads to interact with a common pool of storage data.

Business (high level)

  • New efficiencies: through significantly lower cost of storage and the optimisation of data processing workloads such as data transformation and integration.
  • New Opportunities: through accelerated analytical applications, able to access all enterprise data in both batch and real-time modes.
  • New insights: through allowing data from traditional and emerging data sources to be retained, combined and mined in new and unforseen ways.

Modern Architecture with Apache Hadoop Integrated into existing data systems:


New opportunity for Analytics

  • Schema on read – unlike data warehouse where data is transformed into specified schema when it is loaded into the warehouse requiring schema on write, Hadoop empowers users to store data in its raw format. Analysts can then create a schema to suit the needs of their application or analysis at the time of use.
    • Example: combine CRM data with clickstream data (server logs from web site, sentiment data from social media etc.). It is hard to format data and structure it at the time of entry. With hadoop, structure or make sense during read:
  • Multi-use, multi workload data processing – multi access methods like batch, real-time, streaming, in-memory etc, allow analysts to transform and view data in multiple ways (across various schemas).
    • For example, credit issuer may choose to run an advanced fraud prevention application against incoming transactions in real-time, and run a series of batch reporting and analysis processes overnight – both these can happen on a single cluster of shared resources and single versions of data using hadoop.

 New Opportunities for Data Architecture

  • Lower cost of storage – compared to SANs (high end storage area networks), hadoop allows the user to reduce CAPEX (capital expenditure) because it runs on commodity hardware and also because it allows users to invest in “just enough” hardware to meet immediate needs and easily expand later as needs grow.
  • Data warehouse workload optimisation – as compared to traditional Enterprise Data Warehouse (EDW), the ETL function (which is lower value computing workload) can be offloaded to hadoop, wherein data is extracted and transformed on the hadoop cluster and the result are loaded into the data warehouse.

  Hadoop Enterprise Capabilities


  • Data management – store and process vast quantities of data in a scale-out storage layer.
    • HDFS provides Hadoop’s efficient scale-out storage layer. Yarn enables hadoop to serve broad enterprise use cases, allowing wide variety of data access methods to operate on data stored in hadoop.
  • Data access – Access and interact with data in a wide variety of ways – spanning batch, interactive, streaming and real-time use cases
    • Apache hive offers direct data connections ot microsoft excel and power BI.
  • Data Governance and Integration – Quickly and easily load data, and mange it according to policy.
    • Apache falcon provides policy based workflows for governance.
    • Apache flume and sqoop enable easy data ingestion as do the NFS and WebHDFS interfaces to HDFS.
  • Security – Address requirement of authentication, authorisation, accounting and data protection.
    • Security is provided at every layer Hadoop stack from HDFS to Yarn to Hive.
  • Operations – Provision, manage, monitor and operate Hadoop clusters at scale.
    • Apache Ambari offers the necessary interface and APIs to provision, manage and monitor Hadoop clusters and integrate with other management console software’s.


Tying it back to Finance sector example


Hadoop Architecture with microsoft Windows


Benefit Realisation in Finance Sector

  •  Improving underwriting efficiency for usage based insurance – advanced GPS and telemetry technologies have reduced the cost of capturing driving data for insurance companies issuing pay as you drive (PAYD) policies but for one company streaming vehicle data was cost prohibitive. It is now possble for this company to retain 100% of policy holders geolocation data. This has allowed this company to align premiums with empirical risk and in turn reward safer drivers.
  • Screening new account applications for risk – managing risk of fraud by identifying patterns of fraud.
  • Achieving sub-second SLAs with a Hadoop “ticker plant” – ticker plant collects and process massive data streams, displaying prices for traders and feeding computerised trading systems fast enough to capture opportunities in second. For one custom, gigabytes of data flow in from thousands of server logs per day. This data is queried more than 30,000 times per second and Apache Hbase enables super-fast queries that meet their client SLAs.

Reference: Hortonworks 2014, ‘Modern Data Architecture for financial services with Apache Hadoop on Windows’, The journey to a financial services data lake.

Nine Vendors offering Hadoop Services


  1. Amazon Web Services (AWS) The company’s Hadoop product is named Elastic Map Reduce (EMR), which AWS says uses Hadoop to offer big data management services. It is not pure open source Hadoop though, it’s been tinkered to run specifically on AWS’s cloud.
  2. Microsoft and its partners Windows Azure’s HDInsight product is a Hadoop as a service offering based on Hortonworks’ distribution of the platform but specifically designed to run on Azure. Microsoft has some other nifty projects too, including a production-ready feature named Polybase that allows information on SQLServer to also be searched during Hadoop queries. “Microsoft’s significant presence in the database, data warehouse, cloud, OLAP, BI, spreadsheet (PowerPivot), collaboration, and development tools markets offers an advantage when it comes to delivering a growing Hadoop stack to Microsoft customers
  3. Cloudera Cloudera uses open source Hadoop for the basis of its distribution, but it is not a pure open source product. When Cloudera’s customers need something that open source Hadoop doesn’t have, they build it, or they find a partner who has it.
  4. Hortonworks Unlike Cloudera, Hortonworks sticks to the open source Hadoop code stronger than perhaps any other vendor. Hortonworks’ goal is about building up the Hadoop ecosystem and Hadoop users, and advancing the open source code.
  • IBM
  • Intel
  • MapR Technologies
  • Pivotal Software
  • Teradata

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: