HEP 6 - Cascading Integration

General Proposal Information

HEP: 6
Title: Cascading Integration
HEP Shortname: cascading-integration
Author: Alexander Toktarev (toktarev_at_hazelcast_dot_com)
Sponsor: Hazelcast, Inc., Concurrent, Inc.
Signed-Of: Christoph Engelbert
Lead: Alexander Toktarev
Created: 2015/06/05 
Status: Draft
Type: Feature
Component: Integration
Discussion: https://gitter.im/hazelcast-incubator/cascading-integration
Specification: 
Project: https://github.com/hazelcast-incubator/cascading-integration

Process Information

Start: 2015/Q2 
Depends:
Effort: XL
Duration: XL 
Release:

Summary

Cascading is a data processing API and processing query planner used for defining, sharing, and executing data-processing workflows on a single computing node or distributed computing cluster. Per default Cascading runs on either in single-node (local-mode) setup or on top of hadoop using the hadoop mapreduce framework or Apache tez (in upcoming version 3). Hazelcast would be a good fit to act as an underlying execution framework alongside the already existing ones.

 

Goals

 

The goal is to implement an execution layer underneath the Cascading SPI. This execution layer will translate the Cascading DAG (Directed Acyclic Graph) into an execution plan to execute on Hazelcast itself and eventually execute it. The actual execution plan can be implemented in multiple ways, just to name two of them - the Hazelcast mapreduce framework or implementing an graph-execution model - and it is part of the HEP to figure out the best option.

Additionally Hazelcast datasources and -sinks will be implemented to read data directly from Hazelcast and also save them back into Hazelcast data structures.

Hazelcast also has to provide a framework for to collect data in counters that are available across the cluster to distribute performance and runtime information.

Non-Goals

Not yet defined

Motivation

Cascading is a known framework for simplifying complex data processing on top of hadoop, however the speed is limited by the way hadoop stores data (on disk). Hazelcast, in contrast, stores data in-memory for highest speed and lowest access latency.

In addition to that Hazelcast is easy to setup, compared to hadoop, and offers more people (especially smaller users of Cascading) a fast option to get up and running.

Success Metrics

The Hazelcast-Cascading integration must be able to execute Cascading execution flows on top of Hazelcast, as well as to define a clear deployment schema. It also needs to automatically translate the Cascading DAG into a set of Hazelcast execution patterns and be able to read and store data using Cascading provided Taps (data sinks or sources).

Description

Cascading is a Processing API set normally based upon Hadoop but in general is able to use other systems as the underlying operation layer. Hazelcast is expected to implement all aspects of the Cascading processing, which includes, but is not limited to, data sources and targets as well as mapping the processes (pipes / filters) to some Hazelcast API to execute in a running Hazelcast cluster.

In general the Cascading Adapter needs to map a Cascading execution graph to some underlying Hazelcast execution model. This is done by implementing a rule engine (the query planner) which analyzes the graph and combines or splits graph-nodes in Hazelcast operations. 
For a Processing API (graph based execution model) based approach it seems like the Cascading design is pretty nice and could probably taken almost like 1:1 into our own one. Mapping would than be pretty easy and more like wrapping one element into another. 

For a Map-Reduce based approach we can have a look at the old hadoop based implementation and would rebuild it on top of our Map-Reduce API. As it seems this approach has disadvantages and might not be worth the time to implement. Chris W. (Cascasding, Inc.) advised against it. 
 

Apart from the general execution model abstraction an implementation of the Cascading Tap adapters to Hazelcast data structures is necessary to benefit from the in-memory speed and those Taps need to interoperate with existing data supply mechanisms supported by Cascading.

Last but not least Hazelcast needs to provide a framework for distributed counters to enable Cascading to report status and performance data back to their framework. Those counters will also be very beneficial to other data structures and users.

 

Testing

 

Not yet defined

Risks

Not yet defined

Dependencies