HEP # - Hazelcast Indexing SPI

General Proposal Information

HEP: #
Title: Hazelcast SPI for Indexing Services
HEP Shortname: Hazelcast Indexing SPI
Author: Christoph Engelbert (@noctarius2k)
Sponsor: Hazelcast, Inc.
Signed-Of: Christoph Engelbert
Lead: Christoph Engelbert
Created: 2016/05/04 
Status: Completed
Type: Feature
Component: Integration
Discussion: https://gitter.im/hazelcast-incubator/hazelcast-indexing-spi
Specification: https://docs.google.com/document/d/1VLMEm_ifaIp1yAMgv5s7IJ9kMqvgB0fGPge_0PkQizM
Project: https://github.com/hazelcast-incubator/hazelcast-indexing-spi

Process Information

Start: 2016/Q2 
Depends:
Effort: M
Duration: M
Release: not yet defined

Summary

The Hazelcast Indexing SPI will provide an extension SPI to implement arbitrary / user defined types of indexes. So far Hazelcast provides an indexing system for simple key/value indexed attributes but common questions include indexes for geospatial or partial indexes. This SPI will provide all necessary interfaces, subsystems to implement an index container, indexers and index accessors, as well as query optimizers.

Goals

  • Collect and define a list of necessary information and requirements for different indexing types
  • Create a set of interfaces to be implemented by external indexing strategies
  • Create a specification document
  • Create documentation of the SPI
  • Create reference implementations, at least the currently bundled key-value attribute index + one additional like geospatial or others

Non-Goals

  • Do not create additional, native and bundled indexing systems

Motivation

With a growing user-base of Hazelcast more data storage scenarios come into the game. Whereas Hazelcast itself was always seen as a key-value optimized in-memory data storage, people start to store types of data that would benefit from different types of indexes like geospatial information. To support those use cases and offer people the chance to optimize for their special needs, the indexing system has to be revamped and made public.

Additionally certain types of indexes might be able to optimize the index querying by providing optimized accessors such as bytestream comparison to prevent additional serialization / deserialization or might be able to work on raw data directly (like Hypercast). Apart from that some indexes might be implemented using external services like lucene to provide highspeed full text search or just provide optimized indexes for special value types like UUID.

Success Metrics

The current key-value implementation needs to be reimplemented on top of the SPI and at least one other type of index must be implemented as well. The system needs to be generic enough to support more kinds of indexes, therefore all kinds of index logic (how to select, query, ...) needs to be encapsulated into the Indexing SPI.

The defined SPI has to fulfil current requirements as well as requests for additional indexing purposes like partial indexes (hazelcast/pull/4255) or geospatial indexes (https://github.com/mraad/HZSpatial). It also has to support optimized memory access for indexes that do you have to store deserialized values but can access raw data directly.

Last but not least the internal part of the SPI needs to be defined in a way that it is easy to adapt it to further data structures like List, Set, JCache.

Description

At the current state Hazelcast has a key-value based indexing system storing predefined attributes to speed up queries and prevent deserialization when searching for entries based on certain criteria. Not every type of data has the same requirements or is satisfied with a pure key-value index. Additional information like relations, geospatial information or just useful information to query the object faster can be interesting too. These use cases become more important for Hazelcast as the user-base is growing and new business areas start to utilize Hazelcast for their needs.

The external SPI needs to fulfil current requirements as well as requests for additional indexing purposes like partial indexes (hazelcast/pull/4255) or geospatial indexes (https://github.com/mraad/HZSpatial). It also has to support optimized memory access for indexes that do you have to store deserialized values but can access raw data directly.

Expected changed will mostly affect internal implementations, anyhow the public SPI has to be defined.

 

Testing

The new implementation needs to maintain backwards compatibility to currently used configurations. If additional reference implementations are done those need to have the typical unittests as well as an idea how to automatically test those. Otherwise they have to be kept as unsupported third-party providers.

Risks

Backward compatibility could break with careless changes. Care needs to be taken and possibly an additional set of unittests must be implemented first to prevent this situation from happening.

In addition the current indexing system is tightly coupled to the internal query system. Caution has to be applied to not break anything or create new regressions, however coverage should be good to prevent most problems in the first place.

Dependencies