How to democratize HBase Coprocessors usage ?

HBase

HBase is a distributed, scalable, reliable, and versioned column-oriented database, which can provide random read/write access with very low-latency.HBase was modeled after Google’s Bigtable concept. Bigtable is the column-oriented database built on top of the Google File System (GFS). In the same way, HBase runs on top of the Hadoop Distributed File System (HDFS).

HBase was designed to handle very large tables with billions of rows and millions of columns, with an average response-time of just a few milliseconds.

I am going to assume that the reader is already familiar with HBase architecture and HBase coprocessors.

If that is not the case, I recommend reading this good blog post which describes the HBase Architecture and this slideshow which explains HBase coprocessors.

In this article, I will first give an overview of HBase coprocessors and then try to explain why, and how we can use them in the industrial sector.

Coprocessors in one word

Coprocessors are a very powerful mechanism for calling “Map-Reduce” tasks for “low-latency” applications.

In a word, coprocessors are an API which offers the ability to deploy custom codebase on HBase servers (master and regions).

This API provides two kinds of coprocessors: Observers & Endpoints or a combination of the two.

Observers

Observers coprocessors are the RDBMS-like triggers. They offer the possibilty to trigger custom code in response to HBase life cycle events.

This kind of coprocessor comes in different flavours, for each kind of HBase component.

The three main observers are:

  • MasterObserver: provides hooks for DDL-like operations (table/namespace lifecycle),
  • RegionServerObserver: provides hooks related to region server like stop/start and split, merge, etc.
  • RegionObserver: provides hooks for common operations like Get, Put, Scan …
  • And some others like BulkLoadObserver, WALObserver and EndpointObserver.

Endpoints

Endpoint coprocessors are the RDBMS-like stored procedures.

They have been designed to extend HBase functionalities.

Please note that all exchanges are made through Protocol Buffer, just like many other Hbase exchanges.

Uses cases

Coprocessors have been designed by and for HBase developers. But this functionality is so powerful that many people want to use it, and not only to extend HBase features.

In my opinion, the following table summarizes the different situations where you should and should not use coprocessors :

do don’t
Simple computation Complex data analysis
Reduce amount of data being sent to the client Long process
Cluster advantage (load balancing) No need for low latency map reduce jobs (have a look at Apache Spark)
Need low latency

Good coprocessors use cases could be :

  • Apache Phoenix which provides SQL like endpoint on HBase. It uses coprocessors for filtering (like scans), sorting, paging, aggregating and transforming data,
  • Apache Ranger which provides Authorization for HBase.

Why coprocessors need to be secured?

As we can see, HBase coprocessors seem to be powerful, so why are they not more widely used?

As Spider-Man’s uncle says : “With great power comes great responsibilities”.

More seriously, let me show you what can happen with coprocessors in the following scenario :

I want to use a coprocessor to implement business logic on top of HBase and I didn’t take time to learn about HBase low-level inner working, and to add another layer of complexity, as it is often the case, we are in a multi-tenant environment.

In this case, HBase coprocessors suffer from issues :

  1. Can crash servers (region & masters)

    Because only IOException is allowed and others crash the host.

  2. Can break down the cluster in case of bad request

    Client retry mechanism (which is good) can propagate a fail and consume resources for nothing.

  3. Can slow down the cluster (memory/cpu hogging)

    Long running and heavy memory consumption in HBase JVM can slow down other HBase features, or co-localized resources.

  4. Comes without any instrumentation (logging & metrics)

    Logging feature is trivial to implement but still not present. There is no metrics on coprocessors in the HBase API.

    Metrics are now implemented in HBase coprocessors starting with version 2: more details are available in HBASE-9774 but without any documentation HBASE-17536.

  5. Can crash the host JVM

    There is no process isolation between coprocessors because coprocessors are executed in the HBase JVM (Master, Region).

    Moreover, in multi-tenant environment, Java Reflection API can be an issue too, because multiple coprocessors may have access to all others’ in-memory data.

  6. Can break security

    By using #bypass and #complete API on HBase observers context.

  7. Can break down the host in case of load failure

    When coprocessor JAR is missing.

  8. Can cause unavailability

    Need to disable data tables to setup coprocessors and then re-enable them.

Depending on your use case, some of these issues might not be issues actually, we also need flexibility.

How to secure coprocessors?

Now let’s have a look at how we can solve these issues:

  1. Can crash servers (region & masters)

    You must catch all Exceptions except IOExceptions.

  2. Can break down the cluster in case of bad request

    You must implement a server-side retry logic or throws DoNotRetryIOException when you don’t want retry.

  3. Can slow down the cluster (memory/cpu hogging)

    You must limit execution time and memory consumption.

  4. Comes without any instrumentation (logging & metrics)

    You must implement your metrics API and reporting,

    For logging, use logger with MDC or a conventional name.

  5. Can crash the host JVM

    You must daemonize coprocessors instance in a separate process.

  6. Can break security

    You should tell your devs to not do that and check it up by code review.

    And cross fingers…

  7. Can break down the host in case of load failure

    You can minimize by setting hbase.coprocessor.abortonerror to false since version 0.98.0 see HBASE-6873.

  8. Can cause unavailability

    You must take it in account before using coprocessors.

    You should probably prefer non dynamic deployment and use white-listing for authorized implementation.

    You can note that coprocessors white-listing is part of HBase version 2.0.0 see HBASE-16700.

To summarize these solutions, you must:

  • Write defensive code : Your devs must write defensive code to ensure performance and stability,
  • Set code conventions Your devs must follow some code conventions to avoid security issues,
  • Implement the same functionalities for each coprocessors Common functionalities such as logging, metrics, timeout have to be re-implemented each time,
  • Implement a code review process to make sure that all conventions are applied check defensive code, check your code conventions.

As you can see, this is a heavy process to set up, especially in the industrial sector, and you still need experts to conduct code reviews.

Moreover, having to manually re-implement common functionalities in every coprocessors, is both unefficient and error-prone as it is a human-process.

How to industrialize it?

What we want here, is to apply a set of custom policies on HBase coprocessors life-cycle.

Before implementing policies we must define that interface, I suggest the following: Source code on GitHub

Nota: I used Lombok to reduce the amount of Java boilerplate code.

Which policies to implements?

Let’s see which kind of policies could be implemented on this :

  1. Can crash servers (region & masters)

    Create a policy that catches Throwable and re-throws it as IOException (or derived ones).

  2. Can break down the cluster in case of bad request

    Create a policy that implements a retry limit (region server side) base on input queries,

    You can also create the same policy cluster wide with an HBase table as sync point.

  3. Can slow down the cluster (memory/cpu hogging)

    Create a policy that implements a timeout logic,

    Create a policy that profiles memory of execution at runtime.

  4. Comes without any instrumentation (logging & metrics)

    Create a logger policy,

    Create a metrics policy based on hadoop metrics2.

  5. Can crash the host JVM

    Create a policy which will start a separate process that communicate thanks to pipe with a daemonized instance of the coprocessor in another PolicyVerifier,

    This solution introduces a high overhead (complexity, performance?) but introduce a security enhancement,

    I think this kind of policy need to be used when, in multi-tenant environment, Java refection is considered as an issue with coprocessors.Indeed, all HBase coprocessors shared the same JVM so you should be able to access/modify other coprocessors in-memory data,

    This policy will only be compatible with CoprocessorService interface, others interfaces contains non serializable fields.

  6. Can break security

    Create a policy that wraps ObserverContext and throw Exception when bypass and/or complete method are called.

  7. Can break down the host in case of load failure

    Can be minimized by setting hbase.coprocessor.abortonerror to false since version 0.98.0 see HBASE-6873

  8. Can cause unaviability

    Per set of coprocessors let’s use more or less restrictive policies, even white listing (eg. for HBase coprocessors itself)

These solutions are certainly not perfect, but they try to give a pragmatic solution to HBase coprocessor issues.

At that point, we have a set of flexible policies aiming to solve most coprocessors issues, but how to apply them?

First of all we need to define a class, which is responsible for calling a set of policies when a method is called. This class can be implemented like this:

Source code on GitHub

How to integrate that PolicyVerifier ?

I see two solutions: one at compile time and the other at runtime.

How to apply policies at compile time ?

This is the simplest way, first take your favorite design pattern: Proxy.

For each HBase coprocessor interface (and each combination of them), wrap all methods to apply policies (Not so interesting stuff).

Once all calls are wrapped, you can easily apply a combination of policies.

It’s not possible to use JDK Dynamic proxies because we need a static class reference to put it in a configuration files or across multiple JVM (when it’s used with dynamic activation).

Here is an example of concrete proxy implementation for the Coprocessor base class:

Source code on GitHub

All other proxies are built on the same model. The deployment is the same as the legacy one.

The table below tries to summarize pro and cons of this solution :

Pros Cons
Do the job Intrusive (needs a new dependency)
Easy to set up Needs to be sure that Proxy is applied (review?)
Low overhead (no policies overhead) Cannot update policies without updating coprocessor
Transparent deployment Cannot be applied on existing coprocessor library
Not so easy for multiple interfaces coprocessors (no dynamic proxies : need a “static” class definition)

How to apply policies at runtime ?

The second solution is based on Aspect Oriented Programming.

First, we need modularity.

Secondly, policies are additional behaviors that need to be integrated without modifying the HBase or coprocessors codebase itself.

To apply policies as an aspect, we need to find a point-cut.

This is easy since all coprocessors are instantiated from a CoprocessorHost classes (plus one subclass per kind of coprocessor).

At this point, we need to override CoprocessorHost method that instantiates coprocessors to applies a JDK Dynamic Proxy on each instance of coprocessors (our advice).

Now there is 2 ways to do that:

Modifying HBase codebase

Why not, let’s have a look at time roadmap.

First, we need to integrate a major change on HBase codebase (if HBase team thinks this is relevant).

Since this change is a major change, we need to wait the new major release.

Industry rarely use “raw” package of Hadoop ecosystem, they generally use packaged version (HDP, Cloudera, MAPR).

So, we need to wait for the new HBase version to be integrated into the packaged one.

After that, you need to wait for this new version to be deployed in our infrastructure.

You understand that we could be waiting a (very) long time.

Modyfying HBase bytecode

With Aspect paradigm take a major advantage to be binary-compatible.

We are able to weave directly on HBase bytecode.

The advantage here is that it can be achieved on an old HBase version and should be compatible with multiple HBase version.

Java agent for bytecode waeving

So bytecode modification seems to be a quick and appropriate solution.

But this solution needs to be weaved on HBase bytecode (could be a disagreement).

To achieve this, I need two things :

  • First, intercepting all CoprocessorHost child class loading event to override their behaviour, this is called load-time weaving,
  • Secondly, changing the behaviour of Coprocessors creation, I need to generate and integrate bytecode at runtime.

Java agent can achieve the first point.

Indeed, java agent allows to register a ClassTransformer before each class is loaded by the JVM.

This is a powerfull mechanism, but could generate some performance issues if it’s applied on a huge amount of classes.

Here, this is not the case, we are interested in 4 classes.

If you need to learn more about Java agent I recommend reading the following :

Second point is achieved by a library that is able to create/modify java bytecode like javassist.

Java agents are difficult to test, you can find a good agent toolbox library here.

For example, this library allows to dynamically create and load an agent from an existing class in your classpath.

To deploy this agent, you need to add agent jar to HBase classpath (Masters & Regions).

You need to update Hbase command line arguments to add -javaagent:/path/to/your/agent.jar  and your agent configuration into command line arguments or better in HBase configuration object.

When configuration is put into HBase, it could be reloaded at runtime since HBase listen to sigup signal to handle this, see HBASE-14529 for more details.

This kind of modification is easier with a deployment tool like Ambari.

The table below tries to summarize pro and cons of this solution :

Pros Cons
Transparent and no by-passable HBase class bytecode weaving
Easy to update (update one jar) Harder to implement (but only one time)
Adaptive for multi interfaces coprocessors Class load time impact?
Works on existing coprocessors Cluster modification (deploy jar and add configuration)

Implementation

Now let’s have a look to a concrete implementation of these concepts.

First, create an Agent class which registers class transformer:

Source code on GitHub

And the class transformer which, is responsible to weave HBase CoprocessorHost bytecode :

Source code on GitHub

To avoid generating a large amount of byte which could be touchy to debug, I created a static wrapper that is responsible for: instantiating policies, creating a dynamic proxy of an existing coprocessors instance.

Source on GitHub

JDK Dynamic proxies need an InvocationHandler, here is a sample implementation :

Source code on GitHub

Conclusion

As you can see, HBase coprocessors are very interesting for « low-latency » Map-Reduce like use cases but they suffers from issues.

In this article, I tried to demonstrate how it is possible to :

  • Apply custom policies (on execution flow, parameters) to keep HBase cluster safe;
  • in an industrial way (no huge review process);
  • with a quite low overhead (at first glance).

But you should keep in mind that :

  • Coprocessors are an advanced feature of HBase and are intended to be used by system developers only. HBase Book

and they are still in an evolving state.

All codebase is available on github : https://github.com/jBouyoud/poc-hbase-coprocessor/tree/blog-tech-ssg.

Leave a Reply

Your email address will not be published. Required fields are marked *