Tuesday, 4 November 2014

Why doesn't Map.size return the size of the entire cluster?

Many people may have been surprised the first time they used Map.size method on a distributed Infinispan cluster.  As was later deduced only the local node size is returned.

Infinispan had taken this approach to limit the chance that instead of getting the full cluster size you would receive an OutOfMemoryError.  This seems fair to return the local answer only but you secretly always wanted the entire cluster size.

For the Infinispan 7.0.0.Final release forget what you know when using the Map interface with Infinispan.

Enter Distributed Entry Iterator

We already announced this feature a while back at http://blog.infinispan.org/2014/05/iterate-all-entries-in-cache.html.  You can check it out for more details but it is essentially a memory efficient way of retrieving all the entries in the cache by iterating over them.

This has opened the way to implementing the various bulk methods on the Map interface that we could never do efficiently in the past (ie. Map.size, Map.keySet, Map.entrySet & Map.values).

Map size

Okay I admit, size could have been done more efficiently before, but the answer would have contained a very high margin of error.  Now size will give you a size value with consistency semantics just slightly less than ConcurrentHashMap does, but for the entire cluster.  Warning should be given that size may be slower for larger clusters or ones with a lot of data in a stored cache loader.

The size method behavior can be controlled by using a supplied Flag such as SKIP_CACHE_LOAD to not count any configured cache loaders or CACHE_MODE_LOCAL if you want the local count only.  These flags are not exclusive and can be both passed if desired as well.

Map Collection Views

In the past the Map.values, Map.keySet & Map.entrySet methods were only ever in memory copies of the local data at the time they were invoked, similar to Map.size.

Now these collection views will be cluster wide and an additionally will show updated contents when the cache is updated and your writes to the collection will be reflected in the Cache itself!  The only operations you can't do on these collections are adding values to either the keySet or values collections as they aren't key/value pairs.

If your cache was configured with a Flag such as SKIP_CACHE_LOAD or CACHE_MODE_LOCAL it will also be reflected in the collection view for both reads and writes.

Some caution is advised when using toArray, retainAll, or size methods as they will require full iteration to complete.

KeySet Optimization

The key set collection also has an optimization so it will never pull down the values so it has a lower network and serialization/deserialization overhead (unlike entrySet and values).

Transactionality

All of the aforementioned methods still support transactions in a way that you would expect.  There is one guarantee we don't provide and that is when using REPEATABLE_READ isolation.  We will not store entries read from an iterator in the transactional context as this could very easily run your local node out of memory with a large enough data set.

For reference methods that use an iterator internally are toArray, retainAll, isEmpty & size on the various collections as well as contains and containsAll on the values collection.

Other API Changes

These changes have also loosened some restrictions on other methods as well.

Map.isEmpty

This method before was only used locally as it used to calculate the size to determine if it was empty.  This method will now use the entry iterator and returns as soon as it finds that even a single value exists.  This is an important change as the old implementation would have to query any configured Cache Loader's complete size before returning.

Map.containsValue

We never supported this method before (not even locally).  This method will now use the iterator though and if it finds the value at any point point will return immediately so it doesn't have to iterate over the entire contents unless they don't exist.  However if you really want to do this operation often you should really use Indexing to make this faster.

Code Examples

I could put an example here, but I think some could take it as insult.  You have already seen 100's of examples as to how to use the Map interface and now in Infinispan you can use those in the exact same way and they will work just how you would expect them to.

No comments:

Post a Comment