Tuesday 28 April 2009

Infinispan: the Start of a New Era in Open Source Data Grids

Over the past few months we've been flying under the radar preparing for the launch of a new, open source, highly scalable distributed data grid platform. We've finally got to a stage where we can announce it publicly and I would like to say that Infinispan is now ready to take on the world!

The way we write computer software is changing. The demise of the Quake Rule has made hardware manufacturers cram more cores on a CPU, more CPUs in a server. To achieve the levels of throughput and resilience that modern applications demand, compute grids are becoming increasingly popular. All this serves to exacerbate existing database bottlenecks; hence the need for a data grid platform.

So why is Infinispan sexy?

1. Massive heap - If you have 100 blade servers, and each node has 2GB of space to dedicate to a replicated cache, you end up with 2 GB of total data. Every server is just a copy. On the other hand, with a distributed grid - assuming you want 1 copy per data item - you get a 100 GB memory backed virtual heap that is efficiently accessible from anywhere in the grid. Session affinity is not required, so you don't need fancy load balancing policies. Of course you can still use them for further optimisation. If a server fails, the grid simply creates new copies of the lost data, and puts them on other servers. This means that applications looking for ultimate performance are no longer forced to delegate the majority of their data lookups to a large single database server - that massive bottleneck that exists in over 80% of enterprise applications!

2. Extreme scalability - Since data is evenly distributed, there is essentially no major limit to the size of the grid, except group communication on the network - which is minimised to just discovery of new nodes. All data access patterns use peer-to-peer communication where nodes directly speak to each other, which scales very well.

3. Very fast and lightweight core - The internal data structures of Infinispan are simple, very lightweight and heavily optimised for high concurrency. Early benchmarks have indicated 3-5 times less memory usage, and around 50% better CPU performance than the latest and greatest JBoss Cache release. Unlike other popular, competing commercial software, Infinispan scales when there are many local threads accessing the grid at the same time. Even though non-clustered caching (LOCAL mode) is not its primary goal, Infinispan still is very competitive here.

4. Not Just for Java (PHP, Python, Ruby, C, etc.) - The roadmap has a plan for a language-independent server module. This will support both the popular memcached protocol - with existing clients for almost every popular programming language - as well as an optimised Infinispan-specific protocol. This means that Infinispan is not just useful to Java. Any major website or application that wants to take advantage of a fast data grid will be able to do so.

5. Support for Compute Grids - Also on the roadmap is the ability to pass a Runnable around the grid. You will be able to push complex processing towards the server where data is local, and pull back results using a Future. This map/reduce style paradigm is common in applications where a large amount of data is needed to compute relatively small results.

6. Management is key! - When you start thinking about running a grid on several hundred servers, management is no longer an extra, it becomes a necessity. This is on Infinispan's roadmap. We aim to provide rich tooling in this area, with many integration opportunities.

7. Competition is Proprietary - All of the major, viable competitors in the space are not open-source, and are very expensive. Enough said. :-)

What are data grids?

Data grids are, to put it simply, highly concurrent distributed data structures. Data grids typically allow you to address a large amount of memory and store data in a way that it is quick to access. They also tend to feature low latency retrieval, and maintain adequate copies across a network to provide resilience to server failure.

As such, at its core, Infinispan presents a humble data structure. But this is also a high specialised data structure, tuned to and geared for a great degree of concurrency - especially on multi-CPU/multi-core architectures. Most of the internals are essentially lock- and synchronization-free, favouring state-of-the-art non-blocking algorithms and techniques wherever possible. This translates to a data structure that is extremely quick even when it deals with a large number of concurrent accesses.

Beyond this, Infinispan is also a distributed data structure. It farms data out across a cluster of in-memory containers. It does so with a configurable degree of redundancy and various parameters to tune the performance-versus-resilience trade-off. Local "L1" caches are also maintained for quick reads of frequently accessed data.

Further, Infinispan supports JTA transactions. It also offers eviction strategies to ensure individual nodes do not run out of memory and passivation/overflow to disk. Warm-starts using preloads are also supported.

JBoss Cache and Infinispan

So where does Infinispan stand against the competition? Let's start with JBoss Cache. It is no surprise that there are many similarities between JBoss Cache and Infinispan, given that they share the same minds! Infinispan is an evolution of JBoss Cache in that it borrows ideas, designs and some code, but for all practical purposes it is a brand new project and a new, much more streamlined codebase.

JBoss Cache has evolved from a basic replicated tree structure to include custom, high performance marshalling (in version 1.4), Buddy Replication (1.4), a new simplified API (2.X), high concurrency MVCC locking (3.0.X) and a new non-blocking state transfer mechanism (3.1.X). These were all incremental steps, but it is time for a quantum leap.

Hence Infinispan. Infinispan is a whole new project - not just JBoss Cache 4.0! - because it is far wider in scope and goals - not to mention target audience. Here are a few points summarising the differences:
  • JBoss Cache is a clustering library. Infinispan's goal is to be a data grid platform, complete with management and migration tooling.
  • JBoss Cache's focus has been on clustering, using replication. This has allowed it to scale to several 10s (occasionally even over 100) nodes. Infinispan's goals are far greater - to scale to grids of several 100's of nodes, eventually exceeding 1000's of nodes. This is achieved using consistent hash based data distribution.
  • Infinispan's data structure design is significantly different to that of JBoss Cache. This is to help achieve the target CPU and memory performance. Internally, data is stored in a flat, map-like container rather than a tree. That said, a tree-like compatibility layer - implemented on top of the flat container - is provided to aid migration from JBoss Cache.
  • JBoss Cache traditionally competed against other frameworks like EHCache and Terracotta. Infinispan, on the other hand, goes head to head against Oracle's Coherence, Gemfire and Gigaspaces.
I have put up some FAQs on the project. A project roadmap is also available, as well as a 5-minute guide to using Infinispan.

Have a look at JIRA or grab the code from our Subversion repository to see where we are with things. If you are interested in participating in Infinispan, be sure to read our community page.

I look forward to your feedback!

Cheers
Manik

31 comments:

  1. Yeah! Finally officially out :)

    ReplyDelete
  2. "Unlike other popular, competing commercial software, Infinispan scales when there are many local threads accessing the grid at the same time"

    Can you please be more specific. Products? How have you determined this and what results do you have to validate this off the cuff statement.

    I am really curious about the "many local threads" issue. Honestly.

    William

    ReplyDelete
  3. I'd love to be more specific but sadly I cannot due to these products not being open source.

    If you want to try it out yourself, a simple benchmark would be to run the cache in LOCAL mode (and any other configurations you'd consider typical such as an eviction strategy, etc) and unleash 200 threads with a mix of reads and writes and micro-bench how each performs.

    The CacheBenchFwk I created for JBoss Cache could be used for this.

    ReplyDelete
  4. "All of the major, viable competitors in the space are not open-source"So you never heard about http://www.hazelcast.com/ (Apache License) ?

    ReplyDelete
  5. Alex: I have heard of Hazelcast, but have yet to hear of it in large scale production deployments. Perhaps you could enlighten me. :-)

    ReplyDelete
  6. Nice work - good to see this finally out in the open.

    ReplyDelete
  7. GREAT! I am truly excited by this project, congrats guys :) :) :) I've been waiting for this for quite a bit of time now.

    Oh, and I like the brand :)

    ReplyDelete
  8. Sounds great :)

    Look forward to taking it for a test drive.

    ReplyDelete
  9. Will object searching capabilities be implemented?

    ReplyDelete
  10. Denis, we are discussing that at the moment :)

    ReplyDelete
  11. Denis - the simple answer is, yes - see ISPN-32. As Emmanuel mentioned though, we are still discussing details and designs. If you are interested in being a part of the process, be sure to join the infinispan-dev mail list.

    ReplyDelete
  12. Manik, good luck!
    Conquer the cloud.

    ReplyDelete
  13. How does your solution compare to ehcache. I am doing research into data mining (NLP based) of wikipedia content 22GB+ for the initial sample and I am currently looking into distributed caches such as ehcache.

    ReplyDelete
  14. @dfisla

    AFAIK, EHCache isn't distributed. It's just replicated, which means every node has exactly the same stuff that every other node has. So to cache 22GB+ of stuff, you'd need 22GB+ (plus overhead) of heap on each node, if you want to use replication. :-)

    ReplyDelete
  15. how does it compares to terrocotta ?

    ReplyDelete
  16. What happens if you want the best of both worlds. In process caching as well as distributed caching (e.g Ehcache). Is this possible?

    ReplyDelete
  17. @Hucmuc Not sure I understand what you mean. Infinispan is (optionally) distributed in that it distributes state across a cluster. And it also runs in your VM. Your local in-VM cache is a member of the distribution network.

    AFAIK, EHCache only supports replication, which maintains a copy on every instance. EHCache is in-VM as well, although they do have a server module for remote connections.

    Infinispan has a server module on its roadmap as well.

    ReplyDelete
  18. @popo apples to oranges. Terracotta is not peer-to-peer. Its central server approach provides different scalability characteristics. Also, in API terms, they're completely different - Terracotta is (AFAIK) XML-heavy, employing bytecode manipulation and JVM hooks to intercept objects. Infinispan is more explicit - put an object in a cache, and it will be available everywhere in your grid. Take it out (or let it expire), and it won't be available. Simple.

    ReplyDelete
  19. The way Ehcache works is that an app can cache locally for reads. Not all apps need to have the same entity in the in-memory cache. If an entity is updated it can propagate an invalidate message (not necessarily an object update) to other nodes. What's the advantage of this? Each app can cache locally without requiring a "central" server to get the entity. This is always faster than interprocess communication.

    ReplyDelete
  20. This comment has been removed by the author.

    ReplyDelete
  21. @Hucmuc Infinispan supports 3 clustered modes - replication, invalidation and distribution. Invalidation does pretty much what you describe above. And neither of the 3 modes have anything to do with a central server. Infinispan is a p2p system.

    ReplyDelete
  22. Infinispan also has a local mode, which works like a non-replicated ehcache.

    Further, the replication and invalidation modes always do reads locally in-vm (writes have to update or invalidate peers). In the distributed case the request is sent to whoever owns the data, however, you can front the distributed cache with a local L1 cache.

    So basically you can do whatever you want.

    ReplyDelete
  23. Good news - this product has a potential to make a significant shift in the market. Keep the good work !!!

    One question - Are you plannig to add a pojo layer (similar to PojoCache) on top of Infinispan?

    ReplyDelete
  24. @kostaa

    Yes, but by employing a different approach to the problem. See http://www.jboss.org/community/wiki/NewFineGrainedReplicationAPIDesign for details.

    ReplyDelete
  25. This is great news. Congratulations.

    ReplyDelete
  26. I am very excited about this. Terracotta has always been worthless for my needs because it has no JTA semantics. I'm really looking forward to this!

    ReplyDelete
  27. This comment has been removed by the author.

    ReplyDelete
  28. I just found this blog and Infinispan sounds quite interesting. Is it also possible to use it to cache a database without writing a new framework for "synchronisation" the cache and the database? For sure I could initialize the cache with the whole database (although it is not necessary) but at some time I have to write back the cache into the database. Or should I use the cache in a sense of a database?

    ReplyDelete