Tuesday, April 19, 2011

Building High-Speed Cache


Recently while developing a caching tool capable of providing access to millions of records in sub-seconds, we analysed some of the existing best practices.

 (1) One choice is MySql native memory cache for avoiding GC-pauses.
MySQL InnoDB storage engine's dictionary cache provides blazing fast data acceess. On demand, data blocks are read from disk and stored in memory using LRU chains or other advanced mechanisms(like MySQL's Midpoint Insertion Strategy) .

When should we use MySQL query cache?
      Identical queries are issued by the same or multiple clients on a repetitive basis.
      The underlying data being accessed is static or semi-static in nature.
      Queries have the potential to be resource-intensive and/or build brief, but complexly computed result sets.
Data warehouse/business intelligence situations, web-based applications, and traditional OLTP systems all qualify on the surface. The query cache looks for identical queries (spacing, upper/lower case, all come into play), ergo the first point above.

For internals of InnoDB Cache, refer to the MySql  tutorial.
In general, MySQl memory cache provides best performance if the majority of your database queries aresimply SELECT statements. 

Here the speed gain is definitely due to negligible GC pauses and direct disk access. This type of caching strategy is popular for eCommerce and bidding platform where data query is more frequent than data updates.

(2) An alternative solution is 'Memcached' which lives in a different process and manages its own GC (optimized for caching). We know that in order to lookup the value for a key,  Memcached cache clents issue multiple parallel requests to memcached servers where keys are sharded across those servers.

The value-items resolved by Memcached are usable objects while those returned from the MySQL cache must then be converted into objects.

As queries get more complex, Memcached provides more efficient performance.
If an application (using mysql cache) is load-balanced over multiple servers, then each web process will maintain a separate duplicate cache unless otherwise customized and optimized specifically.  

But the advantage of using Memcached is that it creates a single global cache available to all web processes . Individual web processes and threads do not maintain their own caches in this case, but rather use the single global cache. When one web process puts data in the cache, it becomes available for all others to access. It leverages fast Network connections.

The detail architecture of Memcahed is out of the scope of this article.

The memcached server and clients work together to implement one global cache across as many machines as you have. In fact, it's recommended you run both web nodes (which are typically memory-lite and CPU-hungry)
and memcached processes (which are memory-hungry and CPU-lite) on the same machines.

For a frequently changing data set , Memcahed is surely better than DB Cache as 'bulk of the data are read from memory rather than disk' . But point to note - it still suffers from GC-pauses however minimal it could be !

(3) Now lets see how 'Ehcache' (leveraging Terracotta BigMemory) has really got the basics right !
Just by simply adopting a very well known Disk I/O pattern of accessing native memory directly from java, it has simply eliminated the 'GC' factor and proved to be the fastest caching option !

Now the choice is ours - (a) whether to adopt a caching strategy that requires "build a bigger computer with huge RAMs" (b) or use the faster cache in a machine with smaller RAM.

The bottom line is - the cache must be outside GC's control. There is no GC mechanism available as of now (unless we leverage JDK 7 'Garbage First' ) - which provides a relief from 'Stop-of the World' stuation.

The trick is to implement a smart serialization technique by storing key, value pairs into Direct Byte Buffer.

Direct ByteBuffers (allocated using the java.nio.ByteBuffer.allocateDirect() method) that are backed by native memory rather than Java heap. Direct ByteBuffers can be passed directly to native OS library functions for performing I/O — making them significantly faster in some scenarios because they can avoid copying data between Java heap and native heap.

It's easy to become confused about where direct ByteBuffer data is being stored. The application still uses an object on the Java heap to orchestrate I/O operations, but the buffer that holds the data is held in native memory — the Java heap object only contains a reference to the native heap buffer.

 A non-direct ByteBuffer holds its data in a byte[] array on the Java heap.

A small prototype using ButeBuffer really yielded high performance.
Here is a sample code for creating direct-buffer-based cache (of course its just for learning, no where close to actual ehcahe implementation). 


This is how Terracotta Bigmemory got the basics of caching right by implementing ByteBuffer and bypassing GC pauses ! 
“ ConcurrentMarkSweep collectors can run for hours w/ no full pauses and then 
     suddenly stop the world for 30 minutes trying to catch up to the mess they
     have made.
      .... if (k1, v1) is in cache in Ehcache BigMemory, it has no references to (k2,
      v2). They cannot refer to each other.

Which means I don't have to walk the cache looking for cross-references in the data like a GC does. No mark-and-sweep. I merely wait for the dev to call cache.remove( element ) or for the evictor to evict stale data. The evictor and the remove() API are signaling my MemoryManager explicitly just like new() and delete() or malloc() and free(). There is no notion of garbage or collection. Only memory re-use and fragmentation over time.  ”

If we want to enjoy the fruits of fastest cache we have to perform some simple steps :

(i) Update the Java classpath to include the new ehcache jar
(ii) Modify the App Jvm settings to allocate enough direct byte buffer
-XX:MaxDirectMemorySize affect the memory available with this call
       java.nio.ByteBuffer.allocateDirect(int).
(iii) Update the ehcache.xml to set the size of the
BigMemory off-heap store.
overflowToOffHeap="true" maxMemoryOffHeap="4G"/>

Here in this link - http://code.google.com/p/lucene-bytebuffer/ , we can see how Lucene Buffers bypasses memory intensive slow java garbage collector.  The objects whose life-cycle is known beforehand can be effectively cleaned up through simple manual process instead of relying on GC.

The basic advantages of a direct-buffer-based cache (like BigMemory) is :
1. Bigmemory takes away all cache data from gc-managed memory to its own bytebuffer-managed world !
2. No need to tune GC
3. Low hardware cost and operational simplicity.

User can have one 16 GB RAM and and make use of all that memory with only 1 JVM. This saves the user from the huge effort of maintaining multiple JVMs to create a distributed cache. User can still go fot it and utilize every byte of the RAM without waiting for GC pause.... “

On a different note, though any one can implement a fast cache (as discussed above) using nio package in jdk 1.6; but if the app involes heavy file handling (index files or web resources or jars) then for faster experience we should use jdk 1.7 nio2.

Prior to the JDK7 release, the java.io.File class was the mechanism used for file I/O, but it had several drawbacks.
      Many methods didn't throw exceptions when they failed, so it was impossible to obtain a useful error message. For example, if a file deletion failed, the program would receive a "delete fail" but wouldn't know if it was because the file didn't exist, the user didn't have permissions, or there was some other problem.
       The rename method didn't work consistently across platforms.
      There was no real support for symbolic links.
      More support for metadata was desired, such as file permissions, file owner, and other security attributes.
      Accessing file metadata was inefficient.
       Many of the File methods didn't scale. Requesting a large directory listing over a server could result in a hang. Large directories could also cause memory resource problems, resulting in a denial of service.
      It was not possible to write reliable code that could recursively walk a file tree and respond appropriately if there were circular symbolic links.

Jdk 1.7 has great many changes to improve developer's productvity and app's performance - (http://tech.puredanger.com/java7/ )

The last but not the least, since we are discussing handling of direct byte buffers, its worth keeping in mind the recommendations (IO bound cache) posted at - http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly

(a) Minimize I/O operations by reading an array at a time, not a byte at a time. An 8Kbyte array is a good size.
(b) Minimize method calls by getting data an array at a time, not a byte at a time. Use array indexing to get at bytes in the array.
(c) Minimize thread synchronization locks if you don't need thread safety. Either make fewer method calls to a thread-safe class, or use a non-thread-safe class like FileChanneland MappedByteBuffer.
(d) Minimize data copying between the JVM/OS, internal buffers, and application arrays. Use FileChannel with memory mapping, or a direct or wrapped array ByteBuffer.

References :

No comments: