leonzhx

浏览: 803231 次
性别:
来自: 上海

最近访客更多访客>>

u012363178

justsimple

cdphantom

wang_xuewu

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

2014-05 ( 22)
2014-04 ( 47)
2014-03 ( 25)
更多存档...

Zz Garbage Collection in Java

博客分类：

Gabage Collection

Heap Overview

This is the first in a series of posts about Garbage Collection (GC). I hope to be able to cover a bit of theory and all the major collectors in the hotspot virtual machine over the course of the series. This post just explains what garbage collection is and elements common to different collectors.

Why should I care?

Your Java virtual machine manages memory for you - which is highly convenient - but it might not be optimally tuned by default. By understanding some of the theory behind garbage collection you can more easily tune your collector. A common concern is collector efficiency, that is to say how much time your program spends executing program code rather than collecting garbage. Another common concern is how long that application pauses for.

There's also a lot of hearsay and folklore out there about garbage collection and so understanding the algorithms in a bit more detail really helps avoid falling into common pitfalls and traps. Besides - for anyone interested in how computer science principles are applied and used, JVM internals are a great thing to look at.

What does stop-the-world mean?

Your program (or mutator in GC-Speak) will be allocating objects as it runs. At some point your heap needs to be collected and all of the collectors in hotspot pause your application. The term 'stop-the-world' is used to mean that all of the mutator's threads are paused.

Its possible to implement a garbage collector that doesn't need to pause. Azul have implemented an effectively pauseless collector in their Zing virtual machine. I won't be covering how it works but there's a really interesting whitepaper if you want to know more.

The Young/Weak Generational Hypothesis

Simply stated: Most allocated objects die young ¹. This concept was demonstrated by empirically analysing the memory allocation and liveness patterns of a large group of programs during the 1980s. What researchers found was that not only do most objects die young but once they live past a certain age they tend to live for a long time. The graph below is taken from a SUN/Oracle study looking at the lifespan of objects as a histogram.

How is Heap organised?

The young generational hypothesis has given rise to the idea of generational garbage collection in which heap is split up into several regions, and the placement of objects within each region corresponds to their age. One element that is common to the above these garbage collectors (other than G1) is the way that heap is organised into different spaces.

When objects are initially allocated, if they fit, they are stored in the Eden space. If the object survives a collection then it ends up in a survivor space. If it survives a few times (your tenuring threshold) then the object ends up in the tenured space. The specifics of the algorithms for collecting these spaces differs by collector, so I'll be covering them seperately in a future blog post.

This split is beneficial because it allows you to use different algorithms on different spaces. Some GC algorithms are more efficient if most of your objects are dead and some are more efficient if most of your objects are alive. Due to the generational hypothesis usually when it comes time to collect most objects in Eden and survivor spaces are dead, and most objects in tenured are alive.

There is also the permgen - or permanent generation. This is a special generation that holds objects that are related to the Java language itself. For example information about loaded classes is held here. Historically Strings that were interened or were constants were also held here. The permanent generation is being removed in favour of metaspace.

Multiple Collectors

The hotspot virtual machine actually has a variety of different Garbage Collectors. Each has a different set of performance characteristics and is more (or less) suited for different tasks. The key Garbage Collectors that I'll be looking at are:

Parallel Scavenge (PS): the default collector in recently released JVMs. This stops-the-world in order to collect, but collects in parallel (ie using multiple threads).
Concurrent Mark Sweep (CMS): this collector has several phases, some of which stop the world, but runs concurrently with the program for several of its phases as well.
Incremental Concurrent Mark Sweep (iCMS): a variant of CMS designed for lower pauses. It sometimes achieves this!
Garbage First (G1): a newish collector that's recently become more stable and is in slowly increasing usage.

Conclusions

I've given a few introductory points of thought about garbage collection, in the next post I'll be covering the Parallel Scavenge collector - which is currently the default collector. I'd also like to provide a link to my employer who have a GC log analyser which we think is pretty useful.

"hotspot" is the name given to the codebase common behind openjdk and the official Oracle JVM. As of Java 7 openjdk is the reference implementation for Java SE.
Technically what I described above is the 'weak generational hypothesis' which has empirical validation. There's also a strong variant which can be stated as The mean lifetime of a heap allocated object is equal to the mean amount of reachable storage. This is actually mathematically provable by taking Little's Law and setting Λ to 1. Simple proof!
I'll cover the way heap is organised within G1 on a G1-specific blog post.

Parallel GC

Parallel Scavenge

Today we cover how Parallel GC works. Specifically this is the combination of running a Parallel Scavenge collector over Eden and the Parallel Mark and Sweep collector over the Tenured generation. You can get this option by passing in -XX:+UseParallelOldGC though its the default on certain machine types.

You may want to read my first blog post on Garbage Collection if you haven't since this gives a general overview.

Eden and Survivor Spaces

In the parallel scavenge collector eden and survivor spaces are collected using an approach known as Hemispheric GC. Objects are initially allocated in Eden, once Eden is close to full ¹ a gc of the Eden space is triggered. This identifies live objects and copies them to the active Survivor Space². It then treats the whole Eden space as a free, contiguous, block of memory which it can allocate into again.

In this case the allocation process ends up being like cutting a piece of cheddar. Each chunk gets sliced off contiguously and then the slice next to is the next to be 'eaten'. This has the upside that allocation merely requires pointer addition.

A slab of cheddar, ready to be allocated.

In order to identify live objects a search of the object graph is undertaken. The search starts from a set of 'root' objects which are objects that are guaranteed to be live, for example every thread is a root object. The search then find objects which are pointed to by the root set, and expands outwards until it has found all live objects. Here's a really nice pictorial representation, courtesy of Michael Triana

Parallel in the context of parallel scavenge means the collection is done by multiple threads running at the same time. This shouldn't be confused with ConcurrentGC, where the collector runs at the same time as, or interleaved with, the program. Parallel collection improves overall GC throughput by better using modern multicore CPUs. The parallelism is achieved by giving each thread a set of the roots to mark and a segment of the table of objects.

There are two survivor spaces, but only one of them is active at any point in time. They are collected in the same way as eden. The idea is that objects get copied into the active survivor space when they are promoted from eden. Then when its time to evacuate the space they are copied into the inactive survivor space. Once the active survivor space is completely evacuated the inactive space becomes active, and the active space becomes inactive. This is achieve by flipping the pointer to the beginning of the survivor space and means that all the dead objects in the survivor space can be freed at the cost of assigning to a single pointer.

Young Gen design and time tradeoffs

Since this involves only copying live objects and pointer changes the time taken to collect eden and survivor spaces is proportional to the amount of live objects. This is quite important since due to the generational hypothesis we know that most objects die young, and there's consequently no GC cost to freeing the memory associated with them.

The design of the survivor spaces is motivated by the idea that collecting objects when they are young is cheaper than doing a collection of the tenured space. Having objects continue to be collected in a hemispheric fashion for a few GC runs is helpful to the overall throughput.

Finally the fact that eden is organised into a single contiguous space makes object allocation cheap. A C program might back onto the 'malloc' command in order to allocate a block of memory, which involves traversing a list of free spaces in memory trying to find something that's big enough. When you use an arena allocator and consecutively allocate all you need to do is check there is enough free space and then increment a pointer by the size of the object.

Parallel Mark and Sweep

Objects that have survived a certain number of collections make their way into the tenured space. The number of times that they need to survive is referred to as the 'tenuring threshold'. Tenured Collections work somewhat differently to Eden, using an algorithm called mark and sweep. Each object has a mark bit associated with it. The marks are initially all set to false and as the object is reached during the graph search they're set to true.

The graph search that identifies live objects is similar to the search described for young generation. The difference is that instead of copying live objects, it simply marks them. After this it can go through the object table and free any object that isn't live. This process is done in parallel by several threads, each search a region of the heap.

Unfortunately this process of deleting objects that aren't live leaves the tenured space looking like Swiss Cheese. You get some used memory where objects live, and gaps in between where objects used to live. This kind of fragmentation isn't helpful for application performance because it makes it impossible to allocate objects that are bigger than the size of the holes.

Cheese after Mark and Sweep.

In order to reduce the Swiss Cheese problem the Parallel Mark/Sweep compacts the heap down to try and make live objects contiguously allocated at the start of the tenured space. After deletion it search areas of the tenured space in order to identify which have low occupancy and which have high occupancy. The live objects from lower occupancy regions are moved down towards regions that have higher occupancy. These are naturally at the lower end of memory from the previous compacting phase. The moving of objects in this phase is actually performed by the thread allocated to the destination region, rather than the source region.

Low occupancy cheese.

Summary

Parallel Scavenge splits heap up into 4 Spaces: eden, two survivor spaces and tenured.
Parallel Scavenge uses a parallel, copying collector to collector Eden and Survivor Spaces.
A different algorithm is used for the tenured space. This marks all live objects, deletes the dead objects and then compacts the space/
Parallel Scavenge has good throughput but it pauses the whole program when it runs.

In part three I'll look at how the CMS, or Concurrent-Mark-Sweep, collector works. Hopefully this post will be easier for those with dairy allergies to read.

Technically there is an 'occupancy threshold' for each heap space - which defines how full the space is allowed to get before collection occurs.
This copying algorithm is based on Cheney's algorithm.

Concurrent Mark Sweep

This follows on from my previous two garbage collection blog posts:

Concurrent Mark Sweep

The parallel garbage collectors in Hotspot are designed to minimise the amount of time that the application spends undertaking garbage collection, which is termed throughput. This isn't an appropriate tradeoff for all applications - some require individual pauses to be short as well, which is known as a latency requirement.

The Concurrent Mark Sweep (CMS) collector is designed to be a lower latency collector than the parallel collectors. The key part of this design is trying to do part of the garbage collection at the same time as the application is running. This means that when the collector needs to pause the application's execution it doesn't need to pause for as long.

At this point you're probably thinking 'don't parallel and concurrent mean something fairly similar?' Well in the context of GC Parallel means "uses multiple threads to perform GC at the same time" and Concurrent means "the GC runs at the same time as the application is collecting".

Young Generational Collection

The young gen collector in CMS is called ParNew and it actually uses the same basic algorithm as the Parallel Scavenge collector in the parallel collectors, that I described previously.

This is still a different collector in terms of the hotspot codebase to Parallel Scavenge though because it needs to interleave its execution with the rest of CMS, and also implements a different internal API to Parallel Scavenge. Parallel Scavenge makes assumptions about which tenured collectors it works with - specifically ParOld and SerialOld. Bare in mind this also means that the young generational collector is stop the world.

Tenured Collection

As with the ParOld collector the CMS tenured collector uses a mark and sweep algorithm, in which live objects are marked and then dead objects are deleted. Deleted is really a strange term when it comes to memory management. The collector isn't actually deleting objects in the sense of blanking memory, its merely returning the memory associated with that object to the space that the memory system can allocate from - the freelist. Even though its termed a concurrent mark and sweep collector, not all phases run concurrently with the application's execution, two of them stop the world and four run concurrently.

How is GC triggered?

In ParOld garbage collection is triggered when you run out of space in the tenured heap. This approach works because ParOld simply pauses the application to collect. In order for the application to continue operating during a tenured collection, the CMS collector needs to start collecting when there is a still enough working space left in tenured.

So CMS starts based upon how full up tenured is - the idea is that the amount of free space left is your window of opportunity to run GC. This is known as the initiating occupancy fraction and is described in terms of how full the heap is, so a fraction of 0.7 gives you a window of 30% of your heap to run the CMS GC before you run out of heap.

Phases

Once the GC is triggered, the CMS algorithm consists of a series of phases run in sequence.

Initial Mark - Pauses all application threads and marks all objects directly reachable from root objects as live. This phase stops the world.
Concurrent Mark - Application threads are restarted. All live objects are transitively marked as reachable by following references from the objects marked in the initial mark.
Concurrent Preclean - This phase looks at objects which have been updated or promoted during the concurrent mark or new objects that have been allocated during the concurrent mark. It updates the mark bit to denote whether these objects are live or dead. This phase may be run repeatedly until there is a specified occupancy ratio in Eden.
Remark Since some objects may have been updated during the preclean phase its still necessary to do stop the world in order to process the residual objects. This phase does a retrace from the roots. It also processes reference objects, such as soft and weak references. This phase stops the world.
Concurrent Sweep - This looks through the Ordinary Object Pointer (OOP) Table, which references all objects in the heap, and finds the dead objects. It then re-adds the memory allocated to those objects to its freelist. This is the list of spaces from which an object can be allocated.
Concurrent Reset - Reset all internal data structures in order to be able to run CMS again in future.

Theoretically the objects marked during the preclean phase would get looked at during the next phase - remark - but the remark phase is stop the world, so the preclean phase exists to try and reduce remark pauses by doing part of the remark work concurrently. When CMS was originally added to HotSpot this phase didn't exist at all. It was added in Java 1.5 in order to address scenarios when a young generation scavenging collection causes a pause and is immediately followed by a remark. This remark also causes a pause, which combine to make a more painful pause. This is why remarks are triggered by an occupancy threshold in Eden - the goal is to schedule the remark phase halfway between young gen pauses.

The remark phases are also pausing, whilst the preclean isn't, which means that having precleans reduces the amount of time spent paused in GC.

Concurrent Mode Failures

Sometimes CMS is unable to meet the needs of the application and a stop-the-world Full GC needs to be run. This is called a concurrent mode failure, and usually results in a long pause. A concurrent mode failure happens when there isn't enough space in tenured to promote an object. There are two causes for this:

An object is promoted that is too large to fit into any contiguous space in memory.
There isn't enough space in tenured to account for the rate of live objects being promoted.

This might happen because the concurrent collection is unable to free space fast enough given the object promotion rates or because the continued use of the CMS collector has resulted in a fragmented heap and there's no individual space large enough to promote an object into. In order to properly 'defrag' the tenured heap space a full GC is required.

Permgen

CMS doesn't collect permgen spaces by default, and requires the ?XX:+CMSClassUnloadingEnabled flag enabled in order to do so. If, whilst using CMS, you run out of permgen space without this flag switched on it will trigger a Full GC. Furthermore permgen space can hold references into normal heap via things like classloaders, which means that until you collect Permgen you may be leaking memory in regular heap. In Java 7 String constants from class files are also allocated in regular heap, instead of permgen, which reduces permgen consumption, but also adds to the set of object references coming into regular heap from permgen.

Floating Garbage

At the end of a CMS collection its possible for some objects to not have been deleted - this is called Floating Garbage. This happens when objects become de-referenced since the initial mark. The concurrent preclean and the remark phase ensure that all live objects are marked by looking at objects which have been created, mutated or promoted. If an object has become dereferenced between the initial mark and the remark phase then it would require a complete retrace of the entire object graph in order to find all dead objects. This is obviously very expensive, and the remark phase must be kept short since its a pausing phase.

This isn't necessarily a problem for users of CMS since the next run of the CMS collector will clean up this garbage.

Summary

Concurrent Mark and Sweep reduces the pause times observed in the parallel collector by performing some of the GC work at the same time as the application runs. It doesn't entirely remove the pauses, since part of its algorithm needs to pause the application in order to execute.

It took me a little longer than I had hoped to get round to writing this blog post - but if you want to know when my next post is published simply enter your email in the top right hand corner of this blog to subscribe by email.

G1: Garbage First

The G1 collector is the latest collector to be implemented in the hotspot JVM. It's been a supported collector ever since Java 7 Update 4. Its also been publicly stated by the Oracle GC Team that their hope for low pause GC is a fully realised G1. This post follows on from my previous garbage collection blog posts:

The Problem: Large heaps mean Large Pause Times

The Concurrent Mark and Sweep (CMS) collector is the currently recommended low pause collector, but unfortunately its pause times scale with the amount of live objects in its tenured region. This means that whilst its relatively easy to get short GC Pauses with smaller heaps, but once you start using heaps in the 10s or 100s of Gigabytes the times start to ramp up.

CMS also doesn't "defrag" its heap so at some point in time you'll get a concurrent mode failure (CMF), triggering a full gc. Once you get into this full gc scenario you can expect a pause in the timeframe of roughly 1 second per Gigabyte of live objects. With CMS your 100GB heap can be a 1.5 minute GC Pause ticking time bomb waiting to happen ...

Good GC Tuning can address this problem, but sometimes it just pushes the problem down the road. A Concurrent Mode Failure and therefore a Full GC is inevitable on a long enough timeline unless you're in the tiny niche of people who deliberately avoid filling their tenured space.

G1 Heap Layout

The G1 Collector tries to separate the pause time of an individual collection from the overall size of the heap by splitting up the heap into different regions. Each region is of a fixed size, between 1MB and 32MB, and the JVM aims to create about 2000 regions in total.

You may recall from previous articles that the other collectors split the heap up into Eden, Survivor Space and Tenured memory pools. G1 retains the same categories of pools but instead of these being contiguous blocks of memory, each region is logically categorised into one of these pools.

There is also another type of region - the humongous region. These are designed to store objects which are bigger in size than most objects - for example a very long array. Any object which is bigger than 50% of the size of a region is stored in a humongous region. They work by taking multiple normal regions which are contiguously located in memory and treating them as a single logical region.

Remembered Sets

Of course there's little point in splitting the heap into regions if you are going to have to scan the entire heap to figure out which objects are marked live. The first step in achieving this is breaking down regions into 512 Byte segments called cards. Each card has a 1 byte entry in the card marking table.

Each region has an associated remembered set or RSet - which is the set of cards that have been written to. A card is in the remembered set if an object from another region stored within the card points to an object within this region.

Whenever the mutator writes to an object reference, a write barrier is used to update the remembered set. Under the hood the remembered set is split up into different collections so that different threads can operate without contention, but conceptually all collections are part of the same remembered set.

Concurrent Marking

In order to identify which heap objects are live G1 performs a mostly concurrent mark of live objects.

Marking Phase The goal of the marking phase is to figure out which objects within the heap are live. In order to store which objects are live, G1 uses a marking bitmap - which stores a single bit for every 64bits on the heap. All objects are traced from their roots, marking areas with live objects in the marking bitmap. This is mostly concurrent, but there is an Initial Marking Pause similar to CMS where the application is paused and the first level of children from the root objects are traced. After this completes the mutator threads restart. G1 needs to keep an up to date understanding of what is live in the heap, since the heap isn't being cleaned up in the same pause as the marking phase.
Remarking Phase The goal of the remarking phase is to bring the information from the marking phase about live objects up to date. The first thing to do is decide when to remark. Its triggered by a percentage of the heap being full. This is calculated by taking information from the marking phase and the number of allocations since then and which tells G1 whether its over the required percentage. G1 uses the aforementioned write barrier to take note of changes to the heap and store them in series of change buffers. The objects in the change buffer are marked in the marking bitmap concurrently. When the fill percentage is reached the mutator threads are paused again and the change buffers are processed, marking objects in change buffers live.
Cleanup Phase At this point G1 knows which objects are live. Since G1 focusses on regions which have the most free space available, its next step is to work out the free space in a given region by counting the live objects. This is calculated from the marking bitmap, and regions are sorted according to which regions are most likely to be beneficial to collect. Regions which are to be collected are stored in what's know as a collection set or CSet.

Evacuation

Similar to the approach taken by the hemispheric Young Generation in the Parallel GC and CMS collectors dead objects aren't collected. Instead live objects get evacuated from a region and the entire region is then considered free.

G1 is intelligent about how it reclaims living objects - it doesn't try to reclaim all living objects in a given cycle. It targets regions which are likely to reclaim as much space as possible and only evacuates those. It works out its target regions by calculating the proportion of live objects within a region and picking region with the lowest proportion of live objects.

Objects are evacuated into free regions, from multiple other regions. This means that G1 compacts the data when performing GC. This is operated on in parallel by multiple threads. The traditional 'Parallel GC' does this but CMS doesn't.

Similar to CMS and Parallel GC there is a concept of tenuring. That is to say that young objects become 'old' if they survive enough collections. This number is called the tenuring threshold. If a young generational region survives the tenuring threshold and retains enough live objects to avoid being evacuated then the region is promoted. First to be a survivor and eventually a tenured region. It is never evacuated.

Evacuation Failure

Unfortunately, G1 can still encounter a scenario similar to a Concurrent Mode Failure in which it falls back to a Stop the World Full GC. This is called an evacuation failure and happens when there aren't any regions which are free. No free regions means no where to evacuate objects.

Theoretically evacuation failures are less likely to happen in G1 than Concurrent Mode Failures are in CMS. This is because G1 compacts its regions on the fly rather than just waiting for a failure for compaction to occur.

Conclusions

Despite the compaction and efforts at low pauses G1 isn't a guaranteed win and any attempt to adopt it should be accompanied by objective and measurable performance targets and GC Log analysis. The methodology required is out of the scope of this blog post, but hopefully I will cover it in a future post.

Algorithmically there are overheads that G1 encounters that other Hotspot collectors don't. Notably the cost of maintaining remembered sets. Parallel GC is still the recommended throughput collector, and in many circumstances CMS copes better than G1.

Its too early to tell if G1 will be a big win over the CMS Collector, but in some situations its already providing benefits for developers who use it. Over time we'll see if the performance limitations of G1 are really G1 limits or whether the development team just needs more engineering effort to solve the problems that are there.

Thanks to John Oliver, Tim Monks and Martijn Verburg for reviewing drafts of this and previous GC articles.