在大型Java堆转储中查找内存泄漏的方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2511315/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 08:39:30  来源:igfitidea点击:

Method for finding memory leak in large Java heap dumps

javamethodologyenterpriselegacy-codememory-leaks

提问by Rickard von Essen

I have to find a memory leak in a Java application. I have some experience with this but would like advice on a methodology/strategy for this. Any reference and advice is welcome.

我必须在 Java 应用程序中找到内存泄漏。我在这方面有一些经验,但想就此方法/策略提出建议。欢迎任何参考和建议。

About our situation:

关于我们的情况:

  1. Heap dumps are larger than 1 GB
  2. We have heap dumps from 5 occasions.
  3. We don't have any test case to provoke this. It only happens in the (massive) system test environment after at least a weeks usage.
  4. The system is built on a internally developed legacy framework with so many design flaws that they are impossible to count them all.
  5. Nobody understands the framework in depth. It has been transfered to oneguy in India who barely keeps up with answering e-mails.
  6. We have done snapshot heap dumps over time and concluded that there is not a single component increasing over time. It is everything that grows slowly.
  7. The above points us in the direction that it is the frameworks homegrown ORM system that increases its usage without limits. (This system maps objects to files?! So not really a ORM)
  1. 堆转储大于 1 GB
  2. 我们有 5 次堆转储。
  3. 我们没有任何测试用例来激发这一点。它仅发生在(大规模)系统测试环境中至少使用数周后。
  4. 该系统建立在内部开发的遗留框架上,具有如此多的设计缺陷,以至于无法全部计算在内。
  5. 没有人深入了解这个框架。它已被转移到一个在印度的家伙谁勉强回复电子邮件保存起来。
  6. 随着时间的推移,我们已经完成了快照堆转储,并得出结论,没有一个组件随着时间的推移而增加。一切都是缓慢增长的。
  7. 以上为我们指出了一个方向,即框架自产的 ORM 系统可以无限制地增加其使用。(这个系统将对象映射到文件?!所以不是真正的 ORM)

Question:What is the methodology that helped you succeed with hunting down leaks in a enterprise scale application?

问题:帮助您成功追查企业级应用程序中的漏洞的方法是什么?

采纳答案by Will Hartung

It's almost impossible without some understanding of the underlying code. If you understand the underlying code, then you can better sort the wheat from chaff of the zillion bits of information you are getting in your heap dumps.

如果不了解底层代码,这几乎是不可能的。如果您了解底层代码,那么您可以更好地从堆转储中获得的无数信息中筛选出小麦。

Also, you can't know if something is a leak or not without know why the class is there in the first place.

此外,如果首先不知道为什么该类存在,您就无法知道是否存在泄漏。

I just spent the past couple of weeks doing exactly this, and I used an iterative process.

在过去的几周里,我就是这样做的,我使用了一个迭代过程。

First, I found the heap profilers basically useless. They can't analyze the enormous heaps efficiently.

首先,我发现堆分析器基本上没用。他们无法有效地分析巨大的堆。

Rather, I relied almost solely on jmaphistograms.

相反,我几乎完全依赖jmap直方图。

I imagine you're familiar with these, but for those not:

我想你对这些很熟悉,但对于那些不熟悉的人:

jmap -histo:live <pid> > dump.out

creates a histogram of the live heap. In a nutshell, it tells you the class names, and how many instances of each class are in the heap.

创建活动堆的直方图。简而言之,它告诉您类名,以及每个类在堆中的实例数。

I was dumping out heap regularly, every 5 minutes, 24hrs a day. That may well be too granular for you, but the gist is the same.

我每天 24 小时每 5 分钟定期倾倒一次堆。这对您来说可能过于细化,但要点是相同的。

I ran several different analyses on this data.

我对这些数据进行了几种不同的分析。

I wrote a script to take two histograms, and dump out the difference between them. So, if java.lang.String was 10 in the first dump, and 15 in the second, my script would spit out "5 java.lang.String", telling me it went up by 5. If it had gone down, the number would be negative.

我写了一个脚本来获取两个直方图,并去掉它们之间的差异。所以,如果 java.lang.String 在第一次转储中是 10,在第二次转储中是 15,我的脚本会吐出“5 java.lang.String”,告诉我它上升了 5。如果它下降了,数字将为负数。

I would then take several of these differences, strip out all classes that went down from run to run, and take a union of the result. At the end, I'd have a list of classes that continually grew over a specific time span. Obviously, these are prime candidates for leaking classes.

然后,我会采用其中的几个差异,去掉从运行到运行的所有类,然后对结果进行联合。最后,我会有一个在特定时间跨度内不断增长的课程列表。显然,这些是泄漏类的主要候选者。

However, some classes have some preserved while others are GC'd. These classes could easily go up and down in overall, yet still leak. So, they could fall out of the "always rising" category of classes.

但是,有些类保留了一些,而其他类则是 GC 的。这些类在总体上很容易上下波动,但仍然存在泄漏。因此,他们可能会脱离“不断上升”的类别。

To find these, I converted the data in to a time series and loaded it in a database, Postgres specifically. Postgres is handy because it offers statistical aggregate functions, so you can do simple linear regression analysison the data, and find classes that trend up, even if they aren't always on top of the charts. I used the regr_slope function, looking for classes with a positive slope.

为了找到这些,我将数据转换为时间序列并将其加载到数据库中,特别是 Postgres。Postgres 很方便,因为它提供了统计聚合函数,因此您可以对数据进行简单的线性回归分析,并找到趋势上升的类,即使它们并不总是位于图表的顶部。我使用了 regr_slope 函数,寻找具有正斜率的类。

I found this process very successful, and really efficient. The histograms files aren't insanely large, and it was easy to download them from the hosts. They weren't super expensive to run on the production system (they do force a large GC, and may block the VM for a bit). I was running this on a system with a 2G Java heap.

我发现这个过程非常成功,而且非常有效。直方图文件并不是特别大,而且很容易从主机下载它们。在生产系统上运行它们并不是非常昂贵(它们确实会强制执行大型 GC,并且可能会暂时阻塞 VM)。我在具有 2G Java 堆的系统上运行它。

Now, all this can do is identify potentially leaking classes.

现在,所有这些都可以识别潜在的泄漏类。

This is where understanding how the classes are used, and whether they should or should not be their comes in to play.

这是理解类的使用方式,以及它们是否应该使用它们的地方。

For example, you may find that you have a lot of Map.Entry classes, or some other system class.

例如,您可能会发现您有很多 Map.Entry 类,或其他一些系统类。

Unless you're simply caching String, the fact is these system classes, while perhaps the "offenders", are not the "problem". If you're caching some application class, THAT class is a better indicator of where your problem lies. If you don't cache com.app.yourbean, then you won't have the associated Map.Entry tied to it.

除非您只是缓存 String,否则事实是这些系统类,虽然可能是“违规者”,但并不是“问题”。如果您正在缓存某个应用程序类,那么该类可以更好地指示您的问题所在。如果您不缓存 com.app.yourbean,那么您将不会有关联的 Map.Entry 与之绑定。

Once you have some classes, you can start crawling the code base looking for instances and references. Since you have your own ORM layer (for good or ill), you can at least readily look at the source code to it. If you ORM is caching stuff, it's likely caching ORM classes wrapping your application classes.

一旦你有了一些类,你就可以开始爬取代码库来寻找实例和引用。由于您拥有自己的 ORM 层(无论好坏),您至少可以轻松查看它的源代码。如果您的 ORM 正在缓存内容,则很可能缓存包装您的应用程序类的 ORM 类。

Finally, another thing you can do, is once you know the classes, you can start up a local instance of the server, with a much smaller heap and smaller dataset, and using one of the profilers against that.

最后,您可以做的另一件事是,一旦您了解了类,您就可以启动服务器的本地实例,使用更小的堆和更小的数据集,并使用其中一个分析器。

In this case, you can do unit test that affects only 1 (or small number) of the things you think may be leaking. For example, you could start up the server, run a histogram, perform a single action, and run the histogram again. You leaking class should have increased by 1 (or whatever your unit of work is).

在这种情况下,您可以进行单元测试,只影响您认为可能泄漏的 1 个(或少量)事物。例如,您可以启动服务器、运行直方图、执行单个操作,然后再次运行直方图。您泄漏的课程应该增加 1(或任何您的工作单元)。

A profiler may be able to help you track the owners of that "now leaked" class.

分析器可能能够帮助您跟踪该“现已泄露”类的所有者。

But, in the end, you're going to have to have some understanding of your code base to better understand what's a leak, and what's not, and why an object exists in the heap at all, much less why it may be being retained as a leak in your heap.

但是,最后,您必须对代码库有一定的了解,才能更好地了解什么是泄漏,什么不是,以及为什么一个对象存在于堆中,更不用说为什么它可能会被保留了作为堆中的泄漏。

回答by matt b

Take a look at Eclipse Memory Analyzer. It's a great tool (and self contained, does not require Eclipse itself installed) which 1) can open up very large heaps very fast and 2) has some pretty good automatic detection tools. The latter isn't perfect, but EMA provides a lot of really nice ways to navigate through and query the objects in the dump to find any possible leaks.

看看Eclipse 内存分析器。这是一个很棒的工具(并且是自包含的,不需要安装 Eclipse 本身),它 1) 可以非常快地打开非常大的堆,2) 有一些非常好的自动检测工具。后者并不完美,但 EMA 提供了许多非常好的方法来导航和查询转储中的对象以查找任何可能的泄漏。

I've used it in the past to help hunt down suspicious leaks.

我过去曾用它来帮助追查可疑的泄漏。

回答by Brian Agnew

If it's happening after a week's usage, and your application is as byzantine as you describe, perhaps you're better off restarting it every week ?

如果它在使用一周后发生,并且您的应用程序如您所描述的那样拜占庭,也许您最好每周重新启动它?

I know it's not fixing the problem, but it may be a time-effective solution. Are there time windows when you can have outages ? Can you load balance and fail over one instance whilst keeping the second up ? Perhaps you can trigger a restart when memory consumption breaches a certain limit (perhaps monitoring via JMX or similar).

我知道这并没有解决问题,但它可能是一个省时的解决方案。是否有可能发生中断的时间窗口?您能否在保持第二个实例正常运行的同时对一个实例进行负载平衡和故障转移?也许您可以在内存消耗超过某个限制时触发重新启动(可能通过 JMX 或类似方式进行监控)。

回答by Fortyrunner

Can you accelerate time? i.e. can you write a dummy test client that forces it to do a weeks worth of calls/requests etc in a few minutes or hours? These are your biggest friend and if you don't have one - write one.

你能加速时间吗?即,您能否编写一个虚拟测试客户端,强制它在几分钟或几小时内执行数周的呼叫/请求等操作?这些是你最大的朋友,如果你没有 - 写一个。

We used Netbeans a while ago to analyse heap dumps. It can be a bit slow but it was effective. Eclipse just crashed and the 32bit Windows tools did as well.

不久前我们使用 Netbeans 来分析堆转储。它可能有点慢,但它是有效的。Eclipse 刚刚崩溃,32 位 Windows 工具也崩溃了。

If you have access to a 64bit system or a Linux system with 3GB or more you will find it easier to analyse the heap dumps.

如果您可以访问 64 位系统或 3GB 或更多的 Linux 系统,您会发现分析堆转储更容易。

Do you have access to change logs and incident reports? Large scale enterprises will normally have change management and incident management teams and this may be useful in tracking down when problems started happening.

您是否有权访问更改日志和事件报告?大型企业通常会有变更管理和事件管理团队,这可能有助于跟踪问题何时开始发生。

When did it start going wrong? Talk to people and try and get some history. You may get someone saying, "Yeah, it was after they fixed XYZ in patch 6.43 that we got weird stuff happening".

什么时候开始出错的?与人交谈并尝试了解一些历史。你可能会听到有人说,“是的,在他们在 6.43 补丁中修复了 XYZ 之后,我们发生了一些奇怪的事情”。

回答by Drew Johnson

I've had success with IBM Heap Analyzer. It offers several views of the heap, including largest drop-off in object size, most frequently occurring objects, and objects sorted by size.

我使用 IBM Heap Analyzer取得了成功。它提供了堆的几种视图,包括对象大小的最大下降、最常出现的对象和按大小排序的对象。

回答by LB40

I've used jhat, this is a bit harsh, but it depends on the kind of framework you had.

我使用过jhat,这有点苛刻,但这取决于您拥有的框架类型。

回答by joseph

This answer expands upon @Will-Hartung's. I applied to same process to diagnose one of my memory leaks and thought that sharing the details would save other people time.

这个答案扩展了@Will-Hartung's。我应用相同的过程来诊断我的内存泄漏之一,并认为共享细节会节省其他人的时间。

The idea is to have postgres 'plot' time vs. memory usage of each class, draw a line that summarizes the growth and identify the objects that are growing the fastest:

这个想法是让 postgres '绘制'时间与每个类的内存使用情况,画一条线来总结增长并确定增长最快的对象:

    ^
    |
s   |  Legend:
i   |  *  - data point
z   |  -- - trend
e   |
(   |
b   |                 *
y   |                     --
t   |                  --
e   |             * --    *
s   |           --
)   |       *--      *
    |     --    *
    |  -- *
   --------------------------------------->
                      time

Convert your heap dumps (need multiple) into a format this is convenient for consumption by postgres from the heap dump format:

将您的堆转储(需要多个)转换为便于 postgres 从堆转储格式使用的格式:

 num     #instances         #bytes  class name 
----------------------------------------------
   1:       4632416      392305928  [C
   2:       6509258      208296256  java.util.HashMap$Node
   3:       4615599      110774376  java.lang.String
   5:         16856       68812488  [B
   6:        278914       67329632  [Ljava.util.HashMap$Node;
   7:       1297968       62302464  
...

To a csv file with a the datetime of each heap dump:

到带有每个堆转储日期时间的 csv 文件:

2016.09.20 17:33:40,[C,4632416,392305928
2016.09.20 17:33:40,java.util.HashMap$Node,6509258,208296256
2016.09.20 17:33:40,java.lang.String,4615599,110774376
2016.09.20 17:33:40,[B,16856,68812488
...

Using this script:

使用这个脚本:

# Example invocation: convert.heap.hist.to.csv.pl -f heap.2016.09.20.17.33.40.txt -dt "2016.09.20 17:33:40"  >> heap.csv 

 my $file;
 my $dt;
 GetOptions (
     "f=s" => $file,
     "dt=s" => $dt
 ) or usage("Error in command line arguments");
 open my $fh, '<', $file or die $!;

my $last=0;
my $lastRotation=0;
 while(not eof($fh)) {
     my $line = <$fh>;
     $line =~ s/\R//g; #remove newlines
     #    1:       4442084      369475664  [C
     my ($instances,$size,$class) = ($line =~ /^\s*\d+:\s+(\d+)\s+(\d+)\s+([$\[\w\.]+)\s*$/) ;
     if($instances) {
         print "$dt,$class,$instances,$size\n";
     }
 }

 close($fh);

Create a table to put the data in

创建一个表来放入数据

CREATE TABLE heap_histogram (
    histwhen timestamp without time zone NOT NULL,
    class character varying NOT NULL,
    instances integer NOT NULL,
    bytes integer NOT NULL
);

Copy the data into your new table

将数据复制到新表中

\COPY heap_histogram FROM 'heap.csv'  WITH DELIMITER ',' CSV ;

Run the slop query against size (num of bytes) query:

针对大小(字节数)查询运行 slop 查询:

SELECT class, REGR_SLOPE(bytes,extract(epoch from histwhen)) as slope
    FROM public.heap_histogram
    GROUP BY class
    HAVING REGR_SLOPE(bytes,extract(epoch from histwhen)) > 0
    ORDER BY slope DESC
    ;

Interpret the results:

解释结果:

         class             |        slope         
---------------------------+----------------------
 java.util.ArrayList       |     71.7993806279174
 java.util.HashMap         |     49.0324576155785
 java.lang.String          |     31.7770770326123
 joe.schmoe.BusinessObject |     23.2036817108056
 java.lang.ThreadLocal     |     20.9013528767851

The slope is bytes added per second (since the unit of epoch is in seconds). If you use instances instead of size, then that's the number of instances added per second.

斜率是每秒添加的字节数(因为纪元的单位是秒)。如果您使用实例而不是大小,那么这就是每秒添加的实例数。

My one of the lines of code creating this joe.schmoe.BusinessObject was responsible for the memory leak. It was creating the object, appending it to an array without checking if it already existed. The other objects were also created along with the BusinessObject near the leaking code.

我创建这个 joe.schmoe.BusinessObject 的代码行之一负责内存泄漏。它正在创建对象,将它附加到一个数组中,而不检查它是否已经存在。其他对象也与泄漏代码附近的 BusinessObject 一起创建。

回答by Jim T

There are great tools like Eclipse MAT and Heap Hero to analyze heap dumps. However, you need to provide these tools with heap dumps captured in the correct format and correct point in time.

有很多很棒的工具,比如 Eclipse MAT 和 Heap Hero 来分析堆转储。但是,您需要为这些工具提供以正确格式和正确时间点捕获的堆转储。

This article gives you multiple options to capture heap dumps. However, in my opinion, first 3 are effective options to use and others are good options to be aware. 1. jmap 2. HeapDumpOnOutOfMemoryError 3. jcmd 4. JVisualVM 5. JMX 6. Programmatic Approach 7. IBM Administrative Console

本文为您提供了多种捕获堆转储的选项。但是,在我看来,前 3 个是有效的使用选项,其他是需要注意的好选项。1. jmap 2. HeapDumpOnOutOfMemoryError 3. jcmd 4. JVisualVM 5. JMX 6. 编程方法 7. IBM 管理控制台

7 Options to capture Java Heap dumps

捕获 Java 堆转储的 7 个选项