问题出现

线上报警,看监控图，第一个是cpu使用率，第二个是load

排查步骤

查看cpu使用率高的几个线程
查看gc情况
dump内存分析
根据dump分析结果排查具体原因

排查过程

查看cpu使用率高的几个线程

结果如下：

The stack of busy(83.4%) thread(1549/0x60d) of java pid(1546) all times():
"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f98a801e800 nid=0x60d runnable
 
The stack of busy(81.4%) thread(1550/0x60e) of java pid(1546) all times():
"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f98a8020000 nid=0x60e runnable
 
The stack of busy(81.4%) thread(1551/0x60f) of java pid(1546) all times():
"GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007f98a8022000 nid=0x60f runnable
 
The stack of busy(77.4%) thread(1552/0x610) of java pid(1546) all times():
"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007f98a8023800 nid=0x610 runnable
 
The stack of busy(17.9%) thread(1553/0x611) of java pid(1546) all times():
"VM Thread" os_prio=0 tid=0x00007f98a8077000 nid=0x611 runnable
 
The stack of busy(0.0%) thread(1546/0x60a) of java pid(1546) all times():
The stack of busy(0.0%) thread(1548/0x60c) of java pid(1546) all times():
"main" #1 prio=5 os_prio=0 tid=0x00007f98a8009800 nid=0x60c runnable [0x00007f98af608000]
   java.lang.Thread.State: RUNNABLE
    at java.net.PlainSocketImpl.socketAccept(Native Method)
    at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
    at java.net.ServerSocket.implAccept(ServerSocket.java:545)
    at java.net.ServerSocket.accept(ServerSocket.java:513)
    at org.apache.catalina.core.StandardServer.await(StandardServer.java:452)
    at org.apache.catalina.startup.Catalina.await(Catalina.java:779)
    at org.apache.catalina.startup.Catalina.start(Catalina.java:725)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:322)
    at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:456)
 
The stack of busy(0.0%) thread(1554/0x612) of java pid(1546) all times():
"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f98a807e800 nid=0x612 in Object.wait() [0x00007f9898dfc000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:502)
    at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
    - locked <0x0000000080782fb0> (a java.lang.ref.Reference$Lock)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

很明显，cpu使用率较高的几个线程都是gc线程，ParallelGC并行回收四个gc线程，有三个在80%以上，一个77.4%

结论：内存有问题

查看gc和内存情况

2019-02-27T20:53:35.407+0800: 183330.132: [Full GC (Ergonomics) [PSYoungGen: 276991K->276991K(475136K)] [ParOldGen: 1398225K->1398224K(1398272K)] 1675217K->1675215K(1873408K), [Metaspace: 91794K->91794K(1134592K)], 1.4637406 secs] [Times: user=5.55 sys=0.00, real=1.46 secs]
2019-02-27T20:53:36.872+0800: 183331.597: [Full GC (Ergonomics) [PSYoungGen: 276992K->276991K(475136K)] [ParOldGen: 1398224K->1398224K(1398272K)] 1675216K->1675216K(1873408K), [Metaspace: 91794K->91794K(1134592K)], 1.4722518 secs] [Times: user=5.63 sys=0.00, real=1.48 secs]
2019-02-27T20:53:38.345+0800: 183333.070: [Full GC (Ergonomics) [PSYoungGen: 276991K->276991K(475136K)] [ParOldGen: 1398226K->1398226K(1398272K)] 1675218K->1675218K(1873408K), [Metaspace: 91794K->91794K(1134592K)], 1.4388783 secs] [Times: user=5.50 sys=0.00, real=1.44 secs]
2019-02-27T20:53:39.785+0800: 183334.510: [Full GC (Ergonomics) [PSYoungGen: 276991K->276991K(475136K)] [ParOldGen: 1398226K->1398226K(1398272K)] 1675218K->1675218K(1873408K), [Metaspace: 91794K->91794K(1134592K)], 1.4565019 secs] [Times: user=5.51 sys=0.00, real=1.45 secs]
2019-02-27T20:53:41.242+0800: 183335.967: [Full GC (Ergonomics) [PSYoungGen: 276991K->276991K(475136K)] [ParOldGen: 1398229K->1398229K(1398272K)] 1675221K->1675221K(1873408K), [Metaspace: 91794K->91794K(1134592K)], 1.4389919 secs] [Times: user=5.50 sys=0.00, real=1.44 secs]
2019-02-27T20:53:42.682+0800: 183337.407: [Full GC (Ergonomics) [PSYoungGen: 276991K->276991K(475136K)] [ParOldGen: 1398231K->1398231K(1398272K)] 1675223K->1675223K(1873408K), [Metaspace: 91794K->91794K(1134592K)], 1.4568581 secs] [Times: user=5.53 sys=0.00, real=1.46 secs]
2019-02-27T20:53:44.140+0800: 183338.865: [Full GC (Ergonomics) [PSYoungGen: 276991K->276991K(475136K)] [ParOldGen: 1398233K->1398231K(1398272K)] 1675225K->1675223K(1873408K), [Metaspace: 91794K->91794K(1134592K)], 1.5388401 secs] [Times: user=5.62 sys=0.00, real=1.54 secs]

一直在full gc，YoungGen 和 OldGen 都gc不掉，每次stw 1.5s左右,不停的在stw,系统无法提供服务

看一下堆内存的使用情况：

Attaching to process ID 1546, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.91-b14
 
using thread-local object allocation.
Parallel GC with 4 thread(s)
 
Heap Configuration:
   MinHeapFreeRatio         = 0
   MaxHeapFreeRatio         = 100
   MaxHeapSize              = 2147483648 (2048.0MB)
   NewSize                  = 715653120 (682.5MB)
   MaxNewSize               = 715653120 (682.5MB)
   OldSize                  = 1431830528 (1365.5MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 0 (0.0MB)
 
Heap Usage:
PS Young Generation
Eden Space:
   capacity = 283639808 (270.5MB)
   used     = 283639808 (270.5MB)
   free     = 0 (0.0MB)
   100.0% used
From Space:
   capacity = 202899456 (193.5MB)
   used     = 0 (0.0MB)
   free     = 202899456 (193.5MB)
   0.0% used
To Space:
   capacity = 218103808 (208.0MB)
   used     = 0 (0.0MB)
   free     = 218103808 (208.0MB)
   0.0% used
PS Old Generation
   capacity = 1431830528 (1365.5MB)
   used     = 1431780696 (1365.4524765014648MB)
   free     = 49832 (0.04752349853515625MB)
   99.99651969985096% used
 
43030 interned Strings occupying 4910192 bytes.

eden使用了100.0% ，old使用了99.99651969985096%，已经基本用完了

所以现在应该看一下内存使用情况，分析一下内存使用哪里出问题了

dump内存分析

dump之后用mat分析的结果：

http-bio-8080-exec-58
  at java.util.Arrays.copyOf([Ljava/lang/Object;I)[Ljava/lang/Object; (Arrays.java:3181)
  at java.util.ArrayList.grow(I)V (ArrayList.java:261)
  at java.util.ArrayList.ensureExplicitCapacity(I)V (ArrayList.java:235)
  at java.util.ArrayList.ensureCapacityInternal(I)V (ArrayList.java:227)
  at java.util.ArrayList.add(Ljava/lang/Object;)Z (ArrayList.java:458)
  at org.apache.xmlbeans.impl.values.NamespaceContext$NamespaceContextStack.push(Lorg/apache/xmlbeans/impl/values/NamespaceContext;)V (NamespaceContext.java:81)
  at org.apache.xmlbeans.impl.values.NamespaceContext.push(Lorg/apache/xmlbeans/impl/values/NamespaceContext;)V (NamespaceContext.java:106)
  at org.apache.xmlbeans.impl.values.XmlObjectBase.check_dated()V (XmlObjectBase.java:1273)
  at org.apache.xmlbeans.impl.values.XmlObjectBase.stringValue()Ljava/lang/String; (XmlObjectBase.java:1484)
  at org.apache.xmlbeans.impl.values.XmlObjectBase.getStringValue()Ljava/lang/String; (XmlObjectBase.java:1492)
  at org.apache.poi.xssf.usermodel.XSSFRichTextString.preserveSpaces(Lorg/openxmlformats/schemas/spreadsheetml/x2006/main/STXstring;)V (XSSFRichTextString.java:491)
  at org.apache.poi.xssf.usermodel.XSSFRichTextString.<init>(Ljava/lang/String;)V (XSSFRichTextString.java:91)
  at org.apache.poi.xssf.usermodel.XSSFCell.setCellValue(Ljava/lang/String;)V (XSSFCell.java:315)
  at com.xxx.ExportUtil.export(Ljavax/servlet/http/HttpServletResponse;[Ljava/lang/String;Ljava/lang/String;Ljava/util/List;)V (ExportUtil.java:53)
  at com.xxx.ApplyServiceImpl.export(Ljavax/servlet/http/HttpServletResponse;[Ljava/lang/String;Ljava/lang/String;Lcom/xxx/ApplyRequest;)Lcom/xxx/ApplyResultEnum; (ApplyServiceImpl.java:321)
  at com.xxx.UpDownController.export(Ljavax/servlet/http/HttpServletResponse;Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/lang/Integer;ILjava/lang/String;Ljava/lang/Integer;IILjava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;ILjava/lang/String;Ljava/lang/String;Ljava/lang/String;Z)Lcom/xxx/APIResponse; (UpDownController.java:192)
  at com.xxx.UpDownController$$FastClassBySpringCGLIB$$8f2b3db8.invoke(ILjava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (Unknown Source)
  at org.springframework.cglib.proxy.MethodProxy.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (MethodProxy.java:204)
  at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint()Ljava/lang/Object; (CglibAopProxy.java:738)
  at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed()Ljava/lang/Object; (ReflectiveMethodInvocation.java:157)
  at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed()Ljava/lang/Object; (MethodInvocationProceedingJoinPoint.java:85)

分析结果表示，有一个线程里的本地变量占了总内存的94.91%，将近1.5g大小

根据线程栈看出来，是ArrayList扩容的时候占的内存相当多，这个ArrayList的数据占了1.5个G，调用位置在ExportUtil.java:53,这是一个导出excel的工具类，在这里创建单元格，并设置值。

问题根源

问题根源在这里就查到了，同时根据access log能查出来，这是用户的一个导出数据操作，数据量上百万，于是就出问题了

问题根源有两点：

大量数据的情况下没有增加数量限制
堆内存设置的比较小,从配置信息可以看出来,heap只配了2g,但我们的机器内存有8G