添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

记录一次内存泄漏的调试经历

最近写了一个项目,是关于爬虫的,里面涉及到了django作为orm。当时在服务器上运行程序,发现内存占用持续增长,最后直到被系统kill。遇到这个问题,首先要弄清楚内存里面,到底存储了哪些类型的数据。这里主要使用了objgraph,pympler,guppy工具。

使用objgraph观察

这里简化下代码, 使用函数dosomething表示一次爬取任务执行。每1分钟执行一次

import objgraph
import schedule
def dosomething():
    objgraph.show_growth()
    # ........ 爬取任务
schedule.every(1).minutes.do(dosomething)

观察如下:

objects growth
list                          25705    +25705
function                      16416    +16416
tuple                          7941     +7941
dict                           5802     +5802
weakref                        3779     +3779
builtin_function_or_method     2471     +2471
cell                           2303     +2303
type                           2141     +2141
getset_descriptor              1845     +1845
wrapper_descriptor             1542     +1542
objects growth
list                        26225      +520
ObservationList               329       +67
IdentityPartitionCluster      329       +67
SplitResult                    14        +5
objects growth
ObservationList               337        +8
IdentityPartitionCluster      337        +8
SplitResult                    19        +5
objects growth
ObservationList               349        +4
IdentityPartitionCluster      349        +4

注意到第一次会出现大幅的增量,是因为第一运行加载类,函数等对象。 从后面的输出结果,可以看到ObservationList和IdentityPartitionCluster一直在持续增长,但是只能看到数量,并不能看到占用内存的数据大小和内容。

使用pympler工具

pympler工具可以很容易看到内存的使用情况,用法如下

import objgraph
import schedule
from pympler import tracker, muppy, summary
tr = tracker.SummaryTracker()
def dosomething():
    print "memory total"
    all_objects = muppy.get_objects()
    sum1 = summary.summarize(all_objects)
    summary.print_(sum1)
    print "memory difference"
    tr.print_diff()
    # ........ 爬取任务
schedule.every(1).minutes.do(dosomething)

观察输出结果如下:

memory total
                                     types |   # objects |   total size
========================================== | =========== | ============
                                       str |       94461 |      7.19 MB
                                      dict |        5802 |      6.54 MB
                                   unicode |        9876 |      4.73 MB
                                      type |        2078 |      1.79 MB
                                      code |       14614 |      1.78 MB
                                      list |        9981 |      1.02 MB
  <class 'guppy.heapy.View.ObservationList |         262 |    854.59 KB
                                     tuple |        7941 |    581.80 KB
                                   weakref |        3779 |    324.76 KB
                                       set |         653 |    183.45 KB
                builtin_function_or_method |        2471 |    173.74 KB
                                       int |        6355 |    148.95 KB
                       function (__init__) |        1119 |    131.13 KB
                         getset_descriptor |        1845 |    129.73 KB
                                      cell |        2303 |    125.95 KB
memory difference
                                              types |   # objects |   total size
=================================================== | =========== | ============
                                                str |       72157 |      4.96 MB
                                            unicode |        6943 |      3.18 MB
                                               list |       15319 |      3.13 MB
           <class 'guppy.heapy.View.ObservationList |         262 |    854.59 KB
                                               dict |         590 |    474.83 KB
                                               code |        1991 |    248.88 KB
                                                int |        5817 |    136.34 KB
                                              tuple |         821 |     55.34 KB
                                            weakref |         576 |     49.50 KB
                                               type |          45 |     39.73 KB
  <class 'guppy.heapy.Part.IdentityPartitionCluster |         262 |     24.56 KB
                                 wrapper_descriptor |         164 |     12.81 KB
                                function (__init__) |          95 |     11.13 KB
                                           classobj |         105 |     10.66 KB
                                           instance |         117 |      8.23 KB
memory total
                       types |   # objects |   total size
============================ | =========== | ============
                     unicode |        9890 |      9.76 MB
                         str |       95090 |      7.22 MB
                        dict |        5802 |      6.54 MB
                        type |        2078 |      1.79 MB
                        code |       14614 |      1.78 MB
                        list |       10501 |      1.08 MB
                       tuple |        7940 |    581.73 KB
                     weakref |        3778 |    324.67 KB
                         set |         653 |    183.45 KB
  builtin_function_or_method |        2471 |    173.74 KB
                         int |        6654 |    155.95 KB
         function (__init__) |        1119 |    131.13 KB
           getset_descriptor |        1845 |    129.73 KB
                        cell |        2302 |    125.89 KB
          wrapper_descriptor |        1542 |    120.47 KB
memory difference
                                              types |   # objects |      total size
=================================================== | =========== | ===============
                                            unicode |          14 |         5.04 MB
                                               list |         520 |        56.93 KB
                                                str |         629 |        35.84 KB
                                                int |         299 |         7.01 KB
  <class 'guppy.heapy.Part.IdentityPartitionCluster |          67 |         6.28 KB
                       <class 'urlparse.SplitResult |           5 |       480     B
                                   _sre.SRE_Pattern |           2 |       176     B
                                  datetime.datetime |           1 |        48     B
                                               cell |          -1 |       -56     B
                                              tuple |          -1 |       -64     B
                                            weakref |          -1 |       -88     B
                                  function (remove) |          -1 |      -120     B
           <class 'guppy.heapy.View.ObservationList |          67 |   -753144     B
 .......................
 ...................  省略中间多次结果
 .....................
memory total
                                     types |   # objects |   total size
========================================== | =========== | ============
                                   unicode |       11050 |    385.95 MB
                                       str |       95073 |      7.22 MB
                                      dict |        5802 |      6.54 MB
                                      type |        2078 |      1.79 MB
                                      code |       14614 |      1.78 MB
                                      list |       10501 |      1.08 MB
                                     tuple |        7940 |    581.73 KB
                                   weakref |        3778 |    324.67 KB
                                       int |        8762 |    205.36 KB
                                       set |         653 |    183.45 KB
  <class 'guppy.heapy.View.ObservationList |         897 |    181.98 KB
                builtin_function_or_method |        2471 |    173.74 KB
                       function (__init__) |        1119 |    131.13 KB
                         getset_descriptor |        1845 |    129.73 KB
                                      cell |        2302 |    125.89 KB
memory difference
                                              types |   # objects |   total size
=================================================== | =========== | ============
                                            unicode |           6 |      2.18 MB
                                                str |          29 |      2.14 KB
                                               dict |           0 |    768     B
                       <class 'urlparse.SplitResult |           7 |    672     B
                                  collections.deque |           0 |    512     B
           <class 'guppy.heapy.View.ObservationList |           3 |    336     B
  <class 'guppy.heapy.Part.IdentityPartitionCluster |           3 |    288     B
                                                int |          12 |    288     B

可以看到unicode类型,一直在持续增长。从当初4M多,一直持续增长到385M。知道了是unicode的问题,接下来要观察下这些unicode被哪些对象引用。

使用guppy工具

guppy可以查看到heap内存的具体使用情况,哪些对象占用多少内存。

import objgraph
import schedule
import guppy
from pympler import tracker, muppy, summary
tr = tracker.SummaryTracker()
hp = guppy.hpy() # 初始化了SessionContext,使用它可以访问heap信息
def dosomething():
    print "heap total"
    heap = hp.heap() # 返回heap内存详情
    references = heap[0].byvia # byvia返回该对象的被哪些引用, heap[0]是内存消耗最大的对象
    print references
    # ........ 爬取任务
schedule.every(1).minutes.do(dosomething)

比如上面的代码,返回哪些object引用了unicode这个类型。

Partition of a set of 10151 objects. Total size = 116935280 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0    155   2 114401312  98 114401312  98 "[u'sql']"
     1   1245  12  1036064   1 115437376  99 '.func_doc', '[0]'
     2    384   4   282360   0 115719736  99 "['__doc__']"
     3   1456  14   200344   0 115920080  99 '[1]'
     4    144   1    89824   0 116009904  99 '.__doc__', '.func_doc', '[0]'
     5    719   7    89336   0 116099240  99 '[2]'
     6      1   0    72784   0 116172024  99 "['TECHNICAL_500_TEMPLATE']"
     7    565   6    70256   0 116242280  99 '[0]'
     8    532   5    61408   0 116303688  99 '[3]'
     9    448   4    50464   0 116354152 100 '[4]'
<1359 more rows. Type e.g. '_.more' to view.>

可以看到sql这个属性,占用了所有unicode的98%的存储空间。继续观察sql这个属性的引用链

import objgraph
import schedule
import guppy
from pympler import tracker, muppy, summary
tr = tracker.SummaryTracker()
hp = guppy.hpy()
def dosomething():
    print "heap total"
    heap = hp.heap()
    references = heap[0].byvia
    print references[0].kind
    print references[0].shpaths
    print references[0].rp
    # ........ 爬取任务
schedule.every(1).minutes.do(dosomething)

shpaths返回从最顶端的root到这个object的最短引用路径。rp返回被哪些类型应用信息。

<via "[u'sql']">
 0: hp.Root.i0_modules['django.contr....models.query'].__dict__['connections'].__dict__['_connections'].??[<weakref...bc7d6830>]['default'].__dict__['queries_log'].??[u'sql']
 1: hp.Root.i0_modules['django.contr....models.query'].__dict__['connections'].__dict__['_connections'].??[<weakref...bc7d6830>]['default'].__dict__['queries_log'].??[u'sql']
 2: hp.Root.i0_modules['django.contr....models.query'].__dict__['connections'].__dict__['_connections'].??[<weakref...bc7d6830>]['default'].__dict__['queries_log'].??[u'sql']
 3: hp.Root.i0_modules['django.contr....models.query'].__dict__['connections'].__dict__['_connections'].??[<weakref...bc7d6830>]['default'].__dict__['queries_log'].??[u'sql']
 4: hp.Root.i0_modules['django.contr....models.query'].__dict__['connections'].__dict__['_connections'].??[<weakref...bc7d6830>]['default'].__dict__['queries_log'].??[u'sql']
 5: hp.Root.i0_modules['django.contr....models.query'].__dict__['connections'].__dict__['_connections'].??[<weakref...bc7d6830>]['default'].__dict__['queries_log'].??[u'sql']
 6: hp.Root.i0_modules['django.contr....models.query'].__dict__['connections'].__dict__['_connections'].??[<weakref...bc7d6830>]['default'].__dict__['queries_log'].??[u'sql']
 7: hp.Root.i0_modules['django.contr....models.query'].__dict__['connections'].__dict__['_connections'].??[<weakref...bc7d6830>]['default'].__dict__['queries_log'].??[u'sql']
 8: hp.Root.i0_modules['django.contr....models.query'].__dict__['connections'].__dict__['_connections'].??[<weakref...bc7d6830>]['default'].__dict__['queries_log'].??[u'sql']
 9: hp.Root.i0_modules['django.contr....models.query'].__dict__['connections'].__dict__['_connections'].??[<weakref...bc7d6830>]['default'].__dict__['queries_log'].??[u'sql']
<... 250 more paths ...>
Reference Pattern by <[dict of] class>.
 0: _ --- [-] 20 <via "[u'sql']">: 0x7f439c90c810, 0x7f439c90c840...
 1: a      [-] 20 dict (no owner): 0x7f439d065b40*2, 0x7f439d54bc58*2...
 2: aa ---- [-] 1 collections.deque: 0x7f43a051d360
 3: a3       [-] 1 dict of django.contrib.gis.db.backends.postgis.base.Databa...
 4: a4 ------ [-] 1 django.contrib.gis.db.backends.postgis.base.DatabaseWrapp...
 5: a5         [-] 1 dict of django.contrib.gis.db.backends.postgis.operation...
 6: a6 -------- [-] 1 django.contrib.gis.db.backends.postgis.operations.PostG...
 7: a7           [^ 3] 1 dict of django.contrib.gis.db.backends.postgis.base....
 8: a4b ------ [-] 1 dict (no owner): 0x7f43bbe00050*1
 9: a4ba        [-] 1 dict (no owner): 0x7f43bbe0de88*1
<Type e.g. '_.more' for more.>

可以明显得看到是django的问题,后来上网查了下django内存泄漏,原来是因为django在debug模式下,会保存每一次的sql语句。终于弄清楚了原因,解决办法是把django的settings的debug设置为False。