cpu满时出现奇怪的lua tcp socket read timed out问题

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

heeroz

1.7.2.1 系统 ubuntu14

代码如下：

--测试redis timeout

local redis = require("resty.redis")

function do_redis_read()

local red = redis:new()

red:set_timeout(2000)

local ok, err = red:connect("127.0.0.1", 6379)

if not ok then

error("failed to connect:"..err)

return

end

local ok, err = red:auth("111111")

if not ok then

error("failed to authenticate:"..err)

return

end

local ok, err = red:set_keepalive(10000, 100)

if not ok then

ngx.say("failed to set keepalive: ", err)

return

end

do_redis_read()

local test = function(n)

local t1 = os.time()

for i=1,n do end

print("time ", os.time() - t1)

end

test(10000000000)

--ngx.sleep(10) 用sleep就无法还原出此问题

do_redis_read()

结果：

2014/09/01 19:52:00 [error] 11936#0: *798 lua tcp socket read timed out

2014/09/01 19:52:00 [error] 11936#0: *798 lua entry thread aborted: runtime error: /home/test.lua:16: failed to authenticate:timeout

大致情况就是，第二次读取redis数据时，如果之前有cpu用满的情况，第二次读写就会失败，我哪里用法有问题？

感谢。

On Mon, Sep 1, 2014 at 5:11 AM, Heero Zhang wrote: > 2014/09/01 19:52:00 [error] 11936#0: *798 lua tcp socket read timed out > 2014/09/01 19:52:00 [error] 11936#0: *798 lua entry thread aborted: runtime > error: /home/test.lua:16: failed to authenticate:timeout > 大致情况就是，第二次读取redis数据时，如果之前有cpu用满的情况，第二次读写就会失败，我哪里用法有问题？我使用严格的串行请求无法复现你看到的问题。你看到这样的超时错误应当是你使用了并发请求吧？当有并发请求时，当请求 A 忙于 CPU 计算时，该计算会阻塞对应的 worker 进程的事件循环较长的时间，从而有可能让正在等待 IO 事件的请求 B 发生超时（毕竟你的超时设置只是 2 秒）。你的 socket 超时设置应长于最久可能的阻塞时间。 Regards, -agentzh

heeroz

我使用的是串行，同时只有一个链接请求操作

当时我在ubuntu或windows下都复现出来了，明天我试试看用干净的系统看能否复现出来

在 2014年9月2日上午2:16，Yichun Zhang (agentzh) < [email protected] > 写道：

Hello!
On Mon, Sep 1, 2014 at 5:11 AM, Heero Zhang wrote:
> 2014/09/01 19:52:00 [error] 11936#0: *798 lua tcp socket read timed out
> 2014/09/01 19:52:00 [error] 11936#0: *798 lua entry thread aborted: runtime
> error: /home/test.lua:16: failed to authenticate:timeout
> 大致情况就是，第二次读取redis数据时，如果之前有cpu用满的情况，第二次读写就会失败，我哪里用法有问题？

我使用严格的串行请求无法复现你看到的问题。你看到这样的超时错误应当是你使用了并发请求吧？当有并发请求时，当请求 A 忙于 CPU
计算时，该计算会阻塞对应的 worker 进程的事件循环较长的时间，从而有可能让正在等待 IO 事件的请求 B
发生超时（毕竟你的超时设置只是 2 秒）。你的 socket 超时设置应长于最久可能的阻塞时间。
Regards,
-agentzh

heeroz

我用我们的生产环境能50%概率能试出来这个情况，但是我新开个机器就不行了，试了半天，发现需要这个配置

init_by_lua '

require "resty.core"

春哥加上这个配置再试试看？

./configure --with-debug 后lua模块没了，没继续研究了，先暂时不试了

在 2014年9月2日上午2:45，Yichun Zhang (agentzh) < [email protected] > 写道：

Hello!
2014-09-01 11:42 GMT-07:00 Heero Zhang:
> 我使用的是串行，同时只有一个链接请求操作
> 当时我在ubuntu或windows下都复现出来了，明天我试试看用干净的系统看能否复现出来
一个建议是启用 nginx 调试日志以排查问题：
http://nginx.org/en/docs/debugging_log.html
这样可以很方便地确定对应超时 timer 的开启和生效过程。
Regards,
-agentzh
2014-09-01 23:48 GMT-07:00 Heero Zhang: > 我用我们的生产环境能50%概率能试出来这个情况，但是我新开个机器就不行了，试了半天，发现需要这个配置 > init_by_lua ' > require "resty.core" > 春哥加上这个配置再试试看？加上了这个配置在我这里也无法复现。从逻辑上说，它也不应该会影响到超时定时器。 > ./configure --with-debug 后lua模块没了，没继续研究了，先暂时不试了我没看出来 --with-debug 和启用 lua 模块有什么关系。如我之前所说，如果你看到超时，则极有可能是你测试时请求之间的时序并没有做到*严格串行*。只要请求之间在时序上有微小的重叠，就有可能导致 redis 超时，毕竟你的 redis 超时限制只有 2 秒，而你的那个热循环在我的本上至少需要 10 秒。确认时序上是否存在问题的最简单的做法是检查 nginx 调试日志。我不明白为什么你需要在生产上使用这种代码，因为以单个操作就阻塞 CPU 10 秒的计算本身就是不太适合在线应用的。 Regards, -agentzh

heeroz

这种代码主要用在我们对内服务的应用上，而不是对外，我们会有些定时的大量数据整合计算的内部操作会跑，而openresty的环境写起来方便所以使用。等以后主线版本有了openresty的控制台就方便多了。

设2秒短于处理时间只是为了复现这个错误。

我刚刚严格检查了下，保证同时只有一个请求，来保证*严格串行*。我测试时的做法是，重启nginx，等待重启完成，发起一个请求，一般第一次请求就报这个错了，access日志里并无第二个请求。

另外debug日志开启成功了，附件是debug日志。可以看到日志里只有一个请求。

我看到454行23秒时开始发送数据lua tcp socket sent all the data，然后11秒后，到34秒lua tcp socket read timed out，看上去是发送数据后，11秒没收到数据所以超时了。

但是不应该啊，redis是本地的，也没做什么让redis会卡那么久的操作，cpu也是多核的，不应该会11秒read timed out的？

在 2014年9月3日上午12:59，Yichun Zhang (agentzh) < [email protected] > 写道：

Hello!
2014-09-01 23:48 GMT-07:00 Heero Zhang:

> 我用我们的生产环境能50%概率能试出来这个情况，但是我新开个机器就不行了，试了半天，发现需要这个配置
> init_by_lua '
> require "resty.core"
> 春哥加上这个配置再试试看？

加上了这个配置在我这里也无法复现。从逻辑上说，它也不应该会影响到超时定时器。
> ./configure --with-debug 后lua模块没了，没继续研究了，先暂时不试了

Attachment: socket_timeout.log
Description: Binary data

heeroz

明白了，这11秒就是for循环消耗的11秒

444行是打印for循环时间的信息，打印时间写着13:31:23点，而23秒是开始请求的时间，其实当时应该13:31:34点了，估计内部计时错误导致他错误的认为timeout了？

在 2014年9月3日下午1:51，Heero Zhang < [email protected] > 写道：

设2秒短于处理时间只是为了复现这个错误。

另外debug日志开启成功了，附件是debug日志。可以看到日志里只有一个请求。

但是不应该啊，redis是本地的，也没做什么让redis会卡那么久的操作，cpu也是多核的，不应该会11秒read timed out的？

在 2014年9月3日上午12:59，Yichun Zhang (agentzh) < [email protected] > 写道：

Hello!
2014-09-01 23:48 GMT-07:00 Heero Zhang:

> 我用我们的生产环境能50%概率能试出来这个情况，但是我新开个机器就不行了，试了半天，发现需要这个配置
> init_by_lua '
> require "resty.core"
> 春哥加上这个配置再试试看？

加上了这个配置在我这里也无法复现。从逻辑上说，它也不应该会影响到超时定时器。
> ./configure --with-debug 后lua模块没了，没继续研究了，先暂时不试了

我没看出来 --with-debug 和启用 lua 模块有什么关系。
如我之前所说，如果你看到超时，则极有可能是你测试时请求之间的时序并没有做到*严格串行*。只要请求之间在时序上有微小的重叠，就有可能导致
redis 超时，毕竟你的 redis 超时限制只有 2 秒，而你的那个热循环在我的本上至少需要 10
秒。确认时序上是否存在问题的最简单的做法是检查 nginx 调试日志。
我不明白为什么你需要在生产上使用这种代码，因为以单个操作就阻塞 CPU 10 秒的计算本身就是不太适合在线应用的。
Regards,
-agentzh
2014-09-02 22:51 GMT-07:00 Heero Zhang: > 我看到454行23秒时开始发送数据lua tcp socket sent all the data，然后11秒后，到34秒lua tcp socket > read timed out，看上去是发送数据后，11秒没收到数据所以超时了。 > 明白了，这11秒就是for循环消耗的11秒 > 444行是打印for循环时间的信息，打印时间写着13:31:23点，而23秒是开始请求的时间， > 其实当时应该13:31:34点了，估计内部计时错误导致他错误的认为timeout了？从调试日志看，显然两次请求对应的处理程序在执行时间上发生了叠加。当你的前一个请求的 do_redis_read() 函数尚未返回之前，后一个请求的 test(10000000000) 调试已经开始执行了。从日志消息上看，你确实使用的是同一个下游连接。此时只有一种解释，你的 Lua 代码执行在某个 ngx.timer.at 的回调中。由于定时器异步的本质（即脱离并异步于下游请求来执行的本质），你很难通过串行的下游请求来实现这些定时器的严格串行。解决的办法是使用一种每 worker 的“锁”来确保定时器操作的原子性。你可以通过 lua-resty-lrucache 库来实现这种每 worker 的锁： https://github.com/openresty/lua-resty-lrucache 建议给锁对应的 key 设置上一个（比较长的）过期时间，以避免因为持锁者的（万一）意外中止而导致的死锁。 Regards, -agentzh

heeroz

感谢详细的解说。

抱歉，是有其他站点的 timer.at 和access_by_lua干扰了，不过那些调用那个和test.lua没关系。所以实际上 test(10000000000) 还是只执行了一次。

这次我把其他站点用的 timer.at 的代码和access_by_lua都注掉了，还是报这个错误。我看到日志里面还是有event timer add，应该都是内部的timer了。

附上这次的日志

在 2014年9月4日上午2:42，Yichun Zhang (agentzh) < [email protected] > 写道：

Hello!
2014-09-02 22:51 GMT-07:00 Heero Zhang:

> 我看到454行23秒时开始发送数据lua tcp socket sent all the data，然后11秒后，到34秒lua tcp socket
> read timed out，看上去是发送数据后，11秒没收到数据所以超时了。

> 明白了，这11秒就是for循环消耗的11秒
> 444行是打印for循环时间的信息，打印时间写着13:31:23点，而23秒是开始请求的时间，
> 其实当时应该13:31:34点了，估计内部计时错误导致他错误的认为timeout了？

从调试日志看，显然两次请求对应的处理程序在执行时间上发生了叠加。当你的前一个请求的 do_redis_read()
函数尚未返回之前，后一个请求的 test(10000000000) 调试已经开始执行了。
从日志消息上看，你确实使用的是同一个下游连接。此时只有一种解释，你的 Lua 代码执行在某个 ngx.timer.at
的回调中。由于定时器异步的本质（即脱离并异步于下游请求来执行的本质），你很难通过串行的下游请求来实现这些定时器的严格串行。解决的办法是使用一种每
worker 的“锁”来确保定时器操作的原子性。你可以通过 lua-resty-lrucache 库来实现这种每 worker 的锁：
https://github.com/openresty/lua-resty-lrucache
建议给锁对应的 key 设置上一个（比较长的）过期时间，以避免因为持锁者的（万一）意外中止而导致的死锁。
Regards,
-agentzh

==> /site/error.log <==
2014/09/04 10:58:03 [debug] 5576#0: *25 http cl:-1 max:1048576
2014/09/04 10:58:03 [debug] 5576#0: *25 rewrite phase: 3
2014/09/04 10:58:03 [debug] 5576#0: *25 rewrite phase: 4
2014/09/04 10:58:03 [debug] 5576#0: *25 post rewrite phase: 5
2014/09/04 10:58:03 [debug] 5576#0: *25 generic phase: 6
2014/09/04 10:58:03 [debug] 5576#0: *25 generic phase: 7
2014/09/04 10:58:03 [debug] 5576#0: *25 access phase: 8
2014/09/04 10:58:03 [debug] 5576#0: *25 access phase: 9
2014/09/04 10:58:03 [debug] 5576#0: *25 post access phase: 10
2014/09/04 10:58:03 [debug] 5576#0: *25 lua content handler, uri:"/api/1.0/test" c:1
2014/09/04 10:58:03 [debug] 5576#0: *25 add cleanup: 8005D6B0
2014/09/04 10:58:03 [debug] 5576#0: *25 lua creating new vm state
2014/09/04 10:58:03 [debug] 5576#0: *25 lua initializing lua registry
2014/09/04 10:58:03 [debug] 5576#0: *25 lua initializing lua globals
2014/09/04 10:58:03 [debug] 5576#0: *25 lua initialize the global Lua VM FFF301C0
2014/09/04 10:58:03 [debug] 5576#0: *25 malloc: 800678F8:8
2014/09/04 10:58:03 [debug] 5576#0: *25 http script var: "/site"
2014/09/04 10:58:03 [debug] 5576#0: *25 http script copy: "/api/1.0/"
2014/09/04 10:58:03 [debug] 5576#0: *25 http script capture: "test"
2014/09/04 10:58:03 [debug] 5576#0: *25 http script copy: ".lua"
2014/09/04 10:58:03 [debug] 5576#0: *25 lua reset ctx
2014/09/04 10:58:03 [debug] 5576#0: *25 lua creating new thread
2014/09/04 10:58:03 [debug] 5576#0: *25 http cleanup add: 8005D704
2014/09/04 10:58:03 [debug] 5576#0: *25 lua run thread, top:0 c:1
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket pool get keepalive peer
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket keepalive connection pool not found
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket connect timeout: 2000
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket network address given directly
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket resolve retval handler
2014/09/04 10:58:03 [debug] 5576#0: *25 socket 6
2014/09/04 10:58:03 [debug] 5576#0: *25 connect to 127.0.0.1:6379, fd:6 #27
2014/09/04 10:58:03 [debug] 5576#0: *25 select add event fd:6 ev:1
2014/09/04 10:58:03 [debug] 5576#0: *25 select add event fd:6 ev:4
2014/09/04 10:58:03 [debug] 5576#0: *25 http cleanup add: 8005D770
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket connect: -2
2014/09/04 10:58:03 [debug] 5576#0: *25 event timer add: 6: 2000:1050212201
2014/09/04 10:58:03 [debug] 5576#0: *25 lua resume returned 1
2014/09/04 10:58:03 [debug] 5576#0: *25 lua thread yielded
2014/09/04 10:58:03 [debug] 5576#0: *25 http finalize request: -4, "/api/1.0/test?" a:1, c:2
2014/09/04 10:58:03 [debug] 5576#0: *25 http request count:2 blk:0
2014/09/04 10:58:03 [debug] 5576#0: *25 post event 8005C0E0
2014/09/04 10:58:03 [debug] 5576#0: *25 delete posted event 8005C0E0
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket handler for "/api/1.0/test?", wev 1
2014/09/04 10:58:03 [debug] 5576#0: *25 event timer del: 6: 1050212201
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket connected
2014/09/04 10:58:03 [debug] 5576#0: *25 select del event fd:6 ev:4
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket waking up the current request (conn)
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp operation done, resuming lua thread
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket calling prepare retvals handler 004A6213, u:FFF18408
2014/09/04 10:58:03 [debug] 5576#0: *25 lua run thread, top:0 c:1
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket send timeout: 2000
2014/09/04 10:58:03 [debug] 5576#0: *25 lua allocate new chainlink and new buf of size 39, cl:8005D77C
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket send data
2014/09/04 10:58:03 [debug] 5576#0: *25 send: fd:6 39 of 39
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket sent all the data
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket calling receive() method
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket read timeout: 2000
2014/09/04 10:58:03 [debug] 5576#0: *25 lua allocate new chainlink and new buf of size 65536, cl:8005D7E4
2014/09/04 10:58:03 [debug] 5576#0: *25 malloc: 800B1920:65536
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket read data: wait:0
2014/09/04 10:58:03 [debug] 5576#0: *25 event timer add: 6: 2000:1050212218
2014/09/04 10:58:03 [debug] 5576#0: *25 lua resume returned 1
2014/09/04 10:58:03 [debug] 5576#0: *25 lua thread yielded
2014/09/04 10:58:03 [debug] 5576#0: *25 lua run thread returned -2
2014/09/04 10:58:03 [debug] 5576#0: *25 post event 800581F8
2014/09/04 10:58:03 [debug] 5576#0: *25 delete posted event 800581F8
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket handler for "/api/1.0/test?", wev 0
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket read handler
2014/09/04 10:58:03 [debug] 5576#0: *25 event timer del: 6: 1050212218
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket read data: wait:1
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket try to recv data 65536: "/api/1.0/test?"
2014/09/04 10:58:03 [debug] 5576#0: *25 recv: fd:6 5 of 65536
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket recv returned 5: "/api/1.0/test?"
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket read line
"014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket read the final line part: "+OK
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket receive done: wait:1, eof:0, uri:"/api/1.0/test?"
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket waking up the current request (read)
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp operation done, resuming lua thread
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket calling prepare retvals handler 004A79B5, u:FFF18408
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket receive return value handler
2014/09/04 10:58:03 [debug] 5576#0: *25 lua run thread, top:0 c:1
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket set keepalive: saving connection 8005A470
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket connection pool size: 100
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket keepalive create connection pool for key "127.0.0.1:6379"
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket clear current socket connection
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket keepalive timeout: 10000 ms
2014/09/04 10:58:03 [debug] 5576#0: *25 event timer add: 6: 10000:1050220219
2014/09/04 10:58:03 [debug] 5576#0: *25 lua finalize socket
==> /site/access.log <==
127.0.0.1 - - [04/Sep/2014:10:58:14 +0800] "GET /api/1.0/test HTTP/1.1" "-" 500 600 "c1f4a1fa86eb930053f96f7a97a091c7" "http://localhost/test/index.html&quot; "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK
it/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"
==> /site/error.log <==
2014/09/04 10:58:03 [notice] 5576#0: *25 [lua] test.lua:150: test(): time 11, client: 127.0.0.1, server: localhost, request: "GET /api/1.0/test HTTP/1.1", host: "localhost", referrer: "http://
localhost/test/index.html"
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket pool get keepalive peer
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket get keepalive peer: using connection 8005A470, fd:6
2014/09/04 10:58:03 [debug] 5576#0: *25 event timer del: 6: 1050220219
2014/09/04 10:58:03 [debug] 5576#0: *25 http cleanup add: 8005D838
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket send timeout: 2000
2014/09/04 10:58:03 [debug] 5576#0: *25 lua reuse free buf memory 39 >= 39, cl:8005D77C, p:8005D7BC
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket send data
2014/09/04 10:58:03 [debug] 5576#0: *25 send: fd:6 39 of 39
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket sent all the data
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket calling receive() method
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket read timeout: 2000
2014/09/04 10:58:03 [debug] 5576#0: *25 lua reuse free buf memory 65536 >= 65536, cl:8005D7E4, p:800B1920
2014/09/04 10:58:03 [debug] 5576#0: *25 lua tcp socket read data: wait:0
2014/09/04 10:58:03 [debug] 5576#0: *25 event timer add: 6: 2000:1050212219
2014/09/04 10:58:03 [debug] 5576#0: *25 lua resume returned 1
2014/09/04 10:58:03 [debug] 5576#0: *25 lua thread yielded
2014/09/04 10:58:03 [debug] 5576#0: *25 lua run thread returned -2
2014/09/04 10:58:14 [debug] 5576#0: *25 post event 800581F8
2014/09/04 10:58:14 [debug] 5576#0: *25 event timer del: 6: 1050212219
2014/09/04 10:58:14 [debug] 5576#0: *25 lua tcp socket handler for "/api/1.0/test?", wev 0
2014/09/04 10:58:14 [debug] 5576#0: *25 lua tcp socket read handler
2014/09/04 10:58:14 [error] 5576#0: *25 lua tcp socket read timed out, client: 127.0.0.1, server: localhost, request: "GET /api/1.0/test HTTP/1.1", host: "localhost", referrer: "http://localho
st/test/index.html"
2014/09/04 10:58:14 [debug] 5576#0: *25 lua tcp socket handle read error
2014/09/04 10:58:14 [debug] 5576#0: *25 lua tcp socket waking up the current request
2014/09/04 10:58:14 [debug] 5576#0: *25 lua tcp operation done, resuming lua thread
2014/09/04 10:58:14 [debug] 5576#0: *25 lua tcp socket calling prepare retvals handler 004A79B5, u:FFF3E990
2014/09/04 10:58:14 [debug] 5576#0: *25 lua tcp socket receive return value handler
2014/09/04 10:58:14 [debug] 5576#0: *25 lua run thread, top:0 c:1
2014/09/04 10:58:14 [debug] 5576#0: *25 lua finalize socket
2014/09/04 10:58:14 [debug] 5576#0: *25 select del event fd:6 ev:1
2014/09/04 10:58:14 [debug] 5576#0: *25 delete posted event 800581F8
2014/09/04 10:58:14 [debug] 5576#0: *25 lua close socket connection
2014/09/04 10:58:14 [debug] 5576#0: *25 reusable connection: 0
2014/09/04 10:58:14 [debug] 5576#0: *25 lua tcp socket keepalive: free connection pool for "127.0.0.1:6379"
2014/09/04 10:58:14 [debug] 5576#0: *25 lua resume returned 2
2014/09/04 10:58:14 [error] 5576#0: *25 lua entry thread aborted: runtime error: /site/api/1.0/test.lua:132: failed to authenticate:timeout
stack traceback:
coroutine 0:
        [C]: in function 'error'
        /site/api/1.0/test.lua:3: in function 'do_redis_read'
        /site/api/1.0/test.lua:37: in function </site/api/1.0/test.lua:1>, client: 127.0.0.1, server: localhost, request: "GET /api/1.0/test HTTP/1.1", host: "localhos
t", referrer: "http://localhost/test/index.html&quot;
2014/09/04 10:58:14 [debug] 5576#0: *25 lua request cleanup: forcible=0
2014/09/04 10:58:14 [debug] 5576#0: *25 lua run thread returned 500
2014/09/04 10:58:14 [debug] 5576#0: *25 http finalize request: 500, "/api/1.0/test?" a:1, c:1
2014/09/04 10:58:14 [debug] 5576#0: *25 http special response: 500, "/api/1.0/test?"
2014/09/04 10:58:14 [debug] 5576#0: *25 http set discard body
2014/09/04 10:58:14 [debug] 5576#0: *25 headers more header filter, uri "/api/1.0/test"
2014/09/04 10:58:14 [debug] 5576#0: *25 lua header filter for user lua code, uri "/api/1.0/test"
2014/09/04 10:58:14 [debug] 5576#0: *25 lua capture header filter, uri "/api/1.0/test"
2014/09/04 10:58:14 [debug] 5576#0: *25 charset: "" > "utf-8"
2014/09/04 10:58:14 [debug] 5576#0: *25 HTTP/1.1 500 Internal Server Error
Date: Thu, 04 Sep 2014 02:58:14 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 600
Connection: close
Server: GameServer_openresty
2014/09/04 10:58:14 [debug] 5576#0: *25 write new buf t:1 f:0 8005D88C, pos 8005D88C, size: 181 file: 0, size: 0
2014/09/04 10:58:14 [debug] 5576#0: *25 http write filter: l:0 f:0 s:181
2014/09/04 10:58:14 [debug] 5576#0: *25 http output filter "/api/1.0/test?"
2014/09/04 10:58:14 [debug] 5576#0: *25 http copy filter: "/api/1.0/test?"
2014/09/04 10:58:14 [debug] 5576#0: *25 lua body filter for user lua code, uri "/api/1.0/test"
2014/09/04 10:58:14 [debug] 5576#0: *25 http cleanup add: 8005DA5C
2014/09/04 10:58:14 [debug] 5576#0: *25 lua capture body filter, uri "/api/1.0/test"
2014/09/04 10:58:14 [debug] 5576#0: *25 http postpone filter "/api/1.0/test?" 8005DA44
2014/09/04 10:58:14 [debug] 5576#0: *25 write old buf t:1 f:0 8005D88C, pos 8005D88C, size: 181 file: 0, size: 0
2014/09/04 10:58:14 [debug] 5576#0: *25 write new buf t:0 f:0 00000000, pos 004CB4A0, size: 140 file: 0, size: 0
2014/09/04 10:58:14 [debug] 5576#0: *25 write new buf t:0 f:0 00000000, pos 004CC660, size: 58 file: 0, size: 0
2014/09/04 10:58:14 [debug] 5576#0: *25 write new buf t:0 f:0 00000000, pos 004CC480, size: 402 file: 0, size: 0
2014/09/04 10:58:14 [debug] 5576#0: *25 http write filter: l:1 f:0 s:781
2014/09/04 10:58:14 [debug] 5576#0: *25 http write filter limit 0
2014/09/04 10:58:14 [debug] 5576#0: *25 writev: 781
2014/09/04 10:58:14 [debug] 5576#0: *25 http write filter 00000000
2014/09/04 10:58:14 [debug] 5576#0: *25 http copy filter: 0 "/api/1.0/test?"
2014/09/04 10:58:14 [debug] 5576#0: *25 http finalize request: 0, "/api/1.0/test?" a:1, c:1
2014/09/04 10:58:14 [debug] 5576#0: *25 http request count:1 blk:0
2014/09/04 10:58:14 [debug] 5576#0: *25 http close request
2014/09/04 10:58:14 [debug] 5576#0: *25 lua request cleanup: forcible=0
2014/09/04 10:58:14 [debug] 5576#0: *25 http log handler
2014/09/04 10:58:14 [debug] 5576#0: *25 parse header: "Cookie: JSESSIONID=215D25F47F6D74E6D1F787BD52AC17A6; session=c1f4a1fa86eb930053f96f7a97a091c7"
2014/09/04 10:58:14 [debug] 5576#0: *25 run cleanup: 8005D6B0
2014/09/04 10:58:14 [debug] 5576#0: *25 free: 800B1920
2014/09/04 10:58:14 [debug] 5576#0: *25 free: 8005CCB0, unused: 269
2014/09/04 10:58:14 [debug] 5576#0: *25 close http connection: 4
2014/09/04 10:58:14 [debug] 5576#0: *25 select del event fd:4 ev:1
2014/09/04 10:58:14 [debug] 5576#0: *25 reusable connection: 0
2014/09/04 10:58:14 [debug] 5576#0: *25 free: 80059268
2014/09/04 10:58:14 [debug] 5576#0: *25 free: 80067910, unused: 36
                                doujiang24
                    
建议你搞一个干净的环境，我这里用你的用例，无法复现




    

而且你提供的 log 日志，貌似有点乱且不完整，疑似经过修改，应该是还存在有其他应用干扰


姑且从日志分析看，你日志里的执行时序，跟我复现的很像，但是却有错误产生：



1. test(10000000000) 这个大 job 确实没有在 timer.at 执行，如果在 timer 中，日志会有这样：context: ngx.timer

2.   runtime error: /site/api/1.0/test.lua:132: failed to authenticate:timeout，从这看，貌似你的 test.lua 已经产生了一些变化，第一个日志里 /test.lua:16



所以，从目前的情况看，建议你搞一个干净的环境，最简洁的用例，详细的配置




最后谈一下，长时间 cpu 阻塞 （大 job） 的问题：

对于长时间的 cpu 阻塞应该是禁止出现的，test(10000000000) 这么用很容易出问题

ngx_lua 在底层处理网络 io 时是异步的，很有可能会在 send 和 receive 之间 插入别的计算任务，而且可以理解为不可控的插入，很有可能会造成 io 的超时


有两种解决方式：

1. 把这种 大 job 和有超时要求的任务，放在不同的服务里，比如再建立一个后端更新服务

2. 如果非要放一起，可以把 一个大job 打散，大job中定期调用 ngx.sleep







在 2014年9月4日 上午11:06，Heero Zhang <[email protected]>写道：

感谢详细的解说。

抱歉，是有其他站点的timer.at和access_by_lua干扰了，不过那些调用那个和test.lua没关系。所以实际上test(10000000000) 还是只执行了一次。

这次我把其他站点用的timer.at的代码和access_by_lua都注掉了，还是报这个错误。我看到日志里面还是有event timer add，应该都是内部的timer了。
附上这次的日志




在 2014年9月4日 上午2:42，Yichun Zhang (agentzh) <[email protected]>写道：

Hello!

2014-09-02 22:51 GMT-07:00 Heero Zhang:

> 我看到454行23秒时开始发送数据lua tcp socket sent all the data，然后11秒后，到34秒lua tcp socket

> read timed out，看上去是发送数据后，11秒没收到数据所以超时了。

> 明白了，这11秒就是for循环消耗的11秒

> 444行是打印for循环时间的信息，打印时间写着13:31:23点，而23秒是开始请求的时间，

> 其实当时应该13:31:34点了，估计内部计时错误导致他错误的认为timeout了？

从调试日志看，显然两次请求对应的处理程序在执行时间上发生了叠加。当你的前一个请求的 do_redis_read()

函数尚未返回之前，后一个请求的 test(10000000000) 调试已经开始执行了。

从日志消息上看，你确实使用的是同一个下游连接。此时只有一种解释，你的 Lua 代码执行在某个 ngx.timer.at

的回调中。由于定时器异步的本质（即脱离并异步于下游请求来执行的本质），你很难通过串行的下游请求来实现这些定时器的严格串行。解决的办法是使用一种每

worker 的“锁”来确保定时器操作的原子性。你可以通过 lua-resty-lrucache 库来实现这种每 worker 的锁：

    https://github.com/openresty/lua-resty-lrucache

建议给锁对应的 key 设置上一个（比较长的）过期时间，以避免因为持锁者的（万一）意外中止而导致的死锁。

Regards,

-agentzh

                                heeroz
                    我自己也是试了重开台干净的服务器无法复现，试了很多遍，尝试把生产环境服务器的东西一个个加进去，还是不行，实在是找不到到底哪里有区别。所以我现在的日志是直接在自己会出错的windows机器和ubuntu上试验，提供的log是windows的。 
我的test.lua上面有很多反复测试时的代码被注掉了，执行的代码和邮件1的一样，但行数有100多行的误差，所以为了他人能看懂，我offset了行数，log的条目和时序没有修改




在 2014年9月4日 下午1:11，DeJiang Zhu <[email protected]>写道：


建议你搞一个干净的环境，我这里用你的用例，无法复现

而且你提供的 log 日志，貌似有点乱且不完整，疑似经过修改，应该是还存在有其他应用干扰


姑且从日志分析看，你日志里的执行时序，跟我复现的很像，但是却有错误产生：



1. test(10000000000) 这个大 job 确实没有在 timer.at 执行，如果在 timer 中，日志会有这样：context: ngx.timer

2.   runtime error: /site/api/1.0/test.lua:132: failed to authenticate:timeout，从这看，貌似你的 test.lua 已经产生了一些变化，第一个日志里 /test.lua:16



所以，从目前的情况看，建议你搞一个干净的环境，最简洁的用例，详细的配置




最后谈一下，长时间 cpu 阻塞 （大 job） 的问题：

对于长时间的 cpu 阻塞应该是禁止出现的，test(10000000000) 这么用很容易出问题

ngx_lua 在底层处理网络 io 时是异步的，很有可能会在 send 和 receive 之间 插入别的计算任务，而且可以理解为不可控的插入，很有可能会造成 io 的超时


有两种解决方式：

1. 把这种 大 job 和有超时要求的任务，放在不同的服务里，比如再建立一个后端更新服务

2. 如果非要放一起，可以把 一个大job 打散，大job中定期调用 ngx.sleep







在 2014年9月4日 上午11:06，Heero Zhang <[email protected]>写道：
感谢详细的解说。

抱歉，是有其他站点的timer.at和access_by_lua干扰了，不过那些调用那个和test.lua没关系。所以实际上test(10000000000) 还是只执行了一次。
2014-09-02 22:51 GMT-07:00 Heero Zhang:

> 我看到454行23秒时开始发送数据lua tcp socket sent all the data，然后11秒后，到34秒lua tcp socket

> read timed out，看上去是发送数据后，11秒没收到数据所以超时了。

> 明白了，这11秒就是for循环消耗的11秒

> 444行是打印for循环时间的信息，打印时间写着13:31:23点，而23秒是开始请求的时间，

> 其实当时应该13:31:34点了，估计内部计时错误导致他错误的认为timeout了？

从调试日志看，显然两次请求对应的处理程序在执行时间上发生了叠加。当你的前一个请求的 do_redis_read()

函数尚未返回之前，后一个请求的 test(10000000000) 调试已经开始执行了。

从日志消息上看，你确实使用的是同一个下游连接。此时只有一种解释，你的 Lua 代码执行在某个 ngx.timer.at

的回调中。由于定时器异步的本质（即脱离并异步于下游请求来执行的本质），你很难通过串行的下游请求来实现这些定时器的严格串行。解决的办法是使用一种每

worker 的“锁”来确保定时器操作的原子性。你可以通过 lua-resty-lrucache 库来实现这种每 worker 的锁：

    https://github.com/openresty/lua-resty-lrucache

建议给锁对应的 key 设置上一个（比较长的）过期时间，以避免因为持锁者的（万一）意外中止而导致的死锁。

Regards,

-agentzh

                                heeroz
                    好吧，我找到问题了，的确和我猜的一样。
在test(10000000000)行后加入一句ngx.update_time()，就不会再出现timeout的报错了。

感觉这个挺陷阱的，不知道命令行模式是否有这个问题，估计很多人会用命令行直接来做一些需要长时间运行的工具吧，毕竟openresty环境实在是好用多了。





在 2014年9月4日 下午1:35，Heero Zhang <[email protected]>写道：

我自己也是试了重开台干净的服务器无法复现，试了很多遍，尝试把生产环境服务器的东西一个个加进去，还是不行，实在是找不到到底哪里有区别。所以我现在的日志是直接在自己会出错的windows机器和ubuntu上试验，提供的log是windows的。 

我的test.lua上面有很多反复测试时的代码被注掉了，执行的代码和邮件1的一样，但行数有100多行的误差，所以为了他人能看懂，我offset了行数，log的条目和时序没有修改




在 2014年9月4日 下午1:11，DeJiang Zhu <[email protected]>写道：

建议你搞一个干净的环境，我这里用你的用例，无法复现

而且你提供的 log 日志，貌似有点乱且不完整，疑似经过修改，应该是还存在有其他应用干扰


姑且从日志分析看，你日志里的执行时序，跟我复现的很像，但是却有错误产生：



1. test(10000000000) 这个大 job 确实没有在 timer.at 执行，如果在 timer 中，日志会有这样：context: ngx.timer

2.   runtime error: /site/api/1.0/test.lua:132: failed to authenticate:timeout，从这看，貌似你的 test.lua 已经产生了一些变化，第一个日志里 /test.lua:16



所以，从目前的情况看，建议你搞一个干净的环境，最简洁的用例，详细的配置




最后谈一下，长时间 cpu 阻塞 （大 job） 的问题：

对于长时间的 cpu 阻塞应该是禁止出现的，test(10000000000) 这么用很容易出问题

ngx_lua 在底层处理网络 io 时是异步的，很有可能会在 send 和 receive 之间 插入别的计算任务，而且可以理解为不可控的插入，很有可能会造成 io 的超时


有两种解决方式：

1. 把这种 大 job 和有超时要求的任务，放在不同的服务里，比如再建立一个后端更新服务

2. 如果非要放一起，可以把 一个大job 打散，大job中定期调用 ngx.sleep







在 2014年9月4日 上午11:06，Heero Zhang <[email protected]>写道：
感谢详细的解说。

抱歉，是有其他站点的timer.at和access_by_lua干扰了，不过那些调用那个和test.lua没关系。所以实际上test(10000000000) 还是只执行了一次。
2014-09-02 22:51 GMT-07:00 Heero Zhang:

> 我看到454行23秒时开始发送数据lua tcp socket sent all the data，然后11秒后，到34秒lua tcp socket

> read timed out，看上去是发送数据后，11秒没收到数据所以超时了。

> 明白了，这11秒就是for循环消耗的11秒

> 444行是打印for循环时间的信息，打印时间写着13:31:23点，而23秒是开始请求的时间，

> 其实当时应该13:31:34点了，估计内部计时错误导致他错误的认为timeout了？

从调试日志看，显然两次请求对应的处理程序在执行时间上发生了叠加。当你的前一个请求的 do_redis_read()

函数尚未返回之前，后一个请求的 test(10000000000) 调试已经开始执行了。

从日志消息上看，你确实使用的是同一个下游连接。此时只有一种解释，你的 Lua 代码执行在某个 ngx.timer.at

的回调中。由于定时器异步的本质（即脱离并异步于下游请求来执行的本质），你很难通过串行的下游请求来实现这些定时器的严格串行。解决的办法是使用一种每

worker 的“锁”来确保定时器操作的原子性。你可以通过 lua-resty-lrucache 库来实现这种每 worker 的锁：

    https://github.com/openresty/lua-resty-lrucache

建议给锁对应的 key 设置上一个（比较长的）过期时间，以避免因为持锁者的（万一）意外中止而导致的死锁。

Regards,

-agentzh

                                lyj.william
                    
  这个问题我也遇到了，一个服务在redis 之前调用了切割大文件命令，耗费大量时间，导致部分redis 操作报read timed out.

分析原因是跟 nginx 对时间的缓存和时间更新机制有关
执行流程如下：

    1.  在epoll 回调中执行任务 开始时间T1
    2.  这个阻塞 CPU 的任务耗时太大，假设耗时100s，执行完时真实时间是 T2 = T1+100s
    3.  此时操作redis ,设置超时2秒，此时timer 的基准时间是T1，即超时时间是T3=T1+2s
    4.  如果redis在当前回调中能完成，则不会报错。
    5.  若由于某种原因回调返回（read 无数据返回），导致nginx 时间被更新,更新为T2
 
 800     events = epoll_wait(ep, event_list, (int) nevents, timer);

 804     if (flags & NGX_UPDATE_TIME || ngx_event_timer_alarm) {
 805         ngx_time_update();
 806     }

    6.    ngx_event_expire_timers被调用，redis 操作被当做超时事件处理 (T3<T2)

 250     delta = ngx_current_msec;
 251 
 252     (void) ngx_process_events(cycle, timer, flags); 
 253 
 254     delta = ngx_current_msec - delta;
 255 

 265     if (delta) {
 266         ngx_event_expire_timers();
 267     }

            
  而ngx.update_time() 调用了ngx_time_update，更新了nginx 缓存的时间，redis 的操作就不会超时。
  
code：
    split 文件，耗时几分

    local rds = redis.new()
    rds:set_timeout(6000)

    local ok, err = rds.connect(rds, s3conf.get_conf("redis_ip"), s3conf.get_conf("redis_port"))

    if not ok then  
        ngx.log(ngx.ERR, "failed to connect redis: ", err)
        return
    end

    local res, err = rds:auth(s3conf.get_conf("redis_auth"))
    if not res then
        ngx.log(ngx.ERR, "failed to auth redis: ", err)
        return
    end

日志：
2018/08/07 17:29:35 [info] 28331#0:   *56332 [lua] multi_upload.lua:297: main(): split ret: true ,nparts: 44, client: 127.0.0.1, server: localhost, request: "POST /multi HTTP/1.1"
2018/08/07 17:29:35 [error] 28331#0: *56332 lua tcp socket read timed out, client: 127.0.0.1, server: localhost, request: "POST /multi HTTP/1.1", host: "127.0.0.1:9000"                                     
2018/08/07 17:29:35 [error] 28331#0: *56332 [lua] multi_upload.lua:334: main(): failed to auth redis: timeout, client: 127.0.0.1, server: localhost, request: "POST /multi HTTP/1.1"

这错误有一定的概率出现。



在 2014年9月4日星期四 UTC+8下午2:10:51，Heero Zhang写道：好吧，我找到问题了，的确和我猜的一样。
在test(10000000000)行后加入一句ngx.update_time()，就不会再出现timeout的报错了。

感觉这个挺陷阱的，不知道命令行模式是否有这个问题，估计很多人会用命令行直接来做一些需要长时间运行的工具吧，毕竟openresty环境实在是好用多了。





在 2014年9月4日 下午1:35，Heero Zhang <[email protected]>写道：

我自己也是试了重开台干净的服务器无法复现，试了很多遍，尝试把生产环境服务器的东西一个个加进去，还是不行，实在是找不到到底哪里有区别。所以我现在的日志是直接在自己会出错的windows机器和ubuntu上试验，提供的log是windows的。 

我的test.lua上面有很多反复测试时的代码被注掉了，执行的代码和邮件1的一样，但行数有100多行的误差，所以为了他人能看懂，我offset了行数，log的条目和时序没有修改




在 2014年9月4日 下午1:11，DeJiang Zhu <[email protected]>写道：

建议你搞一个干净的环境，我这里用你的用例，无法复现

而且你提供的 log 日志，貌似有点乱且不完整，疑似经过修改，应该是还存在有其他应用干扰


姑且从日志分析看，你日志里的执行时序，跟我复现的很像，但是却有错误产生：



1. test(10000000000) 这个大 job 确实没有在 timer.at 执行，如果在 timer 中，日志会有这样：context: ngx.timer

2.   runtime error: /site/api/1.0/test.lua:132: failed to authenticate:timeout，从这看，貌似你的 test.lua 已经产生了一些变化，第一个日志里 /test.lua:16



所以，从目前的情况看，建议你搞一个干净的环境，最简洁的用例，详细的配置




最后谈一下，长时间 cpu 阻塞 （大 job） 的问题：

对于长时间的 cpu 阻塞应该是禁止出现的，test(10000000000) 这么用很容易出问题

ngx_lua 在底层处理网络 io 时是异步的，很有可能会在 send 和 receive 之间 插入别的计算任务，而且可以理解为不可控的插入，很有可能会造成 io 的超时


有两种解决方式：

1. 把这种 大 job 和有超时要求的任务，放在不同的服务里，比如再建立一个后端更新服务

2. 如果非要放一起，可以把 一个大job 打散，大job中定期调用 ngx.sleep







在 2014年9月4日 上午11:06，Heero Zhang <[email protected]>写道：
感谢详细的解说。

抱歉，是有其他站点的timer.at和access_by_lua干扰了，不过那些调用那个和test.lua没关系。所以实际上test(10000000000) 还是只执行了一次。
2014-09-02 22:51 GMT-07:00 Heero Zhang:

> 我看到454行23秒时开始发送数据lua tcp socket sent all the data，然后11秒后，到34秒lua tcp socket

> read timed out，看上去是发送数据后，11秒没收到数据所以超时了。

> 明白了，这11秒就是for循环消耗的11秒

> 444行是打印for循环时间的信息，打印时间写着13:31:23点，而23秒是开始请求的时间，

> 其实当时应该13:31:34点了，估计内部计时错误导致他错误的认为timeout了？

从调试日志看，显然两次请求对应的处理程序在执行时间上发生了叠加。当你的前一个请求的 do_redis_read()

函数尚未返回之前，后一个请求的 test(10000000000) 调试已经开始执行了。

从日志消息上看，你确实使用的是同一个下游连接。此时只有一种解释，你的 Lua 代码执行在某个 ngx.timer.at

的回调中。由于定时器异步的本质（即脱离并异步于下游请求来执行的本质），你很难通过串行的下游请求来实现这些定时器的严格串行。解决的办法是使用一种每

worker 的“锁”来确保定时器操作的原子性。你可以通过 lua-resty-lrucache 库来实现这种每 worker 的锁：

    https://github.com/openresty/lua-resty-lrucache

建议给锁对应的 key 设置上一个（比较长的）过期时间，以避免因为持锁者的（万一）意外中止而导致的死锁。

Regards,

-agentzh