Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: resolve blocking issue in redis_writer when off_reply=false #906

Merged
merged 4 commits into from
Dec 17, 2024

Conversation

EquentR
Copy link
Collaborator

@EquentR EquentR commented Dec 16, 2024

fix: resolve blocking issue in redis_writer when off_reply=false

  • Improved the write handling logic in redis_standalone_writer, including added send bytes count and refined ticker logic.
  • Aligned the redis_standalone_writer buffer size with the reply buffer size for consistent behavior.
  • Increased the underlying TCP connection write buffer size in the Redis client to 128KiB to minimize short write errors.

fix: 修复redis_writer在off_reply=false时写入阻塞问题

  • 改进redis_standalone_writer在写入时的处理逻辑,增加计量发送和改进循环定时器逻辑
  • 设置writer缓冲区大小和处理回复的缓冲区大小一致
  • 增大redis client底层TCP连接的写缓冲大小为128KiB,尽量避免短写错误

issue:#905

- Improved the write handling logic in redis_standalone_writer, including added send bytes count and refined ticker logic.
- Aligned the redis_standalone_writer buffer size with the reply buffer size for consistent behavior.
- Increased the underlying TCP connection write buffer size in the Redis client to 128KiB to minimize short write errors.
@suxb201
Copy link
Member

suxb201 commented Dec 16, 2024

之前的写法问题是什么?🤔

@EquentR
Copy link
Collaborator Author

EquentR commented Dec 16, 2024

之前的写法问题是什么?🤔

我这里标注了一下,主要还是reply channel满了,send协程的select阻塞在推入reply过程,无法进入其他case:#903

@suxb201
Copy link
Member

suxb201 commented Dec 16, 2024

我看到你的改动较多,好像没必要引入锁:

                                log.Debugf("[%s] send cmd. cmd=[%s]", w.stat.Name, e.String())
                                if !w.offReply {
-                                       w.chWaitReply <- e
+                                       select {
+                                       case w.chWaitReply <- e:
+                                       default:
+                                               w.client.Flush()
+                                               w.chWaitReply <- e
+                                       }
                                        atomic.AddInt64(&w.stat.UnansweredBytes, e.SerializedSize)
                                        atomic.AddInt64(&w.stat.UnansweredEntries, 1)

此外:

--- a/internal/client/redis.go
+++ b/internal/client/redis.go
@@ -47,7 +47,7 @@ func NewRedisClient(ctx context.Context, address string, username string, passwo
 
        r.conn = conn
        r.reader = bufio.NewReader(conn)
-       r.writer = bufio.NewWriterSize(conn, 16*1024*1024) // size is 16MB
+       r.writer = bufio.NewWriterSize(conn, 16*1024) // size is 16KB
        r.protoReader = proto.NewReader(r.reader)
        r.protoWriter = proto.NewWriter(r.writer)

@EquentR
Copy link
Collaborator Author

EquentR commented Dec 16, 2024

这个写法看着没啥问题,但在网络传输层面差了50Mbps的速度
前者带宽维持在380Mbps左右,ops在40w左右徘徊
后者带宽可维持在430Mbps左右,ops在45w左右徘徊

you
me

我不太清楚这是否已经达到单机Redis本身的上限,我这个写法确实看起来丑陋一点,但性能确实高点

@suxb201
Copy link
Member

suxb201 commented Dec 16, 2024

理论上不应该存在性能差异,可能是 buffer 大小设置导致的。主要是 waitReply chan 的 buffer size,bufio 的 buffer size。

@EquentR
Copy link
Collaborator Author

EquentR commented Dec 16, 2024

理论上不应该存在性能差异,可能是 buffer 大小设置导致的。主要是 waitReply chan 的 buffer size,bufio 的 buffer size。

使用reply channel是否推得进去来flush感觉就像之前的计数send一样,实测确实是计量的方式性能好一点,也能像您之前说的一样避免大key导致的oom问题🤣

@suxb201
Copy link
Member

suxb201 commented Dec 16, 2024

和计数是有区别的,不是数量到了才发送,是 bufio 的 size 写满后就会自动发送。所以不会有 OOM 问题。

@EquentR
Copy link
Collaborator Author

EquentR commented Dec 16, 2024

懂了,我再调整调整代码😂

@EquentR
Copy link
Collaborator Author

EquentR commented Dec 16, 2024

锁感觉还是要的,bufio本身没有锁,我怕两个协程间flush和write打架导致buf内的数据错误等问题
image

@suxb201
Copy link
Member

suxb201 commented Dec 16, 2024

看起来是目的端关闭的,可能不是竞争导致的。
应该没有其他协程有竞争吧,flush 和 write 在一个协程。

@EquentR
Copy link
Collaborator Author

EquentR commented Dec 16, 2024

看起来是目的端关闭的,可能不是竞争导致的。 应该没有其他协程有竞争吧,flush 和 write 在一个协程。

我一开始把他俩写分开了🤣,现在合在一起看着没毛病。我把replyChan大小设置为发送chan的2倍,尽量不触发计量flush,性能就正常了

@suxb201
Copy link
Member

suxb201 commented Dec 16, 2024

  1. SetWriteConnBuff(r.conn, 128*1024) 我理解不需要,默认即可,实践中好像也不会改这个?
  2. r.writer = bufio.NewWriterSize(conn, 32*1024) // size is 32KiB 我之前写 16MB 确实没多想,现在来看保持默认 4KB 或许也可以?4KB 或者对于注重吞吐的应用来讲偏小,但对于 Redis 来讲已经足够大了
    你是不是有现成的环境可以做下 benchmark

@EquentR
Copy link
Collaborator Author

EquentR commented Dec 16, 2024

第一点我是遇到过短写问题:short write,网上搜索通用的解决办法是增大系统tcp缓冲区,或重写Flush,将短写错误剩余未写入的数据再扔进tcp连接中;
第二点4kb之前没改的时候试过的,ops在20-30w,设置成16k在40w,32k在45w左右,64k时ops又开始下降了,等我有空再做一下benchmark吧

@suxb201
Copy link
Member

suxb201 commented Dec 16, 2024

  1. bufio 的 short write 错误是有竞争才会遇到,按理说现在实现不应该有问题
  2. 32k 是合理的

@EquentR
Copy link
Collaborator Author

EquentR commented Dec 16, 2024

刚才压测了一下,16k和32k差别不大(38-48w ops),但是4k确实有较大的性能影响(37-42w ops),大概影响在10%左右
问了一下GPT,bufio短写应该只发生在底层writer层面,那这种极为罕见的短写问题就先不处理吧
image

附压测结果:
32k
企业微信截图_17343357744834

16k
企业微信截图_17343356608691

4k
image

@suxb201 suxb201 merged commit a8c472e into tair-opensource:v4 Dec 17, 2024
7 checks passed
@EquentR EquentR deleted the fix-sendblock branch December 25, 2024 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants