关于FaaS异步队列重试能力建设的二三事

经常用FaaS能力的同学肯定遇到过事件重试,目前为止SCF除了两个常见的出口触发器 APIGW/CLB和一个流式kafka同步触发器,大部分触发器其实都是都是异步的。如何保证异步触发器在出现各种报错的情况下合理的被重试和消费是这篇文章探索的主题。为了找到答案,几乎有段时间都是在焦虑中度过的,研发的挑战,客户的挑战,成本的压力,非准大客户完全相悖的使用习惯,几乎能被挑战的点都被我们趟了个遍。功能也马上就上线了,就用这篇日志分享下FaaS场景下,异步队列重试我是如何思考与建设相关能力的吧。

 

工欲善其事。我们先来讨论下腾讯云云函数在异步调用场景下的常见报错:

第一类 4xx,这里包括400(无效请求),430(用户代码执行错误),432(并发超限),438(函数关停)等等报错

第二类 5xx,这里主要就是500(内部错误),532(计算资源不足)

这么看上去其实很明确,也就两类报错内容。对比,参考友商后第一版的重试策略如下:

4xx错误:帮助用户重试两次,如果配置DLQ(死信队列)后将失败的信息投递至死信队列,如果没有配置DLQ则该事件会直接被丢弃。

5xx错误:考虑5xx一般为系统原因导致错误,故相关事件指数退避重试24小时,24小时后如果还未恢复,该事件将会被丢失。同样DLQ依然对5xx生效。

 

第一版重试策略沿用了很久,直到云函数将并发上移到用户可配置后,矛盾点爆发了。

首先,是对默认4xx重试两次的挑战,有的用户明确表示自己不希望做任何重试。这点很好解决,我们将默认4xx重试的次数交给用户,用户可以从 0-2 中任意选择所需的重试次数。

其次,是对SLA的挑战,在之前的重试队列是没有任何淘汰机制的,打个比方如果用户将保留配额(reserved concurrency)放到 0 ,并且在异步队列中塞进 100条事件内容。在之前的策略中,该100条事件将永远无法被消费,甚至成为永久的脏数据。所以我们需要一个统一的淘汰机制来保证队列消息不会被用户行为打爆。这里我们将异步队列设置一个时间长度,以最大6小时为限,超过6小时的事件将直接被丢弃,如果配置DLQ,会将相关消息投递至DLQ。

看起来这两个策略是完美的了,至少它可以解决我们现在的用户问题。但,这仅仅完成了最简单的部分。

我们都知道异步队列最大的作用其实是保证用户缓冲某个时刻大量的高并发数据。之前并发都是和 work match的,并发上移后启多少work就变成了一个很玄学的事情,打个最简单的比方,下图是我们经常会遇到的情况并发超限。并发超限按照第一版的重试策略会跟随全部4xx,重试两次后丢弃。

什么?数据丢失?并发配额上线那段时间几乎每天都有在抱怨丢数据。这在完全托管的异步队列是完全不能出现的情况,比较理想的状态就是用户有多少并发就启多少 work,这样超限错误永远就不会存在了。如下图

但是的但是,这里基本是不可能实现的,或者实现起来难度非常大,对资源调度的管理要求太高。那又有什么办法可以保证用户的数据不丢失呢?我几乎想尽了所有办法,还是AWS给了我灵感,如下是AWS并发超限时 Throttles 的监控图:

我恍然大悟,这块也是重试策略的能力啊,我们为什么不把超限错误重新投回队列进行重试呢?这样不久完美解决了超限数据丢失?其实根本不需要什么复杂的算法来解决超限啊。所以,最新的重试是这样的:

 

运行错误(含用户代码运行错误和 Runtime 错误):当发生该类错误时,函数平台将默认重试两次或使用配置的重试次数,固定间隔1分钟。在自动重试的同时,新的触发事件仍可正常处理。如果您配置了死信队列,重试两次失败后的事件将传入死信队列,否则事件将被函数平台丢弃。
系统错误:当发生该类错误时,函数平台会根据您配置的最长等待时间持续重试(默认持续重试6小时),重试间隔按照指数退避增加到5分钟。如果您配置了死信队列,重试超过最长等待时间仍失败的事件会被发送到死信队列,由用户进行进一步处理,否则事件将被函数平台丢弃。
超限错误:当发生该类错误时,函数平台会根据您配置的最长等待时间持续重试(默认持续重试6小时),重试间隔为1分钟。如果您配置了死信队列,重试超过最长等待时间仍失败的事件会被发送到死信队列,由用户进行进一步处理,否则事件将被函数平台丢弃。
调用请求错误和调用方错误:当发生该类错误时,平台将不会对该类其他错误进行重试,因为其他请求错误即便重试也不会成功。(超限错误(432)除外)

新版的重试完美解决了并发超限给用户带来的丢数据的风险,也让异步队列发挥了它应有的价值。异步调用的并发超限用户再也无需进行任何操作,在设定的最长等待时间内,函数平台会自动对并发超限错误进行重试。

 

这个探索的过程相当煎熬,也几乎和研发同学PK了无数轮。好在最后我们完成了目标,当然是在经过两轮的方案推倒重做后。当然也在探索这里关于更开放的错误码重试策略的事情,但总感觉为时过早还需要论证和调研,所以这里就不介绍了。

 

总结一些思路,寥寥数语,如有偏颇还请谅解。



71 thoughts on “关于FaaS异步队列重试能力建设的二三事”

  • There are some attention-grabbing closing dates in this article but I don?t know if I see all of them heart to heart. There may be some validity but I’ll take maintain opinion until I look into it further. Good article , thanks and we would like extra! Added to FeedBurner as properly

  • I’d like to thank you for the efforts you’ve put in penning this website. I’m hoping to check out the same high-grade content from you in the future as well. In fact, your creative writing abilities has inspired me to get my very own website now 😉

  • Thanks for sharing superb informations. Your web-site is so cool. I am impressed by the details that you?ve on this site. It reveals how nicely you perceive this subject. Bookmarked this website page, will come back for extra articles. You, my pal, ROCK! I found simply the info I already searched all over the place and just couldn’t come across. What a great site.

  • I’ve learned some new things via your blog site. One other thing I’d really like to say is the fact that newer pc os’s are inclined to allow extra memory to get used, but they furthermore demand more memory simply to perform. If someone’s computer can not handle additional memory along with the newest application requires that storage increase, it might be the time to shop for a new Computer. Thanks

  • I was recommended this blog by my cousin. I am not positive whether or not this submit is written by means of him as no one else realize such special about my trouble. You are wonderful! Thanks!

  • I am now not positive the place you are getting your information, however great topic. I must spend some time studying more or figuring out more. Thank you for magnificent information I was searching for this information for my mission.

  • obviously like your web site however you have to take a look at the spelling on quite a few of your posts. Several of them are rife with spelling issues and I to find it very troublesome to inform the truth nevertheless I will certainly come again again.

  • In my opinion that a foreclosure can have a significant effect on the borrower’s life. House foreclosures can have a Several to a decade negative effects on a applicant’s credit report. Any borrower who may have applied for a home loan or virtually any loans for instance, knows that the particular worse credit rating will be, the more challenging it is to obtain a decent financial loan. In addition, it may affect a new borrower’s capability to find a respectable place to let or rent, if that becomes the alternative homes solution. Interesting blog post.

  • Unquestionably believe that which you stated. Your favorite reason seemed to be on the net the simplest thing to be aware of. I say to you, I definitely get annoyed while people consider worries that they just do not know about. You managed to hit the nail upon the top as well as defined out the whole thing without having side-effects , people can take a signal. Will probably be back to get more. Thanks

  • What I have seen in terms of laptop memory is always that there are features such as SDRAM, DDR and so forth, that must fit in with the features of the mother board. If the personal computer’s motherboard is fairly current and there are no main system issues, improving the memory literally takes under sixty minutes. It’s one of several easiest computer system upgrade methods one can envision. Thanks for spreading your ideas.

  • Thank you for sharing excellent informations. Your web site is very cool. I am impressed by the details that you?ve on this web site. It reveals how nicely you perceive this subject. Bookmarked this website page, will come back for more articles. You, my friend, ROCK! I found just the info I already searched all over the place and simply couldn’t come across. What a great web-site.

  • The very heart of your writing while sounding reasonable originally, did not settle properly with me personally after some time. Somewhere within the paragraphs you were able to make me a believer unfortunately only for a very short while. I however have got a problem with your leaps in assumptions and one might do nicely to help fill in those breaks. When you actually can accomplish that, I could certainly end up being impressed.

  • Thanks for the guidelines shared on your own blog. Another thing I would like to mention is that weight loss is not information about going on a fad diet and trying to get rid of as much weight as you’re able in a couple of days. The most effective way to burn fat is by acquiring it bit by bit and obeying some basic suggestions which can assist you to make the most from the attempt to drop some weight. You may be aware and already be following some of these tips, however reinforcing understanding never affects.

  • Thanks for your tips about this blog. One thing I want to say is purchasing electronic products items from the Internet is not new. In truth, in the past decades alone, the marketplace for online gadgets has grown a great deal. Today, you could find practically virtually any electronic gadget and devices on the Internet, including cameras as well as camcorders to computer components and video games consoles.

  • After study a couple of of the blog posts on your website now, and I really like your way of blogging. I bookmarked it to my bookmark website record and will be checking again soon. Pls take a look at my web site as effectively and let me know what you think.

发表评论

邮箱地址不会被公开。 必填项已用*标注

− 2 = 5