视频1 视频21 视频41 视频61 视频文章1 视频文章21 视频文章41 视频文章61 推荐1 推荐3 推荐5 推荐7 推荐9 推荐11 推荐13 推荐15 推荐17 推荐19 推荐21 推荐23 推荐25 推荐27 推荐29 推荐31 推荐33 推荐35 推荐37 推荐39 推荐41 推荐43 推荐45 推荐47 推荐49 关键词1 关键词101 关键词201 关键词301 关键词401 关键词501 关键词601 关键词701 关键词801 关键词901 关键词1001 关键词1101 关键词1201 关键词1301 关键词1401 关键词1501 关键词1601 关键词1701 关键词1801 关键词1901 视频扩展1 视频扩展6 视频扩展11 视频扩展16 文章1 文章201 文章401 文章601 文章801 文章1001 资讯1 资讯501 资讯1001 资讯1501 标签1 标签501 标签1001 关键词1 关键词501 关键词1001 关键词1501 专题2001
TwilioincidentandRedis
2020-11-09 13:10:44 责编:小采
文档


Twilio just released a post mortem about an incident that caused issues with the billing system: http://www.twilio.com/blog/2013/07/billing-incident-post-mortem.html The problem was about a Redis server, since Twilio is using Redis to stor

Twilio just released a post mortem about an incident that caused issues with the billing system:

http://www.twilio.com/blog/2013/07/billing-incident-post-mortem.html

The problem was about a Redis server, since Twilio is using Redis to store the in-flight account balances, in a master-slaves setup, with multiple slaves in different data centers for obvious availability and data safety concerns.

This is a short analysis of the incident, what Twilio can do and what Redis can do to avoid this kind of issues.

The first observation is that Twilio uses Redis, an in memory system, in order to save balances, so everybody will say "WTF Twilio! Are you serious with your data?". Actually Redis uses memory to serve data and to internally manipulate its data structures, but the incident has *nothing to do* with the durability of Redis as a DB. In fact Twilio stated that they are using the append only file that can be a very durable solution as explained here: http://oldblog.antirez.com/post/redis-persistence-demystified.html

The incident is actually centered around two main aspects of Redis:

1) The replication system.
2) The configuration.

I'll address they two things respectively.

Analysis of the replication issue
===

Redis 2.6 always needs a full resynchronization between a master and a slave after a connection issue between the two.
Redis 2.8 addressed this problem, but is currently a release candidate, so Twilio had no way to use the new feature called "partial resynchronization".

Apparently the master became unavailable because many slaves tried to resynchronize at the same time.

Actually for the way Redis works a single slave or multiple slaves trying to resynchronize should not make a huge difference, since just a single RDB is created. As soon as the second slave attaches and there is already a background save in progress in order to create the first RDB (used for the bulk data transfer), it is put in a queue with the previous slave, and so forth for all the other slaves attaching. Redis will just produce a single RDB file.

However what is true is that Redis may use additional memory with many slaves attaching at the same time, since there are multiple output buffers to "record" to transfer when the RDB file is ready. This is true especially in the case of replication over WAN. In the Twilio blog post I read "multiple data centers" so it is possible that the replication process may be slow in some case.

The bottom line is, Redis normally does not need to go slow when multiple slaves are resynchronizing at the same time, unless something strange happens like hitting the memory limit of the server, with the master starting to swap and/or problems with very slow disks (probably EC2?) so that creating an RDB starts to mess with the ability to write to the AOF file.

However issues writing to the AOF are a bit unlikely to be the cause, since during the AOF rewrite there is the same kind of disk i/o stress, with one thread writing a lot of data to the new AOF, and the other (main) thread logging every new write to the AOF. Everything considered memory pressure seems more probable, but Twilio engineers can just comment with details about what happened, this will be an useful real-world data point for sure.

From the Twilio side, what is possible to do to minimize incidents, is to understand exactly why the master is not able, with the current architecture, to survive without serious loss of performance to many slaves resynchronizing.

From the Redis side, well, we had to do our homework and provide partial resynchronization *long time ago* probably, we finally have it in Redis 2.8, and it is very good that a few days ago I pushed forward the 2.8 release skipping all the other pending features for this release that will be postponed for the next release. Now we have the first release candidate, in a few weeks this should be a release in the hands of users.

The configuration
===

The other obvious problem, probably the biggest one, was restarting the master with the wrong configuration.

Again I think here there was an human error that was "helped" by a Redis non perfect mechanism.

Basically up to Redis 2.6 you had CONFIG SET to change the configuration by hand, so it was possible for example to switch the system from RDB to AOF for more data safety with just:

redis-cli CONFIG SET appendonly yes

However you had to change the configuration file manually in order to ensure that the change will affect the instance after the next restart. Otherwise the change is only in the current in memory configuration and a restart will bring you back to the old config.

Maybe this was not the case, but it is not unlikely that Twilio engineers modified the wrong redis.conf file or forgot to do it in some way.

Fortunately Redis 2.8 provides a better workflow for on-the-fly configuration changes, that is:

redis-cli CONFIG SET appendonly yes
redis-cli CONFIG REWRITE

Basically the config rewriting feature will make sure to change the currently used configuration file, in order to contain the configuration changes operated by CONFIG SET, which is definitely safer.

In the end
===

I'll be happy to work with the Twilio engineers in the next weeks in order to understand the details and their requests and see how Redis can be improved to make incidents like this less likely to happen.

A real world test
===

I just tried to setup a master with AOF enabled, rotating disks, and a huge write load. Only trick is, it is bare metal entry-level hardware.

Then I put a steady load on it of 70k writes per second across 10 millions of keys.

Finally I tried to mass-resync four slaves form scratch multiple times.

Results:

$ redis-cli -h 192.168.1.10 --latency-history
min: 0, max: 26, avg: 0.97 (1254 samples) -- 15.00 seconds range
min: 0, max: 5, avg: 0.66 (1287 samples) -- 15.00 seconds range
min: 0, max: 2, avg: 0.62 (1290 samples) -- 15.00 seconds range
min: 0, max: 1, avg: 0.47 (1307 samples) -- 15.01 seconds range
min: 0, max: 10, avg: 0.48 (1306 samples) -- 15.00 seconds range
min: 0, max: 1, avg: 0.47 (1310 samples) -- 15.01 seconds range
min: 0, max: 3, avg: 0.45 (1311 samples) -- 15.01 seconds range
min: 0, max: 10, avg: 0.48 (1305 samples) -- 15.01 seconds range
min: 0, max: 23, avg: 0.49 (1306 samples) -- 15.01 seconds range
min: 0, max: 3, avg: 0.47 (1307 samples) -- 15.01 seconds range
min: 0, max: 36, avg: 0.86 (1255 samples) -- 15.00 seconds range
min: 0, max: 6, avg: 1.05 (1246 samples) -- 15.01 seconds range
min: 0, max: 21, avg: 0.52 (619 samples)^C

As you can see there is no moment in which the server struggles with this load. During the test the load continued to be accepted at the rate of 70k writes/sec.

This test is in no way able to simulate the Twilio architecture, but the bottom line here is, Redis is supposed to handle this well with minimally capable hardware so something odd happened, or there was a low memory condition, or there was the "EC2 effect", that is, some very poor disk performance allowed for memory pressure. Comments

下载本文
显示全文
专题