Friday, February 22, 2019

Redis Dead Redemption: Redis Cache Timeouts with Sitecore Azure PaaS


Our production CD environments (v9.0 Initial Release) began experiencing significant performance issues to the point where the application kept restarting our site would hang for several minutes at a time.  When the site finally came to, a few minutes later, it'd go right back down.

With the site having been stable for months, it was odd that this issue surfaced out of the blue.

Our logs revealed the following error (one of many):

What is Redis? 

In a nutshell, Azure Redis cache is the default, OOB session provider Sitecore 9.0+ for PaaS instances.
I think George Tucker's StackOverflow answer summed it up best:

Sitecore uses this session provider as a means of managing out of process (ie. distributed) session state to share contact data across browser sessions and devices. This is required to support base functionality in Sitecore XP (Analytics, XDB etc). Even if you are using a custom session provider for other purposes, you will likely still need the Sitecore sessions configured to get full value out of the solution.  
Redis is designed as a means for fast, distributed access to a key/value store. If you're digging into the redis instance directly, you're likely to see all of the keys of your current sessions (private and shared) with non-readable values (usually stored as binary vals). These values are usually managed and accessed via the Sitecore Session API abstractions. It's rare you would need to access Redis directly.

 Initial Troubleshooting

We attempted a few things in attempts to re-stabilize the environment.
  1. Standard procedures to restart applications: generally no effect
  2. Ran 'flushdb' command against Redis Console: errors cleared initially, but the issue resurfaced soon after.
  3. Reboot Redis: Site restores, goes down soon after
  4. Scaled up Redis: Site restores - crashes soon after with the same Redis errors
  5. Applied Support fix outlined in KB464570: Redis driver timeout issues to both CM and CD environments.
To no avail, we decided to take it up with the Sitecore Support team. 

Sitecore Support's Take

Based on the reported errors, Sitecore support identified the following
  • - The "in: 7923" parameter means that there is an amount of data in the Redis queue that sill need to be processed
  • - The "WORKER: (Busy=85,Free=32682,Min=150,Max=32767)" shows that there is some pressure on the Redis side (even if it is not visible as the CPU is not stressed)

You can read more into what your specific Redis errors may mean from this handy Microsoft blog bost documentation.

Support sanctioned our use of KB464570: Redis driver timeout issues and also recommended applying solutions outlined in a separate KB article: KB858026: Excessive load on ASP.NET Session State store.

Additional suggestions included:
- Scale up the Redis instance.  "You may need extra resources to process all the items  "in: 7923")"
- Increase the timeout until you could see the application running.

Among several solutions offered in KB858026: Excessive load on ASP.NET Session State store. (mostly configuration changes to the Web.config and Sitecore.Analytics.Tracking.config),  one included splitting the Sitecore.Sessions databases into dedicated instances - one for private and another for shared.



Splitting the two databases meant creating a new Redis instance in Azure and use different connections strings for the Shared and Private session definitions.

The documentation articles for adjusting the configuration can be found in two separate documents:

Diagnosis

Since the site was stable for months before this began occurring, Support concluded that the most possible cause is that the amount of data in the Redis' queue slowly built up until the CDs could not keep up with the timeouts.
Splitting the sessions into Private and Shared should allow more control over which Redis instance needs more or less processing power and avoid obscure locks.  Also, increasing the timeout values to higher values allows the queue more time to process.

The Split

After creating a new Redis instance in Azure:

1) Create a new redis.sessions.shared entry in the ConnectionString.config under the existing redis.sessions entry.
2) Leave the original (Private) <sessionState> Redis provider as-is in the Web.config
3) Replace the existing redis.sessions connectionString parameter to redis.sessions.shared.

A Note on Timeout Values

Regarding the timeout parameters (connectionTimeoutInmillisecondsoperationTimeoutInMillisecondsretryTimeoutInMilliseconds) there is no exact value to set.  Like Sitecore caches, it must be tuned until your instances do not experience timeouts.

Sitecore's rule of thumb here is to:
- double the timeout until you the environment stabilizes
- set a high value (30 minutes) and then adjust it if it needs further tuning

You can find out what each timeout is designed for here: Redis Provider Settings 


After making these changes, the site has since appeared to have stabilized. Of course, we anticipate to further monitor the logs for Redis errors, but for now, it seems we're in the clear.  

Happy Redis-ing!


5 comments:

  1. Thank you so much for sharing this great article, we are experiencing the same issue and did all kind of tuing but still no luck , we will definitely will try this approach and please keep sharing any updates

    ReplyDelete
    Replies
    1. It certainly took us for a ride. So far so good, though! I'll definitely update the post if anything changes or discover anything new.

      Delete
  2. Thanks for the information Gabe, very helpful. I'm curious what Redis Azure tier you ended up on for both instances? Could you share?

    ReplyDelete
    Replies
    1. Currently scaled to C2 for both Redis instances. We're looking to scale down to C1 if possible for both after giving it some time to ensure the issue doesn't return. For now, stable on C2 😊👍

      Delete
  3. Did you apply KB464570? If so, did you take the values in the supplied config or did you adjust them for your environment? If adjusted, did you have guidance on how to calculate for your environment or was it trial and error?

    ReplyDelete