Highly available Wowza cluster

Hi,

I’m gathering information on how to implement a highly available Wowza cluster.

We are currently using wowza On-Premise for both live streaming video and creating m3u8 playlists from existing video files.

How could we handle the loss of a Wowza node seamlessly, as we are behind an Application Load Balancer ?

Maybe there are some temporary files that can be shared between the nodes through a common file system, that would be picked-up if one instance is lost ?

Or are there any internal feature that could help us ?

Thank you !

I do not think that I know of any communication mechanism amongst nodes in a cluster. Isn’t that what health checks are for ?

Ideally on aws platforms load balancer will identify a bad node and terminate it and there would be a policy in place to replace it to keep up with min nodes. So if you are building on premises you have to build this yourself.

As for detecting node crash you can have a script on instance to check availability of service and ping over http. If that detects failure it must notify to somewhere.

For high availability one crucial point is to create duplicate setups in different regions or locations. This is separate from the failure detection and recovery system. With both combined you get truly high availability.

1 Like

If your desired output is HTTP-based protocols only (HLS / MPEG-DASH) and no other, then I’d probably build a K8S cluster and have an Nginx Ingress in front of a WSE Deployment and then define a Health Check of course.

You can set up a single replica which will make sure one Pod is running at any given time, but it’ll cause a few seconds of downtime in case of a WSE crash because a new Pod must be launched. Alternatively you can run 2 replicas and have an Primary/Backup configuration defined in the Nginx that basically functions as a reverse Proxy.

You can also do this without K8S of course, simply have 2 machines that run side by side and an Nginx Proxy in front, but then your proxy becomes the SPOF.

Regardless of the approach, you will always have the challenge that there are two separate packagers. Both WSE’s will create HLS chunks and/or MPEG-DASH segments independently, and there’s no guarantee that the chunks will be in-sync between the two machines. That means that you can’t switch seamlessly from Primary to Backup as that will confuse the player. Even though it may be synced for some time, you should incorporate in your design that there may be a time difference eventually. You could do testing and figure out how long the sync lasts on average, then decide if that’s an acceptable failover solution for your case.

If guaranteed sync is a requirement then you may need to reconsider your choice for WSE, and you may also need to replace your encoders.

1 Like

If it’s on AWS, then there would be no need for K8S but you could still use the Proxy model; or you can have a Health Check running in AWS that triggers a Lambda function to control how many EC2s with WSE should be running, and then control your LB to point to one of them as a Primary.

I’d probably never recommend an NLB/ALB with Round-Robin pointing directly to your WSEs for the sake of fail-over/high-availability.

1 Like

It is indeed on AWS.

So if we imagine I have 2 wowza instances in 2 separated EC2, behind an ALB or NLB, I’m streaming an RTMP stream on one instance, and this instance crashes for some reason, the is no built in way to rebalance to the second instance ?

From the consumer POV, he will be reading an HLS stream, he will just stop seing the stream ?

Or thanks to the ALB, he will just witness a cut in the stream (during the rebalancing) and re-start seing the stream ?

Sorry for the late response. Remember that RTMP is a session based-protocol, so when you send your stream to one of the two EC2 instances, and that instance crashes, your encoder will have to reconnect, even when you use a NLB between your encoder and the instances.

If you use an ALB or NLB for the output, when server #1 crashes, the LB will logically route HTTP requests to server #2, but there’s no guarantee that it will be in sync with #1, so the player may get an error. It’s up to the player - or how you have configured the player - to decide what happens in that case. Some players reconnect automatically, others will simply stop.

Best practice is probably to write some JavaScript against the player’s API so that they player automatically restarts when it stops. But then, if you’re Live streaming an event, you’ll have to notify your front-end when the event stops, thus when the stream ends, or your player may try to restart indefinitely.

Back to syncing server #1 and #2 - there may be some way to sync if you can use an external clock (will require customization of the packager in Wowza), or if you use a HTTP proxy with some intelligence between the LB and the EC2 instances, e.g. rewrite the chunk names and/or use EXT-X-DISCONTINUITY in the chunklist That’s just thinking out loud though, would require some research and experimenting.