TEST: Why does EC2 CPU capacity hit 100% on first viewer connect and then stabilize?

We are having some issues with users getting livestream not found errors.

(see this forum Live wirecast to AWS streams fine to some but others get dropped every 5 seconds)

We decided to give the small Wowza EC2 a thorough test using Cloudwatch and the Cacti reporting tool. We found that with 10 devices connected the average CPU capacity was usually at 15% in CloudWatch and around 3% in Cacti. We connected one device every 5 minutes or so and monitored the CPU capacity.

When we connected the stream using wirecast Flash-Main profile the CPU monitor was at 10%. But when we connected the first viewer the CPU capacity jumped and hovered at 100%. Why is that? What would causes that 100% CPU load on the first viewer connect?

Attached are the results.

http://www.stagedive.com/images/files/2011-06-08-AWS-Wowza-StreamingTest.xlsx

Try monitoring with JMX/JConsole for a more accurate view of what Wowza/Java:

https://www.wowza.com/docs/quick-start-guides

Richard

I’m not sure. The EC2 small instance is, per the name, small. The pre-built AMIs are tuned. They should handle 150mbs, or 300 concurrent clients streaming 500kbs.

Richard

I would use something much larger than a m1.xlarge for this. You can run up to a m2.4xlarge with devpay or lickey license.

https://www.wowza.com/docs/pre-built-amis-amazon-machine-images

And you can install Wowza on a one of the EC2 super-duper quad or dual quad core instances types with 10gbs nic (5 of which is available to one instance of Wowza/Java, but that’s a lot):

http://aws.amazon.com/ec2/instance-types/

Cluster Compute Quadruple Extra Large Instance

23 GB of memory

33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core “Nehalem” architecture)

1690 GB of instance storage

64-bit platform

I/O Performance: Very High (10 Gigabit Ethernet)

API name: cc1.4xlarge

Cluster Compute Eight Extra Large Instance

60.5 GB of memory

88 EC2 Compute Units (2 x Intel Xeon E5-2670, eight-core “Sandy Bridge” architecture)

3370 GB of instance storage

64-bit platform

I/O Performance: Very High (10 Gigabit Ethernet)

API name: cc2.8xlarge

These types cost more, but for a short duration live event it’s worth it for a few hours.

Richard

The 10 source streams might be cutting into your potential output. In any case, if you see problems consistently correlating with levels of use on a pre-built/tuned EC2 instance, you have reached its load limit. It is possible that the m1.small is not as consistent in resource allocation, there has been indication of that, though I am not certain.

The most cost effective way to add capacity is add more m1.small instances to a load balanced cluster using Liverepeater for live streaming or MediaCache for vod.

live:

https://www.wowza.com/docs/how-to-configure-a-live-stream-repeater

vod:

https://www.wowza.com/docs/how-to-configure-mediacache-for-wowza-ndvr

load balancer for either above:

https://www.wowza.com/docs/how-to-get-dynamic-load-balancing-addon

Richard

If you have an instance that is problem just get rid of it. It is not worth the time to figure out.

On your origin/edge architecture, this is what I do. Configure the 24/7 instance with an origin application to publish to, and one edge application to re-stream from the origin, and as a Load Balancer Listener. Then you will have a cluster of one. Your clients will be redirected by the LB to the least loaded server, which will be itself.

Next, configure a startup package to start an instance as a Liverepeater edge and Load Balancer Sender that sends its data to your Load Balancer Listener. As soon as you start up an instance with that startup package, if configured correctly, it will join the cluster. You can start 1 or 20. To wind down, you can use JConsole to pause a Load Balancer Sender so that clients are not redirected to it, then you terminate when an edge connectioncount goes to zero.

Richard

Welcome. One edit, you pause a Load Balancer Sender (not the listener)

R

You can change StreamType dynamically per RTMP client only. For Flash RTMP client you can do netconnection.call “setStreamType”:

netconnection.call("setStreamType",null, "liverepeater-origin-record");
var netstreamLive:NetStream = new NetStream(netconnection);
netstreamLive.publsih("myStream");

For an RTMP encoder, you can do it server-side in the onConnect handler with the client.

public void onConnect(IClient client, RequestFunction function, AMFDataList params)
{	 
	client.setStreamType("liverepeater-origin-record");
}

Richard

I’m not sure. The EC2 small instance is, per the name, small. The pre-built AMIs are tuned. They should handle 150mbs, or 300 concurrent clients streaming 500kbs.

Richard – I am running an EC2 Wowza instance (latest version 2 of Wowza) on a “small” EC2 instance. When we hit about 120 - 150 viewers people start reporting issues with the video playback.

The video and audio combined average around 350kbs. We have about 10 video feeds running into the server at any given time, and people can connect to whichever one they like. We also have text chat modules running through the Wowza server for each of these. The streams are recorded to the content folder, and are also available on iOS devices as well as a private Roku Channel.

Using CACTI I can see that the bandwidth has never been close to 150mbs. Unfortunately, we’ve also been unable to get anywhere close to 300 concurrent clients either.

Is there anything off the top of your head, based on what I’ve stated, that seems out of place to you? Is pushing 10 source streams (all from Flash Media Live Encoder) to the server an issue? Is allowing iOS/Roku streaming an issue? Do you think the stalling/buffering/jittery video people are reporting based on slow disk I/O? Or is there just too much going on with the server?

Any pointers you could provide would be very helpful.

Richard - as always, I appreciate the fast response. I was looking at Edge/Origin earlier today, but I’m not sure it’s what I originally though it would be. For instance – I have the small Wowza EC2 instance running 24/7 for a church group. They have a dozen congregations that push live streams to the servers (there is a few hours out of the weekend where many of them overlap, which is when we see the problems most). Each live stream is recorded to the content directory and is made available for VOD throughout the week (which is extremely low usage).

Leaving the server on 24/7 also allows any of the congregations to broadcast a live event (Bible Study, or Wedding, or whatever) at any time. If I choose to go the Edge/Origin configuration route I’m essentially going to have to run BOTH of those 24/7. Correct? Since the incoming live streams have to point to a “edge1.website.com” server or an “edge2.website.com” server (if I have multiple edges) it means that I can not shut the edge servers off whenever I want (to save money) – because if I shut those off during the week, the congregations who are set for “edge1” or “edge2” webcasts will not be able to webcast until they notify me and ask for me to launch an edge server again. Is that correct? or am I totally off base here.

One more note: I just looked at CACTI on my “problem” instance… and the Localhost - Memory Usage graph (under “system”) shows just a hair-line of “Free”, with most of the graph showing “Swap” Memory Usage. This doesn’t seem right to me. I launched another EC2 instance and saw pretty much the reverse (it was mostly “Free” Memory Usage with only a small bit listed as “Swap”). That small instance only has a GIG of memory… so does this sound like a problem?

Also, back to the CACTI graph… the “problem” instance never seems to run below 0.49 for the Localhost - Load Average. And this is with NOTHING happening – no streams going to or coming from the server. The “new” instance I launched showed an average of .08 for the Load Average – and that’s when I was pushing a 350kbps stream to it and also watching it in Flash as well as an iPad.

There has to be something wrong with the “problem” instance – so I’m going t start the long process of downloading everything from it and uploading it to this new instance to see if things improve. The only real pain to doing this will be manually rebuilding all the CACTI user accounts and graphics. BLAH!

I’ll report my findings if things improve. I sure hope they do.

Let me know if my Edge/Origin assumptions are correct! Thanks!!

Brilliant!! Thanks a million!

R - You had one edit, I have one more question. :slight_smile:

If I tell people to send streams to: rtmp://www.website.com/live

Can I just change the Application.xml within my “live” application from “live-record” to “liverepeater-origin-record” without them having to change the Stream URL?

Hi Richard,

Thanks for the JConsole tip. Clearly a better method for checking CPU load status.

We retested and the CPU capacity on a small instance topped out at 30% but stayed closer to 5% with 10 concurrent viewers.

We did experience a sort of irregularity. On the first viewer connect the CPU capacity jumped up to about 45%. Why does this tend to happen on the first connect? Is wowza pushing variables to the Heep, RAM, memory etc.? Is the first connect instantiating something that causes the CPU capacity to jump?

Thanks!

Yeah. Thanks for the response.

My question really doesn’t have to do with the EC2 small instance but rather why we are getting the spike on any EC2 instance on the first viewer connect. It seems to happen each time no matter the size of the instance. I’m wondering more about what Wowza is doing that creates that spike…

Thanks!

Was there ever any resolution to this thread?

We just concluded a live event with 2 origins, 2 edge origins, 20 edge servers and 1 loadbalancer (wowza) server.

We were using the transcoder (only on origin) to split the streams into 4 diff bitrates for diff devices.

Here is the use case:

  • Event was a whole day event (we were expecting only around 1500 to 2000 viewers)

  • Event had some breaks (coffee, lunch, etc…)

  • EVERYTIME the breaks ended and we had viewers come back to watch the live streams, the origin CPU would spike to 85-95% and then fall back to 50-55% after 5-7 minutes. During this time viewers would get a stuttering feed and frequent freezing.

We event thought to move the origin instance from a m1.xlarge to something higher but quite frankly we were not sure this wouldn’t reoccur there as well.

Any thoughts?