more CPU cores get weaker performance than less CPU cores

Hi,

I’m currently tuning performance of wowza 4.0.3, with following hardware and test situation.

2*Intel® Xeon® CPU E5-2650 0 @ 2.60GHz(total 32 logic processors)

48G RAM

Only one vod file, 5Mbps, about 5 minutes, codec:H264, profile:High, level:3.1, frameSize:1280x720, displaySize:1280x720, frameRate:23.980000

2*10GbE NIC

Clients use HLS to access the VOD video file, hafe from NIC 1, and hafe from the other NIC 2, so the NIC is not the bottleneck.

Change cpu mode to “performance” by command “for CPUFREQ in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do [ -f $CPUFREQ ] || continue; echo -n performance > $CPUFREQ; done”

We have tested some case as following:

1、When we use all the 32 cpu cores, wowza only can support about 1600 connections, if we increase more connections, clients will get timeout, that is to say, clients can not download ts file in 10 seconds (each ts file is 10 second), the phenomenon is the same as “http://community.wowza.com/t/-/43086()-and-seek()-generate-performance-bottleneck”, VisualVM shows that DirectRandomAccessReader.read() and seek() used the most cpu time;

2、When we use only 16 cpu cores, wowza can support about 1700 connectins, and when increase more connections, clients will get timeout, but different from above, VisualVM shows that nio.SocketIoProcessor$Worker.run() used the most cpu time. We looked into the GC log, and found that with 1700 connections, each Eden Area GC left object need about 50M-100M to store in survivor area (by gc log’s “age 1” keyword), but with 1800 connections, each Eden Area GC left much more objects, need 500M-2G memory in survivor area.

3、When we use only 8 cpu cores, wowza can support about 1900 connectins, and when increase more connections, clients will get timeout, VisualVM and gc log is similar to 16 cpu cores.

4、Base on the results of case 1 and case 3, we start 2 wowza on this server, each using 8 cpu cores, then first start 1800 connections to the first wowza, the wowza works well, and then we start about 200 connections to the second wowza, the clients connected to the first wowza get timeout, VisualVM and gc log is similar to case 2.

Base from above test cases, something looks like strange:

1、Why with 8 cpu cores can achieve more connections than 16 cpu cores and 32 cpu cores? They have the same momory, same configrations.

2、Why when the connection increase to the limits, Eden area GC left much more alive objects? We are sure that clients create connectons and get ts evenly.

3、In case 4, why the seconds wowza influence the first one? They use different cpu cores.

With 1Mbps vod file to test, wowza can achieve 4500 connections by using 8 cpu cores, so it seem that the configration is not the problem.

Following is one of our startup command, and we using command “taskset” to limit wowza only can use some cpu cores, for example, "taskset 0x0000FFFF java " will limit the java only can use cpu core 0 to 15, can not use cpu core 16 to 31.

java -server -Xms20g -Xmx20g -XX:PermSize=512m -XX:MaxPermSize=512m -XX:NewSize=12g -XX:MaxNewSize=12g -XX:+UseParNewGC -XX:MaxTenuringThreshold=15 -XX:SurvivorRatio=3 -XX:+UnlockDiagnosticVMOptions -XX:ParGCCardsPerStrideChunk=32768 -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -XX:+CMSClassUnloadingEnabled -XX:CMSInitiatingOccupancyFraction=20 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintHeapAtGC -XX:+PrintGCApplicationStoppedTime -XX:-OmitStackTraceInFastThrow -Xloggc:/home/lid/cms_gc1935.log -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.port=1099 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.wowza.wms.runmode=standalone -Dcom.wowza.wms.native.base=linux -Dcom.wowza.wms.AppHome=/usr/local/WowzaStreamingEngine -Dcom.wowza.wms.ConfigURL= -Dcom.wowza.wms.ConfigHome=/usr/local/WowzaStreamingEngine -cp /usr/local/WowzaStreamingEngine/bin/wms-bootstrap.jar com.wowza.wms.bootstrap.Bootstrap start

We want to achieve 20Gbps one server, and in our opinion, if you can help us to solve above problems, we can achieve our goal. Thanks.

And if necessary, we can provide you the gc log and wowza’s access logs.

try using wowza in http origin mode, and running a nginx frontend for giving you the best performance (only for hls and hds streaming…)

TIP: you need to tune nginx for caching ts files

Hi,

Assuming you are running one Linux? What does the output of the the top command show after pressing shift-i ? Or using htop?

That will show the usage of each cpu core and thread.

You need to tune based on this guide. and follow the section under “To tune your server based on the available CPU resources of your server” to account for the total number of cpu cores/treads in use.

Daren

Hi,

This is now being handled in a support ticket (96994).

Jason

Hi,

From what you are describing, it may be the network adaptors that are causing the bottleneck. It looks like the outgoing streams may be only using one of the adaptors, even though the requests are coming in on both. The numbers you are quoting support this.

5mbps video file.

1600 x 5mbps = 8gbps

1700 x 5mbps = 8.5gbps

1900 x 5mbps = 9.5gbps

4500 x 1mbps = 4.5gbps

This would normally be the case if you haven’t taken steps to utilize both adaptors for outgoing traffic. The best way is to use bonding which will split the load evenly between both adaptors.

There are different bonding modes. Some are for failover and some for load balancing. In this case, you need to use a load balancing mode.

Roger.

Hi,

If it isn’t your network configuration which is causing the issues then I am not sure. We don’t have any way of reliably generating that amount of traffic to be able to test properly.

I’m not trying to shift blame but can you guarantee that it isn’t the way that you are generating the traffic that is causing the issue or maybe some other part of the network, rather than the actual server you are testing.

What I don’t understand is why people insist on wanting to use a single huge machine to do their streaming when multiple smaller machines would be more cost effective and provide automatic redundancy. If you one 20gb machine goes down, you lose 20gb of traffic. If you have 5 4gb machines and one goes down, you only lose 4gb of traffic.

An i7 or E3 based server with 16GB ram and quad 1GB nics bonded will reach its network limit very easily. A pool of these and associated switches would be a lot more cost effective than a single larger machine and 10gb switches.

Roger.

Hi,

We do have customers that are using hardware similar to yours with single 10gb nics but I’m not aware of anyone using multiple 10gb nics.

That being said, it may be possible but you will have to test different configurations and find one that works. As I mentioned already, we don’t currently have the resources to generate the amount of traffic to test this type of configuration.

I would suggest the following. You may have already done some of this.

Confirm that it isn’t the player side causing the issues. If generating test type connections, make sure the testing servers aren’t getting overloaded. Remember that real connections don’t all come from a handful of locations. Try to replicate real world situations.

Make sure your OS is tuned properly to work with the nics. Most OS’s aren’t specifically tuned to utilise these. Monitor the actual traffic at the nic level to make sure it isn’t something there that is causing the issues.

On the Wowza side, you may need to look at different garbage collectors. The alternative for the Oracle JVM is the G1 garbage collector. It does improve the pausing issues seen with the concurrent garbage collector. You may also need to look at a commercial JVM such as Azul Zing. If garbage collection pauses is the issue then the say their JVM will work a lot better.

Also look at the thread pool sizes. You should monitor the server thread pools with visualVM to make sure there are not too many threads in monitor state. If so, increasing the thread pool and processor counts may help. The handler pools are used for internal processing and the transport and processor pools are used to handle the actual network side of the streaming.

There is already a ticket open for this so if you need any further assistance want someone to look at test results, please use this ticket.

Roger.

Hello

Just to elaborate some on the “small and cheap hardware” statement, we do support single instances and expect high quality thereof. I think, in this case, Roger was suggesting that you may find some benefit to the redundancy achieved by multiple instances versus the all-or-nothing approach. As mentioned in the ticket

#96994

, we would be interested in seeing results regarding a “live” only test. It appears that your bottleneck may be due to the disk trying to seek more data than it can handle at a given time of which would be independent of the media server you choose to use.

Please let us know within the ticket any live performance benchmarks that you find by comparison to your VOD.

Thanks,

Matt

Thanks, Roger, but I am sure that the network adapters are not the bottleneck, because I used command "sar -n DEV " to monitor the network traffic, and it shows that outgoing traffic are distributed to the two network adapters.

And after I started this thread, I make another test case, I found that, when I using 1Mbps file to test and using 16 cpu cores, wowza can only support 2500 connections. It’s strange.

Hello, Regre, Are you still tracking this issue? We don’t have much time. Thanks.

Our goal is 20Gbps or more per server, we have high performance server with 32 cpu cores, and we can extend the NICs, but we really need your help to overcome current software’s limit. Thanks.

Hi, Roger, our competitor’s streamer product can streaming 30Gbps per server, that’s why we need at least 20Gbps per server, we hope wowza can support much more than 20Gbps in fact. And with high streaming density, the operator need less power, less room space.

From your reply, I guess that Wowza were designed for “small and cheap hardware, acceptable but not excellent performance”, is it right? And do you have a plan to provide a version of wowza to support higher streaming capacity (for example, 30Gbps) in the near future? And what’s the max streaming capacity at present, is there suggested hardware configuration for this max capacity?

Thanks.

I’m not sure that if you know our company have ordered some licenses of wowza recently, and if wowza can achieve 20Gbps-30Gbps streaming capacity per server, I think our company need much more licenses.

Thanks.

Hello, I recenty got a dedicated server with 2xE5-2640 cpu

I m facing much more cpu load than my previous server with 2xE5620, old server uses 340% cpu java load, but new almost double and reach 1200% peak.

I have review tuning files but I didnt find any solution.

My wowza is 3.6.3 perpetual version.Both servers are in same company difference is only cpu, both have 32gb ram and 1gb nic.

Can anyone guide me how to solve this issue ?