We have been profiling OpenSim to find and understand some of the limitations to scaling the number of avatars in a region/scene. When the number of active avatars increases, the TX bandwidth from the server grows in an n^2 curve as each client's updates are broadcast to all other clients. Recently, some patches were added to OpenSim trunk to aggregate multiple updates into a single packet for both avatar and object updates. Here we investigate the savings to bandwidth, packet count, and CPU utilization from the patch. The results presented here compare OpenSim r9115 prior to the patch with r9391 after the patch.
The following options were set from the opensim.ini.example with r9391. By default, these options are commented out.
[LLClient]
; Resend packets markes as reliable until they are received
ReliableIsImportant = false
; Configures how ObjectUpdates are compressed.
TerseUpdatesPerPacket=10
FullUpdatesPerPacket=14
TerseUpdateRate=10
FullUpdateRate=14
PacketMTU = 1400
The following results are based on running the event_04 workload against release builds of r9115 and r9391.
The comparison of r9115 vs r9391 (with 100 avatars moving) showed:
If the updates were aggregated 10 or more per packet, then we would expect the packet rate to be reduced by 90% to around 7600 pkts/s. We are still measuring about 3x the expected number of packets (22,863 pkts/s).
If we use packet size as a guide, then we would like to see ~1400 bytes per packet but so far have only measured an average size of 394 bytes. This substantiates that we could potentially reduce the packet count by an additional 2/3.
The results of the patch look very good, especially the reduction of CPU load by 30%. Almost all cycles are still consumed processing network data. The time spent in non-network code is less than 3% of total cycles with 100 avatars. The OpenSim network code is extremely processor intensive as we would not expect 1.5 cores to be consumed to generate only 9MB/s of network data. By comparison, a modern Linux kernel can output more than 3GB/s of reliable TCP packets using a single comparable core. If we could obtain that level of networking efficiency in OpenSim, we should expect to support at least 1000 and up to 5000 simple moving avatars per simulator.
Network processing is a critical bottleneck to avatar scaling in OpenSim.
|
| Click to enlarge image |
In this experiment, we enabled the new packet pool code in OpenSim to measure any reduction in CPU utilization due to network processing.
We added the following lines to OpenSim.ini as shown in the OpenSim.ini.example file from version control. All other setup is identical to the experiment shown above (r9115 vs. r9391 packet aggregation).
[PacketPool]
; Enables the experimental packet pool. Yes, we've been here before.
RecyclePackets = true;
RecycleDataBlocks = true;
With a small number of avatars (less than 40), we measured a large reduction in CPU utilization due to network processing. When the number of avatars continued to grow to 100, the network processing was actually higher when the packet pool feature was enabled. Perhaps the maximum number of packets in the pool needs to be increased, or pool management becomes as costly as the actual processing of the packets. More investigation is needed, but the results with 40 or less avatars is very promising.
We took the profiling results from r9395 with the packet pooling and update aggregation and tested out some optimizations on the Event 04 workload with 10-100 avatars. These results are based on OpenSim r9449 with a patch to make it compatible with libomv r2714. There were some changes to packet header manipulation in libomv that we wanted to test out.
We added code to LLClientView.cs to allow us to control avatar terse update aggregation through opensim.ini options. In opensim.ini, we set the following options:
[LLClient]
; Resend packets markes as reliable until they are received
ReliableIsImportant = false
; Configures how ObjectUpdates are compressed.
AvatarTerseUpdatesPerPacket=20
AvatarTerseUpdateRate=50
TerseUpdatesPerPacket=20
FullUpdatesPerPacket=20
TerseUpdateRate=25
FullUpdateRate=25
PacketMTU = 1400
In PacketPool.cs, we increased the Packet pool size from 50 to 500 and removed the limit on the size of the DataBlock pool. With 100 clients, each updating 5 times per second to 100 clients, that's 50,000 ImprovedTerseObjectUpdatePacket ObjectDataBlock's per second. The default pool of 500 blocks meant that almost all of the packets were being allocated and deallocated and never made it to the pool.
In this workload, all packets are created or pulled from the Packet pool in the Heartbeat thread as the scene is updated with avatar movements. The packets are returned to the pool by the 100 individual Client threads. This means that the code to return a packet has some serious contention issues while the allocation of packets has no contention whatsoever. This was seen in our profiling that allocating packets to support 100 avatars was < 1% of the cycles where returning them to the pool was up to 30% of the cycles. This seems like some contention effect between threads freeing packets. I changed the packet and data block pools both to C# Stacks and so now the only thing the client threads need to do is lock the stack, push, and return.
After this patch, ReturnPacket and ReturnDataBlock combined were reduced to 10-20% of cycles. I think it may be a more efficient implementation to have a PacketPool and DataBlock pool for each client thread which would grow and shrink over time as needed.
public static T GetDataBlock<T>() where T: new() { lock (dataBlockPool) { Stack<Object> s; // The common case is that the data block type already exists in the pool if (dataBlockPool.TryGetValue(typeof(T), out s)) { if(s.Count > 0) { return (T)s.Pop(); } } // Add it now to avoid checking from every thread on every returned block else { dataBlockPool.Add(typeof(T), new Stack<Object>()); } return new T(); } } public static void ReturnDataBlock<T>(T block) where T: new() { lock (dataBlockPool) { dataBlockPool[typeof(T)].Push(block); } }
Most noticeable is the reduction in CPU cycles from 250% to 150% at the 100 avatar part of the test.
Looking at the data point where 100 users are moving: We notice that in previous tests, the Tx bandwidth from the server began to saturate at about 8MB/s but we now see the bandwidth continues increasing at 100 avatars. The packets per second (Tx) is reduced by 60% and the average packet size has increased from an almost constant 380 bytes to about 1000 bytes.
For further optimizations, LLClientView.cs:ProcessOutPacket is consuming 35-50% of the cycles due to the cost of checking and manipulating packet headers, sequence numbers and flags. These virtual functions in OpenMetaverse.dll and OpenMetaverseTypes.dll cannot be inlined but maybe we will find a way to make less of these calls.
The current Packet Pool implementation only pools 3 kinds of packets. I noticed that in some workloads, 80% of the packets are not coming from the pool. Further analysis may determine if more packet types should be included in the pooling.