iPhone openGLES performance tuning
I'm trying now for quite a while to optimize the framerate of my game without really making progress. I'm running on the newest iPhone SDK and have a iPhone 3G 3.1.2 device.
I invoke arround 150 drawcalls, rendering about 1900 Triangles in total (all objects are textured using two texturelayers and multitexturing. most textures come from the same textureAtlasTexture stored in pvrtc 2bpp compressed texture). This renders on my phone at arround 30 fps, which appears to me to be way too low for only 1900 triangles.
I tried many things to optimize the performance, including batching together the objects, transforming the vertices on the CPU and rendering them in a single drawcall. this yelds 8 drawcalls (as oposed to 150 drawcalls), but performance is about the same (fps drop to arround 26fps)
I'm using 32byte vertices stored in an interleaved array (12bytes position, 12bytes normals, 8bytes uv). I'm rendering triangleLists and the vertices are ordered in TriStrip order.
I did some profiling but I don't really know how to interprete it.
instruments-sampling using Instruments and Sampling yelds this result: http://neo.cycovery.com/instruments_sampling.gif telling me that a lot of time is spent in "mach_msg_trap". I googled for it and it seems this function is called in order to wait for some other things. But wait for what??
instruments-openGL instruments with the open开发者_如何学编程GL module yelds this result: http://neo.cycovery.com/intstruments_openglES_debug.gif but here i have really no idea what those numbers are telling me
shark profiling: profiling with shark didn't tell me much either: http://neo.cycovery.com/shark_profile_release.gif the largest number is 10%, spent by DrawTriangles - and the whole rest is spent in very small percentage functions
Can anyone tell me what else I could do in order to figure out the bottleneck and could help me to interprete those profiling information?
Thanks a lot!
You’re probably CPU-bound. The tiler/renderer utilization statistics in the OpenGL ES instrument show that the duty cycle of the GPU is between 20-30% for rendering at 20-30 fps, which suggests that the GPU could run at 60 fps if fed fast enough. It looks like there are a few things that you could do to get more information out of Instruments and Shark about what to pursue:
By default, Sampler shows every sample from every thread, which means that mostly-idle helper threads created by system frameworks will dominate your view. To get a better idea of what the CPU is actually doing, make sure the Detail View is showing (third button from the left in the lower left corner) and change Sample Perspective to Running Sample Times to exclude samples where a thread is idle/blocked.
I don’t see any samples in the Shark trace from your app itself. That may well be because your code is fast enough that it doesn’t appear anywhere in the list of hot functions, but it might also be because Shark can’t find symbols for your application. You might need to configure the search paths in its preferences or manually point Shark at your app binary. Also, Shark defaults to showing a list of functions ordered by how much CPU time is spent in them. It may be useful to change the view to something more like a regular call tree, so you can visualize how your overall render loop spends its time. To do this, change the View option in the lower-right corner to “Tree (Top-Down).” (If you don’t see your app name or functions here either, then Shark is definitely missing your symbols.)
I am unfortunately not well versed in OpenGL, but here are some things to stand out at me from the three results:
1) From the Sampling instrument, you might have some kind of background web connection going?
2) The rendered utilization percentages seem low to me (though I don't know how to improve them).
3) Even though 10% seems low, that seems like a good attack point - however it's almost equally suspect there is so much time spent in memcpy. Also ValidateState is kind of a largish amount and might be holding you back.
Tool wise I think you are using the right tools to examine performance, you just need to think more about what those mean to your application.
Without the full source, It's difficult to tell exactly what's happening. The Instruments trace shows a 20% Render Utilization, which is a bit low. This probably means you're CPU bound. However, if this was the case I would expect to see more application specific sample points in your first trace.
My advice is to roll your own timing class. Something like this (c++):
#include <sys/time.h>
class Timer
{
public:
Timer()
{
gettimeofday(&m_time, NULL);
}
void Reset()
{
gettimeofday(&m_time, NULL);
}
// returns time since construction or Reset in microseconds.
unsigned long GetTime() const
{
timeval now;
gettimeofday(&now, NULL);
unsigned long micros = (now.tv_sec-m_time.tv_sec)*1000000+
(now.tv_usec-m_time.tv_usec);
return micros;
}
protected:
timeval m_time;
};
Time your sections of code to know exactly where your time is being spent.
Also another quick fix is to disable the Thumb instruction set. This could help your floating point performance 20% or more, at the expense of your executable size.
If you are using glFlush or glFinish, remove all of those.
精彩评论