#573 posted by mh on 2016/06/29 23:30:30
You should probably also define "lots".
The animated lightstyle code in stock GLQuake and derivatives, even with my fix, just doesn't scale. All that my fix does is batch up some uploads and provide the right hints to the GPU that it doesn't need to do a format conversion, but it's still possible to construct scenes that run in the 20/30/40 milliseconds per frame range.
The real fix is to move lightstyle animation entirely to the GPU. This is actually quite simple (you can do it in GL 1.1 even) and you end up trading off texture uploads (and pipeline stalls) versus extra blend passes and draw calls (and increasing the video RAM requirements for lightmaps). The devil is in the dynamic lights, but you can do these with more blend passes and attenuation maps (if you don't have shaders), at the expense (I think) of not being able to take the surface normal into account.
The whole thing becomes significantly simpler (not to mention much faster) if you're prepared to say "screw GL 1.1" and require shaders.
End result either way is that the frame rate is levelled: scenes with lots of animated lightstyles run at much the same speed as scenes with none, and lightstyle interpolation becomes possible. Dynamic lights run roughly the same as the old code, but with much higher quality, and proper dynamic lighting of large MDLs and BSP models becomes possible.
#574 posted by Spike on 2016/06/30 01:25:13
dynamic lights then need to become (shadowless) realtime lights, otherwise you're stuck with just flashblends/coronas.
still, if you're willing to ditch dynamic lights and non-glsl to optimise light styles, you can create a non-lit pathway through your code for almost no cost at all.
throw in texture arrays and remove the whole view-frustum-recalculation-every-frame thing, and your bsp rendering drops to almost the cost of just a pointcontents and a glMultiDrawIndirect call.
this is what I did with my 'onedraw' engine, its performance humiliates FTE on most/nearly all maps, but is also far more limited (being from-scratch doesn't help), with no dlights, no lits, no skyboxes, etc.
alternatively, you can move the lightmap updates onto a different thread - fte's 'r_dynamic -1' setting does this.
Unfortunately GL still basically requires the GL api to all be done on a single thread(as does d3d9), though presumably this could be accelerated a little with pbos.
#575 posted by mh on 2016/06/30 09:50:37
Shadowless real time is what I mean, yes.
As for lightstyles, when you break down the R_BuildLightmap code it clearly just becomes: texture * lightmap0 * style0 + texture * lightmap1 * style1 + texture * lightmap2 * style2 + texture * lightmap3 * style3. That's trivially easy to express in GLSL with or without additive blend passes, and not much more difficult to express in fixed pipeline GL.
End result is animated lightstyles without new texture uploads.
Lightmap Updates (1)
#576 posted by mh on 2016/06/30 16:34:17
I'm going to rabbit on about this for a while cos it's something that I've done a lot of research on, and the end results were a little counter-intuitive. It's also useful to provide the background and reasoning for what I'm advocating; it demonstrates that I have done the research and that I have the figures to prove it.
So, I'd always been aware that lightmap updates were a bottleneck in GLQuake, and it really hit bad on some RMQ (and other) maps. There are basically two steps to a lightmap update: R_BuildLightmap and glTexSubImage2D.
Contrary to what I expected, when I benchmarked I found that R_BuildLightmap was actually not the bottleneck. You can comment out the calls to R_BuildLightmap so that it still does the glTexSubImage upload, and you get the same bad performance. You can comment out the calls to glTexSubImage but leave R_BuildLightmap in and it runs fast.
With glTexSubImage there are 3 factors that influence performance: size of data being uploaded, whether or not the driver needs to do a format conversion, and whether or not the driver needs to stall the pipeline.
Lightmap Updates (2)
#577 posted by mh on 2016/06/30 16:34:39
For typical Quake lightmap usage, data size is hugely unimportant. Unless you consider extreme cases (e.g update nothing versus update everything) data size just doesn't matter.
Format conversions can kill performance, and the stock FitzQuake code will always need to do a format conversion.
Stock FitzQuake requests lightmaps in GL_RGB format on the GPU, but (unless you're using very weird hardware) there's no such thing as a 24-bit GPU texture format. It's powers of 2 all the way: 8-bit, 16-bit, 32-bit, etc. GL_RGB is also an unsized format, so the driver is free to give you 5/6/5, 10/10/10/2, 10/11/11, 8/8/8/8 or whatever. What you most likely will get is 32-bit, 8/8/8/8 with the last 8 unused, but you're not actually 100% guaranteed that unless you ask for it (i.e GL_RGB8).
At this point some people will kick and scream about "wasting memory". I loathe that term; it's not "wasting" it, it's using it. There may be nothing stored in the memory, it may never be read from or written to, but the extra byte per texel is being used to increase performance. This is the very same principle as cache-aligning data, aligning allocations on page boundaries, aligning (or padding) to 16-bytes for SIMD, etc etc etc on the CPU. Why are people so fucking surprised that similar principles apply on GPUs? Get over it.
Lightmap Updates (3)
#578 posted by mh on 2016/06/30 16:34:57
So stock FitzQuake code has (most likely) a 32-bit texture on the GPU, but supplies 24-bit data on the CPU. That's a format conversion, right there, and you're totally at the mercy of the driver for how efficient (or not) it is.
NVIDIA always seemed to take the fastest path for the type of conversion needed, so a simple conversion might take 1ms but the most complex take 40ms.
AMD/ATI always seemed to take a similar path irrespective, so conversion times might bob around 10ms to 20ms.
Intel was batshit insane. I can't remember the times but they were off the chart. The only explanation I could think of was that the driver downloaded the entire texture from GPU to CPU, did the conversion, then re-uploaded the entire texture back to the GPU. I don't know if that's actually true or not.
Lightmap Updates (4)
#579 posted by mh on 2016/06/30 16:38:55
Stock FitzQuake also did a glTexSubImage upload per-surface rather than per-lightmap. With a typical ID1-sized map, that could be several hundred uploads instead of maybe a maximum of 10. And each one of those uploads is a pipeline stall, so it totally breaks CPU/GPU asynchronous operation. Instead of the CPU handing a bunch of work off to the GPU and being able to continue itself, it instead handed a small bit of work off, waited for it to finish, handed another small bit off, waited, etc, several hundred times per frame.
That's the same codepath that stock GLQuake with R_DrawSequentialPoly took, and the same performance characteristics are observable there. But stock GLQuake didn't have GL_RGB lightmaps, so it didn't do any format conversions; stock FitzQuake just got the worst of every possible world.
This wasn't such a huge problem in the old 3DFX days because the graphics pipeline (or at least the part of it that was implemented by the GPU) wasn't as deep back then. Only per-fragment and framebuffer ops happened on the GPU, so the effects of a stall were significantly reduced. On modern hardware it's catastrophic. It's the equivalent of putting a glFinish call after drawing each msurface_t.
Lightmap Updates (5)
#580 posted by mh on 2016/06/30 16:39:21
Stock GLQuake with multitexture disabled batches up glTexSubImage calls so that they only occur once per lightmap rather than once per surface. It also doesn't do any format conversions.
Stock FitzQuake with multitexture disabled also batches up glTexSubImage calls so that they only occur once per lightmap rather than once per surface. It still does format conversions, but the impact is reduced. It's still horrible on Intel though.
That's why "disable multitexture" was the standard advice when using stock FitzQuake on maps with lots of lightstyle animations. It wasn't that the single-texture path was any faster in the general sense, it was that lightmap uploads were implemented in a more performant manner.
But it's possible to write a multitexture path that also batch-uploads all lightmaps before drawing anything, and in that case disabling multitexure will just make everything slower.
Lightmap Updates (6, And Last)
#581 posted by mh on 2016/06/30 16:39:44
So, current QuakeSpasm fixes all of this with the exception that it uses GL_RGBA/GL_UNSIGNED_BYTE on the CPU. That will still trigger a format conversion on (?some?) Intel drivers. On those drivers, the only combination that doesn't trigger a format conversion is GL_BGRA/GL_UNSIGNED_INT_8_8_8_8_REV. I believe that on D3D10+ class hardware it's not a problem, however.
None of this scales. It runs fine on ID1 maps, it runs fine on medium-sized maps, it runs fine on small maps with many light updates, it runs fine on large maps with few light updates. It will still excessively slow down on large maps with many light updates. Again, the number of lightmaps grows, the number of glTexSubImage calls increases, factor in brush models and add lots of them, and you're back to hundreds of uploads per frame and all the herkie-jerkies that come from that.
Brute-forcing it on the CPU isn't the answer. With that kind of surface count you'll probably get something worthwhile by threading R_BuildLightmap (which will be called so many times as to make this worthwhile) but you'll still hit a single-thread ceiling with your glTexSubImage calls.
You need to solve it more creatively, and moving the updates entirely to the GPU is the answer. You never need to check if a lightstyle is modified, you never need to run R_BuildLightmap, you never need to call glTexSubImage2D.
And here's something else where parts of the Quake community need to pull a collective stick out of their collective arse. The best-case will run a little slower with this kind of set up. So instead of getting 4000 FPS in DM3 you might get 2500. That doesn't matter a bollocks. You don't optimize the best case, you optimize the worst. Sacrificing a tiny piece of performance (0.15ms) in the best case in exchange for a huge performance increase in the worst (to the extent that there can potentially be no difference between them) is the right thing.
MH
#582 posted by mankrip on 2016/06/30 18:40:17
At this point some people will kick and scream about "wasting memory". I loathe that term; it's not "wasting" it, it's using it. There may be nothing stored in the memory, it may never be read from or written to, but the extra byte per texel is being used to increase performance. This is the very same principle as cache-aligning data, aligning allocations on page boundaries, aligning (or padding) to 16-bytes for SIMD, etc etc etc on the CPU.
[...]
The best-case will run a little slower with this kind of set up. So instead of getting 4000 FPS in DM3 you might get 2500. That doesn't matter a bollocks. You don't optimize the best case, you optimize the worst. Sacrificing a tiny piece of performance (0.15ms) in the best case in exchange for a huge performance increase in the worst (to the extent that there can potentially be no difference between them) is the right thing.
I wholeheartedly agree.
_o_ Here's a hug.
#583 posted by Izhido on 2016/06/30 22:05:14
Also, I'd love to see any hardware, ever, where you can run at 2500 FPS. Optimizing is good; obsessing over it is not.
Thanks Mh
#584 posted by Preach on 2016/06/30 22:08:45
Nice writeup, it all seems to make sense even though I don't have much to do with rendering stuff. For reference, it was fitzquake085 on the last map of The Five Rivers Land. Big wide arena mostly covered with torchlight and often a half-dozen dynamic-lit projectiles crossing it. Stuttering went away just by facing outwards, so I thought it must be a rendering thing. Quakespasm copes with it without the adverse reaction.
The Tragedy Of The Commons Takes Many Forms
#585 posted by Baker on 2016/06/30 22:13:36
Sometimes some newbie shows up and asks why "the Quake community" doesn't do X, Y, or Z like [insert other game, for example "like the Doom community" or "Quake 3"].
But there is no such thing as "the Quake community". Just different individuals of different interests doing stuff they want to do for free.
Its not like someone is demanding Barnak do this or that some Steam user must do this.
No --- because no one makes demands of free loaders and newbies and insists they must do things.
#586 posted by Baker on 2016/06/30 22:19:40
Ah, I clicked submit instead of preview.
What I was trying to say that even a well-intentioned highly skilled engine coder diatribe about how things should work is still basically saying that someone else has to spend their time a certain way for free.
That point of view has merit.
And it is Tragedy of the Commons because such demands are only directed at the "top producers".
No one makes demands of noobs.
@mh -- I Still Haven't Communicated Perfectly ...
#587 posted by Baker on 2016/06/30 22:38:16
In a commercial project that would be sold and generates sales, you'd be absolutely right.
In a free project that is a pastime, I don't know if you are right.
And one of the most important rules for a pastime is knowing it is a pastime.
And The Tradegy of the Common very much applies, you made an exit back in 2013 for this reason and I was already making an exit and once I saw you declare your exit, I left too. It was too much of an out-of-control zoo. 18 months later I'd come back -- sort of --- with firm boundaries in my head and communicate clearly the limitation of what I was willing to (not) do and when I was willing to (not) do it.
@Izhido
#588 posted by Baker on 2016/06/30 22:44:55
MH's now dead engine, DirectQ, could easily clock 5000+ fps on an id1 map.
@Baker
#589 posted by Izhido on 2016/07/01 00:01:12
I'm pretty sure that's true, no doubt about it.
Do we have, however, some kind of display who can actually show those 5K FPS?
That was really kind of my point. :)
#590 posted by Baker on 2016/07/01 00:17:23
Here is a partial list of "blockbuster" map releases in the last few years:
a) A Roman Wilderness of Pain [aka ARWOP]
b) The Alter of Storms [aka ne_ruins]
c) Rubicon Rumble
d) Remake Quake
that will grind down to single digit frames per second in a traditional GLQuake style engines. But a DirectQ or FTEQW can instead of getting 12 fps will get 150 fps or 300 fps.
And these kind of mega-maps have been "the norm" in the last 5-6 years. There are plenty more where those came from.
#591 posted by Izhido on 2016/07/01 00:27:05
Whoa. Can they run on vanilla? I'd love to test them on a iPhone ;)
#592 posted by Baker on 2016/07/01 00:43:56
a) requires enhanced network protocol (FitzQuake 666 or similar)
b) requires enhanced network protocol (FitzQuake 666 or similar)
c) requires BSP2 support, enhanced network protocol (FitzQuake 666 or similar) and should have alpha masked texture support.
d) requires the Remake Quake engine bundled in the download (BSP2, alpha masked texture support, Remake Quake's protocol 999, hurt blurs, other 2D effects)
#593 posted by Joel B on 2016/07/01 01:09:40
What kind of hardware is actually limited to low FPS in those maps? Laptops with Intel GPUs? No system I've played them on with Nvidia GPUs has had issues.
#594 posted by Baker on 2016/07/01 01:12:49
With what engine? Quakespasm since late 2014 or thereabout has used vertex arrays so the performance is about triple the norm on maps with tons and tons of monsters.
#595 posted by Joel B on 2016/07/01 01:18:43
Usually Quakespasm.
@Izhido
#596 posted by Spike on 2016/07/01 02:12:22
try them. fix stuff that doesn't work. job done.
its not like engines that support this stuff are closed source, so it shouldn't be that hard considering the stuff you've previously managed.
@Izhido
#597 posted by Baker on 2016/07/01 02:37:45
Sample source code implementing protocol 666 download source
Every single protocol 666 change is marked with:
#ifdef FITZQUAKE_PROTOCOL
Yeah, every last change is marked.
|