(not sure into which subtopic this goes, feel free to move around).
We're quite frequently discussing about how to make FG faster. That's not as straightforward as it looks, so... let's take a step back and investigate a situation which is perhaps more familiar. Suppose we want to manufacture a car and optimize the assembly process.
The situation
We're a modern company, which means we have 'just in time' management for parts, we are unable to store half-assembled parts for instance - everything arrives at the processing plants when it's needed, everything gets picked up when finished.
We have two plants - plant 1 and plant 2. Plant 1 is a traditional assembly-line setup - it creates the car without engine. The tasks it has to do are
a) assemble frame
b) attach wheels
c) attach seats
d) attach doors
e) paint
f) put engine in as soon as it arrives
Plant 2 is a very modern setup in which the engine mechanical parts and the electronics are modules which can be put together more or less in parallel. Team A assembles the mechanical parts, and any member of the team just grabs two non-attached modules, connects them and hands the finished part into a basket, to be picked up by the next person. Team B takes any of the (half-assembled) structural parts and attaches electronics and sensors. Team A does mechanical parts only, team B electronics only, but each team member does whatever work needs to be done at any given point. However, electronics can only be attached when the structure for some part of the engine is done - there's no attaching to a structure which isn't built yet.
The bottleneck
Suppose plant 1 takes the following times to assemble the car:
assemble frame: 24 minutes / attach wheels: 4 minutes / attach seats: 4 minutes / attach doors: 2 minutes paint: 10 minutes / engine: 2 minutes
Team A in plant 2 takes 58 minutes to assemble a complete engine, team B takes 20 minutes to assemble the electronics.
When will a car be finished?
It's fairly straightforward to get the answer for the assembly line in plant 1 since each task is done after the other, so we can just add the numbers and find 46 minutes.
In plant 2, adding the numbers would be wrong, since team B starts working almost as soon as team A has assembled the first modules, so by and large they work in parallel, and team B needs to do only very little work after team A finished, so it actually takes some 58 minutes.
Yet, building the car then takes 60 minutes, because after 44 minutes the assembly line in plant a needs to stop and wait for the engine to arrive at 58 minutes, then the engine is put in which takes the last two minutes.
Making a car is obviously an analogy for computing a FG frame, plant 1 is the CPU, plant 2 is the graphics card, and while the CPU does tasks in sequence, the GPU internally does things in parallel - but the fragment shader (team B) needs to wait for the vertex shader to be finished (team A), and the frame needs to wait for all tasks to be completed.
Some things that do not work
A consultant might have the bright idea that while we can't paint the car before the doors are on, we can attach wheels and seats simultaneously. Each task takes 4 minutes, but we're not doing them in sequence now, we're doing them in parallel, so we save 4 minutes of the total.
Does that mean we can now roll out cars every 56 minutes? No - it means we finish the assembly line work after 40 minutes, then wait 18 minutes for the engine to arrive, and roll the car out after 60 minutes.
That's utilizing multiple CPUs better in a situation where the GPU is the limiting factor.
***
Another consultant might have the idea to hire more people for team B to get the engine done faster. So team B now only takes 10 minutes to attach the electronics. Yet - team B can't finish before team A does, so it still takes 58 minutes to do the engine and an hour to make a car.
That's streamlining fragment shader operations or getting more fragment processing cores in a situation that's vertex-shader limited.
What does work
In fact, the only thing that will in this setup get cars out faster is to hire more people for team A. If team A is able to assemble the engine structure in 15 minutes, and team B works in parallel but now takes 20 minutes, we're going to see an engine after 20 minutes. Then plant 2 can enjoy a break while the engine gets delivered to plant 1, waits there for 24 minutes, then is put in for two minutes, getting the car ready after 46 minutes. After which plant 2 can start to produce a new engine.
That's an upgrade of the graphics card to give it better vertex processing capability.
In fact, now the situation is different. Hiring more people for team B now does something, i.e. a whole engine can be assembled in 15 minutes. But this doesn't influence the speed at which we roll out cars, all it does is changing the time the engine waits in plant 1 from 24 to 29 minutes - the bottleneck is still in plant 1.
GPU update helps only if you're rendering-limited
But the other solution, i.e. attaching wheels and doors in parallel, now actually has an effect, and we really save four minutes.
Parallelizing helps if you don't need to wait for a task that is longer or needs to be serial.
What to target
The next vastly overpaid consultant has a look at how things in plant 1 are done and finds that attaching doors is really slow, and by investing a million, the car company can make it four times faster than it is now.
His colleague looks at assembling the frame, and finds that well, there's a modest optimization potential - it can be done 20% faster by investing a million.
Well, we only have a million to spare, thus what should we do?
In the first case, we're saving 75% of 2 minutes, or 1:30. In the second case, we're saving 20% of 24 minutes - or almost 5 minutes. So despite the relative increase being vastly larger in the first case, we should go for the second case.
A modest performance increase in a relevant subsystem that actually takes lots of time usually by far outweighs a dramatic performance increase in an irrelevant subsystem that hardly takes time to execute anyway - thus valuable developer time is better spent slightly improving the things that take lots of time than dramatically improving things that don't take measurable time anyway.
Final words
So, next time you feel like pointing at a random FG subsystem and claim we absolutely need to make this faster because it drags everyone's framerate, think of the car manufacturing process, how non-trivial it is to optimize and how many optimizations actually don't help, and how what helps may also be different from one setup to the next.