The only guesses I have is it pertains to scheduling and addressing complexity OR power consumption requirements. I would think that for any multithreaded applications having 1 core with 8 ALU/AGUs would be superior to 2 cores with 4 ALU/AGUs each, as you can commonize the cache addressing and have 1 scheduler parse through all micro-ops. The problem is that most non-server applications are still single threaded, where you would be consuming significantly more power for each core.
I still think that a lot of this is just laziness on behalf of the programmers. Multi-tasking can be very easy for a lot of applications if you have a good kernel and know what the independent steps are - the challenge is just it takes much more time up front.
As an example, I'm doing some work with A* graph search right now. It would be very natural to assign maintenance of the minheap as a task, and create seperate tasks for processing each edge of a given node. It would easily cut my processing time by half, if not better. The problem, however, is that my search takes on the order of seconds to complete. It doesn't really make sense to spend hours creating a real time system when I could easily run 100 more tests in the same time span. As yes, this is just complete laziness on my behalf lol - but there a lots of systems that could be very naturally made into a multi-threaded system with significant time benefits, they just aren't because of the cost of redoing the architecture. I know first hand this is a real problem in automotive where half the "programmers" are really just mechanical engineers autocoding simulink models.
The challenge is you need to keep your code in the cache for each die, if you go over and start in on the next die's cache huge latency is the result.
At least that is what I have been told.
That's true to a point, but you need to take multiple things into account. For example, data sharing in general is frowned upon. There's many ways around it, like using semaphores and mailboxes, but even if cached data needs to be shared you can factor the cache latency cost vs task turnaround time. If the latency cost is less than the turnaround time (in many cases this is true), just spread it to multiple cores. But even then it might be computationally faster to just registerie copies of the data for each core and then have a task for updating the values in a common fashion.
This stuff is why I got into the embedded space. Hardware and software each have their own cool parts, but the integration of them brings a lot of creativity that most people don't know about.
Apple does not support multitasking and can use big/little at the same time. Every quacom part with big little has to hand off, so the 8xx that claim eight cores can only use four at a time. I would rather have less better cores, but phone now dont have too many, at least the qualcom ones.
5930k, R5E, samsung 8GBx4 d-die, vega 56, wd gold 8TB, wd 4TB red, 2TB raid1 wd blue 5400
samsung 840 evo 500GB, HP EX 1TB NVME , CM690II, swiftech h220, corsair 750hxi
Bookmarks