here's the RDNA1 whitepaper:
https://www.amd.com/system/files/doc...whitepaper.pdf
On page 6 you can see each 10 CU module has its own memory controller, L1/L2 cache and render backends. Compared to CPU cores, I don't think these modules really have to talk to each other through an L3 cache etc. Graphics workloads are really all about parallel throughput. (There is some talk of RDNA2 having an L3 cache to reduce memory bandwidth dependencies, but it's all rumours right now.) I think if you split those shader engines into chiplets, yes there would be latency penalties in the render pipeline but you could offset these by increasing CU count. We'd still be talking microseconds. That matters for memory/cache access but not so much in the render pipeline probably. Just a guess.