A quick introduction to DirectX workgraphs

jsheard a year ago

In true Vulkan fashion, this feature which was announced for DirectX a year ago and is now shipping hasn't even been publicly acknowledged by Khronos yet :(

I'm sure they're working on it, but the slower pace imposed by the much bigger committee including the mobile GPU vendors is a drag for those who only care about high performance hardware, and in the end their version usually ends up mirroring the design of the DirectX version anyway.

pjmlp a year ago

Vulkan hasn't changed anything in way Khronos works.
It is still extension spaghetti, multiple code paths working across extensions and driver bugs.
"Make your own SDK" kind of approach, which isn't as bad as OpenGL, only because they at least pay LunarG to make one.
But good luck with Android tooling for Vulkan on the NDK, other than a github repo full of samples.
A stagnant GLSL, hence why HLSL is now the de factor extension language for most studios, not only due to DirectX, with NVidia's Slang as a close second place.
- dagmx a year ago
  
  Are game studios actually using Slang? I haven’t really seen anyone at GDC or Siggraph mention it outside of a couple projects like Autodesk’s Aurora use it, which itself is very nascent.
  - pjmlp a year ago
    
    Most game studios are naturally using HLSL, or variations thereof.
    However that isn't the only reason for Vulkan's existence, and NVidia was proposing Slang at Vulkanised 2024.
    
    dagmx a year ago
    
    Ah yeah, but NVIDIA have been pushing Slang as a solution for a bit. It’ll be interesting to see if there will be any takers for production use.
    Even within NVIDIA, they seem to have a constant battle between MDL, Slang and MaterialX.
mananaysiempre a year ago

At the very least AMD made an extension pretty much immediately[1,2]. NVidia... well I don’t really expect graphics-programming-friendly initiatives from NVidia at this point.
[1] https://gpuopen.com/gpu-work-graphs-in-vulkan/
[2] https://github.com/KhronosGroup/Vulkan-Docs/blob/main/propos...
- creata a year ago
  
  As far as I can tell, AMD's extension is supported by exactly one version of AMD's driver, and it's outdated and only on Windows.
  https://vulkan.gpuinfo.org/listdevicescoverage.php?extension...
  https://github.com/KhronosGroup/Vulkan-Docs/issues/2336
  https://www.amd.com/en/resources/support-articles/release-no...
- my123 a year ago
  
  VK_NV_device_generated_commands(_compute), which is already implemented by RADV in Mesa, and is much older than VK_AMDX_shader_enqueue.
  The X suffix means that it's not a stable API and should not be used by shipped code.
  - creata a year ago
    
    So are the concerns regarding VK_NV_generated_commmands (mainly, unnecessary latency) in the VK_AMDX_shader_enqueue proposal overstated?
    
    my123 a year ago
    
    It comes with a lot of flexibility that the equivalent AMDX extension doesn't have, including being able to queue graphics shaders, not compute ones only.
    As far as I can see, unnecessary latency isn't actually a problem.
- pjmlp a year ago
  
  Wrong there, as Slang is the only sane alternative to a stagnant GLSL, unless you favour using DirectX HLSL.
Lichtso a year ago

It is true that Vulkan usually is behind DirectX by two to three years, but it gets there eventually. Keep in mind that Vulkan tries to cover a wider range of devices and platforms. Also, in my experience as a contributor to GPU HALs, Vulkan is far more thought-out while DirectX is usually a ship-quickly mess. Vulkan obviously benefits from hindsight here.
DonHopkins a year ago

Sounds like a modern more general version of ActiveMovie aka DirectShow Filter Graphs, from 1996.
https://learn.microsoft.com/en-us/windows/win32/directshow/t...
The whole point of filter graphs was that it would cleverly try to choose filter implementations based on their wishy-washily defined "merits", depending on what filters and codecs were currently installed, in a way that minimized copying data between the CPU and the video card (pre-modern-programmable-GPUs, mainly video cards with multimedia stream splitting/combining/compressing/decompressing/filtering/rendering and other forms of acceleration).
It introduced the term "Codec Hell", an even deeper layer of "DLL Hell":
https://en.wikipedia.org/wiki/DirectShow#Codec_hell
>Codec hell (a term derived from DLL hell) is when multiple DirectShow filters conflict for performing the same task. A large number of companies now develop codecs in the form of DirectShow filters, resulting in the presence of several filters that can decode the same media type. This issue is further exacerbated by DirectShow's merit system, where filter implementations end up competing with one another by registering themselves with increasingly elevated priority.
>Microsoft's Ted Youmans explained that "DirectShow was based on the merit system, with the idea being that, using a combination of the filter’s merit and how specific the media type/sub type is, one could reasonably pick the right codec every time. It wasn't really designed for a competing merit nuclear arms race."
>A tool to help in the troubleshooting of "codec hell" issues usually referenced is the GSpot Codec Information Appliance, which can be useful in determining what codec is used to render video files in AVI and other containers. GraphEdit can also help understanding the sequence of filters that DirectShow is using to render the media file. Codec hell can be resolved by manually building filter graphs, using a media player that supports ignoring or overriding filter merits, or by using a filter manager that changes filter merits in the Windows Registry.
- ack_complete a year ago
  
  It is a lot more constrained than DirectShow graph building.
  DirectShow has many characteristics that make its graph building fragile:
  - It allows for fuzzy rules to choose filters, which often overlap and require prioritization.
  - It runs code in the filters themselves as part of the decision process. This leads to apps breaking when filters are installed that have broken input logic and inject themselves everywhere or reject connections for obscure reasons (e.g. allocator requirements).
  - It often needs to build a telescope of multiple filters to complete a requested connection, such as a decompression codec followed by a Color Space Converter. This makes it vulnerable to filter loops and horribly inefficient conversion paths (like converting color spaces by compression followed by decompression).
  - Filters are provided by multiple third parties and installed system-wide.
  DirectX work graphs work with a private node/shader collection provided by the application and connect inputs to outputs by matching node IDs, so it's more like serialization/deserialization of a predefined graph than constructing one by depth-first search, and it's not pulling unknown external nodes of ambiguous quality.
  - jms55 a year ago
    
    Something I want to point out is _why_ you would want a graph for GPU work.
    The existing programming model is based on passes. You have one dispatch spin up thousands of workgroups, each of which do some computation. Then you issue a barrier, change your shader, and issue another dispatch doing more work, often reading the previous pass's result.
    At first glance this seems fine, but it has some inefficiencies.
    1. Often the size of the second pass depends on the output of the first. The first pass might determine that some work is unneeded, and therefore the second pass would have less to do. But from the CPU side, you need still need to allocate memory for the worst-case amount of output from the first pass. 2. Towards the end of the first pass, you'll have only a couple of workgroups left running. But you can't start running the second pass yet - you need to wait for _all_ workgroups in the first pass to finish, and then issue a barrier for memory synchronization.
    Workgraphs let you fix these problems. You define two nodes A->B that would correspond to your old passes. Then, you only need to allocate a reasonable amount of scratch memory for the first pass's output - you don't need to allocate for the worst case. When A is running, B can immediately run and pickup the results from A via the scratch buffer as A's workgroups complete.
    Workgraphs also give you a lot more flexibility with branching based on work type. A common technique is to split the screen into fixed-size tiles, and classify each tile as "cheap" or "expensive" to shade. Then you can run a cheap shading pass only on the cheap tiles, and an expensive shading pass only on the expensive tiles. The problem here is the same as we saw earlier - you need a full barrier and wait for the previous pass to complete before the next can be issued, even though they're independent. And that's with only 2 possible shaders. With more passes, or different length chains of work (e.g. this tile just needs some quick shadows, this other needs shadows and then 3 passes of denoising), it gets more and more inefficient. Workgraphs let you turn all that into nodes, and schedule them much more efficiently.
    
    flohofwoe a year ago
    
    To me the downside is that it's basically building an AST with a cumbersome 'AST builder API' instead of designing a new specialized programming language where the dependency tree and 'node payload' is all integrated into a single language (instead of having the traditional split between CPU- and GPU-side languages, and a 3D API as glue sitting inbetween - as more and more work moves to the GPU that traditional split makes less and less sense IMHO).
    The idea to build an AST with an API instead of a dedicated language feels a lot like GLTF's new interactivity extension, which describes an AST in JSON (blech): https://github.com/KhronosGroup/glTF/blob/interactivity/exte...
    PS: also, it would be nice to get an idea what the debugging situation is for workgraphs.

CooCooCaCha a year ago

I’m curious what the endgame is for GPUs. It seems like over-time they’ve been going down the road of generally-programmable parallel processors.

Will there come a time where, like CPUs, we land on a more general programming model and we don’t need graphics apis to introduce new features? Instead, if there’s a new graphics technique we don’t have to wait for a new DirectX feature, we can just code it ourselves?

pjmlp a year ago

Yes, that is why OTOY is now using CUDA for their rendering engine, or Unreal Nanite based on compute shaders.
Metal is based on C++14, CUDA nowadays can use most of C++20, and HLSL is going into becoming more C++ with HLSL 202x. Not sure about current state of NVN and PSSL.
Regarding waiting for features, it depends, it is already so that many graphics techniques can be fully done with shaders, when that isn't the case is because some primitive is missing, which will mean that there will be always the case that you need to wait for such primitives to be made available.
- CooCooCaCha a year ago
  
  > there will be always the case that you need to wait for such primitives to be made available.
  About the features thing, when it comes to CPUs you don't really need to wait for features unless it's something like SIMD support but otherwise the programming model is general enough that I don't need to think about CPU features too much.
  What I'm wondering is, as GPUs become more general, will the situation become more like CPUs. For example, I saw an example where someone wrote a shader, using work graphs, to rasterize triangles. Once you can do things like that, it seems like the need for new primitives is far less.
- animal531 a year ago
  
  I'm working on an indie game that's a bit like They are Billions with a ton of enemies that need to move together nicely, find targets etc.
  So I've been going down quite a rabbit-hole of nearest neighbour and spatial partitioning algorithms.
  I started with CPU multi threading first and then went to full GPU algorithms. While the speed can be really great they suffer from too much parallelism and lack of recursion/communication etc. I still want to try CUDA/OpenCL versions as part of the test.