> How can gradient descent work on compute graphs when the space of compute graphs is discrete?
You can un-discretize the space of compute graphs by interpolating its points by simplices. More precisely, each graph is a subgraph of the complete graph, and the subgraph is identified by the indicator function of its edges whose values are either 0 or 1. By using weighted edges with values between 0 and 1, the space of all graphs (with the same number of vertices) becomes continuous and connected, and you can gradient move around it in small steps.
Of course, "compute graphs" are more general beasts than "graphs", but it is likely that the same idea will apply. At least, for a reasonably large class of compute graphs.
You can un-discretize the space of compute graphs by interpolating its points by simplices. More precisely, each graph is a subgraph of the complete graph, and the subgraph is identified by the indicator function of its edges whose values are either 0 or 1. By using weighted edges with values between 0 and 1, the space of all graphs (with the same number of vertices) becomes continuous and connected, and you can gradient move around it in small steps.
Of course, "compute graphs" are more general beasts than "graphs", but it is likely that the same idea will apply. At least, for a reasonably large class of compute graphs.