This is a classic knowledge distillation pattern in ML - the "teacher" models (AlphaFold, ESMFold) with complex MSA-based architectures generate training data for a simpler "student" model. What's particularly interesting is how well the simplified architecture generalizes despite losing the evolutionary signal from MSAs. The performance suggests that much of the MSA complexity might be capturing patterns that can be learned more directly from structure data. This could be huge for real-time applications where MSA computation is the bottleneck. Has anyone benchmarked inference speed comparisons with the original AlphaFold pipeline?