> You make an LLM decision tree, one LLM call per policy section, and aggregate ...

> You make an LLM decision tree, one LLM call per policy section, and aggregate the results.

I can never understand why people jump to these weird direct calls to the LLM rather than working with embeddings for classification tasks.

I have a hard time believing that

- the context text embedding

- the image vector representation

- the policy text embedding(s)

Cannot be combined to create a classification model is likely several orders of magnitude faster than chaining calls to an LLM, and I wouldn't be remotely surprised to see it perform notably better on the task described.

I have used LLM as classifier and it does make sense in cases of extremely limited data (though they rarely work well enough), but if you're going to be calling the LLM in such complex ways it's better to stop thinking of this as a classic ML problem and rather think of it as an agentic content moderator.

In this case you can ignore the train/test split in favor of evals which you would create as you would for any other LLM agent workflow.