I can corroborate that text-davinci gives much better results than for tasks involving summarization or extraction of key sentences among a large corpus. I wonder what empirical metrics OpenAI uses to determine performance benchmarks for practical tasks like these. You can see the model in action for analysis of reviews here: https://show.nnext.ai/
[Disclaimer - I work at nnext.ai]