We’ve been evaluating text-to-video tools for internal content production, specifically for thought leadership and training material. I want to share some actual results because the gap between how these tools are marketed and what they produce in production conditions is significant enough to be worth naming directly.
The demos are designed to show you best-case outputs on prompts that have been refined over hours. When you sit down with an actual content brief from an actual stakeholder and try to generate something usable, the experience is different.
Specific observations across three tools we tested over six weeks:
Consistency across a video sequence is the biggest unsolved problem. Characters, settings, objects change between clips in ways that break narrative coherence. Fine for abstract or atmospheric content. Not fine for anything that needs to tell a story with visual continuity.
Text rendering inside video is still unreliable. If your brief involves visible text, signs, labels, readable words, the output is often gibberish or inconsistent across frames. This is a known limitation but it’s not prominently disclosed.
Prompt interpretation is much more literal than the demos suggest and much less robust to ambiguity. Vague briefs that a human video editor would interpret intelligently produce outputs that don’t resemble the intent.
What actually worked: abstract background visuals, atmospheric transitions, short social content where continuity doesn’t matter, and anything where motion and mood matter more than narrative specificity.
Recommendation: useful now for specific, narrow use cases. Not a replacement for production-level video work. Evaluate against your actual use case, not the vendor’s demo reel.