Transformers may attend to a quite long context, about 1000 tokens. For text it makes let say about 100 sentences. It is a lot for written data. There are also implementation with longer context but in general it is very difficult to ‘achieve’.
And because of context it will be difficult for video frames since video frames need a lot more context. For example only 1 minute of video is typically about 1500 frames.
Next problem is amount of data saved in even single video frame. For text Transformers use embeddings of length about 1000 floats. Single video frame (HD) has 2M pixels, 3 colors each.
But as I remember there were attempts of implementations.