I tested whether Gemini, ChatGPT, and Claude can analyze videos - this one wins

Innovation Home Innovation Artificial Intelligence I tested whether Gemini, ChatGPT, and Claude can analyze videos - this one wins Can AI really watch video, or does it just fake it? I tested my favorite AI tools on YouTube clips and local files to find the winner. david-gewirtz

Written by David Gewirtz, Senior Contributing EditorSenior Contributing Editor May 11, 2026 at 5:34 a.m. PT

I tested whether Gemini, ChatGPT, and Claude can analyze videos - this one wins — David Gewirtz / Elyse Betters Picaro / ZDNET

Follow ZDNET: Add us as a preferred source on Google.

ZDNET's key takeaways

Gemini can watch YouTube, MP4, and MOV files.
Claude still can't process video directly.
ChatGPT needs Codex help for deeper video work.

AIs do a fine job understanding text from prompts and documents. Most do quite well interpreting images, but what about video? Can your favorite AI actually understand what's in a video? If it does understand, what can you do with it?

Also: I tested ChatGPT Plus vs. Gemini Pro to see which is better - and if it's worth switching

In this article, we test ChatGPT, Claude, and Gemini to see how well they grok the contents of videos, both from YouTube and local files. The results are surprising.

The tests

I fed each AI a set of three videos. One is a YouTube video I published last year about the scientific process of annealing (yes, I am as exciting on video as I am on ZDNET). I tested the AIs to see if they can understand what's in the video. Then, I tried to see if they could create a better thumbnail than I used on my YouTube channel.

The second video is a motion test for the DJI Neo 2 drone. It's just a video of me standing in front of the drone, using gestures to control how the drone flies. No audio. I wanted to see if the AIs understand what's happening there. That's in MP4 format.

Also: I tested ChatGPT vs. Claude to see which is better - and if it's worth switching

Finally, I have the original MOV file that I uploaded to YouTube for a walk-and-talk about my YouTube posting strategy. I posted it last year. For my test, I used the local video file. It's the same one that I actually uploaded to YouTube. I'm using the local version for my AI test, though, because I wanted to see how well the AIs could ascertain what I'm talking about without any metadata, transcripts, or hints provided by YouTube. It's just the video itself. If you want to see the after-uploaded version, here's a link.

I tested the latest and best models. I tested the $20-per-month ChatGPT Plus plan, the $20-per-month Gemini Pro plan, and the $100-per-month Claude Max plan, which I use for Claude Code.

My prompt was simple: "Can you watch this video?" I found that "watch this video" worked better than "understand" or "summarize," because both of those sent the AIs looking for metadata. By contrast, when I asked the AIs to "watch" the videos, they didn't all go looking for metadata. All three AIs seemed to know that "watch" meant I wanted them to actually watch the videos themselves and demonstrate an understanding of what was in them.

And with that, let's dig in.

Claude: nope

Sometimes prolonging the pain is just cruel. Rip that Band-Aid off and get it over with. So, Claude is a big no. Neither the app nor the web interface can, well, let's have it tell us, shall we?

Also: How I used Claude AI to plan an entire hiking trip to the Adirondacks in 30 minutes - for free

Claude says, "I can't watch video content directly. I can't watch videos. I don't have the ability to process video or audio content from YouTube links. I don't have the ability to process the visual or audio frames of an MP4 file. Same answer here -- I can't directly watch or listen to a .mov file. I don't process video or audio streams."

And so it goes.

Gemini: yep

Gemini does have a standalone app, but I didn't need to run it. Gemini's web interface did a great job understanding all the video formats.

It didn't matter whether I fed Gemini a YouTube URL, a 625MB MP4 file, or a whopping 1.65GB MOV file. Gemini, right in a browser tab, could handle it.

Also: This powerful Gemini setting made my AI results way more personal and accurate

The most interesting one is my drone test. That's because the video contains no audio (not even background noise) and no context other than me standing there and moving my arms.

yard — Drone shot by David Gewirtz/ZDNET

And yet, Gemini was able to ascertain exactly what I was doing:

In the video, you're testing out some hand gestures -- raising your palm to the camera as if signaling it to stop or move. The camera follows your lead, changing its angle and distance as you guide it through the yard and eventually back toward the house.

Looks like a successful test of those drone gestures! Is there something specific about the footage or the camera's response you wanted to go over?

I mean, like, wow. Let's be clear. The drone itself was not visible in the video. It was acting as the camera. I'm betting there are a lot of humans who wouldn't understand what was happening there (I'm looking at you, my neighbors!), let alone an AI.

It did successfully understand my annealing video. It was able to identify sections, report on specific points I made verbally, and otherwise demonstrate its understanding.

It also understood the uploaded walk-and-talk video, not only identifying the location, but the various aspects of my commentary throughout the video.

Also: I tested ChatGPT Images 2.0 vs. Gemini Nano Banana to see which is better - this model wins

The one place Gemini fell down was in the transition from Gemini's understanding mode to Nano Banana's image mode. Despite Nano Banana's ability to make awesome images, it doesn't understand life and the world the way the new ChatGPT Images 2.0 does, and it shows.

I fed Gemini the original thumbnail for the video and told it, "Choose a single frame for the maximum impact as a YouTube thumbnail, then, based on context of the video and my existing YouTube thumbnail style, use Nano Banana to create a high-click-value thumbnail."

The image on the left is my original thumbnail. The next two were Gemini's attempts. They're certainly vivid and might attract more clicks, but Gemini decided to make up a dude with a beard and place him in the image rather than use my suave and sophisticated visage. And it spelled "FIRE" as "FCIRE." And so it goes.

If you want to quickly get an understanding of a video or ask questions about it, Gemini is definitely the best way to go. If you want to create thoughtful imagery from the video, Gemini is not yet up to the task.

ChatGPT: sorta, with a bit more work

Then there's ChatGPT, which is more like a bad news, good news story.

First, the bad news. ChatGPT itself failed my tests. It couldn't read the YouTube link. And while it theoretically can understand videos, they need to be under 500MB in size. Mine were not.

Then, the good news. If you combine the OpenAI Codex app with ChatGPT, it can do some very cool stuff. Codex is the agentic workhorse of OpenAI's offerings. I kind of liken Codex to that handy neighbor we all know who can build almost anything out of almost anything.

Also: I stopped using ChatGPT for everything: These AI models beat it at research, coding, and more

Codex was able to read both local files and understand their meaning. For the drone test, Codex reported:

It looks like a backyard drone test shot.

What's happening:

A person stands in a residential backyard and faces the camera/drone.
They gesture a few times (including a hand raise/wave-like motion).
The camera viewpoint moves around them over time, changing angle and distance while keeping them mostly centered.
The sequence ends still in the same yard area, with no major scene change or additional people/activities.

Yep. That's what it's about.

Codex was also able to watch and understand the walk-and-talk MOV file I uploaded. Initially, it couldn't process the content, so Codex asked permission to install some Python code and libraries to allow it to do audio transcription. Once it built all that out, it was able to not only view the video, but also understand its context.

But then Codex couldn't watch the YouTube stream. OK, fine. This is Codex. So, I asked, "Can you download the full video and then work on it locally?"

That worked. It automagically wrote a Python script, installed some libraries, helped invent video-downloading technology on the fly, and then watched my YouTube video.

But then I wanted it to create a thumbnail. I first asked if it had access to ChatGPT Images 2.0 (remember, they're both OpenAI tools). It responded, "I have access to image generation tools in this session, but I don't have a tool explicitly labeled Images 2.0 exposed to me."

Also: I tested ChatGPT and Perplexity AI as my CarPlay voice assistants - both made Siri look bad

Let's just ignore the unfortunate "exposed to me" phrasing. I had to explain to Codex that Images 2.0 was a thing, and point it to OpenAI's site for it to understand. At that point, the agentic tool was aware of the images tool, but still couldn't do much with it.

So, that's when I acted as the conduit between Codex and ChatGPT. I told Codex, "Choose a single frame for the maximum impact as a YouTube thumbnail, export that thumbnail somewhere so ChatGPT can get to it, or so I can upload it to ChatGPT, and then, based on context of the video and my existing YouTube thumbnail style, write a prompt for ChatGPT to create a high-click-value thumbnail."

Then, in ChatGPT, I uploaded the original thumbnail image I showed you earlier, and the frame that Codex chose. I then pasted in the prompt Codex created. As you can see, Codex/ChatGPT got more right than Gemini did.

Screenshot via ChatGPT and Codex by David Gewirtz/ZDNET

It picked up on the white, yellow, and black color scheme for my lettering. It didn't include my logo, and it didn't include the yellow stripe I use for my titles, but I actually quite like the juxtapositioning of my picture over the torch flame. ChatGPT and Codex actually used my image, unlike Gemini. But I do take issue with the aluminum bar. I used flat material. For some reason, the OpenAI tools decided to make it into square tubing.

Also: I tried ChatGPT Images 2.0: A fun, huge leap - and surprisingly useful for real work

Here's where ChatGPT's better image knowledge comes into play. I corrected it on the tubing vs. bar situation and asked it to regenerate. I prompted, "That aluminum is flat bar material about 1/8-inch thick, not square tubing. Keep everything else, but please fix the aluminum."

We were close. I didn't like how it placed the Sharpie marks (which are used to tell when the metal is hot enough to bend), and the actual bend was far too sharp. One more prompt: "Good, but the bend is too sharp. It's not a perfect right angle. There's a curve because the aluminum needs to flex as it bends. Please revise. Also, the sharpie marks are perpendicular to the edge of the bar, not on an angle. They indicate where to bend."

That's good enough. I think it's possible to feed Codex and ChatGPT a video with no additional context and get out a YouTube thumbnail. You could probably use it to analyze other types of videos and produce images from those as well.

It's not super-convenient, but it does work pretty well.

AI can indeed watch video

There are few things to note. First, the AI was able to fully interpret the videos in a much shorter time than their actual play time. Both the science video and the walk-and-talk are about 15 minutes long, but both Gemini and ChatGPT were able to "watch" and parse them for understanding in what I would say was about two or three minutes each.

Second, both show fairly powerful interpretation skills. I found their ability to understand that the silent video I gave them was a drone test to be rather impressive. The drone mostly stayed at human height, yet they were both able to ascertain context from the frames in the video.

There are certainly some practical uses. I gave Gemini a YouTube video of a CBS report on the OpenAI trial and asked it to provide me details about what was discussed. I can definitely see using it to scan through security camera video to find a specific type of action quickly.

I can also definitely see giving the AI a longer video and having it pull out the major points. What was particularly useful is that Gemini time-stamped each of the key thoughts, so I could just click the time stamps and drop into the video at that point.

Then, of course, there's the actual use of these tools to create YouTube thumbnails. I still prefer to do it by hand. But the fact that these AIs can extract usable frames and construct thumbnails means that creators have a new tool at their disposal.

Also: I used Claude Code to vibe code a Mac app in 8 hours, but it was more work than magic

Overall, I'm impressed with Gemini and the pairing of ChatGPT and Codex for video-watching ability. Isn't it interesting that Gemini doesn't need two tools (after all, it is called "Gemini"), but ChatGPT needs Codex? Things like this amuse me.

Even though Claude bombed at this test, it still has value. Claude is one of my favorites for vibe coding.

What productivity benefits can you see getting from the video-watching capabilities of these AIs? Let us know in the comments below.

You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.