Training a Neural Net to Find Puss in Boots

Close-up of Puss in Boots drinking milk from a shot glass, wearing his signature hat

I want to fine-tune an image generation model on Puss in Boots. That means I need 50 to 100 good stills of the character. The movie is 98 minutes long. I am not going to sit there and screenshot by hand.

So I trained a binary classifier to do it for me, wired it up to OBS, and let it watch the movie while I did other things. Here’s how that went.

Step 1: Get some frames to label

First problem, you need labeled data to train a classifier, but the whole point of the classifier is to avoid labeling by hand. Chicken McCrispy meet Egg McGriddle. I did a bootstrap, label a few images at first, and then strategically find new information. I started off by extracting 500 frames from the movie and manually putting them into puss/not-puss folders.

def random_sample(video_path, output_dir, n=500, seed=42):
    random.seed(seed)
    cap, sar, fps, duration_sec = _video_info(video_path)
    timestamps = sorted(random.uniform(0, duration_sec) for _ in range(n))

    for ts in timestamps:
        cap.set(cv2.CAP_PROP_POS_MSEC, ts * 1000)
        ret, frame = cap.read()
        if not ret:
            continue
        frame = _correct_frame(frame, *sar)
        filename = f"frame_{ts:08.2f}s.jpg"
        cv2.imwrite(str(output_dir / filename), frame,
                    [cv2.IMWRITE_JPEG_QUALITY, 95])

After that, I trained the model (more on that below), then fetched 5000 more frames, and went back and looked at ones it got wrong with high confidence. If the model marked a new image of Puss in Boots at 95% confidence, then it has enough information. Marking a single shot of Perrito at 95%? That’s new information.

File manager grid showing high-confidence false positives. Each filename starts with the classifier confidence score followed by a frame number.
High-confidence false positives. Each filename starts with the model’s confidence score (e.g. 0.9872), followed by a frame number. These frames scored above 90% “puss” but contain no Puss in Boots — dark scenes, other characters, extreme close-ups of eyes. Good candidates for relabeling into the training set.

I moved borderline cases and confident mistakes into the training folders and retrained. After a few rounds: 1,265 labeled frames total, 901 puss and 364 not-puss.

Step 2: What the classifier has to learn

This isn’t as simple as “find the orange cat.” The movie has other cat characters. Kitty Softpaws is also orange-ish and shows up in many of the same scenes. The classifier has to distinguish Puss specifically, across different lighting, angles, and scales (sometimes he’s a tiny figure in a wide shot, sometimes it’s an extreme close-up).

Puss in Boots standing on a table with sword drawn, used as positive training example for the classifier

puss = 1
A nobleman character from the movie — clearly not Puss in Boots

puss = 0

Step 3: Training

I went with ResNet18. Standard fine tuning workflow. Use a pretrained model, freeze most of it, unfreeze the last residual block and swap in a new classification head.

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1)
  (layer1): Sequential(...)  # frozen
  (layer2): Sequential(...)  # frozen
  (layer3): Sequential(...)  # frozen
  (layer4): Sequential(       # unfrozen
    (0): BasicBlock(
      (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (bn1): BatchNorm2d(512)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (bn2): BatchNorm2d(512)
    )
    (1): BasicBlock(...)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc): Linear(in_features=512, out_features=1, bias=True)  # replaced
)

One output neuron, BCEWithLogitsLoss, 15 epochs on CPU. I used weighted random sampling because my classes were imbalanced (more puss than not-puss, which, fair enough, he is the main character).

model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)

for name, param in model.named_parameters():
    if not name.startswith("layer4") and not name.startswith("fc"):
        param.requires_grad = False

model.fc = nn.Linear(model.fc.in_features, 1)

Out of 11.2 million parameters, 8.4 million were trainable. Validation accuracy hit 92.1% at best, but I was also going a little overboard with the complexity of the images it was using for tagging.

Here’s what 92% looks like in practice. Both of these were labeled puss = 1 in the training set:

Kitty Softpaws and Puss in Boots together in a fire scene

puss = 1 (he’s in there, behind Kitty)
Wide shot of a room with Puss in Boots barely visible at the left edge

puss = 1 (hat visible at the left edge)

The movie is 2.39:1 widescreen, but ResNet takes 224×224 square inputs. Every frame gets resized to fit that square, so widescreen shots get squeezed horizontally. In a wide shot where Puss is a small figure at the edge of the frame, he might occupy 20 pixels of the input tensor. The model still has to learn that counts. These borderline cases are part of why accuracy plateaus at 92% instead of 99, and also why 92% is fine for my purposes. The hard cases are genuinely hard.

That’s not going to win any competitions, but I don’t need it to. I just need it to catch most frames of my dear Puss in Boots so I can sort through a smaller pile by hand instead of watching the whole movie frame by frame.

Step 4: The source quality question

Before building the live capture I got sidetracked wondering whether my video source was high enough quality for LoRA training. I spent a while comparing different copies and resolutions, checking codecs, obsessively alt-`ing between frame grabs. At one point I was pricing USB Blu-ray drives.

Puss in Boots frame playing in a video player, used as source material for the frame extraction pipeline
Frame grab from the source video. Good enough?

Then I stopped and thought about it for a second. LoRA training data doesn’t need to be 4K. It needs variety: different poses, angles, lighting. A 98-minute movie has plenty of that regardless of resolution. I was solving the wrong problem.

Step 5: Live capture

This part I’m proud of. The beauty of PyTorch is that you can implement exotic logic and have something fundamentally editable. If you’re willing to relax these, you can get a much more performant model. Export the trained model to ONNX so you don’t need PyTorch at runtime, just onnxruntime and OpenCV. A future project I want to see how light a system I can get a useful ONNX model running.

Open a Jupyter notebook. Point it at the OBS virtual camera. Every frame gets run through the model. Anything above 85% confidence gets saved to disk, with a one-second cooldown to avoid saving the same frame fifty times.

cap = cv2.VideoCapture(VIDEO_SOURCE)

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        continue

    prob = predict(frame)
    now = time.time()

    if prob >= 0.85 and (now - last_save_time) >= 1.0:
        ts = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
        filename = SAVE_DIR / f"puss_{ts}_{prob:.3f}.png"
        cv2.imwrite(str(filename), frame)
        last_save_time = now

I added an overlay that shows up in the notebook cell: green text with the confidence percentage when Puss is on screen, red when he’s not. There’s a “[SAVED]” flash and a running count. So you can watch it work. Hit play in one window. Let the notebook chug in another. Go make coffee. Come back to 166 screenshots of a small orange cat in a hat.

Puss in Boots standing with Kitty Softpaws and Perrito, captured by the classifier at 94.3% confidence
The classifier handles group scenes. 94.3% with two other characters in frame.

Results

166 captures from one sitting. Confidence scores between 0.851 and 0.998. Most of them look good. The ones that don’t are motion blur: the classifier sees enough orange to think “that’s him” but the frame is a smear. Fair enough.

A blurry frame with motion blur showing mostly furniture, incorrectly captured by the classifier at 86.1% confidence
86.1% confidence. I think that’s a boot? The classifier is being generous.

I ran perceptual hashing over the keepers to drop near-duplicates (distance threshold of 10), and ended up with about 70 distinct frames. That’s the LoRA dataset. Next I need to caption them and train the image model. But that’s a different project and a different post.

Puss in Boots standing in the rain wearing his hat, captured automatically by the classifier with 97.6% confidence
97.6% confidence. The hat helps.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *