Sometimes you stare at two datasets long enough that you convince yourself there’s a connection between them. Not because there is, but because the question needing an answer is important enough that the data “should” be connected. That’s a dangerous place to start a modeling project.
This is one such story. Enter multi-instance-learning, and how I failed spectacularly even on simulated data.
Business Context
Where I work, there’s a hodge-podge of systems overlapping and gaps in systems for operations, but one thing we measure obsessively is everything about incoming deliveries. Duration from arrival to dock. Time from dock to unload complete. How long did you wait? Wait time before anyone touches the truck. 8 duration metrics on every single delivery, thousands of deliveries a day.
We also run satisfaction surveys. Happy drivers delivering to your sites isn’t just a people thing. If you have a reputation for detaining drivers hours without so much as a portable restroom, you’re going to quickly find yourself paying higher rates to bring freight into your site. Making your company a place people want to come is as good people sense as it is business.
Not everyone takes the survey. It’s voluntary, and the surveys come in throughout the day, timestamped but not tied to any specific delivery. That’s by design (after talking with a few drivers, can’t go into the reasons). But it also means that driver A fills one in right after unloading, and driver B two hours later at a rest stop, or the next morning. We don’t know which delivery prompted the response.
We don’t want to know what delivery prompted that response, and that isn’t important. We would like to know what conditions prompted it. If we found out that for whatever reason blue paint in the waiting room made the drivers happy? You’d find on my next corporate card statement a line from home depot for blue paint.
The question I wanted to answer: can we link those two data sources? There are a few stipulations. If we could add contextual information to the survey, we could identify the operational metrics that actually matter to drivers. Instead of guessing that long wait times hurt satisfaction, we’d have data. We could focus then on the metrics that matter and call them out in Power BI dashboards for receiving process (“truck waiting time went up”), and have a balanced scorecard approach.
I decided to throw a neural network at it. This is the story of why that was the wrong tool for the job.
The Architecture
The problem breaks down into two pieces you have to solve at once. You get a survey with a timestamp and a score, and somewhere in the hours before that survey, there’s a set of deliveries that could have caused it. Which one was it? And what about that delivery made the driver rate it the way they did? You can’t answer one without the other. To learn what drives bad scores you need to know which delivery conditions provoked it, but to know which delivery to look at you need to know what bad scores look like. Chicken and egg.
I went at this two ways. First attempt was two networks trained together. One network looks at all the candidate deliveries in a time window and assigns probability weights to each one, like saying “I think it was 60% likely to be delivery #47 and 25% likely to be delivery #52.” The other network takes a delivery’s three duration metrics and tries to predict the survey score. They share a loss function, so when the score predictor gets it wrong, that error signal also teaches the matcher to pick better candidates next time.
Second attempt used something called Multiple Instance Learning, where you treat all the candidate deliveries as a bag. Instead of picking one candidate, the model weighs the whole set, builds a blended representation, and predicts the score from that. More mathematically principled for this kind of “I don’t know which item in the group is the important one” problem.
Both are reasonable approaches. Neither was why things went sideways.
The Synthetic Proof-of-Concept
The exercise was built on proving out whether this could extract the signal from the noise when I knew there was a signal. I built a synthetic dataset with a known ground truth. 1,000 deliveries, 200 surveys, a 30-minute candidate window. The scoring rule was deterministic: start at score 5, subtract 2 if duration_1 exceeds 45 minutes, subtract 1 if duration_2 exceeds 35 minutes, subtract 1 if duration_3 exceeds 10 minutes, floor at 1.
Each survey was generated by randomly selecting a delivery and adding 2-30 minutes of delay. So I knew exactly which delivery caused each survey and I knew the exact formula that produced each score. No noise. No ambiguity. A few candidates per survey because the window was tight. If the model couldn’t crack this, it couldn’t crack anything.
It got 85% of scores right and matched the correct delivery 80% of the time. Sounds decent until you remember this is a cheat sheet test. The formula is deterministic and there are maybe 4 candidates to pick from. Missing 15% of scores on that is not great. I looked at the training curves and it was classic overfitting. Train set accuracy going up, test set stuck and jittering around 75%. The model was fitting to the training examples rather than learning the pattern.

That was the first red flag and I mostly ignored it.
Scaling Up and Falling Apart
I then tried to make the data look more like what we’d actually face. Scaled to 10,000 deliveries and 5,000 surveys. Widened the candidate window to 600 minutes. In practice, drivers don’t fill in surveys within 30 minutes. They do it hours later, sometimes the next day. A 600-minute window gave me about 40 candidates per survey instead of 4.
Results: 35.9% score prediction accuracy, 2.7% delivery matching.
Five score categories means guessing randomly gets you 20%. We barely beat random on scores. And 2.7% matching against 40 candidates is literally what you’d get from a coin flip (random chance is 2.5%). The model trained for 200 epochs and came out the other side knowing nothing it didn’t know before epoch 1.
I switched to the MIL architecture. Loss went from 2.3 down to 1.6 over 65 epochs. Looks like progress on paper, but it’s a trap we face at work too, looking at loss functions, not considering that this model is the good predictions, and the bad predictions to people in operations. Trust is gained by the drop, lost by the bucket. I pulled out the deliveries the attention mechanism focused on most for each test survey and grouped them by score level.
Score 1 deliveries had average durations of 10.5, 7.6, 27.1. Score 5 deliveries had average durations of 11.7, 7.5, 26.5. Basically the same numbers. The attention wasn’t locking onto anything meaningful. It picked whoever was convenient, and the score predictor just learned to always say “about 3.5” because that minimizes your loss when you have no real information.
What I learned
Three problems killed this, but any one of them would have sufficed.
The Mechanical. The matching is looking for a needle in a haystack where all the hay looks exactly like the needle. Forty candidates in a window, all with three duration metrics drawn from the same distributions. The correct delivery has no distinguishing mark. The only thing that makes it “correct” is that its durations happen to match the scoring formula, but the model doesn’t know the formula yet because that’s what it’s trying to learn. It’s stuck in a loop. You’d need something like a delivery ID on the survey, and if you had that, you wouldn’t need a model at all.
The data collection. This one took me too long to see. A driver filling out a survey isn’t reacting to one delivery. They’re reacting to their morning. Their week. How things have been going in general at your site. The whole premise of “which delivery caused this score” assumes a 1-to-1 link that doesn’t exist. From reading the comments on the surveys, I noticed that 1s and 5s were reserved for people who either consistently were failed by the site, or more commonly “this is my first experience of the site.” 2-4 were more nuanced. The survey is a thermometer, not a receipt.
The business context. Even if you could match perfectly, three duration numbers aren’t enough to explain why someone rates a 3 versus a 4. Driver experience depends on how the staff treated them, dock conditions, whether their paperwork was ready, the weather. Duration is a proxy for some of that (long waits often signal a disorganized operation), but a rough one. Predicting 5-level satisfaction from three timing features was always going to cap out.
The Actual Answer
The hypothesis behind all of this was something like: “if we reduce receiving time, driver satisfaction goes up.” That’s a perfectly testable idea. But not with a model.
Building a neural network to reverse-engineer causality from observational data is the hard way to answer this. The easy way: pick a set of sites, implement process changes to cut receiving time at half of them, leave the rest as controls, compare survey scores three months later. If cutting time from 45 minutes to 25 moves the average score from 3.2 to 3.8, there’s your answer. If it doesn’t budge, that’s also useful, and a lot cheaper than training models that converge to random.
That’s the unglamorous conclusion. I spent time on attention mechanisms and MIL architectures when the right approach was a spreadsheet and a pilot program. I was trying to shortcut around the hard part (actually changing operations and measuring the result) by mining historical data for patterns that would predict the outcome. But the signal was never in the data because nobody designed the data collection to put it there. Surveys and deliveries are two streams that happen to coexist in time. No amount of matrix multiplication will manufacture a causal link the measurement system never established.
Sometimes you just have to run the experiment. Rein in receiving time and see what happens. No model required.
Leave a Reply