Data: My Unexpected Nemesis
When we started building a machine learning model for fraud detection, I assumed the hard part would be the model.
It turns out the real challenge was something much less glamorous: getting the data.
Building machine learning systems in the real world is messy. One thing that surprised me after working on fraud ML systems is that the hardest problems often have nothing to do with the model itself.
More often, they come down to figuring out what data you need, whether you can trust it, and how to actually get it.
This caused more delays in our project than almost anything else.
Training data is everything, but it's messy
If you're building a supervised machine learning model, you obviously need training data.
Simple enough in theory.
In practice, fraud data is messy.
Every bank defines fraud a little differently. What one institution labels as confirmed fraud might be categorized differently somewhere else. Even when you get labels, you still have to take them with a grain of salt.
So right away you're dealing with:
- inconsistent definitions
- imperfect labels
- data that often arrives weeks or months after the actual activity
Not exactly the clean dataset you see in ML tutorials.
The chicken-and-egg problem of data fields
Another challenge we ran into was figuring out which data fields we actually needed.
To build a strong model, you want access to lots of signals.
But when you're working with banks, you can't just say:
"Send us everything."
You have to ask for specific fields.
The problem is that early on you don’t yet know which fields will matter most for the model.
So you end up in a chicken and egg situation.
You need data to figure out what features matter.
But you need to know what features matter before asking for the data.
This alone caused a lot of delays early on.
Eventually we realized we just had to start somewhere. The model didn’t need to be perfect from day one. We just needed enough data to begin learning.
Why we deployed the model in shadow mode
One thing that helped was introducing an intermediate step.
Instead of waiting until the model was perfect before deployment, we worked with a few clients who were willing to act as design partners.
We deployed the model in shadow mode. That means it ran in the background without affecting real decisions.
This gave us a way to:
- test the model in real environments
- collect better data
- iterate quickly with clients
Getting data from banks takes time
One thing you learn quickly working with financial institutions is that there’s no universal system for how data is stored.
Different banks:
- store data differently
- have different internal teams responsible for different systems
- have different levels of technical maturity
So even if a client wants to help, it can still take time to actually get the data.
Sometimes a lot of time.
Clients need a reason to send you data
Another thing we learned: clients won’t send you data just because you ask for it.
You have to give them a clear reason.
You can’t say:
"Hey, could you send us this data so we can train our model?"
Instead, you have to paint a picture of what the model will eventually do for them, even if the model doesn’t exist yet.
You’re selling the vision.
How will this help them detect fraud earlier?
Reduce false positives?
Protect their business?
Once clients believe in that vision, they are much more willing to work with you.
TL;DR (if you take nothing else away)
Ask for data earlier than you think you need it.
Getting the right data can take weeks or even months, especially inside large financial institutions.
Make sure you're talking to the people who actually own the data.
Banks are big organizations, and the person evaluating your product is often not the person responsible for the systems you need data from.
Start with imperfect data.
Your model doesn’t need to be perfect to begin learning. Sometimes you just need to pick a few clients and start.
Design partners are invaluable.
Running the model in shadow mode with a few cooperative clients helped us iterate much faster.
Sell the vision.
Clients are much more willing to send data when they understand how the model will ultimately benefit them.
Hopefully some of this is helpful if you're building machine learning systems with real world data!
And once data stops being your nemesis, it’s actually pretty cool what you can build with it.