The Data Mill

Data company Versium says it can bust fraudsters one email address at a time

Digital thieves can be difficult to spot before they strike. That's partly because traditional security methods haven't kept pace with technology, giving fraudsters a chance to exploit holes in the wall of fraud detection systems, said Chris Matty, CEO of Seattle-based Versium Inc., a predictive analytics startup. "I had my own identity stolen and someone tried to open up a credit card just a couple of weeks ago. Somehow, they had my Social Security number, date of birth and address," Matty said.

The Data Mill

And, in fact, opening new accounts is the modus operandi for many a fraudster, according to Matty, because a fledgling account is an account that flies under the radar of traditional fraud detection tools. Brand spanking new customers haven't generated enough behavior data for banks or businesses to spot a deviation from the norm, he said. But he's convinced Verisum can change that.

The startup, founded in 2012, has amassed a database of what it calls "LifeData," an accumulation of 300 billion attributes (!) on everything from social, ecommerce and communication activity, to demographic and sociographic data. Today, Versium is applying this vast collection of data to fraud detection. All the company needs from a financial institution or an ecommerce business to trace fraud? "An email address," Matty said.

The email address, a common request when opening up a new account of any kind, becomes the foundation for a profile. It's assessed against the four billion email addresses Versium has collected -- 500 million in the United States alone -- and then cross-referenced with the 300 billion LifeData attributes. After the profile is built, it's fed into a machine-learning algorithm, which produces a risk score. "The higher the score, the lower the risk," Matty said.

Based on that volume of data -- much of it behavior-oriented -- Versium can figure out what "normal" and "risky" look like on a scale that exceeds the system many businesses and financial institutions are currently using.

The four C's of data quality

When it comes to data quality, "one man's gold is another man's garbage," said Ken Gleason, director of electronic trading product development at New York City-based Deutsche Bank Securities.

Previously on The Data Mill

CIOs have a lot to learn from NSA data collection

Visualization tools for spreadsheets and sleep stories

Dropbox CEO on recruiting top talent

Data quality lives on a continuum and businesses need to define what that continuum is, Gleason said. They need to define their requirements and criteria for the data and what happens if the data doesn't meet those requirements. "There's no single right answer for how to do this," Gleason said during the O'Reilly Media Inc. webinar Data Quality Demystified: Knowing When Your Data is Good Enough. To help CIOs and IT leaders figure out where to draw the line between garbage and gold, Gleason created a framework called the four C's of data quality.

Complete: Is the data you're using for a particular report complete and, more importantly, do you have everything you need? If data or data fields are missing, ask yourself how relevant they are to the report and if the data or data elements need to be reacquired or can be omitted. This might be a place to build in rules that will help determine a threshold for how complete the data needs to be, Gleason said.

Coherent: Is the data consistent? Does it make sense when it all hangs together? If you detect a problem with data integration, determine if the problem needs to be fixed or can be omitted. One way to make sure figures are solid is to check for the data's "value integrity" or that "internal totals are consistent," Gleason said. "It's a simple check that we tend not to do until we see the finished totals."

Correct: Are the data values right? If the data values are invalid or out of sequence, figure out if they need to be corrected or if they can be omitted or simply flagged, Gleason said. "Correctness is going to be very domain-specific," he said. "Knowing the data, knowing the domain and understanding what is acceptable and what is not, you can get a long way in starting to define and measure this."

ACcountable: Where does the data come from and who is responsible for keeping the data in a good state? Gleason recommended creating an "intersystem map" or a diagram that shows where the data is coming from, who owns it and when it's updated. Also, build a data validation step into the process. "Part of this is about making sure that over time your data stays fresh," Gleason said. Perform audits to stay on top of any data processing changes. "Go back, pick up the phone, send an email and ask if the data is still working the way you thought it was," he said.

"The simple act of defining the quality requirements that you have is enough to provide a baseline that should provide you with more consistent thoughts about your data and should save time and money, which is what this is all about," Gleason said. "Well that and doing less rework, which is not fun."  

Welcome to The Data Mill, a weekly column devoted to all things data. Heard something newsy (or gossipy)? Email me or find me on Twitter at @TT_Nicole.

This was first published in March 2014