How does your phone know which calls are spam?

If you have a phone in the US, you have probably received a call from Susie about your car’s extended warranty. Or Carol, who needs to tell you about the latest changes that will affect your student loans. Or maybe even the guy who calls to talk to Fredrick (or Carl or Santiago) about donating to a police (or firefighter) charity, but maybe you can help him instead.

These robo-called recorded voices and the scams they make are not alone in their attacks on our phones. Scam calls are ubiquitous.

By 2020, one in five U.S. mobile phone users received three or more scam calls a day, and an estimated 3 to 5 billion robo calls are made each month. As a result, nearly 90 percent of calls from unknown numbers go unanswered due to low user trust.

How is machine learning used to combat spam calls?

Classification and cluster algorithms in both supervised and non-supervised machine learning can be used to identify probable spam indicators and ongoing scam calls. Mobile carriers, devices, and third-party apps use this to generate “likely spam” warnings.

The increasingly common “spam probable” message that pops up on your phone when it rings is part of the ongoing battle against such calls. These warnings are the result of machine learning efforts implemented by voice service providers, device companies, and third-party app manufacturers. Not only can this alert users before answering a call, but it can also help catch the scammers.

Use machine learning to generate ‘Spam Likely’ warnings

When your phone’s caller ID says “spam likely”, it’s based on an analysis engine used by the operator, said Mike Rudolph, CTO of YouMail, a third-party call protection provider that tracks and addresses robo-calls. The three major companies all collaborate with different suppliers of analysis engines: AT&T with Hiya, Verizon with TNS and T-Mobile with First Orion.

“All three of these guys have used machine learning based on the dataset they operate from to give you the ‘spam probable’ indication of the three mobile operators,” Rudolph said.

The data sets that airlines use for this process come from call details. Calls made over the telephone or via voice over internet protocol systems generate call details which are logged by voice service providers (also called operators) and telephone exchanges (also known as switches). Call details contain basic metadata about the call, such as the origin and destination of the call, type of media (audio, SMS, and so on), call duration, and whether or not the call is connected.

“The behavioral analysis has been trained in that a number it has not seen before making 50,000 calls at 9 a.m. on a Monday is suspicious.”

Analysis engine vendors typically use behavioral analysis, which typically examines the range, number of people a particular number calls, and the frequency of calls made over a period of time, to identify suspicious callers.

Rudolph gave an example of a new number that suddenly makes tens of thousands of calls within a network around noon. 9 on certain days.

“The behavioral analysis has been trained in that a number it has not seen before making 50,000 calls at 9 a.m. on a Monday is suspicious,” he said. “It will be marked as ‘spam likely’.”

In addition to the data in call detail records, phones offer built-in tools to help identify and mark spam calls, providing another data stream that can be used in machine learning processes to identify potential spam calls. Apple, for example, has its function of “putting unknown numbers” on phones running iOS 13 and newer operating systems. The Google Phone app for Android similarly includes caller ID and spam protection options that allow users to mark calls as spam.

Carriers have their own systems in the same way: T-Mobile has ScamShield powered by First Orion, Verizon has Call Filter powered by TNS ‘Call Guardian and AT&T has Call Protect powered by Hiya. Third-party apps like YouMail, RoboKiller, CallApp, and those published by Hiya, TNS, and First Orion also allow users to mark calls as spam.

“An entry like this is added to the database as a spam call entry along with other regular calls,” said Albar Wahab, a computer science student at Data Science Dojo. He said feature engineering can be used to select the best indicators for spam calls. Then, traditional machine learning classification algorithms, such as supporting vector machines, can be used to predict whether a future incoming call is potential spam. Deep learning algorithms such as folding neural networks and long-term memory can also be used to effectively automate the functional development step.

Other ways to identify spam calls and those that allow them

While voice service providers are limited to using the data in call detail records to identify potential spam calls due to privacy laws, third-party apps that users choose can access more information about calls. YouMail, for example, uses an audio fingerprint system to analyze the content of a call to identify known and potential scam robocalls without anyone actually listening to the call.

“We are 100 percent based on the sound of calls, and we do nothing related to the range or frequency of calls,” Rudolph said. “For us, because we are an over the top information service, we can train machine learning based on what the call said. It’s a completely different machine learning.”

YouMail, Rudolph explained, takes the sound of calls and turns them into images using fast Fourier transform or FTT and constantQ transform or CQT. The resulting image is the sound fingerprint of a call. Using both supervised and unsupervised machine learning algorithms, YouMail plots the auditory differences between fingerprints from sample calls and known scam calls. The smaller the auditory difference between a sample or an ongoing call’s fingerprint and that of known scams, the more likely it is that the call is a scam.

The audio fingerprints can also be used to identify potential new scams as they happen, either based on a new cluster of very similar content or because of the content itself.

“For example, our machine learning knows some things that are binary,” Rudolph said. “If you get a call that says it is [Internal Revenue Service] or the [Social Security Administration]”It’s definitely going to be a scammer calling you.”

The ability to identify ongoing fraudulent calls using sound fingerprints also allows for faster reporting to potentially identify bad actors or at least the voice service provider that made the call.

When YouMail encounters a call that matches the audio fingerprints from known scam calls, it can be sent to the Industry Traceback Group within seconds of being identified, Rudolph said. The Industry Traceback Group can then trace the fraudulent call back to the provider who activated the call. Due to the TRACED Act, which was signed into law in 2019, voice service providers are required to close accounts that send illegal calls.

More about protecting your informationDeepfake Phishing: Is it actually your boss calling?

Cut off the flow of data to spammers

Just as those who fight fraudulent calls with machine learning thrive on data, so do scammers. Although scam call reduction is generally not the purpose of most data protection apps or services, a reduction in the publicly available data that scammers access can also have the side effect of reducing scam calls.

“One thing we tried to do at Kanary is identify the data sources that spammers use, and then remove the data from there,” said Rachel Vrabec, founder and CEO of Kanary, a data protection service. Removing their phone numbers and other personal information from public sources makes customers less searchable. Being less searchable makes it harder for robocalling scammers to identify live numbers to call, Vrabrec said.

“When you look at the supply chain for these phone numbers and how they end up in the spammers’ arsenals to use, then you do not want to be like the first number on all their lists,” she said. “The goal is to help you keep your phone number more private.”

Spoofing presents challenges for data collection

Although not all spam calls are robocalls, and not all robocalls are spam, there can be a lot of overlap. Increasingly, robo calls are counterfeit, which means that the number displayed on your caller ID is not the actual number from which the call originated. While call spoofing can be done for legitimate reasons – like when a doctor calls back from their personal phone but the office number appears on your caller ID to protect the doctor’s privacy – when scammers use fake robo calls, it is to avoid being detected and tracked down.

“If you started collecting the wrong data about that number, you could easily ruin someone’s landline connection and the delivery capability of those calls.”

“Offshore, and even onshore, less desirable companies you do not want to work with do not want you to find out who they are and they do not want you to call them back on their real phone number so they forges a telephone number. ” said Brian Podalak, CEO of Vocodia, an AI sales and customer service platform.

Spoofing is a growing part of the arbo of scam scammers, and this can dull the edge that machine learning puts on efforts to detect scams. The short version, as Omer Khan, CTO at Vocodia, put it, is that machine learning suffers from the problem of “garbage in, garbage out”.

Counterfeit numbers can result in a lot of noise in a spam detection machine learning model, Vrabec said. This can result in false signals.

“I could use your phone number and start spamming people with it,” she said. “If you started collecting the wrong data about that number, you could easily ruin someone’s landline connection and the delivery capability of those calls.”

More on mobile scamsText scams are common. How to help users avoid them.

A ‘broken environment’

Complicating the use of data to identify spam calls is the current nature of the telephony landscape. There are so many databases of call information collected by different groups – voice service providers, device companies, third-party apps and even countries (think National Do Not Call Registry) and states – all with slightly different spam detection systems.

“Everyone does it their own way and thinks they have the better mousetrap,” Podalak said.

Although there are some public registers with information on numbers, most of the registers at the mobile operator or device level do not interact. Both Khan and Podalak said it needed to be changed in order to make greater progress in the use of machine learning against spam calls.

“In my opinion, the data needs to be centralized somewhere in order to succeed.”

“The data, in my opinion, needs to be centralized somewhere in order to be successful,” Podalak said. “Otherwise, you want what you have right now, which is this broken environment.” If he wanted to get his number registered as legitimate to ensure it did not appear as “spam likely,” he said, he would have to go to about a dozen different groups to do so at the moment.

Khan said he believed such a combined register could not be performed by a conventional carrier or unit company. Instead, ownership of such a centralized registry would have to fall on a body like the Federal Communications Commission. Aside from that, he noted that “there are already initiatives and for-profit ventures that are very interested in standardizing this.”

It would be possible for spam detection efforts using machine learning to be more accurate, he added, but it would require collaboration across the fragmented landscape.

“Businesses and private entities need to be able to talk to each other to share that data and enforce that standard,” Khan said.

Maybe if that happened, we would all get fewer calls from Susie about our extended warranty.


Leave a Comment