Fraud detection is one of the top priorities for companies like us. Except for risks such as identity theft, account theft, money laundry, merchant fraud etc, there is another type of frauds that challenges us in recent years. The fraudsters use certain kind of techniques to register a large number of user accounts on our platform and manipulate those accounts so as to let us believe that those accounts are authentic/real users. Each year in times like 11th Nov., 18th Jun., 25th May, companies like Bestpay, Alipay and JD etc. will launch big promotions by releasing a large volume of coupons online to the market, trying to boost user activity and attract new users. The fraudsters will then use their fake users to get hold of a huge volume of coupons and work with some merchants to cash those coupons with fake transactions. Apparently, those fake users are not the target user for us and the coupons they have are not meant to be within their reach. We have a long history fighting this kind of fraud and by analyzing and mining the data behind this fraud, we successfully shut down a large number of fraudsters’ accounts and put suspicious accounts in watch list every day. The patterns we found, such as the high similarity between a certain section of a group of users’ phone number digits, the number of different IP addresses used by same device ID, the distribution of the login time period, etc. just to name a few. But As fraudulent behavior becomes more and more deceptive and complex, we have to adapt responsively to new patterns of fraud to protect our assets. We have over 200 million registered users, with dozens of PB data comprised of the transaction logs, device information, account information and behavior logs etc. It always results in up to tens of billions of feature space with high sparsity. That makes traditional statistical analyzing and most of machine learning models hard to play.
We found that Adversarial Auto Encoder (AAE) can be used to do representation learning on high dimensional data (in a non-linear way). We train an AAE model (3 hidden layers for encoder/decoder and discriminator) on our unlabeled data, extract the latent vectors from the encoder and evaluated the representation with t-SNE; we then feed the latent vector to a GMM model to come up with a number of clusters. It’s shown that the data within each cluster has a strong latent connection. This method serves as an effective and efficient step for us, in which AAE helps capture the intrinsic features from the complex data, ie a good representation, to model the risk factors. With clustering, it helps us narrow a huge volume of intricate data down to a limited size of groups.
[Results-expect more to come]
A quick test on the data gathered from one of our promotions in 2018, we examined 37 of those groups and found 609 accounts that already in our blacklist, another 3100 accounts on our watch list and found an unseen fraud pattern which is more valuable to us. One problem of AAE though ( and also most of the other deep learning networks), is that the results remain unexplainable, which means we cannot directly use AAE to come up with decisions, but it’s shown that AAE has good representation of high dimensional data, with which we exploit to have a good clustering and further analyze the data and discover a certain undiscovered fraud patterns. Moreover, from the test result, it also helped cut the false alarm rate by 6%
Vincent Xie (谢巍盛) is the Chief Scientist and Director of China Telecom BestPay Co., Ltd. He builds the company’s Artificial Intelligence Group and leads the team to carry out research related to big data and A.I. Previously, he worked for Intel leading an engineering team working on machine learning- and big data-related open source technologies.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org