Fraud detection is one of the top priorities for companies like China Telecom. Along with risks such as identity theft, account theft, money laundry, and merchant fraud, another type of fraud has become a challenge in recent years: the fraudsters register a large number of user accounts on a platform and manipulate those accounts so as to let the company believe that those accounts are authentic or real users.
Each year companies like Bestpay, Alipay, and JD.com launch big promotions (like those on November 11, June 18, and May 25) by releasing online coupons to the market, trying to boost user activity and attract new users. Fraudsters use their fake users to get hold of a huge volume of coupons and work with some merchants to cash those coupons with fake transactions. Those fake users are not the target user, and the coupons they have are not meant to be within their reach. China Telecom has a long history of fighting this kind of fraud, and by analyzing and mining the data behind this fraud, a large number of fraudsters’ accounts were successfully shut down and suspicious accounts are put on watch lists every day. The high similarity between a certain section of a group of users’ phone numbers, the number of different IP addresses used by same device ID, and the distribution of the login time period were some of the patterns found. But as fraudulent behavior becomes more and more deceptive and complex, the company has to adapt responsively to new patterns of fraud to protect its assets.
China Telecom has over 200 million registered users, with dozens of PBs of data comprised of the transaction logs, device information, account information, and behavior logs—tens of billions of feature space with high sparsity. That makes traditional statistical analyzing and most of machine learning models hard to play. The company found that adversarial autoencoders (AAEs) can be used to do representation learning on high dimensional data (in a nonlinear way).
Weisheng Xie dives deep into how a trained AAE model (three hidden layers for encoder, decoder, and discriminator) on the unlabeled data extracts the latent vectors from the encoder and evaluates the representation with t-SNE; the latent vector was then fed to to a Gaussian mixture model (GMM) to come up with a number of clusters. The data within each cluster shows a strong latent connection. This method serves as an effective and efficient step, in which AAE helps capture the intrinsic features from the complex data (i.e., a good representation) to model the risk factors. With clustering, it helps narrow a huge volume of intricate data down to a limited size of groups. In a quick test on the data gathered from one of the promotions in 2018, out of 37 groups examined, the company found 609 accounts that were already in its blacklist, another 3,100 accounts on its watch list, and an unseen fraud pattern that was more valuable.
One problem of AAE (and also most of the other deep learning networks) is that the results remain unexplainable, which means the company cannot directly use AAE to come up with decisions, but AAE has good representation of high dimensional data, which can be exploited to create a good clustering and further analyze the data and discover certain undiscovered fraud patterns. Moreover, from the test result, it also helped cut the false alarm rate by 6%.
Vincent Xie (谢巍盛) is the Chief Data Scientist/Senior Director at Orange Financial, as head of the AI Lab, he built the Big Data & Artificial Intelligence team from scratch, successfully established the big data and AI infrastructure and landed tons of businesses on top, a thorough data-driven transformation strategy successfully boosts the company’s total revenue by many times. Previously, he worked at Intel for about 8 years, mainly on machine learning- and big data-related open source technologies and productions.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org