Natural language is highly varied and nuanced. An SMS message, an academic paper, a patent filing, and an online news article all use different grammar, jargon, and implied semantics. This requires most real systems to train domain-specific models for the type of text and type of inferences the system must make. Of course, different models are also required for understanding text in multiple human languages. Most state-of-the-art algorithms in natural language processing are based on deep learning: word embeddings, bidirectional LSTMs, and hybrid neural network combinations that combine to achieve high-accuracy results.
David Talby explains how to train custom word embeddings, named entity recognition, and question-answering models on the NLP library for Apache Spark, which provides distributed implementations of these tasks as a native extension of Spark ML, taking advantage of Spark’s runtime performance optimization at scale.
This talk is intended to be an immediate follow-up to Introducing Spark NLP. David uses sample PySpark notebooks, which will be made publicly available after the talk.
David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe. Earlier, he worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org