Techniques and tools for the evolving NLP landscape in Fintech

Traditional Natural Language Processing (NLP) has long relied on powerful Python libraries such as SpaCy and NLTK, which have proven effective for a wide range of text-processing tasks. However, these libraries are primarily designed for single-node compute environments, which can become a significant limitation when dealing with large-scale datasets. In this session, we will explore how distributed platforms like Apache Spark, and specifically PySpark, are revolutionizing the way we approach NLP by enabling parallelized, distributed processing.

We will delve into the use of PySpark libraries, such as Spark NLP, which seamlessly distribute NLP tasks across multiple nodes, ensuring that even the largest datasets can be processed efficiently. The session will also cover practical techniques for distributing Python-based NLP workloads over clusters, including how to leverage non-Spark NLP libraries like SpaCy and NLTK within a Spark environment by utilizing pandas UDFs (User Defined Functions). Additionally, we will discuss the use of libraries such as MLlib for scalable machine learning, Koalas for simplifying the transition from pandas to PySpark, and Delta Lake for handling large-scale data lakes.

Building on this foundation, we will then venture into the integration of Generative AI (GenAI) frameworks into these NLP pipelines. We will explore how tools like Hugging Face’s Transformers (BERT and its variants) and DeepSpeed can be utilized to scale deep learning models across distributed environments highlighting their applications in tasks such as text classification, sentiment analysis, and named entity recognition, particularly within the fintech sector.

By the end of this session, participants will have a clear understanding of how to evolve traditional NLP practices by incorporating distributed computing and GenAI, ensuring they can handle the growing demands of big data in a scalable and efficient manner.

About the speaker

Sumedha Rai

AI Researcher at NYU Langone Health

Sumedha is a Senior Data Scientist and AI researcher based out of New York with extensive experience in both academia and industry.

She is currently working with a micro-investments firm where she leads the Natural Language Processing efforts and a suite of anti-fraud solutions. She’s a graduate from the New York University (NYU) where she studied machine learning and AI and continues to research at the Center for Data Science, NYU with other researchers across departments. Currently, her work at NYU involves collaborating with the Langone medical school to explore large language models and AI in healthcare. Her professional experience spans diverse domains, including finance, medicine, SaaS and policy. As a dedicated researcher, she has worked with distinguished academics, product companies, hedge funds, and even central banks. She has authored articles on AI in major media and frequently goes to NYU as a panelist in AI and tech. She has also been a speaker for national radio shows, tech conferences and podcasts. Sumedha consistently stays informed about the latest tech developments, always eager for the opportunity to dissect and glean insights from them.

When

Online Event: September 24, 2024

Contact

nlpsummit@johnsnowlabs.com

Presented by