Advanced Natural Language Processing with Apache Spark NLP

Advanced Natural
Language
Processing with
Spark NLP
David Talby
CTO, John Snow Labs

2
Agenda
Introducing Spark NLP
State-of-the-art Accuracy
Speed & Scalability
Ease of Use
Examples
1.
2.
3.
4.
5.

3
Most popular
O’Reilly Media
54% share
of healthcare AI teams
use Spark NLP
Gradient Flow
16x growth
In downloads of the
library Since Jan 2020
PyPI Download Stats
NLP library in
the enterprise

What is Spark NLP?
▪ State of the art Natural Language Processing
▪ Production-grade, trainable, and scalable
▪ Open-Source Python, Java & Scala libraries
▪ 1,400+ Pre-trained models & pipelines
▪ Active: 26+ new releases/year since 2017!

Spark NLP in Industry
NLP Industry Survey by Gradient Flow,
an independent data science research & insights company, September 2020
Which NLP libraries does your organization use?

There’s a world of difference between
an academic result and a production system
TRAINABLE &
TUNABLE
100% PRIVATE
EXPLAINABLE
REPRODUCIBLE
HARDWARE
OPTIMIZED
SCALABLE
COMMUNITY &
EDUCATION

Introducing Spark NLP 3
• Massive speedups
[Databricks 7.2 ML GPU on 10 AWS f4dn.large:]
7.9 times faster in calculating BERT–Large
6.5 times faster in calculating BERT-base
3.0 times faster in calculating NER DL
• The latest compute platforms
Spark 3.1, 3.0, 2.4, 2.3
Databricks 8.x, 7.x, 6.x – CPU and GPU
Linux, Max, Windows – local development
Docker – with & without Kubernetes
Hadoop 2.7 and 3.x
Cloudera & Hortonworks
AWS, Azure, and GCP

10
Agenda
Speed & Scalability
Ease of Use
Examples
1.
2.
3.
4.
5.

11
On Accuracy
Biomedical Named
Entity Recognition at Scale
Improving Clinical Document
Understanding on COVID-19
Research with Spark NLP
Accurate Clinical Named
Entity Recognition at Scale
• Obtains new state-of-the-art results
on seven public biomedical benchmarks
without using heavy contextual embeddings,
including:
• BC4CHEMD to 93.72%
(4.1% gain)
• Species800 to 80.91%
(4.6% gain)
• JNLPBA to 81.29% (5.2% gain)
• Production-grade codebase on top of the
Spark NLP library; can scale up for training
and inference in any Spark cluster; GPU
support; Polyglot API
• Improve on the previous best accuracy
benchmarks for assertion status detection
• Recognize 100+ entity types including social
determinants of health, anatomy, risk
factors, and adverse events in addition to
other commonly used clinical and
biomedical entities
• Extract trends and insights:
Most frequent disorders & symptoms and
most common vital signs and EKG findings
from CORD-19
Presented at CADL 2020 (International Workshop
on Computational Aspects of Deep Learning), in
conjunction with ICPR 2020
Presented at SDU (Scientific Document
Understanding) workshop at AAAI 2021
• Establishes new state-of-the-art accuracy on
3 clinical concept extraction challenges:
• 2010 i2b2/VA clinical concept extraction
• 2014 n2c2 de-identification
• 2018 n2c2 medication extraction
• Outperform the accuracy of AWS Medical
Comprehend and Google Cloud Healthcare
API by a large margin (8.9% and 6.7%
respectively)
• Outperform plain Keras implementation
Under review

● “State of the art” means the
best peer-reviewed academic
results
● For example: Best F1 score on
CoNLL-2003 NER
benchmark for a system in
production
● Spark NLP uses a custom
model based on Bi-LSTM +
Char-CNN
+ CRF + Word Embeddings
Accuracy: State-of-the-art Models
Named Entity Recognition

● The best F1 score on CoNLL-2003
NER benchmark for a system in
production by using Spark NLP
● BERT Large model was used to
train our Bi-LSTM + Char-CNN +
CRF model

● Everything must work right out of the box
● All the parameters are default
● CoNLL 2003 dataset is used in this
benchmark. The eng.train was used for
training and
the eng.testa was used for evaluating the
model

Transformers & Embeddings
Spark NLP: 100+ Word
Embeddings
● BERT
● Small BERT
● BioBERT
● CovidBERT
● ALBERT
● ELECTRA
● XLNet
● ELMO
● GloVe

Multi-class & Multi-label Text Classifications
● Multi-class text classification to
detect emotions, cyberbullying,
fake news, spams, etc.
● Multi-label text classification to
detect toxic comments, movie
genre, etc.
● Hundreds of pre-tained Word
and Sentence Embeddings
● Language-Agnostic BERT
Sentence Embedding
● Universal Sentence Encoder as
an input for text classifications

SentimentDL, ClassifierDL, and MultiClassifierDL
● BERT
● Small BERT
● BioBERT
● CovidBERT
● LaBSE
● ALBERT
● ELECTRA
● XLNet
● ELMO
● Universal Sentence
Encoder
● GloVe
● 100 dimensions
● 200
dimensions
● 128 dimensions
● 256
dimensions
● 300
dimensions
● 512 dimensions
● 768
dimensions
● 1024
dimensions
● tfhub_ues
● tfhub_use_lg
● glove_6B_100
● glove_6B_300
● glove_840B_300
● bert_base_cased
● bert_base_uncased
● bert_large_cased
● bert_large_uncased
● bert_multi_uncased
● electra_small_uncased
● elmo
● ...
● 2 classes (positive/negative)
● 3 classes (0, 1, 2)
● 4 classes (Sports, Business,
etc.)
● 5 classes (1.0, 2.0, 3.0, 4.0, 5.0)
● ... 100 classes!

Language Detection & Identification
● LanguageDetectorDL is a state-of-the-art
TensorFlow/Keras model
● Uses the positions of the characters
● It is around 3 MB to 5 MB
● It has been trained over 8 million Wikipedia
pages
● It has between 97% to 99% accuracy for text
longer than 140 characters

Context Spell Checker
● Ability to consider OCR specific error
patterns
● Ability to leverage the context
● Ability to preserve and even correct custom
patterns
● Flexibility to incorporate your own custom
patterns

20
Agenda
Speed & Scalability
Ease of Use
Examples
1.
2.
3.
4.
5.

Optimizing Performance
BERT Embeddings
● Transformers are slow!
● They need GPUs
● It depends highly on max sequence
length
Spark NLP 2.6 optimizations:
● Improve the memory consumption by
30%
● Improve performance by more than 70%
with dynamic shape

Performance
BERT Embeddings
● Trade off size, memory, and accuracy
● Tiny BERT
● Mini BERT
● Small BERT
● Medium BERT
● Others…
Example:
● BERT-Tiny is 24x times smaller and
28x times faster than BERT-Base

Performance:
Hardware
● Optimized builds of Spark NLP
for both Intel and Nvidia
● Out-of-the-box optimizations for
Intel (MKL, etc.) and Nvidia
(Spark 3, etc.)
● Ongoing profiling with
engineering teams at both
companies

Scale: Distribution & Parallelism
● Zero code changes to scale a pipeline to any
Spark cluster
● Only natively distributed open-source NLP
library
● Spark provides execution planning, caching,
serialization, and shuffling
● Caveats
● Speedup depends on what you actually do
● Spark configurations matter
● Cluster tuning based on your data is advised

Recognize Entity DL Pipeline
● Amazon full reviews, 15 million
sentences, and
255 million tokens
● Single node, 32G memory & 32 cores
● 10x workers with 32G memory & 16 cores
● The pipeline includes sentence
detection, tokenization, word
embeddings, and NER
Setup:
● Single node is dedicated Dell Server
● 10 Nodes are in Databricks on AWS

BERT Embeddings
● Amazon full reviews, 15 million
sentences, and 255 million tokens
● Single node with 64G memory & 32
cores
● 10x workers with 32G memory & 16 cores
● 128 max sequence length
Setup:
● Single node is dedicated Dell Server
● 10 Nodes are in Databricks on AWS

27
Agenda
Speed & Scalability
Ease of Use
Examples
1.
2.
3.
4.
5.

Easy to Use
Python, Scala, and Java
● Pretrained pipelines
● Pretrained models
● Training your own models

Easy to Use
Pretrained Pipelines
● 100+ pretrained pipelines
● Full support for 13 languages
● Simple and easy to use
● Works online and offline
● Preconfigured

Easy to Use
Pretrained Models
● Hundreds of pretrained models
● Support for 46 languages
● Works online and offline
● Flexible & customized pipelines
● Caveat: some models depend on each
other

Easy to Use
Train your own POS tagging models
● POS() accepts token-tag format
● POS Tagger is based on Perceptron Average
algorithm
● Language-agnostic and supports any
language

Easy to Use
Train your own NER models
● CoNLL 2003 format as input
● Accepts 50+ Word Embeddings
models
● Train on CPU or GPU
● Extended metrics and evaluation
● Built-in validation split with metrics

Easy to Use
Train your own NER models
● BERT with 2 layers & 768 dimensions
● 16 minutes training
● 91% Micro F1 on Dev
● 90% conll_eval on Dev
● Full CoNLL 2003 training dataset
● Google Colab with GPU

Easy to Use
Train your own multi-class classifiers
● Supports up to 100 classes
● Accepts 90+ Word & Sentence Embeddings
models
● Train on CPU or GPU
● Extended metrics and evaluation
● Built-in validation split with metrics

35
Agenda
Speed & Scalability
Ease of Use
Examples
1.
2.
3.
4.
5.

38
Project creation
Team setup
Tasks creation
Labeling
The Annotation Lab

39
Learn More
Using Spark NLP to build a drug
discovery knowledge graph for Covid-19
Vishnu Vettrivel & Alexander Thomas
Founder & Principal Data Scientist at Wisecube
NLP in Healthcare:
Challenges & Opportunities
Ganesh Thodikulam
Executive Director, Kaiser Permanente
A Unified CV, OCR, and NLP for
Scalable Document Understanding
Text Analytics and its Applications
in the Pharma Industry
Harsha Gurulingappa, Ph.D.
Text Analytics Product Owner at Merck
NLP in Oncology Real World Data:
Opportunities to develop a true learning
healthcare system
Patrick Beukema, Ph.D.
Senior ML Engineer, DocuSign
Automated & Explainable Deep
Learning for Clinical Language
Understanding at Roche
Vishakha Sharma, Ph.D.
Principal Data Scientist, Roche
George A. Komatsoulis, Ph.D.
Chief of Bioinformatics at CancerLinQ

40
Thank you!
© 2015-2021 John Snow Labs Inc. All rights reserved. The John Snow Labs logo is a trademarks of John Snow Labs Inc. The included information is for informational purposes only and represents the current
view of John Snow Labs as of the date of this presentation. Since John Snow Labs must respond to changing market conditions, it should not be interpreted to be a commitment on its part, and John Snow Labs
cannot guarantee the accuracy of any information provided after the date of this presentation. John Snow Labs makes no warranties, express
or statutory, as to the information in this presentation.
demo.johnsnowlabs.com
nlp.johnsnowlabs.com
Live demos:
Get Started:

Advanced Natural Language Processing with Apache Spark NLP

More Related Content

What's hot

Similar to Advanced Natural Language Processing with Apache Spark NLP

More from Databricks

Recently uploaded

Advanced Natural Language Processing with Apache Spark NLP