0% found this document useful (0 votes)
37 views14 pages

NLTK

NLTK is an open-source suite of Python modules, datasets, and tutorials designed for natural language processing research and development. It includes over 50,000 lines of code, more than 30 annotated corpora, and extensive documentation, including a 400-page book. The toolkit is widely adopted in various NLP courses globally and encourages community contributions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views14 pages

NLTK

NLTK is an open-source suite of Python modules, datasets, and tutorials designed for natural language processing research and development. It includes over 50,000 lines of code, more than 30 annotated corpora, and extensive documentation, including a 400-page book. The toolkit is widely adopted in various NLP courses globally and encourages community contributions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 14

An overview of the

Natural Language Toolkit


Steven Bird, Ewan Klein, Edward Loper

nltk.org
Summary

 NLTK is a suite of open source Python


modules, data sets and tutorials
 supporting research and development in
natural language processing
 Download NLTK from nltk.org
Components of NLTK
1. Code: corpus readers, tokenizers,
stemmers, taggers, chunkers, parsers,
wordnet, ... (50k lines of code)
2. Corpora: >30 annotated data sets
widely used in natural language
processing (>300Mb data)
3. Documentation: a 400-page book,
articles, reviews, API documentation
1. Code
 corpus readers
 tokenizers
 stemmers
 taggers
 parsers
 wordnet
 semantic interpretation
 clusterers
 evaluation metrics
 …
2. Corpora
 Brown Corpus
 Carnegie Mellon Pronouncing Dictionary
 CoNLL 2000 Chunking Corpus
 Project Gutenberg Selections
 NIST 1999 Information Extraction: Entity Recognition Corpus
 US Presidential Inaugural Address Corpus
 Indian Language POS-Tagged Corpus
 Floresta Portuguese Treebank
 Prepositional Phrase Attachment Corpus
 SENSEVAL 2 Corpus
 Sinica Treebank Corpus Sample
 Universal Declaration of Human Rights Corpus
 Stopwords Corpus
 TIMIT Corpus Sample
 Treebank Corpus Sample
 …
3. Documentation
 a 400-page book about natural language
processing in Python and NLTK
 teaches Python and NLP
 provides numerous examples and exercises
 installation instructions
 presentation slides for some of the book
chapters
 API Documentation: describes every module,
interface, class, and method
Adoption in NLP courses
Amsterdam, Ben-Gurion, Brown, Bryn Mawr,
CDAC-Mumbai, Coruña, Edinburgh, Erlangen,
Georgetown, Helsinki, IIT-Bombay, Iowa State,
Konstanz, MIT, Macquarie, Magdeburg, Malta,
Marquette, Melbourne, Nancy, Naval
Postgraduate School, Northeastern, Ohio State,
Pitt, San Diego State, Simon Fraser, Stanford,
Syracuse University, Tsuda College, U
Colorado, UC Berkeley, UMass Amherst,
UNAM, U Penn, UT Austin, Warsaw
Contribute…
 NLTK is an open source project
 all code, data, documentation is free
 dozens of people have contributed over
the past 6 years
 please visit the website for project ideas
 sign up on the NLTK-Announce mailing
list to hear about new releases

You might also like