Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof.
CoNLL
CoNLL, the Conference on Natural Language Learning, is SIGNLL’s(Special Interest Group on Natural Language Learning) yearly meeting. More information available here http://www.signll.org/conll/
Lemmatizing
A very similar operation to stemming is called lemmatizing. The major difference between these is, as you saw earlier, stemming can often create non-existent words. So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary. A root lemma, on the other hand, is a real word. Many times, you will wind up with a very similar word, but sometimes, you will wind up with a completely different word.
NLTK Corpora
NLTK is a massive toolkit for you. part of what they give you is a ton of highly valuable corpora to learn with, train against, and some of them are even capable of using in production.
Part-Of-Speech Tagger(POS) tags
A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like ‘noun-plural’.
Stemming
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).
Stop Words
Text may contain stop words like ‘the’, ‘is’, ‘are’. Stop words can be filtered from the text to be processed. There is no universal list of stop words in nlp research, however the nltk module contains a list of stop words.
Tokenizing words and Sentences
Tokenization is the process by which big quantity of text is divided into smaller parts called tokens.
Natural language processing is used for building applications such as Text classification, intelligent chatbot, sentimental analysis, language translation, etc. It becomes vital to understand the pattern in the text to achieve the above-stated purpose. These tokens are very useful for finding such patterns as well as is considered as a base step for stemming and lemmatization.
Wordnet
Part of the NLTK Corpora is WordNet. I wouldn’t totally classify WordNet as a Corpora, if anything it is really a giant Lexicon, but, either way, it is super useful. With WordNet we can do things like look up words and their meaning according to their parts of speech, we can find synonyms, antonyms, and even examples of the word in use.