X

Blogs about Deep Learning, Machine Learning, AI, NLP, Security, Oracle Traffic Director,Oracle iPlanet WebServer

spaCy - Named Entity and Dependency Parsing Visualizers

I was searching for some pre-trained models that would read text and extract entities out of it like cities, places, time and date etc. automatically as training a model manually is time consuming and needs a lot of data to train if somebody has already done it why not reuse it.

Named-entity recognition (NER) (also known as entity identificationentity chunking and entity extraction) is a sub-task of information extraction that seeks to locate and classify named entities in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

spaCy is the leading open-source library for advanced NLP. spaCy has excellent pre-trained named-entity recognizers in a number of models. Note that we used "en_core_web_sm" model. I have read that some spaCy models are case-sensitive.

I tried converting text of a random news article into Named Entities using this visualization tool "displaCy Named Entity Visualizer". You can look at the results in the link here 

Here is the output of the paragraph I had entered in the toolNamedEntityExtraction

If you look at spaCy documentation, it gives the explanation of these entity types

  • PERSON (People, including fictional): It classified "AI", "CAGR", "Tencent" wrongly as person in our context.
  • NORP (Nationalities or religious or political groups): It classified 'Asian' and "Chinese" correctly as nationality.
  • GPE (Countries, cities, states): It classified country "U.S." correctly but misclassified "Alibaba" and "AI" in our context.
  • ORG (Companies, agencies, institutions etc): It classified "Baidu", "Google", "IBM", and "Microsoft" correctly.
  • CARDINAL (Numerals that do not fall under another type): It classified "one" and "three" correctly.
  • PERCENT(Percentage, including "%"): "45%", "50%" and "65%" were classified correctly.
  • DATE (Absolute or relative dates or periods): "2017" was classified correctly.

For the entry types which are not correct , we need to re-train the model with our own contextual data as training set.

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc.

dependency parser analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words which modify those heads.

The figure below shows a snapshot of dependency parser of the paragraph above. Full image can be viewed in Dependency Visualizers here.

DependencyVisualizer

Dependency Parsers can read various forms of plain text input and can output various analysis formats, including part-of-speech tagged text, phrase structure trees, and a grammatical relations (typed dependency) format.

Dependency Parsing can be used to solve various complex NLP (Natural Language Processing) problems like Named Entity Recognition, Relation Extraction, translation. For more details on Dependency parsing, watch this Stanford video.

Read about Parsey McParseface (and SyntaxNet), open source dependency parser here

This blog is posted here.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.