Product

Dig Out Relevant Text Elements with Entity Extraction API

Ankit Singh
January 13, 2017
##
mins read
Ready to get Started?
request Demo

Named Entity RecognitionNamed Entity Recognition, also known as entity extraction classifies named entities that are present in a text into pre-defined categories. These categories can be individuals, companies, places, organization, cities and others. Named Entity Recognition is a subtask of information extraction. It is one of the basic starting points for using natural language processing techniques to augment your content. Extracting key entities such as person names, locations, dates, specialized terms and product terminology from untreated text can sanction organizations to not only improve keyword search but also paves the path for semantic search, targeted search and document repurposing. Named Entity Recognition can add a wealth of semantic knowledge to your content. This helps you to promptly understand the subject of any given text.
Our Named Entity Recognition API uses Deep Learning technology to determine representations of character groupings. With an immaculate accuracy, our API discovers the most relevant entities in your textual content. Try our Named Entity Recognition demo.

How our Named Entity Recognition API works

Our API uses deep learning technology. Below, you can find a brief description of our technology:

  • Word Embeddings are trained on a huge text corpus our extensive crawling infrastructure collects from the open web. These embeddings are trained using either GloVe or Word2Vec algorithm. We use gloVe embeddings in production. This algorithm converts each word into a dense 100-dimensional vector. The Neural Network we train takes these Embeddings as inputs instead of words directly.
  • Our internal data tagging team annotated a huge dataset of entities present in the data we have crawled. So for example, the sentence "This is a house that Jack built" is annotated with (Jack, Person) and "Ram and Shyam are going to Delhi" is annotated with (Ram, Person), (Shyam, Person) and (Delhi, Place). Our internal dataset has over 200,000 such annotated sentences.
  • We then train a sequence labeling bidirectional LSTM on top of the tagged dataset mentioned above to predict whether each word in a sentence in an entity or not. An LSTM or Long Short Memory Network is a better RNN, which avoids gradient damping by converting general recurrence's multiplication paradigm into addition paradigm.
  • Attention layer was also tried in LSTM to see if it can help tell about important properties in a sentence which define a word as an entity. We are still refining the model with attention and the model in production is LSTM without attention.

Out of the total data given as input, 10% was used for testing the system and the remaining for training it. Our Neural Network model attains over 90% accuracy in extracting entities.
For a better understanding of how the entities are extracted from a piece of text, here is an example:

Example

Input

In 2015 Harry Styles tweeted nonchalantly about Monopoly and we noticed the RRP of One Direction’s official Monopoly game sky-rocket by 125%.
Forbes estimates that Kim Kardashian West has made $51m from her enormous social media following through sponsorship deals. She is quoted: “There’s a lot of value in social media, and people really get that.” We tend to agree, Kim. The “all publicity is good publicity” mantra probably doesn’t work when the President-Elect of the United States is tweeting about canceling an order from your company worth millions of dollars.
We’re living in unprecedented times with an unprecedented President-Elect.

Output
{ "entities": [ ["United States", 1.0, ["place"], "http://dbpedia.org/resource/United_States"], ["Monopoly", 0.9965510937353969, "", ""], ["Harry Styles", 0.9800905556827882, ["person"], ""], ["Kim Kardashian West", 0.9309083455558312, ["person"], ""], ["Forbes", 0.6556073703283326, ["creative work", "written work"], "http://dbpedia.org/resource/Forbes"] ] }

If you deal with a massive corpus on a daily basis, Named Entity Recognition can work wonders for you. There can be several ways in which entity extraction sorts out most of your content-related issues.

  • Automatically generated metadata for your content can be used to improve SEO.
  • Identify the trends associated with your brand, product or service and group them by a person, place or location. Hence, improve your overall social listening.
  • Extract key entities in user queries like product name, service request etc. to analyze most frequently used terms. This is called intent analysis.

The most significant use of Named Entity Recognition is leveraged by publishing organizations. The Media industry is switching fast to semantic publishing. Know more about semantic publishing here.

Feedback

Tell us what you think of our Named Entity Recognition. We would love to have your feedback.
Leave a comment and share your thoughts.

Named Entity RecognitionNamed Entity Recognition, also known as entity extraction classifies named entities that are present in a text into pre-defined categories. These categories can be individuals, companies, places, organization, cities and others. Named Entity Recognition is a subtask of information extraction. It is one of the basic starting points for using natural language processing techniques to augment your content. Extracting key entities such as person names, locations, dates, specialized terms and product terminology from untreated text can sanction organizations to not only improve keyword search but also paves the path for semantic search, targeted search and document repurposing. Named Entity Recognition can add a wealth of semantic knowledge to your content. This helps you to promptly understand the subject of any given text.
Our Named Entity Recognition API uses Deep Learning technology to determine representations of character groupings. With an immaculate accuracy, our API discovers the most relevant entities in your textual content. Try our Named Entity Recognition demo.

How our Named Entity Recognition API works

Our API uses deep learning technology. Below, you can find a brief description of our technology:

  • Word Embeddings are trained on a huge text corpus our extensive crawling infrastructure collects from the open web. These embeddings are trained using either GloVe or Word2Vec algorithm. We use gloVe embeddings in production. This algorithm converts each word into a dense 100-dimensional vector. The Neural Network we train takes these Embeddings as inputs instead of words directly.
  • Our internal data tagging team annotated a huge dataset of entities present in the data we have crawled. So for example, the sentence "This is a house that Jack built" is annotated with (Jack, Person) and "Ram and Shyam are going to Delhi" is annotated with (Ram, Person), (Shyam, Person) and (Delhi, Place). Our internal dataset has over 200,000 such annotated sentences.
  • We then train a sequence labeling bidirectional LSTM on top of the tagged dataset mentioned above to predict whether each word in a sentence in an entity or not. An LSTM or Long Short Memory Network is a better RNN, which avoids gradient damping by converting general recurrence's multiplication paradigm into addition paradigm.
  • Attention layer was also tried in LSTM to see if it can help tell about important properties in a sentence which define a word as an entity. We are still refining the model with attention and the model in production is LSTM without attention.

Out of the total data given as input, 10% was used for testing the system and the remaining for training it. Our Neural Network model attains over 90% accuracy in extracting entities.
For a better understanding of how the entities are extracted from a piece of text, here is an example:

Example

Input

In 2015 Harry Styles tweeted nonchalantly about Monopoly and we noticed the RRP of One Direction’s official Monopoly game sky-rocket by 125%.
Forbes estimates that Kim Kardashian West has made $51m from her enormous social media following through sponsorship deals. She is quoted: “There’s a lot of value in social media, and people really get that.” We tend to agree, Kim. The “all publicity is good publicity” mantra probably doesn’t work when the President-Elect of the United States is tweeting about canceling an order from your company worth millions of dollars.
We’re living in unprecedented times with an unprecedented President-Elect.

Output
{ "entities": [ ["United States", 1.0, ["place"], "http://dbpedia.org/resource/United_States"], ["Monopoly", 0.9965510937353969, "", ""], ["Harry Styles", 0.9800905556827882, ["person"], ""], ["Kim Kardashian West", 0.9309083455558312, ["person"], ""], ["Forbes", 0.6556073703283326, ["creative work", "written work"], "http://dbpedia.org/resource/Forbes"] ] }

If you deal with a massive corpus on a daily basis, Named Entity Recognition can work wonders for you. There can be several ways in which entity extraction sorts out most of your content-related issues.

  • Automatically generated metadata for your content can be used to improve SEO.
  • Identify the trends associated with your brand, product or service and group them by a person, place or location. Hence, improve your overall social listening.
  • Extract key entities in user queries like product name, service request etc. to analyze most frequently used terms. This is called intent analysis.

The most significant use of Named Entity Recognition is leveraged by publishing organizations. The Media industry is switching fast to semantic publishing. Know more about semantic publishing here.

Feedback

Tell us what you think of our Named Entity Recognition. We would love to have your feedback.
Leave a comment and share your thoughts.