Amazon Comprehend: Custom Entity Recognition

To use Amazon Comprehend’s custom entity recognition:

  • provide a data set for model training purposes, with either a set of annotated documents, or a list of entities and their type label (such as PRODUCT_CODES) and a set of documents containing those entities.
  • You can train a model on up to 12 custom entities at once.
  • Once your model is trained, you can search for up to 12 of those custom entities you trained for, in each entities detection job.

Training Custom Entity Recognizers

  • analyze your documents to find entities specific to your needs, rather than limiting you to the preset entity types already available.
  • You can identify almost any kind of entity, simply by providing a sufficient number of details to train your model effectively.

two ways to provide data to Amazon Comprehend

 

  • Annotations
    • uses an annotation list that provides the location of your entities in a large number of documents so Amazon Comprehend can train on both the entity and its context.
    • increase the accuracy – more accurate context to the custom entity you’re seeking
    • When the meaning of the entities could be ambiguous and context-dependent.
      • Amazon could either refer to the river in Brazil, or the online retailer Amazon.com
  • Entity Lists
    • This provides only a limited context, and uses only a list of the specific entities list so Amazon Comprehend can train to identify the custom entity.
    • comma-separated value (CSV) file
      • Text – The text of an entry example exactly as seen in the accompanying document corpus
        • např. Brno
      • Type – The customer-defined entity type.
        • uppercase, underscore separated string such as MANAGER or SENIOR_MANAGER
        • Up to 12 entity types can be trained per model.
        • A minimum of 200 entity matches are needed per entity
        • např. CITY

Zdroje

  • https://docs.aws.amazon.com/comprehend/latest/dg/training-recognizers.html

NLP: Seznam jazykových korpusů

CNEC – Czech Named Entity Corpus

  • Web projektu | Korpus ke stažení
  • Ústav formální a aplikované lingvistiky, Univerzita Karlova
  • korpus 8 993 českých vět obsahující 35 220 manuálně pojmenovaných entit, klasifikovaných podle dvouúrovňové hierarchie 46 pojmenovaných entit
  • ???information extraction – typy slov ve větě (podmět a přísudek)

CZES corpus