Logo Spacy

spaCy: New language & Training model

Statistical models

  • predict linguistic attributes in context:
    • Part-of-speech tags
    • Syntactic dependencies
    • Named entities

Model Packages

Download:

python -m spacy download en_core_web_sm

Use:

nlp = spacy.load("en_core_web_sm")

Training

https://spacy.io/usage/training

Language data

Každý jazyk je odlišný a plný výjimek a speciálních případů.
Některé výjimky jsou sdíleny napříč jazyky, zatímco jiné jsou specifické pro daný jazyk

Moduly jazyků

github.com: spaCy/spacy/lang

  • python soubory
  • obsahují všechna data specifická pro daný jazyk
    • snadná aktualizace a rozšíření
  • /general language data – includes rules that can be generalized across languages (basic punctuation, emoji, emoticons, single-letter abbreviations and norms for equivalent tokens with different spellings, like “ and ”)
  • /en, /cs, … – language data relevant to a particular language
from spacy.lang.en import English
from spacy.lang.de import German
nlp_en = English() # Includes English data
nlp_de = German() # Includes German data

Files

  • __init__.pya language subclass
  • stop_words.pyStop words – List of most common words of a language that are often useful to filter out (and, i)
    • is_stop = true
  • tokenizer_exceptions.pyTokenizer exceptions – Special-case rules for the tokenizer (can’t, U.K.)
  • norm_exceptions.pyNorm exceptions – rules for normalizing tokens to improve the model’s predictions (American vs. British spelling)
  • punctuation.pyPunctuation rules – Regular expressions for splitting tokens
  • char_classes.pyCharacter classes – Character classes to be used in regular expressions
  • lex_attrs.pyLexical attributes – Custom functions for setting lexical attributes on tokens (ten, hundred)
  • syntax_iterators.pySyntax iterators – 
  • tag_map.pyTag map – Dictionary mapping strings in your tag set to Universal Dependencies tags.
  • morph_rules.pyMorph rules – Exception rules for morphological analysis of irregular words like personal pronouns.
  • examples.pyExample sentences – to test spaCy and its language models
  global en cs sk
__init__.py
stop_words.py
tokenizer_exceptions.py
norm_exceptions.py
punctuation.py
char_classes.py
lex_attrs.py
syntax_iterators.py  
tag_map.py
morph_rules.py
examples.py
  • spacy-lookups-dataLemmatizer – Lemmatization rules or a lookup-based lemmatization table to assign base forms (be, was)

Adding Languages

  • all language data is stored in regular Python files
  • you’ll need to modify the library’s code
  • create a Language subclass
  • define custom language data (a stop list and tokenizer exceptions and test the new tokenizer)
  • build the vocabulary (including word frequencies, Brown clusters and word vectors)
  • train the tagger and parser and save the model to a directory
  • For some languages, you may also want to develop a solution for lemmatization and morphological analysis.

Creating a language subclass

  • file: __init__.py
  • folder: spacy/lang/cs
  • import: spacy.lang.cs

Stop words

# file __init__.py
STOP_WORDS = set("""
a about above across after afterwards again against all almost alone along ...
""".split())
  • separated by spaces and newlines, and added as a multiline string

Tokenizer exceptions

TOKENIZER_EXCEPTIONS = {
  "don't": [
    {ORTH: "do"},
    {ORTH: "n't", NORM: "not"}]
}

Viz spaCy: Tokenizer exceptions

Norm exceptions

Viz spaCy: Norm exceptions

Lexical attributes

  • soubor: lex_attrs.py
  • is_lower, like_url, like_num

https://spacy.io/usage/adding-languages#lex-attrs

Syntax iterators

https://spacy.io/usage/adding-languages#syntax-iterators

Lemmatizer

  • převod slov na základní tvar
  • The data is stored in a dictionary mapping a string to its lemma. To determine a token’s lemma, spaCy simply looks it up in the table.

soubor en_lemma_lookup.json:

{
  …
  "cars": "car",
  …
  "horses": "horse",
  "horseshoes": "horseshoe",
  …
}
Adding JSON resources
  • resources for the lemmatizer are stored as JSON
    • a separate repository and package: github.com: explosion/spacy-lookups-data
      • exposes the data files via language-specific entry points that spaCy reads when constructing the Vocab and Lookups
        • If you want to use the lookup tables without a pretrained model, you have to explicitly install spaCy with lookups via pip install spacy[lookups] or by installing spacy-lookups-data in the same environment as spaCy.

  • lemma_rules
  • lemma_index
  • lemma_exc
  • lemma_lookup

Více na:

Tag map

Most treebanks define a custom part-of-speech tag scheme, striking a balance between level of detail and ease of prediction. While it’s useful to have custom tagging schemes, it’s also useful to have a common scheme, to which the more specific tags can be related. The tagger can learn a tag scheme with any arbitrary symbols. However, you need to define how those symbols map down to the Universal Dependencies tag set. This is done by providing a tag map.

https://spacy.io/usage/adding-languages#tag-map

Morph rules

https://spacy.io/usage/adding-languages#morph-rules

Language-specific tests

  • directory: tests/lang

Training a language model

  • Much of spaCy’s functionality requires models to be trained from labeled data
    • named entity recognizer – train a model on text annotated with examples of the entities you want to recognize
    • part-of-speech tagger and text categorizer – require models to be trained from labeled examples
    • word vectors, word probabilities and word clusters – require training, can be trained from unlabeled text, which tends to be much easier to collect.

Creating a vocabulary file

  • spaCy expects that common words will be cached in a Vocab instance.

Models and training data

JSON input format for training

  • convert the .conllu format (universaldependencies.org) to spaCy’s JSON training format – command convert
  • convert Doc objects to spaCy’s JSON format – helper gold.docs_to_json

Struktura:

  • id (int) – ID of the document within the corpus
  • paragraphs (array) – list of paragraphs in the corpus
    • raw (string) – raw text of the paragraph
    • sentences (array) – list of sentences in the paragraph
      • tokens (array) – list of tokens in the sentence
        • id (int) – index of the token in the document
        • dep (string) – dependency label
        • head (int) – offset of token head relative to token index
        • tag (string) – part-of-speech tag
        • orth (string) – verbatim text of the token
        • ner (string) – BILUO label (e.g. „O“ or „B-ORG“)
      • brackets (array) – phrase structure (NOT USED by current models)
        • first (int) – index of first token
        • last (int) – index of last token
        • label (string) – phrase label
    • cats (array) – categories for text classifier
      • label (string) – text category label
      • value (float / bool) – label applies (1.0/true) or not (0.0/false)

Example: https://github.com/explosion/spaCy/blob/master/examples/training/training-data.json

Zdroje

 

Napsat komentář

Vaše e-mailová adresa nebude zveřejněna. Vyžadované informace jsou označeny *