Statistical models
- predict linguistic attributes in context:
- Part-of-speech tags
- Syntactic dependencies
- Named entities
Model Packages
Download:
python -m spacy download en_core_web_sm
Use:
nlp = spacy.load("en_core_web_sm")
Training
https://spacy.io/usage/training
Language data
Každý jazyk je odlišný a plný výjimek a speciálních případů.
Některé výjimky jsou sdíleny napříč jazyky, zatímco jiné jsou specifické pro daný jazyk
Moduly jazyků
- python soubory
- obsahují všechna data specifická pro daný jazyk
- snadná aktualizace a rozšíření
/
– general language data – includes rules that can be generalized across languages (basic punctuation, emoji, emoticons, single-letter abbreviations and norms for equivalent tokens with different spellings, like “ and ”)/en
,/cs
, … – language data relevant to a particular language
from spacy.lang.en import English from spacy.lang.de import German nlp_en = English() # Includes English data nlp_de = German() # Includes German data
Files
__init__.py
– a language subclassstop_words.py
– Stop words – List of most common words of a language that are often useful to filter out (and, i)is_stop = true
tokenizer_exceptions.py
– Tokenizer exceptions – Special-case rules for the tokenizer (can’t
,U.K.
)norm_exceptions.py
– Norm exceptions – rules for normalizing tokens to improve the model’s predictions (American vs. British spelling)punctuation.py
– Punctuation rules – Regular expressions for splitting tokenschar_classes.py
– Character classes – Character classes to be used in regular expressionslex_attrs.py
– Lexical attributes – Custom functions for setting lexical attributes on tokens (ten
,hundred
)syntax_iterators.py
– Syntax iterators –tag_map.py
– Tag map – Dictionary mapping strings in your tag set to Universal Dependencies tags.morph_rules.py
– Morph rules – Exception rules for morphological analysis of irregular words like personal pronouns.examples.py
– Example sentences – to test spaCy and its language models
global | en | cs | sk | |
__init__.py |
– | ✓ | ✓ | ✓ |
stop_words.py |
– | ✓ | ✓ | ✓ |
tokenizer_exceptions.py |
✓ | ✓ | – | – |
norm_exceptions.py |
✓ | – | – | – |
punctuation.py |
✓ | – | – | – |
char_classes.py |
✓ | – | – | – |
lex_attrs.py |
✓ | ✓ | ✓ | ✓ |
syntax_iterators.py |
– | ✓ | – | |
tag_map.py |
✓ | ✓ | – | ✓ |
morph_rules.py |
– | ✓ | – | – |
examples.py |
– | ✓ | ✓ | ✓ |
- spacy-lookups-data – Lemmatizer – Lemmatization rules or a lookup-based lemmatization table to assign base forms (
be
,was
)
Adding Languages
- all language data is stored in regular Python files
- you’ll need to modify the library’s code
- clone the repository and build spaCy from source
- create a Language subclass
- define custom language data (a stop list and tokenizer exceptions and test the new tokenizer)
- build the vocabulary (including word frequencies, Brown clusters and word vectors)
- train the tagger and parser and save the model to a directory
-
For some languages, you may also want to develop a solution for lemmatization and morphological analysis.
Creating a language subclass
- file:
__init__.py
- folder:
spacy/lang/cs
- import:
spacy.lang.cs
Stop words
# file __init__.py STOP_WORDS = set(""" a about above across after afterwards again against all almost alone along ... """.split())
- separated by spaces and newlines, and added as a multiline string
Tokenizer exceptions
- define special-case rules
TOKENIZER_EXCEPTIONS = { "don't": [ {ORTH: "do"}, {ORTH: "n't", NORM: "not"}] }
Viz spaCy: Tokenizer exceptions
Norm exceptions
Lexical attributes
- soubor:
lex_attrs.py
is_lower
,like_url
,like_num
https://spacy.io/usage/adding-languages#lex-attrs
Syntax iterators
https://spacy.io/usage/adding-languages#syntax-iterators
Lemmatizer
- převod slov na základní tvar
- The data is stored in a dictionary mapping a string to its lemma. To determine a token’s lemma, spaCy simply looks it up in the table.
soubor en_lemma_lookup.json
:
{ … "cars": "car", … "horses": "horse", "horseshoes": "horseshoe", … }
Adding JSON resources
- resources for the lemmatizer are stored as JSON
- a separate repository and package: github.com: explosion/spacy-lookups-data
- exposes the data files via language-specific entry points that spaCy reads when constructing the Vocab and Lookups
-
If you want to use the lookup tables without a pretrained model, you have to explicitly install spaCy with lookups via pip install spacy[lookups] or by installing spacy-lookups-data in the same environment as spaCy.
-
- exposes the data files via language-specific entry points that spaCy reads when constructing the Vocab and Lookups
- a separate repository and package: github.com: explosion/spacy-lookups-data
- lemma_rules
- lemma_index
- lemma_exc
- lemma_lookup
Více na:
Tag map
Most treebanks define a custom part-of-speech tag scheme, striking a balance between level of detail and ease of prediction. While it’s useful to have custom tagging schemes, it’s also useful to have a common scheme, to which the more specific tags can be related. The tagger can learn a tag scheme with any arbitrary symbols. However, you need to define how those symbols map down to the Universal Dependencies tag set. This is done by providing a tag map.
https://spacy.io/usage/adding-languages#tag-map
Morph rules
https://spacy.io/usage/adding-languages#morph-rules
Language-specific tests
- directory:
tests/lang
Training a language model
- Much of spaCy’s functionality requires models to be trained from labeled data
- named entity recognizer – train a model on text annotated with examples of the entities you want to recognize
- part-of-speech tagger and text categorizer – require models to be trained from labeled examples
- word vectors, word probabilities and word clusters – require training, can be trained from unlabeled text, which tends to be much easier to collect.
Creating a vocabulary file
- spaCy expects that common words will be cached in a Vocab instance.
Models and training data
JSON input format for training
- convert the
.conllu
format (universaldependencies.org) to spaCy’sJSON
training format – commandconvert
- convert
Doc
objects to spaCy’sJSON
format – helpergold.docs_to_json
Struktura:
id
(int) – ID of the document within the corpusparagraphs
(array) – list of paragraphs in the corpusraw
(string) – raw text of the paragraphsentences
(array) – list of sentences in the paragraphtokens
(array) – list of tokens in the sentenceid
(int) – index of the token in the documentdep
(string) – dependency labelhead
(int) – offset of token head relative to token indextag
(string) – part-of-speech tagorth
(string) – verbatim text of the tokenner
(string) – BILUO label (e.g. „O“ or „B-ORG“)
brackets
(array) – phrase structure (NOT USED by current models)first
(int) – index of first tokenlast
(int) – index of last tokenlabel
(string) – phrase label
cats
(array) – categories for text classifierlabel
(string) – text category labelvalue
(float / bool) – label applies (1.0/true) or not (0.0/false)
Example: https://github.com/explosion/spaCy/blob/master/examples/training/training-data.json
Zdroje
- spaCy: Training
- spaCy: Language data
- spaCy: Adding Languages
- spaCy: Annotation Specifications: Models and training data