Milestones are typically location designators (e.g. chapter numbers, lines numbers, or groups of sentences) that help you identify structural divisions within documents.
Note
The milestones submodule provides useful functionality for other features of Lexos and will eventually be made into a separate module. For now, it can be accessed as a submodule of Rolling Windows.
Since milestones can span multiple tokens, each token in the document is classified using the IOB method (also used by spaCy's named entity recognition component). The value I assigned to milestone_iob indicates that a token is "inside" (part of) a milestone. The value O indicates that the token is "outside" of (not part of) a milestone. The value B indicates that the token is the "beginning" of (the first token in) a milestone. The milestone_label attribute provides a text representation of the combined tokens. Note, however, that by default it is truncated after twenty characters. Its main function is thus as a point of reference for the user.
Note
Custom attributes in spaCy are accessed with the ._. prefix, so the milestone_iob and milestone_iob values for the first token in the document would be accessed with ms.doc[0]._.milestone_iob and ms.doc[0]._.milestone_label.
If you already have a doc with milestone attributes, you can simple initialise a Milestones object using that doc, and the milestone_iob and milestone_label attributes will be available. If you have used the Milestones class to create these attributes, in most cases you will want to replace your original doc with the one in your Milestones object with doc = ms.doc.
lexos.milestones.helpers mostly consists of deprecated functions. The only one currently used is lexos.milestones.helpers.ensure_list. The deprecated functions are not documented below.
Wraps any input in a list if it is not already a list.
def ensure_list(input: Any) -> list| Parameter | Description | Required |
|---|---|---|
input: Any |
An input variable. | Yes |
Get a list of Milestone objects from a list of docs. This function may be deprecated.
def get_multiple_milestones(docs: List[spacy.tokens.doc.Doc], nlp: str = "xx_sent_ud_sm", patterns: Any = None, case_sensitive: bool = True, mode: str = None, skip_token: bool = False, remove_token: bool = False, split_lines: bool = False, split_sentences: bool = False, step: int = None, remove_milestone: bool = True) -> List[Milestones]| Parameter | Description | Required |
|---|---|---|
docs: List[spacy.tokens.doc.Doc] |
A list of spaCy Doc objects. |
Yes |
nlp: str |
The name of a spaCy language model. Default is xx_sent_ud_sm. |
No |
patterns: Any |
The list of patterns to match milestone spans or line breaks. If nothing is supplied, get_line_spans() will use the default pattern for line breaks. Default is None. |
No |
case_sensitive: bool |
Whether to use case sensitive matching. Default is True. |
No |
mode: str |
The mode to use for token matching. Default is None. |
No |
skip_token: bool |
Set milestone start to the token following the milestone span. Default is False. |
No |
remove_token: bool |
Set milestone start to the token following the milestone span and remove the milestone span. Default is False. |
No |
split_lines: bool |
Use set_line_spans() instead of set_milestones(). Default is False. |
No |
split_sentences: bool |
Use set_sentence_spans() instead of set_milestones(). Default is False. |
No |
step: int |
The number of lines or sentences to include in the spans. By default, all are included. remove_milestone: Whether or not to remove the linebreak using split_lines. Default is None. |
No |
remove_milestone: bool |
Whether or not to remove the linebreak using split_lines. Default is True. |
No |
Creates a Milestones object. The object has the property spans, which returns the value of Returns Milestones.doc.spans["milestones"].
class Milestones(doc: spacy.tokens.doc.Doc, *, nlp: str = "xx_sent_ud_sm", patterns: Any = None, case_sensitive: bool = True)| Attribute | Description | Required |
|---|---|---|
doc: spacy.tokens.doc.Doc |
A spaCy Doc object. |
Yes |
nlp: str |
The name of a spaCy language model. Default is xx_sent_ud_sm. |
No |
patterns: Any |
A pattern or list of patterns to match to milestones. Default is None. |
No |
case_sensitive: bool |
Whether to use case sensitive matching. Default is True. |
No |
Returns a Milestones generator of Milestones.spans.
def __iter__(self)Assign token attributes in the doc based on spans.
def _assign_token_attributes(self, spans: List[spacy.tokens.span.Span])| Parameter | Description | Required |
|---|---|---|
spans: List[spacy.tokens.span.Span] |
A list of spaCy Span objects. |
Yes |
Autodetect mode for matching milestones if not supplied (experimental). Returns a string to supply to the mode parameter of lexos.milestones.Milestones.get_matches.
def _autodetect_mode(self, patterns: Any) -> str| Parameter | Description | Required |
|---|---|---|
patterns: Any |
The pattern(s) to match. | Yes |
Get matches to milestone patterns in strings. Returns a list of spaCy spans matching the pattern.
def _get_string_matches(self, patterns: Any, flags: Enum) -> List[spacy.tokens.Span]| Parameter | Description | Required |
|---|---|---|
patterns: Any |
The pattern(s) to match. | Yes |
flags: Enum |
An enum containing Python re flags. |
Yes |
Get matches to milestone patterns in phrases. Returns a list of spaCy spans matching the pattern.
def _get_phrase_matches(self, patterns: Any, attr: str = "ORTH") -> List[spacy.tokens.Span]| Parameter | Description | Required |
|---|---|---|
patterns: Any |
The pattern(s) to match. | Yes |
attr: str |
A string indicating the spaCy token attribute to match. Default is ORTH. |
No |
Get matches to milestone patterns in phrases. Returns a list of spaCy spans matching the pattern.
def _get_rule_matches(self, patterns: Any) -> List[spacy.tokens.Span]| Parameter | Description | Required |
|---|---|---|
patterns: Any |
The pattern(s) to match. | Yes |
Remove duplicate spans, generally created when a pattern is added.
def _remove_duplicate_spans(self, spans: List[spacy.tokens.Span]) -> List[spacy.tokens.Span]| Parameter | Description | Required |
|---|---|---|
spans: List[spacy.tokens.Span] |
A list of spaCy Span objects. |
Yes |
Set the object's case sensitivity.
def _set_case_sensitivity(self, case_sensitive: bool = True)| Parameter | Description | Required |
|---|---|---|
case_sensitive: bool |
Whether or not to perform case-sensitive searching. Default is True. |
Yes |
Convert a re.match object to a spaCy Span object.
def _to_spacy_span(self, match: Match) -> spacy.tokens.Span| Parameter | Description | Required |
|---|---|---|
match: re.match |
A re.match object. |
Yes |
Add patterns. Note that the resulting patterns are unsorted. Depending on what you are doing, you may need to call ms.patterns = sorted(ms.patterns).
def add(self, patterns: Any, mode: str = "string") -> None| Parameter | Description | Required |
|---|---|---|
patterns: Any |
The pattern(s) to match. | Yes |
mode: str |
The mode to use for matching. Default is string. |
No |
Get matches to milestone patterns. Returns a list of spaCy spans matching the pattern.
def get_matches(self, patterns: Any = None, mode: str = None, case_sensitive: bool = True)| Parameter | Description | Required |
|---|---|---|
patterns: Any |
The pattern(s) to match. | Yes |
mode: str |
The mode to use for matching: - string: Match milestone patterns in the document text.- phrase: Match to milestone patterns in phrases.- rule: Match to milestone patterns with spaCy rules.- sentence: Match milestone patterns in sentences.Default is None. |
No |
case_sensitive: bool |
Whether to use case sensitive matching. Default is True. |
No |
The mode parameter identifies the function to use for matching patterns. The string mode matches character sequences in the document's text. The phrase mode matches token sequences in the document using spaCy's Phrase Matcher. The rule mode matches a spaCy Rule Matcher pattern. The sentence mode works somewhat differently, it uses returns a list of sentences in the document. Since it uses spaCy's sentence detection component, it will only work if that component is available in the selected language model. If no mode is provided, Lexos will attempt to auto-detect the most appropriate mode based on the pattern.
Pattern matching may not work as desired in RTL languages like Arabic and Hebrew. Some functions to handle RTL languages have been prototyped but are not part of this version of Milestones.
Tip
The string mode matches patterns using regular expressions, which may occasionally cause mismatches. For instance, matching "Mr. Darcy" will return matches to "Mrs Darcy" since "." indicates any single character in regular expressions. Typically, this problem can be avoided by selecting the phrase mode.
Caution
Calling Milestones.get_matches() will overwrite any pre-existing patterns. If you wish to add patterns to existing ones, use the Milestones.add() method, which updates the list of patterns and sets the milestones matching both the previous and the new milestones. You can also remove patterns with the Milestones.remove() method. Both methods accept the mode parameter. Finally, you can clear the pattern list by calling the Milestones.reset() method. This will also reset all milestone_iob values to "O" and all milestone_label values to empty strings.
Remove patterns.
def remove(self, patterns: Any, mode: str = "string") -> None| Parameter | Description | Required |
|---|---|---|
patterns: Any |
The pattern(s) to match. | Yes |
mode: str |
The mode to use for matching. Default is string. |
No |
Reset all milestone values to defaults. Does not modify patterns or any other settings.
def reset(self)Generate spans based on a custom list. Returns a list of spaCy spans.
def set_custom_spans(self, spans: List[spacy.tokens.Span], step: int = None, type: str = "custom") -> List[spacy.tokens.Span])| Parameter | Description | Required |
|---|---|---|
pattern: List[spacy.tokens.Span] |
The string or regex pattern to use to identify the milestone. | Yes |
step: str |
The number of spans to group into each milestone span. By default, all spans are included. Default is None. |
No |
step: str |
The type of span used. Default is custom. |
No |
Generate spans based on line breaks. Returns a list of spaCy spans.
def set_line_spans(self, pattern: str = r".+?\n", step: int = None, remove_milestone: bool = True) -> List[spacy.tokens.Span])| Parameter | Description | Required |
|---|---|---|
pattern: str |
The string or regex pattern to use to identify the milestone. Default is r".+?\n". |
No |
step: str |
The number of spans to group into each milestone span. By default, all lines are included. Default is None. |
No |
remove_milestone: bool |
Whether or not to remove the line break character. Default is True. |
No |
Commit milestones to the object instance.
def set_milestones(self, spans: List[spacy.tokens.span.Span], skip_token: bool = False, remove_token: bool = False) -> None| Parameter | Description | Required |
|---|---|---|
spans: List[spacy.tokens.span.Span] |
The span(s) to use for identifying token attributes. | Yes |
skip_token: bool |
Set milestone start to the token following the milestone span. Default is False. |
No |
remove_token: bool |
Set milestone start to the token following the milestone span and remove the milestone span. Default is False. |
No |
Generate spans with n sentences per span. Returns a list of spaCy spans.
def set_sentence_spans(self, step: int = None) -> List[spacy.tokens.Span])| Parameter | Description | Required |
|---|---|---|
step: str |
The number of spans to group into each milestone span. By default, all lines are included. Default is None. |
No |
Get a list of milestone dictionaries. Some language models include a final punctuation mark in the token string, particularly at the end of a sentence. The strip_punct argument is a somewhat hacky convenience method to remove it. However, the user may wish instead to do some post-processing in order to use the output for their own purposes.
def to_list(self, strip_punct: bool = True) -> List[dict]| Parameter | Description | Required |
|---|---|---|
strip_punct: bool |
Strip single punctuation mark at the end of the character string. Default is True. |
No |
lexos.milestones.helpers mostly consists of deprecated functions. The only one currently used is lexos.milestones.helpers.ensure_list. The deprecated functions are not documented below.
Generate a characters to tokens mapping. Returns a dictionary mapping character indexes to token indexes.
def chars_to_tokens(doc: spacy.tokens.doc.Doc) -> Dict[int, int]| Parameter | Description | Required |
|---|---|---|
doc: spacy.tokens.doc.Doc |
A spaCy Doc object. |
Yes |
Converts a spaCy Matcher rule to lower case. Performs the same function as rollingwindows.calculators.spacy_rule_to_lower.
def spacy_rule_to_lower(patterns: Union[Dict, List[Dict]], old_key: Union[List[str], str] = ["TEXT", "ORTH"], new_key: str = "LOWER") -> list| Parameter | Description | Required |
|---|---|---|
patterns: Union[Dict, List[Dict]] |
A string to match against the Roman numerals pattern. | Yes |
old_key: Union[List[str], str] |
A dictionary key or list of keys to rename. Default is ["TEXT", "ORTH"]. |
No |
new_key: str |
The new key name. Default is LOWER. |
No |
Applies a filter to a document and returns a new document. This function is a duplicate of rollingwindows.filters.filter_doc.
def filter_doc(input: Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc], n: int = 1000, window_units: str = "characters", alignment_mode: str = "strict") -> Iterator| Parameter | Description | Required |
|---|---|---|
doc: spacy.tokens.doc.Doc |
A spaCy Doc object. |
Yes |
keep_ids: int |
A list of spaCy Token ids to keep in the filtered Doc. |
Yes |
spacy_attrs: List[str] |
A list of spaCy Token attributes to keep in the filtered Doc. Default is the SPACY_ATTRS list imported with util.* |
No |
force_ws: bool |
Force a whitespace at the end of every token except the last. Default is True. |
No |
* The default list of spaCy token attributes can be inspected by calling util.SPACY_ATTRS.
Converts a spaCy Doc object into a numpy array. This function is a duplicate of rollingwindows.filters.get_doc_array.
def get_doc_array(doc: spacy.tokens.doc.Doc, spacy_attrs: List[str] = SPACY_ATTRS, force_ws: bool = True) -> np.ndarray| Parameter | Description | Required |
|---|---|---|
doc: spacy.tokens.doc.Doc |
A spaCy Doc object. |
Yes |
keep_ids: int |
A list of spaCy Token ids to keep in the filtered Doc. |
Yes |
spacy_attrs: List[str] |
A list of spaCy Token attributes to keep in the filtered Doc. Default is the SPACY_ATTRS list imported with util.* |
No |
force_ws: bool |
Force a whitespace at the end of every token except the last. Default is True. |
No |
* The default list of spaCy token attributes can be inspected by calling util.SPACY_ATTRS.
The following options are available for handling whitespace:
force_ws=Trueensures thattoken_with_wsandwhitespace_attributes are preserved, but all tokens will be separated by whitespaces in the text of a doc created from the array.force_ws=FalsewithSPACYinspacy_attrspreserves thetoken_with_wsandwhitespace_attributes and their original values. This may cause tokens to be merged if subsequent processing operates on thedoc.text.force_ws=FalsewithoutSPACYinspacy_attrsdoes not preserve thetoken_with_wsandwhitespace_attributes or their values. By default,doc.textdisplays a single space between each token.