Skip to content

Token Classification

Token classification is a task in natural language processing (NLP) where individual tokens (words or subwords) within a text are assigned a label. It helps understanding the structure and meaning of sentences. Common examples include Named Entity Recognition (NER), where the goal is to find and label entities like dates, names, and locations in text, and Part-of-Speech (PoS) tagging, which involves identifying whether words are nouns, verbs, adjectives, etc.

Task ID

  • token-classification
  • ner

Default Model

  • Xenova/bert-base-multilingual-cased-ner-hrlh

Use Cases

Token classification can be applied in various scenarios, including but not limited to:

  • Information Extraction from Invoices: Extracting specific entities like dates, company names, and amounts from scanned invoice documents.
  • Content Organization and Discovery: Enhancing search functionality by tagging content with entities like locations, person names, or dates.
  • Automated Content Tagging: Assigning relevant tags to articles or products for better categorization or recommendation.
  • Language Learning Tools: Developing educational software that helps learners understand sentence structure and word usage.

Running a Pipeline Session

To use the Token Classification pipeline, you'll need to provide a piece of text. Here's an example:

php
use function Codewithkyrian\Transformers\Pipelines\pipeline;

$ner = pipeline('token-classification', 'Xenova/bert-base-NER');

$output = $ner('My name is Kyrian and I live in Onitsha');

Pipeline Input Options

When running the token-classification pipeline, you can the following options:

  • texts (string|array)

    The sentence(s) to classify. It's the first argument so there's no need to pass it as a named argument. You can pass a single string or an array of strings. When passing an array, the pipeline will return predictions for each sentence in the array.

    php
    $output = $ner(['My name is Kyrian and I live in Onitsha', 'I am a developer']);
  • ignoreLabels (string[])

    [Optional] An array of labels to ignore. This is useful when you want to exclude certain labels from the model's predictions. The default value is ['O'], which means the model will ignore the 'O' label(s) (i.e., tokens that are not part of any named entity) will be ignored. You can pass an empty array to include all labels.

    php
    $output = $ner('My name is Kyrian and I live in Onitsha', ignoreLabels: []);
  • aggregationStrategy (string)

    [Optional] Determines how tokens that have been split (subword tokenization) and belong to the same entity are grouped in the output. The default strategy is NONE, which means no aggregation is performed. It is passed as a PHP enum and the possible values are AggregationStrategy::NONE, AggregationStrategy::FIRST, AggregationStrategy::AVERAGE, and AggregationStrategy::MAX. You can also pass the string representation of the enum value.

    php
    $output = $ner('My name is Kyrian and I live in Onitsha', aggregationStrategy: 'none');

    Because tokenization can split words into subwords or characters, aggregation strategies help to group these tokens meaningfully. Take the word "Onitsha" for example. It might be split into "On", "##it", "##sha" by the tokenizer. Also take the word "United States of America". It might be split into "United", "States", "of", "America". Here's how each strategy would handle these tokens:

    • AggregationStrategy::NONE: No aggregation is performed. The output will contain the individual tokens and their labels.
      php
        // For "Onitsha"
        [
          // ...
            ['entity' => 'B-LOC', 'word' => 'On', 'score' => 0.9980088367015,],
            ['entity' => 'I-LOC', 'word' => '##its', 'score' => 0.57264213459144,],
            ['entity' => 'I-LOC', 'word' => '##ha', 'score' => 0.99585163659008,]
        ]
      
        // For "United States of America"
        [
            // ...
            ['entity' => 'B-LOC', 'word' => 'United', 'score' => 0.99959621338567,],
            ['entity' => 'I-LOC', 'word' => 'States', 'score' => 0.99930135657091,],
            ['entity' => 'I-LOC', 'word' => 'of', 'score' => 0.99910850633584,],
            ['entity' => 'I-LOC', 'word' => 'America', 'score' => 0.99851260224595,]
        ]
    • AggregationStrategy::FIRST: The subwords will be grouped together as well as tokens with similar entities ( with the first being a B- tag and the rest being I- tags). The score assigned to the grouped token will be the score of the first token in the group. Note the change in the entity key to entity_group.
      php
      // For "Onitsha"
      [
        // ...
          ['entity_group' => 'LOC', 'word' => 'Onitsha', 'score' => 0.9980088367015,],
      ]
      
      // For "United States of America"
      [
          // ...
          ['entity_group' => 'LOC', 'word' => 'United States of America', 'score' => 0.99959621338567,],
      ]
    • AggregationStrategy::AVERAGE: Similar to AggregationStrategy::FIRST, but the score assigned to the grouped token will be the average of the scores of the tokens in the group.
      php
      // For "Onitsha"
      [
        // ...
          ['entity_group' => 'LOC', 'word' => 'Onitsha', 'score' => 0.85583420296134,],
      ]
      
      // For "United States of America"
      [
          // ...
          ['entity_group' => 'LOC', 'word' => 'United States of America', 'score' => 0.99910416963452,],
      ]
    • AggregationStrategy::MAX: Similar to AggregationStrategy::FIRST, but the score assigned to the grouped token will be the maximum of the scores of the tokens in the group.
      php
      // For "Onitsha"
      [
          // ...
          ['entity_group' => 'LOC', 'word' => 'Onitsha', 'score' => 0.9980088367015,],
      ]
      
      // For "United States of America"
      [
          // ...
          ['entity_group' => 'LOC', 'word' => 'United States of America', 'score' => 0.99959621338567,],
      ]

Pipeline Outputs

The output of the pipeline is an array containing the predicted entity, the word, the confidence score, and optionally the index of the word. THe entity labels themselves vary depending on the model used.

Named Entity Recognition (NER)

NER can be used to identify things like names of people, locations, organizations, dates, and more. The typical labels include

AbbreviationDescription
OOutside of a named entity
B-MISCBeginning of a miscellaneous entity right after another entity
I-MISCMiscellaneous entity within the same entity group
B-PERBeginning of a person’s name right after another person's name
I-PERPerson’s name within the same entity group
B-ORGBeginning of an organization right after another organization
I-ORGOrganization name within the same entity group
B-LOCBeginning of a location right after another location
I-LOCLocation name within the same entity group
php
$ner = pipeline('token-classification', 'Xenova/bert-base-NER');

$output = $ner('My name is Kyrian and I live in Onitsha', aggregationStrategy: 'max');
Click to view output
php
[
  ["entity_group" => "PER", "score" => 0.99431570686513, "word" => "Kyrian"]
  ["entity_group" => "LOC", "score" => 0.9980088367015, "word" => "Onitsha"]
]

Part-of-Speech (PoS) Tagging

PoS models are trained to identify parts of speech, such as nouns, pronouns, verbs,adjectives, etc., in a given text. The typical labels include:

AbbreviationDescription
NOUNNoun
AUXAuxiliary verb
PROPNProper noun
PRONPronoun
VERBVerb
ADPApposition (preposition and postposition)
CONJConjunction
php
$ner = pipeline('token-classification', 'codewithkyrian/bert-english-uncased-finetuned-pos');

$output = $ner('My name is Kyrian and I live in Onitsha', aggregationStrategy: 'max');
Click to view output
php
[
    ["entity_group" => "PRON", "word" => "my", "score" => 0.99482086393966],
    ["entity_group" => "NOUN", "word" => "name", "score" => 0.95769686675798],
    ["entity_group" => "AUX", "word" => "is", "score" => 0.97602109098715],
    ["entity_group" => "PROPN", "word" => "kyrian", "score" => 0.96583783664597],
    ["entity_group" => "CCONJ", "word" => "and", "score" => 0.98444884455349],
    ["entity_group" => "PRON", "word" => "i", "score" => 0.99566682068677],
    ["entity_group" => "VERB", "word" => "live", "score" => 0.98391136480035],
    ["entity_group" => "ADP", "word" => "in", "score" => 0.99580186695928],
    ["entity_group" => "PROPN", "word" => "onitsha", "score" => 0.91250281394515],
]