R11484/source_code/transformers/examples/token-classificationa8311170c72cmaster
token-classification
README.md
<!--- Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->
Token classification
Fine-tuning the library models for token classification task such as Named Entity Recognition (NER) or Parts-of-speech tagging (POS). The main scrip run_ner.py leverages the 🤗 Datasets library and the Trainer API. You can easily customize it to your needs if you need extra processing on your datasets.
It will either run on a datasets hosted on our hub or with your own text files for training and validation.
The following example fine-tunes BERT on CoNLL-2003:
bash python run_ner.py \ --model_name_or_path bert-base-uncased \ --dataset_name conll2003 \ --output_dir /tmp/test-ner \ --do_train \ --do_eval
or just can just run the bash script run.sh.
To run on your own training and validation files, use the following command:
bash python run_ner.py \ --model_name_or_path bert-base-uncased \ --train_file path_to_train_file \ --validation_file path_to_validation_file \ --output_dir /tmp/test-ner \ --do_train \ --do_eval
Note: This script only works with models that have a fast tokenizer (backed by the 🤗 Tokenizers library) as it uses special features of those tokenizers. You can check if your favorite model has a fast tokenizer in this table, if it doesn't you can still use the old version of the script.
Old version of the script
You can find the old version of the PyTorch script here.
TensorFlow version
The following examples are covered in this section:
- NER on the GermEval 2014 (German NER) dataset
- Emerging and Rare Entities task: WNUT’17 (English NER) dataset
Details and results for the fine-tuning provided by @stefan-it.
GermEval 2014 (German NER) dataset
Data (Download and pre-processing steps)
Data can be obtained from the GermEval 2014 shared task page.
Here are the commands for downloading and pre-processing train, dev and test datasets. The original data format has four (tab-separated) columns, in a pre-processing step only the two relevant columns (token and outer span NER annotation) are extracted:
bash curl -L 'https://drive.google.com/uc?export=download&id=1Jjhbal535VVz2ap4v4r_rN1UEHTdLK5P' \ | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp curl -L 'https://drive.google.com/uc?export=download&id=1ZfRcQThdtAR5PPRjIDtrVP7BtXSCUBbm' \ | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp curl -L 'https://drive.google.com/uc?export=download&id=1u9mb7kNJHWQCWyweMDRMuTFoOHOfeBTH' \ | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp
The GermEval 2014 dataset contains some strange "control character" tokens like '\x96', '\u200e', '\x95', '\xad' or '\x80'. One problem with these tokens is, that BertTokenizer returns an empty token for them, resulting in misaligned InputExamples. The preprocess.py script located in the scripts folder a) filters these tokens and b) splits longer sentences into smaller ones (once the max. subtoken length is reached).
Let's define some variables that we need for further pre-processing steps and training the model:
bash export MAX_LENGTH=128 export BERT_MODEL=bert-base-multilingual-cased
Run the pre-processing script on training, dev and test datasets:
bash python3 scripts/preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt python3 scripts/preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt python3 scripts/preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt
The GermEval 2014 dataset has much more labels than CoNLL-2002/2003 datasets, so an own set of labels must be used:
bash cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt
Prepare the run
Additional environment variables must be set:
bash export OUTPUT_DIR=germeval-model export BATCH_SIZE=32 export NUM_EPOCHS=3 export SAVE_STEPS=750 export SEED=1