## Token classification Fine-tuning the library models for token classification task such as Named Entity Recognition (NER) or Parts-of-speech tagging (POS). The main scrip `run_ner.py` leverages the 🤗 Datasets library and the Trainer API. You can easily customize it to your needs if you need extra processing on your datasets. It will either run on a datasets hosted on our [hub](https://huggingface.co/datasets) or with your own text files for training and validation. The following example fine-tunes BERT on CoNLL-2003: ```bash python run_ner.py \ --model_name_or_path bert-base-uncased \ --dataset_name conll2003 \ --output_dir /tmp/test-ner \ --do_train \ --do_eval ``` or just can just run the bash script `run.sh`. To run on your own training and validation files, use the following command: ```bash python run_ner.py \ --model_name_or_path bert-base-uncased \ --train_file path_to_train_file \ --validation_file path_to_validation_file \ --output_dir /tmp/test-ner \ --do_train \ --do_eval ``` **Note:** This script only works with models that have a fast tokenizer (backed by the 🤗 Tokenizers library) as it uses special features of those tokenizers. You can check if your favorite model has a fast tokenizer in [this table](https://huggingface.co/transformers/index.html#bigtable), if it doesn't you can still use the old version of the script. ## Old version of the script You can find the old version of the PyTorch script [here](https://github.com/huggingface/transformers/blob/master/examples/legacy/token-classification/run_ner.py). ### TensorFlow version The following examples are covered in this section: * NER on the GermEval 2014 (German NER) dataset * Emerging and Rare Entities task: WNUT’17 (English NER) dataset Details and results for the fine-tuning provided by @stefan-it. ### GermEval 2014 (German NER) dataset #### Data (Download and pre-processing steps) Data can be obtained from the [GermEval 2014](https://sites.google.com/site/germeval2014ner/data) shared task page. Here are the commands for downloading and pre-processing train, dev and test datasets. The original data format has four (tab-separated) columns, in a pre-processing step only the two relevant columns (token and outer span NER annotation) are extracted: ```bash curl -L 'https://drive.google.com/uc?export=download&id=1Jjhbal535VVz2ap4v4r_rN1UEHTdLK5P' \ | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp curl -L 'https://drive.google.com/uc?export=download&id=1ZfRcQThdtAR5PPRjIDtrVP7BtXSCUBbm' \ | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp curl -L 'https://drive.google.com/uc?export=download&id=1u9mb7kNJHWQCWyweMDRMuTFoOHOfeBTH' \ | grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp ``` The GermEval 2014 dataset contains some strange "control character" tokens like `'\x96', '\u200e', '\x95', '\xad' or '\x80'`. One problem with these tokens is, that `BertTokenizer` returns an empty token for them, resulting in misaligned `InputExample`s. The `preprocess.py` script located in the `scripts` folder a) filters these tokens and b) splits longer sentences into smaller ones (once the max. subtoken length is reached). Let's define some variables that we need for further pre-processing steps and training the model: ```bash export MAX_LENGTH=128 export BERT_MODEL=bert-base-multilingual-cased ``` Run the pre-processing script on training, dev and test datasets: ```bash python3 scripts/preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt python3 scripts/preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt python3 scripts/preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt ``` The GermEval 2014 dataset has much more labels than CoNLL-2002/2003 datasets, so an own set of labels must be used: ```bash cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt ``` #### Prepare the run Additional environment variables must be set: ```bash export OUTPUT_DIR=germeval-model export BATCH_SIZE=32 export NUM_EPOCHS=3 export SAVE_STEPS=750 export SEED=1 ```