Semi Supervised Finetuning of NLP Transformers

5 min readMay 20, 2021

Semi supervised learning (SSL) has been found to be very useful in computer vision and natural language processing. This article, for example, points to the number of increasing publications in SSL over the last decade.

SSL models like Pi, Temporal ensemble and Mean Teacher have been extensively used in computer vision, but there has been a rare use of these models in NLP to my knowledge. There have been some mentions in the literature about these models not being very useful in NLP. I was curious to see for myself how these models would work when a pre-trained transformer is finetuned. The recently published Unsupervised Data Augmentation paper uses pre-trained BERT models, but, here, the authors only check the performance with respect to the advanced data augmentation techinques. In this paper, authors use ULMFiT with finetuning and a mixture of freezing and unfreezing of hidden layers. But it has not captured much attention, since it was published.

Interestingly, I did find some hyperparameters using albert model on CoLA dataset. I did not use any direct data augmentation. At the end, I realised that one might need to do a more vast hyperparameter search to find more consistent results.

Since I already had the code, I decided to convert it into a python package, ssfinetuning, hoping that it might be useful to someone. In this article, I describe how to use the package and share some of the results I found using these models.

Using ssfinetuning package

This article could also be found in the form of a colab here.

Semi supervised models possible to be used (Visit the links for more information):

In the following, we will see how first three models could be used in sslfinetuning.

Installation

!pip3 install ssfinetuning

Downloading CoLA dataset via HuggingFace’s datasets library.

from datasets import load_dataset, load_metricdataset = load_dataset("glue", "cola")
metric = load_metric('glue', "cola")

Finetuning with PiModel

Options chosen for training in this example.

labeled_fraction: A list which divides the dataset into different labeled fractions mentioned as the list elements. For example, if a list of labeled_fraction is set as below, the function train_with_ssl will first keep the 15% of data as labeled and consider the rest as unlabeled and apply all the combinations of hyperparmeters to obtain the results. Then, it will do the same by keeping 35% of the labels and so on.
use_sup: Whether the trainer should first train a supervised model and then use that trained model as a starting point for the ssl training.
dataset: Huggingface dataset.
model_name: NLP transformers model name as used by huggingface library. See this page for more information.
arg_ta: transformers.TrainingArguments, check this page for more information. If using train_with_ssl, then, there are default values set up in the default_args module. These values are similar to this notebook since CoLA dataset has been used here. If you would like to keep all the other arguments as same as this notebook but change only a few, it could be done with using arg_ta as dictionary and changing that specific argument like below. Otherwise, you can also prepare a whole new TrainingArguments object.
arg_ta_sup: Similar to arg_ta, but for supervised model, if use_sup or run_sup has been enabled.
ssl_model_type: Semisupervised model type.

unsup_hp:

A dictionary of hyperparameter that you for testing your SSL model. For example, in the case of PiModel the choices could be:

w_ramprate: linear rate at which the unsupervised weight would be increased from the initial value.
update_weights_steps: interval steps after which the unsupervised weight would be updated by the w_ramprate etc.

ssfinetuning will test with all the combination of values in the dictionary. Please look at PiModel’s or TrainerWithUWScheduler’s documentation for more information.

In this example, we only vary w_ramprate.

import torch
from ssfinetuning import train_with_ssltorch.manual_seed(970)
torch.cuda.manual_seed_all(970)sup_stats, stats_pi = 
             train_with_ssl(labeled_fraction=[0.15, 0.35, 0.5],
             use_sup=True, dataset = dataset, 
             model_name="albert-base-v2", args_ta={'no_cuda' :False},
             args_ta_sup={'no_cuda':False}, ssl_model_type="PiModel",
             unsup_hp={'w_ramprate':[0.01, 0.1]})

Here, the validation set provided by the datasets is used as the test set after training both supervised and unsupervised models for 10 epochs. To plot and see the results, you can also use tensorboard extension, but I found it sometimes shows weird plots. In any case, there is also a build-in plotter in ssfinetuning.

from ssfinetuning import plotting_utils
from ssfinetuning.dataset_utils import dic_to_pandasplotting_utils.sort_and_plot(dis_col='l_fr',
                            data=dic_to_pandas(stats_pi),
                            data_to_compare=dic_to_pandas(sup_stats),
                            y_axis_col='eval_matthews_correlation',
                            cols_to_find=["w_ramprate"])

You can read more about plotting_utils module here.

Here are the results.

sup_stats refers to the accuracy of the pure supervised model. Mostly, it only worked better in the case of labeled data fraction 0.35 and with w_ramprate=0.1.

Finetuning with Temporal Ensemble

Here, we only check with one labeled fraction with checking out another hyperparamter important for this ssl model, alpha.

You can read more about alpha in the documentation or in the original paper.

sup_stats, stats_te = train_with_ssl([0.1], 
                     use_sup=True, 
                     dataset = dataset,
                     model_name="albert-base-v2",
                     args_ta={'no_cuda' :False,'learning_rate':5e-6}, 
                     args_ta_sup={'no_cuda':False},
                     ssl_model_type="TemporalEnsemble", 
                     unsup_hp={'w_ramprate':[0.01,0.1], 
                     'alpha':[0.3,0.6,0.9]})

Plotting the results.

plotting_utils.sort_and_plot(dis_col='l_fr',
                            data=dic_to_pandas(stats_te),
                            data_to_compare=dic_to_pandas(sup_stats),
                            y_axis_col='eval_matthews_correlation',
                            cols_to_find=["w_ramprate",'alpha'])

For temporal ensemble, in our example, SSL is always better than supervised results.

Finetuning with Co-Training

Changed options:

num_train_epochs: Here, we change the number of training epochs to 5.
learning_rate: Learning rate is set to 1e-5 instead of 2e-5 in the default case.

usup_hp:

p_threshold: Threshold probability for considering an exchange between the models.
epoch_per_cotrain: Number of epochs to pass through training data while going through one iteration of cotraining.

For more information, please refer to TrainerForCoTraining or CoTrain documentation.

sup_stats, stats_co = train_with_ssl([0.1, 0.5], 
                      model_name="albert-base-v2",
                      use_sup=True,
                      dataset = dataset,
                      args_ta_sup = {'no_cuda':False, 
                       'num_train_epochs':5, 'learning_rate':1e-5}                        
                      args_ta={'no_cuda':False,'learning_rate':1e-5},
                      ssl_model_type="CoTrain", 
                      unsup_hp={'p_threshold':[0.2, 0.4, 0.85],    
                      'epoch_per_cotrain':[5]})

Plots

plotting_utils.sort_and_plot(dis_col='l_fr',
                            data=dic_to_pandas(stats_co),
                            data_to_compare=dic_to_pandas(sup_stats),
                            y_axis_col='eval_matthews_correlation',
                            cols_to_find=['epoch_per_cotrain',     
                            'p_threshold'])

In a future article, I will show the results from the rest of the three models.

Semi Supervised Finetuning of NLP Transformers

Using ssfinetuning package

Finetuning with PiModel

Finetuning with Temporal Ensemble

Finetuning with Co-Training

Written by Ankush