Fine tune

To fine-tune the plant DNA LLMs, please first download the desired models from HuggingFace or ModelScope to local. You can use git clone (which may require git-lfs to be installed) to retrieve the model or directly download the model from the website.

In the activated llms python environment, use the model_finetune.py script to fine-tune a model for downstream task.

Our script accepts .csv format data (separated by ,) as input, when preparing the training data, please make sure the data contain a header and at least these two columns:

sequence,label

Where sequence is the input sequence, and label is the corresponding label for the sequence.

We also provide several plant genomic datasets for fine-tuning on the HuggingFace and ModelScope.

We use Plant DNAGPT model as example to fine-tune a model for active core promoter prediction.

First download a pretrain model and corresponding dataset from HuggingFace or ModelScope:

# prepare a output directory
mkdir finetune
# download pretrain model
git clone https://huggingface.co/zhangtaolab/plant-dnagpt-BPE models/plant-dnagpt-BPE
# download train dataset
git clone https://huggingface.co/datasets/zhangtaolab/plant-multi-species-core-promoters data/plant-multi-species-core-promoters
  • Note: If downloading from huggingface encounters network error, please try to download model/dataset from ModelScope or change to the accelerate mirror before downloading.
# Download with git
git clone https://hf-mirror.com/[organization_name/repo_name]
# Download with huggingface-cli
export HF_ENDPOINT="https://hf-mirror.com"
huggingface-cli download [organization_name/repo_name]

After preparing the model and dataset, using the following script to finetune model (here is a promoter prediction example)

python model_finetune.py \
    --model_name_or_path plant-dnagpt-BPE \
    --train_data plant-multi-species-core-promoters/train.csv \
    --test_data plant-multi-species-core-promoters/test.csv \
    --eval_data plant-multi-species-core-promoters/dev.csv \
    --train_task classification \
    --labels 'Not promoter;Core promoter' \
    --run_name plant_dnagpt_BPE_promoter \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 8 \
    --learning_rate 1e-5 \
    --num_train_epochs 5 \
    --load_best_model_at_end \
    --metric_for_best_model 'f1' \
    --save_strategy epoch \
    --logging_strategy epoch \
    --evaluation_strategy epoch \
    --output_dir plant-dnagpt-BPE-promoter

In this script:
1. --model_name_or_path: Path to the foundation model you downloaded
2. --train_data: Path to the train dataset
3. --test_data: Path to the test dataset, omit it if no test data available
4. --dev_data: Path to the validation dataset, omit it if no validation data available
5. --train_task: Determine the task type, should be classification, multi-classification or regression
6. --labels: Set the labels for classification task, separated by ;
7. --run_name: Name of the fine-tuned model
8. --per_device_train_batch_size: Batch size for training model
9. --per_device_eval_batch_size: Batch size for evaluating model
10. --learning_rate: Learning rate for training model
11. --num_train_epochs: Epoch for training model (also you can train model with steps, then you should change the strategies for save, logging and evaluation)
12. --load_best_model_at_end: Whether to load the model with the best performance on the evaluated data, default is True
13. --metric_for_best_model: Use which metric to determine the best model, default is loss, can be accuracy, precison, recall, f1 or matthews_correlation for classification task, and r2 or spearmanr for regression task
14. --save_strategy: Strategy for saving model, can be epoch or steps
15. --logging_strategy: Strategy for logging training information, can be epoch or steps
16. --evaluation_strategy: Strategy for evaluating model, can be epoch or steps
17. --output_dir: Where to save the fine-tuned model

Detailed descriptions of the arguments can be referred here.

Finally, wait for the progress bar completed, and the fine-tuned model will be saved in the plant-dnagpt-BPE-promoter directory. In this directory, there will be a checkpoint directory, a runs directory, and a saved fine-tuning model.