policy

gpt2-instruction-tuning-rlhf-pytorch

devjwsong · PyTorch

or hover any field below to flag it

Overview

Name

Author

devjwsong

Framework

PyTorch

License

MIT

Skill type

other

Evidence level

untested

Task description

PPO (Proximal Policy Optimization) and DPO (Direct Preference Optimization) practice code to train a GPT-based model that generates a response for the given task-specific instruction.

Spaces

Action space

other · 0-dim · 0Hz

Observation space

type: other

Links

HuggingFace repo

null

Paper (arXiv)

null

Compatible environments

No environments list gpt2-instruction-tuning-rlhf-pytorch yet.

Datasets that reference this policy

No datasets reference gpt2-instruction-tuning-rlhf-pytorch yet.

Overview

Spaces

Links

Compatible robots

Compatible environments

Datasets that reference this policy