Making NanoChatGPT, NanoGPT, Chat Oriented

You can find the project at - https://github.com/VatsaDev/nanoChatGPT

NanoChatGPT is based on NanoGPT but finetuned to be more like chat models, similar to chatGPT and others.

History of this project

When I first used ChatGPT back in January, I instantly wanted to build a model that could talk similar to an LLM but sounded more like humans texting or talking rather than an essay builder or story writer. A great idea, but unfortunately I had next to no experience with ML whatsoever, with my best work being basic Python, I had no idea where to start.

Around March I stumbled upon Andrej Karpathy's amazing video, lets build GPT, from scratch, in code, spelled out. It immediately piqued my interest, and I dived into the video and its code, and experimented with it for a little while, mostly learning about Tokenizers, etc. I experimented with the GPT2 and Bigram models, but I couldn't see any methods to turn a text guesser into a chatbot.

From April-June I had to deal with end-of-the-year work, AP exams, and the like, and put this project on hold. I looked at it again in May, hooked it up to Google Colab for the GPU, looked into the finetuning for the first time, and realized that a play script couldn't be that different from a chat between two people. After that, I discovered HuggingChat and Huggingface and realized there were so many models out there that did the things I was looking for, like actual chats, short-term memory, etc.

Over the next couple of months, I was looking at many models, including the Pythia set by Eluthier AI and the Falcon set by TII UAE. Though the model that finally made me realize the data format was the RedPajama-3b model, with their data format style on the model card.

[Human]: <utterance>
[Bot]: <utterance>
[Human]: <utterance>
[Bot]: <utterance>
...

Now, before I tried this, I was looking at one other Model type, The question-answer model. I realised this wouldn't work after a while, but I did spend a while seeing whether the model could invent new answers from its context, something outside the scope of this model.

After this, was finally when I implemented NanoChatGPT...

Building your LLM

Building NanoChatGPT involves

Choosing a baseline GPT model or training your GPT model
finetuning on a chat dataset
????
Profit

Choosing a baseline model or training your own

Start with a general text-prediction/auto-complete model like gpt2 or others. NanoChatGPT was made by finetuning on top of GPT-2 124 million parameters. I wouldn't recommend training gpt2 on chat messages firsthand, as I've noticed that when topics fall outside your chat data, GPT-2's data might come in handy and fill in its sentences.

Finetuning on a Chat Dataset

Get a chat Dataset of your choice, mine was PersonaChat Dataset. For NanoChatGPT, I converted my data into the format above, and my model's finetune script. After finetuning, with proper prompting, your model should now at least give English-looking, readable messages, though of course there's a lot more between you and GPT-3.5.

???

The newly fine-tuned model works, but it needs a lot more, including

More Data

I cannot stress this well enough, all ADD MORE DATA. The more data you have, the better your model gets, especially with more parameters. If you want a better production model of NanoChatGPT, use GPT2-xl instead.

There are many other issues with the model, including

Very small dataset
- Right now, the dataset is very small, the current data was just copy-paste conversions, and definitely needs to be automated

Very small conversations
- Right now, the dataset used is made up of very small conversations, at best 8-10 sentences long, so including more medium-sized and large conversations is a necessity

Repetitive with larger amounts of tokens
- While this effect is greatly reduced with prompting, it turns extremely repetitive when asked for more than 15-30 tokens. It can also start repeating the bot prompt or replace it with human names

example:

Human: Hi, how are you? Bot: i'm good, how are you? Human: I'm good: where are you from? Bot: i live in seattle, seattle, seattle, seattle

Bot: Hi, how are you? Bot: i'm good, how are you? Human: i live in seattle, seattle, seattle, seattle, seattle

Bot: Hi, how are you? Bot: i'm good, how are you? Human: i live in seattle, seattle, seattle, seattle, seattle

Bot: Yo,

Not having a clear answer and also continuing to generate an answer
- While this effect is greatly reduced with more and better data, the model probably needs padding to set outputs to a certain size, and also probably a stop at a certain word function.

no memory/recall
- With many models, you can ask what you were just talking about or to summarize the conversation above. When that is attempted with this model:
```
  Human: Dogecoin is cool 
  Bot: indeed, very shibe
  Human: what were we just talking about?
  Bot: me and a friend gave up on mining, but now I can
```
  as we can see, it continues with a sentence on mining, confirming that it understood the context(GPT2 info) but it cannot recall. I suspect that has to do with the model's data, and that if I were to feed it data like short-context calling and summarization data, it would gain those abilities

Profit

Once you have a competent model, set up a web UI and dev API, and profit.

Making NanoChatGPT, NanoGPT, Chat Oriented

History of this project

Building your LLM

Choosing a baseline model or training your own

Finetuning on a Chat Dataset

???

More Data

Very small dataset

Very small conversations

Repetitive with larger amounts of tokens

Not having a clear answer and also continuing to generate an answer

no memory/recall

Profit