Dragged into ML

Last semester, I took a distributed systems class. They had made some changes to the syllabus form the previous semesters - we would cover transformers and how to distribute it, to look at some modern applications of what we had studied so far, in addition to a new third and fourth project in which they would make us implement k-means and distribute it using Ray. During one of the (two) lectures on transformers¹, the professor mentioned that a student was working with him trying to "improve" the architecture, and the way he said it made me kinda jelly (i felt like i couldnt sit around doing nothing while some sophomore is working on "research"). So, after class i asked him if i could join the project. He agreed, and asked me to attend their next weekly meeting and get started on the deep learning specialization on coursera (from Andrew Ng). I was pretty excited, but i had no idea what i was getting myself into. I got started on the specialization (though I knew most of the stuff in ther, so i kinda found it boring), and attended their next weekly meeting. They were talking about some... "attention" stuff, and i was completely clueless. At that time, I'd heard of "LLM", knew that it was some probabilistic model that "predicted" the next word in a sequence or something that, and used "neural nets". I had studied MLPs and backpropagation and basic ML stuff, but I had no idea about anything beyond that.

The professor was the kind of guy who mentored insanely successful students, ran two companies, etc. and his expectations were really high. I was still under the impression that this is some research project that they were working on with a clear goal. So, I asked him what i shud do, and just got a... "make it better". I realized that at this point it's pretty much on me to figure out what to do, and later learned that the other sophomore guy with us was actually jus working on his senior thesis. This wasn't really one of those "research" projects with a clear goal (the kind that I'd worked on before), and was no longer a "what do I need to learn to get up to speed with the project", but a "what do I need to learn to get up to speed with the SOTA in this feild and beat it". Starting from almost zero, I didnt hav any idea about what the journey would look like, or if it's even possible. It was also at a time during the semester, so i didn't really spend much time on it and made almost no progress (just tried reading some papers here and there). Though, at the beginning, I did read LLM in a Flash (at the time, the paper was just published), understood it as much as i could, and made a presentation. The discussion in the weekly meeting was fun. But after that, the meetings were pretty much like,, "ugh yea i was thinking i will work on...", and i wasnt really getting anything done.

In fact, for a long time from feb to apr 2024, we didnt even have our meetings cuz everyone was very busy. I was still not the most interested in ML. BUT, everything changed this month (april). A fren added me to a ML nerds gc on X. I started engaging there a bit, everyone seemed like normal cool guys and gals UNTIL i also got added to some ML research related gc (with a lot of the saem ppl), and they started talking about technical stuff. Eventually i realized everyone in there is extremely cracked. There was a guy who was trying to explain to me why GPU mem is a bottleneck, and I was being a compete muppet (if u r reading this, im sorry). It hurt even more, later, when I realized that I considered myself a "systems" guy and that that should have been a painfully obvious fact. After lurking a bit more, I realized that if i really wanna "beat SOTA", i was up against cracked ppl like them, and i basically stood zero chance. Initially, the "im in a grup of ppl who did some cool stuff and i also want to do the saem cool stuff cuz i want to fit in" energy got me started on a gpt2 (inference) implementation (this was like,, two weeks ago from now?). As always, my first idea was to implement it completely from scratch in C without using any libraries. I got started, and stopped at implementing (any-dimensional) tensor multiplication because I had no idea how that worked. I realized that i could just get away with implementing matrix mul, but i kinda figured that its not the best idea to spend time implementing a "tensor library" if i wanted to learn how inference worked. I decided to make a python implementation but "only" use a tensor library (i settled on numpy). I would still be loading the weights from disk, and writing the entire inference code out, but have numpy to lean on for tensor operations. I had to spend quite a bit of time trying to figure what tensors to use where and what to do with them, and at one point i thought i'd finally figured it out². So, i implemented it exactly as i understood it. yayyyy!!! of course it didnt work. It was outputting gibberish, even worse, there was some floating point error after generating a few (garbage) tokens. At that point, i semi-gave up, and started looking at other stuff meanwhile. Karpathy's tokenization video was one of those things, and im glad i watched it, it was very fun and insightful. I came across picogpt, and tried to study it, but the code was kinda weird. A week later, i decided that i really needed to figure this out, so i heavily referenced the ggml gpt2 example, picogpt (even stole some functions from it), got rid of my kv cache implementation, and switched to using tiktoken instead of the hacky tokenizer that i had built. I compared my code with the reference thoroughly, and I was pretty confident this time around. aaaaaaannnnnnnd,,,,,???? IT DIDNT WORKKKK al;ksdjf;aslkdfj. Well, at least it didnt give me any weird floating point warnings, and seemed to be "working" somehow (though not outputting the expected tokens)³. Now, I really didnt know what to do. I was suggested to examine intermediate calculations from a reference implementation and compare it with mine (by an amazing person), and i havent gotten to it yet but i might try to do that. Also, one thing that's different in my implementation compared to my references is the source of the model weights, which could be an issue (I downloaded the gpt2 safetensors form huggingface). I'm thinking it might first be easier to load in the weights used in one of the reference implementations, and plug it into mine to see if it makes any difference ("openai" GPT2 weights shud be the same everywhere,, right?????).

At this point, I've seen ppl talking about recently published papers and how they themselves are working on implementations that would be better, and i feel more useless. At the same time, I'm seeing papers beating the previous SOTA published in the frequency of months, or even weeks. At this point, i really wanna re-learn AL and ML from scratch, but a part of me cant help but think that i will never catch up if i try to do that. At the same time, it only makes sense that if i don't, i won't stand any chance whatsoever. I also need to learn practical skills, probably mostly current frameworks used in industry to implement this kind of stuff, or im practically useless (which i am at this point) even if i understand all the science behind it. First, though, i need to make sure that I actually graduate this sem, but thats a talk for another day.

1) It was really high level (not a ML class), and i was left with more questions than answers but gave us a basic idea
2) heres my scratchwork if u want to see it, DO NOT try to understand it, its not meant to be understood
3) heres a snapshot of what that code looked like, for eternity

"Dragged into ML"18 Apr 2024

"Dragged into ML"
18 Apr 2024