

Photo by author Canva
# Introduction
Personally, I find it amazing that computers can process language at all. It’s like a child learning to talk, but in code and algorithms. It feels weird sometimes, but that’s exactly what makes natural language processing (NLP) so interesting. Can you really make a computer understand your language? This is the fun part. If this is your first time reading my fun project series, I just want to make it clear that the goal here is to promote project-based learning by highlighting some of the best hands-on projects you can try, from the simple to the slightly advanced. In this article, I have chosen Five projects from the major areas of NLP to give you a better feel for how things workfrom the basics to more applied concepts. Some of these projects use specific architectures or models, and it helps if you understand their structure. So if you feel you need to brush up on some concepts first, don’t worry, I’ve added some extra learning resources at the end 🙂
# 1. Creating a tokenizer from scratch
Project 1: How to create a brute-word piece tokenizer in Python and Embrace
Project 2: Let’s create a GPT tokenizer
Text preprocessing is the first and most essential part of any NLP task. This is what allows raw text to be converted into something that a machine can actually process by breaking it down into smaller units such as words, sub-words, or even bytes. To get a good idea of ​​how it works, I recommend checking out these two awesome projects. The first one walks you through creating a Breit Word piece tokenizer in Python Hug face. This shows how words are broken down into smaller sub-word units, such as adding “##” to mark parts of a word, which helps models like Burt handle rare or misspelled words by breaking them into familiar chunks. Second video, “Let’s Build a GPT Tokenizer” by Andrej Karpathiis a bit long but there it is to sleep The resource walks through how GPT uses byte-level byte-pair encoding (BPE) to combine common byte sequences and handle text more flexibly, including spaces, punctuation, and even emojis. I really recommend watching it if you want to see what’s really going on when text is converted to tokens. Once you’re comfortable with tokenization, everything else in NLP becomes more clear.
# 2. Neuroscience in Action: Recognizing names, dates, and organizations
Project 1: Named Entity Recognition (NER) in Python: Pre-trained and Custom Models
Project 2: Creating an Entity Extraction Model Using Brit
Once you understand how text is represented, the next step is learning how to actually extract meaning from it. A good place to start is called entity recognition (NER), which teaches a model to spot entities in a sentence. For example, “Apple reached an all-time high stock price of $143 this January,” a good nervous system should select “Apple” as the organization, “$143” as the amount, and “this January” as the date. The first video shows how to use pre-trained neural models with libraries Spice And Embrace the face transformer. You’ll see how to input text, get predictions of entities, and even visualize them. The second video goes a step further, walking you through building an entity extraction system by fine-tuning Bert himself. Instead of relying on a ready-made library, you can code the pipeline: tokenize text, align tokens with entity labels, fix model Pytorch or Tensor flowand then use it to tag new text. I would recommend this as your second project because NER is one of those tasks that really makes NLP feel more practical. You begin to see how machines can understand “who did what, when and where.”
# 3. Text Classification: Predicting Emotions with Brit
Project: Classification of text Emotion analysis with Bert after the hug
After learning how to represent text and extract entities, the next step is teaching the model to assign labels to text, with sentiment analysis being a classic example. This is a pretty old project, and you may need a mod to get it to work (check the comments on the video), but I still recommend it because it also explains how Britt works. If you’re not familiar with Transformers yet, this is a good place to start. The project lets you categorize text such as movie reviews, tweets, or product reviews by using a face-hugging model. In the video, you see how to load a labeled dataset, preprocess text, and predict whether each example is positive, negative, or neutral. It’s a clear way to see how tokenization, model training, and evaluation all come together in a single workflow.
# 4. Text Generation Model Building with RNNS and LSTMs
Project 1: Text Generation AI – Next Word Prediction in Python
Project 2: Text Generation with LSTM and Spelling with Nabil Hasin
Sequence modeling is about tasks where the output is a sequence of text and is a huge part of how modern language models work. These projects focus on text generation and predicting the next word, showing that a machine can learn to continue one word at a time. The first video walks you through building a simple recurrent neural network (RNN)-based language model that predicts the next word in a sequence. This is a classic exercise that really shows how a model picks up patterns, grammar, and structure in text, which is what models like GPT do to a great extent. The second video uses a long-short-term memory (LSTM) to generate coherent text from prose or code. You’ll see how the model is fed one word or character at a time, how to model predictions, and even how tricks like temperature and beam searching control the creativity of the generated text. These projects make it really clear that text generation isn’t magic, it’s all about cleverly manipulating predictors.
# 5. Construction of the SEQ2SEQ Machine Translation Model
Project: Pytorch SEQ2SEQ Tutorial for Machine Translation
The final project takes NLP beyond English and into real-world tasks with a focus on machine translation. In this you create an encoder-decoder network where one network reads and encodes the source sentence and the other decodes it into the target language. This is basically what Google Translate and other translation services do. The tutorial also demonstrates the focus mechanism so the decoder can focus on the right parts of the input and explains how to train on parallel texts and evaluate translations with metrics such as BLEU (Bilingual Evaluation Understudy) scores. This project brings together practically everything I have learned so far in NLP work. Even if you’ve used translation apps before, making a toy translator gives you a sense of how these systems actually work behind the scenes.
# The result
Which brings us to the end of the list. Each project covers one of five major areas of NLP: tokenization, information extraction, text classification, sequence modeling, and applied multilingual NLP. By trying them out, you’ll get a better sense of how NLP pipelines work from start to finish. If you found these projects helpful, give the tutorial creators a thumbs up and share what you’ve created.
To know more, Stanford Course CS224N: Natural Language Processing with Deep Learning An excellent resource. And if you like learning through projects, you can check out our other ones too “5 Fun Projects” Series:
Kanwal Mehreen is a machine learning engineer and technical writer with a deep passion for data science and the intersection of AI with medicine. He co-authored the eBook “Maximizing Productivity with ChatGPT.” As a 2022 Google Generation Scholar for APAC, she champions diversity and academic excellence. He has also been recognized as a Teradata Diversity in Tech Scholar, a MITACS GlobalLink Research Scholar, and a Harvard Wicked Scholar. Kanwal is a passionate advocate for change, having founded the Fame Code to empower women in stem fields.