Data Scientists as a position you need to be relevant

by SkillAiNest

Data Scientists as a position you need to be relevant
Photo by Author | Canva

If you work in the data -related field, you should update yourself regularly. Data scientists use different tools such as data ventilation, data modeling, and even warehouse systems.

In this way, AI has transformed data science into A to Z. If you are on the way to data science jobs, you may have heard the term chord.

In this article, we will break the vein. Starting with the educational article that introduced it and how it is now used to reduce costs when working with a large language model (LLM). But first, let’s cover the basics.

What is the collective race (RAG) from recovery?

What is the Rag (RAG) growing beyond recovery?
Patrick Lewis introduces the first rag These Academic article in the first 2020. It combines two important elements: one retrieval and one generator.

The idea behind it is easy. Instead of producing answers from the parameters, the rig can collect information related to the document.

What is the recovery?

A recovery is used to collect document information. But how?

Let’s look at it. You have a massive Excel sheet. We say it’s 20 MB, with thousands of rows. You want to find the call_date user_id = 10234.

Thanks to this retrieval, instead of seeing the entire document, the rig will only find the relevant part.

What is the recovery in the chord?

But how is it helpful for us? If you look for the entire document, you will spend a lot of tokens. As you probably know, the use of LLM API is calculated using token.

Let’s see And see how this calculation is. For example, if you paste this article. It costs 123 tokens.

You have to check the cost of using the LLM’s API LLM. For example, if you consider using a word document, 10 MB say, it can be thousands of tokens. Each time you upload this document using LLM API, the cost increases.

By using the RAG, you can only select the relevant part of the document, reduce the number of token so that you pay a lower price. It’s straight.

What is the recovery in the chord?

How does this recovery do it?

Before the recovery begins, the documents are divided into smaller parts, paragraphs. Each section is converted into dense vector using the embellish model (open AI embeddings, phrase burt, etc.).

So when a user wants an operation as if it asks what the call date is, the retrieval inquiry compares the vector all parts with the vector and makes the most similar choices. It’s fantastic, okay?

What is the generator?

As we mentioned above, the generator takes power after receiving the most relevant documents from the retire. It produces a response using the user’s inquiry and recovery document.

Using this method, you also reduce the risk of deception. Since the AI ​​was trained, instead of creating a response independently of the data, the model responds to a original document you provided.

Context’s window evolution

Initial models, such as GPT -2, have small context windows around 2048 tokens. This is why these models do not feature file uploading. If you remember, after some models, Chat GPT offers a data uploading feature as the context window is developed in it.

Advanced models such as GPT -4O have a 128K token range, which supports the data upload feature and in the case of a context window, the chord can show useless. But this is the place where the cost of the cost reducing requests enter.

So now, one of the reasons is that consumers are using RAG is to reduce the cost, but not just it. Since the cost of using LLM is decreasing, GPT 4.1 introduces a context window up to 1 million tokens, which is an amazing increase. Now, the rags are also ready.

Industry -related exercise

Now, LLM is being manufactured in agents. They should automatically make your tasks automatically instead of creating answers. Some companies are developing models that also control your keywords and mouse.

So for these matters, you should not take a chance to deceive. So here the rash comes into the scene. In this section, we will deeply analyze an example of the real world.

Companies are looking for skills for them to prepare agents. These are not just big companies. Even medium sizes or small companies and startups are looking for their options. You can find these jobs on Freelance websites Upwork And Fever.

Marketing agent

We say that a medium -sized company in Europe wants you to make you an agent, an agent who produces marketing tips for its clients using company documents.

In the upper part, this agent should use the content by adding relevant hotel information to the proposal for business programs or campaigns.

But there is a problem: the agent is often deceived. Why does this happen? Because instead of relying on the company’s document, the model draws information from its original training data. This training data may be old, because as you know, these LLMs are not regularly updated.

Therefore, as a result, AI’s incorrect hotel names or merely unrelated information. Now you point to the main reason for the problem: lack of reliable information.

This is the place where the melody comes. Using the web browsing API, companies have used LLM to recover reliable information from the web and refer to it, while creating answers. Let’s see this gesture.

“Prepare a proposal based on voice and company information tone, and use a web search to find hotel names.”

This web search feature is becoming a rag.

The final views

In this article, we discovered the evolution of AI models and because the chord is using. As you can see, the cause has changed over time, but the problem remains: performance.

Even if the reason for this is cost or speed, this method will continue to be used in AI -related tasks. And through “AI related”, I don’t exclude data science, because, as you may be familiar, with the coming AI summer, data science has already been deeply affected by the AI.

If you want to follow similar articles, Data Science -related 700+ interview questions, and 50+ solve data projects, see my visit Platform.

Net Razii A data is in a scientist and product strategy. He is also an affiliated professor of Teaching Analytics, and is the founder of Stratskrich, a platform that helps data scientists prepare for his interview with the real questions of high companies. The net carrier writes on the latest trends in the market, gives interview advice, sharing data science projects, and everything covers SQL.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro