From deception to hardware: Lessons from a real -world computer vision project

Enterprise leaders have joined a reliable program for nearly two decades. VB transform brings people to develop real enterprise AI strategies. Get more information

Computer vision projects rarely go according to the plan, and this was not an exception. The idea was simple: create a model that can see the image of a laptop and indicate any physical damage – things like cracked screens, lost keys or broken hollows. It seemed like a straightforward use case for image models and large language models (LLM), but it quickly turned into something else.

On the way, we fled to matters with deception, incredible results and photos that were not even laptops. To solve them, we applied an agent framework in an atypical way – not for task automation, but to improve the performance of the model.

In this post, we will try what we tried, what didn’t work, and how to combine the approach helped us make us a reliable thing.

Where we started: to indicate on a single

Our initial approach was quite standard for the multi -modal model. We used a single, large gesture to transfer the image to the iconic LLM and asked him to identify the visible damage. It is easy to implement a monopoly strategy and works decent, well -developed, well -planned tasks. But the real world data is rarely played.

We initially ran into three major problems:

Intrigue: The model will sometimes harm the harm that was not present or misleading what he was seeing.
Finding a pic of junk: He did not have a reliable way of flag images that were not even laptops, such as the images of desks, walls or people were occasionally slipped and received reports of unconscious damage.
Conflict accuracyThe combination of these problems made the model very credible for operational use.

This was the point when it became clear that we would need to repeat.

FIRST FIRST: Mixing photo resolutions

One of the things we felt was how much of the model’s output was affected by the quality of the image. Consumers upload all kinds of images, from sharp and high resolution to fading. This is why we were referred to Research Highlighting how the image resolution affects the deep learning model.

We train and test the model using a mixture of high and low resolution images. The idea was to make the model more flexible for a wide range of practical image features. This helped improve consistency, but the main issues of deception and trash image handling remain.

Multi Moodle Dettor: Only text goes only LLM Multi Moodle

Only the text has been encouraged by recent experiences in connecting the image title with the LLM. AmongWhere the title is made from the pictures and then interpreted by a language model, we decided to try it.

How does it work:

The LLM begins with a multiple multiple title of an image.
Another model, called a multi -modal embedding model, checks to what extent each caption fits the image. In this case, we used the Suggle to score the similarities between the image and the text.
This system has a few captions based on these scores.
The LLM uses the top captions to write, and try to get closer to which the image actually shows.
It repeats the process until the caption is stopped, or it is hit by a fixed limit.

Beware of the theory, this approach introduced new issues for our use.

Contempt: The titles themselves sometimes included imaginary damage, which the LLM again confidently reported.
Incomplete coverage: Even despite multiple titles, some problems were completely missed.
Increase the complexity, a slight advantage: The added steps made the previous setup more complicated without a reliable performance.

It was an interesting experience, but eventually not a solution.

The creative use of agent framework

It was a turning point. Although the agent framework is commonly used to arcate the task flu (think agents connect the calendar or customer service operations), we wonder if the image interpretation work can help break into smaller, special agents.

We created an agent framework like that made:

Architerator Agent: He tested the photo and pointed out which components of the laptop are visible (screen, keyboard, chassis, ports).
Component agent: Dedicated agents inspected each component for specific types of damage. For example, one for cracked screens, the other for lost keys.
Junk detection agent: A separate agent flagged whether the picture was even a laptop.

This modular, task -based approach created very precise and explicit results. The fraud was dramatically dropped, the trash images were reliably flagged and each agent’s job was easy and the Enough was quite concentrated to control the quality.

As effective, it was not perfect. Two main limits appeared:

Increase the delay.: Multiple sequences were added to the total time of running agents.
The difference of coverage: The agents can only detect the issues they were clearly programmed to search. If an image showed something unexpectedly that an agent was given the task of identifying, it would not be careful.

We needed a method of precision balance with coverage.

Hybrid Solution: Combine Agent and Unity

Elimination of space L, We created a hybrid system:

Agent framework First run away, to deal with the exact detection of leading damage and the exact detection of trash images. We have restricted the number of agents to the most essential people to improve the latest.
Again, a Yo Singy Image LLM Prompt Scan the image of anything to lose agents.
Finally, we The model did fine tone The use of a set of images for the most notified damage scenario, such as high priority use cases, to further improve accuracy and reliability.

This combination promoted us to the precision and explanation of the agent setup, the extensive coverage of a single indicator and the confidence of the target -fining fine toning.

What we learned

After the project wrapped up some things became clear:

Agent framework are more versatile than what they get credit: Although they are usually associated with workflow management, we have learned that they can increase the performance of a meaningful model when they are applied, modularly.
Mixing different perspectives relying on only one: Along with the extensive coverage of LLMS as well as precise, agent -oriented detection as well as a slightly fine toning combination, where it is most important, give us far more reliable results on our own way.
Visual models suffer from deception: Even more advanced setup can jump on results or see things that are not there. To maintain these mistakes, it takes a deliberate system design.
Image quality type makes a difference: Training and testing with both clear, high resolution images and everyday, when facing unexpected, real -world images helped this model stay flexible.
You need a way to catch the pics of junk: A dedicated check for junk or irrelevant images was one of the easiest changes we made, and it had an outsidered effect on the overall reliability of the system.

The final views

What started as a simple idea, using LLM prompt to detect physical damage in the images of the laptop, turned into a very deep experience in connecting various AI techniques to tackle unexpected, real -world problems. On the way, we found that some of the most useful tools were those that were not actually designed for this kind of work.

The agent framework, which is often viewed as the utility of the workflower, proved to be surprisingly effective when re -developed for works such as structural damage detection and image filtering. With a lot of creativity, they helped us create a system that was not only more accurate, but also practically easy to understand and manage.

Shruti Tiwari is an AI product manager at Dale Technologies.

Vaderz Kulkarni is a data scientist at Dale Technologies.

Daily Insights on Business Use Matters with Daily VB

If you want to impress your boss, the VB Daily covers you. We give you internal scope what companies are doing with Generative AI, from regulatory shifts to practical deployments, so that you can share insights for more and more ROIs.

Read our privacy policy

Thanks for subscribing. Check more VB Newsletter here.

There was a mistake.

Where we started: to indicate on a single

FIRST FIRST: Mixing photo resolutions

Multi Moodle Dettor: Only text goes only LLM Multi Moodle

The creative use of agent framework

Blind spots: the commercial process of the agent’s approach

Hybrid Solution: Combine Agent and Unity

What we learned

The final views

Editor's pick

Get latest news

From deception to hardware: Lessons from a real -world computer vision project

Where we started: to indicate on a single

FIRST FIRST: Mixing photo resolutions

Multi Moodle Dettor: Only text goes only LLM Multi Moodle

The creative use of agent framework

Blind spots: the commercial process of the agent’s approach

Hybrid Solution: Combine Agent and Unity

What we learned

The final views

How to View Glastinbury 2025 in Ireland for free

Tesla demonstrates its first complete autonomous delivery to persuade her self -powered cars to work well

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news