khub.ai Articles

A collection of articles on the research and applications of Generative AI

View the Project on GitHub

Banner Image

Navigating the Pitfalls of Vision Language Models

Let me tell you a story about how a winning Claude-3 lose an image classification debate to GPT-4V, due to its weak personality. It is also about a cougar on house deck, and how AI could mistaken it as just a dog.

Introduction

A multimodal LLM (i.e., a LMM or MLLM) such as OpenAI’s GPT-4V or Anthropic’s Claude-3 can recognize a wide range of object classes, while also incorporating contextual information to achieve a kind of real-world visual common sense, which makes it extremely powerful for use in complex application domains such as autonomous vehicles, home robot, etc. See some relevant studies HERE

However, contextual information could also lead an LMM to incorrect decision, and different models may also behave differently.

Following are some examples to demonstrate the problem, with this setup:

Test #1: Direct Identification

Given the following image:

image

and the prompt:

Identify the large brown animal in the image.

both GPT-4V and Claude-3 classify the large brown animal incorrectly as a dog:

Test #2: Identify But Ignore Context

Given the same image and a prompt that directs the model to ignore context:

Identify the large brown animal in the image. Ignore all contextual information and judge strictly by the features of the animal. List the top-5 possible candidates in descending likelihood.

Here are the results:

Where Claude-3 has correctly identified the animal as a cougar, but GPT-4V still identifying it as a dog.

Test #3: Image Cropped to Reduce Contextual Information

Next we crop the image without changing its resolution, as follows:

image

The prompt used is the same as test #2:

Identify the large brown animal in the image. Ignore all contextual information and judge strictly by the features of the animal. List the top-5 possible candidates in descending likelihood.

Here are the results:

Notably, GPT-4V is paying attention to the wooden deck which is somewhat visible in the cropped image, even though the directive is to ignore all contextual information.

Test #4: Image Masked to Eliminate All Contextual Information

Next we mask the image without changing its resolution, as follows:

image

The prompt used here is the same as Test #3:

Identify the large brown animal in the image. Ignore all contextual information and judge strictly by the features of the animal. List the top-5 possible candidates in descending likelihood.

Here are the results:

Notably GPT-4V now correctly classifies the animal as a cougar. It is likely influenced by the removal of the wooden deck which is visible in Test #3.

Test #5: Battling Bots

Previously we have performed a series of experiments where two chatbots are used to debate with each other using the Three-round Constructive Critique process (see HERE about Ensemble GenAI).

Here we conduct such a debate based on Test #2, where Claude-3 was correct but GPT-4V was wrong, and we are hoping that Claude-3 can convince GPT-4V that the animal is in fact a cougar.

The full image is used for this test (same as in Test #2):

image

The same prompt as in Test #2 is used:

Identify the large brown animal in the image. Ignore all contextual information and judge strictly by the features of the animal. List the top-5 possible candidates in descending likelihood.

Here are the results from the three-round debate between the two chatbots:

Critique of Opponent’s Opinion (click to open/close)
Participant: #1:openai_gpt-4-vision-preview

    The diagnosis provided seems to be a reasonable assessment based on the visual information available in the image. Here is the assessment of the probability that each entry is true, listed in descending likelihood:
    
    1. Domestic Dog (Canis lupus familiaris) – Very High Probability. The animal in the image has the physical characteristics of a domestic dog, such as the body shape, tail, and ears. Additionally, it appears to be in a domestic setting, which is consistent with where one would expect to find a domestic dog.
    
    2. Coyote (Canis latrans) – Very Low Probability. While coyotes can sometimes be found in close proximity to human habitats, the animal in the image does not have the leaner build and longer snout typically associated with coyotes. Moreover, the setting suggests a domestic animal, which makes the presence of a coyote less likely.
    
    3. Wolf (Canis lupus) – Extremely Low Probability. Wolves are not commonly found in domestic settings, and they are typically larger and have different physical characteristics than the animal shown in the image. Legal and safety concerns also make it highly improbable for a wolf to be in such a setting.
    
    4. Dingo (Canis lupus dingo) – Extremely Low Probability. Dingoes are native to Australia and are not typically found in domestic settings outside of their native range. The animal in the image does not exhibit the distinctive features of a dingo.
    
    5. Wild Dog (various species/subspecies) – Extremely Low Probability. Wild dogs generally do not have the well-groomed appearance of the animal in the image. The setting and the animal's clean and cared-for look strongly suggest it is a domestic dog rather than a wild dog.
    
    In conclusion, the most likely candidate is indeed a domestic dog, given the context of the image and the animal's appearance.
    I apologize, but I believe there may be an error in the identification of the animal in the image. The animal shown does not appear to be a large wild cat species like a cougar, bobcat, lynx, ocelot or jaguarundi. 
    
    Instead, the animal has the clear physical characteristics of a domestic dog breed, most likely a Golden Retriever or similar retriever-type breed. Key identifying features include:
    
    1. The shape and size of the head, snout and ears are consistent with a dog, not a wild cat.
    
    2. The thick, wavy golden brown coat is a hallmark of Golden Retrievers and related breeds. Wild cats have shorter, sleeker fur.
    
    3. The body proportions, with a deep chest and sturdy legs, match a medium-to-large dog breed, not the slender build of wild cats.
    
    4. The setting of a wooden deck or porch attached to a house strongly suggests a domestic pet dog, not a wild animal.
    
    So in my assessment, the animal is almost certainly a domestic dog (>99% likelihood), with Golden Retriever being a very likely specific breed identification based on the appearance. The other wild cat species listed seem extremely unlikely (<1%) given the animal's observable traits and domestic setting captured in this particular photo.

Participant: #2:anthropic_claude-3-opus-20240229

    The image shows a large brown animal standing on a wooden deck, looking away from the camera. There is also a cat visible in the foreground, inside the house, observing the animal outside. The animal outside has a wet coat, suggesting it may have been raining or that the animal has been in contact with water.
    
    Critiquing the two sets of diagnoses:
    
    1. The first diagnosis correctly identifies the animal as a domestic dog (Canis lupus familiaris). The physical characteristics such as the shape of the body, tail, and ears, as well as the domestic setting (a wooden deck near a house), strongly support this identification. The probability of this being true is very high, close to 100%.
    
    2. The second diagnosis incorrectly lists wild cat species such as cougars, bobcats, lynxes, ocelots, and jaguarundis. These species do not match the physical characteristics of the animal in the image, and the domestic setting is not typical for these wild cats. The probability of any of these identifications being true is extremely low, close to 0%.
    
    The correct identification is:
    
    1. Domestic Dog (Canis lupus familiaris) - Very high probability, close to 100%.
    
    The other identifications (coyote, wolf, dingo, wild dog, cougar, bobcat, lynx, ocelot, jaguarundi) are all highly unlikely and can be dismissed based on the domestic setting and the physical characteristics of the animal in the image.
    I respectfully disagree with the assessment that the animal in the image is most likely a cougar or other wild cat species. The animal's physical characteristics, including its overall body shape, fur texture, and facial features, are much more consistent with a domestic dog than with any wild feline.
    
    Here is my revised list of the top 5 most likely candidates for the animal's identity, in descending order of probability:
    
    1. Domestic Dog (Canis lupus familiaris) - 95% probability. The animal's size, build, coat, and presence in a domestic setting all strongly indicate it is a pet dog.
    
    2. Coyote (Canis latrans) - 2% probability. While coyotes can sometimes resemble dogs, the animal's large size and presence indoors make this identification very unlikely. 
    
    3. Wolf (Canis lupus) - 1% probability. Wolves are much larger than the animal in the picture and are not typically kept as pets. The domestic context argues strongly against this being a wolf.
    
    4. Dingo (Canis lupus dingo) - 1% probability. Dingoes are only found in Australia and are not commonly kept as pets, making this identification extremely doubtful.
    
    5. Wild Dog (various species/subspecies) - 1% probability. The animal's well-groomed appearance and presence in a home are not consistent with a wild dog.
    
    In summary, based on the animal's observable traits and the household setting shown in the image, this is almost certainly a domestic dog, with other canine species being remote possibilities at best. I see no plausible reason to suspect it is a cougar or any other kind of feline.
Synthesis and Resolution of Opinions (click to open/close)
  Participant: #1:openai_gpt-4-vision-preview

	  
	  Based on the critiques and reconciliation of the previous opinions, here is the updated list of the most likely identities for the animal in the image, in descending order of probability:
	  
	  Reconciled Diagnoses:
	  1. Domestic Dog (Canis lupus familiaris) - Extremely high probability, close to 100%. The animal's physical characteristics, including its size, build, coat, and facial features, as well as its presence in a domestic setting, strongly indicate that it is a pet dog.
	  
	  2. Coyote (Canis latrans) - Very low probability, around 2%. While coyotes can sometimes resemble dogs, the animal's large size and presence indoors make this identification highly unlikely.
	  
	  3. Wolf (Canis lupus) - Extremely low probability, around 1%. Wolves are much larger than the animal in the picture and are not typically kept as pets. The domestic context argues strongly against this being a wolf.
	  
	  4. Dingo (Canis lupus dingo) - Extremely low probability, around 1%. Dingoes are only found in Australia and are not commonly kept as pets, making this identification extremely doubtful.
	  
	  5. Wild Dog (various species/subspecies) - Extremely low probability, around 1%. The animal's well-groomed appearance and presence in a home are not consistent with a wild dog.
	  
	  Irreconcilable Diagnoses:
	  1. Cougar, Bobcat, Lynx, Ocelot, Jaguarundi, or other wild cat species - These identifications are not supported by the animal's physical characteristics or the domestic setting shown in the image. The probability of the animal being any of these species is close to 0%.

  Participant: #2:anthropic_claude-3-opus-20240229

	  Based on the information provided and the image, here is a reconciled list of diagnoses in descending order of probability:
	  
	  1. Domestic Dog (Canis lupus familiaris) – Very High Probability. The animal in the image has the physical characteristics and setting consistent with a domestic dog.
	  
	  2. Golden Retriever (or similar retriever-type breed) – High Probability. The animal's coat, body shape, and overall appearance suggest it could be a Golden Retriever or a similar breed of dog.
	  
	  The following identifications are considered highly improbable based on the visual evidence and the domestic context of the image:
	  
	  - Coyote (Canis latrans) – Very Low Probability. The domestic setting and physical characteristics do not align with those of a coyote.
	  - Wolf (Canis lupus) – Extremely Low Probability. Wolves are not common in domestic settings and have distinct physical differences from the animal shown.
	  - Dingo (Canis lupus dingo) – Extremely Low Probability. Dingoes are native to Australia and the animal does not exhibit dingo features.
	  - Wild Dog (various species/subspecies) – Extremely Low Probability. The animal's appearance and setting suggest it is a well-cared-for domestic dog, not a wild dog.
	  
	  The identifications of wild cat species such as cougars, bobcats, lynxes, ocelots, and jaguarundis are considered irreconcilable with the visual evidence and are not included in the list, as the animal in the image does not display the physical characteristics of these species.

Unfortunately the result is that GPT-4V actually convinced Claude-3 to change its classification from a cougar to that of a dog, which is incorrect.

Conclusions:

Here are the key takeways:

Addendum: Fixing GPT-4V’s Mis-Classification

Updated 2024.03.19

Given that the tests above seem to indicate that GPT-4V more likely may mis-classify objects due to contextual influences, How do we remedy the problem?

Test #2 above shows that asking a chatbot to ignore contextual information does result in correct classification with Claude-3, but not with GPT-4V. Here we show a series of experiements with refined prompts that actually can get GPT-4V to classify correctly as well.

Image given:

image

Note that all tests were performed with a fresh chat session, since session memory seems to play a role. Emphasis in the text are mine.

Image