AI in Fashion: Can GPT-4o Mini and Gemini 2.0 Flash Predict Fine-Grained Product Attributes? A Zero-Shot Analysis

Exploring whether GPT-4o Mini and Gemini 2.0 Flash can predict fine-grained fashion product attributes in a zero-shot setting, reshaping e-commerce catalogs

September 7, 2025

Fashion e-commerce thrives on detailed product attribution, but manual labeling is unsustainable at scale. This article explores a zero-shot study comparing GPT-4o Mini and Gemini 2.0 Flash across 18 fashion attribute categories using the DeepFashion-MultiModal dataset. Results show Gemini 2.0 Flash leading with an F1 score of 56.79%, while GPT-4o Mini trails at 43.28%. Through error analysis, case studies, and industry insights, we highlight how LLMs are reshaping cataloging, personalization, and future retail AI strategies.

Why Fashion Attribution is the Backbone of Retail

In the digital-first retail world, product attribution—the labeling of characteristics like fabric type, neckline, sleeve length, or style—is what powers product discovery. Without accurate attributes, customers searching for “silk red maxi dress” may face irrelevant or empty results, leading to frustration and abandoned carts.

For e-commerce giants like Amazon, ASOS, or Zalando, accurate attribution means the difference between frictionless browsing and catalog chaos. But with millions of new SKUs every season, manual labeling is impossible. Traditional computer vision models helped, but they struggled with fine-grained differences like “midi vs. maxi dress” or “crew neck vs. round neck.”

This is where multimodal large language models (LLMs) like GPT-4o Mini and Gemini 2.0 Flash step in.

The Promise of Multimodal AI in Fashion

LLMs were first praised for text-based tasks, but recent generations integrate vision + language understanding, enabling models to “see” an image and describe it in natural language. For fashion, this translates into:

  • Recognizing fabrics, cuts, and patterns.
  • Understanding subtle differences between similar attributes.
  • Scaling catalog organization without manual annotation.

The challenge? Can they achieve this in a zero-shot setting—without fine-tuning—when presented with raw fashion images?

Zero-Shot Learning: The E-Commerce Challenge

Zero-shot learning (ZSL) is the holy grail for scaling fashion AI. Instead of retraining models for every new collection or brand, ZSL allows models to generalize from prior knowledge and classify unseen categories.

Imagine:

  • A retailer uploads 50,000 new images of summer wear.
  • The model, without explicit training, recognizes “halter neck,” “linen fabric,” or “floral print” directly.

This is the promise tested in the comparison of GPT-4o Mini and Gemini 2.0 Flash.

The Dataset: DeepFashion-MultiModal

To evaluate, researchers used the DeepFashion-MultiModal dataset, one of the most robust fashion AI datasets.

Key Features:

  • 18 attribute categories (sleeve length, neckline, color, fabric, pattern, etc.).
  • Diverse product images from e-commerce sites.
  • Ground truth annotations for benchmark testing.
  • Multimodal design (images + metadata), though the experiment constrained input to images only for realism.

This constraint mirrored real-world retail pipelines, where images often arrive before structured metadata.

Meet the Contenders

GPT-4o Mini

  • Built for speed and cost efficiency.
  • Lightweight version of GPT-4o.
  • Performs well in multimodal reasoning but prioritizes efficiency over accuracy.

Gemini 2.0 Flash

  • Google DeepMind’s flagship multimodal model.
  • Balances high performance and inference speed.
  • Trained extensively on vision-language alignment tasks, making it better suited for attribute extraction.

Results: Performance Metrics

ModelMacro F1 ScoreStrengthsWeaknessesGPT-4o Mini43.28%Fast, cost-effectiveConfuses similar categories (e.g., crew vs. round neck)Gemini 2.0 Flash56.79%Strong attribute recognition, robust to patterns & materialsHigher cost, slower at scale

Takeaway: Gemini 2.0 Flash significantly outperformed GPT-4o Mini, especially in pattern recognition (floral, striped, polka dots), fabric classification (denim, silk, cotton), and shape-related features (V-neck vs. square neck).

Error Analysis

  1. Lighting & Pose Variations
    • Both models struggled with images under poor lighting or non-standard poses.
  2. Ambiguity in Attributes
    • Distinguishing “midi” vs. “maxi” dresses caused misclassification.
  3. Cultural Clothing Bias
    • Traditional wear like sarees or qipaos were often mislabeled, highlighting the dataset’s Western bias.

Practical Applications in E-Commerce

  1. Catalog Automation
    • Gemini’s higher accuracy could help auto-tag millions of products, reducing reliance on manual teams.
  2. Search & Discovery Optimization
    • Fine-grained attributes improve relevance in search queries, reducing bounce rates.
  3. Personalized Recommendations
    • Customers get style-matched recommendations (e.g., “similar maxi dresses in silk fabric”).
  4. Fraud Detection
    • Misattributed listings (common in counterfeit markets) could be flagged more easily.

Case Studies

Case 1: Zara’s Dynamic Cataloging

Zara’s fast fashion model requires weekly catalog updates. Automating attributes with Gemini could cut turnaround time by 50%, keeping pace with trends.

Case 2: Amazon’s Scale

Amazon lists millions of items daily. A two-tiered pipeline could use GPT-4o Mini for bulk processing and Gemini for refinement in high-value categories.

Tools for Practitioners

  • Datasets: DeepFashion-MultiModal, ModaNet.
  • APIs: OpenAI (GPT-4o Mini), Google Vertex AI (Gemini 2.0).
  • Evaluation: Hugging Face for benchmarking, F1 score comparisons.
  • Hybrid Models: Combine zero-shot LLMs with fine-tuned fashion-specific models for best results.

Future Research Directions

  1. Domain-Specific Fine-Tuning
    • Training on fashion datasets could push F1 scores past 80%.
  2. Multimodal Fusion
    • Combining images + text metadata for more robust performance.
  3. Bias Mitigation
    • Expanding training sets to cover diverse global fashion.
  4. Real-Time Deployment
    • Using Gemini for real-time catalog tagging at e-commerce scale.

Conclusion

Fine-grained product attribution is the invisible engine of fashion e-commerce. While GPT-4o Mini offers speed and efficiency, Gemini 2.0 Flash currently outperforms in accuracy and robustness.

For businesses, the optimal strategy is hybrid pipelines—low-cost bulk tagging with GPT-4o Mini, refined by Gemini for critical categories.

This study is not the finish line but the starting point: with domain fine-tuning, multimodal fusion, and expanded datasets, AI can truly revolutionize the fashion discovery experience.

Digital Kulture

Digital Kulture Team is a passionate group of digital marketing and web strategy experts dedicated to helping businesses thrive online. With a focus on website development, SEO, social media, and content marketing, the team creates actionable insights and solutions that drive growth and engagement.