AI in Fashion: Can GPT-4o Mini and Gemini 2.0 Flash Predict Fine-Grained Product Attributes? A Zero-Shot Analysis

Fashion e-commerce thrives on detailed product attribution, but manual labeling is unsustainable at scale. This article explores a zero-shot study comparing GPT-4o Mini and Gemini 2.0 Flash across 18 fashion attribute categories using the DeepFashion-MultiModal dataset. Results show Gemini 2.0 Flash leading with an F1 score of 56.79%, while GPT-4o Mini trails at 43.28%. Through error analysis, case studies, and industry insights, we highlight how LLMs are reshaping cataloging, personalization, and future retail AI strategies.

Why Fashion Attribution is the Backbone of Retail

In the digital-first retail world, product attributionâ€”the labeling of characteristics like fabric type, neckline, sleeve length, or styleâ€”is what powers product discovery. Without accurate attributes, customers searching for â€œsilk red maxi dressâ€ may face irrelevant or empty results, leading to frustration and abandoned carts.

For e-commerce giants like Amazon, ASOS, or Zalando, accurate attribution means the difference between frictionless browsing and catalog chaos. But with millions of new SKUs every season, manual labeling is impossible. Traditional computer vision models helped, but they struggled with fine-grained differences like â€œmidi vs. maxi dressâ€ or â€œcrew neck vs. round neck.â€

This is where multimodal large language models (LLMs) like GPT-4o Mini and Gemini 2.0 Flash step in.

The Promise of Multimodal AI in Fashion

LLMs were first praised for text-based tasks, but recent generations integrate vision + language understanding, enabling models to â€œseeâ€ an image and describe it in natural language. For fashion, this translates into:

Recognizing fabrics, cuts, and patterns.
Understanding subtle differences between similar attributes.
Scaling catalog organization without manual annotation.

The challenge? Can they achieve this in a zero-shot settingâ€”without fine-tuningâ€”when presented with raw fashion images?

Zero-Shot Learning: The E-Commerce Challenge

Zero-shot learning (ZSL) is the holy grail for scaling fashion AI. Instead of retraining models for every new collection or brand, ZSL allows models to generalize from prior knowledge and classify unseen categories.

Imagine:

A retailer uploads 50,000 new images of summer wear.
The model, without explicit training, recognizes â€œhalter neck,â€ â€œlinen fabric,â€ or â€œfloral printâ€ directly.

This is the promise tested in the comparison of GPT-4o Mini and Gemini 2.0 Flash.

The Dataset: DeepFashion-MultiModal

To evaluate, researchers used the DeepFashion-MultiModal dataset, one of the most robust fashion AI datasets.

Key Features:

18 attribute categories (sleeve length, neckline, color, fabric, pattern, etc.).
Diverse product images from e-commerce sites.
Ground truth annotations for benchmark testing.
Multimodal design (images + metadata), though the experiment constrained input to images only for realism.

This constraint mirrored real-world retail pipelines, where images often arrive before structured metadata.

Meet the Contenders

GPT-4o Mini

Built for speed and cost efficiency.
Lightweight version of GPT-4o.
Performs well in multimodal reasoning but prioritizes efficiency over accuracy.

Gemini 2.0 Flash

Google DeepMindâ€™s flagship multimodal model.
Balances high performance and inference speed.
Trained extensively on vision-language alignment tasks, making it better suited for attribute extraction.

Results: Performance Metrics

ModelMacro F1 ScoreStrengthsWeaknessesGPT-4o Mini43.28%Fast, cost-effectiveConfuses similar categories (e.g., crew vs. round neck)Gemini 2.0 Flash56.79%Strong attribute recognition, robust to patterns & materialsHigher cost, slower at scale

Takeaway: Gemini 2.0 Flash significantly outperformed GPT-4o Mini, especially in pattern recognition (floral, striped, polka dots), fabric classification (denim, silk, cotton), and shape-related features (V-neck vs. square neck).

Error Analysis

Lighting & Pose Variations
- Both models struggled with images under poor lighting or non-standard poses.
Ambiguity in Attributes
- Distinguishing â€œmidiâ€ vs. â€œmaxiâ€ dresses caused misclassification.
Cultural Clothing Bias
- Traditional wear like sarees or qipaos were often mislabeled, highlighting the datasetâ€™s Western bias.

Practical Applications in E-Commerce

Catalog Automation
- Geminiâ€™s higher accuracy could help auto-tag millions of products, reducing reliance on manual teams.
Search & Discovery Optimization
- Fine-grained attributes improve relevance in search queries, reducing bounce rates.
Personalized Recommendations
- Customers get style-matched recommendations (e.g., â€œsimilar maxi dresses in silk fabricâ€).
Fraud Detection
- Misattributed listings (common in counterfeit markets) could be flagged more easily.

Case Studies

Case 1: Zaraâ€™s Dynamic Cataloging

Zaraâ€™s fast fashion model requires weekly catalog updates. Automating attributes with Gemini could cut turnaround time by 50%, keeping pace with trends.

Case 2: Amazonâ€™s Scale

Amazon lists millions of items daily. A two-tiered pipeline could use GPT-4o Mini for bulk processing and Gemini for refinement in high-value categories.

Tools for Practitioners

Datasets: DeepFashion-MultiModal, ModaNet.
APIs: OpenAI (GPT-4o Mini), Google Vertex AI (Gemini 2.0).
Evaluation: Hugging Face for benchmarking, F1 score comparisons.
Hybrid Models: Combine zero-shot LLMs with fine-tuned fashion-specific models for best results.

Future Research Directions

Domain-Specific Fine-Tuning
- Training on fashion datasets could push F1 scores past 80%.
Multimodal Fusion
- Combining images + text metadata for more robust performance.
Bias Mitigation
- Expanding training sets to cover diverse global fashion.
Real-Time Deployment
- Using Gemini for real-time catalog tagging at e-commerce scale.

Conclusion

Fine-grained product attribution is the invisible engine of fashion e-commerce. While GPT-4o Mini offers speed and efficiency, Gemini 2.0 Flash currently outperforms in accuracy and robustness.

For businesses, the optimal strategy is hybrid pipelinesâ€”low-cost bulk tagging with GPT-4o Mini, refined by Gemini for critical categories.

This study is not the finish line but the starting point: with domain fine-tuning, multimodal fusion, and expanded datasets, AI can truly revolutionize the fashion discovery experience.

â€

•