In e-commerce, major technical challenges such as distributed search queries, real-time inventory management, and recommendation systems are often discussed. But behind the scenes lies a stubborn, systematic problem that merchants worldwide are concerned with: the management and normalization of product attribute values. These values form the foundation of product discovery. They directly influence filters, comparison functions, search rankings, and recommendation logic. In real catalogs, such values are rarely consistent. Duplicates, formatting errors, or semantic ambiguities are common.
A simple example illustrates the extent: For a size specification, you might find “XL”, “Small”, “12cm”, “Large”, “M”, and “S” side by side. For colors, values like “RAL 3020”, “Crimson”, “Red”, and “Dark Red” are mixed together—standards like RAL 3020 and free descriptions are unrestrainedly blended. When these inconsistencies multiply across several million SKUs, the depth of the problem becomes clear. Filters become unreliable, search engines lose precision, manual data cleaning becomes a Sisyphean task, and customers experience frustrating product discovery.
The core strategy: Intelligence with guardrails
A pure black-box AI solution was out of the question. Such systems are difficult to interpret, debug, and control at millions of SKUs. Instead, the goal was a predictable, explainable, and human-controlled pipeline—AI that acts intelligently without losing oversight.
The answer lay in a hybrid architecture that combines contextual LLM intelligence with deterministic rules and merchant controls. The system should meet three criteria:
Transparency in decision-making
Predictability in process flows
Human intervention options for critical data
Offline processing instead of real-time pipelines
A key architectural step was choosing offline background jobs over real-time pipelines. This may initially seem like a step backward, but it is strategically sound:
Real-time systems lead to unpredictable latencies, fragile dependencies, costly peaks in computation, and higher operational vulnerability. Offline jobs, on the other hand, offer:
Throughput efficiency: Massive data volumes are processed without burdening live systems
Robustness: Processing errors never impact customer traffic
Cost optimization: Calculations can be scheduled during low-traffic times
Isolation: LLM latency does not affect product page performance
Predictability: Updates are atomic and reproducible
With millions of product entries, this decoupling of customer-facing and data processing systems is indispensable.
Data cleaning as a foundation
Before deploying AI, an essential preprocessing step was performed to eliminate noise. The model received only clean, clear inputs:
Whitespace normalization (leading and trailing spaces)
Removal of empty values
Deduplication of values
Simplification of category context (convert breadcrumbs into structured strings)
This seemingly simple step significantly improved the accuracy of the language model. The principle remains universal: with this volume of data, even small input errors can cascade into a series of problems later.
Contextual LLM processing
The language model did not perform mechanical sorting. With sufficient context, it could apply semantic reasoning:
That “Voltage” in power tools should be sorted numerically
That “Size” in clothing follows an established progression (S, M, L, XL)
That “Color” in certain categories respects standards like RAL 3020
That “Material” exhibits semantic hierarchies
The model returned:
an ordered list of values
refined attribute descriptions
a classification: deterministically or contextually sortable
This enabled the pipeline to handle different attribute types flexibly, without hardcoding rules for each category.
Deterministic fallback logic
Not every attribute required AI intelligence. Numeric ranges, unit-based sizes, and simple quantities benefit from:
faster processing
guaranteed predictability
lower costs
elimination of ambiguity
The pipeline automatically recognized such cases and applied deterministic sorting logic. The system remained efficient and avoided unnecessary LLM calls.
Human control via tagging systems
For business-critical attributes, merchants needed final decision authority. Each category could be tagged:
LLM_SORT: language model decides the order
MANUAL_SORT: merchants explicitly define the sequence
This dual system proved doubly effective: AI handled routine tasks, humans retained control. It built trust and allowed merchants to override model decisions when needed, without disrupting the processing pipeline.
Persistence in a centralized database
All results were directly persisted in MongoDB, keeping the architecture simple and maintainable:
MongoDB served as the operational store for:
ordered attribute values
refined attribute names
category-specific sort tags
product-related sort field metadata
This enabled easy review, targeted overwriting, reprocessing of categories, and seamless synchronization with external systems.
Integration with search infrastructure
After normalization, values flowed into two search systems:
Elasticsearch: for keyword-driven filtering and faceted search
Vespa: for semantic and vector-based product matching operations
This duality ensured:
filters appear in logical, expected order
product pages display consistent attributes
search engines rank products more precisely
customer experience is more intuitive
The search layer is where attribute consistency is most visible and commercially valuable.
Practical results of the transformation
The pipeline transformed chaotic raw values into structured outputs:
Attribute
Raw Values
Normalized Output
Size
XL, Small, 12cm, Large, M, S
Small, M, Large, XL, 12cm
Color
RAL 3020, Crimson, Red, Dark Red
Red, Dark Red, Crimson, RAL 3020 (
Material
Steel, Carbon Steel, Stainless, Stainless Steel
Steel, Stainless Steel, Carbon Steel
Numeric
5cm, 12cm, 2cm, 20cm
2cm, 5cm, 12cm, 20cm
Especially for color attributes, the importance of contextualization became clear: the system recognized that RAL 3020 is a color standard and placed it meaningfully among semantically similar values.
Architecture overview of the entire system
The modular pipeline orchestrated the following steps:
Extract product data from the )Product Information Management( system
Isolate attribute values and category context via the attribute extraction job
Pass cleaned data to the AI sorting service
Write updated product documents in MongoDB
Outbound sync job updates the source PIM system
Elasticsearch and Vespa sync jobs synchronize sorted data into their respective indexes
API layers connect search systems with client applications
This workflow ensured that every normalized attribute value—whether sorted by AI or manually set—was consistently reflected in search, merchandising, and customer experience.
Why offline processing was the right choice
Real-time pipelines would have introduced latency unpredictability, higher compute costs, and fragile dependency networks. Offline jobs instead enabled:
Efficient batch processing
Asynchronous LLM calls without real-time pressure
Robust retry mechanisms and error queues
Time windows for human validation
Predictable, controllable compute costs
The trade-off was a slight delay between data ingestion and display, but the benefit—reliability at scale—is valuable for customers.
Business and technical impact
The solution achieved measurable results:
Consistent attribute sorting across 3+ million SKUs
Predictable sorting of numeric values via deterministic fallbacks
Decentralized merchant control through manual tagging
Cleaner product pages and more intuitive filters
Improved search relevance and ranking accuracy
Increased customer trust and conversion rate
This was not just a technical project; it was an immediately measurable lever for user experience and revenue growth.
Key takeaways for product scale
Hybrid systems outperform pure AI at scale. Guardrails and control mechanisms are essential.
Context is the multiplier for LLM accuracy. Clean, category-relevant inputs lead to reliable outputs.
Offline processing is not a compromise but an architectural necessity for throughput and resilience.
Human override options build trust. Systems controllable by humans are adopted faster.
Input data quality determines output reliability. Cleaning is not overhead but the foundation.
Final reflection
Normalizing attribute values may seem like a simple problem—until you have to solve it for millions of product variants. By combining language model intelligence with deterministic rules and merchant controls, a hidden, stubborn problem was transformed into an elegant, maintainable system.
It reminds us: some of the most valuable technical wins do not come from shiny innovations but from systematically solving unseen problems—those that operate daily on every product page but rarely receive attention.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
Scaling E-Commerce: How AI-Driven Pipelines Maintain Consistent Product Attributes
In e-commerce, major technical challenges such as distributed search queries, real-time inventory management, and recommendation systems are often discussed. But behind the scenes lies a stubborn, systematic problem that merchants worldwide are concerned with: the management and normalization of product attribute values. These values form the foundation of product discovery. They directly influence filters, comparison functions, search rankings, and recommendation logic. In real catalogs, such values are rarely consistent. Duplicates, formatting errors, or semantic ambiguities are common.
A simple example illustrates the extent: For a size specification, you might find “XL”, “Small”, “12cm”, “Large”, “M”, and “S” side by side. For colors, values like “RAL 3020”, “Crimson”, “Red”, and “Dark Red” are mixed together—standards like RAL 3020 and free descriptions are unrestrainedly blended. When these inconsistencies multiply across several million SKUs, the depth of the problem becomes clear. Filters become unreliable, search engines lose precision, manual data cleaning becomes a Sisyphean task, and customers experience frustrating product discovery.
The core strategy: Intelligence with guardrails
A pure black-box AI solution was out of the question. Such systems are difficult to interpret, debug, and control at millions of SKUs. Instead, the goal was a predictable, explainable, and human-controlled pipeline—AI that acts intelligently without losing oversight.
The answer lay in a hybrid architecture that combines contextual LLM intelligence with deterministic rules and merchant controls. The system should meet three criteria:
Offline processing instead of real-time pipelines
A key architectural step was choosing offline background jobs over real-time pipelines. This may initially seem like a step backward, but it is strategically sound:
Real-time systems lead to unpredictable latencies, fragile dependencies, costly peaks in computation, and higher operational vulnerability. Offline jobs, on the other hand, offer:
With millions of product entries, this decoupling of customer-facing and data processing systems is indispensable.
Data cleaning as a foundation
Before deploying AI, an essential preprocessing step was performed to eliminate noise. The model received only clean, clear inputs:
This seemingly simple step significantly improved the accuracy of the language model. The principle remains universal: with this volume of data, even small input errors can cascade into a series of problems later.
Contextual LLM processing
The language model did not perform mechanical sorting. With sufficient context, it could apply semantic reasoning:
The model received:
With this context, the model understood:
The model returned:
This enabled the pipeline to handle different attribute types flexibly, without hardcoding rules for each category.
Deterministic fallback logic
Not every attribute required AI intelligence. Numeric ranges, unit-based sizes, and simple quantities benefit from:
The pipeline automatically recognized such cases and applied deterministic sorting logic. The system remained efficient and avoided unnecessary LLM calls.
Human control via tagging systems
For business-critical attributes, merchants needed final decision authority. Each category could be tagged:
This dual system proved doubly effective: AI handled routine tasks, humans retained control. It built trust and allowed merchants to override model decisions when needed, without disrupting the processing pipeline.
Persistence in a centralized database
All results were directly persisted in MongoDB, keeping the architecture simple and maintainable:
MongoDB served as the operational store for:
This enabled easy review, targeted overwriting, reprocessing of categories, and seamless synchronization with external systems.
Integration with search infrastructure
After normalization, values flowed into two search systems:
This duality ensured:
The search layer is where attribute consistency is most visible and commercially valuable.
Practical results of the transformation
The pipeline transformed chaotic raw values into structured outputs:
Especially for color attributes, the importance of contextualization became clear: the system recognized that RAL 3020 is a color standard and placed it meaningfully among semantically similar values.
Architecture overview of the entire system
The modular pipeline orchestrated the following steps:
This workflow ensured that every normalized attribute value—whether sorted by AI or manually set—was consistently reflected in search, merchandising, and customer experience.
Why offline processing was the right choice
Real-time pipelines would have introduced latency unpredictability, higher compute costs, and fragile dependency networks. Offline jobs instead enabled:
The trade-off was a slight delay between data ingestion and display, but the benefit—reliability at scale—is valuable for customers.
Business and technical impact
The solution achieved measurable results:
This was not just a technical project; it was an immediately measurable lever for user experience and revenue growth.
Key takeaways for product scale
Final reflection
Normalizing attribute values may seem like a simple problem—until you have to solve it for millions of product variants. By combining language model intelligence with deterministic rules and merchant controls, a hidden, stubborn problem was transformed into an elegant, maintainable system.
It reminds us: some of the most valuable technical wins do not come from shiny innovations but from systematically solving unseen problems—those that operate daily on every product page but rarely receive attention.