Scaling E-Commerce: How AI-Driven Pipelines Maintain Consistent Product Attributes

2026-01-09 11:12:13

In e-commerce, major technical challenges such as distributed search queries, real-time inventory management, and recommendation systems are often discussed. But behind the scenes lies a stubborn, systematic problem that merchants worldwide are concerned with: the management and normalization of product attribute values. These values form the foundation of product discovery. They directly influence filters, comparison functions, search rankings, and recommendation logic. In real catalogs, such values are rarely consistent. Duplicates, formatting errors, or semantic ambiguities are common.

A simple example illustrates the extent: For a size specification, you might find “XL”, “Small”, “12cm”, “Large”, “M”, and “S” side by side. For colors, values like “RAL 3020”, “Crimson”, “Red”, and “Dark Red” are mixed together—standards like RAL 3020 and free descriptions are unrestrainedly blended. When these inconsistencies multiply across several million SKUs, the depth of the problem becomes clear. Filters become unreliable, search engines lose precision, manual data cleaning becomes a Sisyphean task, and customers experience frustrating product discovery.

The core strategy: Intelligence with guardrails

A pure black-box AI solution was out of the question. Such systems are difficult to interpret, debug, and control at millions of SKUs. Instead, the goal was a predictable, explainable, and human-controlled pipeline—AI that acts intelligently without losing oversight.

The answer lay in a hybrid architecture that combines contextual LLM intelligence with deterministic rules and merchant controls. The system should meet three criteria:

Transparency in decision-making
Predictability in process flows
Human intervention options for critical data

Offline processing instead of real-time pipelines

A key architectural step was choosing offline background jobs over real-time pipelines. This may initially seem like a step backward, but it is strategically sound:

Real-time systems lead to unpredictable latencies, fragile dependencies, costly peaks in computation, and higher operational vulnerability. Offline jobs, on the other hand, offer:

Throughput efficiency: Massive data volumes are processed without burdening live systems
Robustness: Processing errors never impact customer traffic
Cost optimization: Calculations can be scheduled during low-traffic times
Isolation: LLM latency does not affect product page performance
Predictability: Updates are atomic and reproducible

With millions of product entries, this decoupling of customer-facing and data processing systems is indispensable.

Data cleaning as a foundation

Before deploying AI, an essential preprocessing step was performed to eliminate noise. The model received only clean, clear inputs:

Whitespace normalization (leading and trailing spaces)
Removal of empty values
Deduplication of values
Simplification of category context (convert breadcrumbs into structured strings)

This seemingly simple step significantly improved the accuracy of the language model. The principle remains universal: with this volume of data, even small input errors can cascade into a series of problems later.

Contextual LLM processing

The language model did not perform mechanical sorting. With sufficient context, it could apply semantic reasoning:

The model received:

cleaned attribute values
category metadata (e.g., “Power Tools”, “Clothing”, “Hardware”)
attribute classifications

With this context, the model understood:

That “Voltage” in power tools should be sorted numerically
That “Size” in clothing follows an established progression (S, M, L, XL)
That “Color” in certain categories respects standards like RAL 3020
That “Material” exhibits semantic hierarchies

The model returned:

an ordered list of values
refined attribute descriptions
a classification: deterministically or contextually sortable

This enabled the pipeline to handle different attribute types flexibly, without hardcoding rules for each category.

Deterministic fallback logic

Not every attribute required AI intelligence. Numeric ranges, unit-based sizes, and simple quantities benefit from:

faster processing
guaranteed predictability
lower costs
elimination of ambiguity

The pipeline automatically recognized such cases and applied deterministic sorting logic. The system remained efficient and avoided unnecessary LLM calls.

Human control via tagging systems

For business-critical attributes, merchants needed final decision authority. Each category could be tagged:

LLM_SORT: language model decides the order
MANUAL_SORT: merchants explicitly define the sequence

This dual system proved doubly effective: AI handled routine tasks, humans retained control. It built trust and allowed merchants to override model decisions when needed, without disrupting the processing pipeline.

Persistence in a centralized database

All results were directly persisted in MongoDB, keeping the architecture simple and maintainable:

MongoDB served as the operational store for:

ordered attribute values
refined attribute names
category-specific sort tags
product-related sort field metadata

This enabled easy review, targeted overwriting, reprocessing of categories, and seamless synchronization with external systems.

Integration with search infrastructure

After normalization, values flowed into two search systems:

Elasticsearch: for keyword-driven filtering and faceted search
Vespa: for semantic and vector-based product matching operations

This duality ensured:

filters appear in logical, expected order
product pages display consistent attributes
search engines rank products more precisely
customer experience is more intuitive

The search layer is where attribute consistency is most visible and commercially valuable.

Practical results of the transformation

The pipeline transformed chaotic raw values into structured outputs:

Attribute	Raw Values	Normalized Output
Size	XL, Small, 12cm, Large, M, S	Small, M, Large, XL, 12cm
Color	RAL 3020, Crimson, Red, Dark Red	Red, Dark Red, Crimson, RAL 3020 (
Material	Steel, Carbon Steel, Stainless, Stainless Steel	Steel, Stainless Steel, Carbon Steel
Numeric	5cm, 12cm, 2cm, 20cm	2cm, 5cm, 12cm, 20cm

Especially for color attributes, the importance of contextualization became clear: the system recognized that RAL 3020 is a color standard and placed it meaningfully among semantically similar values.

Architecture overview of the entire system

The modular pipeline orchestrated the following steps:

Extract product data from the )Product Information Management( system
Isolate attribute values and category context via the attribute extraction job
Pass cleaned data to the AI sorting service
Write updated product documents in MongoDB
Outbound sync job updates the source PIM system
Elasticsearch and Vespa sync jobs synchronize sorted data into their respective indexes
API layers connect search systems with client applications

This workflow ensured that every normalized attribute value—whether sorted by AI or manually set—was consistently reflected in search, merchandising, and customer experience.

Why offline processing was the right choice

Real-time pipelines would have introduced latency unpredictability, higher compute costs, and fragile dependency networks. Offline jobs instead enabled:

Efficient batch processing
Asynchronous LLM calls without real-time pressure
Robust retry mechanisms and error queues
Time windows for human validation
Predictable, controllable compute costs

The trade-off was a slight delay between data ingestion and display, but the benefit—reliability at scale—is valuable for customers.

Business and technical impact

The solution achieved measurable results:

Consistent attribute sorting across 3+ million SKUs
Predictable sorting of numeric values via deterministic fallbacks
Decentralized merchant control through manual tagging
Cleaner product pages and more intuitive filters
Improved search relevance and ranking accuracy
Increased customer trust and conversion rate

This was not just a technical project; it was an immediately measurable lever for user experience and revenue growth.

Key takeaways for product scale

Hybrid systems outperform pure AI at scale. Guardrails and control mechanisms are essential.
Context is the multiplier for LLM accuracy. Clean, category-relevant inputs lead to reliable outputs.
Offline processing is not a compromise but an architectural necessity for throughput and resilience.
Human override options build trust. Systems controllable by humans are adopted faster.
Input data quality determines output reliability. Cleaning is not overhead but the foundation.

Final reflection

Normalizing attribute values may seem like a simple problem—until you have to solve it for millions of product variants. By combining language model intelligence with deterministic rules and merchant controls, a hidden, stubborn problem was transformed into an elegant, maintainable system.

It reminds us: some of the most valuable technical wins do not come from shiny innovations but from systematically solving unseen problems—those that operate daily on every product page but rarely receive attention.

VON5,18%

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.