UNSPSC Classification Accuracy: What 90

Procurement28 April 20266 min read

The number you keep seeing

If you have researched automated UNSPSC classification, you have probably encountered a claim around 90–95% automatic classification. Pearstop publishes this number. So do several competitors.

But what does it actually mean? How is it measured? And what happens to the remaining 5–10%?

These are the questions procurement teams ask before committing to an automated classification system. This article answers them plainly.

How accuracy is defined

"90–95% automatic classification" means that 90–95% of procurement line items in a given dataset are assigned a UNSPSC commodity code by the automated engine, without requiring any human input.

The remaining 5–10% are items where the engine's confidence falls below a defined threshold. These items are flagged for human review — a buyer or category manager looks at each one and confirms or corrects the suggested code.

Accuracy is typically measured at the commodity level — the most specific level of the UNSPSC hierarchy (an 8-digit code). Measuring at segment or family level would produce higher-sounding numbers but much less useful classification.

What makes an item hard to classify automatically?

The items that fall into the review queue tend to share a few characteristics:

Highly abbreviated descriptions. "WK elec H3 Q2-26" is meaningful to the engineer who wrote it but tells an algorithm very little. Without context from the supplier, cost element, and site, confident classification is not possible.

Brand names and part numbers without descriptions. "Wago 221-412" is a specific terminal block connector, but without the product description, the engine relies on supplier and GL context alone. If those signals are weak, confidence drops.

Genuinely ambiguous spend. Some procurement lines sit at the boundary between two UNSPSC categories. A maintenance visit that includes both labour and materials might be Segment 72 (Maintenance Services) or Segment 78 (Transportation and Storage), depending on how the work was invoiced.

First-time suppliers. The ML layer learns from patterns across your supplier base. A brand-new supplier with no purchase history produces lower confidence scores until the engine has seen enough examples to establish patterns.

Accuracy at different stages

A good automated classification system does not stay at the same accuracy level over time. It improves.

Stage	Auto-classification rate
Initial baseline (first run)	80–90%
After 3 months of operation	90–95%
After 12 months of operation	95%+

The improvement comes from the feedback loop. Every item that a human reviewer classifies is fed back into the ML model. The next time a similar description appears — from the same or a different supplier — the engine classifies it automatically with high confidence.

This is why the review queue matters. It is not a failure of the system; it is the system learning.

How Pearstop's four-layer engine achieves this

Layer 1 — Rules Engine. User-defined rules and automatically loaded patterns handle high-confidence classifications immediately. If you have told the system that all purchases from Supplier X under GL account 6400 are Segment 72 Class 721010, every matching line item is classified without any computation.

Layer 2 — Machine Learning. A proprietary ML layer is trained on your historical spend data and on a broad corpus of procurement transactions across industries. It replicates the classification logic your most experienced category managers would apply — handling the common cases automatically.

Layer 3 — LLM Layer. Ambiguous or unusual line items are processed by a large language model that brings broad product and industry knowledge. The LLM handles descriptions that the ML layer has never seen before, including foreign-language descriptions and highly technical terminology.

Layer 4 — Human Review. Items below the confidence threshold are surfaced to your team in a review interface. Each decision feeds back into layers 1–3. Over time, the rules and ML layers expand to cover items that initially required human input.

What happens to unclassified items

Some procurement teams worry about the 5–10% that goes to review. The realistic picture:

In the first month, a team might review 500–1,000 items from a dataset of 10,000 lines
By month three, the same volume of new invoices produces 200–300 items for review
By month twelve, it is often fewer than 100

The review interface is designed for speed — a category manager can process 100 items in 20–30 minutes with clear suggested codes and confidence indicators. It is a fundamentally different workload from manual classification of the full dataset.

A note on how competitors measure accuracy

Not all accuracy claims are equivalent. Watch for:

Accuracy at segment level vs. commodity level. Segment-level classification (the first 2 digits of the UNSPSC code) is much easier than commodity-level (all 8 digits). A tool that achieves 95% at segment level might achieve only 60% at commodity level.
Accuracy on clean data vs. real-world data. Some vendors test against datasets where descriptions are already standardised. Real procurement data from SAP or Oracle is messier, and accuracy figures should reflect that.
Review queue vs. unclassified. A system that flags items for review is different from one that leaves them unclassified. Flagged items get classified — eventually. Unclassified items stay unclassified.

Pearstop's figures are measured at commodity level on real client datasets including Dutch infrastructure and FM spend, where descriptions are in mixed Dutch and English and vary significantly across sites.

What accuracy actually buys you

A 90–95% auto-classification rate means that a procurement team handling 10,000 invoice lines per month reduces their manual classification effort from 10,000 decisions to 500–1,000 — a reduction of 90–95%.

That reduction does not just save time. It makes spend analysis possible in the first place. A manually classified dataset where one person is working through 10,000 lines is always weeks or months behind. An automated system produces a classified dataset within days of the invoice data arriving.

For category management, supplier benchmarking, and margin analysis, timeliness matters as much as accuracy. A spend baseline that is three months old is far less useful than one that reflects last month's purchases.

Free Tools

Not sure which UNSPSC code to use?

Paste any product or service description and get the correct 8-digit code instantly — or explore the full taxonomy tree to understand the hierarchy.

Try the free lookup →Explore the taxonomy tree

Pearstop Team

Pearstop

Pearstop helps procurement and operations teams in hard services, FM, construction, and manufacturing turn messy data into a reliable foundation for decisions, AI, and category management.

LinkedIn →

UNSPSC Classification Accuracy: What 90–95% Actually Means