C
ChaoBro

Invoices, Structured Data, Complex Instructions: Domestic Models Real Task Exam — Who Fabricates Data?

Invoices, Structured Data, Complex Instructions: Domestic Models Real Task Exam — Who Fabricates Data?

Beyond Benchmarks: Real Tasks Are the True Test

In the AI model evaluation system, standardized benchmarks like SWE-bench, MMLUPro, and HumanEval have become industry consensus. But an increasingly clear fact is: there is a significant gap between benchmark scores and real task performance.

Recently, a community developer tested multiple mainstream models with the same invoice processing task, revealing surprising divergence:

In actual invoice task testing, only DeepSeek V4 Flash, GPT-5.5, and GLM-5.1 reliably completed the task. MIMO V2.5 Pro and MiniMax M2.7 fabricated data in their results.

This is not an edge case. Invoice processing, structured data extraction, complex instruction execution — these seemingly “simple” tasks are precisely the workloads AI Agents encounter most frequently in real business scenarios.

Test Task: Why Invoice Processing Exposes Model Differences

Invoice processing is a good evaluation dimension because it simultaneously tests multiple core capabilities:

  1. OCR + Semantic Understanding: Extracting key fields (amount, date, tax ID, vendor) from unstructured text
  2. Data Validation: Verifying extracted data is logical (reasonable amounts, correct tax ID format)
  3. Rejecting Hallucination: When information is incomplete, the model should honestly report “cannot confirm” rather than fabricating data

The third point is the most critical differentiator. A good model, when facing uncertain information, should prefer to say “I don’t know” rather than fabricate an answer.

Test Results Summary

ModelTask CompleteData AccuracyHallucination RiskOverall
DeepSeek V4 FlashYesHighLowReliable
GPT-5.5YesHighLowReliable
GLM-5.1YesHighLowReliable
MIMO V2.5 ProNoFabricatedHighHallucination
MiniMax M2.7NoFabricatedHighHallucination

Note: Results from a single community test with limited sample size, but consistent with trends from other independent evaluations.

Why This Divergence Occurs

DeepSeek V4 Flash: The Pragmatist’s Victory

DeepSeek’s reliable performance in real tasks aligns with its design philosophy: not over-pursuing benchmark scores, but emphasizing practical usability. The V4 Flash version, while compressing costs, maintains sufficient reasoning capability. In tasks requiring precise information extraction like invoice processing, it shows stronger “restraint” than flagship models — it doesn’t fabricate data just to provide a “complete answer.”

GLM-5.1: Zhipu’s Engineering Accumulation

GLM-5.1’s stability in programming and structured tasks has been validated by the community. In invoice processing, it extends this characteristic: for tasks requiring high precision, GLM-5.1’s hallucination rate is significantly lower than models at the same level. This may relate to Zhipu’s technical accumulation in knowledge graphs and structured understanding.

GPT-5.5: The Closed-Source Baseline

As a closed-source representative, GPT-5.5 performs as expected in this task — reliable, accurate, low hallucination. But the noteworthy point is: domestic open-source models (DeepSeek V4 Flash, GLM-5.1) have caught up to the closed-source benchmark in this dimension.

MIMO V2.5 Pro and MiniMax M2.7: The Cost of Overconfidence

Both models exhibited “data fabrication” issues. This reflects a common vulnerability: when models are trained to “always give complete answers,” they are more prone to hallucination in scenarios with uncertain information.

This phenomenon may be less fatal in programming (code execution failure exposes errors), but is catastrophic in data processing — fabricated invoice amounts or incorrect tax IDs can lead directly to financial errors.

Broader Implications: Model Selection Cannot Rely Solely on Benchmarks

This test reveals an important selection principle:

Different task types have vastly different requirements for model capabilities.

Task TypeKey CapabilityRecommended Model
Code GenerationSyntax correctness, context understandingGLM-5.1, Kimi K2.6
Code DebuggingReasoning chains, root cause analysisDeepSeek V4 Pro
Data ExtractionPrecision, hallucination rejectionDeepSeek V4 Flash, GLM-5.1
Creative WritingDiversity, fluencyQwen 3.6 Max
Multimodal UnderstandingImage + text joint reasoningKimi K2.6
High-Frequency Agent CallsCost, speedQwen 3.6 Plus, MiniMax M2.7

Anti-Hallucination: Model Selection and Prompt Strategies

If your task has high data accuracy requirements, beyond choosing the right model, you can also adopt these strategies:

1. Explicitly Tell the Model “You Can Say No”

Please extract key fields from the following invoice text. If any field's
information is incomplete or uncertain, clearly mark "cannot confirm" —
do not guess or fabricate.

2. Use Structured Output Format

Require JSON output with schema annotations for required vs. optional fields:

{
  "invoice_number": "required",
  "amount": "required, must be numeric",
  "tax_id": "optional, null if uncertain",
  "vendor_name": "required"
}

3. Cross-Validation

For critical data, use two different models to extract independently and compare consistency. If one model gives a definite value while another marks “cannot confirm,” use the conservative value.

Domestic Models’ Real Level: Progress and Challenges Coexist

This test has both positive signals and warnings:

Positive: DeepSeek V4 Flash and GLM-5.1 demonstrate reliability in real data processing tasks comparable to GPT-5.5. This shows domestic models have reached usable levels in scenarios requiring high precision.

Challenge: Some models still carry hallucination risks in structured tasks. This means model selection cannot rely solely on benchmark rankings — real business data validation is essential.

Action Recommendations

For Developers

  • Build your own test suite: Use real business data (invoices, contracts, reports) to run tests on candidate models, recording accuracy, hallucination rate, and first-attempt completion rate
  • Don’t just look at SWE-bench scores: Code generation ability ≠ data processing ability. Choose models based on your actual task types
  • Focus on models’ “rejection ability”: A good model should know when to say “I don’t know”

For Enterprises

  • PoC phase must include hallucination testing: During model selection, specifically design test cases with incomplete information to evaluate hallucination tendencies
  • Use dual-model cross-validation for critical tasks: For high-risk scenarios like finance and legal, use two independent models to cross-validate results

For Model Vendors

  • Add “uncertainty expression” training: Include “should refuse when information is insufficient” preference signals in RLHF
  • Provide structured output guarantees: Support JSON Schema validation with automatic correction for non-compliant output

Sources: