Adversarial Test Report: Legal Citation Extraction System

Date: 2026-06-25
Tester: Phase3-Adversarial
Methodology: 5 adversarial test cases designed to expose failure modes in the contextual-distillation citation extraction pipeline. Each case targets a specific weakness in how the LLM parses legal text, maps it to CountryProfile level hierarchies, and produces reference_id values that must be verbatim substrings of the chunk.

Ambiguous Numbering
Cross-Reference Confusion
Multi-Document Chunk
Amendment Text
Code-Switching
Robustness Rating
Systemic Findings

1. Ambiguous Numbering

Scenario: French law text where "2" could be a paragraph number (2°) or a point number — or an article subdivision digit.

Query (FR)

Quelles sont les conditions d'ancienneté pour le préavis de licenciement ?

Chunk (FR)

Article L1234-1

Lorsque le licenciement n'est pas motivé par une faute grave, le salarié a droit :

1° S'il justifie chez le même employeur d'une ancienneté de services continus inférieure à six mois, à un préavis dont la durée est déterminée par la loi, la convention ou l'accord collectif de travail ou, à défaut, par les usages pratiqués dans la localité et la profession ;

2° S'il justifie chez le même employeur d'une ancienneté de services continus comprise entre six mois et moins de deux ans, à un préavis d'un mois ;

3° S'il justifie chez le même employeur d'une ancienneté de services continus d'au moins deux ans, à un préavis de deux mois.

Toutefois, les dispositions des 2° et 3° ne sont applicables que si la loi, la convention ou l'accord collectif de travail, le contrat de travail ou les usages ne prévoient pas un préavis ou une condition d'ancienneté de services plus favorable pour le salarié.

Why This Confuses the LLM

The number "2" appears in four structurally different roles:

Occurrence	Actual Role	Risk
`2°` (line 2 of enumeration)	Point (level: `point`)	Correct
`2°` in "des 2° et 3°" (last paragraph)	Back-reference to point, not a new point	LLM may extract it as a new citation
`deux ans`	Prose number — not a structural reference at all	LLM may hallucinate `reference_id: "deux ans"`
`L1234-1`	Article number containing "1" at multiple depths	LLM may misparse the hyphenated structure

The last paragraph's 2° et 3° is the real trap. It's a forward/backward reference embedded in running text, not a structural subdivision. The LLM must recognize that 2° here is a citation of the points above, not a new structural unit. The few-shot schema asks for reference_id as a verbatim substring — 2° does appear verbatim, so the LLM will be tempted to extract it again.

Correct Extraction

relation	reference_id	section_name
`direct`	`Article L1234-1`	Préavis en cas de licenciement
`direct`	`3° S'il justifie chez le même employeur d'une ancienneté de services continus d'au moins deux ans`	Préavis de deux mois (ancienneté ≥ 2 ans)
`indirect`	`2° S'il justifie chez le même employeur d'une ancienneté de services continus comprise entre six mois et moins de deux ans`	Préavis d'un mois (ancienneté 6 mois–2 ans)

The back-reference des 2° et 3° in the last paragraph should not produce a separate extraction — it's a pointer to the already-extracted points, not a structural unit itself.

Likely LLM Failure

Duplicate extraction: 2° extracted twice (once from the enumeration, once from the back-reference)
Relation confusion: The back-reference 2° classified as direct when it's really just prose
reference_id truncation: 3° S'il justifie chez le même employeur d'une ancienneté de services continus d'au moins deux ans is very long — LLMs tend to truncate to 3° or 3° S'il justifie…

2. Cross-Reference Confusion

Scenario: Korean law text referencing a section in a different law, where the cited law's section number happens to collide with the current law's structure.

Query (KR)

개인정보 유출 시 통지 의무와 처벌 규정은 어떻게 되나요?

Chunk (KR)

제34조의2(유출 등의 통지) ① 개인정보처리자는 개인정보가 분실·도난·유출된 사실을 알게 된 때에는 지체 없이 해당 정보주체에게 다음 각 호의 사항을 알려야 한다.
1. 유출된 개인정보의 항목
2. 유출된 시점
3. 대응조치 및 피해구제 절차

② 개인정보처리자는 제1항에 따른 통지를 받은 날부터 30일 이내에 「정보통신망 이용촉진 및 정보보호 등에 관한 법률」 제48조의2에 따른 개인정보 침해 사실 통지와 「신용정보의 이용 및 보호에 관한 법률」 제34조에 따른 유출 통지를 하여야 한다.

Why This Confuses the LLM

This chunk contains three different laws' section numbers in close proximity:

Reference	Law	Section	Current Law?
`제34조의2`	개인정보 보호법	—	✅ Yes (this chunk's article)
`제1항`	Same law, self-reference	—	✅ Yes
`제48조의2`	정보통신망법	Art. 48-2	❌ No — different law
`제34조`	신용정보법	Art. 34	❌ No — different law

The critical collision: 제34조의2 (current law, inserted article 34-2) and 제34조 (completely different law, article 34). They share the same structural pattern (제N조) and nearly the same number. The LLM must:

Recognize that 제48조의2 belongs to 정보통신망법, not the current law
Recognize that 제34조 belongs to 신용정보법, not the current law
NOT confuse 제34조의2 (current, inserted) with 제34조 (other law)
Handle the self-reference 제1항 pointing to paragraph 1 of the current article

Correct Extraction

relation	reference_id	section_name
`direct`	`제34조의2`	유출 등의 통지
`direct`	`①`	통지 의무
`direct`	`1.`	유출된 개인정보의 항목
`indirect`	`②`	통지 방법 (타 법률 준용)

제48조의2 and 제34조 from other laws should not be extracted as structural citations of this document — they're cross-references to external statutes. However, the system has no explicit mechanism to distinguish "this law's §34" from "that law's §34" when both appear in the same chunk.

Likely LLM Failure

Cross-law extraction: 제48조의2 and 제34조 extracted as direct references of the current document
Number collision: 제34조 (신용정보법) confused with 제34조의2 (current law)
Relation inflation: External cross-references classified as direct instead of indirect or excluded
Missing context: The LLM has no way to know which law the chunk belongs to — the user message template ([Question]\n{query}\n\n[Document Chunk]\n{chunk}) doesn't include the source document name

3. Multi-Document Chunk

Scenario: A chunk that contains text from two different laws concatenated together (e.g., a PDF extraction artifact, or a legislative amendment that embeds another law's text).

Query (FR)

Quelles sont les obligations de l'employeur en matière de sécurité et les droits du salarié en cas de harcèlement ?

Chunk (FR)

Article L4121-1

L'employeur prend les mesures nécessaires pour assurer la sécurité et protéger la santé physique et mentale des travailleurs.

Ces mesures comprennent :

1° Des actions de prévention des risques professionnels, y compris ceux mentionnés à l'article L. 4161-1 ;

2° Des actions d'information et de formation ;

3° La mise en place d'une organisation et de moyens adaptés.

Article L1152-2

Aucun salarié ne doit subir les agissements répétés de harcèlement moral qui ont pour objet ou pour effet une dégradation de ses conditions de travail susceptible de porter atteinte à ses droits et à sa dignité, d'altérer sa santé physique ou mentale ou de compromettre son avenir professionnel.

Why This Confuses the LLM

Two articles from the same code (Code du travail) but addressing completely different topics (workplace safety vs. moral harassment) are concatenated into one chunk. The LLM faces:

Query relevance mismatch: The query asks about both topics, so both articles seem relevant — but they're from different parts of the code with no logical connection
No document boundary marker: There's no separator, header, or metadata indicating the text switched laws
Level confusion: Article L4121-1 uses the L. prefix (legislative), while Article L1152-2 also uses L. — same level, same prefix scheme, unrelated content
Cross-reference within chunk: L. 4161-1 appears inside the first article's text — is it a new extraction or a cross-reference?

Correct Extraction

relation	reference_id	section_name
`direct`	`Article L4121-1`	Obligation générale de sécurité
`direct`	`Article L1152-2`	Interdiction du harcèlement moral
`indirect`	`1° Des actions de prévention des risques professionnels`	Actions de prévention

L. 4161-1 in the first article's text should be classified as indirect (cross-reference) since it's not the article being discussed but rather mentioned within another article.

Likely LLM Failure

False coherence: LLM treats the two unrelated articles as if they're part of the same logical unit, producing a narrative connecting safety obligations to harassment prevention
reference_id collision: Both Article L4121-1 and Article L1152-2 are correct, but the LLM may invent section_name values that falsely link them
Cross-reference extraction: L. 4161-1 extracted as direct when it's just a mention inside another article's text
Missing the second article entirely: If the query only asked about safety, the LLM might stop reading after Article L4121-1 and miss L1152-2 completely

4. Amendment Text

Scenario: A chunk that is an amending law inserting a new section into an existing code, where the amendment text describes the new section in meta-language rather than presenting it as operative text.

Query (FR)

Quelle est la nouvelle disposition relative aux lanceurs d'alerte dans le Code du travail ?

Chunk (FR)

Article 9

I. - Le titre IV du livre Ier de la première partie du code du travail est complété par un chapitre V ainsi rédigé :

" Chapitre V

" Dispositions relatives aux lanceurs d'alerte en matière sociale

" Art. L. 1312-1. - Un lanceur d'alerte au sens de l'article 6 de la loi n° 2016-1691 du 9 décembre 2016 relative à la transparence, à la lutte contre la corruption et à la modernisation de la vie économique bénéficie, dans les conditions prévues par ladite loi, de la protection contre les mesures de représailles mentionnées à l'article 12 de ladite loi.

" Art. L. 1312-2. - Les représentants du personnel sont informés et consultés sur les procédures de recueil des signalements établies par l'employeur."

II. - Le présent article entre en vigueur le 1er janvier 2023.

Why This Confuses the LLM

This is a meta-legislative chunk — it's an amending law that inserts new articles into the Code du travail. The structural references are nested:

Reference	What It Is	Depth
`Article 9`	The amending law's own article	Level 1 — the "real" article
`I.` / `II.`	Paragraphs of Article 9 (Roman numeral)	Level 2
`Chapitre V`	Title being inserted into the Code	Level 3 — quoted/amended text
`Art. L. 1312-1`	Article being created inside the Code	Level 4 — double-nested
`Art. L. 1312-2`	Another created article	Level 4
`article 6 de la loi n° 2016-1691`	Reference to a third law	Cross-reference
`article 12 de ladite loi`	Reference to the same third law	Cross-reference

The LLM must decide: what is the "document" here? Is it the amending law (Article 9), or the Code du travail articles being inserted (L. 1312-1, L. 1312-2)? The query asks about the new provisions, but the chunk's structure is the amending law's article.

Correct Extraction

relation	reference_id	section_name
`direct`	`Art. L. 1312-1`	Protection des lanceurs d'alerte en matière sociale
`direct`	`Art. L. 1312-2`	Information des représentants du personnel
`indirect`	`Article 9`	Article d'insertion (loi modificative)
`indirect`	`Chapitre V`	Dispositions relatives aux lanceurs d'alerte

The query asks about the new provisions, so L. 1312-1 and L. 1312-2 are direct. The amending Article 9 is indirect — it's the vehicle, not the content.

Likely LLM Failure

Level inversion: Article 9 extracted as direct (it's the "real" article in the chunk) while L. 1312-1 is missed or classified as indirect
Quoted text ignored: The doubled quotation marks (" Art. L. 1312-1.) signal quoted/amended text — LLMs often skip or deprioritize quoted content
Cross-reference explosion: article 6 de la loi n° 2016-1691 and article 12 de ladite loi extracted as citations — they're cross-references to a third law, not structural elements of either the amending law or the Code
"ladite loi" resolution failure: ladite loi (the aforementioned law) requires resolving the anaphoric reference to loi n° 2016-1691 — LLMs may fail this co-reference
reference_id for quoted text: The verbatim substring " Art. L. 1312-1. includes leading quotes and a period — the LLM may strip the quotes, producing a reference_id that doesn't match the chunk

5. Code-Switching

Scenario: A Belgian or Canadian legal text mixing French and Dutch (Belgium) or French and English (Canada), where structural markers change language mid-sentence.

Query (BE — Belgium, bilingual FR/NL)

Quelles sont les conditions pour la détention provisoire ?

Chunk (BE)

Artikel 16 § 1. De onderzoeksrechter kan, op vordering van het openbaar ministerie, de aanhouding bevelen van een verdachte wanneer er ernstige aanwijzingen van schuld bestaan en hetzij de feiten een misdaad of wanbedrijf betreffen waarvoor de wet een gevangenisstraf van meer dan één jaar vaststelt, hetzij de verdachte gevaar oplevert voor de openbare veiligheid.

§ 2. Le juge d'instruction peut, à la réquisition du ministère public, ordonner l'arrestation d'un prévenu lorsqu'il existe des indices graves de culpabilité et que soit les faits constituent un crime ou un délit puni d'une peine d'emprisonnement de plus d'un an, soit le prévenu constitue un danger pour la sécurité publique.

§ 3. Le mandat d'arrêt précise les faits qui en motivent la délivrance et la qualification légale.

Artikel 16bis. In afwachting van de beslissing van de raadkamer over de verlenging van de aanhouding, kan de onderzoeksrechter de gevangenhouding met ten hoogste vijftien dagen verlengen.

§ 2. En attendant la décision de la chambre du conseil sur la prolongation de la détention, le juge d'instruction peut prolonger la détention provisoire pour une durée maximale de quinze jours.

Why This Confuses the LLM

Belgian legislation is officially bilingual (French/Dutch), and consolidated texts often alternate languages at the article or paragraph level:

Line	Language	Structure
`Artikel 16 § 1.`	Dutch	Article 16, § 1
`§ 2.`	French	Same article, § 2 — language switched!
`§ 3.`	French	Same article, § 3
`Artikel 16bis.`	Dutch	Inserted article (Latin suffix)
`§ 2.`	French	Same article's § 2 — language switched again!

The LLM faces:

Numbering format shift: Artikel 16 (Dutch) vs. implicit Article 16 (French) — same concept, different token
§ symbol parsing: § 1 / § 2 / § 3 are paragraph markers — but which CountryProfile level do they map to? Belgium may not have a profile yet
Insertion pattern: Artikel 16bis uses the Latin suffix pattern (bis) — same as French Article 16 bis, but concatenated without space in Dutch convention
Duplicate content: § 1 (Dutch) and § 2 (French) are the same provision in two languages — the LLM may extract them as two separate citations
Language-agnostic structural markers: § 2 appears in both the Dutch and French sections — it's the same paragraph number but in different languages

Correct Extraction

If the system is configured for French (BE-FR):

relation	reference_id	section_name
`direct`	`§ 2` (first occurrence, in Article 16)	Conditions d'arrestation
`direct`	`§ 2` (second occurrence, in Article 16bis)	Prolongation de détention
`indirect`	`Artikel 16`	Mandat d'arrêt (version néerlandaise)
`indirect`	`Artikel 16bis`	Prolongation de détention (version néerlandaise)

The Dutch Artikel 16 § 1 and French § 2 are the same provision in two languages. The system should ideally extract only the language-relevant version.

Likely LLM Failure

Duplicate extraction: § 2 extracted for both Dutch and French versions as if they're separate provisions — they're translations of each other
reference_id collision: § 2 appears 3 times in the chunk — the LLM can't distinguish which § 2 it's pointing to
Language detection failure: No CountryProfile exists for Belgium (BE.yaml is absent from the repo), so the system has no guidance on bilingual handling
Artikel vs Article: The Dutch Artikel won't match French-level few-shot patterns; the LLM may ignore it entirely
bis spacing: Artikel 16bis (no space) vs. Article 16 bis (with space) — reference_id must be verbatim, so if the LLM produces Article 16 bis but the chunk has Artikel 16bis, the substring check fails

Robustness Rating

Overall: Fragile — would fail on edge cases

Dimension	Rating	Evidence
Ambiguous numbering	Fragile	The system relies on the LLM to disambiguate structural numbers from back-references in prose. No post-processing validates that a `reference_id` like `2°` isn't a back-reference to an already-extracted point. The few-shot examples show clean cases; adversarial ones with repeated numbers are untested.
Cross-reference confusion	Fragile	The user message template (`[Question]\n{query}\n\n[Document Chunk]\n{chunk}`) carries no source document metadata. The LLM cannot distinguish "this law's §34" from "that law's §34" without knowing which law the chunk belongs to. CountryProfile `cross_references` sections document the pattern but don't help the LLM resolve it.
Multi-document chunks	Brittle	No chunk boundary detection exists. The system assumes each chunk is from a single document. When chunking (e.g., PDF extraction) concatenates unrelated articles, the LLM has no signal to detect the boundary. This is a structural failure, not just an LLM reasoning failure.
Amendment text	Fragile	Quoted/amended text uses non-standard formatting (`" Art. L. 1312-1.`) that produces `reference_id` values with leading quotes. The verbatim substring check will fail if the LLM strips the quotes. The meta-legislative structure (amending law inserting into another code) has no representation in the CountryProfile hierarchy.
Code-switching	Brittle	Bilingual legal systems (Belgium, Canada, Switzerland, Luxembourg, South Africa) have no CountryProfile support for dual-language handling. The system treats each chunk as monolingual. When a chunk switches language mid-article, the LLM will either ignore the non-configured language or produce duplicate extractions.

Summary Table

Test Case	Failure Mode	Severity	Likelihood
Ambiguous numbering (FR)	Duplicate extraction of back-references	Medium	High
Cross-reference confusion (KR)	External citations extracted as direct	High	High
Multi-document chunk (FR)	False coherence between unrelated articles	High	Medium
Amendment text (FR)	Level inversion; quoted text reference_id mismatch	High	High
Code-switching (BE)	Duplicate extraction; reference_id collision	Critical	High

Systemic Findings

Finding 1: No Source Document Context in User Message

The user message template is:

[Question]
{query}

[Document Chunk]
{chunk}

There is no field for the source document name, law title, or country code. The LLM must infer the document identity from the chunk content alone. This makes cross-reference disambiguation (Test Case 2) and multi-document detection (Test Case 3) structurally impossible without hallucination.

Recommendation: Add {source} or {document_title} to the user message template.

Finding 2: reference_id Verbatim Constraint vs. Quoted/Amended Text

The reference_id must be a verbatim substring of the chunk. This works for clean legislative text but breaks for amendment text where structural markers are embedded in quotation marks (" Art. L. 1312-1.). The LLM must either include the quotes (unintuitive, fragile) or strip them (fails validation).

Recommendation: Allow reference_id normalization that strips leading/trailing punctuation and quotation marks during validation.

Finding 3: No Chunk Boundary Metadata

The distillation pipeline processes chunks independently. When a chunk spans multiple documents (PDF artifact, amendment with embedded text), there's no mechanism to detect or handle the boundary. Each chunk is assumed to be a coherent unit from a single document.

Recommendation: Add optional source_document metadata per chunk, or implement a chunk-boundary heuristic in the chunker.

Finding 4: Monolingual Assumption

The entire system — CountryProfiles, few-shots, level hierarchies — assumes one language per chunk. Bilingual legal systems produce chunks where structural markers switch language mid-paragraph. The system will either:

Extract only the configured-language markers (missing half the legal structure)
Produce duplicate extractions for the same provision in two languages
Fail the reference_id substring check when the LLM normalizes Artikel to Article

Recommendation: Add a secondary_language field to CountryProfile, with language-specific label aliases per level. For bilingual chunks, either split by language before extraction or configure the prompt to handle both.

Finding 5: No Anaphoric Reference Resolution

Legal text is full of anaphoric references: ladite loi (the aforementioned law), the same section, 위 법률 (the above law). The LLM must resolve these to their antecedents, but the extraction schema has no field for resolved references. The reference_id captures the mention, not the meaning.

Recommendation: Add an optional resolved_to field in the extraction schema for anaphoric/cross-references, or exclude anaphoric mentions from extraction entirely.

End of adversarial test report.

Adversarial Tests

Adversarial Test Report: Legal Citation Extraction System

Table of Contents

1. Ambiguous Numbering

Query (FR)

Chunk (FR)

Why This Confuses the LLM

Correct Extraction

Likely LLM Failure

2. Cross-Reference Confusion

Query (KR)

Chunk (KR)

Why This Confuses the LLM

Correct Extraction

Likely LLM Failure

3. Multi-Document Chunk

Query (FR)

Chunk (FR)

Why This Confuses the LLM

Correct Extraction

Likely LLM Failure

4. Amendment Text

Query (FR)

Chunk (FR)

Why This Confuses the LLM

Correct Extraction

Likely LLM Failure

5. Code-Switching

Query (BE — Belgium, bilingual FR/NL)

Chunk (BE)

Why This Confuses the LLM

Correct Extraction

Likely LLM Failure

Robustness Rating

Overall: Fragile — would fail on edge cases

Summary Table

Systemic Findings

Finding 1: No Source Document Context in User Message

Finding 2: reference_id Verbatim Constraint vs. Quoted/Amended Text

Finding 3: No Chunk Boundary Metadata

Finding 4: Monolingual Assumption

Finding 5: No Anaphoric Reference Resolution