IXP - Best practices

General recommendations for your taxonomy

Clarity and simplicity - Use clear, direct, and unambiguous language. Avoid overcomplicating instructions that could confuse the model. Use plain language and keep sentences short.
Consistency - Maintain consistent terminology across fields, field groups, and instructions to avoid confusion.
Provide context - Equip the model with pertinent context to comprehend the general scope of the task. This could encompass industry information, document type, or overall data format, as the model needs to understand the task it handles. If you provide more context within the prompt, it increases the probability of the model to consistently predict the field correctly.
Iterate - As refining prompts is an iterative process, maintaining a record of your drafts and their corresponding results can provide valuable insights for future adjustments and improvements. Write a prompt, test, and edit. Repeat this process until you get your desired extraction.
Avoid negative instructions - Do not enter an instruction similar to: do not leave out any sections of the document. Instead, replace it with: ensure all key sections, such as x,y,z, of the document are covered.
Avoid repetitive language - Repetitive language can lead to redundancy, confusion, and unclear instructions for the model.
Watch out for contradictory information - Make sure that your project, field group, and field-level instructions do not contradict one another in terms of the information to extract, the format of the extraction, and where the information can be found. This will confuse the model and lead to inconsistent results.
Example reinforcement - Whenever possible, reinforce the prompt instruction with examples of correct responses. These instances can guide the model towards the expected outcome.

Figure 1. Taxonomy example

Project (overall extraction) level

Best practice	Details	Importance	Correct example	Incorrect example
Define the industry and the document type	Briefly describe the industry and the document type from which information is being extracted. Then, specify key characteristics and the expected structure of the document type to guide the extraction.	This provides important context for the data extraction process.	Instruction: Extract information from a brokerage statement, which is commonly found in the Financial Services Industry. Brokerage statements typically consist of a few sections: account overview, account summary, account holdings, and account transaction activity.	Instruction: Extract the fields below from the document. Explanation: This project instruction example does not benefit the model. It does not provide any important context or key characteristics that would help guide the model.
Specify if you expect multiple occurrences of the document within one file.	Indicate if the document contains multiple instances of identical data, and provide guidance for each extraction instance. In use cases that may have multiple documents within a single file, identify a unique identifier and include it as a field in each field group.	This will facilitate post-processing, allowing for more efficient automation.	Instruction: There may be multiple brokerage accounts within a single document file. A brokerage account can be identified via a unique account number field present in each field group. Extract the account information, account holdings, and account activity field groups for each account.	Instruction: Extract all instances of data from each account document. Explanation: This instruction example is poor as it fails to specify how to determine if there are multiple occurrences of a document type within the file.

Field group-level

Best practice	Details	Importance	Correct example	Incorrect example
Group similar data points that you want to be extracted together into field groups.	Organize related fields into logical groups.	This helps to streamline extraction and minimize errors.	The name, address, and marital status of the account owner can all be grouped under an Account Owner Information field group.	Field group: Account Information Fields: Account Holdings, Transaction Date, Account Owner Explanation: This grouping might work in a situation where a user only wants to extract those three fields. However, if there are other fields like the holding ticker symbol and cost basis, the design or structure of this group will not be the most effective.
Field group context	Explain how each field group contributes to the overall meaning and purpose of the document.	This helps the model understand the context of the extraction.	Instruction: This section outlines key brokerage statement account holding details, including the equity name, purchase date, quantity purchased, cost basis, and total price paid. These details help determine the current holdings in a brokerage statement.	Instruction: Extract the fields below from the document. Explanation: The prompt instructions lack context and detailed instructions for the model. It neither explains the type of information that requires extraction nor highlights its importance.
Leverage the location and structure of information in the document within your field group prompts	Indicate likely locations for the data of each field, for example, table, header, body, to guide extraction. Note: If you are working on a document where information appears in the same section, state the section in the prompt.	This helps the model focus on the correct part of the document for each field.	Instruction: The field-level data for this section will most likely be found in the header of the report on the first page under the document title.	Instruction: Extract the information from the beginning of the document. Explanation: The prompt is vague and does not provide the model with enough detail on where specifically to look within the document.
Model tables using field groups with fields	Treat a field group as a table, with each column acting as a unique field within that group. This approach is key to effective data modeling as it ensures clear differentiation, minimizes data duplication, and increases data consistency.	This method enables a logically structured and systematic arrangement of data, which subsequently leads to enhanced efficiency during data queries and analysis.	Field group: Customers Fields: Name, Address, Phone Number	Field groups: Customer Name, Customer Address, Customer Phone Number Fields: Name, Address, Phone Number Explanation: This example unnecessarily separates each customer detail into its own field group, making data management complex and prone to inconsistencies.
Create parent and child field groups	Relationships are denoted with a greater-than `>` sign. A parent field group can have multiple child field groups.	Leveraging field groups to show relationships between data within the documents is a great way of maintaining hierarchical data organization.	Field group: Brokerage Statement Fields: Account Owner, Account Type Field group name: Brokerage Statement > Asset Allocation Fields: Asset Type, for example, Stocks, Bonds, Cash, Percentage of Total Assets Field group name: Brokerage Statement > Investments Fields: Investment Name, Quantity Owned, Price per Share, Total Value of Investment	Field group: Account Owner Fields: Name, Investment Name, Type of Account, Number of Shares, Stocks, Bonds Field group: Account Owner > Address Fields: Street, City, State, ZIP Code Field group: Account Owner > Contact Info Fields: Phone Number, Email Explanation: This is a poorly structured hierarchy because it combines unrelated fields under the same parent, and the child field groups (Address and Contact Info) do not logically relate to the fields of the parent (Investment Name, Number of Shares, Stocks, Bonds). This could confuse the AI model as it does not reflect the natural organization of the data within the document.
Use a key field for files that contain multiple documents within them	Select a unique identifier in the document that will allow you to differentiate the data. Include this field in every field group. You do not need to alter the instruction for this field from one field group to another.	Including this key field allows for the separation of information within the document and removes confusion when processing the extracted data.	Field: Account Number, Social Security Number, Policy Number	Field: Date, Name Explanation: The field names listed would not make good key fields as they are not unique. Dates and names can both be repeated.

Field-level

Best practice	Details	Importance	Correct example	Incorrect example
Pick field names carefully	Choose clear, recognizable names for fields that align with the expectations of the user. If there is a universal name that is used in all document variations, make sure to include it.	Precise field names ensure accurate extraction and reduce ambiguity.	Field: Date of Accident	Field: Date Explanation: Date is a generic term and does not provide any context about what the date refers to. This can lead to inaccurate data extraction, as the AI model might pick up any date that appears in the document.
Be explicit and detailed with instructions	Kickstart the model by explicitly stating what you want the model to extract. Specify the exact format and structure of the data to be extracted.	Clear, detailed prompts guide the model to extract exactly what you need, in the format you expect.	Instruction: Extract the list of all the advisors from the document, format them into a comma-separated list, and arrange them in alphabetical order.	Instruction: Get all of the advisors Explanation: The prompt is vague and does not provide the model with clear instructions about the desired outcome and how it should be formatted. This can lead to inconsistencies in the extracted information, making it more difficult to process the results.
Provide examples within the instructions	Provide example inputs and corresponding expected outputs to clarify the expected outcomes.	This helps the model understand exactly what you are looking for.	Instruction: Extract the transaction dates from the document. The dates should be in `MM/DD/YYYY` format. For example, if the document states that the transaction was completed on January 1, 2021, the extracted date should be 01/01/2021. If the transaction date is stated in the `MM/YYYY` format then extract it as the first day of that month. For example, if the date is presented as 05/2021, extract it as 05/01/2021.	Instruction: Get the transaction dates from the document. Explanation: The prompt above is not as effective because it does not provide explicit instructions on how to handle different date formats found in the document. This lack of clarity can lead to inconsistent extraction of dates, making the task of interpreting and analyzing data more complicated.
Stick to one main idea per field instruction	Avoid overloading the prompt by trying to extract large, sequential amounts of data in a single field to improve accuracy. Each field level should focus on extracting one piece of data.	This will also make post-processing easier.	Field 1: Extract the Account Number. Field 2: Extract the Transaction Date. Field 3: Extract the Account Balance.	Instruction: Extract the account number, transaction date, and account balance together. Explanation: The prompt is overloaded with multiple instructions directing the model to extract different types of data simultaneously. This approach could create messy extraction outcomes and make post-processing difficult.

Field type-level

Best practice Details Importance Correct example Incorrect example

Choose data types with purpose

Best practice	Details	Importance	Correct example	Incorrect example
Choose data types with purpose	Consider how you want the extracted data formatted and ensure it aligns with downstream use cases to optimize extraction for automation. Date - use this to represent dates in text. Dates will be normalized as UTC with a `YYYY-MM-DD HH:MM:SS` format. Exact Text - use this to represent text that appears verbatim in the text. Inferred Text - use this for text that might not appear verbatim in the text, but has other identifiers within the document that are present. Monetary Quantity - use this to represent monetary values in text. Monetary Quantities are normalized in the following example formats: `$00.00` or `00.00 USD`. Number - use this to represent amounts or quantities in text. Numbers are inferred from the document, users can input values, and optionally annotate evidence. The value will be formatted as a decimal value, `00.00`.	Selecting the appropriate data type enables accurate formatting and easier downstream processing.	Field name: Transaction Volume Data type: Number	Field name: Phone Number Data type: Number Explanation: Using the Number data type for a phone number is not beneficial. Although a phone number is composed of digits, it is not a numerical value, meaning that you do not perform arithmetic with it; it is better described as a string of digits. Therefore, using an Exact Text data type would be the appropriate choice.
Only include field type-specific instructions in the field type.	When providing instructions for data extraction, it is crucial to keep them specific to each field type. If there are general instructions that apply to all fields of a certain type, a user can provide them at the field type level to avoid repetition. For example, if all Monetary Quantity fields need to be in USD, specify this at the field type level. However, some datasets may require unique fields not covered by existing field types (Date, Text, Monetary Quantity, and so on). In these cases, you can create a new, customized field type. When writing instructions for these new fields, specify how the data should be formatted to ensure the extracted data meets its intended purpose. These practices enhance the precision and consistency of your extracted data.	Keeping instructions specific to each field type ensures clarity, reusability, and consistency in data extraction.	Field type: Date Instruction: Extract all the dates associated with transactions from the document. Dates should be normalized to the format `YYYY-MM-DD`.	Field type: Monetary Quantity Instruction: Extract the item price from the Price column under the invoice line items table. Explanation: The instruction is relevant specifically to extracting a Monetary Quantity from a certain field (the Price column), not to any other Monetary Quantity-based field.

Consider how you want the extracted data formatted and ensure it aligns with downstream use cases to optimize extraction for automation.

Date - use this to represent dates in text. Dates will be normalized as UTC with a YYYY-MM-DD HH:MM:SS format.

Exact Text - use this to represent text that appears verbatim in the text.

Inferred Text - use this for text that might not appear verbatim in the text, but has other identifiers within the document that are present.

Monetary Quantity - use this to represent monetary values in text. Monetary Quantities are normalized in the following example formats: $00.00 or 00.00 USD.

Number - use this to represent amounts or quantities in text. Numbers are inferred from the document, users can input values, and optionally annotate evidence. The value will be formatted as a decimal value, 00.00.

Selecting the appropriate data type enables accurate formatting and easier downstream processing.

Field name: Transaction Volume

Data type: Number

Field name: Phone Number

Data type: Number

Explanation: Using the Number data type for a phone number is not beneficial. Although a phone number is composed of digits, it is not a numerical value, meaning that you do not perform arithmetic with it; it is better described as a string of digits. Therefore, using an Exact Text data type would be the appropriate choice.

Only include field type-specific instructions in the field type.

When providing instructions for data extraction, it is crucial to keep them specific to each field type. If there are general instructions that apply to all fields of a certain type, a user can provide them at the field type level to avoid repetition. For example, if all Monetary Quantity fields need to be in USD, specify this at the field type level.

However, some datasets may require unique fields not covered by existing field types (Date, Text, Monetary Quantity, and so on). In these cases, you can create a new, customized field type. When writing instructions for these new fields, specify how the data should be formatted to ensure the extracted data meets its intended purpose. These practices enhance the precision and consistency of your extracted data.

Keeping instructions specific to each field type ensures clarity, reusability, and consistency in data extraction.

Field type: Date

Instruction: Extract all the dates associated with transactions from the document. Dates should be normalized to the format YYYY-MM-DD.

Field type: Monetary Quantity

Instruction: Extract the item price from the Price column under the invoice line items table.

Explanation: The instruction is relevant specifically to extracting a Monetary Quantity from a certain field (the Price column), not to any other Monetary Quantity-based field.

Examples of fields and field types

Signatures

When your documents include signatures, make sure you apply the following best practices:

Use a Boolean data type for a Signed by X? field, that is, Is it signed by this individual?, as well as a text field for the name of the individual, which is usually printed.
If you can typically find signatures in a table or table-like format, use the Table model pre-processing option.
Failures are most common in a document with multiple signatories, including both the named individual in the document and their witness.
Be clear and descriptive about the following:
- What constitutes a signature?
- What does not constitute a signature?
- Who needs to sign the document?
- How to detect the person who needs to sign the document?
Account for potential failure cases in your documents and include them in the instructions, as described in the following example:

Instruction example for a Signed by Signatory field

Determine whether or not the signatory, not the witness, signed the document.

Only return true if the document is signed by this signatory. Return false if it is not signed by them.

Signatures may not look like the printed name, so just look for a signature or handwritten signature-like addition to the document in the space for the signature near the specific signatory's name.

If a name is amended with a handwritten addition, this should not be treated as a signature, only explicit signatures.

Signatures will generally be close and around the word "Signed By" or a variation such as "Signed as a deed", "In the presence of", and so on.

A dotted line does not constitute a signature.

If it is a general Signatures field group, containing a Signed field being combined with Name or Position of the signatory, or both, you can add in the instructions: Ensure you are attributing signatures to the correct individual.
An example instruction for a broader Signatures field group is the following:

Instruction example for a Signatures field group, which contains fields relating to signatories

Information on the people signing and the status of the document.

If there are multiple signature blocks and multiple people present in the document, extract all of them.

There may not be an explicit signature block, agreements and letters may have been signed by the person sending it, with a signature block for the person accepting. In this case, extract both sets of signatures.

Note: If performance is still not satisfactory, even after you made persistent efforts to improve it through instruction tuning, contact your Account Manager. They can check if any preview processing features that could help are available in your region.

Regional differences

Monetary quantities and comma separators

An example of regional differences that can require prompting to correct the default LLM behaviour is the use of commas as decimal separators in certain countries, such as Germany and India.

The following example for a German receipts use case shows how you can account for the presence of values in an unexpected format:

Instruction example

You are extracting data from German receipts. Monetary amounts are all in €, while the € sign might be missing. "," is the typical decimal separator for all numbers, while '.' is used for formatting larger values.

To determine if this format is being used, check for a comma as the final separator in the value. If not, the number is likely formatted in the alternative format of using ',' for formatting and '.' for decimal places.

Amounts typically have two decimal digits (e.g. 8,58 is 8.58€ and 9.115,00 is 9115.00€). Expect single line items in grocery receipts to be below 100€.

Testing and iterating

Create a field for all of the information you want extracted but do not include any instructions.
Select a sample of 2 to 3 documents and run predictions on each one. These documents should reflect the variation present in the documents that you are building the model for.
Compare the extractions of the model to what you expected. For the fields that did not perform well, draft a prompt using the previously listed best practices, as this will serve as your baseline prompt.
Rerun the predictions using the same 2 to 3 sample documents you tested earlier and check whether the extraction performance has improved.
If the predictions are incorrect or incomplete, refine the prompts to add the necessary details to enhance the extraction performance of the model. If the predictions align with your expectations, widen your sample size of documents. It is crucial to gradually increase these numbers. Move from 2 to 3 to 10, then to 20, 30, and so on. Continue until you feel confident that the predictions of the model are correct.
If the instructions have changed, reevaluate previously viewed documents to ensure predictions remain accurate.
Once you are satisfied with the performance of the model, revisit the first document and start annotating. Annotate at least 10 documents to gain valuable field performance metrics through the Measure tab. This feature allows you to evaluate the extraction performance at both the overall project and field levels.
Monitor performance metrics to inform your large-scale prompt refinement. The process of prompt iteration should primarily occur at the field level, where adjustments will have more targeted and direct impacts on the specific fields that are not performing well. If the score for a field group is not performing well, then adjusting your project and field group instructions may be more impactful, as they affect several fields.

Unstructured and complex documents user guide

Best practices