- Overview
- Model building
- Model validation
- Model deployment
- API
- Frequently asked questions

Unstructured and complex documents user guide
This section contains best practices on how to write good prompt instructions at the project (that is, overall extraction) level, the field group level, and the individual field level.
- Clarity and simplicity - Use clear, direct, and unambiguous language. Avoid overcomplicating instructions that could confuse the model. Use plain language and keep sentences short.
- Consistency - Maintain consistent terminology across fields, field groups, and instructions to avoid confusion.
- Provide context - Equip the model with pertinent context to comprehend the general scope of the task. This could encompass industry information, document type, or overall data format, as the model needs to understand the task it handles. If you provide more context within the prompt, it increases the probability of the model to consistently predict the field correctly.
- Iterate - As refining prompts is an iterative process, maintaining a record of your drafts and their corresponding results can provide valuable insights for future adjustments and improvements. Write a prompt, test, and edit. Repeat this process until you get your desired extraction.
- Avoid negative instructions - Do not enter an instruction similar to: do not leave out any sections of the document. Instead, replace it with: ensure all key sections, such as x,y,z, of the document are covered.
- Avoid repetitive language - Repetitive language can lead to redundancy, confusion, and unclear instructions for the model.
- Watch out for contradictory information - Make sure that your project, field group, and field-level instructions do not contradict one another in terms of the information to extract, the format of the extraction, and where the information can be found. This will confuse the model and lead to inconsistent results.
- Example reinforcement - Whenever possible, reinforce the prompt instruction with examples of correct responses. These instances can guide the model towards the expected outcome.
| Best practice | Details | Importance | Correct example | Incorrect example |
|---|---|---|---|---|
| Define the industry and the document type | Briefly describe the industry and the document type from which information is being extracted. Then, specify key characteristics and the expected structure of the document type to guide the extraction. | This provides important context for the data extraction process. | Instruction: Extract information from a
brokerage statement, which is commonly found in the Financial
Services Industry. Brokerage statements typically consist of a few
sections: account overview, account summary, account holdings, and
account transaction activity.
|
Instruction: Extract the fields below from the document. Explanation: This project instruction example does not benefit the model. It does not provide any important context or key characteristics that would help guide the model. |
| Specify if you expect multiple occurrences of the document within one file. | Indicate if the document contains multiple instances of identical data, and provide guidance for each extraction instance. In use cases that may have multiple documents within a single file, identify a unique identifier and include it as a field in each field group. | This will facilitate post-processing, allowing for more efficient automation. | Instruction: There may be multiple brokerage
accounts within a single document file. A brokerage account can be
identified via a unique account number field present in each field
group. Extract the account information, account holdings, and
account activity field groups for each account.
|
Instruction: Extract all instances of data from each account document.
Explanation: This instruction example is poor as it fails to specify how to determine if there are multiple occurrences of a document type within the file. |
| Best practice | Details | Importance | Correct example | Incorrect example |
|---|---|---|---|---|
| Group similar data points that you want to be extracted together into field groups. | Organize related fields into logical groups. | This helps to streamline extraction and minimize errors. | The name, address, and marital status of the account owner can all be grouped under an Account Owner Information field group. |
Field group: Account Information
Fields: Account Holdings, Transaction Date, Account Owner
Explanation: This grouping might work in a situation where a user only wants to extract those three fields. However, if there are other fields like the holding ticker symbol and cost basis, the design or structure of this group will not be the most effective. |
| Field group context | Explain how each field group contributes to the overall meaning and purpose of the document. | This helps the model understand the context of the extraction. | Instruction: This section outlines key brokerage statement account holding details, including the equity name, purchase date, quantity purchased, cost basis, and total price paid. These details help determine the current holdings in a brokerage statement. |
Instruction: Extract the fields below from the document.
Explanation: The prompt instructions lack context and detailed instructions for the model. It neither explains the type of information that requires extraction nor highlights its importance.
|
| Leverage the location and structure of information in the document within your field group prompts | Indicate likely locations for the data of each field, for
example, table, header, body, to guide extraction.
Note: If you are working on a document where
information appears in the same section, state the section in
the prompt.
| This helps the model focus on the correct part of the document for each field. | Instruction: The field-level data for this section will most likely be found in the header of the report on the first page under the document title. |
Instruction: Extract the information from the beginning of the document.
Explanation: The prompt is vague and does not provide the model with enough detail on where specifically to look within the document. |
| Model tables using field groups with fields | Treat a field group as a table, with each column acting as a unique field within that group. This approach is key to effective data modeling as it ensures clear differentiation, minimizes data duplication, and increases data consistency. | This method enables a logically structured and systematic arrangement of data, which subsequently leads to enhanced efficiency during data queries and analysis. |
Field group: Customers Fields: Name, Address, Phone Number |
Field groups: Customer Name, Customer Address, Customer Phone Number Fields: Name, Address, Phone Number Explanation: This example unnecessarily separates each customer detail into its own field group, making data management complex and prone to inconsistencies. |
| Create parent and child field groups | Relationships are denoted with a greater-than
> sign. A parent field group can have
multiple child field groups.
| Leveraging field groups to show relationships between data within the documents is a great way of maintaining hierarchical data organization. |
Field group: Brokerage Statement Fields: Account Owner, Account Type Field group name: Brokerage Statement > Asset Allocation Fields: Asset Type, for example, Stocks, Bonds, Cash, Percentage of Total Assets Field group name: Brokerage Statement > Investments Fields: Investment Name, Quantity Owned, Price per Share, Total Value of Investment |
Field group: Account Owner Fields: Name, Investment Name, Type of Account, Number of Shares, Stocks, Bonds Field group: Account Owner > Address Fields: Street, City, State, ZIP Code Field group: Account Owner > Contact Info Fields: Phone Number, Email
Explanation: This is a poorly structured hierarchy because it combines unrelated fields under the same parent, and the child field groups (Address and Contact Info) do not logically relate to the fields of the parent (Investment Name, Number of Shares, Stocks, Bonds). This could confuse the AI model as it does not reflect the natural organization of the data within the document. |
| Use a key field for files that contain multiple documents within them | Select a unique identifier in the document that will allow you to differentiate the data. Include this field in every field group. You do not need to alter the instruction for this field from one field group to another. | Including this key field allows for the separation of information within the document and removes confusion when processing the extracted data. | Field: Account Number, Social Security Number, Policy Number |
Field: Date, Name Explanation: The field names listed would not make good key fields as they are not unique. Dates and names can both be repeated. |
| Best practice | Details | Importance | Correct example | Incorrect example |
|---|---|---|---|---|
| Pick field names carefully | Choose clear, recognizable names for fields that align with the expectations of the user. If there is a universal name that is used in all document variations, make sure to include it. | Precise field names ensure accurate extraction and reduce ambiguity. | Field: Date of Accident |
Field: Date
Explanation: Date is a generic term and does not provide any context about what the date refers to. This can lead to inaccurate data extraction, as the AI model might pick up any date that appears in the document. |
| Be explicit and detailed with instructions | Kickstart the model by explicitly stating what you want the model to extract. Specify the exact format and structure of the data to be extracted. | Clear, detailed prompts guide the model to extract exactly what you need, in the format you expect. | Instruction: Extract the list of all the advisors from the document, format them into a comma-separated list, and arrange them in alphabetical order. |
Instruction: Get all of the advisors
Explanation: The prompt is vague and does not provide the model with clear instructions about the desired outcome and how it should be formatted. This can lead to inconsistencies in the extracted information, making it more difficult to process the results.
|
| Provide examples within the instructions | Provide example inputs and corresponding expected outputs to clarify the expected outcomes. | This helps the model understand exactly what you are looking for. | Instruction: Extract the transaction dates from the
document. The dates should be in MM/DD/YYYY format.
For example, if the document states that the transaction was
completed on January 1, 2021, the extracted date should be
01/01/2021. If the transaction date is stated in the
MM/YYYY format then extract it as the first day
of that month. For example, if the date is presented as 05/2021,
extract it as 05/01/2021.
|
Instruction: Get the transaction dates from the document.
Explanation: The prompt above is not as effective because it does not provide explicit instructions on how to handle different date formats found in the document. This lack of clarity can lead to inconsistent extraction of dates, making the task of interpreting and analyzing data more complicated. |
| Stick to one main idea per field instruction | Avoid overloading the prompt by trying to extract large, sequential amounts of data in a single field to improve accuracy. Each field level should focus on extracting one piece of data. | This will also make post-processing easier. |
Field 1: Extract the Account Number. Field 2: Extract the Transaction Date. Field 3: Extract the Account Balance. |
Instruction: Extract the account number, transaction date, and account balance together. Explanation: The prompt is overloaded with multiple instructions directing the model to extract different types of data simultaneously. This approach could create messy extraction outcomes and make post-processing difficult. |
| Best practice | Details | Importance | Correct example | Incorrect example |
|---|---|---|---|---|
| Choose data types with purpose | Consider how you want the extracted data formatted
and ensure it aligns with downstream use cases to optimize
extraction for automation.
| Selecting the appropriate data type enables accurate formatting and easier downstream processing. |
Field name: Transaction Volume Data type: Number |
Field name: Phone Number Data type: Number Explanation: Using the Number data type for a phone number is not beneficial. Although a phone number is composed of digits, it is not a numerical value, meaning that you do not perform arithmetic with it; it is better described as a string of digits. Therefore, using an Exact Text data type would be the appropriate choice. |
| Only include field type-specific instructions in the field type. |
When providing instructions for data extraction, it is crucial to keep them specific to each field type. If there are general instructions that apply to all fields of a certain type, a user can provide them at the field type level to avoid repetition. For example, if all Monetary Quantity fields need to be in USD, specify this at the field type level.
However, some datasets may require unique fields not covered by existing field types (Date, Text, Monetary Quantity, and so on). In these cases, you can create a new, customized field type. When writing instructions for these new fields, specify how the data should be formatted to ensure the extracted data meets its intended purpose. These practices enhance the precision and consistency of your extracted data. | Keeping instructions specific to each field type ensures clarity, reusability, and consistency in data extraction. |
Field type: Date Instruction: Extract all the dates associated with
transactions from the document. Dates should be normalized to
the format
YYYY-MM-DD. |
Field type: Monetary Quantity Instruction: Extract the item price from the Price column under the invoice line items table. Explanation: The instruction is relevant specifically to extracting a Monetary Quantity from a certain field (the Price column), not to any other Monetary Quantity-based field. |
Signatures
- Use a Boolean data type for a Signed by X? field, that is, Is it signed by this individual?, as well as a text field for the name of the individual, which is usually printed.
- If you can typically find signatures in a table or table-like format, use the Table model pre-processing option.
- Failures are most common in a document with multiple signatories, including both the named individual in the document and their witness.
- Be clear and descriptive about the following:
- What constitutes a signature?
- What does not constitute a signature?
- Who needs to sign the document?
- How to detect the person who needs to sign the document?
- Account for potential failure cases in your documents and include them in the instructions, as described in the following
example:
Instruction example for a Signed by Signatory field
Determine whether or not the signatory, not the witness, signed the document.
Only return true if the document is signed by this signatory. Return false if it is not signed by them.
Signatures may not look like the printed name, so just look for a signature or handwritten signature-like addition to the document in the space for the signature near the specific signatory's name.
If a name is amended with a handwritten addition, this should not be treated as a signature, only explicit signatures.
Signatures will generally be close and around the word "Signed By" or a variation such as "Signed as a deed", "In the presence of", and so on.
A dotted line does not constitute a signature.
- If it is a general Signatures field group, containing a Signed field being combined with Name or Position of the signatory, or both, you can add in the instructions: Ensure you are attributing signatures to the correct individual.
- An example instruction for a broader Signatures field group is the following:
Instruction example for a Signatures field group, which contains fields relating to signatories
Information on the people signing and the status of the document.
If there are multiple signature blocks and multiple people present in the document, extract all of them.
There may not be an explicit signature block, agreements and letters may have been signed by the person sending it, with a signature block for the person accepting. In this case, extract both sets of signatures.
Regional differences
Monetary quantities and comma separators
An example of regional differences that can require prompting to correct the default LLM behaviour is the use of commas as decimal separators in certain countries, such as Germany and India.
The following example for a German receipts use case shows how you can account for the presence of values in an unexpected format:
Instruction example
You are extracting data from German receipts. Monetary amounts are all in €, while the € sign might be missing. "," is the typical decimal separator for all numbers, while '.' is used for formatting larger values.
To determine if this format is being used, check for a comma as the final separator in the value. If not, the number is likely formatted in the alternative format of using ',' for formatting and '.' for decimal places.
Amounts typically have two decimal digits (e.g. 8,58 is 8.58€ and 9.115,00 is 9115.00€). Expect single line items in grocery receipts to be below 100€.
- Create a field for all of the information you want extracted but do not include any instructions.
- Select a sample of 2 to 3 documents and run predictions on each one. These documents should reflect the variation present in the documents that you are building the model for.
- Compare the extractions of the model to what you expected. For the fields that did not perform well, draft a prompt using the previously listed best practices, as this will serve as your baseline prompt.
- Rerun the predictions using the same 2 to 3 sample documents you tested earlier and check whether the extraction performance has improved.
- If the predictions are incorrect or incomplete, refine the prompts to add the necessary details to enhance the extraction performance of the model. If the predictions align with your expectations, widen your sample size of documents. It is crucial to gradually increase these numbers. Move from 2 to 3 to 10, then to 20, 30, and so on. Continue until you feel confident that the predictions of the model are correct.
- If the instructions have changed, reevaluate previously viewed documents to ensure predictions remain accurate.
- Once you are satisfied with the performance of the model, revisit the first document and start annotating. Annotate at least 10 documents to gain valuable field performance metrics through the Measure tab. This feature allows you to evaluate the extraction performance at both the overall project and field levels.
- Monitor performance metrics to inform your large-scale prompt refinement. The process of prompt iteration should primarily occur at the field level, where adjustments will have more targeted and direct impacts on the specific fields that are not performing well. If the score for a field group is not performing well, then adjusting your project and field group instructions may be more impactful, as they affect several fields.