How to make text extraction more accurate

Use the information detailed in this article to improve the accuracy of your Drawing text extraction.

Quick Navigation:

Uploading new Drawings using text extraction can save hours of time by eliminating the need to manually upload single files and update each file's form fields. In order to improve the accuracy of your extraction, it is recommended that you follow the guidelines below:

Vector-based PDF Format

In order to use the ProjectTeam text extraction tool to its full potential, you will need to use vector-based PDFs with text. One easy way to find out if your PDF is suitable for text extraction is to open your PDF and click "Ctrl + A" (or "⌘ + A" on Mac) to see if you have readable text. In the example below, you will see that the text gets highlighted in blue meaning it can be found by the extraction tool. However, notice the Architects information in the top right corner is not highlighted meaning it will not be found during text extraction.

It is recommended that you request vector-based PDFs from the project design team instead of PDFs with raster content.

Text Locations

The ProjectTeam text extraction tool allows you to identify areas within your Drawings to extract text. You have the ability to map text from your PDF to any single-line or multi-line text fields on your Drawing form.

For the most accurate extraction, it is best to upload a PDF that has consistent Drawing information on every page. To account for small differences from page to page, make sure the extraction region defined is large enough to capture all appropriate text. The extractor tool will capture everything that is completely enclosed in the region and leave out any overlapping text. For example, below you will notice I create a region much larger than just the characters on this page.

AutoCAD Users

To create text that can be found by the ProjectTeam text extraction tool, follow the guidelines below:

Common Issues

Text is extractor out of order

Text extraction reading order is not defined in the ISO PDF standard. In fact, there is no concept of the sentence, paragraph, tables, or anything similar in a typical PDF file. Therefore, the reading order is not guaranteed to match the order that a typical user reading the document would follow. The reading order of a magazine, newspaper article, and academic article are all quite different due to the lack of semantic information in a PDF and the placement/ordering of text in the document.

An example of this is when you have two different blocks of text where the second line actually shows first when "reading" left to right.

 

If you edit the PDF document you will notice that each line is its own block of text. Since the "R" from "R450" is more left than the "S" in "SRA" the extraction will read it as "R450 SRA".

To fix the issue, you can either left-align the text or make sure "SRA R450" is all included in the same block of text.