Overview
Motivation
As technology becomes more advanced and data becomes digital, utilizing past projects that are in analog forms becomes more difficult. For example, the United States is grappling aging infrastructure, which must be maintained and will take decades to replace. Safety reviews and seismic retrofits for these aged infrastructures rely heavily on geotechnical reports and borehole logs, many of which predate digital records. Given the high costs of subsurface investigations and engineering supervision, access to historical borehole records is essential for ensuring safety and stability.
Traditionally, these records are scanned into PDF files and stored for future access, which generates a digital copy of the logs that can be printed or attached to new reports or construction drawings. Although this is a great step towards preserving historic borehole logs, it does not necessarily lead to cost-savings when planning new subsurface investigations or engineering designs.
The new RSLog OCR feature remedies this problem by utilizing Artificial Intelligence to turn historic paper borehole logs into easily accessible database records in your RSLog account. This enables consulting engineers, Departments of Transportation, and infrastructure owners to convert tens of thousands of borehole records to structured borehole data in a short period of time.
Introduction
The new Optical Character Recognition (OCR) feature in RSLog utilizes Artificial Intelligence to turn the paper copies of your old borehole logs into easily accessible database records in your RSLog account. The process involves uploading your logs and creating log extraction templates that can be reused for multiple files. If you would like more information, see the latest OCR article. Below is an overview of the steps required to create a template and extract boreholes from your logs:
- To access the OCR page, go to the left-hand navigation menu and select OCR > Scan Pages. Click Import Borehole Logs to upload and OCR your borehole logs.
- Once the status changes to Unknown Template, action is required. Click the Create and Update Templates icon button to create a template.
- Follow the steps under the Create a Template documentation page to formulate a template.
- The new template will be applied to all files and will be assigned to all pages that match the template. Continue to create new templates until every page in the file is matched to a template.
- Once a file is completely matched to templates with a status of Templates Matched, you are ready to extract the boreholes. Click the Create Boreholes icon button.
Trials
With every RSLog trial now comes an OCR trial. Trial users will have access to 10 log pages that can be scanned and extracted to create boreholes. Once a file is imported and the OCR process has begun, pages will be deducted from the 10 page limit. Once a template is assigned to the file and the boreholes are extracted, the pages will be permanently deducted. Before this point, if you remove a file from your OCR list, the pages will be returned to your page limit and will be available for additional use. Please note that once the trial ends, the projects and boreholes created will no longer be accessible. If you renew your subscription within 30 days, the projects and boreholes will be restored, otherwise they will be deleted.
Limitations
While RSLog’s OCR tool is a powerful way of extracting log data, it is still limited by current optional character recognition technologies and the inconsistencies of loosely formatted boreholes logs. Below outlines the major restrictions and limitations of RSLog’s OCR feature:
- Handwritten text
- Handwritten forms
- Language support
- Graphical log columns
- Missing or broken lines
- Erroneous rotation
- Image resolution and text legibility
- Form layout consistency
Handwritten Text
Handwritten text is supported by RSLog’s OCR subject to language and legibility constraints. The list of languages that are supported for handwritten text is written below in the Language Support section. Absolute consistency is not required for text to be recognized by RSLog’s OCR, however ensuring that text is as legible as possible provides the best circumstances for recognition from our OCR engine. Handwritten text falls under similar requirements as the general case for constraints regarding image resolution and text legibility covered elsewhere in the document.
Handwritten Forms
Handwritten forms are supported as RSLog’s OCR does not mandate lines as perfectly straight or computer generated. RSLog uses depth along the page and the depth measurements in your depth column to determine the depth of text results and layer transitions. Since it is more likely for there to be distortions in text placement and layer transitions relative to the depth column in a handwritten form, you may experience distortions in computed depth placement of these properties on handwritten logs more frequently than on computer generated pdfs.
Handwritten forms may also lead to degradation of form structure regarding column bounding lines or layer transitions. These cases can also occur in computer generated forms and are covered later in this document under missing, broken, and curved lines. Please also see the section on form consistency, which is an issue that can be exasperated by handwritten forms.
Language Support
For an enhanced user experience, the RSLog OCR system makes predictions about which words on your PDF may represent common data fields in RSLog such as borehole ID, coordinates, date, drilling method, etc. RSLog also predicts which column headers are representative of supported RSLog columns and depth-dependent information, including custom field tests and lab tests. The OCR engine itself can be used by many different languages, but the seed data is only supported by the following languages:
- English
If you are using a language not currently supported in our seed data, you may still utilize the tool to set up templates, but the guesses are likely to be missing for your borehole, resulting in slightly more set up time when creating a template. Although once a template has been created, the extraction is based on the OCR engine and text positions, so the tool can still detect and record different languages. We are continually working on expanding the OCR tool to different languages.
Graphical Log Columns
The extraction of graphical log columns containing data which is not textual is not currently supported. Image or symbol-based test results are also not extractable. For example, strata hatched layers cannot currently be extracted.
Missing, Broken, or Curved Lines
The use of lines on the page is very important in RSLog’s OCR since it plays a key role in guessing data for each field and in recognizing log columns and headers. RSLog’s OCR is designed to work on straight vertical lines in the column information section, with straight horizontal lines used to divide layers. If lines are curved or broken the system may struggle to accurately identify these lines.
If you are concerned that line issues may be the case for an entry in your borehole, you can make use of the custom column line separator feature during template creation to create columns yourself. This feature can allow you to add back broken or missing vertical lines. Additionally, this feature may allow you to separate information within a graphical column according to position on the page. You may be able to split one graphic column into multiple sub-columns.
Non-uniform shaded regions present on the page may also occasionally interfere with accurate detection of vertical lines on the page, which could affect column selection.
Erroneous Rotation
The current system will not work on pages of a borehole log which are upside down or rotated 90 degrees. Straight lines are relied upon to correctly display and organize borehole log columns. Since line detection is done at a pixel-by-pixel scale, a reasonably large margin is available for pages which may be slightly diagonal.
Image Resolution and Text Legibility
The extraction ability of RSLog’s OCR depends on the readability of the pdf files, words must be legible enough to be picked up by our OCR engine.
Form Layout Consistency
RSLog’s OCR is built on a positional template system that learns the positional format of a form based on only one page of a log from the user (the source page). Pages that have information in a different relative location than the source page will not be classified as belonging to that template. In this case you must manually apply the template and change any errors or create a new template. You can create as many templates as you need to cover variations on form structure you may have in your company. Due to the positional nature of the template system, RSLog’s OCR works best on forms that are uniformly scanned to prevent translation or rotation of form information across the page.