Overview
The data table extraction feature of Scribe OCR allows for .pdf
files containing tables to be converted to .xlsx
.
Enabling Table Extraction
Select Info
-> Optional Features
-> Extract Tables
. This will enable the Layout
tab, and add the Data Tables
pane, which can be used to add tables.
Edit Table Layout
Adding and Modifying Tables
- Add table.
- Add a table by clicking the
Add Data Table
button and selecting the area where the table should be inserted.
- Add a table by clicking the
- Delete table.
- Delete a table by selecting the entire table (including all columns), right clicking, and selecting
Delete Table
.
- Delete a table by selecting the entire table (including all columns), right clicking, and selecting
- Resize table.
- A table can be resized by dragging the controls that appear when the table is selected.
- When the table is resized to exclude a column, that column is automatically deleted.
- A table can be resized by dragging the controls that appear when the table is selected.
Adding and Modifying Columns
- Adjusting column bounds.
- Column boundaries can be adjusted by clicking the column separator and dragging it to the left or right.
- Combining columns.
- Neighboring columns can be combined by selecting the columns, right clicking, and selecting
Combine Columns
from the context menu.
- Neighboring columns can be combined by selecting the columns, right clicking, and selecting
- Splitting columns.
- A single column can be split into multiple columns by selecting the column, right clicking, and selecting
Split Column
from the context menu.
- A single column can be split into multiple columns by selecting the column, right clicking, and selecting
- Adding/deleting columns.
- There are no “add column” or “delete column” buttons.
- Columns can be added/deleted through a combination of resizing the table, and splitting/combining existing columns.
- When the table is resized to exclude a column, that column is automatically deleted.
Set Default Page Layout
It is possible to set a default layout, which will be applied to all pages where tables where pages have not been edited manually. Setting a default makes it easy to process documents such as invoices or reports, where 50 pages may contain the same layout.
- To make the layout from the current page the default, click
Save As Default
. - To discard all edits made to an individual page, reverting it to the default, click
Revert To Default
.
Text Assignment to Columns
By default, individual words are assigned to the column they overlap the most with. While this behavior is generally correct, users can modify how words are assigned to columns by right clicking column(s) and selecting options in the Overlap Rules
drop-down menu. Specifically, the following properties can be modified.
- Is text assigned to columns on a word-by-word basis, or should entire lines be assigned to columns?
- Select
word
to assign text to column by word; selectline
to assign entire lines to the same column.
- Select
- Is text assigned to columns based on where the text starts, or based on where the majority of the text is found?
- Select
left
to assign text to the column where the text starts; selectmajority
to assign text to the column it overlaps the most with.
- Select
Downloading Data
To download tables in a tabular format, navigate to the Download
tab, and then set the format to .xlsx
. Excel (.xlsx
) is currently the only supported format for writing tabular data.
FAQ
Can rows be adjusted manually?
No, rows currently cannot be adjusted manually. If lines are being incorrectly split onto different rows, or being incorrectly combined into the same row, this would need to be fixed in the .xlsx
data afterwards.
Can tables be detected automatically?
When uploading from Abbyy .xml
, tables in the OCR data will be parsed and inserted automatically. Automatic table recognition is not supported for uploaded Tesseract .hocr
data, or for the built-in recognition engine.