WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

GateNLP/atrium-text_extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ATRIUM PDF Text Extraction

A script for extracting text frm PDF files and storing it into a simple JSON structure which can also store metadata and section (or annotation) information.

Installation

The script uses a number of Python packages which can be installed from the requirements.txt file. We recommend using a virtual enviroment for this to avoid clashes with other projects.

python -m venv venv
./venv/bin pip install -r requirements.txt

Note that if you wish to run OCR on the PDF before extracting the text then this requires OCRmYPDF to also be installed so that the platform native tools are available. Instructions for how to do this for your platform are available here.

Usage

Once the dependencies are installed you can run the script

./venv/bin/python pdf_to_json.py

which will display this helpful usage message:

usage: ATRIUM PDF Text Extraction [-h] [--ocr] pdf json

positional arguments:
  pdf         PDF to extract text from
  json        JSON file to write output into

options:
  -h, --help  show this help message and exit
  --ocr       Run OCR before extracting text

As you can see for basic operation simply provide the path to the PDF file as the first argument and the path to where to store the JSON outpu tas the second argument. If you also want to perform OCR then simply add the --ocr flag.

Note that for PDF files which were born digital it is unlikely you will want to run the OCR step. On PDF files which were created by scanning and OCRing paper documents sometimes it is useful to re-do the OCR as the performance may well have improved since they were originally produced.

Json Output Format

The output JSON follows the following simple format, where the text is stored as a single top level field, and then section information is provided as offset level information.

{
	"meta": {
		(any metadata about the source document)
	},
	"text": (full text contents of the document),
	"sections": [
		{ 
			"start": (number),	<-- character position
			"end": (number),	<-- character position
			"type": (string) 	<-- e.g. "title", "subtitle", "abstract", "conclusions", "references" etc.
		}
	]
	
}

About

ATRIUM: PDF Text Extraction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages