CERN document conversion service

The conversion service intends to automatically convert a number of application formats, input ones, to a different format, output ones, usually PDF or PDF/A. Extension to provide more output/input formats is possible as adding new "converters" is part of the design. Conversions are done asynchronously using a REST API to communicate among actors.

Workflow and syntax

The communication to the converter is done via a REST API with no authentication. The sequence is basically as:

While a new document is uploaded to the doconverter server, several parameters need to be specified:

uploadedfile: Use multipart/form-data to upload your file.
converter: format we would like to convert to, e.g. pdf, pdfa, ps, thumb*...
dirresponse: where you expect to store the converted file, it will be use when sending back your new converted document
urlresponse: callback that we will use to contact you e.g. https://mydomain.ch/api/receivefiles
options: special options depending on type of document conversion

Input File	Conversion output	Options	Expected result	Comments
Office 2019, Autocad(cdx), OpenOffice, Coreldraw(cdr),rtf,htm,bmp,jpg,tif	converter=[pdf,dfa,ps]	N/A	pdf file	N/A
Special case for Word documents(e.g. doc,docx)	converter=[pdf,pdfa]	hidedocumentrevisions=[false,true]	pdf file	You may want to have comments visible on your pdf, by default they are not
PDF file	converter=toimg	typeofimg=[jpeg,bmp,tiff,png]:imgresh=200:imgresv=200*	zip or tif file	Depending on the format of file chosen a tif file or zip file containing all the numbered pages of the document will be sent back. E.g. original file mypresentation.pdf -> zip: mypresentation.zip or tif: mypresentation.tif
PDF file	converter=thumb	imgresh=200:imgresv=200:imgheight=300:imgwidth=300**	png file	if original file was called mydocument.pdf -> mydocument1.png
PLT file	converter=hpgl	color=[true,false]	pdf file	N/A
tif,png,jgp	converter=tesocr	N/A	pdf searchable file	It uses tesseract-ocr as engine that just support image files
pdf, tif	converter=modiocr	language=[english,french]	pdf searchable file	N/A

*imgresh,imgresv: should be one of '72x72','100x100','150x150','200x200','300x300','400x400','600x600','1200x1200' **imgresh,imgresv,imgheight,imgwidth:should be one of '72x72','100x100','150x150','200x200','300x300','400x400','600x600','1200x1200' and imgheight_imgwidth should be an integer in pixels e.g.: thumb_200_200_150_150

As result of a submission, you should get a result of your submitted job, either a success in that case you get a file with the expected format or a error in that case you get an error message:

directory: where you would like to place your document
status: 1 if success, different number otherwise
filename: name of the file provided in the content
error_message: in case of an error, an explanation why that happened
content: expected file coded in a multipart/form-data section

An example can be found within doconverter project functional test module, see Github

Architecture

The two main components of the framework are a web fronted and worker nodes. Worker nodes are running on Windows as most of the functionality of the document conversion service deals with office applications. Web server and worker nodes don't run in the same server and communication about conversion jobs is done using a common file system. EOS is a CERN developed file system that can be accessed from Linux like and Windows operating systems.

Both components have been developed using Python3. A PostgreSQL database is used to keep accounting of jobs requested and results.

Setup

The application expects this structure on a windows server:

c:\doconverter
              \logs
              \config\
                      doconverter.ini
                      logging.conf
              \files
              \cert
              \doconverter
              \testing

A sample of doconverter.ini and logging.conf can be found in the project.

Possible setup on a Windows machine:

# Activate appropiate environment
workon doconverter
cd c:\doconverter
git clone https://github.com/CERNCDAIC/doconverter.git
cd doconverter
pip install -r requirement.txt
# This is required to run COM. Cant be added as it will fail image build on Openshift (Linux)
pip install pypiwin32

# Database setup, PostgreSQL user and schema should be available
python
from doconverter.models.extensions import db
from doconverter.models.Taskdb import Taskdb
from doconverter.models.Result_Conversion import Result_Conversion
db.drop_all()
db.create_all()

# Run converter
(doconverter) C:\doconverter\doconverter\doconverter\engines>python converter_daemon.py --help
usage: converter_daemon.py [-h] [--n NPROCESSES] [--t TIMETOSLEEP]
                           [--send SENDTASKID] [--s] [--r] [--m] [--a ARCHIVE]
                           [--c COMPUTER]
Processing of tasks
optional arguments:
  -h, --help         show this help message and exit
  --n NPROCESSES     number of documents to treat in parallel
  --t TIMETOSLEEP    time to wait for new tasks in secs
  --send SENDTASKID  Get the task information and try to send it, it is
                     supposed it was properly converted
  --s                creates a flag in the local file system for clean stop
  --r                deletes flag for clean shutdown in the local file system,
                     if present
  --m                Performs basic monitoring checks
  --a ARCHIVE        items older than this number of days will be moved to
                     archive
  --c COMPUTER       queue to work tasks from. Usually same as the server were
                     the program runs.

# Run a worker with two processes
python .\engines\converter_daemon.py --n 2

# Run a test web server on a public IP
(doconverter) C:\doconverter\doconverter\doconverter\api>python handlers.py
 * Restarting with stat
 * Debugger is active!
 * Debugger PIN: 294-034-801
 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)

Software

The project relies in several packages for conversion:

Neevia DocPro converter a license is required, though a test software is available. Most office like conversions are done using this software. It requires also that those applications be installed on the server e.g. Microsoft Office, Open Office, Autocad, etc..
Tesseract-OCR it converts image files to a searchable PDF.
SharepointDesigner 2007 it converts images and PDF files to a searchable PDF. It uses Neevia COM.
HPGL-Viewer it converts PLT files to PDF.

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
doconverter		doconverter
docs		docs
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
launch_gunicorn.sh		launch_gunicorn.sh
logrotation_doconverter.cmd		logrotation_doconverter.cmd
requirements.txt		requirements.txt
restart_doconverter.cmd		restart_doconverter.cmd
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CERN document conversion service

Workflow and syntax

Architecture

Setup

Software

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CERN document conversion service

Workflow and syntax

Architecture

Setup

Software

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages