Page MenuHomec4science

websubmit-file-converter.webdoc
No OneTemporary

File Metadata

Created
Wed, May 8, 03:49

websubmit-file-converter.webdoc

## -*- mode: html; coding: utf-8; -*-
## This file is part of CDS Invenio.
## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 CERN.
##
## CDS Invenio is free software; you can redistribute it and/or
## modify it under the terms of the GNU General Public License as
## published by the Free Software Foundation; either version 2 of the
## License, or (at your option) any later version.
##
## CDS Invenio is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with CDS Invenio; if not, write to the Free Software Foundation, Inc.,
## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.
<!-- WebDoc-Page-Title: Conversion tools -->
<!-- WebDoc-Page-Navtrail: <a class="navtrail" href="<CFG_SITE_URL>/help/hacking">Hacking CDS Invenio</a> &gt; <a class="navtrail" href="websubmit-internals">WebSubmit Internals</a> -->
<p>The WebSubmit Conversion Tools library (<tt>websubmit_file_converter.py</tt>) let you convert from a fulltext format into an other and to perform OCR.</p>
<h2>Python API</h2>
<pre>
def get_best_format_to_extract_text_from(filelist, best_formats=CFG_WEBSUBMIT_BEST_FORMATS_TO_EXTRACT_TEXT_FROM):
"""
Return among the filelist the best file whose format is best suited for
extracting text.
"""
def get_missing_formats(filelist, desired_conversion=CFG_WEBSUBMIT_DESIRED_CONVERSIONS):
"""Given a list of files it will return a dictionary of the form:
file1 : missing formats to generate from it...
"""
def can_convert(input_format, output_format, max_intermediate_conversions=2):
"""Return the chain of conversion to transform input_format into output_format, if any."""
def can_pdfopt():
"""Return True if it's possible to optimize PDFs."""
def can_pdfa():
"""Return True if it's possible to generate PDF/As."""
def can_perform_ocr():
"""Return True if it's possible to perform OCR."""
def can_spell_check(ln='en'):
"""Return True if it's possible to perform spell checking."""
def guess_is_OCR_needed(input_file, ln='en'):
"""
Tries to see if enough text is retrievable from input_file.
Return True if OCR is needed, False if it's already
possible to retrieve information from the document.
"""
output_file = convert_file(input_file, format='.txt', perform_ocr=False)
def convert_file(input_file, output_file=None, format=None, **params):
"""
Convert files from one format to another.
@param input_file [string] the path to an existing file
@param output_file [string] the path to the desired ouput. (if None a
temporary file is generated)
@param format [string] the desired format (if None it is taken from
output_file)
@param params other paramaters to pass to the particular converter
@return [string] the final output_file
"""
def pdf2hocr2pdf(input_file, output_file=None, font="Courier", author=None, keywords=None, subject=None, title=None, draft=False, ln='en', pdfopt=True, **args):
"""
Transform a scanned PDF into a PDF with OCRed text.
@param font the default font (e.g. Courier, Times-Roman).
@param author the author name.
@param subject the subject of the document.
@param title the title of the document.
@param draft whether to enable debug information in the output.
@param ln is a two letter language code to give the OCR tool a hint.
"""
input_file, output_hocr_file, dummy = prepare_io(input_file, output_ext='.hocr', need_working_dir=False)
output_hocr_file, working_dir = pdf2hocr(input_file, output_file=output_hocr_file, ln=ln, return_working_dir=True)
output_file = hocr2pdf(output_hocr_file, output_file, working_dir, font=font, author=author, keywords=keywords, subject=subject, title=title, draft=draft)
clean_working_dir(working_dir)
return output_file
</pre>
<p>See <a href="http://cdsware.cern.ch/invenio/code-browser/invenio.websubmit_file_converter-module.html">websubmit_file_converter API</a> for a complete API description.</p>

Event Timeline