Diffusion invenio-infoscience 31f485a1e848

Fixed a bug relating to the indexation of fulltexts: When the contents of a PDF…
31f485a1e848
Actions

Authored by Nicholas Robinson <nicholas.robinson@cern.ch> on Aug 9 2006, 15:26.

Description

Fixed a bug relating to the indexation of fulltexts: When the contents of a PDF fulltext are to be indexed, the tool "pdftotext" is used to convert the PDF to plain text. The plaintext should be utf-8 so that search_engine (strip_accents) can replace accented letters with their non-accented cousins. However, pdftotext outputs by default latin-1, so no accented letters could be replaced and were kept and used in the fulltext word index, meaning that if you seached for a word containing accents, within a fulltext, you would never have any results, unless the non-accented "version" of that word also existed in the document. [E.g. searching for "sp�ter" would only return results for documents containing "spater" because search engine strips the accent in the search query, meaning that the query can never match the accented word in the fulltext word index.] The problem was fixed by calling pdftotext with its "-enc UTF-8" argument.

Details

Committed

Nicholas Robinson <nicholas.robinson@cern.ch>

Aug 9 2006, 15:26

Parents

R3600:090c843e4f30: Added 999C6a subfield containing information about status of extracted…

Branches

Unknown

Tags

Unknown

Event Timeline

Nicholas Robinson <nicholas.robinson@cern.ch> committed R3600:31f485a1e848: Fixed a bug relating to the indexation of fulltexts: When the contents of a PDF… (authored by Nicholas Robinson <nicholas.robinson@cern.ch>).Aug 9 2006, 15:26

Changes (1)

				Path
	M			modules/bibindex/lib/bibindex_engine.py

R3600:31f485a1e848

View Options

modules/bibindex/lib/bibindex_engine.py