Homec4science

Fixed a bug relating to the indexation of fulltexts: When the contents of a PDF…

Authored by Nicholas Robinson <nicholas.robinson@cern.ch> on Aug 9 2006, 15:26.

Description

Fixed a bug relating to the indexation of fulltexts: When the contents of a PDF fulltext are to be indexed, the tool "pdftotext" is used to convert the PDF to plain text. The plaintext should be utf-8 so that search_engine (strip_accents) can replace accented letters with their non-accented cousins. However, pdftotext outputs by default latin-1, so no accented letters could be replaced and were kept and used in the fulltext word index, meaning that if you seached for a word containing accents, within a fulltext, you would never have any results, unless the non-accented "version" of that word also existed in the document. [E.g. searching for "sp�ter" would only return results for documents containing "spater" because search engine strips the accent in the search query, meaning that the query can never match the accented word in the fulltext word index.] The problem was fixed by calling pdftotext with its "-enc UTF-8" argument.

Event Timeline

Nicholas Robinson <nicholas.robinson@cern.ch> committed R3600:31f485a1e848: Fixed a bug relating to the indexation of fulltexts: When the contents of a PDF… (authored by Nicholas Robinson <nicholas.robinson@cern.ch>).Aug 9 2006, 15:26