How To Implement Text Search for Non-Linear Texts
Abstract
A digital edition should provide a full text search solution. However, this might pose a challenge to those digital manuscript editions that cannot edit the content of the manuscripts as linear text. As the word ordering of non-linear texts is not reliable, full text search might either fail to return many relevant results or return many irrelevant results, thus making it impossible to find what is relevant.
The digital manuscript edition "Der späte Nietzsche" (https://nietzsche.philhist.unibas.ch) faces this problem because the edition does not aim to construct a linear text by interpreting the content of the manuscripts, rather it presents the content of the manuscripts as topological transkriptions. Nevertheless, users should be able to search these manuscripts for phrases that they might know from Nietzsche's published work. Therefore, the digital edition has to provide a solution for this problem.
We will propose a three factor solution that relies on multiple pseudo linear texts of a manuscript page, a simple text search and a ranking strategy by topological proximity.
Three factor solution
The solution:
- Preparation for each manuscript page create several pseudo linear texts as part of the data set that reflect the different textual possibilites (use machine reasoning).
- Search for each search term add a FILTER regex(?pseudo_text, <search_term>) to the SPARQL-Query selecting all pseudo linear texts of all (selected) manuscript pages. Return all pages for which there is at least one pseudo linear text that contains all the search terms. (OR use Jena Full Text Search)
- Ranking Find the words that correspond to the search terms on the returned pages and rank them according to their topological proximity -- i.e. the proximity between their transkription positions.
- Last Author
- steinech
- Last Edited
- Feb 2 2022, 10:13