Page MenuHomec4science

How To Implement Text Search for Non-Linear Texts
Updated 805 Days AgoPublic

Abstract

A digital edition should provide a full text search solution. However, this might pose a challenge to those digital manuscript editions that cannot edit the content of the manuscripts as linear text. As the word ordering of non-linear texts is not reliable, full text search might either fail to return many relevant results or return many irrelevant results, thus making it impossible to find what is relevant.
The digital manuscript edition "Der späte Nietzsche" (https://nietzsche.philhist.unibas.ch) faces this problem because the edition does not aim to construct a linear text by interpreting the content of the manuscripts, rather it presents the content of the manuscripts as topological transkriptions. Nevertheless, users should be able to search these manuscripts for phrases that they might know from Nietzsche's published work. Therefore, the digital edition has to provide a solution for this problem.

We will propose a three factor solution that relies on multiple pseudo linear texts of a manuscript page, a simple text search and a ranking strategy by topological proximity.

Three factor solution

The solution:

  • Preparation for each manuscript page create several pseudo linear texts as part of the data set that reflect the different textual possibilites (use machine reasoning).
  • Search for each search term add a FILTER regex(?pseudo_text, <search_term>) to the SPARQL-Query selecting all pseudo linear texts of all (selected) manuscript pages. Return all pages for which there is at least one pseudo linear text that contains all the search terms. (OR use Jena Full Text Search)
  • Ranking Find the words that correspond to the search terms on the returned pages and rank them according to their topological proximity -- i.e. the proximity between their transkription positions.
Last Author
steinech
Last Edited
Feb 2 2022, 10:13

Event Timeline

steinech created this document.Oct 23 2020, 17:25
steinech edited the content of this document. (Show Details)
steinech edited the content of this document. (Show Details)Oct 23 2020, 17:27
steinech edited the content of this document. (Show Details)Oct 23 2020, 17:55
steinech edited the content of this document. (Show Details)Dec 11 2020, 09:13
steinech changed the visibility from "Restricted Project (Project)" to "Public (No Login Required)".Feb 2 2022, 10:12
steinech edited the content of this document. (Show Details)