Phriction Projects Wikis Der Späte Nietzsche How To Implement Text Search for Non-Linear Texts History Version 1 vs 5
Version 1 vs 5
Version 1 vs 5
Content Changes
Content Changes
# Abstract
A digital edition should provide a full text search solution. However, this might pose a challenge to those digital manuscript editions that cannot edit the content of the manuscripts as linear text. As the word ordering of non-linear texts is not reliable, full text searches might either fail to return many relevant results or return many irrelevant results, thus making it impossible to find what is relevant.
The digital manuscript edition "Der späte Nietzsche" (that corresponds to the print edition //KGW IX//) faces this problem because the edition does not aim to construct a linear text by interpreting the content of the manuscripts, rather it presents the content of the manuscripts as topological transkriptions. Nevertheless, users should be able to search these manuscripts for phrases that they might know from Nietzsche's published work. Therefore, the digital edition has to provide a solution for this problem.
We will propose a two step solution that relies on multiple pseudo-linear texts of a manuscript page on the one hand and a ranking strategy by topological proximity on the other hand.
# Abstract
A digital edition should provide a full text search solution. However, this might pose a challenge to those digital manuscript editions that cannot edit the content of the manuscripts as linear text. As the word ordering of non-linear texts is not reliable, full text search might either fail to return many relevant results or return many irrelevant results, thus making it impossible to find what is relevant.
The digital manuscript edition "Der späte Nietzsche" ([[ https://nietzsche.philhist.unibas.ch ]]) faces this problem because the edition does not aim to construct a linear text by interpreting the content of the manuscripts, rather it presents the content of the manuscripts as topological transkriptions. Nevertheless, users should be able to search these manuscripts for phrases that they might know from Nietzsche's published work. Therefore, the digital edition has to provide a solution for this problem.
We will propose a three factor solution that relies on multiple pseudo linear texts of a manuscript page, a simple text search and a ranking strategy by topological proximity.
# Three factor solution
The solution:
- **Preparation** for each manuscript page create several pseudo linear texts as part of the data set that reflect the different textual possibilites (use machine reasoning).
- **Search** for each search term add a `FILTER regex(?pseudo_text, <search_term>)` to the SPARQL-Query selecting all pseudo linear texts of all (selected) manuscript pages. Return all pages for which there is at least one pseudo linear text that contains all the search terms. (OR use [[https://jena.apache.org/documentation/query/text-query.html|Jena Full Text Search]])
- **Ranking** Find the words that correspond to the search terms on the returned pages and rank them according to their topological proximity -- i.e. the proximity between their transkription positions.
# Abstract
A digital edition should provide a full text search solution. However, this might pose a challenge to those digital manuscript editions that cannot edit the content of the manuscripts as linear text. As the word ordering of non-linear texts is not reliable, full text searches might either fail to return many relevant results or return many irrelevant results, thus making it impossible to find what is relevant.
The digital manuscript edition "Der späte Nietzsche" (that corresponds to the print edition //KGW IX//[[ https://nietzsche.philhist.unibas.ch ]]) faces this problem because the edition does not aim to construct a linear text by interpreting the content of the manuscripts, rather it presents the content of the manuscripts as topological transkriptions. Nevertheless, users should be able to search these manuscripts for phrases that they might know from Nietzsche's published work. Therefore, the digital edition has to provide a solution for this problem.
We will propose a three factor solution that relies on multiple pseudo linear texts of a manuscript page, a simple text search and a ranking strategy by topological proximity.
# Three factor solution
We will propose a two step solution that relies on multiple pseudo-linear texts of a manuscript page on the one hand and a ranking strategy by topological proximity on the other hand.The solution:
- **Preparation** for each manuscript page create several pseudo linear texts as part of the data set that reflect the different textual possibilites (use machine reasoning).
- **Search** for each search term add a `FILTER regex(?pseudo_text, <search_term>)` to the SPARQL-Query selecting all pseudo linear texts of all (selected) manuscript pages. Return all pages for which there is at least one pseudo linear text that contains all the search terms. (OR use [[https://jena.apache.org/documentation/query/text-query.html|Jena Full Text Search]])
- **Ranking** Find the words that correspond to the search terms on the returned pages and rank them according to their topological proximity -- i.e. the proximity between their transkription positions.
c4science · Help