![]() A comprehensive evaluation through crowd-sourcing shows that the effectiveness of our system's search functionality is on par with the human-level performance. The key to such a domain-independent and language-independent digital infrastructure is a novel combination of a character-based n-gram language model, space-optimised suffix tree, generalised edit distance. The system has already been used to support research on five large text corpora that span a number of different domains and languages. It is at the core of our Samtla (Search And Mining Tools with Linguistic Analysis) system designed in collaboration with historians and linguists. In this paper, we present in detail how we build a digital infrastructure that can facilitate the search and discovery of parallel passages for any domain in any language. Although there exist a few software tools for this purpose, they are restricted to a specific domain (e.g., the Bible) or a specific language (e.g., Hebrew). It is of great interest to researchers and scholars in many disciplines (particularly those working on cultural heritage projects) to study parallel passages (i.e., identical or similar pieces of text describing the same thing) in digital text archives.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |