The objective of this talk is to explore how natural language processing can be employed to connect diverse scholarly objects at two different levels of scientific publication: the content level of scholarly documents and the knowledge level, including metadata and the database of scholarly entities. Firstly, at the content level, the use of natural language processing in capturing the semantics of non-linguistic objects in scholarly documents will be illustrated. The most important step at this level is to recognize the document structure in order to associate natural language descriptions with non-linguistic elements such as tables, figures, and mathematical formulae. It is noteworthy that in a scholarly document, layout, logical, and semantic structures are inseparably interlinked. As a result of this entanglement, the structure analysis becomes significantly challenging regardless of whether it is a preprocessing step in conventional approaches or a vision-language understanding task, as is the case in some recent studies. As an illustrative example, this talk will describe our attempt to extract and analyze mathematical formulae in documents. Secondly, at the knowledge level, a brief overview of information linkage will be provided. Starting with the early record linkage problem, integrating distributed bibliographic catalogs has long been a central issue in the operation of digital libraries. Moreover, in recent years, knowledge databases have witnessed rapid growth across scientific disciplines. Natural language processing is used to identify the tuples of scientific entities and their relations based on their semantic interpretation. In particular, coreference resolution and entity linking serve as key techniques for connecting unstructured text to existing knowledge resources. However, it should be noted that despite the long history, several unresolved issues remain. Lastly, the talk will introduce recent advances toward developing a national-level research data infrastructure in Japan.
Akiko AIZAWA is a professor at the National Institute of Informatics (NII) and currently serves as the Vice Director-General at NII. Aizawa is also an adjunct professor at the University of Tokyo as well as at the Graduate University of Advanced Studies. Aizawa’s research interests include natural language understanding, dialogue systems, text-based content and media processing, and information retrieval. Aizawa has served as an organizer and Program Committee member of related conferences and workshops, and also organized mathematical formula retrieval tasks at NTCIR-10, 11, 12.
FAIR Data (Findable-Accessible-Interoperable-Reusable), and increasingly, FAIR Software, are key concepts being adopted by the scientific community to support validation of results and digital object reuse. FAIR principles and roadmaps provide explicit models of how sharing is to be implemented across the data and software lifecycle, including for publishing.
Our team has developed FAIRSCAPE, a reusable cloud-based framework for digital commons environments and reproducibility in biomedicine, to provide support for FAIR methods from the inception of research projects, deep provenance of results, and robust, configurable security for protected health information. Our implementation at the University of Virginia has been used to run thousands of analyses on biomedical datasets. This talk will describe the use cases, implementation, and application of our reusable framework.
It will also present the Evidence Graph Ontology, a vocabulary for computational reproducibility, which can be used independently of FAIRSCAPE, and will discuss various problems and developments relating to FAIRness in science communication
Timothy Clark, Ph.D., is an Associate Professor of Public Health Sciences and Data Science at the University of Virginia. He has over three decades of experience in leading, developing, and contributing to large-scale bioinformatics platforms, including significant work on interoperability and FAIRness approaches. Prof. Clark was a Founding Director of the FORCE11 consortium, led development of the FORCE11 Data Citation Roadmaps, and co-authored the FAIR Data Principles. His current research focuses on cloud-based systems for FAIR data, software and computation in biomedical research. Prof. Clark holds a Ph.D. in Computer Science from the University of Manchester.