In the era of big data, research activity on data science focuses on the collection, processing, and interpretation of large datasets to produce knowledge for decision-making processes in different application domains and contexts. The legal domain is one of these different domains and contexts where data science approaches can be applied. Indeed, thousands of legal documents are constantly produced by institutional bodies, such as Parliaments and Courts, where daily law and court decisions (CDs) are published. Both law and CDs constitute prominent sources of knowledge. Law is general by definition and adopts abstract terminology that creates areas of uncertainty, whereas CDs provide a lot of concrete information about the application of the law. For this reason, legal interpreters, such as judges, public prosecutors and lawyers, are daily involved in the analysis and evaluation of CDs, that currently is mostly run manually by domain experts. Knowledge extraction approaches have been applied to legal texts, but the literature confirms that the application of such techniques on CDs requires unique capabilities, due to the domain language and practice. More recently, many machine learning approaches applied on CDs datasets require expensive tasks of manual annotation, in terms of time, costs, expertise, and agreement, implying also several ethical issues. In the Thesis, we address these issues by proposing CRIKE (CRIme Knowledge Extraction) a data science approach conceived to support legal knowledge extraction from a corpus of CDs documents, based on a reference legal domain ontology (LATO). First, we introduce the legal knowledge model of LATO, that captures and conceptually formalizes the features and nature of terminology used in law and CDs as entities and relationships, implemented in LATO using SKOS. Then, we present CRIKE that aims to progressively enrich the knowledge specified in a reference LATO ontology by extracting new concrete terminology associated with legal ontology concepts, as it occurs in the corpus. Knowledge extraction in CRIKE is based on multi-label annotation techniques where the corpus of annotated CDs is built by relying on the ontology contents without the need of manual annotation. Information retrieval techniques are applied for discovering new terms to populate term-sets of legal concepts that have been recognized in the CD texts. To evaluate the results obtained through the CRIKE approach, we discuss experimental results of application of CRIKE to a dataset of 180,000 CDs of the State of Illinois taken from the Caselaw Access Project (CAP) that provides public access to U.S. CDs digitized from the collection of the Harvard Law Library. Finally, we discuss the applicability conditions and the ethical issues related to CRIKE access, storage and processing tasks.

LAW AND DATA SCIENCE: KNOWLEDGE MODELING AND EXTRACTION FROM COURT DECISIONS / M. Falduti ; supervisor: S. Castano; assistant supervisor: A. Ferrara ; headmaster of the phd. school: P. Boldi. Dipartimento di Informatica Giovanni Degli Antoni, 2021 Jan 15. 33. ciclo, Anno Accademico 2020. [10.13130/falduti-mattia_phd2021-01-15].

LAW AND DATA SCIENCE: KNOWLEDGE MODELING AND EXTRACTION FROM COURT DECISIONS

M. Falduti
2021

Abstract

In the era of big data, research activity on data science focuses on the collection, processing, and interpretation of large datasets to produce knowledge for decision-making processes in different application domains and contexts. The legal domain is one of these different domains and contexts where data science approaches can be applied. Indeed, thousands of legal documents are constantly produced by institutional bodies, such as Parliaments and Courts, where daily law and court decisions (CDs) are published. Both law and CDs constitute prominent sources of knowledge. Law is general by definition and adopts abstract terminology that creates areas of uncertainty, whereas CDs provide a lot of concrete information about the application of the law. For this reason, legal interpreters, such as judges, public prosecutors and lawyers, are daily involved in the analysis and evaluation of CDs, that currently is mostly run manually by domain experts. Knowledge extraction approaches have been applied to legal texts, but the literature confirms that the application of such techniques on CDs requires unique capabilities, due to the domain language and practice. More recently, many machine learning approaches applied on CDs datasets require expensive tasks of manual annotation, in terms of time, costs, expertise, and agreement, implying also several ethical issues. In the Thesis, we address these issues by proposing CRIKE (CRIme Knowledge Extraction) a data science approach conceived to support legal knowledge extraction from a corpus of CDs documents, based on a reference legal domain ontology (LATO). First, we introduce the legal knowledge model of LATO, that captures and conceptually formalizes the features and nature of terminology used in law and CDs as entities and relationships, implemented in LATO using SKOS. Then, we present CRIKE that aims to progressively enrich the knowledge specified in a reference LATO ontology by extracting new concrete terminology associated with legal ontology concepts, as it occurs in the corpus. Knowledge extraction in CRIKE is based on multi-label annotation techniques where the corpus of annotated CDs is built by relying on the ontology contents without the need of manual annotation. Information retrieval techniques are applied for discovering new terms to populate term-sets of legal concepts that have been recognized in the CD texts. To evaluate the results obtained through the CRIKE approach, we discuss experimental results of application of CRIKE to a dataset of 180,000 CDs of the State of Illinois taken from the Caselaw Access Project (CAP) that provides public access to U.S. CDs digitized from the collection of the Harvard Law Library. Finally, we discuss the applicability conditions and the ethical issues related to CRIKE access, storage and processing tasks.
15-gen-2021
Settore INF/01 - Informatica
CASTANO, SILVANA
BOLDI, PAOLO
FERRARA, ALFIO
CASTANO, SILVANA
Doctoral Thesis
LAW AND DATA SCIENCE: KNOWLEDGE MODELING AND EXTRACTION FROM COURT DECISIONS / M. Falduti ; supervisor: S. Castano; assistant supervisor: A. Ferrara ; headmaster of the phd. school: P. Boldi. Dipartimento di Informatica Giovanni Degli Antoni, 2021 Jan 15. 33. ciclo, Anno Accademico 2020. [10.13130/falduti-mattia_phd2021-01-15].
File in questo prodotto:
File Dimensione Formato  
phd_unimi_R11975.pdf

accesso aperto

Tipologia: Tesi di dottorato completa
Dimensione 1.14 MB
Formato Adobe PDF
1.14 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/799875
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact