Improving the analysis of legal texts with data
Using the example of the Cyber Resilience Act, and manipulating the raw data from EUR-Lex, let's see what we can do to improve the reading experience.
Each time a new draft law is published, the whole Brussels bubble rushes to the European Commission's press release and the machine is set in motion: highlighters scramble over PDF documents, consultants go back and forth between the article containing the definitions and the publication, lobbyists get impatient, and all the articles have to be copied and pasted one by one into the four-column document...
Now, imagine if I told you that the cost of repeating over and over the same action and compiling information could be greatly reduced? For public affairs consultants - given the huge investment in lobbying I'm sure there's a business to grab - but also for citizens and SMEs struggling to keep up with the multitude of draft laws the Commission has on the table.
Nonetheless, European public institutions are trying to simplify the understanding of laws as well as the legislative process. Initiatives such as OEIL or the Legislative Train are excellent examples in this respect. But in either case, they cannot meet all the use cases. That is why it is so important to be able to give developers and entrepreneurs the opportunity to do something meaningful with data.
In this blog post, we're going to dive together into the mind of a developer by going step by step. Starting from a use case, automatically improving the reading of a European legislation (the brand new Cyber Resilience Act), I will try to define the main concepts of data analysis while underlining the limits of the current scheme.
🚀 See the result of the project.
A small project
Among the files for which I am responsible, I am dealing with the European digital identity, which the Commission wants to raise its ambitions with a new legislative proposal, eIDAS 2.0. In my extensive reading on this subject, I came across a Council document containing a link to an annotated version of the legislative file.
The document was created by Tim Speelman, a developer and public servant of the Dutch Ministry of Interior. Tim has taken the document and added definitions of terms, a table of contents, and cross-references between articles.
You have no idea how much this site has made my work easier.
The question I wondered right away was: would it be possible to automate this process for any piece of legislation? And if so, what are the requirements I should address?
Objectives
I decided to focus on three main objectives:
To improve the reading of the document by minimising the number of steps that need to be performed. This means: adding the definition of the elements referred to through the text simply to display them on mouse-over; providing the name of the referenced texts to avoid having to search for each document one by one; providing the option of sharing an article directly with someone, i.e. creating anchors that can be used in the text.
To reduce the number of repetitive tasks or at least optimise them. To do this, I decided to automatically generate a four-column document that can be updated with Parliament's amendments (with Parltrack's excellent data-mining work), and to give the possibility to copy/paste articles simply, without formatting.
To create intelligence from patterns that can be found in the text. For example, automatically extracting references to other legislative files or implementing acts.
Make it usable for every text
To avoid duplicating work as much as possible, I also set myself the goal of developing this application to be as agnostic as possible. If the Cyber Resilience Act was chosen as an example, the algorithmic design should work for any European legislative file.
In search of a machine-readable format
Many projects use data to respond to specific and pressing issues. Using the list of vaccination centres in France, Guillaume Rozier, a 25-year-old French engineer, has created an effective and simple application for getting vaccinated against COVID-19. Behind this initiative lies Open Data: publicly available and usable data in a format that can be easily read by a computer.
What is a machine-readable format?
To perform algorithmic operations on a dataset, it is important to properly categorise the data, i.e. to associate a label to each piece of information which can then be used to compare similar data.
While completing an Excel file is relatively easy for a human being, it is usually much more difficult for a machine. In a long paragraph, how do you know which label to put on which word? Some information is easy to extract, such as a date, while other information is much less explicit and may require complex algorithms.
To minimise the number of operations required, developers store data in tables (which make up databases) and can exchange this data through APIs. One of the popular exchange formats is json
, which allows labels to be maintained on data sets so that a machine can quickly browse it.
Let's put a picture on these tables through an example I recently worked on: parliamentary questions. The questions are accessible from the European Parliament's website and are available in HTML, DOC and PDF format, three formats that are intended for a human user. In an ideal world, the Parliament would provide another format that might look like this:
Table format:
Question | |
---|---|
Title | Meeting between Danish authorities and the Commission in connection with the Recommendation of 14 July 2020 |
Date | 04 July 2022 |
Number | E-002421/2022 |
Type | Written |
Rule | 138 |
From.id | 197571 |
json format:
{
"title": "Meeting between Danish authorities and the Commission in connection with the Recommendation of 14 July 2020",
"date": "4.7.2022",
"number": "E-002421/2022",
"type": "Priority",
"rule": 138,
"from": [
{
"id": "197571",
"first_name": "Nikolaj",
"last_name": "Villumsen",
"country": "DK",
"group": "4277",
}
]
}
Linking data together
If an MEP has his full name ("Predrag Fred Matić") displayed on Parliamentary Questions but his short name ("Predrag Matić") on another part of the site, a machine might think that they are two different people. Similarly, if two MEPs have the same name, an algorithm could easily assume that they are the same person.
To avoid duplication and overlap of data, engineers use identifiers, abbreviated id. These identifiers are unique numbers that are associated with a single dataset. A MEP's id can be found in the URL of the EP website. MEP Predrag Fred Matić was assigned the id 197441
.
🤔 Try changing the name of the MP in the URL, it won't change the page because the EP website is basing its intelligence only on the id!
In short, machine-readable and usable data is the combination of these two elements: well labelled and well identified data.
The EUR-Lex API
Now back to our original project, improving the text of the Cyber Resilience Act. Does EUR-Lex have an easily accessible API for this data? Well, the answer is not so obvious.
EUR-Lex has an API - but not always easy to use - for its metadata, i.e. the set of data used to describe their legislative texts. It is thus possible, for example, to browse their huge database to identify texts concerning a particular country or theme or to retrieve all the identifiers of texts containing "cybersecurity" in their title.
However, the text itself remains accessible only in HTML, DOC and PDF format, as for our parliamentary questions. Unfortunately, we will have to make do with that and use data mining techniques in conjunction with their API.
💡 In 2017, EUR-Lex Cellar had:
246 million identifiers
127 million files
For a total of...
27.5TB of static files
4.1TB of database
Data mining to the rescue
In the absence of machine-readable data, it is possible to teach a machine to manually separate data through data mining techniques. Often very specific, these algorithms can rarely be reused outside their original purpose.
The principle is relatively simple, from the HTML code of a page, it is technically possible to identify particular areas and associate a label with that area.
HTML consists of a series of elements. Elements are defined by an opening tag (<p>
for paragraph) and a closing tag (</p>
for paragraph), wrapping a piece of text.
<p>This is a paragraph</p>
↑ ↑ ↑
opening content closing
tag tag
It is possible to specify parameters to a tag this way: <p parameter="value">
.
Let's take the example of three HTML paragraphs (<p>
) and a list (<ol>
):
<p class="title">Cyber Resilience Act</p>
<p>Amending Regulation (EU) 2019/1020</p>
<p class="article">
<span>Article 1</span>
<ol>
<li>First point.</li>
<li>Second point.</li>
</ol>
</p>
Attached to the first paragraph is a title
class. It is possible to specifically target this class and associate the title with it. Based on this, it is possible to target the next element after it and get the amending regulation.
Let's have an example of what it could looks like with DomCrawler:
<?php
// Will return: title = "Cyber Resilience Act"
$title = $crawler->filter('.title')->text('');
// Will return: regulation = "Amending Regulation (EU) 2019/1020"
$regulation = $crawler->filter('.title')->nextAll()->first()->text('');
// Will return: articleParagraphs = ["First point.", "Second point"]
$articleParagraphs = $crawler->filter('.article ol li')->each(function (Crawler $node, $i) {
return $node->text();
});
ℹ️ The purpose of this article is not to give a lecture on data mining but simply to give some keys to understand what newspaper articles refer to.
Mining EUR-Lex HTML page
Depending on the structure of an HTML page, the targeting of these features can be more or less simple. Let's look at the EUR-Lex source code.
EUR-Lex has a specific HTML structuring that allows to tell a data mining algorithm which part to look at first. For example, all headings have the class Heading0
, articles are located in paragraphs with the class Normal
.
ℹ️ To be sure that an algorithm can be reused as much as possible and to avoid making significant corrections when updating a website, it is important to be as little specific as possible.
The structure of the page can be summarised as follows:
<div class="contentWrapper">
<!-- First references -->
<p class="Emission">Brussels, 15.9.2022</p>
<p class="Rfrenceinterinstitutionnelle">2022/0272(COD)</p>
<p class="Titreobjet_cp">on horizontal cybersecurity requirements for products with digital elements and amending Regulation (EU) 2019/1020</p>
<!-- Start of recital -->
<p class="Institutionquiagit">...</p>
<p class="Normal">Treaty...</p>
<!-- ... -->
<p class="Normal">Whereas...</p>
<p class="li ManualConsidrant">Recital</p>
<!-- ... -->
<p class="li ManualConsidrant">Recital</p>
<!-- Start of Articles -->
<p class="Formuledadoption">HAVE ADOPTED THIS REGULATION:</p>
<!-- First chapter -->
<div class="Titrearticle0">CHAPTER I</div>
<div class="Titrearticle0">GENERAL PROVISIONS</div>
<!-- First article -->
<div class="Titrearticle0">Article 1</div>
<div class="Titrearticle0">Subject matter</div>
<p class="Normal">This Regulation lays down:</p>
<p class="li Point0">
<span class="num">(a)</span>
First point
</p>
</div>
Unfortunately, the document is not very well structured and the elements follow one after the other at the same level. This means that it will not be possible to have a nested search approach and will have to rely on linear approach.
For example, to retrieve chapters and articles, the rules can be supersized as follows:
The first chapter starts after the
Formuledadoption
class.If the class
Titrearticle0
contains the text "CHAPTER [IVX]
", then a chapter is starting.The next element is the chapter title, the next two are an article's number and title.
All elements between the last
Titrearticle0
are an article's content.The paragraph number is stored in the
num
class.
With a little work, it is possible to mine articles. Here is a very short example of what a DomCrawler based algorithm in PHP might look like:
Inconsistencies making it more difficult to read data
Unfortunately some inconsistencies in the EUR-Lex source code have made the project more complex. For example, all paragraphs in the Recital have the ManualConsidrant
class, except for one.
Another example, NIS2, the regulation on cybersecurity of critical entities, is mentioned 23 times in the text with 5 different spellings, making it even more complex to parse the text.
The text is also rendomly wrapped in <div class="content">
making impossible to correctly parse articles in one go.
Building the project
After about 500 lines of code and a weekend of work, I was able to develop an algorithm to transform a text from EUR-Lex into a json format.
I am making this export available for free on the project mini-site under the CC BY-NC-SA 4.0 license.
Based on this json format, it is pretty easy to parse the text to extract relevant information. You will find a few example features in the next chapter.
List of features
🚀 See the result of the project.
Annotated text with references to definitions, footnotes and other legislation
List of delegated and implementing acts
Export in json format
Next steps
This project was always intended to be simple and limited in time, a form of experimentation and a small personal challenge.
However, if I get good feedback from you, I will be very happy to update the project and possibly extend it to other legislation I am working on.
On my to-do list:
[Bug fix] Some definitions are not correctly parsed because of overlap. For example with "vulnerability" and "actively exploited vulnerability".
Add cross-references to articles to see where they are referenced in the text.
List articles numbers in the interplay.
Add the recital.