A Crash-Course on Web-Scraping with Python

There are many Python modules you can use to scrape websites. A popular selection:

Module	Focus
`Scrapy`	complex web scraping/crawling framework
`Selenium`	remote-control for a web-browser, used when content is added by JavaScript
`BeautifulSoup`	simple but powerful parsing library for html content

R also has web scraping modules. (Not today’s topic, but see Web Scraping with R.)

The Eur-Lex Database

Eur-Lex is the main legal database of the European Union. It includes, for example

the Official Journal
legislative and preparatory documents
rulings of the EU courts
procedural information on legislation

Data can be accessed by various search interfaces. Some parts of the data are standardized and can be exported for download, obviating the need for web-scraping. An R package (by Michal Ovádek) for accessing standardized Eur-Lex data is available.

Eur-Lex’ terms of service are permissive. While they don’t explicitly mention web-scraping, they don’t exclude it either.

The Eur-Lex Database

Please go to eur-lex-europa.eu.

Proceed to “Advanced Search” (below the big search bar)
Select the “Case-law” collection
Select “Judgment” in “Document reference”
Select “Court of Justice” in “Author of the document”
Hit “Search”

Voilá! These are all judgments ever issued by the European Court of Justice.

The Eur-Lex Database

Voilá! These are all judgments ever issued by the European Court of Justice.

But that’s a lot.

Let’s try some scraping.

The Quest

Imagine we’re interested in the European Court of Justice (ECJ). We want a list of all the litigants that were ever involved in an ECJ proceeding.

There is no standardized litigant information on Eur-Lex that we could download using the export function. But litigants are mentioned in the text of each court ruling. This is a good use-case for web-scraping.

So we have to

download the search results page
save the URL for each judgment
download each judgment via its URL
extract the litigants from each judgment

Downloading the Search Results Page

The following code examples are available in the presentation notebook.

Go ahead and follow along.

How do we get hold of the web-page with the search results?

That’s the job of the requests module.

The search query is encoded in the URL (link) of the search results page.¹

Copy and paste the URL into the script and assign it to a variable, e.g. query_url.²

#| echo: true
query_url = "https://eur-lex.europa.eu/search.html?SUBDOM_INIT=EU_CASE_LAW&DB_TYPE_OF_ACT=judgment&DTS_SUBDOM=EU_CASE_LAW&typeOfCourtStatus=COURT_JUSTICE&DTS_DOM=EU_LAW&typeOfActStatus=JUDGMENT&page=1&lang=en&CASE_LAW_SUMMARY=false&type=advanced&DB_TYPE_COURquT=COURT_JUSTICE"

Next, we pass the url to requests’ .get method, immediately assigning the output to a variable (r).

r = requests.get(query_url)

r now contains the html of the search results page and some other information. We can access the html part using the .text method. For the first 250 letters:

html = r.text
html[0:250]

'  \n \n \n \n \n \n \n \n \n \n \n \n \n \n        <!DOCTYPE html>\n        <html lang="en" class="no-js"\n        xml:lang="en"  >\n        <head>\n        <meta charset="utf-8">\n        \n        <meta http-equiv="X-UA-Compatible" content="IE=edge"/>\n        \n \n \n \n '

Here are the next 750 letters:

html[251:1000]

'                <meta name="viewport" content="width=device-width, initial-scale=1">\n            \n \n \n        <script type="text/javascript" src="./revamp/components/vendor/modernizr/modernizr.js?v=2.10.6"></script>\n        \n \r\n\r\n\r \r \r \r \r \r \r\n\r\n\r\n\r \r \r \r \r \r\n\n \n \n <title>Search results - EUR-Lex</title> \n \r\n\r\n\r\n\n <meta name="WT.cg_n" content="Search"/><meta name="WT.cg_s" content="Search results"/><meta name="WT.pi" content="Search result page"/><meta name="DCSext.w_oss_metadata_field" content="Document reference"/><meta name="DCSext.w_oss_metadata_collection_type" content="Single Collection"/><meta name="DCSext.w_oss_metadata_collection_option" content="EU case law"/><meta name="WT.z_usr_lan" content="en"/><meta name="WT.seg_1" content="'

But this is just garbage!

Technically, it’s a tag soup.

Make It a Beautiful Soup

BeautifulSoup makes that tag soup readable.

We use it to

traverse the branches of the html code until we arrive at useful information
or search the entire html tree for the information we need
and eventually extract it

“What html tree,” you ask?

A Website Under the Hood

<html>
  <head>
    <title>A Web Page</title>
  </head>
  <body>
    <p id="author">Henning Deters</p>
    <p id="subject">A Web-Scraping Primer with Python</p>
    <a href="https://en.wikipedia.org/">A link to Wikipedia</a>
  </body>
</html>

The Eur-Lex website uses the same html syntax. It’s just a “bit” more complicated – and messy!

Text is embedded in “tags”: <p> … </p> for paragraphs, <a> … </a> for links etc.
Some tags carry additional attributes: id=..., href=...
Nesting: <p> sits between <body>, which is in <html>

Inspecting the html code on Eur-Lex

Please go to the search results, then hit F12 (Firefox) or Shift-Ctrl-J (Chrome) to open the developer tools.

click on the “element picker”
select the first search result
the tool shows you the html snippet that corresponds to the search result

It can be useful to examine the full html source. In Firefox, right-click anywhere and select “View Page Source”.

The snippet we are looking for is located

in an a tag (hyperlink),
nested within an h2 tag (heading level 2),
within a div tag with the class attribute “SearchResult”,
within even more levels …

The URL of the result appears twice. We want the second appearance, assigned to the name attribute.

Extracting the Information

We create a soup object by feeding the stuff we downloaded to BeautifulSoup. The second argument ("html5lib") tells BeautifulSoup that we’re feeding it html.

soup = BeautifulSoup(html, "html5lib")

The soup object comes with useful methods. .find_all() finds all occurrences of a search term. Remember: We’re looking for something that’s nested within a div tag with the class “SearchResult”.

soup.find_all("div", {"class": "SearchResult"})

.find_all() expects the first argument to be a string with the tag, and the optional second argument to be a dictionary.

.find_all() returns an object similar to a list. Thus we can access the first result by attaching an index.

search_results = soup.find_all("div", {"class": "SearchResult"})
search_results[0]

Here is the relevant part of the output.

<div class="SearchResult" xmlns="http://www.w3.org/1999/xhtml"><h2><a class="title" href="./legal-content/AUTO/?uri=CELEX:62020TJ0279&amp;qid=1669282930007&amp;rid=1" id="cellar_96f68f89-6b2b-11ed-9887-01aa75ed71a1" name="https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62020TJ0279"> Judgment of the General Court ... Manifest errors of assessment.<br/>Case T-279/20.</a></h2>

As discovered earlier, the URL of the result sits within <h2> tags, which sit within <a> tags.

Since there is only one URL per result, we can ignore the <h2> tags and look just for the <a>:

search_results[0].find("a")

<a class="title" href="./legal-content/AUTO/?uri=CELEX:62020TJ0279&amp;qid=1669282930007&amp;rid=1" id="cellar_96f68f89-6b2b-11ed-9887-01aa75ed71a1" name="https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62020TJ0279"> Judgment of the General Court ... Manifest errors of assessment.<br/>Case T-279/20.</a>

The URL after href is what your browser opens when you click on a link, but it’s truncated. Since we’re too lazy to add the missing part, we use the URL after the name attribute.

Attributes can be accessed like a dictionary. We pass "name" as key and are served the corresponding value (the URL).

search_results[0].find("a")["name"]

'https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62021CJ0370'

Over and over and over

This gets us the URL of the first search result.

search_results[0].find("a")["name"]

'https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62021CJ0370'

To extract all relevant URLs, we iterate over all search results. Instead of printing them out, we append them to an empty list for later use.

urls = []
for s in search_results:
    urls.append(s.find("a")["name"])
urls[0:5] # take a look at the first 5

['https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62021CJ0370',
 'https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62020CJ0269',
 'https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62021CJ0512',
 'https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62020CJ0653',
 'https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62021TJ0611']

Downloading Each Judgment

Now that we have the URLs for all judgments,¹ we can download each one. This might take a minute or two.

We’re just recycling code from earlier and placing it in a loop. Not terribly elegant, but fairly straightforward by now.

contents = []
for u in urls:
    print("Scraping", u)
    r = requests.get(u)
    soup = BeautifulSoup(r.text, "html5lib")
    contents.append(soup)

Scraping https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62020CJ0638
Scraping https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62021CJ0658
etc.

Inspecting the html of a Judgment

Please open one of the search results (for example this one) and inspect the text of the first judgment, using the developer tools.

The snippet we are looking for is located within a div tag with the id attribute “text”. It corresponds to the box that holds the text of the ruling.

Extracting the Information

Now we work with the html of all judments, which we stored in the variable contents. We narrow down the search to the main text, delimited by <div id="text">.

# narrow down to text of the ruling
for c in contents:
    text = c.find("div",
                  {"id": "text"})
    print(type(text))

<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'NoneType'>
<class 'NoneType'>

The type() functions returns the type of BeautifulSoup’s output. Some results are of NoneType, which means these particular pages did not contain the div we were looking for.

Let’s open the a corresponding URL in our list of search results to look at the offending page.

The ruling has not (yet) been translated into English, therefore the page does not have its text on it!

We’ll keep this in mind for the methodology section, and just ignore all pages that lack the text of the ruling.

#  narrow down to the text of the ruling
ruling_texts = []
for c in contents:
    text = c.find("div", {"id": "text"})
    if text is not None:
        ruling_texts.append(text)

The if statement means we include only those results in our rulings list that are not None.

Now we have a list containing the text of all the rulings. But we’re really only interested in the litigants.

Inspecting one of the search results once more, we find that the litigants are located

within <b> tags (for bold print), which are
within <p> tags with the class “C02AlineaAltA”

This is not always true, but for simplicity we pretend it is.

The .find_all_next() method of BeautifulSoup finds all branches matching a search criterion after the soup object.

# narrow down to relevant paragraphs
paragraphs = []
for r in ruling_texts:
    relevant_p = r.find_all_next("p", {"class": "C02AlineaAltA"})
    paragraphs.extend(relevant_p)

Here we tell Python to give us a list of all <p> tags with the class “C02AlineaAlta” after the <div> that marks the text of the ruling on the web page. We repeat this for all ruling texts, using a for loop.

We further narrow down the relevant paragraphs to bold text (e.g. in <b> tags).

# narrow down to bold parts
bold = []
for p in paragraphs:
    x = p.find("b")
    if x is not None:
        bold.append(x.text)
bold[0:5]

['DOMUS-Software-AG',
 'Marc Braschoß Immobilien GmbH,',
 'Finanzamt T',
 'S,',
 'Aquila Part Prod Com SA']

Line 4: Again, the .find() method returns None if the current soup object does not contain a <b> tag.

We further narrow down the relevant paragraphs to bold text (e.g. in <b> tags).

# narrow down to bold parts
bold = []
for p in paragraphs:
    x = p.find("b")
    if x is not None:
        bold.append(x.text)
bold[0:5]

['DOMUS-Software-AG',
 'Marc Braschoß Immobilien GmbH,',
 'Finanzamt T',
 'S,',
 'Aquila Part Prod Com SA']

Line 4: Again, the .find() method returns None if the current soup object does not contain a <b> tag.

Line 5: We use if to ensure that those don’t end up on our list of bold paragraphs.

We further narrow down the relevant paragraphs to bold text (e.g. in <b> tags).

# narrow down to bold parts
bold = []
for p in paragraphs:
    x = p.find("b")
    if x is not None:
        bold.append(x.text)
bold[0:5]

['DOMUS-Software-AG',
 'Marc Braschoß Immobilien GmbH,',
 'Finanzamt T',
 'S,',
 'Aquila Part Prod Com SA']

Line 4: Again, the .find() method returns None if the current soup object does not contain a <b> tag.

Line 5: We use if to ensure that those don’t end up on our list of bold paragraphs.

Line 6: The .text method returns just the text between the tags, omitting the <b> ... </b>.

Final Polish

.strip() gets rid of spaces left and right
.rstrip(",") removes the commas on the right

We can chain one after the other.

# stripping commas and spaces
litigants = [x.strip().rstrip(",") for x in bold]
litigants[0:5]

['DOMUS-Software-AG',
 'Marc Braschoß Immobilien GmbH',
 'Finanzamt T',
 'S',
 'Aquila Part Prod Com SA']

This is a “list comprehension”, often an elegant alternative to a for loop.

Could You Be Less Verbose, Please?

We could have done all of this in a few lines.

import requests
from bs4 import BeautifulSoup

query_url = "https://eur-lex.europa.eu/search.html?SUBDOM_INIT=EU_CASE_LAW&DB_TYPE_OF_ACT=judgment&DTS_SUBDOM=EU_CASE_LAW&typeOfCourtStatus=COURT_JUSTICE&DTS_DOM=EU_LAW&typeOfActStatus=JUDGMENT&page=1&lang=en&CASE_LAW_SUMMARY=false&type=advanced&DB_TYPE_COURquT=COURT_JUSTICE"
soup = BeautifulSoup(requests.get(query_url).text, "html5lib")
search_results = soup.find_all("div", {"class": "SearchResult"})
search_results = [x.find("a", {"class": "title"}) for x in search_results]
urls = [x["name"] for x in search_results]
contents = [BeautifulSoup(requests.get(u).text, "html5lib") for u in urls]
ruling_texts = [x.find("div", {"id": "text"}) for x in contents]
paragraphs = [x.find_all_next("p", {"class": "C02AlineaAltA"}) for x in ruling_texts if x is not None]
bold = [x.find("b") for y in paragraphs for x in y]
litigants = [x.text.strip().rstrip(",") for x in bold if x is not None]

Very compact, but much harder to understand. When writing code, make sure you’ll be able to go back to it months later.

Show Me A Picture

If we ran our unmodified script over the 400 most recent search results, we could visualize the five most frequent litigants like this.

import seaborn as sns
import pandas as pd

litigants_series = pd.Series(litigants)

sns.set_style("darkgrid")
plot = sns.countplot(x=litigants_series.values,
                     order=litigants_series.value_counts().iloc[:5].index)
plot.set_xticklabels(plot.get_xticklabels(), rotation=45, ha="right")

A Count Plot

Caveats and Extensions

The script only downloads the first ten search results. The URL for the subsequent ten results contains the term &page=2 instead of &page=1. By now you can probably guess how to scrape all results. (Hint: it involves a loop).
The script downloads the html and extracts information in one go. It’s much more efficient (and puts less load on the sever) to first download all results and then extract the information from local files. This way you’ll only ever have to download the results once. (Writing to and reading from files involves the open() command.)
There is always more polishing to do. For example, if a litigant name contains weird strings like “\xa0”, we should transform these “non-breaking spaces” into regular ones.
Eur-Lex is not very consistent in how it encodes information. Sometimes you have to search for text patterns instead of html tags. (This involves “regular expressions”, available in the re module).
Always make sure to verify your results on a sufficiently large and diverse sample. In our example, we should at least double-check that the script works on results from different time periods, as the format of the html files might have changed.

The final section of the presentation notebook contains improvements on points 1 – 3.

For convenient analysis, you’ll want to combine different variables and store those that belong to the same search result in a dictionary (easily exported as a versatile json file) or in a rectangular shape (e.g. a Pandas DataFrame).

But that’s a different topic to explore.

A Crash-Course on Web-Scraping with Python

What Is Web-Scraping?
What Is It (Not) Good For? How Does It Work?

What It Is

When to Use It

When Not to Use It (Alternatives)

Web-scraping Ethics

How to Do It

How to Do It Part 1: A 10 Minutes Python Primer

Hello World!

Calculation

Variables

Data Types

Slicing

Loops

Flow Control

Objects

Modules

How to Do It Part 2:
A Tiny Scraping Project

The Eur-Lex Database

The Eur-Lex Database

The Eur-Lex Database

The Quest

Downloading the Search Results Page

Make It a Beautiful Soup

A Website Under the Hood

Inspecting the html code on Eur-Lex

Extracting the Information

Over and over and over

Downloading Each Judgment

Inspecting the html of a Judgment

Extracting the Information

Final Polish

Could You Be Less Verbose, Please?

Show Me A Picture

Caveats and Extensions

Resources

Gentle Python Introductions

Web-Scraping

Background and Depth

Image Sources