A Crash-Course on Web-Scraping with Python

Henning Deters | henning.deters.me

Centre for European Integration Research, IPW, Uni Vienna

What Is Web-Scraping?
What Is It (Not) Good For? How Does It Work?

What It Is


   


  • Whatever a web-browser can display, you should be able to download on your computer.
  • Right?

When to Use It

When you want to…

  • download many pages.
  • download partial information from many pages.
  • download data in a useful format for further processing.

When Not to Use It (Alternatives)

  • when the data is otherwise available. (ask!)
  • when there is an API for pain-free access.
import tweepy

auth = tweepy.OAuth1UserHandler(
   consumer_key, consumer_secret, access_token, access_token_secret
)

api = tweepy.API(auth)

public_tweets = api.home_timeline()
for tweet in public_tweets:
    print(tweet.text)

(Not today’s topic.)

  • also: When you really shouldn’t. (ask!)

Web-scraping Ethics

Things to consider:

  • Use the API, if there is one
  • Respect the robots.txt
  • Don’t flood the server with requests
  • Is the data copyrighted?
  • What are the privacy implications?
  • Read the terms of service
  • When in doubt: ask!

More information: monashdatafluency.github.io

How to Do It

Time to get our hands dirty.



Please go to the Kaggle notebook now.

www.kaggle.com

How to Do It Part 1: A 10 Minutes Python Primer

Notebooks (e.g. Jupyter) are great

  • for combining text and code
  • for trying out stuff
  • for quick and somewhat reproducible analysis

They are not so great…

  • for complex projects
  • for debugging
  • for version control (but it depends)
  • for programs that run mostly “behind the scenes” (e.g. scraping)

We’ll use a notebook for convenience, but you might want to install a “full-grown” Python environment later on. (Not today’s topic.)

Hello World!

print("Hello World!")
Hello World!
# whatever succeeds a hash sign (#)
# is not evaluated.
# useful for comments
# or to (temporarily) disable code

# print("Hello World!") # skipped
print("Hello Earthling!") # evaluated
Hello Earthling!

Calculation


4 + 4
8
4 - 4
0
3 * 3
9
10 / 2
5.0
2 ** 10
1024
5 % 2
1


etc.

Variables

eggs = 4
4 * eggs
16
dog_name = "Barkley"
cat_name = "Cheeto"
print("My dog is", dog_name,
      "and my cat is", cat_name)
My dog is Barkley and my cat is Cheeto

Try a simple calculation, using a variable.

Just enter some code in the first cell of your notebook and hit Ctrl+Enter on your keyboard to evaluate it.

Data Types

There are more, but these are often used.

"The Answer"         # this is a String
"""
The Answer to the
Ultimate Question of
Life, the Universe,
and Everything
"""                  # this is also a String
42                   # this is an Integer number
1298.423             # this is a floating point number
True
False                # these are Boolean values
["cat", "mouse"]     # this is a List
{"name": "Barkley",    
 "species": "dog"}   # this is a dictionary (Dict)

Slicing

Allows you to access a part
of certain data types.

First element:

["cat", "mouse", "dog", "frog"][0]
'cat'

Returns a string (note the single quotes '…').

Third (!) element:

["cat", "mouse", "dog", "frog"][2]
'dog'

Last element:

["cat", "mouse", "dog", "frog"][-1]
'frog'

Second to third element:

["cat", "mouse", "dog", "frog"][1:3]
['mouse', 'dog']

Returns a list (note the square brackets)!

Strings can be sliced, too.

"mouse"[-1]
'e'

Dictionaries

barkey = {"species": "dog",
          "breed": "fox terrier",
          "favorite food": "socks",
          "name": "barkey"}
barkey["breed"]
'fox terrier'

Combining data

# What is grey and jumps? 
"mouse" + "-" + "frog"
'mouse-frog'
["mouse", "cat"] + ["dog", "frog"]
['mouse', 'cat', 'dog', 'frog']

Different data types

zoo = ["mouse", "cat"]
zoo.append("dog")
print(zoo)
['mouse', 'cat', 'dog']

Slice the first letter out of your name.

Just enter the code in a new cell of your notebook and hit Ctrl+Enter on your keyboard to evaluate it.

"Henning"[0]
'H'

Loops

Use loops to repeat operations. We say, we “iterate over” data.

Strings:

for character in "dog":
    print("-=-", character, "-=-")
-=- d -=-
-=- o -=-
-=- g -=-

Lists:

zoo = ["cat", "mouse",
       "dog", "frog"]
for animal in zoo:
    print("I love my", animal)
I love my cat
I love my mouse
I love my dog
I love my frog

Other iterables:

for i in range(4,8):
    print(i)
4
5
6
7

Note the white-space! Everything inside a loop is indented. Spaces and new lines inside brackets are ignored.

Write a loop that prints each letter in your name on a separate line.

for letter in "henning":
    print(letter)
h
e
n
n
i
n
g

Flow Control

Use if (elif and else) to make code execution conditional.

if 1 == 0:
    print("yes")
if "zoo" in "zoology":
    print("certainly")
if "mouse" in ["dog", "cat", "frog"]:
    print("of course")
else:
    print("surely not")
if "mouse" not in ["dog", "cat"]:
    print("certainly not")
certainly
surely not
certainly not

Everything after the if statement is indented.

Logical operators are useful for writing conditions:

1 == 1
True
1 != 1
False
10 < 5
False
1 == 1 and ("m" in "mouse" or
            "x" in "dog")
True

Objects

Everything in Python is an object. Depending on their class,

  • objects have certain attributes, but not others
  • you can do certain things with them, but not others.

For example, a cup of tea is an object.

  • It has a color, but not a favorite movie.
  • You can drink it, but not read it.

You can access the attributes and methods of a python object using the dot (.) operator. Think of it as right-clicking on a file in your computer and selecting a command from the context menu.

For example, strings have a .capitalize() method that puts the string in Title Case.

"capital".capitalize()
'Capital'

Or a .count() method that counts substring occurrences.

"acetylcholinesterase".count("e")
4

Or .split() and .join() methods.

" eats ".join(["cat", "mouse"])
'cat eats mouse'
"cat eats mouse".split(" eats ")
['cat', 'mouse']

Modules

Modules are ready-made Lego™ bricks you can use to build something new.

There are currently 417,030 modules on the main repository pypi.org.

import requests
from bs4 import BeautifulSoup

Here we import the requests module, then the BeautifulSoup function from the bs4 module (one particular brick from a Lego™ pack).

How to Do It Part 2:
A Tiny Scraping Project

There are many Python modules you can use to scrape websites. A popular selection:

Module Focus
Scrapy complex web scraping/crawling framework
Selenium remote-control for a web-browser, used when content is added by JavaScript
BeautifulSoup simple but powerful parsing library for html content

R also has web scraping modules. (Not today’s topic, but see Web Scraping with R.)

The Eur-Lex Database

Eur-Lex is the main legal database of the European Union. It includes, for example

  • the Official Journal
  • legislative and preparatory documents
  • rulings of the EU courts
  • procedural information on legislation

Data can be accessed by various search interfaces. Some parts of the data are standardized and can be exported for download, obviating the need for web-scraping. An R package (by Michal Ovádek) for accessing standardized Eur-Lex data is available.

Eur-Lex’ terms of service are permissive. While they don’t explicitly mention web-scraping, they don’t exclude it either.

The Eur-Lex Database


Please go to eur-lex-europa.eu.

  • Proceed to “Advanced Search” (below the big search bar)
  • Select the “Case-law” collection
  • Select “Judgment” in “Document reference”
  • Select “Court of Justice” in “Author of the document”
  • Hit “Search”

Voilá! These are all judgments ever issued by the European Court of Justice.

The Eur-Lex Database

Voilá! These are all judgments ever issued by the European Court of Justice.

But that’s a lot.

Let’s try some scraping.

The Quest

Imagine we’re interested in the European Court of Justice (ECJ). We want a list of all the litigants that were ever involved in an ECJ proceeding.

There is no standardized litigant information on Eur-Lex that we could download using the export function. But litigants are mentioned in the text of each court ruling. This is a good use-case for web-scraping.

So we have to

  1. download the search results page
  2. save the URL for each judgment
  3. download each judgment via its URL
  4. extract the litigants from each judgment

Downloading the Search Results Page

The following code examples are available in the presentation notebook.


Go ahead and follow along.

How do we get hold of the web-page with the search results?

That’s the job of the requests module.

The search query is encoded in the URL (link) of the search results page.1

Copy and paste the URL into the script and assign it to a variable, e.g. query_url.2

#| echo: true
query_url = "https://eur-lex.europa.eu/search.html?SUBDOM_INIT=EU_CASE_LAW&DB_TYPE_OF_ACT=judgment&DTS_SUBDOM=EU_CASE_LAW&typeOfCourtStatus=COURT_JUSTICE&DTS_DOM=EU_LAW&typeOfActStatus=JUDGMENT&page=1&lang=en&CASE_LAW_SUMMARY=false&type=advanced&DB_TYPE_COURquT=COURT_JUSTICE"

Next, we pass the url to requests.get method, immediately assigning the output to a variable (r).

r = requests.get(query_url)


r now contains the html of the search results page and some other information. We can access the html part using the .text method. For the first 250 letters:

html = r.text
html[0:250]
'  \n \n \n \n \n \n \n \n \n \n \n \n \n \n        <!DOCTYPE html>\n        <html lang="en" class="no-js"\n        xml:lang="en"  >\n        <head>\n        <meta charset="utf-8">\n        \n        <meta http-equiv="X-UA-Compatible" content="IE=edge"/>\n        \n \n \n \n '

Here are the next 750 letters:

html[251:1000]
'                <meta name="viewport" content="width=device-width, initial-scale=1">\n            \n \n \n        <script type="text/javascript" src="./revamp/components/vendor/modernizr/modernizr.js?v=2.10.6"></script>\n        \n \r\n\r\n\r \r \r \r \r \r \r\n\r\n\r\n\r \r \r \r \r \r\n\n \n \n <title>Search results - EUR-Lex</title> \n \r\n\r\n\r\n\n <meta name="WT.cg_n" content="Search"/><meta name="WT.cg_s" content="Search results"/><meta name="WT.pi" content="Search result page"/><meta name="DCSext.w_oss_metadata_field" content="Document reference"/><meta name="DCSext.w_oss_metadata_collection_type" content="Single Collection"/><meta name="DCSext.w_oss_metadata_collection_option" content="EU case law"/><meta name="WT.z_usr_lan" content="en"/><meta name="WT.seg_1" content="'


  But this is just garbage!

Technically, it’s a tag soup.

Make It a Beautiful Soup

BeautifulSoup makes that tag soup readable.

We use it to

  • traverse the branches of the html code until we arrive at useful information
  • or search the entire html tree for the information we need
  • and eventually extract it

“What html tree,” you ask?

A Website Under the Hood

<html>
  <head>
    <title>A Web Page</title>
  </head>
  <body>
    <p id="author">Henning Deters</p>
    <p id="subject">A Web-Scraping Primer with Python</p>
    <a href="https://en.wikipedia.org/">A link to Wikipedia</a>
  </body>
</html>

The Eur-Lex website uses the same html syntax. It’s just a “bit” more complicated – and messy!

  • Text is embedded in “tags”: <p></p> for paragraphs, <a></a> for links etc.
  • Some tags carry additional attributes: id=..., href=...
  • Nesting: <p> sits between <body>, which is in <html>

Inspecting the html code on Eur-Lex


Please go to the search results, then hit F12 (Firefox) or Shift-Ctrl-J (Chrome) to open the developer tools.

  • click on the “element picker”
  • select the first search result
  • the tool shows you the html snippet that corresponds to the search result

It can be useful to examine the full html source. In Firefox, right-click anywhere and select “View Page Source”.

The snippet we are looking for is located


  • in an a tag (hyperlink),

  • nested within an h2 tag (heading level 2),

  • within a div tag with the class attribute “SearchResult”,

  • within even more levels …

The URL of the result appears twice. We want the second appearance, assigned to the name attribute.

Extracting the Information

We create a soup object by feeding the stuff we downloaded to BeautifulSoup. The second argument ("html5lib") tells BeautifulSoup that we’re feeding it html.

soup = BeautifulSoup(html, "html5lib")

The soup object comes with useful methods. .find_all() finds all occurrences of a search term. Remember: We’re looking for something that’s nested within a div tag with the class “SearchResult”.

soup.find_all("div", {"class": "SearchResult"})

.find_all() expects the first argument to be a string with the tag, and the optional second argument to be a dictionary.

.find_all() returns an object similar to a list. Thus we can access the first result by attaching an index.

search_results = soup.find_all("div", {"class": "SearchResult"})
search_results[0]

Here is the relevant part of the output.

<div class="SearchResult" xmlns="http://www.w3.org/1999/xhtml"><h2><a class="title" href="./legal-content/AUTO/?uri=CELEX:62020TJ0279&amp;qid=1669282930007&amp;rid=1" id="cellar_96f68f89-6b2b-11ed-9887-01aa75ed71a1" name="https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62020TJ0279"> Judgment of the General Court ... Manifest errors of assessment.<br/>Case T-279/20.</a></h2>

As discovered earlier, the URL of the result sits within <h2> tags, which sit within <a> tags.

Since there is only one URL per result, we can ignore the <h2> tags and look just for the <a>:

search_results[0].find("a")
<a class="title" href="./legal-content/AUTO/?uri=CELEX:62020TJ0279&amp;qid=1669282930007&amp;rid=1" id="cellar_96f68f89-6b2b-11ed-9887-01aa75ed71a1" name="https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62020TJ0279"> Judgment of the General Court ... Manifest errors of assessment.<br/>Case T-279/20.</a>

The URL after href is what your browser opens when you click on a link, but it’s truncated. Since we’re too lazy to add the missing part, we use the URL after the name attribute.

Attributes can be accessed like a dictionary. We pass "name" as key and are served the corresponding value (the URL).

search_results[0].find("a")["name"]
'https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62021CJ0370'

Over and over and over

This gets us the URL of the first search result.

search_results[0].find("a")["name"]
'https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62021CJ0370'

To extract all relevant URLs, we iterate over all search results. Instead of printing them out, we append them to an empty list for later use.

urls = []
for s in search_results:
    urls.append(s.find("a")["name"])
urls[0:5] # take a look at the first 5
['https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62021CJ0370',
 'https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62020CJ0269',
 'https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62021CJ0512',
 'https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62020CJ0653',
 'https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62021TJ0611']

Downloading Each Judgment

Now that we have the URLs for all judgments,1 we can download each one. This might take a minute or two.

We’re just recycling code from earlier and placing it in a loop. Not terribly elegant, but fairly straightforward by now.

contents = []
for u in urls:
    print("Scraping", u)
    r = requests.get(u)
    soup = BeautifulSoup(r.text, "html5lib")
    contents.append(soup)
Scraping https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62020CJ0638
Scraping https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62021CJ0658
etc.

Inspecting the html of a Judgment



Please open one of the search results (for example this one) and inspect the text of the first judgment, using the developer tools.

The snippet we are looking for is located within a div tag with the id attribute “text”. It corresponds to the box that holds the text of the ruling.

Extracting the Information

Now we work with the html of all judments, which we stored in the variable contents. We narrow down the search to the main text, delimited by <div id="text">.

# narrow down to text of the ruling
for c in contents:
    text = c.find("div",
                  {"id": "text"})
    print(type(text))
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'NoneType'>
<class 'NoneType'>

The type() functions returns the type of BeautifulSoup’s output. Some results are of NoneType, which means these particular pages did not contain the div we were looking for.

Let’s open the a corresponding URL in our list of search results to look at the offending page.

The ruling has not (yet) been translated into English, therefore the page does not have its text on it!

We’ll keep this in mind for the methodology section, and just ignore all pages that lack the text of the ruling.

#  narrow down to the text of the ruling
ruling_texts = []
for c in contents:
    text = c.find("div", {"id": "text"})
    if text is not None:
        ruling_texts.append(text)    

The if statement means we include only those results in our rulings list that are not None.

Now we have a list containing the text of all the rulings. But we’re really only interested in the litigants.

Inspecting one of the search results once more, we find that the litigants are located

  • within <b> tags (for bold print), which are

  • within <p> tags with the class “C02AlineaAltA”

This is not always true, but for simplicity we pretend it is.

The .find_all_next() method of BeautifulSoup finds all branches matching a search criterion after the soup object.

# narrow down to relevant paragraphs
paragraphs = []
for r in ruling_texts:
    relevant_p = r.find_all_next("p", {"class": "C02AlineaAltA"})
    paragraphs.extend(relevant_p)

Here we tell Python to give us a list of all <p> tags with the class “C02AlineaAlta” after the <div> that marks the text of the ruling on the web page. We repeat this for all ruling texts, using a for loop.

We further narrow down the relevant paragraphs to bold text (e.g. in <b> tags).

# narrow down to bold parts
bold = []
for p in paragraphs:
    x = p.find("b")
    if x is not None:
        bold.append(x.text)
bold[0:5]
['DOMUS-Software-AG',
 'Marc Braschoß Immobilien GmbH,',
 'Finanzamt T',
 'S,',
 'Aquila Part Prod Com SA']

Line 4: Again, the .find() method returns None if the current soup object does not contain a <b> tag.

We further narrow down the relevant paragraphs to bold text (e.g. in <b> tags).

# narrow down to bold parts
bold = []
for p in paragraphs:
    x = p.find("b")
    if x is not None:
        bold.append(x.text)
bold[0:5]
['DOMUS-Software-AG',
 'Marc Braschoß Immobilien GmbH,',
 'Finanzamt T',
 'S,',
 'Aquila Part Prod Com SA']

Line 4: Again, the .find() method returns None if the current soup object does not contain a <b> tag.

Line 5: We use if to ensure that those don’t end up on our list of bold paragraphs.

We further narrow down the relevant paragraphs to bold text (e.g. in <b> tags).

# narrow down to bold parts
bold = []
for p in paragraphs:
    x = p.find("b")
    if x is not None:
        bold.append(x.text)
bold[0:5]
['DOMUS-Software-AG',
 'Marc Braschoß Immobilien GmbH,',
 'Finanzamt T',
 'S,',
 'Aquila Part Prod Com SA']

Line 4: Again, the .find() method returns None if the current soup object does not contain a <b> tag.

Line 5: We use if to ensure that those don’t end up on our list of bold paragraphs.

Line 6: The .text method returns just the text between the tags, omitting the <b> ... </b>.

Final Polish

  • .strip() gets rid of spaces left and right
  • .rstrip(",") removes the commas on the right

We can chain one after the other.

# stripping commas and spaces
litigants = [x.strip().rstrip(",") for x in bold]
litigants[0:5]
['DOMUS-Software-AG',
 'Marc Braschoß Immobilien GmbH',
 'Finanzamt T',
 'S',
 'Aquila Part Prod Com SA']

This is a “list comprehension”, often an elegant alternative to a for loop.

Could You Be Less Verbose, Please?

We could have done all of this in a few lines.

import requests
from bs4 import BeautifulSoup

query_url = "https://eur-lex.europa.eu/search.html?SUBDOM_INIT=EU_CASE_LAW&DB_TYPE_OF_ACT=judgment&DTS_SUBDOM=EU_CASE_LAW&typeOfCourtStatus=COURT_JUSTICE&DTS_DOM=EU_LAW&typeOfActStatus=JUDGMENT&page=1&lang=en&CASE_LAW_SUMMARY=false&type=advanced&DB_TYPE_COURquT=COURT_JUSTICE"
soup = BeautifulSoup(requests.get(query_url).text, "html5lib")
search_results = soup.find_all("div", {"class": "SearchResult"})
search_results = [x.find("a", {"class": "title"}) for x in search_results]
urls = [x["name"] for x in search_results]
contents = [BeautifulSoup(requests.get(u).text, "html5lib") for u in urls]
ruling_texts = [x.find("div", {"id": "text"}) for x in contents]
paragraphs = [x.find_all_next("p", {"class": "C02AlineaAltA"}) for x in ruling_texts if x is not None]
bold = [x.find("b") for y in paragraphs for x in y]
litigants = [x.text.strip().rstrip(",") for x in bold if x is not None]

Very compact, but much harder to understand. When writing code, make sure you’ll be able to go back to it months later.

Show Me A Picture

If we ran our unmodified script over the 400 most recent search results, we could visualize the five most frequent litigants like this.




import seaborn as sns
import pandas as pd

litigants_series = pd.Series(litigants)

sns.set_style("darkgrid")
plot = sns.countplot(x=litigants_series.values,
                     order=litigants_series.value_counts().iloc[:5].index)
plot.set_xticklabels(plot.get_xticklabels(), rotation=45, ha="right")

Caveats and Extensions

  1. The script only downloads the first ten search results. The URL for the subsequent ten results contains the term &page=2 instead of &page=1. By now you can probably guess how to scrape all results. (Hint: it involves a loop).
  2. The script downloads the html and extracts information in one go. It’s much more efficient (and puts less load on the sever) to first download all results and then extract the information from local files. This way you’ll only ever have to download the results once. (Writing to and reading from files involves the open() command.)
  3. There is always more polishing to do. For example, if a litigant name contains weird strings like “\xa0”, we should transform these “non-breaking spaces” into regular ones.
  4. Eur-Lex is not very consistent in how it encodes information. Sometimes you have to search for text patterns instead of html tags. (This involves regular expressions”, available in the re module).
  5. Always make sure to verify your results on a sufficiently large and diverse sample. In our example, we should at least double-check that the script works on results from different time periods, as the format of the html files might have changed.

The final section of the presentation notebook contains improvements on points 1 – 3.

For convenient analysis, you’ll want to combine different variables and store those that belong to the same search result in a dictionary (easily exported as a versatile json file) or in a rectangular shape (e.g. a Pandas DataFrame).

But that’s a different topic to explore.

Resources

Gentle Python Introductions

  • Freeman, Eric. Head First Learn to Code. Beijing: O’Reilly, 2018.
  • Sweigart, Al. Automate the Boring Stuff with Python: Practical Programming for Total Beginners. 2nd edition. San Francisco: No Starch Press, 2020, full text.
  • W3 Schools Python Tutorial

Web-Scraping

  • A Collection of Tools for Various Languages
  • Feiks, Markus. Empirische Sozialforschung mit Python: Daten Automatisiert Sammeln, Auswerten, Aufbereiten. Wiesbaden: Springer VS, 2019 (Chapter 4, in German), full text.
  • Mitchell, Ryan E. Web Scraping with Python: Collecting More Data from the Modern Web. Second edition. Sebastopol, CA: O’Reilly Media, 2018.
  • W3 Schools HTML Tutorial

Background and Depth

  • Atteveldt, Wouter van, Damian Trilling, and Carlos Arcíla. Computational Analysis of Communication: A Practical Introduction to the Analysis of Texts, Networks, and Images with Code Examples in Python and R. Hoboken, NJ: John Wiley & Sons, 2021, full text.
  • McLevey, John. Doing Computational Social Science: A Practical Introduction. Thousand Oaks: SAGE Publications, 2021.

Image Sources

  • Javier Allegue Barros https://unsplash.com/photos/0nOP5iHVaZ8
  • Lesley Davidson https://unsplash.com/photos/FYMY-DJPLGo
  • Kier… in Sight https://unsplash.com/photos/2TwvNp2kw78
  • David Clode https://unsplash.com/photos/4_LTR48NBYQ
  • Trình Minh Thư https://unsplash.com/photos/e2CPaDz3Pjo
  • Terry Jaskiw https://unsplash.com/photos/EWfwYi-qcNw
  • Egor Lyfar https://unsplash.com/photos/jHMJrp33sUg
  • Lina Verovaya https://unsplash.com/photos/4kBsVsiFozc

All images under Unsplash license