= requests.get(query_url) r
Centre for European Integration Research, IPW, Uni Vienna
When you want to…
import tweepy
auth = tweepy.OAuth1UserHandler(
consumer_key, consumer_secret, access_token, access_token_secret
)
api = tweepy.API(auth)
public_tweets = api.home_timeline()
for tweet in public_tweets:
print(tweet.text)
(Not today’s topic.)
Things to consider:
robots.txt
More information: monashdatafluency.github.io
Time to get our hands dirty.
Please go to the Kaggle notebook now.
Notebooks (e.g. Jupyter
) are great…
They are not so great…
We’ll use a notebook for convenience, but you might want to install a “full-grown” Python environment later on. (Not today’s topic.)
etc.
Try a simple calculation, using a variable.
Just enter some code in the first cell of your notebook and hit Ctrl+Enter
on your keyboard to evaluate it.
There are more, but these are often used.
"The Answer" # this is a String
"""
The Answer to the
Ultimate Question of
Life, the Universe,
and Everything
""" # this is also a String
42 # this is an Integer number
1298.423 # this is a floating point number
True
False # these are Boolean values
["cat", "mouse"] # this is a List
{"name": "Barkley",
"species": "dog"} # this is a dictionary (Dict)
Allows you to access a part
of certain data types.
First element:
Returns a string (note the single quotes '…').
Third (!) element:
Last element:
Second to third element:
Returns a list (note the square brackets)!
Strings can be sliced, too.
Dictionaries
'fox terrier'
Combining data
Different data types
Slice the first letter out of your name.
Just enter the code in a new cell of your notebook and hit Ctrl+Enter
on your keyboard to evaluate it.
Use loops to repeat operations. We say, we “iterate over” data.
Strings:
Lists:
I love my cat
I love my mouse
I love my dog
I love my frog
Other iterables
:
Note the white-space! Everything inside a loop is indented. Spaces and new lines inside brackets are ignored.
Write a loop that prints each letter in your name on a separate line.
Use if
(elif
and else
) to make code execution conditional.
certainly
surely not
certainly not
Everything after the if
statement is indented.
Logical operators are useful for writing conditions:
Everything in Python is an object. Depending on their class,
For example, a cup of tea is an object.
You can access the attributes and methods of a python object using the dot (.
) operator. Think of it as right-clicking on a file in your computer and selecting a command from the context menu.
For example, strings have a .capitalize()
method that puts the string in Title Case.
Or a .count()
method that counts substring occurrences.
Or .split()
and .join()
methods.
Modules are ready-made Lego™ bricks you can use to build something new.
There are currently 417,030 modules on the main repository pypi.org.
Here we import the requests
module, then the BeautifulSoup
function from the bs4
module (one particular brick from a Lego™ pack).
There are many Python modules you can use to scrape websites. A popular selection:
Module | Focus |
---|---|
Scrapy |
complex web scraping/crawling framework |
Selenium |
remote-control for a web-browser, used when content is added by JavaScript |
BeautifulSoup |
simple but powerful parsing library for html content |
R
also has web scraping modules. (Not today’s topic, but see Web Scraping with R.)
Eur-Lex is the main legal database of the European Union. It includes, for example
Data can be accessed by various search interfaces. Some parts of the data are standardized and can be exported for download, obviating the need for web-scraping. An R
package (by Michal Ovádek) for accessing standardized Eur-Lex data is available.
Eur-Lex’ terms of service are permissive. While they don’t explicitly mention web-scraping, they don’t exclude it either.
Please go to eur-lex-europa.eu.
Voilá! These are all judgments ever issued by the European Court of Justice.
Voilá! These are all judgments ever issued by the European Court of Justice.
But that’s a lot.
Let’s try some scraping.
Imagine we’re interested in the European Court of Justice (ECJ). We want a list of all the litigants that were ever involved in an ECJ proceeding.
There is no standardized litigant information on Eur-Lex that we could download using the export function. But litigants are mentioned in the text of each court ruling. This is a good use-case for web-scraping.
So we have to
The following code examples are available in the presentation notebook.
Go ahead and follow along.
How do we get hold of the web-page with the search results?
That’s the job of the requests module.
The search query is encoded in the URL (link) of the search results page.1
Copy and paste the URL into the script and assign it to a variable, e.g. query_url
.2
Next, we pass the url to requests
’ .get
method, immediately assigning the output to a variable (r
).
r
now contains the html of the search results page and some other information. We can access the html part using the .text
method. For the first 250 letters:
Here are the next 750 letters:
' <meta name="viewport" content="width=device-width, initial-scale=1">\n \n \n \n <script type="text/javascript" src="./revamp/components/vendor/modernizr/modernizr.js?v=2.10.6"></script>\n \n \r\n\r\n\r \r \r \r \r \r \r\n\r\n\r\n\r \r \r \r \r \r\n\n \n \n <title>Search results - EUR-Lex</title> \n \r\n\r\n\r\n\n <meta name="WT.cg_n" content="Search"/><meta name="WT.cg_s" content="Search results"/><meta name="WT.pi" content="Search result page"/><meta name="DCSext.w_oss_metadata_field" content="Document reference"/><meta name="DCSext.w_oss_metadata_collection_type" content="Single Collection"/><meta name="DCSext.w_oss_metadata_collection_option" content="EU case law"/><meta name="WT.z_usr_lan" content="en"/><meta name="WT.seg_1" content="'
But this is just garbage!
Technically, it’s a tag soup.
BeautifulSoup
makes that tag soup readable.
We use it to
“What html
tree,” you ask?
<html>
<head>
<title>A Web Page</title>
</head>
<body>
<p id="author">Henning Deters</p>
<p id="subject">A Web-Scraping Primer with Python</p>
<a href="https://en.wikipedia.org/">A link to Wikipedia</a>
</body>
</html>
The Eur-Lex website uses the same html
syntax. It’s just a “bit” more complicated – and messy!
<p>
… </p>
for paragraphs, <a>
… </a>
for links etc.id=...
, href=...
<p>
sits between <body>
, which is in <html>
Please go to the search results, then hit F12 (Firefox) or Shift-Ctrl-J (Chrome) to open the developer tools.
It can be useful to examine the full html source. In Firefox, right-click anywhere and select “View Page Source”.
The snippet we are looking for is located
in an a
tag (hyperlink),
nested within an h2
tag (heading level 2),
within a div
tag with the class
attribute “SearchResult”,
within even more levels …
The URL of the result appears twice. We want the second appearance, assigned to the name
attribute.
We create a soup object by feeding the stuff we downloaded to BeautifulSoup
. The second argument ("html5lib"
) tells BeautifulSoup
that we’re feeding it html.
The soup object comes with useful methods. .find_all()
finds all occurrences of a search term. Remember: We’re looking for something that’s nested within a div
tag with the class
“SearchResult”.
.find_all()
expects the first argument to be a string with the tag, and the optional second argument to be a dictionary.
.find_all()
returns an object similar to a list. Thus we can access the first result by attaching an index.
Here is the relevant part of the output.
<div class="SearchResult" xmlns="http://www.w3.org/1999/xhtml"><h2><a class="title" href="./legal-content/AUTO/?uri=CELEX:62020TJ0279&qid=1669282930007&rid=1" id="cellar_96f68f89-6b2b-11ed-9887-01aa75ed71a1" name="https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62020TJ0279"> Judgment of the General Court ... Manifest errors of assessment.<br/>Case T-279/20.</a></h2>
As discovered earlier, the URL of the result sits within <h2>
tags, which sit within <a>
tags.
Since there is only one URL per result, we can ignore the <h2>
tags and look just for the <a>
:
<a class="title" href="./legal-content/AUTO/?uri=CELEX:62020TJ0279&qid=1669282930007&rid=1" id="cellar_96f68f89-6b2b-11ed-9887-01aa75ed71a1" name="https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62020TJ0279"> Judgment of the General Court ... Manifest errors of assessment.<br/>Case T-279/20.</a>
The URL after href
is what your browser opens when you click on a link, but it’s truncated. Since we’re too lazy to add the missing part, we use the URL after the name
attribute.
Attributes can be accessed like a dictionary. We pass "name"
as key and are served the corresponding value (the URL).
This gets us the URL of the first search result.
'https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62021CJ0370'
To extract all relevant URLs, we iterate over all search results. Instead of printing them out, we append
them to an empty list for later use.
urls = []
for s in search_results:
urls.append(s.find("a")["name"])
urls[0:5] # take a look at the first 5
['https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62021CJ0370',
'https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62020CJ0269',
'https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62021CJ0512',
'https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62020CJ0653',
'https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62021TJ0611']
Now that we have the URLs for all judgments,1 we can download each one. This might take a minute or two.
We’re just recycling code from earlier and placing it in a loop. Not terribly elegant, but fairly straightforward by now.
contents = []
for u in urls:
print("Scraping", u)
r = requests.get(u)
soup = BeautifulSoup(r.text, "html5lib")
contents.append(soup)
Scraping https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62020CJ0638
Scraping https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62021CJ0658
etc.
Please open one of the search results (for example this one) and inspect the text of the first judgment, using the developer tools.
The snippet we are looking for is located within a div
tag with the id
attribute “text”. It corresponds to the box that holds the text of the ruling.
Now we work with the html of all judments, which we stored in the variable contents
. We narrow down the search to the main text, delimited by <div id="text">
.
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'NoneType'>
<class 'NoneType'>
The type()
functions returns the type of BeautifulSoup
’s output. Some results are of NoneType
, which means these particular pages did not contain the div
we were looking for.
Let’s open the a corresponding URL in our list of search results to look at the offending page.
The ruling has not (yet) been translated into English, therefore the page does not have its text on it!
We’ll keep this in mind for the methodology section, and just ignore all pages that lack the text of the ruling.
The if
statement means we include only those results in our rulings
list that are not None
.
Now we have a list containing the text of all the rulings. But we’re really only interested in the litigants.
Inspecting one of the search results once more, we find that the litigants are located
within <b>
tags (for bold print), which are
within <p>
tags with the class “C02AlineaAltA”
This is not always true, but for simplicity we pretend it is.
The .find_all_next()
method of BeautifulSoup
finds all branches matching a search criterion after the soup object.
Here we tell Python to give us a list of all <p>
tags with the class “C02AlineaAlta” after the <div>
that marks the text of the ruling on the web page. We repeat this for all ruling texts, using a for
loop.
We further narrow down the relevant paragraphs to bold text (e.g. in <b>
tags).
['DOMUS-Software-AG',
'Marc Braschoß Immobilien GmbH,',
'Finanzamt T',
'S,',
'Aquila Part Prod Com SA']
Line 4: Again, the .find()
method returns None
if the current soup object does not contain a <b>
tag.
We further narrow down the relevant paragraphs to bold text (e.g. in <b>
tags).
['DOMUS-Software-AG',
'Marc Braschoß Immobilien GmbH,',
'Finanzamt T',
'S,',
'Aquila Part Prod Com SA']
Line 4: Again, the .find()
method returns None
if the current soup object does not contain a <b>
tag.
Line 5: We use if
to ensure that those don’t end up on our list of bold paragraphs.
We further narrow down the relevant paragraphs to bold text (e.g. in <b>
tags).
['DOMUS-Software-AG',
'Marc Braschoß Immobilien GmbH,',
'Finanzamt T',
'S,',
'Aquila Part Prod Com SA']
Line 4: Again, the .find()
method returns None
if the current soup object does not contain a <b>
tag.
Line 5: We use if
to ensure that those don’t end up on our list of bold paragraphs.
Line 6: The .text
method returns just the text between the tags, omitting the <b> ... </b>
.
.strip()
gets rid of spaces left and right.rstrip(",")
removes the commas on the rightWe can chain one after the other.
['DOMUS-Software-AG',
'Marc Braschoß Immobilien GmbH',
'Finanzamt T',
'S',
'Aquila Part Prod Com SA']
This is a “list comprehension”, often an elegant alternative to a for
loop.
We could have done all of this in a few lines.
import requests
from bs4 import BeautifulSoup
query_url = "https://eur-lex.europa.eu/search.html?SUBDOM_INIT=EU_CASE_LAW&DB_TYPE_OF_ACT=judgment&DTS_SUBDOM=EU_CASE_LAW&typeOfCourtStatus=COURT_JUSTICE&DTS_DOM=EU_LAW&typeOfActStatus=JUDGMENT&page=1&lang=en&CASE_LAW_SUMMARY=false&type=advanced&DB_TYPE_COURquT=COURT_JUSTICE"
soup = BeautifulSoup(requests.get(query_url).text, "html5lib")
search_results = soup.find_all("div", {"class": "SearchResult"})
search_results = [x.find("a", {"class": "title"}) for x in search_results]
urls = [x["name"] for x in search_results]
contents = [BeautifulSoup(requests.get(u).text, "html5lib") for u in urls]
ruling_texts = [x.find("div", {"id": "text"}) for x in contents]
paragraphs = [x.find_all_next("p", {"class": "C02AlineaAltA"}) for x in ruling_texts if x is not None]
bold = [x.find("b") for y in paragraphs for x in y]
litigants = [x.text.strip().rstrip(",") for x in bold if x is not None]
Very compact, but much harder to understand. When writing code, make sure you’ll be able to go back to it months later.
If we ran our unmodified script over the 400 most recent search results, we could visualize the five most frequent litigants like this.
&page=2
instead of &page=1
. By now you can probably guess how to scrape all results. (Hint: it involves a loop).open()
command.)re
module).The final section of the presentation notebook contains improvements on points 1 – 3.
For convenient analysis, you’ll want to combine different variables and store those that belong to the same search result in a dictionary (easily exported as a versatile json file) or in a rectangular shape (e.g. a Pandas DataFrame).
But that’s a different topic to explore.
All images under Unsplash license