Scraping Wikipedia with BeautifulSoup, pt. 1

17 Jul 2020

category is ~ web scraping ~ berlin ~ python ~ data science ~ beautifulsoup ~

About time I published a technical post again, isn't it? I have quite a few drafts stacking up back there, but I've been waiting for things to be perfect or finished. Screw perfectionism; a ship in harbour is safe, etc.

So, I've had an idea rolling around for about four years that I was originally going to self-publish as a book or zine. But now I have coding skills, I may as well make it virtual now, right? Especially since there are so many tools out there that are going to make it a much more exciting venture.

Idea: make an encyclopaedia of Berlin streets and squares named after women, sorted by Bezirk (borough).

‌My idea requires sucking up a lot of open data — I'm talking about Wikipedia — and displaying it. Sure, the "quick" way to grab a lot of website data for your development purposes is via an API; most of the popular apps out there have these available. But what I actually want to do right now is experiment with web scraping. (We will look at APIs in another post!)

BeautifulSoup is a well-known Python library that's excellent for scraping information from a web page and putting it into a local HTML document so that you have thousands of lines of data at your disposal.

First, make sure you've installed BeautifulSoup4 and lxml into your Python3 project environment. Visit the desired Wikipedia page — let's start with Mitte — and copy the URL, then open a new Python file:

# import BeautifulSoup and Python's inbuilt HTTP request module
from bs4 import BeautifulSoup
import requests

# define URL
my_url = "https://de.wikipedia.org/wiki/Liste_der_Stra%C3%9Fen_und_Pl%C3%A4tze_in_Berlin-Mitte"

# define request object
req = requests.get(my_url).text

# define soup object and print its contents to console in LMXL format
soup = BeautifulSoup(req, "lxml")
print(soup)

# create a writable file called and write contents of soup
with open("mitte.html", "w") as file:
    file.write(str(soup))

The entire HTML source of the page will be printed in your console and saved into a file of your choosing. Hooray, the request was successful! But also, we can't really do much with that. For one thing, it would be useful to nose around in the file and zero in on certain elements. So, in the same Python file:

mitte_file = open("mitte.html") # open your file
mitte_read = BeautifulSoup(mitte_file.read(), features="lxml") # reads your file
elems = soup.select("td") # creates "elems" object, aka a list of "td" elements
len(elems) # checks number of "td" in "elems" object
# output: 1976!

Here, we've picked out the HTML element td, which indicates "table data" and will give us the most salient info from the page. Also, whoa! There are 1,976 td elements in this list, so now we need to find a way to narrow them down. To grab a certain element, we can use indices. You can see below how I experimented with it in the Python shell to get a feel for how the information in each td is presented.

>>> len(elems)
1976
>>> type(elems[0])
<class 'bs4.element.Tag'>
>>> elems[0].getText()
'Ackerstraße\n(Lage)\n\n'
>>> elems[1].getText()
'0950(im Ortsteil)\n'
>>> elems[94].getText() # from here, I'm putting in random indices to see if I strike gold
'Die Straße beginnt als Fußweg an der Stralauer Straße und endet an der Neuen Jüdenstraße. Der Name der Privatstraße wurde von einer Gasse übernommen, die im 16.\xa0Jahrhundert als Zugang zur Spree zum Wasserholen angelegt worden war.[6] Die Gasse existierte bis um 1937. Bei der Neubebauung nach 1990 wurde der Straßenname entsprechend dem historischen Verlauf neu vergeben.\n'
>>> elems[342].getText()
'Caroline-Michaelis-Straße\n(Lage)\n\n'
>>> elems[343].getText()
'0510\n'

For the street info I'm displaying in this project, I will be focusing only on certain table headings. Way down the line, these will become the variables for an element object: name, name_origin, name_details. They respectively correspond to the Wikipedia table headings Name/Lage, Namensherkunft, and Anmerkungen.

wikipedia table

For each object, we are going to be assigning numbers to these variables to signify their position in the document. In this case, it follows a pattern of strides of 2, starting from 0.

‌To get the page's first occurrence of name, name_origin, then name_details, see the code below.

>>> elems[0].getText() # name
>>> elems[2].getText() # name_origin
>>> elems[4].getText() # name_details

The next occurrence of name, seeing as this is a sequence where 2 is added each time, would be a 6.

‌Therefore: name will always be a multiple of 6 +[0], name_origin is a multiple of 6 + [2], and name_details is a multiple of 6 + [4].

We can condense this into a list comprehension:

>>> [6*n for n in range(1,10+1)]
[6, 12, 18, 24, 30, 36, 42, 48, 54]

If you're not familiar with list comprehensions, you can read up about them here. They are an efficient way to sift through great globs of data and apply a certain action to many elements at a time.

In the above example, 6*n is the "action", i.e. what we are applying to each n in the range specified. Note that ranges are not inclusive; if it's range(1,10), that means n will only go up to and including 9. That's why in this case, I've added +1 to the range, so that 10 is included as well.

Running this list comprehension will return a list of numbers that we can then use as indices. So, let's grab all the indexes that are divisible by 6. Since I've already come so far in my findings on this page, though, I don't want to start all the way from 0 again. That's why I'm now going to define a range. If I want to get 100 index numbers, then, I can run [6*n for n in range(1,100+1)]. Knowing that there are 1,976 elements, we can already figure out the indices we'll need.

>>> round(1976 / 6) # find the end number for our range
329
>>> [6*n for n in range(1,329+1)]
[6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84, 90, 96, 102, 108, 114, 120, 126, 132, 138, 144, 150, 156, 162, 168, 174, 180, 186, 192, 198, 204, 210, 216, 222, 228, 234, 240, 246, 252, 258, 264, 270, 276, 282, 288, 294, 300, 306, 312, 318, 324, 330, 336, 342, 348, 354, 360, 366, 372, 378, 384, 390, 396, 402, 408, 414, 420, 426, 432, 438, 444, 450, 456, 462, 468, 474, 480, 486, 492, 498, 504, 510, 516, 522, 528, 534, 540, 546, 552, 558, 564, 570, 576, 582, 588, 594, 600, 606, 612, 618, 624, 630, 636, 642, 648, 654, 660, 666, 672, 678, 684, 690, 696, 702, 708, 714, 720, 726, 732, 738, 744, 750, 756, 762, 768, 774, 780, 786, 792, 798, 804, 810, 816, 822, 828, 834, 840, 846, 852, 858, 864, 870, 876, 882, 888, 894, 900, 906, 912, 918, 924, 930, 936, 942, 948, 954, 960, 966, 972, 978, 984, 990, 996, 1002, 1008, 1014, 1020, 1026, 1032, 1038, 1044, 1050, 1056, 1062, 1068, 1074, 1080, 1086, 1092, 1098, 1104, 1110, 1116, 1122, 1128, 1134, 1140, 1146, 1152, 1158, 1164, 1170, 1176, 1182, 1188, 1194, 1200, 1206, 1212, 1218, 1224, 1230, 1236, 1242, 1248, 1254, 1260, 1266, 1272, 1278, 1284, 1290, 1296, 1302, 1308, 1314, 1320, 1326, 1332, 1338, 1344, 1350, 1356, 1362, 1368, 1374, 1380, 1386, 1392, 1398, 1404, 1410, 1416, 1422, 1428, 1434, 1440, 1446, 1452, 1458, 1464, 1470, 1476, 1482, 1488, 1494, 1500, 1506, 1512, 1518, 1524, 1530, 1536, 1542, 1548, 1554, 1560, 1566, 1572, 1578, 1584, 1590, 1596, 1602, 1608, 1614, 1620, 1626, 1632, 1638, 1644, 1650, 1656, 1662, 1668, 1674, 1680, 1686, 1692, 1698, 1704, 1710, 1716, 1722, 1728, 1734, 1740, 1746, 1752, 1758, 1764, 1770, 1776, 1782, 1788, 1794, 1800, 1806, 1812, 1818, 1824, 1830, 1836, 1842, 1848, 1854, 1860, 1866, 1872, 1878, 1884, 1890, 1896, 1902, 1908, 1914, 1920, 1926, 1932, 1938, 1944, 1950, 1956, 1962, 1968, 1974]

A note on indexing for this project

As you can see, this has returned rather a lot of numbers, and not all of them will be relevant. Even though on first glance it seems efficient, it's actually counterproductive to grab a range of 329 numbers all in one go. I started running into trouble once the table didn't behave the way I'd expected it to. So for now, let's keep it to 100 index places.

Case in point: sometimes the cells of the Wikipedia table are empty, which sabotages my neat little equation. When this happens, find another point to go off from, then make use of the modulo operator!

‌The modulo operator looks like this %; however, it's not a percentage sign. You can use it in the same way you'd multiply numbers, except it returns the remainder, not the product. (46656 % 6 will return 0, because 46656 is a perfect multiple of 6.) So if you need to start again, this is a good way to find your feet.

Don't automate all the boring stuff

Python is good for a lot of things, but there are a couple of things that we need to do things by hand:

So, we can do that for a while: run elems[0].getText(), alternately replacing the index number with one of the ones from [6*n for n in range(1,100+1)] and noting down the result, if relevant.

Once you've done that for a few items, it's time to create a dictionary to store the information you've found. A Python dictionary has nothing to do with vocab, necessarily; it's a data type consisting of a key-value pair. Here, the key will be the street name, the index of the occurrence of the street name will be the value.

Knowing the index position is useful because it helps us find whereabouts the required info is on the page. So if you needed to check something about Claire-Waldoff-Straße and knew the index was 384, you'd do elems[384].getText() to get to the relevant point in the file. Conversely, if you knew you needed to find out which index was assigned to Claire-Waldoff-Straße you'd run elems.index("Claire-Waldoff-Straße"), which would return the index.

Note that in the dictionary, I am logging these numbers as integers, not strings.

names = {
  "Adele-Schreiber-Krieger-Straße": 12,
  "Alexandrinenstraße": 42,
  "Anna-Louisa-Karsch-Straße": 180,
  "Annenstraße": 186,
  "Bona-Peiser-Weg": 288,
  "Caroline-Michaelis-Straße": 342,
  "Caroline-von-Humboldt-Weg": 348,
  "Claire-Waldoff-Straße": 384,
  "Cora-Berliner-Straße": 390,
  "Dorothea-Schlegel-Platz": 402,
  "Dorotheenstraße": 408,
  "Elisabeth-Mara-Straße": 438
  ...
}

Later on we can put this info into a database, but for the time being, a dictionary will do the trick. I'll be continuing this series as I play around a bit more, so keep an eye out for future developments!


⟵ return to blog