This lesson is being piloted (Beta version)

Web Scraping with Python

Overview

Teaching: 0 min
Exercises: 30 min
Questions
  • How do I web scrape with Python?

Objectives
  • Read an html page with Python

  • Parse an html page with Python

  • Identify all csv file links in an html page

  • Save all csv files linked from an html page

Web Scraping with Python

import requests #http://docs.python-requests.org/en/master/ 
url = "https://www.sec.gov/foia/docs/foia-logs.htm"
webpage = requests.get(url)
print(webpage.content)

What can you do with this?

filename = "SEC_FOIALogs.html"
file = open(filename, 'wb')

In Jupyter Notebook you can type a question mark after the function name to open a help menu. So, viewing the help menu after executing open? will show that the 'wb' in open(filename, 'wb') will open a new file that is writable with the name "SEC_FOIALogs.html" in binary mode.

file.write(webpage.content)
file.close()

Using a library to parse the page

from bs4 import BeautifulSoup #https://www.crummy.com/software/BeautifulSoup/ 
# Beautiful Soup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html?highlight=href 

soup = BeautifulSoup(webpage.content, 'html.parser')
print(soup.prettify())
print(soup.get_text())
soup.find_all('a')

Finding all of the links

for link in soup.find_all('a'):
    print(link.get('href'))

Find all of the links to csv files

for link in soup.find_all('a'):
    print(link.get('href').find('csv'))

Oops. It seems the .find('csv') function returns an error. When we compare it with the earlier list, we’ll see that there are some spaces. This indicates that there are some links that do not contain URLs, for whatever reason.

To avoid the errors, we first need to test the link to make sure it is not empty.

for link in soup.find_all('a'):
    testlink = link.get('href')
    if testlink is not None:
        print(testlink.find('csv'))

And, now when we print testlink.find('csv'), there are no errors. The find function returns either the starting index where the search pattern is located, or a negative 1 to indicate the pattern does not occur in the string. Now that we understand how the find function works, we can add a few lines to our for loop, so it only returns the URL text used in the href call.

for link in soup.find_all('a'):
    testlink = link.get('href')
    if testlink is not None:
        if(testlink.find('csv') != -1):
            print(testlink)

Now, we want to save all of the csv files, but we want to use their file names. To do that, we need to extract the file name from the testlink string. If split the testlink string we are left with after the for loop is finished running, we’ll see that it isn’t a link that we can work with.

for link in soup.find_all('a'):
    testlink = link.get('href')
    if testlink is not None:
        if(testlink.find('csv') != -1):
            keeptestlink = testlink
            print(testlink)

testlink.split('/')[-1]
print('testlink')

So instead, we create a temp variable that will save the last good, i.e. .csv link name.

for link in soup.find_all('a'):
    testlink = link.get('href')
    if testlink is not None:
        if(testlink.find('csv') != -1):
            keeptestlink = testlink
            print(testlink)
keeptestlink.split('/')[-1]

headerurl = 'https://www.sec.gov'
print(headerurl+keeptestlink)

From this, we can create a header string, then print the two strings concatenated to each other to test if the string is a valid URL. Remember, we are trying to reconstruct the URLs, so we can save the csv files linked from this page.

for link in soup.find_all('a'):
    testlink = link.get('href')
    if testlink is not None:
        if(testlink.find('csv') != -1):
            keeptestlink = testlink
            print(testlink)
keeptestlink.split('/')[-1]

headerurl = 'https://www.sec.gov'
print(headerurl+keeptestlink)

Make sure you are a good netizen

import time
for link in soup.find_all('a'):
    testlink = link.get('href')
    if testlink is not None:
        if(testlink.find('csv') != -1):
            print(testlink)
            webpage = requests.get(headerurl+testlink)
            filename = testlink.split('/')[-1]
            file = open(filename, 'wb')
            file.write(webpage.content)
            file.close()
            time.sleep(5)

Web Scraping with Bash

curl --limit-rate 10240 https://www.sec.gov/foia/docs/foia-logs.htm --output "test_foia.txt"

Key Points

  • Web scrape to save all csv files linked from an html page