In this post I promised to show how I use Python with the BeautifulSoup and Mechanize modules to scrape information from different websites. As a fun exercise, and something that should interest the readers of R-bloggers, I thought it would be interesting to scrape the R-bloggers site, this will form the basis for a later post on how to write the perfect R-bloggers post
The first step in the procedure is to write a small script that downloads the pages we want to scape into a directory on the hard drive. First open up a new file in your favorite python editor, and name it download.py.
In our script we first import the modules that we need, namely the BeautifulSoup, Mechanize and the time modules. The BeautifulSoup module provides a very nice set of functions to parse html code, the mechanize module allow us to emulate a browser in our script. This is handy when looping through the pages of a website. Finally the time module is used to introduce pauses in our script so we do not hit a website to hard with our script.
from BeautifulSoup import BeautifulSoup
import mechanize
import time
The next step is to define the url from which we want to start the scraping, and initialize our browser:
url = "http://www.r-bloggers.com/page/157/"
br = mechanize.Browser()
The next step is to tell our simulated browser to open the starting page:
page = br.open(url)
Since it is likely that we will find some malformed html or missing pages that our script cannot handle, we setup a text file that will tell us which pages our script failed to download:
errorlog = open("errorlog.txt","w")
errorlog.write("Pages not downloaded:\n")
errorlog.close()
Finally we create a list to hold the next link we want to visit, and a counter that will be part of our unique id for every post:
count = 0
nextLink = list()
The main part of the script will be a while loop that will terminate when our nextLink variable no longer holds any information. The first part of the loop looks like this:
while nextLink != None:
time.sleep(1)
links = list()
soup = BeautifulSoup(page)
excerpts = soup.findAll("p",{"class":"excerpt"})
for excerpt in excerpts:
link = excerpt.findNext("a", {"class":"more-link"})["href"]
links.append(link)
The time.sleep() function tell the script to pause one second every time we go to a new page. The links list contain all the links on a given page that we want to visit. To begin the scraping we first run the raw html of the page through the BeautifulSoup module, this creates an object of class BeautifulSoup which we can search and parse information from. Since all the links we want to collect on a given page are below the post excerpts we first locate the excerpts and then select the next link that links to the whole post. All the links are stored in the links list we created above:
for excerpt in excerpts:
link = excerpt.findNext("a", {"class":"more-link"})["href"]
links.append(link)
The next step is to loop over the collected links and download the page that contains the whole post. Basically we follow the link with our simulated browser, then open an html file in our chosen directory, save the content of the page in that file, close it and add 1 to our counter. The print statement prints the filename of our current page to the console, that way we can see how far the loop has gone.
This loop is contained within a try-except structure. In case we hit upon a missing page, or any other error, the link that we followed will be written to the errorlog.txt file we created earlier.
for link in links:
try:
site = br.open(str(link)).read()
filename = "/Users/thomasjensen/Documents/RBloggersScrape/download/post" + str(count) + ".html"
print filename
html = open(filename,"wb")
html.write(site)
html.close()
count += 1
except:
error = open("errorlog.txt","a")
text = str(link) + "\n"
error.write(text)
error.close()
The final step is to see whether there is a link to an older page that contain more posts, and then follow that link if it exists. This is done by first locating the current page, add a one to the page counter. If we can find a link to a page with that counter we try to follow it. If no link is found the while loop will terminate.
The script in its entirety is below. The script will take about two hours to run, and will collect more than 5900 posts, hence if you decide to run it go make a cup of tea and find an interesting book to read
Stay tuned, tomorrow I will cover how to extract information from the pages we have downloaded, and put the information into a .csv file that we can read into R for further analysis.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...