Website Scraping With Python and Beautiful Soup
So a buddy of mine asked me to help him write a shell script which can scrape content from a website and put it into a mongodb database. I didn’t really feel like writing a shell script to do that since I figured it would be a huge pain in the a**. So I decided I would try it with Python. After some research I stumbled upon beautiful Soup. This actually turned out to be pretty easy and in a few moments I had a script which could scrape the MegaMillions website, grab the date, winning numbers, and mega number from every drawing and put that info into a mongodb database.
Grab The Website
So the first thing that needs to be done is a simple urlopen on the website in question:
soup = BeautifulSoup(urllib2.urlopen('http://www.usamega.com/mega-millions-history.asp?p=1').read())
In our case we are going to pull down the first page of the MegaMillions winning number history page and set it as the variable soup. If you where to do a simple print(soup.prettify())
you would see a pretty output of the URL posted above.
Parsing HTML Table Content With Beautiful Soup
I had to actually read the HTML code to determine that the fourth ‘table’ on the website was the one that contained the winning lottery numbers that I wanted to parse out. A simple print soup('table')[4].prettify()
would output the full table content I was looking for. This is a good start, as you can see I simply needed to supply which table in the HTML I wanted. I found other examples on the website that would allow you to search for css tags in the table, etc.. but in my case that wasn’t an option.
Iterating Trough Table Rows
Now that I know which table I wanted, I simply needed to iterate through the table rows so I created a simple loop:
for row in soup('table')[4].findAll('tr'): tds = row('td') print tds
This would allow me to print each set of td tags contained in a tr tag individually:
[, Tuesday, September 18, 2012 ,, 05 · 09 · 22 · 36 · 49 + 36 ,, 3 ,, $15 Million ,]
Print Strings From Specific Cells In HTML Table Row
Here is where Beautiful Soup really shines. I wanted to take only specific cells from the row and append them to a dictionary, but I only wanted the actual content string, not the HTML tags. First lets isolate one of the cells in one of the rows:
print soup('table')[4].findAll('tr')[1].findAll('td')[1]
This gives us the following output:
Friday, November 30, 2012
But like I said I wanted only the string information, but simply adding .string
was not sufficient, I needed to tell Beautiful Soap that I wanted the string after the link tag like so:
print soup('table')[4].findAll('tr')[1].findAll('td')[1].a.string
Which gives me the following output:
Friday, November 30, 2012
Perfect, next I wanted the actual winning numbers which in this case is the 4th set of cells. So, again, simply printing index [3] would return the 4th cell:
print soup('table')[4].findAll('tr')[1].findAll('td')[3]
Which looks like this:
11 · 22 · 24 · 28 · 31 + 46
But, again I only waned the winning numbers and in this case the winning numbers and the mega ball where both combined in the same cell. The good thing here is that the winning numbers and the mega ball are both separated by different tags (which makes our life easy). To simply get the winning numbers, we do this:
print soup('table')[4].findAll('tr')[1].findAll('td')[3].b.string
Which gives us our power ball numbers seperated by the ·
tag (we can parse that out later):
11 · 22 · 24 · 28 · 31
To get the mega ball number all we needed to do was pull out the string after the ‘strong’ tag vs the ‘bold’ tag:
print soup('table')[4].findAll('tr')[1].findAll('td')[3].strong.string
And this gives us our mega millions number:
46
Final Script With Mongodb Integration
I won’t bore you with all the details on how to input the data into mongodb, or explain all the small python details (there are websites way better then mine which can explain all that). I just wanted to give people a brief overview of the Beautiful Soup python module and how they might better use it in their day to day coding. Here is the code all tied together:
Thanks for the post, it got me going quickly.
Great tutorial!
Awesome tutorial!
Thank you!
Awesome! but have a question.
from BeautifulSoup import BeautifulSoup
from pymongo import Connection
The two lines above dont work for me.
I’m assuming the first line needs to be “from bs4 import BeautifulSoup”
I installed “pymongo” but i cant seem to find “Connection”
I’m new to this. Please help.
Thanks
You need to make sure to “sudo pip install BeautifulSoup” for pymongo you are correct the syntax has changed, I would suggest looking here for the updated info:
https://api.mongodb.org/python/current/tutorial.html