screeley.com

A Faster Python Script for Extracting Excerpts from Articles

July1

A couple weeks ago David Ziegler posted an article on how to extract excerpts from articles using Python and BeautifulSoup. It works well, but I would like to suggest some improvements by using lxml instead. It's a fairly simple problem. Get the title and the description out of the head, and if there is no description, try to pull some content out of the body. First two easy and the last one sucks, but Python has tools that make our life easier. BeautifulSoup is the go to for web scraping in Python, but it suffers when it comes to performance. lxml is definitely faster and in this case about 3 times so.

When coding this I pretty much used the exact same method as David, just used lxml's functions instead. To retrieve the link we use the cookielib and urllib2 as so.

    import urllib2
    import cookielib
    
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))  

    try:
        response = opener.open(url).read()
    except urllib2.URLError:
        return (None, None)

From there we can user lxml's fromstring method to create an HtmlElement like so:

    from lxml.html import fromstring
    doc = fromstring(response)

Now the fun part, cleaning the HTML. BeautifulSoup makes you work a little for this, but lxml comes with this functionality built in. By default the clean_html method strips out most everything, including the page structure, leaving you with just content. In this case we leave the page structure and meta in tact and remove all the headers. safe_attrs_only needs to be False as well, otherwise it will strip the content attribute from the meta tags.

    from lxml.html.clean import Cleaner
    cleaner = Cleaner(
                  meta=False,
                  safe_attrs_only=False,
		  page_structure=False, 
                  remove_tags=['h1','h2','h3','h4','h5','h6'] )
    
    doc = cleaner.clean_html(doc)

This has removed all script tags, style tags, comments and everything else nasty from the page. Once that is complete we can then get the title and description from the head element then remove it.

    description = None
    try:
        path = '/html/head/meta[@content][@name="description"]'
        description = doc.xpath(path)[0].get("content")
    except IndexError:
        pass
    
    title = None
    try:
        title = doc.xpath('/html/head/title')[0].text_content().strip()
    except IndexError:
        pass
    
    if not description:
        #Get rid of the head element
        doc.head.drop_tree()
        # Taken from http://bit.ly/zsXZt
	p_texts = [p.strip() for p in doc.text_content().split('\n')]
        description = max((len(p), p) for p in p_texts)[1].strip()[0:255]

The last part is taken directly from David's post. Now the results, note that I took out the link retrieval time when clocking this, so it's only the time it took to parse the HTML. Results. It takes about 1.32 seconds to process using BeautifulSoup while lxml takes around 0.33 seconds. Not the most scientific study, but it validated the performance increase for me.

You can find the full code on GitHub here: http://gist.github.com/138642

Comments

lxml is great! thanks for sharing this script.

This is cool. A company I was consulting for insisted on using the newest version of Beautiful Soup instead of the one that actually works, so instead I switched it out for lxml. If you can actually get lxml to compile, it's awesome, but installing all the dependancies can be a pain.

I'm a developer out of San Francisco CA working at a startup.

This space will deal with the work I've participated in using the Django framework to build applications for enterprise clients.

Finally, you should follow me on twitter.

Ruminations

  • "generic z-pak <a href=http://sefsa.org>buy azithromycin</a>"
    at 7:53p.m. Aug. 27, 2010 | permalink

  • "How do i come up with cash from online gambling? <img>http://shrtn.info/smile/ref.php</img>"
    at 2:50a.m. Aug. 25, 2010 | permalink

  • "http://needman.ru замуж за иностранца <a href=http://needman.ru>знакомства с иностранцами</a>"
    at 12:59p.m. May 18, 2010 | permalink

  • "Yebhewjw <a href="http://yebhewjw.de">yebhewjw</a> http://yebhewjw.de yebhewjw http://yebhewjw.de"
    at 11:41p.m. April 29, 2010 | permalink

  • "Thanks for this, unbelievable our developer has a robots no follow tag on our site, no wonder it wasn't being found by the search engines ..."
    at 7:40a.m. March 2, 2010 | permalink

  • "maybe you are right. but how often robots.txt is actually accessed? and how much overhead there is? I'm curious - quantitatively - how big of ..."
    at 7:13p.m. Dec. 12, 2009 | permalink

  • "Lovely idea! Thanks for sharing. I'm gonna have a closer look at the patch for Django 1.2. This could help switching template engines a lot. ..."
    at 9:14a.m. Nov. 2, 2009 | permalink

  • "That was an inspiring post, I think Drupal is great! how could you hate it so much, Thanks for writing, most people don't bother."
    at 11:14a.m. Oct. 28, 2009 | permalink

  • "@Evgeniy. Yes at: http://code.google.com/p/django-alfresco/"
    at 10:42a.m. Oct. 22, 2009 | permalink

  • "Is this released as an open source project?"
    at 1:21a.m. Oct. 22, 2009 | permalink

  • "Interesting, thanks for the examples that you have shared, these are great... Anyway, thanks for the post"
    at 7:55a.m. Oct. 16, 2009 | permalink

  • "Quite inspiring, looks pretty easy aswell, as you have laid it out in such a way, great work, keep it up Thanks for bringing this ..."
    at 10:01a.m. Oct. 8, 2009 | permalink