screeley.com

A Faster Python Script for Extracting Excerpts from Articles

July1

A couple weeks ago David Ziegler posted an article on how to extract excerpts from articles using Python and BeautifulSoup. It works well, but I would like to suggest some improvements by using lxml instead. It's a fairly simple problem. Get the title and the description out of the head, and if there is no description, try to pull some content out of the body. First two easy and the last one sucks, but Python has tools that make our life easier. BeautifulSoup is the go to for web scraping in Python, but it suffers when it comes to performance. lxml is definitely faster and in this case about 3 times so.

When coding this I pretty much used the exact same method as David, just used lxml's functions instead. To retrieve the link we use the cookielib and urllib2 as so.

    import urllib2
    import cookielib
    
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))  

    try:
        response = opener.open(url).read()
    except urllib2.URLError:
        return (None, None)

From there we can user lxml's fromstring method to create an HtmlElement like so:

    from lxml.html import fromstring
    doc = fromstring(response)

Now the fun part, cleaning the HTML. BeautifulSoup makes you work a little for this, but lxml comes with this functionality built in. By default the clean_html method strips out most everything, including the page structure, leaving you with just content. In this case we leave the page structure and meta in tact and remove all the headers. safe_attrs_only needs to be False as well, otherwise it will strip the content attribute from the meta tags.

    from lxml.html.clean import Cleaner
    cleaner = Cleaner(
                  meta=False,
                  safe_attrs_only=False,
		  page_structure=False, 
                  remove_tags=['h1','h2','h3','h4','h5','h6'] )
    
    doc = cleaner.clean_html(doc)

This has removed all script tags, style tags, comments and everything else nasty from the page. Once that is complete we can then get the title and description from the head element then remove it.

    description = None
    try:
        path = '/html/head/meta[@content][@name="description"]'
        description = doc.xpath(path)[0].get("content")
    except IndexError:
        pass
    
    title = None
    try:
        title = doc.xpath('/html/head/title')[0].text_content().strip()
    except IndexError:
        pass
    
    if not description:
        #Get rid of the head element
        doc.head.drop_tree()
        # Taken from http://bit.ly/zsXZt
	p_texts = [p.strip() for p in doc.text_content().split('\n')]
        description = max((len(p), p) for p in p_texts)[1].strip()[0:255]

The last part is taken directly from David's post. Now the results, note that I took out the link retrieval time when clocking this, so it's only the time it took to parse the HTML. Results. It takes about 1.32 seconds to process using BeautifulSoup while lxml takes around 0.33 seconds. Not the most scientific study, but it validated the performance increase for me.

You can find the full code on GitHub here: http://gist.github.com/138642

Comments

lxml is great! thanks for sharing this script.

This is cool. A company I was consulting for insisted on using the newest version of Beautiful Soup instead of the one that actually works, so instead I switched it out for lxml. If you can actually get lxml to compile, it's awesome, but installing all the dependancies can be a pain.

В Вашей RSS нельзя получать полные тексты записей, что ли?

Логотип мне нравится:)

У Вас долго загружается блог - видимо, хостинг плоховат

Об этом уже писал кто-то из моих ЖЖ-френдов :(

Прошу прощения за оффтопик. Вы продаете сквозные ссылки с сайта? Если да, свяжитесь со мной, плз!

А сегодня день архивного работника. У вас на сайте есть "Архив"? Можете праздновать! :))

Post Your Comment

I'm a developer out of Boston MA and I work for a consulting firm specializing in open source technologies.

This space will deal with the work I've participated in using the Django framework to build applications for enterprise clients.

Finally, I hate the word blog and Drupal.

Ruminations

  • "А сегодня день архивного работника. У вас на сайте есть "Архив"? Можете праздновать! :))"
    at 1:49p.m. March 10, 2010 | permalink

  • "А интересно, сам автор читает комментарии к этому сообщению. Или мы тут сами для себя пишем? :)"
    at 4:58a.m. March 9, 2010 | permalink

  • "Прошу прощения за оффтопик. Вы продаете сквозные ссылки с сайта? Если да, свяжитесь со мной, плз!"
    at 8:06p.m. March 8, 2010 | permalink

  • "Об этом уже писал кто-то из моих ЖЖ-френдов :("
    at 10:29a.m. March 8, 2010 | permalink

  • "У Вас долго загружается блог - видимо, хостинг плоховат"
    at 9:41p.m. March 6, 2010 | permalink

  • "I just discovered <a href=http://bit.ly/bMGrYw>SatelliteTV</a> on my PC! Ultra cheap at only $50 once off to get the software and an account on the Internet. ..."
    at 5:20p.m. March 4, 2010 | permalink

  • "Логотип мне нравится:)"
    at 8:47a.m. March 4, 2010 | permalink

  • "Девушки из твоих грёз на твоём рабочем столе. 1.Полностью бесплатно 2.100% безопасность вашего ПК 3.Новые девушки каждый день <a href=http://blogs.mail.ru/mail/erorulez/6605707A18ACC7D6.html>смотреть стриптиз бесплатно</a> http://blogs.mail.ru/mail/erorulez/6605707A18ACC7D6.html эгоистка стриптиз ..."
    at 5:08a.m. March 4, 2010 | permalink

  • "uh.. strange .."
    at 11:54p.m. March 3, 2010 | permalink

  • "Hi guys, I know this might be a bit off topic but seeing that a bunch of you own websites, where would the best place ..."
    at 11:12p.m. March 3, 2010 | permalink

  • "Thanks for this, unbelievable our developer has a robots no follow tag on our site, no wonder it wasn't being found by the search engines ..."
    at 7:40a.m. March 2, 2010 | permalink

  • "В Вашей RSS нельзя получать полные тексты записей, что ли?"
    at 9:37p.m. March 1, 2010 | permalink