screeley

Co-Founder of Embed.ly

Change.org is changing responses based on User-Agent.

I really hate it when companies only value traffic from certain sites. Change.org is changing responses based on user agent and is pretty much prohibiting smaller sites from hitting them. Only the big guys like Facebook and Google get valid html responses. Here is an example:

Acing the YC Interview

Y Combinator interviews start in about 3 hours. Being on the East coast we can't really show up and give advice to the teams as they arrive, so here it is. From Boston with love.

Everyone has a different experience during their Y Combinator interview, but there are some common threads. I'm going the separate them out into things you need to manage and things you need to do to ace the interview.

Manage

Fear. If you walk in nervous you are going to waste the first 5 minutes calming down and hitting stride. The interview will be pretty much over by the time you catch on fire. Throw a stiff drink down before you walk in that room or walk around the building a couple of times.

Bullshit. Don't just make things up because you think it's what they want to hear. It's probably not. Be completely honest, thoughtful and brief. The more questions you can answer in that 10 minutes the better.

Acing

Get to the demo. What did you build, why is it awesome and where are you going with it. Walk into that room with a laptop open to your demo. Introduce yourself and then get right into it.

On the day of our interview we asked the team before us how it went. They said: "I think well, but we never got to the demo. We talked theory and market the whole time." I _never_ saw those guys again. This is not a normal pitch where you have a deck, 30 minutes and an agenda. The YC team will most likely steer the conversation in anyway they see fit. Your goal is to get the demo in, show that you are passionate, honest and you will get shit done.

Disagree, but don't get stuck. Everyone wants to be right, but if you waste all your time fighting on details you will never get through it all. A tactic to move on is pretty simple. Acknowledge the difference in opinion, state your thesis and back it up.

"Ok, I understand. Our thesis is X and we have seen this through this data Y."

Ever get in an argument with someone that is devoutly religious? The conversation always ends with: "Because God said so". In our world data is God.

You are going to have to steer the conversation back to your startup at some point. They will want to talk amongst themselves and do a little debating. You can do this by bringing them back into the demo, by saying "hey, look at this." Remember that they are raccoons and you have something shiny.

Lastly, pay attention to Jessica. While she is not on the cover on INC, Jessica reads people better than anyone else in that room. Her vote will go a long way towards getting you a phone call later that night.

Didn't get into YC? Up your funding odds 1600% with the MassChallenge.

Y Combinator interview decisions went out late last night. I've already heard back from a few teams that made it, but many more that didn't. From what I heard it was the most competitive YC application round ever. Thousands of companies have applications and ideas for startups that need a home or funding.

While many of you will apply to TechStars, Seed Camp or any of the other incubators in the world, I'd like to add another one to the list. The MassChallenge is out of Boston and the odds that you will get a significant amount of funding are much better than YC. Note that you don't have to be based in Boston or even plan to stay in Boston after. You just need to show up here for the pitches.

Last year the MassChallenge received 450 applications and they gave 50k+ to 16. Right off the bat 3.5% of applicants get money, but the odds get better from there. "Top investors, lawyers and entrepreneurs" judge the initial applications down to 300. As long as you are even remotely serious you will get into this pool, so the odds of taking home money is more like 5.3%.

There is then a second pitching round that takes that number from 300 to 100. This is an actual pitching round. You need to get in a room with 3 people and tell them about your startup. While this is a little tougher, I still feel that if you have been working on the project for any solid amount of time you are golden. The odds then go up to 16% percent.

When it comes to initial odds, there is no better incubator then the MassChallenge. I think Y Combinator's acceptance numbers this year will be something like .01%. Just by taking the application you sent in to Y Combinator, revising it slightly and throwing it on the MassChallenge your odds of getting funding are significantly increased.

Easily the most deterring factor to the MassChallenge is the application fee. To even get your application looked at you need to cough up $200, but you can get endorsements to lower this fee to nothing. Hit me up, I'm happy to endorse YC apps up to my limit and find others to endorse you as well. Each endorsement is worth $50 off your application fee.

While I dont agree with pay to play models and hope the MassChallenge changes it in the future, it is one of those necessary evils for a non profit startup incubator.

So in summary, take your YC app, apply to the MassChallenge, increase your odds by 1600% and hit me up for an endorsement so you can recoup some of the application fee.

This is a limited time offer. The application deadline is Monday, April 11th.

How not to promote your Startup Accelerator on Hacker News

There was a thread on Hacker News yesterday that absolutely fascinated me. It shows a back and forth between John Hathorne of the MassChallenge and Paul Graham.

First off, I have a lot of respect for both John and Paul. John has done some amazing things getting the MassChallenge off the ground, funding 16 startups and giving a home to many more. I think Paul's merits need no explaining.

We were a MassChallenge finalist and a Y Combinator company, so this was equivalent to being a 13 year old girl and watching mommy and daddy fight center stage at a Justin Bieber concert.

It all started when the MassChallenge decided that it wanted to get a little press on Hacker News hoping to draw in some applications for it's upcoming deadline. The blog post was fine. It was on startupamericapartnership.org, had a glowing picture of the Relay Rides team excepting their $50k check, but it had this superficial tone that didn't make it sound authentic. Nonetheless, with their awesome post in hand they set out into the Hacker News water and things got ugly.

You can read all the comments here and the MassChallenge even asked for upvotes via their Twitter account. In the end I blame this all on rhizome. The post was simple enough:

It's still creepy.

It however elicited a response from John Harthorne the CEO and founder of the MassChallenge. He took a snarky comment from a random HN user and turned around and punched Y Combinator in the face with it:

Is it as creepy as taking 6% equity from a company that is still young and strategically weak?

To me that is molestation.

So obviously PG had to defend his honor:

I wouldn't be surprised to see this sort of comment from a random troll on HN, but I'm surprised to see it from one of the organizers of the program.

It suggests that as well as having rather bad judgment, you don't understand the math of equity. A "free" alternative is no bargain if you end up net worse off.

http://paulgraham.com/equity.html

John rebuts with math and economics:

But if we offer a free alternative with equivalent benefits, then 6% is way overpriced. And that's what we do.

Paul omits a discussion of opportunity cost. Sure, buying a coke for 50 cents is great ... but getting it for free is much better.

Remember that at 6.4% improvement, you end up even with ycombinator. At 6.4% improvement with MassChallenge, you end up 6.4% ahead.

and staunch ends it:

Do yourself a favor: wait until you have a track record and reputation that's at least 5% the quality of YC's before you start bragging about how much better you are.

Now it's very clear that these two have very different opinions on how a startup should be funded. Totally fine, but using the word `molestation` to describe another's methods is a little much.

There is a very clear difference between these two programs, Y Combinator is a business and it's goal is to make money. The MassChallenge is a non-profit and it's goal is to 'Foster Innovation'. They both have the same mission to create more startups, but at this point you would think it's the Sea Shepherd Conservation Society vs Japanese Whalers. (Yes, that was a Whale Wars reference)

I'm not going to comment on the merits of either program. We have enjoyed the benefits of both. I'm just amazed by this whole back and forth. John's comments were down voted to the point where you can barely read them, where PG's post garnered 10 points. It just shows that if you are going to swim in someone else's pond that you need to play nice.

 

Writing

I use to write significantly more then I do now. I would like to change that and start posting thoughts on startups, technology and Boston. They will probably turn into rants more then posts, but this is how it will be. The thoughts here are mine and not of Embedly's unless the rest of the team agrees with me.

django-oembed and Embedly

I've forked the django-oembed project and added 46 more providers to the default install. oEmbed is only as good as the number of providers you can use with it and going from 15 to 61 providers is a step in the right direction.

Get embedding people.

 

A Faster Python Script for Extracting Excerpts from Articles

om articles using Python and BeautifulSoup. It works well, but I would like to suggest some improvements by using lxml instead. It's a fairly simple problem. Get the title and the description out of the head, and if there is no description, try to pull some content out of the body. First two easy and the last one sucks, but Python has tools that make our life easier. BeautifulSoupis the go to for web scraping in Python, but it suffers when it comes to performance. lxml is definitely faster and in this case about 3 times so.

When coding this I pretty much used the exact same method as David, just used lxml's functions instead. To retrieve the link we use the cookielib andurllib2 as so.

 

import urllib2
    import cookielib
    
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))  

    try:
        response = opener.open(url).read()
    except urllib2.URLError:
        return (None, None)

 

From there we can user lxml's fromstring method to create anHtmlElement like so:

 

from lxml.html import fromstring
    doc = fromstring(response)

 

Now the fun part, cleaning the HTML. BeautifulSoup makes you work a little for this, but lxml comes with this functionality built in. By default theclean_html method strips out most everything, including the page structure, leaving you with just content. In this case we leave the page structure and meta in tact and remove all the headers. safe_attrs_only needs to be Falseas well, otherwise it will strip the content attribute from the meta tags.

 

from lxml.html.clean import Cleaner
    cleaner = Cleaner(
                  meta=False,
                  safe_attrs_only=False,
                  page_structure=False, 
                  remove_tags=['h1','h2','h3','h4','h5','h6'] )
    
    doc = cleaner.clean_html(doc)

 

This has removed all script tags, style tags, comments and everything else nasty from the page. Once that is complete we can then get the title and description from the head element then remove it.

 

description = None
    try:
        path = '/html/head/meta[@content][@name="description"]'
        description = doc.xpath(path)[0].get("content")
    except IndexError:
        pass
    
    title = None
    try:
        title = doc.xpath('/html/head/title')[0].text_content().strip()
    except IndexError:
        pass
    
    if not description:
        #Get rid of the head element
        doc.head.drop_tree()
        # Taken from http://bit.ly/zsXZt
        p_texts = [p.strip() for p in doc.text_content().split('\n')]
        description = max((len(p), p) for p in p_texts)[1].strip()[0:255]

 

The last part is taken directly from David's post. Now the results, note that I took out the link retrieval time when clocking this, so it's only the time it took to parse the HTML. Results. It takes about 1.32 seconds to process usingBeautifulSoup while lxml takes around 0.33 seconds. Not the most scientific study, but it validated the performance increase for me.

You can find the full code on GitHub here: 

Django Alfresco

I have a wheelhouse and it's integrating Django with a Java Open Source project. Today I get to announce the next one, Django Alfresco. We combined the Alfresco's document management capabilities with Django's web tier components. I get mixed reactions when I tell people about this project. Anywhere from, "Why did you go and mess up a good thing" to "This is amazing." The former more than the latter, but I'm going to try to convince you that it is a really good idea to use this project. Jeff Potts who is the ECM lead at Optaros and got the project to a place where it could be released has apost on it and a screencast.

Why

About 9 months ago I got shipped off to work for a client in Dallas, Texas. They where implementing Alfresco to handle the content of their intranet newspaper, mostly for the workflow process. Getting a story from drafts, to pending and then through to the approval process. They needed a web tier to display this content and that's where the issues started. Alfresco has a few of it's own solutions; WCM and Surf. WCM has notoriously had stability issues, but the blocker was that you couldn't search across sites. Surf was very new at the time and it takes about 5 xml files to display and image. I'm probably exaggerating, but that's what it felt like. A portal was the next thought, Liferay or JBoss, but why all the weight of a portal when all we need to do is pull HTML files out of Alfresco.

So we created a simple POC for them using Alfresco's REST interface and Django as the web tier. Simple, easy and it took only a couple days to build. After getting over the initial anxiety of Python, Django and Apache we were on our way.

Content

There are two main functions a news site needs to perform: get a listing of documents based on a category and displaying a detailed view of that content. We have a simple hierarchy structure which maps a space id to a category, so in a sense a folder in Alfresco becomes a category and every piece of content in that folder now belongs to it. Django has deserializers which allow an XML or JSON file to be converted into Django Model object seamlessly. Using an Alfresco Webscript we format the response using Freemarker to a Django friendly XML document. With a Space model we run the following code and like magic we have a list of python objects:

 

In [1]: from alfresco.models import Space

In [2]: space = Space.objects.all()[0]

In [3]: space.contents.all()

Out[3]: 
[Content: 1e66b2b3-2dba-4a5f-9527-de754c3a983e - test-1.html,
 Content: 0599e21d-d078-4911-a4a6-8ad5a7ae7f1d - test-2.html]

 

The responses from Alfresco are cached using whatever Django caching backend you choose. We recommend switching away from the default local memory setting to file based caching or memcached.

Authentication

To get access to Alfresco's content repository the user needs to be authenticated. Because of this there is no such thing as an anonymous user, but instead a default user. If a user visits the site for the first time they are automatically logged in as the default user that gives them access to basic content. Users that then log into the system with advanced privileges will get access to more content in searches and category displays. Through Django's Authentication Backends we completely circumvent all of Django's authentication and let Alfresco handle it. The Webscripts use an alf_ticket to authenticate users, therefore we had to save that ticket somewhere. Here's what the AlfrescoUser model:

 

class AlfrescoUser(User):
    """
    Alfresco User. 
        Extends the User model to apply the ticket.
    """
    ticket = models.CharField(max_length=50, blank=True, 
                                          null=True)
    objects = UserManager()
By extending User we get the built in user management functions, but now we have a ticket as well.

 

Sample Site

Django Alfresco ships with a sample site which makes it's really easy for users to evaluate the project before jumping in. You can find these install instructionin the code under docs. Unfortunately we don't have these hosted yet, as they aren't complete.

Thanks

Big thanks goes to Justin Luzier, Ron Bostic and Kris McCuller who codeveloped this with me and to Jeff Potts who took it from a piece of code to something worth sharing.

For more information on it follow @jeffpotts01 or me (screeley) on Twitter.

Django Daemon Command Extension

Recently I started to move Cubby Scott away from a cron and towards a queue. It's hard to be real time when you wake up a cron job once every 3 minutes. Lame. I'm also in the process of adding screenshots and content retrieval. Both take a good amount of time to process. The queue part was easy after readingRabbits and Warrens and Working with Python and RabbitMQ. The problem came when I started working on the consumer, no one ever talks about the consumer. Well I'm going to give the consumer some love.

The consumer should be a daemon, but what's the best way to do that? It would be nice if I could just use Django's built in management functions rather than having one off scripts. i.e:

 

python manage.py linkconsumer

 

I would run that once when I started up the server and be good to go. It turned out to be pretty easy with python-daemon. I threw together a quick class to handle it and you can get a copy of the DaemonCommand here. All it really does is create an interface for a daemon context and open it. When subclassing the DaemonCommand instead of calling handle usehandle_daemon.

Carrot is the open source project that ties the two together, but was a little too complex. So for the purpose of keeping it simple stupid, I usedNathan Borror's Flopsy. Dead simple way to communicate with a queue.

If we put that together we accomplish our goal in 20 lines of code.

 

from daemonextension import DaemonCommand
from django.conf import settings
import os

class Command(DaemonCommand):
    #Declare Daemon std.
    stdout = os.path.join(settings.DIRNAME, "log/cubbyscott.out")
    stderr = os.path.join(settings.DIRNAME, "log/cubbyscott.err")
    pidfile = os.path.join(settings.DIRNAME, "pid/cb_link.err")
    
    def handle_daemon(self, *args, **options):
        from flopsy import Connection, Consumer
        consumer = Consumer(connection=Connection())
        consumer.declare(queue='links', 
                         exchange='cubbyscott', 
                         routing_key='importer', auto_delete=False)
        
        def message_callback(message):
            print 'Recieved: ' + message.body
            consumer.channel.basic_ack(message.delivery_tag)
        
        consumer.register(message_callback)
        
        consumer.wait()

CubbyScott.com | An experiment in 140 character requirements

Is there a 3rd party twitter app that builds a link page based on my follows? If not, someone should build it. It would be my start page.

Fred Wilson posted this tweet a few days ago, a pretty simple requirement. Get all users that Fred is following, parse, get the links and display them for Fred's viewing pleasure. Personally I really like this idea. The problem with an asymmetrical relationship is that you really only follow that person for the interesting links they post. I follow mostly tech people and honestly, their personal comments don't really do much for me. It would be great if I could get all those links into one feed and filter out all the noise.

So in the last 4 days I put together an application to do this. Personal web developer to Fred Wilson and hopefully a few others out there.

First off a few requirements that I added * I'd rather not make the user authenticate, but it turns out that's not such a good idea. Say you are Fred Wilson and you follow 370 people. This means I have to make 370 calls to the twitter apis to get all your friends feeds, a problem when I'm rate limited to 100 an hour. This means you need to authenticate so I can use the friends timeline method. I don't want your password and you don't want to give it to me. Hence we used OAuth. * Get the real link and title of the page, not just the shortened url. * Group by urls to get rid of RTs. * Atom Feeds. I hate leaving Google Reader if I don't have to.

The main issue I have with sites like this is privacy. http://tweetlnks.com/does sort of the same thing, but look at them. I don't want to give my password to you, ever. This is the reason why I went with OAuth. The other thing is that if your friend's tweets are protected I don't save them. They may be the most interesting person in the world, but if I can't display them in an open feed, they are no good to anyone.

Try it out at http://www.cubbyscott.com/. After you login take note of your Atom feed, because after you authenticate you may never have to come back to the site again. Twitter says that your access token is good forever, so once every 5 minutes or so we grab it and parse your feed again. I'm really hoping that no one that has 300,000 friends logs in because I'll need a fail whale. Please note that Cubby Scott is definitely very Alpha and if it breaks I blame you.

The last interesting part of this app is that a Cubby Scott user's link feed isn't protected in anyway. I'd be interested in looking at a link feed from @biz,@fredwilson or @parislemon and see what they are reading. Privacy issue? You tell me.

The site is built using Django and it's being served off a Linode instance. I'd like to Open Source the OAuth code at some point, but for now it's just a mess. There is a fair amount of caching that is happening in the backend as well. Not ideal for a real time system, but it's being powered by gerbils at this point.

Feedback is welcome. I'm not a designer so, "Your site looks like crap" won't help me much.

And if your interested Cubby Scott is named after a road in Peterborough NH