Every investor wants to find the hidden unicorn in a sea of potential investments. Identifying founders, especially early in their career, with unicorn potential is extremely difficult. Though, there are many examples of attempts to predict the success of a company. I was able to find more than 20 in a quick google search. Some of these are tricks-of-the-trade from investors giving their perspective on what matters the most. Others are machine learning engineers searching for predictive insights from big data.
Unfortunately, these methods ultimately fall short. There are various reasons these methods are less than ideal. The insights investors provide are valuable perspectives into what they consider important, but they are impossible to replicate. Their internal scale for measuring the “focus” of a founder isn’t something that can be calibrated (very easily) in another investors process. While I’m a firm believer that humans are excellent pattern recognizers, we can also only hold onto so much information. These investors’ “rules” may only capture a part of a much larger picture and are likely biased in some non-obvious way. Meaning the rules may not apply to all founders. I’m also a firm believer that data and machine learning/data science/ artificial intelligence/pattern recognition techniques contain the ability to predict startup success (with some degree of accuracy). However, the methods I’ve seen typically use publicly available data, making its own set of assumptions. I don’t make any references to these works because I don’t want to diminish the work they’re doing. I think it’s valuable, but it’s a problem with a lot of uncertainty surrounding it. In the rest of this post, I will focus on the data-driven approach to predicting startup success. I will give a high-level overview of the data, standard AI practices, and what we did to deliver a solution developed specifically for this purpose.
Ask any data scientist, and they’ll likely tell you one of the most frustrating parts of the job is cleaning the data. I have been working with data for 10+ years, and not once have I ever received clean data. The question surrounding data isn’t if it’s dirty, but how dirty is it? The first two things I look for are if the data is incomplete or incorrect. Incomplete data is simply if any data points are missing, and incorrect is if the data points are flat out wrong. To mitigate incomplete data, I typically will look for…