Skip to content

This project identifies companies that will be considered successful based on information sourced from crunchbase. The results are recommended for use by potential VC firms as a filter for potential investments

Notifications You must be signed in to change notification settings

mwirtz946/startup_success_classifier

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predicting Startup Acquisition

startup.jpeg

Authors: Brendan Ferris, Michael Wirtz

Overview

This project analyzes the needs of Butterfly Ventures, a micro venture capital fund that is seeking to leverage a machine learning model that will precisely classify companies that will be acquired. In an effort to model this problem, we collected a dataset of startups that fell into any one of the following three categories: closed, operating or acquired. In an effort to minimize the false negatives, we chose precision to be our target metric. Our baseline model using Logistic Regression had a precision score ranging from 13-30%, exhibiting low predictive power. Our final and best model was a Random Forest model that had a precision score 28%.

Business Problem

Butterfly Ventures is small VC fund that is low on capital. Because of their limited funds, they are looking for a way to better filter companies in the hopes of making the most of their investments. They are aware of the following statistics: 75% of venture-backed startups fail. Under 50% of businesses make it to their fifth year. 33% of startups make it to the 10-year mark. Only 40% of startups actually turn a profit. Given this knowledge, Butterfly Ventures is targeting startups that they believe have the best opportunity at acquisition, a sure-fire way for investment profits. For this purpose, they have hired a group of data scientists to create a model classify whether or not a startup will be acquired.

Data

category_frequency.jpeg

In order to help Butterfly Ventures, we used a Kaggle dataset with information on 54,000 companies sourced from crunchbase to train our model. The three original classification in the dataset were "closed," "operating," and "acquired."; however, we grouped operating and closed into one category (not acquired) in order to predict the "acquried" class. The feature definitions for the dataset can be found below, or on the crunchbase website.

Original Features
  • funding_total_usd: total amount of money raised by the company.
  • seed: Seed rounds are among the first rounds of funding a company will receive, generally while the company is young and working to gain traction. Round sizes range between $10k–$2M, though larger seed rounds have become more common in recent years. A seed round typically comes after an angel round (if applicable) and before a company’s Series A round.
  • venture: money raised through venture capital.
  • equity_crowdfunding: Equity crowdfunding platforms allow individual users to invest in companies in exchange for equity. Typically on these platforms the investors invest small amounts of money, though syndicates are formed to allow an individual to take a lead on evaluating an investment and pooling funding from a group of individual investors.
  • undisclosed: mondey raised through undisclosed means.
  • convertible_note: A convertible note is an ‘in-between’ round funding to help companies hold over until they want to raise their next round of funding. When they raise the next round, this note ‘converts’ with a discount at the price of the new round. You will typically see convertible notes after a company raises, for example, a Series A round but does not yet want to raise a Series B round.
  • debt_financing: In a debt round, an investor lends money to a company, and the company promises to repay the debt with added interest.
  • angel: An angel round is typically a small round designed to get a new company off the ground. Investors in an angel round include individual angel investors, angel investor groups, friends, and family.
  • grant: A grant is when a company, investor, or government agency provides capital to a company without taking an equity stake in the company.
  • private_equity: A private equity round is led by a private equity firm or a hedge fund and is a late stage round. It is a less risky investment because the company is more firmly established, and the rounds are typically upwards of $50M
  • round_A: funding round for earlier stage companies and range on average between $1M–$30M. The value of the company is usually determined during the initial round of funding.
  • round_B: funding round for earlier stage companies and range on average between $1M–$30M.
  • product_crowdfunding: In a product crowdfunding round, a company will provide its product, which is often still in development, in exchange for capital. This kind of round is also typically completed on a funding platform.
Engineered Features
  • days_from_founding_to_funding: the amount of days that passed between the companies founding and when they were first able to secure funding.
  • time_between_first_and_last_funding: the amount of days that passed between the first time the company recieved funding and the last time the company recieved funding.
  • month_X: the month that the company was founded.
  • founded_quarter_X: the quarter (Q2,Q3,Q4) that the company was founded. Q1 was omitted because most companies are founded in January for tax purposes.
  • state_code_X: the state the company was founded in.
  • founded_year_X: the year the company was founded.
  • url_ending_X: the domain name ending of the company website (.com, .org, etc.)
  • country_code_USA: if the company was founded in the USA.
Target
  • acquired (1): companies that were acquired.
  • not acquired (0): companies that have either closed or are still operating.

Because certain values possessed overly predictive power, they were dropped from the models. Those columns are as follows:

Removed From Original Dataset
  1. post_ipo_equity: A post-IPO equity round takes place when firms invest in a company after the company has already gone public.
  2. post_ipo_debt: A post-IPO debt round takes place when firms loan a company money after the company has already gone public. Similar to debt financing, a company will promise to repay the principal as well as added interest on the debt.
  3. round_C through round_F: companies that that move forward to more advanced stages of funding.
  4. funding_rounds: the number of funding rounds a company participated in.

Methods

category_frequency.jpeg

Overall, this project analyzes the given dataset information to maximize the precision metric of our models.

In order to get the most out of our features, we dummied all of the categorical columns. We presumed that the category list column would be the most beneficial to our model, given that it would be able to classify each startup specifically into business-type categories.

Because there was high class imbalance, we implemented a mixture of upsampling and downsampling techniques to balance out the acquired (1) and not acquired(0) classes.

For our logistic regression models, large continuous variables presented issues with training time and performance of our model, so we standardized the continuous features in order to deal with this issue.

We ran through multiple iterations of both logistic regression and random forest models in order to maximize the percision score.

Results

Our random forest model yielded the best precision results at 28%. Although at first, logistic regression models produced high numbers on precision, values would fluctuate leading to low predictive power.

Conclusion

The conclusions that can be drawn given our results include:

  • Predicting whether a company would be acquired is a complex problem, and expanded data collection would greatly benefit the precision of the model.
  • A combination of Up/Downsampling drastically reduced false positives, while preserving precision.
  • Random Forest models yielded the highest precision figures.

Next Steps

  • If Butterfly Ventures does not have the resources to collect more data, it may be beneficial to pivot into developing a more interpretable model, then drawing insights from it to guide investment decisions.
  • Scrape data on startup management to get an indication of how that can effect acquisition
  • Predicting whether a company would be acquired is a complex problem, and expanded data collection would greatly benefit the precision of the model.

For More Information

See the full modeling process in the modeling notebook or review this presentation.

For additional info, contact Brendan Ferris or Michael Wirtz at [email protected] and [email protected], respectively.

Repository Structure

├── README.md
├── archive
│   ├── EDA_notebook.ipynb
│   ├── cleaning_notebook.ipynb
│   └── modeling_notebook.ipynb
├── data
│   ├── cleaned_investments_VC.csv
│   └── investments_VC.csv
├── images
│   ├── cat_frequency_graph.png
│   ├── class_imbalance_graph.png
│   ├── external-content.duckduckgo.com.jpg
│   ├── funding.png
│   ├── startup_acquisitions_blue.jpeg
│   └── startup_acquisitions_red.jpeg
└── slide_deck.pdf

About

This project identifies companies that will be considered successful based on information sourced from crunchbase. The results are recommended for use by potential VC firms as a filter for potential investments

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%