Bored of Kaggle and FiveThirtyEight? Listed here are the choice strategies I exploit for getting high-quality and unique datasets

The important thing to an awesome data science project is an awesome dataset, but finding great data is way easier said than done.
I remember back after I was studying for my master’s in Data Science, somewhat over a 12 months ago. Throughout the course, I discovered that coming up with project ideas was the simple part — it was finding good datasets that I struggled with essentially the most. I might spend hours scouring the web, pulling my hair out trying to search out juicy data sources and getting nowhere.
Since then, I’ve come a great distance in my approach, and in this text I need to share with you the 5 strategies that I exploit to search out datasets. For those who’re bored of normal sources like Kaggle and FiveThirtyEight, these strategies will enable you to get data which are unique and way more tailored to the precise use cases you’ve in mind.
Yep, imagine it or not, this is definitely a legit strategy. It’s even got a elaborate technical name (“synthetic data generation”).
For those who’re trying out a brand new idea or have very specific data requirements, making synthetic data is a improbable option to get original and tailored datasets.
For instance, let’s say that you simply’re attempting to construct a churn prediction model — a model that may predict how likely a customer is to depart an organization. Churn is a reasonably common “operational problem” faced by many firms, and tackling an issue like that is an awesome option to show recruiters you could use ML to unravel commercially-relevant problems, as I’ve argued previously:
Nonetheless, for those who search online for “churn datasets,” you’ll find that there are (on the time of writing) only two primary datasets obviously available to the general public: the Bank Customer Churn Dataset, and the Telecom Churn Dataset. These datasets are a improbable place to start out, but won’t reflect the type of information required for modelling churn in other industries.
As a substitute, you possibly can try creating synthetic data that’s more tailored to your requirements.
If this sounds too good to be true, here’s an example dataset which I created with just a brief prompt to that old chestnut, ChatGPT:
In fact, ChatGPT is proscribed within the speed and size of the datasets it might create, so if you need to upscale this method I’d recommend using either the Python library faker
or scikit-learn’s sklearn.datasets.make_classification
and sklearn.datasets.make_regression
functions. These tools are a improbable option to programmatically generate huge datasets within the blink of an eye fixed, and ideal for constructing proof-of-concept models without having to spend ages trying to find the right dataset.
In practice, I even have rarely needed to make use of synthetic data creation techniques to generate entire datasets (and, as I’ll explain later, you’d be sensible to exercise caution for those who intend to do that). As a substitute, I find it is a really neat technique for generating adversarial examples or adding noise to your datasets, enabling me to check my models’ weaknesses and construct more robust versions. But, no matter how you employ this method, it’s an incredibly great tool to have at your disposal.
Creating synthetic data is a pleasant workaround for situations when you may’t find the kind of data you’re searching for, but the apparent problem is that you simply’ve got no guarantee that the information are good representations of real-life populations.
If you need to guarantee that your data are realistic, the most effective option to try this is, surprise surprise…
… to really go and find some real data.
A technique of doing that is to succeed in out to firms that may hold such data and ask in the event that they’d be concerned about sharing some with you. Susceptible to stating the apparent, no company goes to present you data which are highly sensitive or for those who are planning to make use of them for industrial or unethical purposes. That might just be plain silly.
Nonetheless, for those who intend to make use of the information for research (e.g., for a university project), you may well find that firms are open to providing data if it’s within the context of a quid pro quo joint research agreement.
What do I mean by this? It’s actually pretty easy: I mean an arrangement whereby they give you some (anonymised/de-sensitised) data and you employ the information to conduct research which is of some profit to them. For instance, for those who’re concerned about studying churn modelling, you possibly can put together a proposal for comparing different churn prediction techniques. Then, share the proposal with some firms and ask whether there’s potential to work together. For those who’re persistent and forged a large net, you’ll likely find an organization that’s willing to offer data to your project so long as you share your findings with them in order that they will get a profit out of the research.
If that sounds too good to be true, you is perhaps surprised to listen to that this is precisely what I did during my master’s degree. I reached out to a few of firms with a proposal for a way I could use their data for research that might profit them, signed some paperwork to substantiate that I wouldn’t use the information for another purpose, and conducted a extremely fun project using some real-world data. It really may be done.
The opposite thing I particularly like about this strategy is that it provides a option to exercise and develop quite a broad set of skills that are essential in Data Science. You have got to speak well, show industrial awareness, and turn out to be a professional at managing stakeholder expectations — all of that are essential skills within the day-to-day lifetime of a Data Scientist.
Numerous datasets utilized in academic studies aren’t published on platforms like Kaggle, but are still publicly available to be used by other researchers.
Probably the greatest ways to search out datasets like these is by looking within the repositories related to academic journal articles. Why? Because plenty of journals require their contributors to make the underlying data publicly available. For instance, two of the information sources I used during my master’s degree (the Fragile Families dataset and the Hate Speech Data website) weren’t available on Kaggle; I discovered them through academic papers and their associated code repositories.
How are you going to find these repositories? It’s actually surprisingly easy — I start by opening up paperswithcode.com, seek for papers in the realm I’m concerned about, and take a look at the available datasets until I find something that appears interesting. In my experience, it is a really neat option to find datasets which haven’t been done-to-death by the masses on Kaggle.
Truthfully, I’ve no idea why more people don’t make use of BigQuery Public Datasets. There are actually lots of of datasets covering every part from Google Search Trends to London Bicycle Hires to Genomic Sequencing of Cannabis.
Considered one of the things I especially like about this source is that plenty of these datasets are incredibly commercially relevant. You possibly can kiss goodbye to area of interest academic topics like flower classification and digit prediction; in BigQuery, there are datasets on real-world business issues like ad performance, website visits and economic forecasts.
Numerous people draw back from these datasets because they require SQL skills to load them. But, even for those who don’t know SQL and only know a language like Python or R, I’d still encourage you to take an hour or two to learn some basic SQL after which start querying these datasets. It doesn’t take long to stand up and running, and this truly is a treasure trove of high-value data assets.
To make use of the datasets in BigQuery Public Datasets, you may join for a totally free account and create a sandbox project by following the instructions here. You don’t have to enter your bank card details or anything like that — just your name, your email, a bit of information in regards to the project, and also you’re good to go. For those who need more computing power at a later date, you may upgrade the project to a paid one and access GCP’s compute resources and advanced BigQuery features, but I’ve personally never needed to do that and have found the sandbox to be greater than adequate.
My final tip is to try using a dataset search engine. These are incredibly tools which have only emerged in the previous couple of years, and so they make it very easy to quickly see what’s on the market. Three of my favourites are:
In my experience, searching with these tools is usually a way more effective strategy than using generic search engines like google as you’re often supplied with metadata in regards to the datasets and you’ve the power to rank them by how often they’ve been used and the publication date. Quite a nifty approach, for those who ask me.