Top Machine Learning Datasets

A dataset is a collection of data organised in a specific format. The dataset could be available in text, video, audio, or other formats. The dataset is the very first step in any machine learning project. The data it contains is fed to ML models to solve different types of problems.

Some are listed below:

Weather Forecasting
Bird Sound Recognition
Face detection
Speech recognition
Age prediction
Stock analysis

Machine Learning Datasets

Below is the list of different datasets that could help you while you are getting started with your machine learning project.

1. Internet Usage Dataset

Description: The dataset contains general statistical data about internet consumers. The data in this dataset is in text format and includes 10,104 instances. The basic purpose for which it is used is either classification or clustering. It was created by D. Cook in 1997.

Dataset Link: Internet Usage Dataset

2. SMS Spam Collection Dataset

Description: The dataset contains data of spam messages gathered from SMS in the year 2011. The format of this data is text with 5,574 total instances. The dataset applies to classification problems. It was created by T. Almeida et al.

Dataset Link: SMS Spam Collection Dataset

3. Twitter100k

Description: This dataset was created by Y. Hu et al. in 2017. It includes images and tweets, totalling 100,000 instances. The data format consists primarily of pictures and text, which can be used for cross-media retrieval tasks.

Dataset Link: Twitter100K Dataset

4. Sentiment140

Description: The dataset contains data from tweets, specifically from 2009. The data it holds includes actual tweet text, users, their sentiments, and timestamps. This dataset comprises 15,78,627 instances, with data primarily in comma-separated values (CSV) and tweet format. It was developed by A. Go et al. in 2009 for sentiment analysis.

Dataset Link: Sentiment140 Dataset

5. Web of Science dataset

Description: The dataset was created by K. Kowsari et al. in 2017. It is a dataset primarily used for classifying text in hierarchical datasets. The total number of instances is 46,985, with text data used for classification and categorisation.

Dataset Link: Web of Science Dataset

6. Social Structure of Facebook Networks

Description: This dataset is vast, as it contains Facebook’s whole social structure. It was developed by A. Traud et al. in 2012 and is used for tasks such as network analysis and clustering. The dataset covered 100 colleges, which explains its large size. The data format in this dataset is text.

Dataset Link: Social Structure of Facebook Networks Dataset

7. Tic-Tac-Toe Endgame Dataset

Description: The dataset consists of 958 instances, created in the year 1991 by D. Aha. The main goal of this dataset is to cover binary classification problems for only winning conditions in the game. The dataset contains text data and is suitable for classification problems.

Dataset Link: Tic-Tac-Toe Endgame Dataset

8. Online Retail Dataset

Description: This dataset was created in 2015 by D. Chen for transactions of a UK-based online retailer. It has 5,41,909 instances in total, in text format. This data applies to classifications and clustering problems.

Dataset Link: Online Retail Dataset

9. Farm Ads Dataset

Description: This dataset was created using twelve different websites. Data was gathered through text advertising covering various aspects of farm-based animals. The dataset has binary target values (labels) indicating whether the content owner approves the ads. The whole dataset is in text format with 4,143 instances. It was given by C. Masterharm et al. in 2011.

Dataset Link: Farm Ads Dataset

10. Ozone Level Detection Dataset

Description: This dataset is a collection of two ground ozone level datasets created in 2008 by K. Zhang et al. This dataset consists of text data with 2536 instances. The task it applies to is classification problems.

Dataset Link: Ozone Level Detection Dataset

11. URL Dataset

Description: This dataset contains features from 120 days of URL taken from an enormous conference. It has 23,96,130 instances in total with format of text. The default application area is classification. It was created by J. Ma in 2009.

Dataset Link: URL Dataset

12. University Dataset

Description: This dataset is a collection of LISP-readable (original) forms. It has 285 instances in text format. The task associated with it is basically classification. It was created by M. Lebowitz in 1988.

Dataset Link: University Dataset

13. Open University Learning Analytics Dataset

Description: This dataset was created by J. Kuzilek et al. in 2015. The dataset is about students involved in virtual environment-based learning. It contains data on them and their interactions. It has approximately 30,000 instances of data in text format. The tasks in this dataset are mostly classification, regression, and clustering.

Dataset Link: Open University Learning Analytics Dataset

14. Bank Marketing Dataset

Description: This dataset contains data gathered from a large bank’s marketing campaign. The default goal of this dataset is classification. It has data in text format and has 45,211 instances. It was created by S. Moro et al.

Dataset Link: Bank Marketing Dataset

15. Cloud Dataset

Description: This dataset was created by P. Collard in 1989 to perform classification and clustering tasks. It is a collection of data gathered from around 1024 different clouds. The data is in text format with 1024 instances.

Dataset Link: Cloud Dataset

16. Bike Sharing Dataset

Description: This dataset was created in 2013 by H. Fanaee-T, which contains data about rental bikes being used on an hourly as well as daily basis in a large city. It has 17,389 instances in total, with all data in text format. The default application of this dataset is regression.

Dataset Link: Bike Sharing Dataset

17. Chess (King-Rook King-Pawn) Dataset

Description: This dataset contains King+Rook vs. King+Pawn on a7. It has 3196 instances in the text format. It was developed by R. Holte in 1989 and is used for classification tasks.

Dataset Link: Chess (King-Rook King-Pawn) Dataset

18. Rice Leaf Diseases Dataset

Description: This dataset was created by Jitesh Shah et al. in 2019 to perform classification tasks. It contains 120 instances, and the data is in image format. These images of diseased leaves were taken in sunlight, with a white background.

Dataset Link: Rice Leaf Disease Dataset

19. TV Human Interaction Dataset

Description: This dataset consists of video clips from 20 different television shows. It has 6,766 instances and is in video format. It was created by A. Patron-Perez et al. in 2013 to predict human actions (handshakes, high-fives, kisses, hugs, and none) in society.

Dataset Link: TV Human Interaction Dataset

20. Free Music Archive Dataset

Description: This dataset contains raw audio and features. It includes songs from different genres, albums, artists, etc., and was created to perform classification and recommendation tasks. The data format is text and MP3, with 1,06,574 instances. It was created by M. Defferrard et al. in 2017.

Dataset Link: Free Music Archive Dataset

Conclusion

A lack of high-quality datasets is one of many reasons ML-based projects may fail to deliver effective results. This makes it the most important initial step: choosing the appropriate dataset before you start your project. Selecting a suitable dataset helps you successfully create and implement an ML algorithm for your project.

To help you begin your ML project, this article outlines everything that you require to understand about ML datasets, along with the links where you can easily find them!