A Meta-index of Data Sets

I had to go hunting around for some data to try some new ideas on recently. As handy as Google is, there’s still a fair bit of chaff from which to sort the wheat.

Fortunately, there is a lot of good stuff out there including well-organised indexes of data sets for various purposes. For my future reference (and for anyone else that may be interested) here are some of the better data set lists I found.

  • UCI Repositories: No list of lists would be complete without this perennial collection of machine learning data sets hosted by the University of California, Irvine. They also have a repository of large data sets for knowledge discovery in databases (KDD).

  • The Info: This site “for people with large data sets” has a community editable list of data sets organised by topic. The collection here has a web/text focus.

  • Text Retrieval: This list kept by NIST has data sets for each of the various tracks at the Text Retrieval Conference, including data sets for spam detection, genomics, and a terabyte track (although the data sets aren’t quite up to a terabyte yet).

  • Time Series Data Library: This collection has a large number of time varying data sets from finance, demography, physics, sport and ecology.

  • DMOZ Directory of Data Sets: This is a good starting point for more lists of data sets for machine learning.

    Parts of DMOZ itself are available in RDF as a data set for researchers. There is also a processed version made available as part of the PASCAL Ontology Learning Challenge.

  • Royal Statistical Society: This collection contains data sets used in research published in the journal of the Royal Statistical Society. This is an admirable idea that I wish more journals would take up.

As well as the above institution or community organised lists, I also came across some maintained by individuals.

A few specific data sets caught my eye, some new, and some I just hadn’t seen before.

  • Freebase Wikipedia Extraction: The Wikipedia WEX data set is essentially a large (57 GB) graph of articles from wikipedia.

  • Enron Email: This collection of email (400 Mb compressed) between Enron staff contains about half a million messages organised into folders. It was released publicly as part of the investigation into Enron and has been used by William Cohen and others as part of the CALO project.

  • Freeway Traffic Analysis: This fairly large data set is a record of traffic flow on several lanes of the I-880 freeway in California in order to study the effect of roving tow-trucks on dealing with decongesting traffic incidents.

If all else fails and you still cannot find a suitable data set for your research, you can always invoke the social web and trawl through bookmarks on services like del.icio.us. The global data set tag can throw up some interesting hits occasionally but there might be a higher wheat to chaff ratio in particular user’s bookmarks, such as Peter Skomoroch. Mine is not nearly as comprehensive yet.

It would be interesting to do a meta-analysis of all these data sets to see how our ability as a discipline to deal with larger and more complex data sets has increased over time. As Daniel Lemire pointed out with some surprise recently, processing a terabyte of data isn’t that uncommon.

Comments (5)

  1. Daniel Lemire wrote::

    You forgot swivel. It has some pretty good data sets.

    Sunday, February 24, 2008 at 5:59 pm #
  2. Vishal wrote::

    This is really precious. Thanks for posting this. I might actually need this.

    Tuesday, February 26, 2008 at 5:17 pm #
  3. An addition: Many Eyes — a site where anyone can upload data sets. It also has visualizations people have done. http://www.many-eyes.com/ http://services.alphaworks.ibm.com/manyeyes/home

    Thursday, March 13, 2008 at 8:59 pm #
  4. Rufus Pollock wrote::

    Have you seen http://www.ckan.net?

    CKAN is the Comprehensive Knowledge Archive Network, a registry of open knowledge packages and projects (and a few closed ones). CKAN is the place to search for open knowledge resources as well as register your own – be that a set of Shakespeare’s works, a global population density database, the voting records of MPs, or 30 years of US patents.

    Those familiar with freshmeat or CPAN can think of CKAN as providing an analogous service for open knowledge.

    Thursday, May 1, 2008 at 1:55 am #
  5. Mark Reid wrote::

    Thanks everyone for the extra links.

    I’ve also recently discovered DataMob which appears to be a collection of datasets and interfaces.

    Wednesday, May 14, 2008 at 9:42 am #

Trackbacks/Pingbacks (2)

  1. [...] 以前、マシンラーニング(機械学習)についてどんな論文を書こうかスタンフォードで考えていたとき、会話はいつもどんなデータセットが利用可能かということに左右されていました。現存する使用可能データを把握し、そこから何をしたいかを見極めていたのです。ある目的のためにデザインされたデータを転用するための議論に膨大な時間を費やしていました。データを使用する多くの分野で同じことが起きていると思います。 [...]

  2. [...] on the lookout for interesting data sets, I suggested that we apply some basic data analysis tools to the database to see what kind of [...]