Crowd Source Test Data for Atom Linguist


If you haven’t been following my blog, I’ve been publishing some articles on a project I’m calling Atom Linguist. Linguist is a Ruby library created by GitHub that they use to detect the type of files in projects that get stored to GitHub. I want to use Linguist to make file type detection in Atom better, so it needs to be converted to CoffeeScript first. I’m in the process of doing that.

As I mentioned in this blog post, I want to ensure that whatever I write actually improves things. So I need test data. I am hoping to crowd source it with the help of the Atom community starting with Discuss :grinning:

If you have any sample files you can contribute, I’d love it if you could submit pull requests! Just follow the convention mentioned in the Structure section of the README. Every little bit helps and if you have files that are examples of things Atom does not currently detect properly, those will be a huge help.

Thanks for everyone’s help in advance! Crowd sourcing this test data will free me up to work on Atom Linguist and the test harness to measure accuracy sooner as well as make the test data more representative of all of our needs.


So that’s detection not only on file extension or shebang line but on content as well?


The directory name must match identically and case-sensitively to the name of the language in the official language list.

Is there a link to this list?


Yes, earlier in that section where it says, “Atom Linguist languages file”.

Right now it is just file extension, full file name (for things like Rakefile) and shebang line. But the original Linguist library did some simple heuristics and Bayesian classification on file content in addition to the above. So, depending on accuracy, memory requirements and latency … I may be skipping the Bayesian portion. But this is why I need the test data, so I can quantify the benefits and drawbacks of various implementations.