Jump to content

Scraped Admissions Data


Recommended Posts

Hello some of you might be interested in this, I have scraped the admissions data for statistics phds/masters from the grad cafe. The csv file can be gotten here. The python code used to generate it is here.

 

The code could be easily modified to get biostatistics, or math, or any other subject, data too.

 

The most annoying part is that schools have many names given for them, I have collected a list of synonyms, but in the future, if you run the code again, no doubt new cases will arise, and you will have to add these to the list in the code.

 

If you run the code, it will take about 3 minutes to finish. This is because I have added delays between sending requests to the grad cafe for pages in order not to annoy them.

 

In the code I have added a couple of plotting functions the most basic tasks.

apps_by_year.png

 

Things that could be added

  • I have only collected records that are accepted/rejected, could add wait listed and interview as results.
  • Could add data about schools from the usa today rankings or NRC.
  • Standardising gre data/ from 160 to 900 scale, also people report as percentages, esp for subject gre.

I hope those of you who are data-curious will find some interesting things out and share them with us.

Edited by fuzzylogician
links are fixed!
Link to comment
Share on other sites

I cannot edit my post, but for some reason the links I posted lack the : in http://...

 

Perhaps a moderator can correct it.

 

Data is here and code is here.

 

Hopefully these links will work.

 

Just in case :

data :http://sourceforge.net/projects/triangleinequal/files/Grad%20Cafe%20Data/gc_data.csv/download

code:http://sourceforge.net/projects/triangleinequal/files/Grad%20Cafe%20Data/grad_cafe.py/download

Edited by persistent_homology
Link to comment
Share on other sites

One thing to keep in mind when looking at these data is that the GC population is a highly biased sample of applicants. For instance, it is much more heavily domestic (i.e., U.S.-based) than the overall group applying to stat & biostat programs. This is one of the reasons (among many) why you're seeing admit rates in the 40-50% range from the results page, while most top programs report rates under 20%. 

Link to comment
Share on other sites

That is a good point cyberwolf. What I hope is that although the numbers are biased, we might still be able to see trends such as: is it getting harder to be accepted?

 

Once you have a specific question in mind to answer, perhaps the bias can be compensated for to some extent by using official data from some schools to inform a Bayesian prior.

Edited by persistent_homology
Link to comment
Share on other sites

Investigating the difference between international and American (where I group 'U' together with 'I' for this purpose) I made the following plots:

status_effect.png

 

 

the first is broken down by year and the second looks at all applications together.

 

Also here is a plot showing some of the difference between GPA reporters and non-reporters:

GPA_reporting.png

Link to comment
Share on other sites

Really glad someone has done this, thanks persistent_homology. Something that would be a moderate-to-severe PITA but potentially yield interesting results is to try to match up records from the same user (as I suggested ) to examine which universities tend to accept the same sets of applicants.

Link to comment
Share on other sites

Ooh interesting idea wine in coffee cups...

 

For me the results were most interesting for getting an idea of the timeline (when I would hear from different schools). I might play around with it later this week and see how consistent the schools are across the years. I think it would be helpful to see the overall pattern though... some schools seem to do it all at once, some spread it out, etc.

Link to comment
Share on other sites

Yeah same here I use the results page mainly for knowing when results might be coming out based on previous years, and seeing if schools started sending offers for the current year.

Oh and I'm definitely upvoting the OP for taking the time to make this

Edited by StatPhD2014
Link to comment
Share on other sites

I agree it would be very interesting to tie the different submissions to individuals, but it would be a PITA.

 

I think that thegradcafe could implement a submission system more orientated around collecting data, by linking to a profile, and also by standardizing things like name of school.

 

Then they could offer useful summary statistics themselves.

Link to comment
Share on other sites

  • 1 month later...

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

This website uses cookies to ensure you get the best experience on our website. See our Privacy Policy and Terms of Use