Jump to content

brewdata: Extracting Usable Data from the Grad Cafe Results Search


brewdata

Recommended Posts

Hi All, 

 

I've seen some really nice scripts that scrape the Grad Cafe, but none had all the features I wanted. 

 

I wrote some of my own functions and put them into an R package called brewdata. If you're also interested in using R to parse Results Search data, then you can find brewdata on CRAN ( http://cran.r-project.org/web/packages/brewdata/). 

 

Please email or PM me with any suggestions or bugs you find. I'd welcome the chance to work with anyone interested in making their own improvements.

 

Thanks!

NW

Link to comment
Share on other sites

Awesome! My main suggestion is to have the data frame returned by brewdata() contain the original program name as a column. Setting map=TRUE lets you get the school name, but I think it makes sense to also return the program name. That way users can remove false positives, e.g. exclude programs like "Educational Psychology - Learning Sciences (Research, Measurement, And Statistics)" from statistics-related results. This seems really important for disciplines like math, where searching for "math*" gets you both pure and applied programs, which are impossible to disentangle without the program name.

 

I also suggest changing the default query to "(stat|stats|statis*)". You actually miss out on a decent number of Duke results because their program is formally called "Statistical Science", for example.

Link to comment
Share on other sites

Thanks cyberwulf & wine in coffee cups!

 

@cyberwulf: Never used Shiny, but I know some swear by it. The examples I saw were great. I'll see how far I can go with the R package. Is that how you use shiny? 

 

@wine in coffee cups: I'll adjust the data frame returned and see what I can do about the default search. Certainly do not want to miss any records since many people opt not to share their 'metrics'. 

 

I'll roll these (and other fixes) into the next CRAN submission. Thanks again for the tips and feedback!

Link to comment
Share on other sites

Yeah but you would need a lot of dummies (say for 160, 161, 162.....) since there's a lot of different scores :0! Way to go on the acceptances! 

 

I suppose if you believed the cutoff was x, you could just make one dummy whenever the variable is >= x?

 

Thanks a lot! I'm actually quite surprised at the outcome thus far. 

Link to comment
Share on other sites

ooh yeah I gotcha, yeah I think he's along the same line of reasoning as when he says:: "I imagine that a cutoff model would be more appropriate"

 

I think one of the next interesting is comparing this data with actual data that some schools publish (like Duke, UW, Etc..), then we can maybe get a better idea of how representative TGC data is. 

Link to comment
Share on other sites

One of my friends wrote a post about this.. nice package!!! http://minimallysufficient.github.io/2015/02/08/gradcafe.html

That's great. Really enjoyed reading the post. The footnote about homework procrastination is the best part. 

 

 

Am I the only one that finds it hilarious that such a package even exists?

 

Glad to see it brighten your day! I had fun putting it together.  :)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

This website uses cookies to ensure you get the best experience on our website. See our Privacy Policy and Terms of Use