untzkatz
Members-
Posts
29 -
Joined
-
Last visited
Recent Profile Visitors
The recent visitors block is disabled and is not being shown to other users.
untzkatz's Achievements
Decaf (2/10)
4
Reputation
-
untzkatz reacted to a post in a topic: Consider leaving my PhD and reapply for masters + suggestions needed
-
nauhark reacted to a post in a topic: Big tech data scientist vs. big pharma biostatistician
-
bayessays reacted to a post in a topic: Big tech data scientist vs. big pharma biostatistician
-
Big tech data scientist vs. big pharma biostatistician
untzkatz replied to lxzqw136's topic in Mathematics and Statistics
I am in pharma and I am leaving soon because of the things you listed, because they weren't good fits. I did manage to recently find a more Bioinformatics DS drug discovery type focused position, and they are willing to train me on the domain knowledge thankfully, also because probably my undergrad background is also in a biomedical field so I wasn't a pure biostat/stat person. The TLDR is if you really really want to do stats, data analysis, and modeling focused work, then Biostats in pharma will be disappointing. You don't have to go to big tech though, you can look for something like this too in other titled position. I had gotten feedback on a similar topic weeks ago and I meant to update that but after a whole month of interviews, I managed to find this as an MS. The interviews for diff DS positions were a mix of presenting data analyses I had already done (I used grad school stuff for these), take-home data analysis, data wrangling tests on coderpad, and leetcode type qs. I bombed the leetcode ones but I passed the ones that had the other 3. And then made the decision, hopefully the right one, on one which seemed more analysis focused on biochemical data. One of the others was academia (which had lower pay) and then another I got the vibe during subsequent interviews it was more DE focused despite claiming to do causal inference and ML. Ironically, Biostatistics is a good fit for people who want occasional simple statistics and more focus on writing, communication, FDA/regulatory stuff. If you want to use more statistical methods and have it focused mostly on programming, modeling, etc though then DS, ML engineer are better. For me, what drew me to biostatistics major was the data analysis and modeling, so it turned out that being in pharma in a Biostat *title* was an extremely poor fit for this. I've hated writing ever since middle school and the documentation was painful and stressful for me more than learning advanced programming, ML/DL, and data analysis. I think often times there is a common misconception in school that STEM is "harder" than humanities, social science, writing etc and there are definitely people for whom its the opposite and this side is fortunately or unfortunately depending on the person a major part of biostatistics titled jobs. You have to want to get better at it and improve over time to be successful in Biostatistics, and it's something I had near 0 interest in my entire life. It is true you won't be competing with younger people who could be sharper technically when you get older though I think most likely you will have to pick 1 and then see how you like it in the 1st year. This is how I have ruled out all Biostatistics jobs in the future for me. -
Sending GRE scores before applying
untzkatz replied to csheehan10's topic in Mathematics and Statistics
Related to this, if your GRE is expiring this November, does sending it earlier, even if you turn in the app itself later but by the Dec 1 deadline make the scores valid? -
Damn, so the job description really is deceiving. I guess not too surprising considering how hyped up DS is. In regards to transitioning out of biotech, one issue is my undergrad was in BE. So compared to perhaps other Biostat people who came from stat/math, I feel a bit more "holed into" this industry. All of these Bio-X fields seem to suffer from this. Like my whole resume is projects that are biomedical stats related, even research experience I got in grad school was doing stats for a lab in BE (hence the work with imaging data). Though this issue is kind of common to any Bio-X field. I actually like the biomedical stuff but in industry there can be too much red-tape Some good news (fingers crossed) is recently I had a non-clinical Biostat interview for a biotech company (some recruiter had referred me, I hadn't applied, but I said well I'll just see what its about), but I made it clear to the team I didn't like the regulatory writing stuff but wanted algorithms. One of the interviewers actually was on the algs team and I was able to answer the ML questions (they asked me weird stuff like what if data isn't labeled what would you do, and is accuracy always a good metric), and got an internal referral to interviewing for DS on the algs team (they themselves said I would be a better fit for this, and they want someone who is familiar with the stats aspects of it) where they do seem to do predictive modeling. Hopefully I get lucky to pass it.
-
Well I know Python so it would probably be improving SQL. Are you basically saying that such job descriptions look like they have lots of cool modeling, but that reality is not the case and it just seems that way on the outside? I keep hearing that for statistical/ML modeling jobs these days you need a PhD, and even still it'll be competitive as you said. Modeling 20% of the time isn't too bad, but I'm afraid data analyst it'll be like <5% of the time and most you will be doing is basic univariate summary stats and visualizations. Sounds like what you are getting at is that the coursework in statistics/biostatistics departments is heavily foundational classical stats, but the research does more modern things and combines it with the inferential aspects? Whereas NYU DS for example seems to go right into the statistical ML/DL and bayesian network type stuff. I would be interested in Comp Neuro too since you brought that up. Did you do that stuff in undergrad-it seems very advanced for undergrad level. I agree the good thing about an Algorithms course is that even if I don't do a PhD, it can still help to get through interviews at tech companies since that stuff is tested in Leetcode and so on. And still improves general programming skills beyond just numerical computation. That is why I have been leaning towards doing it. As far as the frameworks though, I'm pretty sure most PhD students doing DL are using PyTorch and so aren't implementing various data structures or autograd from scratch. I've seen arxiv github code and it still often follows the formulaic subclassing nn.Module to make a layer, then having __init__ and forward() and so on. And making a Dataset class and Dataloader. Would you say something like UCSD EE with DS/ML may also be good? https://www.ece.ucsd.edu/index.php/faculty-research/ece-research-areas/machine-learning-data-science-impacted. Seems like they do stat learning and DL there too.
-
So you think that it's not worth doing a PhD in a DS/related field in order to eventually go for one of those more stat/ML modeling based jobs (despite how scarce they seem to be)? A lot of these, like for example the Harnham one posted earlier, seem to require a PhD. Would you say apply for that stuff even with an MS and try to demonstrate that you can do it on the resume/Github & interview? I understand data wrangling is the 80% of data related work, but still I'd like to get away from the regulatory writing aspects primarily. I'm not complaining about data wrangling as much as the biostatistics regulatory stuff. In that sense, it sounds like a DS job, even in biotech, with less of the regulatory writing being given to me (since Biostatisticians are given this) could be a better fit. Here is a more example of something I would eventually like to get into: https://verily.com/roles/job/?job_id=2059874. It's signal processing related, and looks like actually this particular one requires an MS but in CS, and otherwise it says preferred is PhD in BME/CS/App Math/related. A lot of the job posting seems to be about both classical (eg time series, linear models, multivariate analysis) and ML modeling so a good mix. It just seems like these jobs and similar ones vastly prefer PhD candidates. I don't know if I am interested in pipelines like the software/ML engineering sense of the term, but I like both methods+applications of ML/DL. Sometimes, people develop methods for specific applications. My undergrad btw actually was in BE which you listed, although it is far too broad and I wish I did something like applied math ugrad. My Biostat program covered those inferential things at MS level in the 1st year, and 2nd year we had stuff on Survival+GLM+GLMM (these were combined with PhD students, I got an A in survival/GLMM but a B+ in GLM) as well as the electives which were the ML/signals/time series classes I mentioned (those were all As). Yea, the higher ranked programs I would expect to be more modern. I'm not too interested in doing the Fisher/Neymann Pearson inference stuff all over again though, but NYU DS inference & representation course on the graph models does look interesting as its a more modern spin on it. Didn't realize DS&Algs would be needed though, I'm not sure where things like heaps, linked lists, dynamic programming etc come up in ML at all, but I also took ML in a stat department not a CS one. Technically I know NNs for example make use of dynamic programming when using autograd to cache the gradients, but thats in internal detail you don't need to worry about when using high level programming languages/frameworks like Julia's Flux or the TF/PyTorch frameworks. The only computational complexity stuff we did was related to matrix decomps in computational stat. Between Real Analysis and an Algorithms class what would be better? I don't think I can take both.
-
I see, yea I am not interested in overall CS though. I feel like I only like this narrow ML/DL area and to me it seemed like stats. So seeing that NYU DS you can more or less just focus on that area looks appealing. They do a lot of MRI research in the biomedical track too, which seems to be more applied statistics based than CS. I do agree more math background would help but I was in a different biomedical field in my undergrad, so can’t do much now. I could potentially sign up for one of either Real Analysis or Data Structures&Algs for the summer though, which I am considering. Honestly I never really thought I would like ML/DL back in undergrad because I didn’t really know what it was and thought it was some insane CS thing but 2 stat ML (on supervised+unsupervised learning) classes in my MS I was wowed and I got convinced that ML is stats. And computational statistics (which had a bit of numerical analysis, but mostly matrix decomps/GD/MCMC/EM) had some as well. I also had a signal processing (special topics) stats course I really liked on FFTs, which was actually invented by Tukey, and I liked that too. Time series as well but my TS elective course was undergrad level. There was much less asymptotic stuff in these areas and it was more like “show gradient descent on convex functions converges” which is more optimization. So it seems weird to me that all this stuff isn’t considered statistics, perhaps I went to a more modern department after all. Deep Learning also we had a little bit on it from the GLM/GAM perspective. And I always liked GLMs and regression, so seeing that ML and DL boiled down to that got me interested in it. And regularization too, like how people nowadays are incorporating different kinds of penalties for domain specific problems is interesting. The whole double descent thing in DL to me seems like statistics, at least the way Dr Witten explained it with GAMs and regularization. VAEs for example seem to be heavily statistical would fall under probabilistic modeling of multivariate distributions, letting you generate new data based on the latent space. I don’t think asymptotics, inference, and p values necessarily define the field. I see, yea it seems to be a great idea but maybe I need to think about it more. It is true, as of now work-life balance is not a problem despite the job itself being a bore. Maybe things could get better with a new job+ end of covid lockdowns. That could be a contributor too. I hate remote work and I hate how companies are also pushing to be remote more and more. It makes you feel mostly like a corporate slave imo and long term this model is not going to work. There is no social aspect/culture and it makes the boring stuff more boring. As my first job ever, I have hated that (maybe id have felt differently otherwise). I could also consider joining some DS/ML meetup groups irl just to satisfy the intellectual curiosity aspect, as it sounds like jobs will not have as much of this (especially without a PhD). And also do other hobbies. I will admit, part of why I wanna do a PhD is delay this chapter of life lol. I did like the school research environment more. Its not only because of jobs.
-
untzkatz reacted to a post in a topic: Is Biostatistics becoming outdated in the industry, outside regulatory writing?
-
Yea I get this perspective too, which brings me to the elephant in the room—Why has DS/ML/AI been so hyped up? Its certainly starting to sound like the “instagram effect” but for jobs. You see the best, most cutting edge stuff (analogous to seeing highlight reels and heavily curated/edited pics ) from the outside but the reality isn’t like that, and it gives you a skewed view. It sounds like this stuff is really more in research labs and if what you are saying is true, it actually does not pay well (since its in academia) except for the very few who are competitive to get a FAANG-like research scientist position. Or perhaps lucky to find some startup. Otherwise statistical/ML/DL algs related stuff sounds like largely a hobby for personal projects. Guess now it starts to makes sense now why they call it “work” and not “fun”.
-
Data analyst traditionally is like Tableau and SQL from my understanding. That probably doesn’t have much classical nor ML analysis at all. Don’t think its necessary to do DA to go to DS coming from Biostat is it Im currently actually talking to a biostat position related to imaging data analysis though in academia, I had actually landed it last year but I chose industry due to the pay. It wasn’t directly an imaging position, but I had gone through the process and the labs I would have worked with were radiomics and stat learning ones. I decided recently to contact the main person again last week, who actually did get back to me though its looking probably like I would have to go through the process again soon but they are finding that out for me. The good thing is, they have this thing where you can even take relevant classes on the side while you do research so that could help too. Then there are some DS positions I have had phone calls for too but heard nothing back yet, so competitive. But still hoping for those.
-
Looking at the PhD DS curriculum here https://cds.nyu.edu/phd-curriculum-info/ Looks like there is a 1 probability course and 1 more modern inference (like graphical models) course. The probability course seems to have notes here https://cims.nyu.edu/~cfgranda/pages/stuff/probability_stats_for_DS.pdf and it looks pretty much like MS level probability (looks like over here the MS students also take this) which I have done before already. Its not measure theoretic probability. The intro to DS course is programming based, and the ML class looks like it goes more into ESLR stuff which I don’t mind. It is also MS level from what I can tell MS are taking the same class, and we had something similar to this too. The hardest class in terms of theoretical background seems to be the Inference & Representation one. But this looks to be about very modernized topics like DAG/Bayesian network models which is very different from your usual math-stat asymptotic/Fisher/ Neymann-Pearson stuff. They seem to discuss applications in ML as well, so its far from the typical math-stat inference class and looks like something I would like potentially based on the notes : https://www.notion.so/Inference-and-Representation-623a215febc3461dbc004682484922ad
-
My stats/biostats program in grad school didn’t have this grade inflation. It was graded more like how undergrad courses would be on curves. Actually many Americans in particular got similar scores as me, the international Chinese students (who were like 90+% of the dept, which I think isn’t uncommon) set the high barrier. There were classes were I did decently well and then last minute got screwed by the Final Exam curve. Some of these international students had done things like Quadratic Forms way back in HS, and lot of the MS math stat courses were just review for them. Ok maybe I should rephrase. I like the computational aspects of all of these tools. I have always liked implementing things like GLMs, EM, gradient descent, doubly robust etc in code. From a practical standpoint, when you say fit a GLM model and do data analysis, are you *actively* thinking about asymptotics? Usually you do some deviance residual visual checks, check assumptions like independence based on the design, and consider things like bootstrap (or if there is dependence, some clustered resampling) if various assumptions aren’t met (or sometimes even if they are). I guess indirectly this is related to it, but its not like a proof more of a check. Idk, maybe I just like data analysis and implementing statistical algorithms stuff but it sounds like that isnt really what a PhD is about either. If that is the case, maybe it could be that its better to just looks for DS jobs that involve more of it and improve programming skills so that I can write production code? As it sounds like its hard to find something where you are just doing the data analysis component, except maybe in academia. But I am considering also just going back to academia as an MS level biostatistician, where it is more real biostats without the regulatory stuff. Doesn’t pay great but I am considering just working at like a cancer center doing imaging data analysis. Which could help also with PhD apps but also see if I like that stuff more.
-
B/B+ was in graduate MS level math stat classes, not undergrad. Its the classes taken by MS students and the 1st year PhD students who need to review the MS level before doing PhD level inference courses. We used Casella and Berger. My undergrad was in a different biotech related field. The highest undergrad math course I have taken is upper division linear algebra but I also got a B+ there, never did real analysis. I did struggle in the MS level math stat asymptotic theory type proofs. I got As in the computational courses (comp stats and 2 ML classes) though. How important is the statistical inference asymptotic type proof stuff for going into ML/DL? I wonder if maybe a DS program would be better for this reason because it is more applied and would go straight into the more modern statistical areas and not have to bother with regular math stats again. I hated the asymptotic theory stuff as it had very little application (in the end you just throw it into a Wald Test or Bootstrap anyways).
-
Oh I see, well I did do medical imaging related biostat research in my MS. It was interdisciplinary and I got 1 applied paper in a well known MRI journal, although it was more in applied classical stats. And that is the sort of stuff I want to do, involving DL/ML and imaging data. I don’t want to do vanilla biostats stuff like survival analysis lol, even in survival nowadays people are analyzing full images and using the survival loss functions in DL.
-
26 itself isn’t old to be in the middle of PhD already but I see it as kind of old to start, like assuming it is 6 years (and given ill have to apply coming Fall) I would be around 33 after graduation. Lot of people are starting to settle just about now. And yea agreed it is a big consideration. But it sounds like the research scientist jobs in FAANG need one. Though I probably wouldn’t want to work for FB but that is more for my own reasons like not being into social media lol. Wrangling data is tedious at times but its still better than writing regulatory reports to the FDA and documentation imo. Tidyverse makes it a lot easier if its structured data. I think I would enjoy the PhD, provided it has a good mix of modern and computational topics. Wouldn’t want something where its like mostly dry math-stats and asymptotics. NYUs DS PhD program looks really interesting to me though. And they have cool research too, including a bunch of biomed imaging people. Its probably still really hard to get in though, but maybe its easier than top Stat programs.
-
Thanks for the links. The Cytel one looks like it uses SAS and nothing cutting edge is being done in there lol, even a log transform is insanely cumbersome vs R/Python/Julia. But these are interesting otherwise especially the Harnham one is right up my alley, though it says Senior DS and wants a PhD. Rest seems mostly director level. The PhD seems to be a big barrier and I am 26 so am getting older. I regret not doing it earlier, as it seems with an MS you mostly get all the boring work especially in biotech. Biotech seems to value the PhD status a ton. Maybe at the end of the day a job is a job and I will just have to do the advanced stat/ML/AI stuff as a side hobby if I never get a PhD. One of my biostat profs suggested to maybe get an MS in DS or ML from a reputable school and then see after but I suspect itll lead to the same problems, as even that field demands a PhD now as I don’t want to be doing ML Engineering either I want to be doing statistical ML. ISLR/ESLR is on ML though and these are written by statisticians. I don’t think uncertainty quantification is necessary for something to be statistics. If you have a complex observational dataset and you don’t approximate the function correctly (model misspecification) the 2nd order things like SEs/p values are not going to be accurate anyways. Predictive scores are important even for inferential purposes now according to some classical statisticians, like Max Kuhn the R tidymodels author who describes an example of inaccurate inferential results when this isn’t done: https://www.tmwr.org/performance.html. Nowadays people are even combining ML with classical statistics in the things like SuperLearner by Mark VDL and Doubly Robust methods. That is the sort of stuff I find really cool. Seems its all mostly academia though. But yea anyways my company doesn’t seem to encourage innovation. Its all about just moving the business forward. Its mid size trying to scale up further and its going to be even more regulatory work going forward. When I first started a year ago, I had more freedom and did more internal data analysis rather than for product submission/FDA but in the last 5 months it has changed a lot more and they even say “we are becoming more like biostatisticians in more established companies so more regulated”. I think I wouldn’t like MBA either since its again more business oriented, and I was never interested in management. I very much am interested primarily in data analysis and the stat/ML methods. But it seems I don’t have the PhD gold star for this work.
-
To me ML=statistics, and after taking 2 ML courses in the stats department during my MS I am convinced lol. The rigorous stats you are referring to like missing data I think doesn’t really come up in MS level biostat. And people are conservative when it does anyways, they often just drop it and don’t do fancy imputation. Power and sample size calcs come up quite a bit but they are very straightforward simulations. Also, statistics is not just hypothesis testing and uncertainty quantification to me. I think this is a misconception. Or rather, the Biostat work does not really involve advanced methods for this. Causal Inference for example is rarely by industry pharma/med device biostatisticians, but it is done by data scientists in tech. Causal inference on observational data is a good example of rigorous stats that is missing in industry biostatistics. You mention RWE observational data but I have not seen industry Biostat positions where one can focus solely on analyzing that data. I see RWE mentioned more in DS positions again. I never said anything about not checking assumptions and so on, in fact, there have been multiple occasions where an analysis I proposed was better on this basis but they still wanted the simple one because it was in the FDA guideline and they don’t like going against the grain, regardless of the mathematical or statistical justification. For example take a medical image or molecular level data. This is where the interesting work is and you can go deep into the mathematics of for example statistical signal processing (if you don’t want to do ML, and stay within classical) and extract features, but biostatisticians don’t do this either. They are much more product facing. Instead we have bioinformaticians and comp chemists for example doing that other more discovery related stuff. And there is far more actual deep statistics in that then there is with a t test or ANOVA showing that A was better than B. Sometimes I get lucky and I can bootstrap, but even that isn’t exactly exciting. Bootstrap is probably the most “complex” technique I have used. And then there is study design where majority of the work is the boring planning phase and writing even more documentation. Maybe theres some simulated sample size calculation but beyond that simulation there isn’t much. Or what about classical time series analysis on wearable device data? Nope that doesn’t seem to happen in biostat either. Even within classical stats, the advanced stats besides some designed experiment GLMM I have not seen come up. Within non-ML, things like mixture models, EM algorithm, MCMC, Fourier transforms, Gaussian processes come up more in DS than stats. I did try to use Bayesian once but they are opposed, feels it complicates the documentation. As for ML, there is a lot more statistics here— take a variational autoencoder for example. The theory behind this is very statistical and involves probability theory/KL divergence etc. You are right, I just wanna dig in and start analyzing the data and focus on the methods. But that to me is the statistics, not documentation and regulatory aspects. Lot of people don’t like data cleaning and while I am not a huge fan, I have even enjoyed the data wrangling/cleaning aspects far more than the regulatory documentation. Like at least even data cleaning has the programming aspect and can be like a puzzle to solve.