-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathds_subreddit_hot_posts.json
2493 lines (2493 loc) · 499 KB
/
ds_subreddit_hot_posts.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
[
{
"id": "1c4cxoj",
"datetime": 1713153681.0,
"flair": null,
"title": "Weekly Entering & Transitioning - Thread 15 Apr, 2024 - 22 Apr, 2024",
"score": 2,
"comment counts": 12,
"content": " \n\nWelcome to this week's entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include:\n\n* Learning resources (e.g. books, tutorials, videos)\n* Traditional education (e.g. schools, degrees, electives)\n* Alternative education (e.g. online courses, bootcamps)\n* Job search questions (e.g. resumes, applying, career prospects)\n* Elementary questions (e.g. where to start, what next)\n\nWhile you wait for answers from the community, check out the [FAQ](https://www.reddit.com/r/datascience/wiki/frequently-asked-questions) and Resources pages on our wiki. You can also search for answers in [past weekly threads](https://www.reddit.com/r/datascience/search?q=weekly%20thread&restrict_sr=1&sort=new).",
"comments": [
"So I'm currently getting a Bachelor in Computer Science at WGU. Should I get master in Data Analytics after or find somewhere else that offers DS masters?",
"Hello! I have an interview coming up next week at a startup, where the role requires some expertise in causal analysis. This involves identifying issues, understanding their underlying causes, improving the product by addressing these problems, and then conducting tests and sensitivity analysis to verify the results.\n\nRegarding my background, I\u2019ve spent 1.5 years working as a data scientist, including 1 year as an intern and half a year in a full-time role. My experience has primarily focused on exploratory data analysis, ml modeling, and A/B testing, with less emphasis on causal analysis. Although I have a theoretical background in causal inference from my statistics coursework, I haven\u2019t had the opportunity to apply this knowledge to real-time data. Could anyone recommend resources or Kaggle competitions for practical experience in causal analysis? If you are a DS professional who does causal inference/modeling, could you share insights on how to effectively frame problems and set up hypotheses? Additionally, I would appreciate recommendations for widely-used causal analysis libraries in Python that are industry standard.\n\nThanks in advance!",
"Hi guys, I'm in need of good resources on statistics/probability geared towards machine learning to work through in the next few months. My knowledge of machine learning is quite limited currently, but I'm working through a python course on machine learning that I should be completing in the next month or so. My end goal is to either get a job as a data scientist (which would be very difficult given my current skillset) or to get admitted to a good masters program in data science in the next application cycle.\n\n\n\nI've had a hard time finding resources online as I'm specifically looking for courses/books that don't shy away from math and explain their methodology thoroughly. I have bachelors and masters degrees in math and a strong background in linear algebra and analysis. I've also taken a couple of intro courses in statistics, but I'm interested in building more foundational knowledge as I get more familiar with probability and machine learning.\n\n\n\nI would appreciate any guidance/advice or resources you guys would be willing to share!",
"Hi all, \n\nI am seeking advice on the best step I should take in my career. I am transitioning to data science from academia. I have been offered an entry-level position in a medium-sized company that is developing its relatively new data science team. My impression is that they are not entirely clear on what they want to do or where resources should be best focused. They seem to be in an exploratory phase, determining which avenues will provide the most value to the company. I do not know the competency level of members of the current team. The company is not high-profile, but it may provide a good opportunity for career progression if I can help them build a helpful data science program. The position is also 100% remote.\n\nI also have an offer for a \"fellowship\" where I would be paired with a higher-profile company, either in the public or private sector, and carry out a machine learning project with them (previous examples include NLP, predictive models, RAG, etc.). At the end of the fellowship, there is a high probability (\"95%\" according to them) that I will offered a position by the partner company. The fellowship is highly competitive, so should make me more attractive to recruiters. This fellowship also requires in person attendance and could provide some valuable networking opportunities.\n\nI am struggling to decide which option to take. Does anyone have advice on which option might be best?",
"Not sure if this is the best thread to ask, but does anyone have statista access? I want to use 2 of their datasets for a class project"
]
},
{
"id": "1c4nwy0",
"datetime": 1713192080.0,
"flair": "Discussion",
"title": "WTF? I'm tired of this crap",
"score": 549,
"comment counts": 161,
"content": "Yes, \"data professional\" means nothing so I shouldn't take this seriously.\n\nBut if by chance it means \"data scientist\"... why this people are purposely lying? You cannot be a data scientist \"without programming\". Plain and simple.\n\nProgramming is not something \"that helps\" or that \"makes you a nerd\" (sic), it's basically the core job of a data scientist. Without programming, what do you do? Stare at the data? Attempting linear regression in Excel? Creating pie charts?\n\nYes, the whole thing can be dismisses by the fact that \"data professional\" means nothing, so of course you don't need programming for a position that doesn't exists, but if she mean by chance \"data scientist\" than there's no way you can avoid programming.",
"comments": [
"Data professional could mean being a data entry clerk, or working as a data analyst using only Excel, and maybe a little bit of SQL. I wouldn't read too much into it.",
"I think this is just an accurate post. Data professional covers a broad range of jobs where the primary function is working with data. There are plenty of jobs under that umbrella that do not require programming. Analyst/entry/design. It doesn\u2019t mean top tier data jobs don\u2019t require programming. What a strange thing to rage about.",
"It feels like you posted this looking for a reason to be outraged.",
"Dude...data scientist doesn't actually mean anything either.\n\nCalm down. The phrase 'know programming' is so vague as to be nearly useless as well.\n\nI am a data science manager. I know some code, I can script pretty well, but I am terrible at actually programming anything. That's how I would characterize my skillset.\n\nOther people think I am a code wiz. They are very, very wrong. :)\n\nThese are subjective terms. Most of them are not codified the way doctor, lawyer, realtor, or engineer are codified.\n\nLet people believe what they want to believe.",
"you can get a long way with SQL"
]
},
{
"id": "1c5f11i",
"datetime": 1713270988.0,
"flair": "Projects",
"title": "Loading a trillion rows of weather data into TimescaleDB",
"score": 3,
"comment counts": 1,
"content": "",
"comments": [
"I posted a while back [asking for help](https://www.reddit.com/r/dataengineering/comments/16z8h6l/how_to_efficiently_load_20_tib_of_weather_data/) [on loading tons of data](https://www.reddit.com/r/PostgreSQL/comments/16z7vqs/how_to_quickly_load_20_tib_of_weather_data_into_a/) and got lots of great advice and feedback. I ended up doing some digging to answer my question and wrote a post benchmarking the fastest ways to insert data.\n\nI'm still learning Postgres so if anyone has any feedback or questions, I'd love to hear them!"
]
},
{
"id": "1c5fqyd",
"datetime": 1713273095.0,
"flair": "AI",
"title": "Rule based, Recommendation Based Embedding",
"score": 2,
"comment counts": 0,
"content": "Hello Coders\n\nI would like to share an experience and know your opinions. I embedded about 12K+ order lists from a takeaway order system. I used Cohere english v3 and openai text embeding v3 for the embed. I prepared questions for the embed I would like large pizza, green pepper and corn questions with semantic parser. The output answers of these questions vegan pizza, vegan burger added pepperoni topping coke side topping did not satisfy me. Complementary and suggestion answers gave one quality and one poor quality output. Of course, these embed algorithms are usually based on conise similar. I suddenly had the suspicion that I should use embed for this type of rule based, match based, recommended. I believe that I can do the attached data with my own nlp libraries with more enrichment metadata tags without embedding. I would be glad if you share your ideas, especially if I can use llm in Out of vocabulary (OOV) detection contexts.\n\nThank you.",
"comments": []
},
{
"id": "1c4oyx9",
"datetime": 1713194707.0,
"flair": "Career Discussion",
"title": "Excel Monkey",
"score": 74,
"comment counts": 51,
"content": "How much in your daily career life do you feel like an Excel Monkey where you spend most of your work load in Excel?\n\nI\u2019m currently in a modeling role in the insurance industry looking to see if it is time to branch out to other industries or if my expectations are too high. \n",
"comments": [
"What are you doing in Excel? Can you load the data elsewhere to model/analyze it more flexibly?\n\nI'd second the other post suggesting Pandas which can read/write Excel files and gives you access to the entire Python ecosystem. It's a pretty good start.",
"Hard recommendation Python \ud83d\udc0d learning. Especially pandas library you can use currently job",
"I think people are missing the point. If you are producing a product for somebody, you have to produce something that is useful for them. Often times this includes an excel spreadsheet because even most C suites can navigate Excel. There is nothing wrong with Excel when you are working with data that is < 100k observations. \n\nAlso, I\u2019m in the same industry and work with financial models, most of them are based in Excel and the primary reason is because Excel is very explainable.\n\nTo summarize, there is nothing wrong with Excel. You need to work within your company\u2019s tech stack and produce something that is useful for the people that need it. If you aren\u2019t happy with the rigor of the work (this is where I\u2019m at) look for opportunities and ask your boss for more challenging tasks where you will be forced to use additional tools besides Excel. Or, leave the company and go to a company that is a bit more mature in their tech stack choices and methodologies.",
"In my experience, so much depends on the corporation\u2019s IT and licensing agreements and appetite to support certain applications \u2026I\u2019ve been slowly getting my team to learn Python and using Anaconda, but IT sent out a memo saying we could no longer have Anaconda on our machines. So, we\u2019re looking at other solutions with IT, but we\u2019re mostly Excel and Oracle SQL developer at the moment while IT sorts out what we can and can\u2019t have. Excel and the Microsoft suite are safe and easy to maintain, but a lot of IT people are not familiar with R, Python, etc\u2026 from the standpoint of maintaining it and ensuring IT security.",
"I know some python. But the company I\u2019m currently with doesn\u2019t use Python in this role. I believe my predecessor is an actuary and what the role is meant for. What I\u2019m wondering is if it\u2019s time to look elsewhere or if I\u2019m likely going to just find myself in the same Excel position."
]
},
{
"id": "1c5b5xz",
"datetime": 1713256858.0,
"flair": "ML",
"title": "Interview Advice - Sales and Marketing Predictive Modelling ",
"score": 2,
"comment counts": 1,
"content": "Its hard as an international to get internships in this market but thankfully I had the fortune to interview for a few F250 companies. \n\nI seem to be missing out for fine margins. One company team technical lead said that i would be a good fit but since there was just 1 opening, I got referred to another team to apply . This happened quite a few times with others except i wasnt referred to other teams. I prepared for wrong things in that interview. I was able to answer all but it was thinking on spot and beating around the bush which definitely didn't help . Someone who knew it would sound more sure and knowledgeable and will get the edge .I know where i could have improved :( \n\nThis maybe my last opportunity to bag summer internship this year. I want to give my best and try to leave no stone unturned. \n\nIt would be great of someone with experience in predictive Modelling in sales and marketing can tell me about some work done and commonly used questions / techniques. I did google and chatgpt but some real world / production level insights and some commonly used models and methods MLOps of this domain would help me a lot. \n\nAppreciate your support in the above matter ",
"comments": [
"Hey FellowMates, I dont know If I would be asking this. But can you give me Some Karma(10 needed). I need to ask(post) something in this sub regarding my project."
]
},
{
"id": "1c4m8q6",
"datetime": 1713187760.0,
"flair": "Career Discussion",
"title": "How to negotiate salary when doing an internal move?",
"score": 22,
"comment counts": 17,
"content": "Hi all,\n\nBasically the title \u2014 any tips on negotiating the salary when doing an internal move, and the hiring manager / HR most certainly know at least my pay bracket, if not the exact salary I have right now?\n\nI only know some very rough numbers from colleagues and I tend to underestimate their budget / undersell when negotiating.\n\nThanks! \ud83d\ude4f ",
"comments": [
"If you're underpaid, companies typically will give you a token raise to keep you happy unless you come in with a competing offer. If you really like where you work but also want to get paid fairly then I'd suggest interviewing, getting an offer for a higher salary and using that as leverage to get the salary you deserve.",
"One aspect that makes this more difficult compared to a salary negotiation when being initially hired is you have less leverage. When being hired you can walk away and they have to keep recruiting. I don't know if your move is a promotion, a lateral move, or a completely different team / position but not only do they know your current salary, you don't have a lot to bargain with unless you plan to quit if you don't get the salary you want. \n\nI think the best you can do is show salary data from Glassdoor or some other website if you get lowballed. Otherwise I wouldn't try too hard.",
"I\u2019d recommend external interviews. \n\nLast I checked, life time earning studies show that that those who job hop more have approximately 50% greater life time earnings. **although, I\u2019m having a hard time finding the articles**, but there is an exception, those who job hop under a year are less likely to get more money, the optimal time to job hop is like 2-4 years",
"If you\u2019re underpaid and then use a competing offer, they will most likely lay you off after giving you a raise as the expectations most likely will double which wont be proportional to whatever raise they give you. \n\nUsing a competing offer as leverage will most likely strain your relationship with the employers 9/10. So just job hop.\n\nThere is nothing more an employer hates than an employee asking to be compensated fairly.\n\nThe only case of them increasing your salary is that if you are paid less than a new grad who makes 40k usd more than you and has less experience then maybe just maybe they will give you a 5000$ salary adjustment.\n\nIt has to be comically large for them do the bare minimum to adjust your salary to reflect experience and pay from colleagues. That is the expectation by law so your employer will do the bare minimum to retain you if there is no real risk of you leaving. Else that will be an easy case to prove in court for pay discrimination.\n\nIf anything when you ask for a competitive salary what they will do is to make sure to make you feel worthless and pathetic so that you do not value yourself. Very common strategy, the employer must first make you feel that you are worthless employee and that you work is crap regardless of what achievements you had in the past. The main arguments will be they care about current achievements, the budget is rough, and also be thankful you have a job in this economy. They have 1000-5000 candidates willing to do tour job for half the price.\n\nThis is a harsh reality that i saw first hand, the only leverage you have is when you are coming in a company. Afterwards you got no leverage, especially with how the economy is going. For sure dont expect this to get any better either because just like the dot com bubble in it took around 15 years for the job market to rebound. So do expect that to be the case for the next 15 years until we get a new recession 5-10 years later. In the meantime the economy will slowly rebound but not any time soon.",
"Key thing: you don't have the same leverage as an external hire, so it's hard to truly \"negotiate\" salary. \n\nHaving said that - you do have *some* leverage. The leverage is \"well, if you hire me and don't give me enough money I will start applying to other jobs behind your back and then leave in 6 months\".\n\nOne thing to keep in mind: HR doesn't like people negotiating salary increased internally for lateral moves. So if you're moving from Data Scientist 2 in one team to Data Scientist 2 in a different team, expect very little movement in salary. And if you are getting a promotion HR is likely going to cap that as well.\n\nNow, here's my advice:\n\nPull your comp history over the last 2-3 years (whatever paints the best picture), and look at what you average yearly comp increase has been. If it's been low (say, 4% or lower), then I think you have a really good argument to make that taking on a new role with new responsibilities is something that you think warrants a raise in compensation."
]
},
{
"id": "1c40xcl",
"datetime": 1713120639.0,
"flair": "Discussion",
"title": "If you mainly want to do Machine Learning, don't become a Data Scientist",
"score": 669,
"comment counts": 170,
"content": "I've been in this career for 6+ years and I can count on one hand the number of times that I have seriously considered building a machine learning model as a potential solution. And I'm far from the only one with a similar experience.\n\nMost \"data science\" problems don't require machine learning.\n\nYet, there is SO MUCH content out there making students believe that they need to focus heavily on building their Machine Learning skills.\n\nWhen instead, they should focus more on building a strong foundation in statistics and probability (making inferences, designing experiments, etc..)\n\nIf you are passionate about building and tuning machine learning models and want to do that for a living, then become a Machine Learning Engineer (or AI Engineer)\n\nOtherwise, make sure the Data Science jobs you are applying for explicitly state their need for building predictive models or similar, that way you avoid going in with unrealistic expectations.",
"comments": [
"The problem is that \"machine learning\" is the vaguest term in the world that encompasses everything from linear regression to ChatGPT.",
"My title is data scientist and honestly about 50-80% of my day is spent either using pytorch and prototyping, doing more large scale jobs on aws or preparing data so that I can then prototype on pytorch and then move toward a large scale job on hpc\u2026 however after joining this sub and reading the posts, i feel like im in a unique position.",
"Linear regressions are my bread and butter no matter how much I try to do something better. Interpretability and consistency are more important than accuracy in my field.",
"89% of my \u201cdata scientist\u201d role is making pretty charts to put in PowerPoint products. I don\u2019t have enough professional ML experience to get paid as much as I currently do anywhere else. Someone save me.",
"I\u2019m sure it\u2019s like this with most jobs but I think the data space has been seriously subject to a massive amount of hype and marketing. Everything has to be ML or Ai and 90% of companies are just suckered into buying services and platforms that just don\u2019t need. Our jobs also get hyper competitive. Need to know snowflake, docker, spark, Kafka, airflow, databricks, sql, nosql, and 10 billion other things that just don\u2019t make sense. It\u2019s getting tiring."
]
},
{
"id": "1c4r5sf",
"datetime": 1713200032.0,
"flair": "Discussion",
"title": "Does anyone use this potential alternative to gradient descent?",
"score": 13,
"comment counts": 8,
"content": "​\n\n[ Step 1: there is some loss\\/cost function but we don't know its optimal parameters ](https://preview.redd.it/vo2fb58taouc1.png?width=723&format=png&auto=webp&s=d700eadb8435238bcf549c71cf7974d0d1d27cc1)\n\n​\n\n[ Step 2: solve for the derivatives at random points for the parameters and obtain tangent vectors for those points. ](https://preview.redd.it/td8kuu9waouc1.png?width=619&format=png&auto=webp&s=5e0d879dda0ebb502895350fc23a302393268f74)\n\n​\n\n[ Step 3: Solve for where the vectors \\\\\"cross\\\\\" \\(when stretched\\) in terms of the parameters, and plug those parameters into the loss function. If it seems to be a good place, you could try gradient descent\\/back-prop starting from here. The vectors may not intersect at any point with respect to all parameters, but as long as they intersect with respect to enough of them you could try that coordinate. If not you can repeat the process until you find some tangent vectors that intersect with respect to enough parameters. ](https://preview.redd.it/ozkafxn0bouc1.png?width=616&format=png&auto=webp&s=81ba479a32c6572f6da6701d3d55d32e223d2507)\n\n I chose an example in 3 dimensions for obvious reasons. This might not run so fast or be so easy in higher dimensions. Curious to hear what people think though. And maybe this approach already exists? ",
"comments": [
"I have considered this, but not used it. There are a few potential problems that have so far dissuaded me:\n\n1. Since the vectors probably won't intersect, I thought you'd need to find the shortest line segment where the vectors most closely approach one another, and then take the midpoint of that segment. Seems complicated... I didn't think of finding an intersection in a subspace as you suggest here.\n\n2. If the loss is bumpy and weird and the probe points are too far away from another, they might be pointing to completely different local minima with nothing interesting happening between them.\n\n3. If the two points are both on the inside surface of a banana-shaped valley, the vectors will point *away* from one another- the only \"intersection\" or \"closest approach\" is *uphill* from them :-( .",
"It might work assuming the optimization problem is convex (as shown on your pictures). General optimization does not hold for that assumption, and deep neural network losses are way far from being convex. That means that the presumed \"intersection\" (or a middle point of the shortest distance line as pointed out in another comment) may point into a local maxima or whatever ultimately fucking everything up.\n\nApart from the above and the computational complexity, the approach introduces new hyperparameters:\n\n1. the number of points to sample\n\n2. variance of the sampling process\n\n3. how the above variance should decay over time (otherwise I don't expect this random walk to converge at all)\n\n4. distribution from which the points must be sampled\n\nAlso, the memory requirements scale linearly with the number of sampled points, meaning you'll have to kind of instantiate the network N times (where N is the number of points for gradient computation). And considering the scales in our modern deep learning era that's just stupidly expensive.",
"Rather than using vectors you could use tangent planes and pick new points on their intersections",
"[deleted]"
]
},
{
"id": "1c4uvfr",
"datetime": 1713208788.0,
"flair": "Statistics",
"title": "Real-time hypothesis testing, premature stopping ",
"score": 8,
"comment counts": 10,
"content": "Say I want to start offering a discount for shopping in my store. I want to run a test to see if it's a cost-effective idea. I demand an improvement of $d in average sale $s to compensate for the cost of the discount. I start offering the discount randomly to every second customer. Given the average traffic in my store, I determine I should be running the experiment for at least 4 months to determine the true effect equal to d at alpha 0.05 with 0.8 power. \n\n\n1. Should my hypothesis be: \n\nH0: s_exp - s_ctrl < d\n\nAnd then if I reject it means there's evidence the discount is cost effective (and so I start offering the discount to everyone) \n\nOr \n\nH0: s_exp - s_ctrl > d \n\nAnd then if I don't reject it means there's *no* evidence the discount is *not* cost effective (and so i keep offering the discount to everyone or at least to half of the clients to keep the test going)\n\n\n2. What should I do if after four months, my test is not conclusive? All in all, I don't want to miss the opportunity to increase the profit margin, even if true effect is 1.01*d, right above the cost-effectiveness threshold. As opposed to pharmacology, there's no point in being too conservative in making business right? Can I keep running the test and avoid p-hacking?\n\n\n3. I keep monitoring the average sales daily, to make sure the test is running well. When can I stop the experiment before preassumed amount of sample is collected, because the experimental group is performing very well or very bad and it seems I surely have enough evidence to decide now? How to avoid p-hacking with such early stopping?\n\n\nBonus 1: say I know a lot about my clients: salary, height, personality. How to keep refining what discount to offer based on individual characteristics? Maybe men taller than 2 meters should optimally receive two times higher discount for some unknown reasons?\n\n\nBonus 2: would bayesian hypothesis testing be better-suited in this setting? Why?\n",
"comments": [
"You can stop the experiment before the pre-assumed number of sample are collected if the results are very clear and statistically significant. However, you should be careful about p-hacking with such early stopping. To avoid this, you could use sequential analysis, which allows you to stop the experiment early if the results are clear, but adjusts the statistical significance level to account for the fact that you're looking at the data multiple times.",
"Checkout the \u201coptional stopping\u201d part of this paper\n\nhttps://arxiv.org/abs/2212.11366",
"Some bayesian approach is probably a valid way to approach this type of problem. I also want to say that if you run an experiment for 4-6 months to measure a small effect you should be careful about drift in your user population behavior. Effects can be seasonal or just have secular changes so keep that in mind",
"I wonder if the proper hypothesis is s\\_exp - s\\_ctrl = 0 and then you statistical test just measures if the difference is statistically significant. If it is and the difference is d, then you're good to go. But i think this is the same as what you're doing. Find the distribution of s\\_exp - s\\_cntrl and if $d falls in the <.05 left quantile then you can say s\\_exp is $d greater. \n\nI think you can just stop when the test returns something significant. This can happen if you have very few samples but the difference s\\_exp - s\\_cntrl is very large and/or the difference is small but you have many many samples."
]
},
{
"id": "1c59ru9",
"datetime": 1713251049.0,
"flair": "ML",
"title": "Help in creating a chatbot",
"score": 0,
"comment counts": 9,
"content": "I want to create a chatbot that can fetch data from database and answer questions.\n\nFor example, I have a database with details of employees. Now If i ask chatbot how many people join after January 2024 that chatbot will return answer based on data stored in database.\n\nHow to achieve this and what approch to use?",
"comments": [
"You want a RAG. Assuming it\u2019s a text DB, you need to chunk the DB into passages, and an embedding model to create a vector DB. Given a query, embed it (use the same model as before), return top N closest passages, and use them to give a QA model the necessary context to answer the query by engineering the ideal prompt. Tip: use instruct-type QA models like mistral 7b instruct.",
"Check out \"pandas ai\" package, basically does this"
]
},
{
"id": "1c4kstx",
"datetime": 1713183610.0,
"flair": "Tools",
"title": "Best framework for creating an ML based website/service for a data scientist",
"score": 5,
"comment counts": 7,
"content": "I'm a data scientist who doesn't really know web development. If I tune some models and create something that I want to surface to a user, what options do I have? Also, what if I'd like to charge for it?\n\nI'm already quite familiar with Streamlit. I've seen that there's a new framework called Taipy that looks interesting but I'm not sure if it can handle subscriptions.\n\nAny suggestions or personal experience with trying to do the same?",
"comments": [
"Let's put the question around, what are you missing in streamlit and what exactly do you want to do?\n\nPersonally I never do production grade interfaces, these are done by my devs. So my use cases are demos, pocs or internal small scale apps.\n\nFor all these use cases a combination of streamlit, shiny and flask is more than enough.",
"Flask, Fast API, Django?",
"I think, you should use Streamlit already familiar. Streamlit provides a simple API for building interactive web apps entirely in Python, making it accessible to data scientists quickly build and deploy. Next step Paypal or stripe implement.",
"check out Panel for python web development! it\u2019s pretty cool but can be hard for very large user bases. the docs are really good too.\n\non the payment side i would assume that\u2019s just an issue with whatever authentication you stand up in front of your app (panel has good docs on this as well). you have a list of users that are paying and you check on login if the current user is in that list.",
"Do check out Dash by plotly. More customisable than Streamlit. Optimization is good."
]
},
{
"id": "1c4uibm",
"datetime": 1713207939.0,
"flair": "Discussion",
"title": "Self serve dashboard adoption",
"score": 0,
"comment counts": 3,
"content": "I\u2019m trying to build a dashboard- PowerBI/Tableau which would serve as a one stop solution to most of the ad hoc analytics request that our stakeholders have. But in the past we observed that the adoption of such dashboards by them is not that great. Did anyone in the sub try this initiative, and what are the important factors to keep in mind before you go into such venture.",
"comments": [
"I\u2019ve done some work around this, I think my biggest suggestion would be a series of smaller more tailored 1 stop shops for each overarching initiative is more effective then one massive one for everything in my experience. For example we were building our ad hoc reports on customer engagement, call center volume, and push notification interactions for various alerts. It worked a lot better to create one dashboard for each of those topics than one huge one for everything because the metrics that are important for one aren\u2019t for the others and stakeholders got nervous when they saw metrics or filters they were unfamiliar with\u00a0\n\n\n\n\n\nEdit: just wanted to add that unless you have very dedicated and competent stakeholders these efforts are almost never worth it, it did very little if anything to reduce the ad hoc requests and the people that did try to use it often didn\u2019t understand how to use the filters/fields properly and ended up with some really strange takeaways from it",
"Tough ask. I built high level dashboards for some years. In my experience these dashboards disappoint. Worse, if you build in various customization options (filters, level of detail parameters, varying percentiles etc) data naive XFNs abuse and misunderstand and inevitably misuse the dashboard. End result: it's the builder's fault.\n\nPush back if you can and propose to deliver stand-alone tools for most needed use cases. Some sort of 80-20 rule applies to dashboarding. 80 percent of insights come from 20 percent of dashboards. \n\nI am not one to lecture but please keep in mind that one key tool in an analyst's / data scientist's skillset is managing stakeholders. We all must learn how to say no without saying no"
]
},
{
"id": "1c49ugr",
"datetime": 1713144012.0,
"flair": "Discussion",
"title": "If you work in a unique domain, what is it and how did you get into it?",
"score": 23,
"comment counts": 42,
"content": "Most of the domains I see for job listings are healthcare, products, retail, fraud, supply chain, etc. I'm still early career and my degree is in sociology and data, which has helped me work mostly with non-profits. I wanted to hear about what other domains that other DA/DS are working in which fall outside the majority.",
"comments": [
"I worked in nuclear power and transportation, pretty unique for DS imo but perhaps limited use cases as well.",
"Video games. It\u2019s an amazing use of DS",
"Air quality - job opportunity during my master's",
"I work in the specialty insurance field, where pretty much anything you can imagine is insured - from somebody\u2019s vocal chords to offshore energy operations to more standard properties along the coast of Florida. Very different to retail insurance (to the extent that it\u2019s actually hard to move between them the more senior you become). \n\nI got into this by first training as an actuary. I packed in that career path, but stayed in the same field to further deepen my domain knowledge. I now focus more on software development and aspects of data analytics. \n\nBecause the focus is on specialty risks, the data tends to be \u2018smaller\u2019 and there isn\u2019t much \u2018machine learning\u2019. Descriptive statistics and exploratory analysis is king. Optimising processes to be more efficient is also massively important here. If you are somebody who has these kind of skills, and can combine it with an expertise of specialty insurance, you\u2019ll be (very) hot in this market.",
"I specialize in recommendation and search. I got into it bc I was super interested in using graphs/ontologies. Ended up working for a research institute during my dissertation that was exploring the use of ontologies and named entity recognition for scientific paper recommendation. I\u2019ve since moved on from the research institute but still working in recommendation and search for a tech company now."
]
},
{
"id": "1c41y7n",
"datetime": 1713123133.0,
"flair": "Discussion",
"title": "Distraction caused by the Ai Hype",
"score": 68,
"comment counts": 24,
"content": "I noticed there's some disconnection between this recent AI Hype we constantly witness on Linkedin/Twitter, things like these new LLMs, the latest 3D models, the Cool Gen AI stuff ... and the industry requirements that actually matter for companies. Which is a bit confusing and can be distracting especially for juniors trying to upskill and learn the things that leads to get them jobs, this leaves you with the questions: Should you follow the hype and try to stay up to date by learning all these new things? or stick to what matters and can generates actual value and be good at it even if it seems \"outdated\" (things like traditional machine learning)? ",
"comments": [
"Unfortunately this is very hard for everyone. Everyone and their boss wants to jump on the GenAI train because of FOMO when they don\u2019t even have good quality data to begin with. Outside of tech, finance companies and a couple of others, data quality, its management and pipelines are extremely suspect. A lot of value in the field actually comes from doing simpler things.\u00a0\n\nThe only method to keep up with the madness is to keep reading about it to know where the field is moving. You don\u2019t have to be on top of it all everyday but its always good to be abreast of the various developments in the field even if you aren\u2019t necessarily implementing it in your day to day job. You\u2019d comes across as a genuinely curious person and even otherwise, its fun!\u00a0",
"I am frankly of the opinion that being strong in the fundamental stuff and then branching out to areas that interest you is the most valuable path of any career. There may be areas that pay more and/or have more hype around them, but do you actually see yourself doing one of those hyped up things in your career? If yes, that's fine. If no, find something else you like.\n\nI've had the opportunity to work with GPT-3 before Chat-GPT came out and back then I thought \"This is pretty cool but it has many limitations to overcome.\" I am still of the same impression when it comes to Chat-GPT and other LLM technology. Chat-GPT is like a calculator: it's useful, but to get the most of it, you need a decent-to-strong foundation in programming, math, and stats. While I like LLMs, I made the decision that they won't be the sole, new fangled toy that I would advertise my skills in to employers. I'll use them when I have to, but I'll do other things too.\n\nI personally prefer having a strong foundation in fundamental Statistics, Data Science, CS, and domain expertise. I don't mind some ML thrown in though (NLP, Regression, and Classification mostly).\n\nTLDR; get good at the basics and branch out to areas that interest you. If you like LLMs and want to learn more, definitely do so! Data Science is a wide enough field that you'll find relevant work in the areas that you'll like (with effort of course).",
"GenAI will fade, ML not.",
"I am literally feeling this one ,I used to learn one by one like ML,Dl,NLP And then wanted to learn computer vision but because of LLM and New models,new fine-tuning methods and models , I really don't know which one to learn and which one to leave and also Now a days everyone is asking for LLM in job description.\n\nI am really confused,this got to me a waste of my most time instead of learning.",
"Unlike traditional ML which is done in house, I don\u2019t LLM will follow this pattern. They will be built by a few companies then leveraged via API. If you want to add an AI chat bot to your product you just buy it. Building your own is a waste of time."
]
},
{
"id": "1c45zjn",
"datetime": 1713133176.0,
"flair": "Career Discussion",
"title": "Would really love opinions. ",
"score": 22,
"comment counts": 30,
"content": "So I\u2019m a hs math teacher. Over the past 4-5 years my school has started teaching data science. I\u2019ve been teaching that for about 3-4 years. Very basic stuff at the HS level, basically intro to R and some stats. \n\nI\u2019ve started to think about studying DS as a possible career path (and/or as a source of side income if that\u2019s possible). \n\nI was looking around and saw that Berkeley extension offers a Bootcamp in data science. It\u2019s not cheap and seems like it will take up a lot of my free time. \nBut I\u2019m having second thoughts about doing this. \nFor starters the application process was very easy. Which makes me wonder how effective this program is. \nAlso, this is just a certificate program. And it seems like most people on this subreddit have BS and or masters. \nDoes anyone have any experience with this particular program? Is there a better path to follow?\nThank you. ",
"comments": [
"Don't go to a bootcamp, those things are cash grabs. Everyone and their mom is trying to get into tech, the market is flooded with people with degrees. You'll need at least a masters if you want to get into data science at this point.",
"You probably won\u2019t break through into a career with that bootcamp. Do you have a math degree? \n\nIf I were you I\u2019d focus on tools. There\u2019s a lot of free materials to learn about tools and what they do. A lot of data scientists use SQL, Python(more) or R(less), a data vis tool like Tableau, have basic programming Know-how, like Git, basic bash/shell scripting and have some additional skills like AWS/Google cloud. They use those things to make projects that do Machine Learning, descriptive reporting or anything else that a business needs. You can read cases on the internet about what businesses commonly use data science for, how those projects were put together, and then learn those tools and methods. \n\nYou could do a bootcamp, or not. If you think it would help you gain the know-how, great. The credential itself probably won\u2019t be much help for landing a job.",
"I started as a high school teacher (AP stats and SAS) along with Math I/II etc.. cause not enough kids took stats. I'm now a Senior AI/ML architect. Here were my steps and why it was the best decision I ever made.\n\n1.) Knew SAS already and SQL so took a low entry level analyst role.\n2.) Got my masters at night while in the analyst role and as my \"internship\" I asked the data science team at my company to shadow them and got my first exposure to real data science.\n3.) Took Data Science job, contract role which I HATED cause of no PTO health insurance not great etc.. but don't regret it cause it was my foot in the door and full time exposure to data science.\n4.) Once graduated, I became a graduate professor myself. I missed teaching SO much so this was my way to keep doing it and keep up with my skills, doing something is one thing teaching it is another as you know.\n5.) Took entry level architect job learned MLOPs and with my data science skills over the last few years worked my way to senior. \n6.) BEST PART...My company just approved the curriculum and I will be volunteering teaching cloud computing in the fall to my old high school with support from tons of companies in the area that use Snowflake. I'm so excited and although it took awhile this journey was worth it.\n\nMy last piece of advice. You have the skill that 90% of other data scientists don't, and that's your teaching ability. You'll be able to learn to code, but the personality and ability to explain things to the business and other people will separate you SO much. Good luck!",
"I would recommend reading O\u2018Reilly books and actually doing some hands on projects. Uni certificates, in my experience, only give you a broad overview on a topic without actually teaching you how to do it.",
"As has already been stated, don't do a boot camp and expect to switch careers, especially in the current job market. But, depending on your situation, consider an online masters. I was a hs math teacher and got my masters in Statistics, studying on nights and weekends. Attending an online program allowed me the ability to keep working full-time and still make progress in my studies. I got an internship one summer, then a full-time offer, and I've been a data scientist for 5 years now.\n\nLike I said, the market isn't great now, so look into getting a masters, plan for a couple of years, look for internships over the summer, and hopefully when you're complete, the job market will be a bit better. Good luck!"
]
},
{
"id": "1c50iuu",
"datetime": 1713222361.0,
"flair": "Discussion",
"title": "Why are Data Scientists still needed when Machine Learning Engineers are a thing? ",
"score": 0,
"comment counts": 38,
"content": "I've worked in a company in which there were different teams for DS and MLE with very distinct responsibilities. DS were responsible for talking with business stakeholders, understanding the project goals, talking with data engineers and analysts to assess data availability, do the whole data science project cycle and then deploying the model as a flask API using infrastructure built by the MLE team.\n\nHowever, I have seen more and more MLE jobs in which their responsibilities are much broader than that. Some of them expect MLEs to actually build the model themselves. \n\nI might be wrong here but it seems like the average ML Engineer could do everything that is expected from an average data science position, but an average data Scientist would really struggle at an average ML Engineer position. ",
"comments": [
"I think the answer is that \"data scientist\" and \" machine learning engineer\" do not have specific definitions. They are simply labels that companies can use how they want. There will always be variation in job title as it relates to specific job duties across different companies.",
"you're gonna catch a lot of shit for this post, but in some ways, it (what is in your title) is a question worth asking. i'm going to post my thoughts.\n\n1. mles are, first and foremost, \\*software engineers\\*; they are not necessarily trained to have specific statistical or analytic backgrounds that are typical pre-requisites for data scientists. organizations do not necessarily need more engineers but may have a need for data scientists. note that this somewhat presupposes a pay discrepancy between the two (which i think does exist)\n\n \n2. related to the above: not all data science questions involve machine learning. in fact, most do not, at least any machine learning problem that would benefit from the expertise of an mle. at my prior job, i would say 80% of my time was spent with logistic regression as the typical model.\n\n \n3. you're right, there is definitely some blending of roles. but it's always been like that. i've definitely had to build my share of data pipelines and monitoring s/w that would be outside of what the typical ds from way back would have had to do.",
"Data scientists speced into math and being able to talk to people. If a MLE can do that on top of normal MLE stuff you don\u2019t need a DS. \u00a0But that\u2019s easier said than done so it makes sense to split the jobs.\u00a0",
"I know a guy who was hired as a data analyst and the company expected him to build and maintain machine learning pipelines. Titles are goofy.",
"The definition of Data Scientist has always been loose and the introduction of the term Machine Learning Engineer hasn\u2019t lead to more clarity. You basically need to create job descriptions one by one and match people\u2019s skills and experience to those jobs- knowing someone is/ was a data scientist or MLE just doesn\u2019t tell you anything useful.\u00a0"
]
},
{
"id": "1c3t6ww",
"datetime": 1713100049.0,
"flair": "Career Discussion",
"title": "Asking for a promotion too soon?",
"score": 2,
"comment counts": 36,
"content": "Hi all,\n\nSome context for the situation,\nI\u2019m a career switcher (sales-> tech staffing AM -> self employed personal trainer -> Data Scientist )\n\nI will be graduating with my MS-DS next month (May) and have been working full time as a DS since October 23. (With a 3 month internship at the same company prior to coming on full time).\n\nI was brought in as entry level (BS and no prior experience required)\n\nI was pulled into a project that ended up being multifaceted and find that I am collaborating with a principal on how to tackle the problems we face, solutions we build, and deadlines for each piece etc. We divide the work pretty evenly and are responsible for our respective deliverables.\n\nOur lead suggests that I need to make the case now for a promotion given what we\u2019ve done thus far in terms of work load, deliverables etc \nAnd while I \u201cfeel\u201d like that makes sense my brain says \u201cI\u2019ve only been here 6 mths\u201d.\n\nWhen is it too soon? I was planning on letting the project wrap up and presenting my progress, and value add to the company along with including that I am now an MS grad but wanted to see what the more experienced folks have to share\n\nTIA!",
"comments": [
"If your lead is suggesting that you ask for a promotion, then you don't need really any more hints from the internet. Go for it!!",
"Well: there\u2019s two schools of thought here: \n1. Your idea is great, I think that\u2019s how most people do it.\n2. Does your lead play any part in your promotion Process? If they do and they want you to make the case for a promotion, DO SO. They have been there probably longer than you and know how that company and their promotion processes work. Definitely still talk about value added and your improved education, but the timeline can be moved up. Ask your lead for advice if they\u2019re willing to give it.",
"I\u2019m assuming that your lead is of the opinion that what you are doing exceeds that of an \u201centry level\u201d data scientist. My advice: compare your work to that of what an entry level data scientist is and use that to motivate your promotion. If you find that your work is entry level, than you needn\u2019t push for a promotion, if after your investigation you find that your work is definitely not that of an entry level, then the doubt you have of whether to ask for a promotion should be gone.",
"what a amazing career line up",
"How long until the project finishes?"
]
},
{
"id": "1c3wgfp",
"datetime": 1713109251.0,
"flair": "Discussion",
"title": "Hiring managers - how do you measure research capabilities and coding good practices on an interview?",
"score": 0,
"comment counts": 5,
"content": "So in the next couple of weeks I am going to interview several candidates to our department. Our work is divided between ml development, feature engineering and ideation and customer facing. I don\u2019t need the candidates to excel in all of these, but if I need to choose the first two are the most important aspects. \nHow do you evaluate candidates in your organization for these traits? I am looking for a candidate that has some kind of a programing skills or with some demonstrated ability to write production level code, and with initiative and research capabilities, not necessarily a PhD but some past experience. ",
"comments": [
"Personally I tend to ask more theoretical type questions about coding. I.e why or when would you prefer X pattern over Y pattern, or give an example of a recent problem your team faced at work and get their thoughts on how they would've that same issue. If you're really just interested in coding practices, a question I tend to ask is \"what does good code look like to you?\" And based on their answers dig a little bit deeper. For feature engineering type questions you can give a quick rundown of a problem and some variables you have for the model already, and then ask them what other factors they'd consider. Usually you're looking for creativity and whether the candidate asks clarifying questions and tries to really understand what the problem is.",
"Use case analysis, business req and communication",
"I usually don\u2019t test on coding because its not a need in my organization all that much. Usually I ask for git repo to see what their style is. I am more interested in thinking process and usually ask them about how they\u2019d go about solving situational problems. I\u2019d also dig into why they would prefer one approach over another and what benefits they see in each of the approaches.\u00a0"
]
},
{
"id": "1c3jupb",
"datetime": 1713064988.0,
"flair": "Discussion",
"title": "T test vs z test",
"score": 9,
"comment counts": 11,
"content": "Why dont we use t tests for comparing proportions of two populations? I have tried digging into this but could not understand yet. ",
"comments": [
"Both are merely approximations, but the logic behind using the z-test is that, for proportions, the standard deviation is a function of the mean, and so the population standard deviation is known under the null hypothesis (i.e. it is not estimated from the sample).",
"A t test will work given sufficient sample by treating it as a comparison of means with a 0/1 variable. However, in this case the z test takes advantage of the fact that the SD is known in the proportion case and so is a higher powered test.",
"[deleted]",
"We aren't in the 1950's. No reason to use an approximate test when exact tests are available.",
"I just wrote a great article about data science and claritied two concepts T test and z-test . [HOCTHUE.NET: Khoa h\u1ecdc d\u1eef li\u1ec7u trong l\u0129nh v\u1ef1c kinh t\u1ebf.](https://www.hocthue.net/khoa-hoc-du-lieu)"
]
},
{
"id": "1c2yl0i",
"datetime": 1713003212.0,
"flair": "Discussion",
"title": "What field/skill in data science do you think cannot be replaced by AI?",
"score": 130,
"comment counts": 153,
"content": "Title.",
"comments": [
"Talking to people and figuring out what they actually want",
"The part that is related to responsibility and decision-making.",
"\u00a0Dealing with messy data.",
"Overseeing the whole process. Helping a stake holder understand what their problem is and what kind of solution could help them. Not just cleaning your data, but figuring of what data you could use or where it is even coming from. Helping with deployment so that the model is used in a way that helps the organization",
"I don't know if this falls under data science, but garbage and garbage out, so making sure\n\nthe AI is being trained on the right data,\n\nthe right data is being collected\n\nThe right data is available for the AI.... Managing the training loop...\n\nThat's where data roles should focus. Not trying to compete with AI....\n\nalso KNOWLEDGE GRAPHS, soon it'll be the hottest things in Data."
]
},
{
"id": "1c3pdcz",
"datetime": 1713086069.0,
"flair": "Challenges",
"title": "Looking for team memebers for CV kaggle challenge ",
"score": 1,
"comment counts": 2,
"content": "Hey! I am looking for teammates for image-matching-challenge-2024. Please do reach out if you have prior CV experience. \n\nMy Profile: Masters in data science, top kaggle achievement: finished top 8% in llm-detect-ai-generated-text challenge. I have NLP experience, want to build CV experience. Most comfortable in pytorch.",
"comments": [
"I've CV experience. We can connect. MSc student as well and decent kaggle experience."
]
},
{
"id": "1c30flh",
"datetime": 1713010100.0,
"flair": "Discussion",
"title": "Enhancing Weather Forecast Accuracy: Exploring Regression Models with Multi-source Data Integration",
"score": 23,
"comment counts": 18,
"content": "I am currently working as a data scientist at a new energy startup, mainly responsible for predicting photovoltaic power generation every 15 minutes for the next day. The key data relied upon are weather forecasts, especially the predicted solar irradiance values. Currently, we have data from five numerical weather forecasts, which include fields such as irradiance, temperature, and humidity. The accuracy of the forecasts varies among different data sources, and there are certain discrepancies with the actual weather. I am considering merging the five sets of data to obtain a more accurate weather forecast. Can I use a regression model to fit the actual weather using the five sets of weather forecast data? Is there a better method available?\n\nBtw, the weather forecast datasets appears as follows:\n\nhttps://preview.redd.it/pzvu110o0duc1.jpg?width=2298&format=pjpg&auto=webp&s=c5f45c31b5c683a09f8609de0077b0be06d1808a",
"comments": [
"you'd want to understand the shortcomings of existing modelling methods before going down this road. you're essentially trying to predict the types of mistakes those models make, rather than predict the weather. I'd guess regression models can be used effectively for at least a component of a good model, but you're going to have to do a lot of work to handle geo data and distance smoothings, and you'll probably need to classify the types of errors in existing models to define targets for individual regression models. Ultimately though you can define the problem space quite naturally in tabular datasets, so I'd expect regression or tree models to work well.",
"I think you should consider a hierarchical Bayesian model. You have spatio temporal data that needs to be handled and a granularity of space that requires aggregation (thinking square acre vs county what was happening at that moment and how should the estimate aggregate)\n\nAlso there may be room for a convolutional recurrent neural network. So convolutions on the vector of each dataset over the two 2d layer (thinking spherical coordinates with fixed radius) and then on top of this network the time axis through say a lstm (maybe encoder to decoder)\n\nI think I saw some cool papers about generative networks that has some promising weather predictions that looked better",
"You should fit each model separately prior to any ensemble averaging. Essentially, you are tackling model output statistics to develop corrections that are model-dependent. These approaches have been used extensively and nonlinear AI/ML based approaches are certainly capable of developing corrections as well and being implemented more often. At the hyper local scales you are likely interested in, there are likely many nonlinearities/state-dependencies that are important. Each model will have unique error characteristics tied to its own model physics and those inherited from boundaries conditions of driving global models (assuming you are using regional/limited-area models). Training using the ensemble mean is also possible but it\u2019s likely those unique aggregate errors from the ensemble able harder to disentangle from shorter hindcast records. And, if a single model ever changes/updates/drops out/erc., then the underlying distribution of errors of the ensemble mean also changes and would no longer be consistent with your calibrated model.",
"Interesting!",
"Most of the irradiance actuals from different weather providers is actually a simulated series (not actual observation). You want to be sure of this as blending them together will not lead to good results. Better way would be to build individual models and ensemble them together (like xgboost using each weather vendor and then ensemble using regression etc)."
]
},
{
"id": "1c35bvi",
"datetime": 1713024179.0,
"flair": "Discussion",
"title": "FNN to predict improper vouchers. ",
"score": 12,
"comment counts": 19,
"content": "I am an auditor for a state agency, we audit payments the state makes every day to find improper voucher. \n\nWe get about 30,000 vouchers a day so obviously we can\u2019t audit all of them. So we set up certain risks associated with vouchers to try and better find improper payments. And sometime we have filters for payments that meet certain criteria that must get audited. \n\nHowever, our risk based design doesn\u2019t really work, it\u2019s just a chance of whether or not the vouchers selected for audited are improper or not. I don\u2019t believe we have any better outcome that just randomly selecting a voucher everyday. \n\nIt just depends on the risks the auditors look for and how well they look at it. However, I am trying to create a statistical model to find these improper vouchers based on these risks. \n\nAs opposed to what some auditor thinks is the best risk, the model can look at all these risks and see how they interact and if there is some pattern. \n\nAdditionally, a lot of these risks have some arbitrary cut off date. For example, we might have a risk saying the specific vendor hasn\u2019t been audited in over a year. That\u2019s considered risky, however, a voucher that misses that by one year wouldn\u2019t be rated as risky. \n\nSo doing this we can turn some categorical variables into continuous variables. \n\nThe data set as of now is about 600,000 vouchers that have been audited over a ten year span. Currently about 8% of them have been rejected. But not all of the rejected ones were necessarily bad. We have two classes non compliance and saving. Savings are when the money is not due or at least some of it, bad math on the invoice, incorrect charges and so one. While non compliance don\u2019t really save any money it\u2019s just some account error, maybe they paid from the wrong funds, referenced the wrong contract or something. It\u2019s gonna mess up the accounting system but not really save any money. About 20% of rejected vouchers have saving and 80% are non compliant. \n\nObviously our goal is to identify vouchers that yield a saving. Even if we had a model that can predict all the improper ones, we just don\u2019t have the resource to audit all of them. \n\nSo my thoughts were to create a model fine tuned to have low false positive. Basically I would have a penalizing model for instances of an okay voucher being marked as improper. \n\nObviously we\u2019d miss some improper vouchers from that but we also don\u2019t have the resource to audit them all anyway so my thought is this would allow us to focus on those that might be improper. \n\nJust wondering if you guys have. Any thoughts on this. ",
"comments": [
"Have you considered framing as anomaly detection?",
"What was your idea behind choosing FNN?",
"Standard boosted decision tree methods should be more suitable instead of using neural networks.",
"What programing languages did you learn for your job?",
"What is your input data form? Text?\n\nHave you looked at huggingface models?"
]
},
{
"id": "1c3vdii",
"datetime": 1713106354.0,
"flair": "Career Discussion",
"title": "What would you do in my shoes?",
"score": 0,
"comment counts": 2,
"content": "People of datascience, if you were in my shoes and the goal was to get a job asap in machine learning engineering role (preferably NLP), how would you plan a few months ahead and in what time frame would you meet that goal. \n\nMy shoes: [https://www.dropbox.com/scl/fi/k0ruhu6wnri4phfiniuri/Resume-Censored.pdf?rlkey=zq46ltuvu4xjxtrn68d8zyd01&dl=0](https://www.dropbox.com/scl/fi/k0ruhu6wnri4phfiniuri/Resume-Censored.pdf?rlkey=zq46ltuvu4xjxtrn68d8zyd01&dl=0) \n(this is enough to give you an idea of what my shoes look like) \n\nGoal is to get a job asap, as mentioned above. My university is online so I have to spend about to 2hrs daily to complete my lectures, and prepare for assignments and quizes. I spend about 8-12 hrs daily, divided for learning ML (currently taking CS224n lectures) and working at my internship (fully-remote). I live in Kuwait and the ML market here is almost non-existent. I can move back to Pakistan (home country) and get a full-time job there where the market is not very mature but there is work for engineering (R&D near non-existent). Best case scenario would be to get a fully-remote job in any part of the world possible. My salary expectations are not much, I can work full-time job if it can pay me $500/mo atleast because that would cover my needs and then with that I can focus on my studies and ML specifically as I want to get into masters or phd after bachelors. That's my long-term goal to get into academic research. \n\nSo for now, getting a job to cover my expenses is a priority and a short term goal. How would you then plan next few months to meet that goal, if possible?",
"comments": [
"Fully remote jobs are nearly impossible to get right now, especially junior roles. I wouldn\u2019t bank on it",
"Contact staffing agencies to get contract work. That can be a fast way to get into the right type of work and eventually the right company. They do all the work of finding jobs for you but you will not be able to be picky about the opportunities they give you. You probably wont have great benefits or make a ton of money. But, in the matter of months or a few years, you will have enough experience to get a great paying job at a great company, full time. Often big, attractive companies will hire contractors, so you will get real experience and, perhaps more importantly, the opportunity to build relationships with people at that company. It could lead to a full time job with luck.\n\nGood luck"
]
},
{
"id": "1c33azw",
"datetime": 1713018702.0,
"flair": "Career Discussion",
"title": "Where do you guys apply for jobs in uk?",
"score": 11,
"comment counts": 6,
"content": "I\u2019ve been using LinkedIn but haven\u2019t got much success, I\u2019m not sure if it\u2019s because I\u2019m unqualified (BSc from top 20 unis with 2ye), the markets tough or if I\u2019m on the wrong site. \n\nWhere do you guys apply for roles? Im based in London currently. \n\nI tried going to networking events, I attended big data London last year but the only people I met were trying to sell me storage solutions. Are there any networking events you\u2019d recommend? ",
"comments": [
"Indeed but there's alot of scams on there",
"I live in the UK. IT, software roles have been reduced by up to 40%. Junior roles difficult time.",
"Otta is great for jobs in tech, much better than Linkedin",
"Linkedin is not too bad. You might need to revamp your profile. Add the skills you've used in all your jobs and BS. Showcase a few personal projects if you've done (they could be from your BS, or ones you may have done in your free time). If you've contributed to open source, highlight that, and also provide a link to your github if you've got something to show there.\nRecruiters are suckers for keywords and jargons so make sure you use them effectively in your skills and work ex.",
"Yeah big data London is a place for listening to talks and listening to sales pitches\ud83d\ude05\n\nAre there any local meetups around data? I don't know where you are, but in my area these events are on meetup.com, or some monthly events organized by local data people."
]
},
{
"id": "1c2mqav",
"datetime": 1712962195.0,
"flair": "Discussion",
"title": "XGBoost Please help",
"score": 91,
"comment counts": 61,
"content": "\u2060I am trying to train an xgboost model that estimates stock price. I don't think there is anything wrong with the data format, and I am using gridsearch to find the likely hyperparameters.... But this is what I get for the estimation. except for a small area, the graph is flat...! Why is this happening?\n\nhttps://preview.redd.it/uzkt8fqxn4uc1.png?width=1954&format=png&auto=webp&s=b76f84d7910768b33692d0b9ee9d144f5e951984\n\nhttps://preview.redd.it/no2qddf0o4uc1.png?width=1267&format=png&auto=webp&s=dbd1e1b51b0b87949046e27f1976967e11caea06",
"comments": [
"The most plausible reason is that the max value of y\\_train is less than 42. Tree-based algorithms, like XGBoost, can only interpolate, not extrapolate.",
"Please tell us you're forecasting return and not stock price",
"I would try the skforecast library. It handles time series with regression techniques better. \n\nDo you have a GitHub link for this? It\u2019s tough to tell what the problem is from this. Seems like a data cleaning/structure issue from here, not an xgboost problem.",
"You\u2019ll want to use walk forward validation at the very least when forecasting time series. Not simple train/test.",
"This looks like an extrapolation problem. Tree based models are known to not be able to extrapolate on data outside the expected ranges. There are a ton of resources on this you can find. Here's an example that shows the problem and some other models you can use:\n\nhttps://www.kaggle.com/code/carlmcbrideellis/extrapolation-do-not-stray-out-of-the-forest\n\nNot sure what happened in your case exactly but it was probably something like that.\n\nBTW predicting stock prices is a difficult problem so you are likely going to struggle a bit. I think it's best to start with some time series forecasting libraries like skforecast or AWS forecast. \n\nThere's also libraries like this for more advanced models:\n\nhttps://unit8co.github.io/darts/\n\nhttps://nixtlaverse.nixtla.io/\n\nI remember reading this article which goes over the state of the art which I thought was pretty good too.\n\nhttps://mangodata.io/blog-post/forecasting"
]
},
{
"id": "1c2tz99",
"datetime": 1712984546.0,
"flair": "ML",
"title": "Predicting successful pharma drug launch",
"score": 10,
"comment counts": 20,
"content": "I have a dataset with monthly metrics tracking the launch of various pharmaceutical drugs. There are several different drugs and treatment areas in the dataset, grouped by the lifecycle month. For example:\n\n\n\n|Drug|Treatment Area|Month|Drug Awareness (1-10)|Market Share (%)|\n|:-|:-|:-|:-|:-|\n|XYZ|Psoriasis|1|2|.05|\n|XYZ|Psoriasis|2|3|.07|\n|XYZ|Psoriasis|3|5|.12|\n|XYZ|Psoriasis|...|...|...|\n|XYZ|Psoriasis|18|6|.24|\n|ABC|Psoriasis|1|1|.02|\n|ABC|Psoriasis|2|3|.05|\n|ABC|Psoriasis|3|4|.09|\n|ABC|Psoriasis|...|...|...|\n|ABC|Psoriasis|18|5|.20|\n|ABC|Dermatitis|1|7|.20|\n|ABC|Dermatitis|2|7|.22|\n|ABC|Dermatitis|3|8|.24|\n\n* Drugs XYZ and ABC may have been launched years apart, but we are tracking the month relative to launch date. E.g. month 1 is always the first month after launch.\n* Drug XYZ might be prescribed for several treatment areas, so has different metric values for each treatment area (e.g. a drug might treat psoriasis & dermatitis)\n* A metric like \"Drug awareness\" is the to-date cumulative average rating based on a survey of doctors. There are several 10-point Likert scale metrics like this\n* The target variable is \"Market Share (%)\" which is the % of eligible patients using the drug\n* A full launch cycle is 18 months, so we have some drugs that have undergone the full 18-month cycle can that be used for training, and some drugs that are currently in launch that we are trying to predict success for.\n\nThus, a \"good\" launch is when a drug ultimately captures a significant portion of eligible market share. While this is somewhat subjective what \"significant\" means, let's assume I want to set thresholds like 50% of market share eventually captured.\n\nQuestions:\n\n1. Should I model a time-series and try to predict the future market share?\n2. Or should I use classification to predict the chance the drug will eventually reach a certain market share (e.g. 50%)?\n\nMy problem with classification is the difficulty in incorporating the evolution of the metrics over time, so I feel like time-series is perfect for this.\n\nHowever, my problem with time-series is that we aren't looking at a single entity's trend--it's a trend of several different drugs launched at different times that may have been successful or not. Maybe I can filter to only successful launches and train off that time-series trend, but I would probably significantly reduce my sample size.\n\nAny ideas would be greatly appreciated!\n\n",
"comments": [
"Why not make it more easy for you and for the world to understand by just predicting the absolute sales, and make a separate prediction by market? That seems like a way more solid approach than to have relative targets going in all directions over a time window that is also made relative.\n\nRegarding your question: \nIt seems you are trying to predict a Y variable here that is relative to the other candidates. I think there are some big challenges in setting this up as a time series like this if you don't have extra data to ungroup it into a 'normal' format where you know the start date. But it is definitely a problem that could hugely benefit from being a time series (including seasonality is one), so I would spend extra time data engineering to combat the problem: that your target is relative to other drugs but your time variable is also relative to some arbitrary beginning. \n\nAlso keep in mind: \nLooking at this problem with common sense I would say your problem is going to be very likely that the correlation might not be very strong. Or it might be strong because of a latent variable, which is quite dangerous. To give a small example: let's say the quality of the drug against a certain disease (not possible to predict/capture in data) leads to it bought a lot (market share), and the fact that it's bought a lot leads to \"market awareness\". Then the Marketing team will spend a lot of money on marketing while there's actually nobody looking at the ads.",
"I would use classification and model uplift + a cutoff value for what is deemed \u201csuccessful\u201d",
"Given all you have is the market share and this drug awareness score, just build a simple curve-fitting model that links the two and call it a day.",
"You could cluster all of the drugs you\u2019re looking at, then run a time series analysis on the cluster(s) that you\u2019ve determined represent a successful launch.",
"Are there drugs that succeded?\n\nIf yes, how about a multitimeseries Analysis?"
]
},
{
"id": "1c32kuk",
"datetime": 1713016671.0,
"flair": "Statistics",
"title": "Looking for a decision-making framework ",
"score": 2,
"comment counts": 16,
"content": "I'm a data analyst working for a loan lender/servicer startup. I'm the first statistician they hired for a loan servicing department and I think I might be reinventing a wheel here.\n\nThe most common problem at my work is asking \"we do X to make a borrower perform better. Should we be doing that?\"\n\nFor example when a borrower stops paying, we deliver a letter to their property. I performed a randomized A/B test and checked if such action significantly lowers a probability of a default using a two-sample binomial test. I also used Bayesian hypothesis testing for some similar problems.\n\nHowever, this problem gets more complicated. For example, say we have four different campaigns to prevent the default, happening at various stages of delinquency and we want to learn about the effectiveness of each of these four strategies. The effectiveness of the last (fourth) campaign could be underestimated, because the current effect is conditional on the previous three strategies not driving any payments.\n\nAdditionally, I think I'm asking a wrong question most of the time. I don't think it's essential to know if experimental group performs better than control at alpha=0.05. It's rather the opposite: we are 95% certain that a campaign is *not* cost-effective and should be retired? The rough prior here is \"doing something is very likely better than doing nothing \"\n\nAs another example, I tested gift cards in the past for some campaigns: \"if you take action A you will get a gift card for that.\" I run A/B testing again. I assumed that in order to increase the cost-effectives of such gift card campaign, it's essential to make this offer time-constrained, because the more time a client gets, the more likely they become to take a desired action spontaneously, independently from the gift card incentive. So we pay for something the clients would have done anyway. Is my thinking right? Should the campaign be introduced permanently only if the test shows that we are 95% certain that the experimental group is more cost-effective than the control? Or is it enough to be just 51% certain? In other words, isn't the classical frequentist 0.05 threshold too conservative for practical business decisions?\n\n\n1. Am I even asking the right questions here?\n2. Is there a widely used framework for such problem of testing sequential treatments and their cost-effectivess? How to randomize the groups, given that applying the next treatment depends on the previous treatment not being effective? Maybe I don't even need control groups, just a huge logistic regression model to eliminate the impact of the covariates?\n3. Should I be 95% certain we are doing good or 95% certain we are doing bad (smells frequentist) or just 51% certain (smells bayesian) to take an action?",
"comments": [
"I am just trying to decipher what exactly are you trying to do. Retain borrower? Reduce default rates?",
"My company build a custom deployment framework because there\u2019s just not bespoke stuff like that out there. I run this software not. You\u2019d create tremendous value if you can create this yourself. And you most certainly will have to do that. It sounds like a custom service.\n\nI don\u2019t know your framework to office advice so if you told me about your deployment environment and tools I might have better advice. Like, what software, how are you pulling data, what type of dbs, what do you currently do to handle the sequential testing?\n\nIn terms of models, it sounds like hierarchical or mixed models, but Bayesian seems most appropriate (I have no experience there unfortunately).",
"Sounds like causal analysis will help. Assuming there is some variation in the data. Some go thru treatment 1, some go through treatment 1 and 2 ....then you can compute the treatment effect, or the lack of.",
"Interesting",
"Dynamic programming If you can model the problem as a sequential decision-making process with known transition probabilities, dynamic programming can be used to find the optimal sequence of interventions that maximizes the expected cumulative reward or minimizes the expected cumulative cost."
]
},
{
"id": "1c1vciv",
"datetime": 1712881995.0,
"flair": "Career Discussion",
"title": "What realistically will be automated in the next 5 years for data scientists / ML engineers? Plus would love some career advice",
"score": 174,
"comment counts": 138,
"content": "Recently I\u2019ve been job hunting and have hit the sad realization that I\u2019ll have to take a salary cut if I want to work for a company with good ML practices. I have a lot of student loans from master\u2019s program. \n\nI\u2019ve been trying to keep up with LLM coding automations and software automators. It\u2019s all beginning to seriously make me anxious but I think the probability I\u2019m overreacting is at least 50%.\n\nHow much of a data scientist\u2019s job do you think will be completely automated? Do you think we (recent master\u2019s graduates with lots of debt) made the wrong choice? What areas can I strengthen to begin to future proof myself? Should I just chill out and just be ready to learn and adapt continuously?\n\nMy thinking is that I want to do more ML engineering or ML infra engineering even though right now I\u2019m just a data scientist. It feels like this career path will pay off my loans, have some security, and also is better than dealing with business stakeholders sometimes. \n\nI am considering taking a bad pay cut to do more sophisticated ML where I\u2019ll be building more scalable models and dealing with models in production. My thought process is this is the path to ML engineer. However my anxiety is terrifying me. Should I just not take the pay cut and continue to pay off loans + wait for a new opportunity? I fear the longer I wait, the worse my skills at a bad company become. Also would rather take a pay hit now and not in 1 year.\n\n My fear with taking pay cut is that I\u2019ll be broke for a year and then in another year automations and coding bots might really become sophisticated. \n\nAnyways, if anyone\u2019s knowledgeable would love to chat. This market and my loans are the most depressing realization ever\n\n",
"comments": [
"Here's what skills I am trying to develop as an AI/ML engineer:\n\n- actually knowing the models well. I.e being able to pick the right model for the right job.\n\n- efficiently hosting models on the cloud.\n\n- trust, safety, and explainability in models.",
"Although a cloud practitioner is not a data scientist, and I am certainly not a cloud practitioner, the cloud practitioner stuff will be tough to automate. Knowing *what to do* on a more big-picture level will be tough to hand over to LLMs. Arguably point-and-click driven tools will continue to be popular, and they will continue to need a human in the loop.",
"I\u2019ve been in industry at this point for a long time. I\u2019ve seen many cycles where layoffs have occurred. I largely have been isolated from that and got my first layoff ever in my career after working nearly 15 years. \n\nI was working with as a subcontractor/AWS ProServe Partner and eventually all the low side work was completed and I was awaiting to onboard for new project. There were some issues with my clearance at that time, which was beyond my control. Ultimately, it lead the company to let me go without severance.\n\nIt actually fucked with my head a lot. I have basically always done great work delivering for my clients and always aimed to do the right thing. And now I was being faced with unemployment first the time in my life at 36 years old, expecting my first child in June\u201924. I couldn\u2019t even let go that I know it wasn\u2019t personal just business.\n\nIt felt very personal, largely cause it happened to me. So I can understand your fear and I can understand why you don\u2019t feel secure with the developments you are seeing through automation. \n\nI\u2019ve been automating things with code since 2004 and there may come a time when I am automated out of a job, but I don\u2019t think so.\n\nI have read a book by Cal Newport called [\u201cSo good they can\u2019t ignore you\u201d](https://amzn.to/3VZU91h). This book completely changed my life and is the single best book I recommend anybody starting out with data science. Is the book about data science, no? Will it teach you to be a data scientist, no? It however will teach you how to pursue mastery in your craft as a way to build autonomy and security. \n\nFor you, I recommend you do like I did. \n\nI started as a DBA - Database Analyst, I learned everything I could about SQL, data warehousing, ETL, etc. And every two to five years, I take on a new technology to go hard one. Doing this you will accumulate so much experience and your skill set will be broad, but you also want to go deep in one or two areas. The two areas that I went deep on was programming and cybersecurity to couple with my DBA and Data Science skills. Cybersecurity is obviously too broad, so I focused on learning all about Security, Pen Testing and Reverse Engineering/Malware Analysis. \n\nIf you start now, I Cant guarantee you won\u2019t be unemployed for some time. So I must recommend for you to also get on a plan for paying off debt, setting out an emergency fund and planning for retirement. I started reading books at debt free living and reverse budgets. I eventually found Dave Ramsey\u2019s [\u201cTotal Money Makeover\u201d](https://www.amazon.com/dp/1595555277/ref=cm_sw_r_as_gl_api_gl_i_DWRC5XQ3ZDJ0W7NF7B6N?linkCode=ml2&tag=insightsthrou-20). I don\u2019t agree with everything that Dave says but I agree that having no debt (except the house) and emergency fund feels great.\n\nI was able to start a new job within two weeks after being let go. Some would say that is luck, some say it was good timing, etc. All I know is that I am going to keep refining my skills throughout my entire career because there is value and compound interest in acquiring the skills. \n\nI recommend you couple your data science with programming and some other skill. You don\u2019t need to be the smartest person in the room. You just need to be able to bring value to a team.\n\nWhat are skills required for the future idk, I just know that Cloud Computing is super popular right now, and I was lucky to get into about 3-4 years ago. I completed almost all the AWS certs now. Does that mean I won\u2019t be let go, absolutely not! But it does provide me with a wide range of skills which I can bring to bear on solving problems in code and for clients. And if you can do that you\u2019ll fine.\n\nDon\u2019t let your thoughts get in the way of you taking actions, rather use them a barometer to understand yourself and during the process of inquiry. As yourself questions about why you feel this way\u2026 don\u2019t bullshit yourself be honest. If it\u2019s a gap in skills or lack of productivity because you\u2019re more junior then own that and make yourself better. Keep moving forward brother.",
"Soft skills + agency >>> general dev skills > ML knowledge > everything else\n\nBeing able to talk to stakeholders and deliver a project from start to finish is what matters to people. The technical details of how only matters to you (I still like communicating it to set expectations).\n\nOftentimes delivery speeds matter more because it lets you iterate over the whole problem space and get good back and forths about wtf the project actually needs to do (hint it's rarely what it starts off as). \n\nI'm a fan of just throwing the best language models at a problem first, then breaking it down and optimising individual steps with smaller simpler models if it's worth my time.\n\n(NLP domain, your mileage may vary)",
"Most of the current AI stuff is hype. LLMs are not all that useful on their own, the hype will blow over and most of these AI jobs will disappear once companies realize that there is no return on investment."
]
},
{
"id": "1c25koq",
"datetime": 1712917046.0,
"flair": "Discussion",
"title": "What's next for the quintessential DS role?",
"score": 25,
"comment counts": 14,
"content": "This post is multiple questions wrapped into a single topic kind of thing which is why I thought best to keep it as an open-ended discussion.\n\n \n**Q1.** When I see recent DS job postings a majority now have these two added requirements: 1. Some knowledge of LLMs. 2. Experience in NLP. I'm not sure if this is just biased based on what LinkedIn algorithm is showing me. But is this the direction that the average DS role is headed? I've always considered myself as a jack of all trades, flexible DS, but with no expertise is any technical vertical. Is the demand for the general data scientist role diminishing? \n \n**Q2.** In my 5 years of experience as a DS I've worked on descriptive analytics, predictive modelling, dash-boarding in consulting and product alike. Now, 5 years isn't that much time, but it's not too short either. I'm now finding myself working on similar types of problems (churn, risk, forecasting) and similar tools and workflows. This is not a complaint by any means, it is expected. But this got me thinking... Are there new tools and workflows out there that might enhance my current working setup? For example: I sometimes find myself struggling to manage code for different variations of datasets used for different model versions. After loads of experimentation my directory is a mess. I'd love to know tools and workflows you use for typical DS problems. \n \nHere's mine: \ncode/notebook editor: VScode \nversioning: git/github \narchiving & comparing models: MLFlow \\[local only within project context\\] \nhyperparameter optimisation: Optuna \ninference endpoint deployment: fastapi \nconvey results and progress: good ol' excel and powerpoint :p\n\n",
"comments": [
"For point 1. \nI think this is just what teams put into requirements as a lot of roles don\u2019t really require knowledge of LLMs or Neural Net in general. Most of the value is still derived from supervised learning. However having these on resume as skills does show that you are constantly learning and know about various advancements in the field you work in. \nAt least this is how I look at it. After all, Attention is all you need! \n\n2. For your second point - I think your working set up is very good. I don\u2019t use MLflow and Optuna",
"Optuna is the future also dude your work is top class",
"Data versioning is a thing, if it\u2019s the same dataset changing over time that is causing the problem. then you change the training code to select which commit essentially it should use. For example, DVC is one well established one, or delta tables are another option.\n\nSounds a good set up to me for the jobs you specify. I\u2019d suggest making sure you\u2019re getting the most out of each of them, possibly looking to customise. For example, do you have pre commits set up, do you need custom hooks for git, etc.\n\nAdditionally maybe looking at deployment pain points could find you ways to add new tools. Would containerising with docker help? Are you reusing software efficiently? I agree with the other point where engineering and deployment is a bigger part now.\n\n\nOn the LLM point I wouldn\u2019t be surprised if most companies have internal pressure to deploy something in this area based on the current hype cycle, so are including it in job adverts. Having somebody able to do this quickly is going to be a benefit - but I\u2019d naively imagine most are just using pre trained models or an API and still value flexibility (it is SOME experience in NLP they ask for, after all ).\n \nif someone wanted the LLM experience to tick a box it really wouldn\u2019t take long nowadays given how accessible they are and fits with the jack of all trades approach. I\u2019d be trying to get a project in the common deep learning areas - vision, NLP, decision making with RL, ++",
"Interesting question. Not necessarily NLP; that\u2019s just an easy thing to throw out there and they may never ever need you to actually leverage any kind of NLP. I would say something I\u2019ve noticed is that as a data scientist machine learning engineer (or whatever flavor of title your company has given you), expect to know data engineering methods and techniques and how to implement them more than you already know or have learned in school. I feel like the data scientist and data engineering roles have become more and more blended.\u00a0"
]
},
{
"id": "1c1y7yu",
"datetime": 1712890054.0,
"flair": "Discussion",
"title": "Am I glorifying ML research roles? ",
"score": 60,
"comment counts": 183,
"content": "I think it\u2019s safe to say I\u2019ve constantly been \u201cfantasizing\u201d about getting a job as a ML researcher. Not traditional data science. I\u2019m talking one of the jobs that involve cutting edge modeling like https://www.microsoft.com/en-us/research/careers/.\n\nTo me it sounds like the dream DS job. Read papers on ML, implement them in software, and get paid a lot to do it. \n\nThe thing is though, with an MS in Stats it\u2019s safe to say it\u2019s nearly impossible for me to land any kind of research data scientist job because almost all the job reqs require a PhD. I just wish there wasn\u2019t a stigma associated with MS graduates. We know stuff too. Just because we have an MS and not a PhD doesn\u2019t make us any less qualified. Is there anyone here who is in a data science research role without a PhD?\n\nAm I glorifying these roles? \n\n",
"comments": [
"> Just because we have an MS and not a PhD doesn\u2019t make us any less qualified\n\nunfortunately, it literally does; it doesn\u2019t make you intellectually less capable but it means you are much less likely to have demonstrated research experience, and that does matter",
"So a research heavy role requires a proven record of research. Big surprise.",
"A couple of thoughts on this - 1. These roles typically also require (or at least encourage) publication of research in some form or another - PhDs are typically going to have more experience with this 2. Even companies like Microsoft only need so many people to \"read papers on ML and implement them in software\", I think there is currently a lot more DS work to be done that involves figuring out profitable ways to leverage models that have already been implemented in software by someone else.",
"Hmm. Actually having a Ph.D. literally means you are more qualified than an M.S holder.\n\nThe gap between a Ph.D and M.Sc is actually greater than the gap between M.Sc and layman. \n\nExceptions do exist. A dedicated genius can learn to contribute to a field without a degree but these are exceptions. MS or IBM will not go fishing for an exceptional M.Sc holder or an uncut gem when they can fish in a much smaller pond of Ph.Ds with published research.\n\nNote that these research roles are not for using libraries, they are for actually tweaking the method itself, creating new estimators etc.",
"A job involving reading about what others have done and then implementing it doesn't really sound like cutting-edge research to me. It sounds like a big standard industry job.\n\nReal cutting-edge research is about discovering / making / doing things that nobody else has done before. That's hard and requires years of training to be good at, hence the requirements for a PhD.\n\nAnd yes, just because you have a Masters literally does make you less qualified than a PhD. It doesn't make you dumber or less competent. But the difference is literally that PhDs have more qualifications than you do."
]
},
{
"id": "1c29ri7",
"datetime": 1712930088.0,
"flair": "AI",
"title": "Retrieval-Augmented Language Modeling (REALM)",
"score": 7,
"comment counts": 9,
"content": "I just came upon (what I think is) the original REALM paper, [\u201cRetrieval-Augmented Language Model Pre-Training\u201d](https://arxiv.org/abs/2002.08909). Really interesting idea, but there are some key details that escaped me regarding the role of the retriever. I was hoping someone here could set me straight:\n\n1. **First and most critically, is retrieval-augmentation only relevant for generative models?** You hear a lot about RAG, but couldn\u2019t there also be like RAU? Like in encoding some piece of text X for a downstream non-generative task Y, the encoder has access to a knowledge store from which relevant information is identified, retrieved, and then included in the embedding process to refine the model\u2019s representation of the original text X? Conceptually this makes sense to me, and it seems to be what the REALM paper did (where the task Y was QA), but I can\u2019t find any other examples online of this kind of thing. Retrieval-augmentation only ever seems to be applied to generative tasks. So yeah, is that always the case, or can RAU also exist?\n\n2. **If a language model is trained using retrieval augmentation, that would mean the retriever is part of the model architecture, right?** In other words, come inference time, there must always be some retrieval going on, which further implies that the knowledge store from which documents are retrieved must also always exist, right? Or is all the machinery around the retrieval piece only an artifact of training and can be dropped after learning is done?\n\n3. **Is the primary benefit of REALM that it allows for smaller model?** The rationale behind this question: Without the retrieval step, the 100% of the model\u2019s latent knowledge must be contained within the weights of the attention mechanism (I think). For foundation models which are expected to know basically everything, that requires a huge number of weights. However if the model can inject context into the representation via some other mechanism, such as retrieval augmentation, the rest of the model after retrieval (e.g., the attention mechanism) has less work to do and can be smaller/simpler. Have I understand the big idea here?",
"comments": [
"Found [5 relevant code implementations](https://www.catalyzex.com/paper/arxiv:2002.08909/code) for \"REALM: Retrieval-Augmented Language Model Pre-Training\".\n\n[Ask the author(s) a question](https://www.catalyzex.com/paper/arxiv:2002.08909?autofocus=question) about the paper or code.\n\nIf you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/2002.08909&title=REALM%3A+Retrieval-Augmented+Language+Model+Pre-Training) \ud83d\ude0a\ud83d\ude4f\n\n--\n\nTo opt out from receiving code links, DM me.",
"nice",
"Can people please up vote I need some advice and I don't have enough comment karma"
]
},
{
"id": "1c344vh",
"datetime": 1713020923.0,
"flair": "Discussion",
"title": "Feedback on response: What realistically will be automated in the next 5 years for data scientists/ML engineers?",
"score": 0,
"comment counts": 6,
"content": "I had responded to Reddit thread [here](https://www.reddit.com/r/datascience/s/lhe8RQK6Up) I was completely blown away with the traction my response received. \n\nI wanted to thank everyone who took the time to read and share there thoughts. I would also appreciate if folks could share constructive feedback for me on the writing.\n\nI have a very small tech blog that I\u2019ve been wanting to write on for a while now. I wasn\u2019t sure where to start or what topics I should focus on first. I decided that I with all the engagement of that I would try to unpack the advice on the blog, which can be found [here](https://insightsthroughdiscovery.com/what-will-be-automated-by-ai-ml-in-the-next-five-years/). \n\nThe website hasn\u2019t had much work on it, not really looking for feedback on the website itself, cause I know it needs work. I\u2019m looking for feedback about the blog post and about the content within it?\n\nI would also like to hear about what topics you as a reader might be interested in reading about. Thank you, in advance for your feedback and I hope you have a great weekend ahead.",
"comments": [
"Making a post to get engagement on reddit and drive traffic to your own website hmmm \ud83e\udd14",
"Will the ds people running ai ever be out of a job?",
"Potential topic - There seems to be a lot of different tech stacks out there but they are doing similar things under the hood. And new techs come out all the time. So what skills are common to most tech stacks and are transferable?",
"Great. Your experiences real world use case. \nI have review your blog post.\n\nWhile your predictions are interesting, you could consider providing more specific examples and use cases to illustrate your points. For instance, you mention \"data cleaning and preprocessing\" as a task that will be automated. You could elaborate on this by describing the current challenges in data cleaning and how automation can streamline the process, perhaps with a real-world example."
]
},
{
"id": "1c1vo54",
"datetime": 1712882866.0,
"flair": "ML",
"title": "How do you deal with predicting purchases where the purchases are extremely imbalanced and the data is extremely sparse. ",
"score": 22,
"comment counts": 31,
"content": "Dataset has 300 million rows. Only 1 million have purchases. So the dataset is extremely sparse. \n\nI\u2019m taking the one million purchases and taking a random sample of one million non purchases and training my model on that. \n\nIs this approach feasible? Are there any other approaches people would recommend. Any papers on this? \n\nTrying to predict conversions on an ads platform. ",
"comments": [
"This approach is on the track. Down sampling is good. You have lots of data so down sampling should not be a problem for you. And it's good for your efficiency since you have less data now. \n\nBut when you are evaluating your model on the testing dataset. Make sure the test data actually follows the real distribution which is imbalanced to reflect the reality. \n\nAlso make sure you add one more step of score calibration at the end of your model building to make the prediction score follow the statistical distribution of being positive. The approach can be as simple as bucketize the prediction scores and apply a multiplier for each bucket.",
"Your biggest issue as you\u2019ve correctly identified is scarcity. A third of one percent of people actually compete a purchase. That\u2019s an exceedingly rare event. I\u2019ve often solved similar problems using two different models. First, can you predict who is and isn\u2019t going to buy something? If you can, grab all of the records where you have a confidence they will purchase above some threshold and then try to predict what they will buy. You can create an extra class amongst this slimmed down group for those that are strong prospects but ultimately didn\u2019t complete a purchase. You may also want to cluster items into logical groups as you may have a better shot at predicting someone wants to buy tshirts vs wanting to buy a specific tshirt.",
"For xgboost there is a parameter called scale_pos_weight, which weighs the classes proportionally. That has worked best for me in the past. Better than manually down sampling in my experience",
"The problem really comes in when you have a really low raw count of positive cases. You do not. There\u2019s nothing wrong with modeling an imbalanced dataset. Other comments about downsampling affect the score calibration (you\u2019ll get a 50:50 distribution of scores if you weight or downsample), but really don\u2019t affect goodness of fit. \n\nI personally hate throwing out data. If you have compute just model the whole dang dataset.",
"If you can link purchases to individual customers, you could look at [buy till you die](https://en.wikipedia.org/wiki/Buy_Till_you_Die) models. But those are simple parameterized models, not machine learning. Also, not applicable if you are trying to forecast which particular items will be bought next."
]
},
{
"id": "1c28knv",
"datetime": 1712926892.0,
"flair": "AI",
"title": "Advice and Resources Needed for Project on Auditing and Reversing LLMs employing coordinate ascent",
"score": 2,
"comment counts": 3,
"content": "This may not be the right place to ask but really need advice. \n\nI am a college student and I'm working on a project for Auditing LLMs by reversing an LLM and looking for prompt - output pairs. I want to know which model would suit my purpose . I wanted to evaluate pretrained models like LLaMA , Mistral etc . I found a research paper doing experiments on GPT -2 and Gpt-j. For the academic purposes i intend to extend the experiment to other llms like Mistral, LLaMA , somw suggestions are welcome .\n\nI am a beginner here and I have not worked on LLMs for prompting or optimization problems. I am really not sure how to progress and would appreciate any resources for performing experiments on LLMs. \n\nAlso any concepts that i should know of ? .\nAlso im curious how do you usually run and train such models . Especially when there are constraints in computational power. \n\n What do you usually when access to server / gpu is limited . Any resources where it is easy to get GPU for distribted parallel computing that are easy to obtain? Other than google colab. ",
"comments": [
"Does your school have computing resources for students in your program? Like a cluster or a supercomputer you can sign up for time on?",
"Can people please up vote I need some advice and I don't have enough comment karma"
]
},
{
"id": "1c1yc8i",
"datetime": 1712890389.0,
"flair": "Career Discussion",
"title": "Advice on what types of entry-level roles to seek",
"score": 7,
"comment counts": 19,
"content": "Hello! \n\nI'm looking for some career advice on what types of roles to seek, resume help, and how to get feedback on an NLP Developer technical project that I completed. \n\nA little about me: I have a B.Sc. in CS, a data science internship (data visualization, data analysis, and LLM experience), and research experience (published in JMIR). I've been looking for work, full-time, for about a month and a half now, 170 applications sent, and minimal responses so far.\u00a0**My end goal is to find a DS/ML role**. Something that will allow me to learn more technical skills in ML. I am primarily seeking remote roles in Canada and the United States, but am open to hybrid in my location. At some point, I plan on going back to school to receive an master's in CS specializing in ML, but at the moment, I'm looking to get more work experience.\n\nThe only response I've received so far is from an international company as NLP developer. I wrote a technical project for them, but sadly did not receive the role or any feedback on my submission. I have posted two examples of resumes that I might submit. I tailor my resume for each job, which usually means using an ATS checker and changing the summary/skills section to match the verbiage of the job. I'm looking for some advice:\n\n**1)**\u00a0At the moment, I'm casting a wide net in terms of roles. I've been applying to entry-level DA/DS/DE and ML related roles. Recently, I've focused a bit more on the DA roles, as I generally have all the skills they are asking for, and my data science internship involved a lot of dashboard building. That said, I am most interested in roles that involve ML (especially LLM and NLP).\n\nI sometimes feel as though I'm selling myself short by applying so much to data analytics roles when I really want ML.\u00a0**Should I continue to cast a wide net in terms of roles, or would it be better to focus more time on applying to only DS/ML roles?**\n\nI should mention, time is of the essence, and work experience is so valuable in this market, so I would rather start working as a data analyst now than hold out for another year to find an ML role.\n\n**2)**\u00a0Related to Q1. I'm currently working on the Google Data Analytics Profesional cert, which I'm finding boring and easy. At this point, I will finish it, but I'm wondering what certs to take next. I could do more data analytics certs, but I was considering doing Andrew Ng's ML and DL certs on Coursea.\u00a0**Would you recommend these certs, or any comments on other certs to take?**\n\n**3)** **Do you have any suggestions on my resumes?**\u00a0I know that it's a bit verbose in the skills section, but I'm trying to ensure that I make it through the initial screening if they are using AI.\n\n**4)** **Where can I post my technical project for the NLP developer position that I applied to?**\u00a0I would love to get some feedback on my submission, and I think it could be helpful for other Redditors looking for the same types of roles.\n\nThank you! ",
"comments": [
"[deleted]",
"ML Resume:\u00a0[https://www.zippyshare.day/5LtewD2od3hXKhE/file](https://www.zippyshare.day/5LtewD2od3hXKhE/file)\n\nDS Resume:\u00a0[https://www.zippyshare.day/5GjSSZwVVPKeKvb/file](https://www.zippyshare.day/5GjSSZwVVPKeKvb/file)\n\nI tried to add these resumes as images to the post, and moderators took that post down. If you don't want to click on these Zippyshare links, you can view another post I made [here](https://www.reddit.com/r/datasciencecareers/comments/1c1m99s/advice_on_types_of_entrylevel_roles_to_seek/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button).",
"Make a post on r/EngineeringResumes for resume review and follow their wiki if you haven't already. I didn't look at your resume. \n\nThe market is tough for entry level right now, especially with just a BS, only suggestion is to cast a wider net, maybe entertain data engineering or CS roles and then pivot later.",
"Its good to secure a job first.. although if it's DA. You can get a good pay scale plus you can gain some experience and contacts are bonus. From there you can plan how, when and from where you can get into ML or DS.",
"The situation what you\u2019re describing is like everyone else\u2019s situation. Your first job you\u2019re going to be overqualified for. Try to find any DA role (even if it\u2019s a consulting company) for the experience and leave after 1.5-2 years for a better role and work your way up in roles. I wish there was a short cut, and I\u2019ve tried everything, but there isn\u2019t.\n\nThe only real shortcut is go to top ranking school and networking. The rest of us kinda only have 1 path"
]
},
{
"id": "1c1owjv",
"datetime": 1712866253.0,
"flair": "Tools",
"title": "Ibis/dbplyr equivalent now on julia as TidierDB.jl",
"score": 19,
"comment counts": 8,
"content": "I know a lot of ppl here dont love/heavily use julia, but I thought I'd share this package i came across here incase some people find it interesting/useful. \n\n[TidierDB.jl](https://github.com/TidierOrg/TidierDB.jl) seems to be a reimplementation of dbplyr and inspired by ibis as well. It gives users the [TidierData.jl](https://github.com/TidierOrg/TidierData.jl) (aka dplyr/tidyr) syntax for 6 backends (duckdb is the default, but there are others ie mysql, mssql, postgres, clickhouse etc). \n\nInterestingly, it seems that julia is having [consistent growth](https://discourse.julialang.org/t/some-julia-growth-usage-stats/112547), and they have [native quarto](https://discourse.julialang.org/t/ann-native-julia-engine-for-quarto-using-quartonotebookrunner-jl/112753) support now. Who knows where julia will be in 10 yrs.. mb itll get to 1% on the tiobe index ",
"comments": [
"BTW, TidierDB is part of a larger \"100% Julia reimplementation of R's tidyverse\": https://github.com/TidierOrg. It covers plotting \u00e0 la ggplot, some web scraping and other things.",
"Stop trying to make Julia happen. Julia is not going to happen"
]
},
{
"id": "1c1y109",
"datetime": 1712889477.0,
"flair": "ML",
"title": "The Mechanisms of LLM Prompting and Next Word Prediction",
"score": 4,
"comment counts": 2,
"content": " Is a prompt always necessary for a large language model to generate a response? What processes occur behind the scenes when a prompt is given? How is prompting connected to the next word prediction in LLMs? ",
"comments": [
"No, a prompt is not necessary. In simple terms the generative process is generating a sequence of tokens (words) one at a time according to the probability of that assignment*. If there were no probabilities then it would be a purely random process that would generate random gibberish.\n\nOne mechanism to assess the probability of the next token is self attention (more specifically masked self attention in this case but keeping it simple). In this, we decide how probable the next token is based on the output itself ... i.e. if the output sequence already generated is \"How are ____\" then the model will learn that \"you\" is a very probable next token; \"how\" is a very improbable next token; and something like \"babies\" is moderately probable. This is self attention and can generate without any prompts.\n\nWhile this actually creates quite an effective generative process - it creates credible outputs/sentences - it is not controlled by a user and can't answer questions or be steered towards topics of interest. For this reason, we also use cross attention. This is basically the same thing but here we determine how probable the next token is based on a different input. Commonly this input will be a written user prompt but it could be an image or audio file or pretty much anything that produces a trainable pattern. This allows us to then direct the LLM to generate based on user requests. If we had only cross attention (no self attention) we would generate relevant words, but they would not be coherent sentences just a list of words associated with the input/prompt topic.\n\nSo in practice, LLMs generate the next token by considering two bits of evidence: (1) the previous words generated (self attention) and (2) the prompt/user input (cross attention). (1) Ensures that we generate coherent outputs and (2) makes generation controllable and linked to user input/prompts.\n\n*I appreciate RLHF and all that and the above probably better describes greedy generation, but for simplicity...",
"Tokenization, context, pattern recognition, probability estimation, response and output text."
]
},
{
"id": "1c1hqxo",
"datetime": 1712848824.0,
"flair": "Career Discussion",
"title": "Data science vs Consulting ",
"score": 19,
"comment counts": 22,
"content": "I went through a bunch of tech and operational roles for 5 years. For 1.5 years till 6 months ago, I was in an academia adjacent research role heavy on data analytics. Last 6 months I have moved to a full fledged data science role. Not much of neural networks/deep learning. Most work is tabulation and/or random forests, logistic regression and such.\n\nI might potentially get an offer to move into consulting (not MBB but globally known).\n\nFor many years, I was solely focussed on advancing my career in DS. But, hearing stories about how hard it is to even get interviews I am a but nervous about what the future holds after my current gig.\n\nI have a master's from an Ivy+ uni which is not a full fledged DS degree but involved a decent amount of DS coursework. I have about 8 years of work ex overall (But only <2 in DS). Currently working in the public health domain.\n\nDo you think it's worthwhile continuing the DS journey or should I switch? Any opinions or advice is helpful.",
"comments": [
"Ex consultant who went to FAANG DS. Consulting can be a great leaping off point for future roles, especially if you're doing DS consulting.",
"I went to the commercial side of a pharma, it\u2019s more money and less work. Different kind of work though. Base is $150k with good benefits and work hours are like 9-3.\n\nNow, I do work til 5 because I want growth but office is usually empty after 2pm.",
"Hey man don't let others scare you do what your passionate about if that's consulting or advancing career in ds that's where you will be most successful",
"This is not a great answer for you, but over a long analytics / data science career, I went where the work was. That meant I jumped between small and large firms, Fortune 500 firms, FAANGs, advertising agencies, market research firms, consulting firms, and even did some independent consulting. My work ranged from IC to managing small teams to managing departments. \n\nEach place had its pluses and minuses, but if pushed, I would say my work at a Fortune 500 firm was the most fulfilling for a very straightforward reason: they consistently applied my analytics work for decision-making. I only wish that had been the case elsewhere, as even at the FAANG it was an uphill battle to get them to actually act on clear findings.",
"I work in consulting as a lead customer facing DS. We're not a Big 4, so we actually do work. I can't speak for every org but here's my take:\n\nMy role is a mixture of technical hands on work and sales work, so I help with giving educational sessions to customers newer to \"AI\", work with a plethora of vendors/partners, scope out projects that range from basic planning to advanced cutting edge work, then get to help lead execution of the work. It's a nice balance with varying exposure.\n\nOne benefit is the variety of projects and industries you get to work on/with. Every customer has a different challenge and level of maturity, which means you'll likely have plenty of variety. \n\nAnother is typically the pay. Because you're essentially contracted out to clients to do work at a higher hourly rate and need to actually have consulting skills, you can usually demand higher than average comp. \n\nOne big downside is that it becomes difficult to form any true domain knowledge unless you bring it from previous experience. This is mostly due to the constant context switching between projects and clients. On one hand it's nice to lean in and have them educate you on their processes, but it can cause for some slow starts or frustration.\n\nAnother is typically how consultants are measured. Project utilization is a key metric for anyone that does customer facing work. The problem with this is many firms (maybe not all) will emphasize your utilization before your wants/career goals. I.e. - you may find yourself doing DS adjacent or random boring work just to keep your rate up. \n\nEach company and role is different. Try to get a feel for the consulting company itself and how they measure your value and where they see their DS capabilities going. You want to make sure the consulting firm sees a positive trajectory for DS otherwise there won't be investment and you'll likely become underutilized or burnt out."
]
},
{
"id": "1c1rlkp",
"datetime": 1712872626.0,
"flair": "Discussion",
"title": "Learning new skills Advice for down time?",
"score": 5,
"comment counts": 1,
"content": "I would really enjoy learning and picking up new skills. Typically I work in SQl, power bi , but I really want to work on my Python and data analysis skills on the data science level. I do use Python but just the basic pandas and group by. \n\n What can I do during working hours during my downtime for this? Typically I just maintain reports and that\u2019s about it but I want to be proactive and at least learn some new skills. Any advice appreciated!",
"comments": [
"If you want to do it during your downtime at work, then I recommend you come up with a relevant business question you can try to answer, something that you think could have a real impact on the business.\n\nYou can start with doing some basic exploratory data analysis (EDA) using Python/Pandas in a Jupyter notebook.\n\nSlice the data in different ways, create some useful visualisations, and write down what you notice and whatever you think would be good exploring further. Make lots of inferences and gain more domain knowledge.\n\nOnce you get a good understanding of the data, try to answer that business question you came up with at the beginning. This could potentially lead you to trying some more advanced analytical techniques (Linear regression, causal impact analysis, survival analysis, etc\u2026).\n\nThis is more of a learn by doing approach.\n\nYou might get stuck along the way, but that\u2019s the point. Do some research, try to overcome any hurdles and finish your analysis. \n\nAt the end, if everything went well, you can share it with your stakeholders and get their thoughts/feedback."
]
},
{
"id": "1c1i2i6",
"datetime": 1712849621.0,
"flair": "Tools",
"title": "Tech Stack Recommendations?",
"score": 16,
"comment counts": 9,
"content": "I'm going to start a data science group at a biotech company. Initially it will be just me, maybe over time it would grow to include a couple more people. \n\nWhat kind of tech stack would people recommend for protein/DNA centric machine learning applications in a small group. \n\nMostly what I've done for my own personal work has been cloning github repos, running things via command-line Linux (local or on GCP instances) and also in Jupyter notebooks. But that seems a little ad hoc for a real group. \n\nThanks!",
"comments": [
"GCP/AWS/Azure probably pretty standard. I do dev in notebooks in GCP's Vertex on most days. Vertex is... not great, aside from developing in notebooks. But, it allows scaling compute pretty effortlessly, and switching between no GPU and an A100 (or whatever I need) is a major time saver when I get past modeling on a small sample. If I need a model built or inferences run on a schedule, I just wrap it in a DAG and run it in airflow using GCP's dataproc (managed pyspark), which can easily scale and handle R and python to process 10s of TBs of data for ETL and modeling jobs nightly. Code is saved in GitHub at end of day. \n\nNear real-time inferences via an API can be done via your cloud host or using a 3rd party edge deployment service within your cloud provider depending on your needs and budget (if you do lots of just in time inferences - like $500k worth per year or more - a 3rd party vendor can save you tons of $$). We save models in GCS buckets and outputs are saved in BigQuery. We do model and dataset/artifact tracking with a third party service that is similar to ml flow.",
"Oooh this is fun: \n\n* **Secure Data Collection Tools**: RedCap for encrypted and secure data capture from medical devices and clinical trials.\n* **ETL/ELT Processors**: Stitch or Fivetran for HIPAA-compliant data ingestion.\n* **Data Storage**: AWS S3 or Google Cloud Storage, configured for HIPAA compliance with encryption and fine-grained access controls.\n* **Data Warehouses**: Google BigQuery or Snowflake, with strong security measures and PHI data isolation. I'd lean towards Snowflake unless my org were full of Google fans. \n* **Data Transformation**: dbt for transforming, modeling, and ensuring the quality of data in the warehouse.\n* **Compliance Management**: Datica or ClearDATA for continuous compliance monitoring with HIPAA and SOC II.\n* **Data Visualization**: Tableau for advanced data visualizations and dashboards, configured for healthcare data regulations.\n* **Report Automation**: Rollstack for automated, compliant reports for data consumers in decks and docs \n* **Security and Monitoring**: Vanta or Secureframe for continuous SOC 2 compliance monitoring and Keycloak or Okta for secure Identity and Access Management (IAM).\n* **Backup and Disaster Recovery**: Automated backups and a disaster recovery plan that meets HIPAA\u2019s contingency plan requirements.\n* **Data Team and Stakeholder Engagement**:\n * **Data Literacy Training for Stakeholders**: Implement regular training sessions for stakeholders on data literacy, ensuring they understand how to interpret data and use analytics tools effectively. This helps in making informed decisions and leveraging data insights across the organization.\n * **Embed a Data Consultancy Knowledgeable About Biotech**: Collaborate with a data consultancy that has familiarity of biotech to provide expert advice on managing and analyzing scientific data. Basically they act as another set of eyes, and an \"expert\" voice to help coax stakeholders to act. \n\nEpic project. Be sure to report back in a couple of years!",
"d"
]
},
{
"id": "1c1lf2t",
"datetime": 1712857891.0,
"flair": "Challenges",
"title": "Framework for Build vs. Buy Decisions in Data Science Tools?",
"score": 8,
"comment counts": 5,
"content": " Hi All,\n\nI'm a BI lead exploring the decision matrix for \"build vs. buy\" regarding data science & BI tools, especially for automating recurring reports. We have some budget this year, but leadership often questions the need for purchasing tools when we might build in-house solutions especially with the support of AI. \n\nimho, the major arguments against building internally include ongoing maintenance, scalability, and reliability issues. It seems to me that with the advancements in AI and increased competition, the cost of buying solutions is more justifiable than before.\n\nDoes anyone have a structured approach or a decision matrix for evaluating build vs. buy options? How do you weigh the pros and cons in terms of costs, time, and resource allocation? Am I overblowing the upkeep costs of buy? ",
"comments": [
"I\u2019ve definitely been in the same boat with the whole build vs. buy scenario. It's always a bit of a juggling act trying to figure out what's best for the team in the long run. One thing that's helped us is looking beyond just the cost. We consider things like how quickly we can get the tool up and running, and what we're really sacrificing in terms of time and manpower if we decide to build it ourselves.\n\nWe use Rollstack for some our report automation needs (dashboards to decks). It just ended up being easier and self serve for our non-python non-REST knowing users. \n\nAnd, don\u2019t forget about the support aspect. Buying usually gives you access to support further preserving your internal technical resources which can be a lifesaver, especially when you need to focus on other projects.",
"It depends on your use cases. We typically followed a process of:\n\nDoes our existing tools have our identified use cases on their website? If No, move to next question. If yes, we worked with partner.\n\nIs there an app which does generally what we are looking for and have our identified use case? If no, move to next question. If yes, we set up a call and discussed our use case with vendor and doing a Poc \n\nIs there on GitHub/Blog/. Other source discussing our use case or what we are trying to do. If no, move to next question. If yes, we took code adjusted to what we were trying to do into an identified pattern.\n\nIf all the questions were no\u2026 then we made a custom build.",
"One problem with building it yourself is you\u2019re probably understaffed so it\u2019s going to take a long time and be kind of shit.\n\nBut the BIGGEST problem by far is when people leave you can\u2019t replacement their knowledge by hiring new people since everything is custom.",
"Try something damn simple. Like a server plus cron job. \n\nSometimes 80% of your need can be met. Then try to decide what you would like to invest to close that 20% gap",
"Clearly can write the problem you are trying to solve and the specific requirements needed to address it.\n\nHow much technical complexity? Is the tool/solution technically complex? If so, building in house may require significant expertise and time.\n\nSummarise compare the total cost of ownership."
]
},
{
"id": "1c1v5vc",
"datetime": 1712881495.0,
"flair": "Career Discussion",
"title": "Career roadmap and advice please. Transitioning from Academia to DS.",
"score": 2,
"comment counts": 9,
"content": "Hello!\n\nI'm 46, Masters in Operations Research, BS in Pure and Applied Math. I have been focusing on an academic career for a while, but lost my full-time professorship because I had problems with grading and meeting deadlines (I suffer from clinical depression). I am currently an adjunct math professor and cannot pay my bills. I love Applied Mathematics:, MDP, Game Theory, Simulation and I want to transition into some kind of Data Science field where I can learn more and utilize knowledge I have from grad school and more. I am not particularly interested in Finance or Medicine, and would find it much more interesting to work for physicists, chemists, biologists, economists, or other scientists. I am also very concerned about being able to acquire a career of this sort (and keeping it considering my depression). \n\nMy current plan is to finish 4 Coursera specializations: Python, Data Science, Data Science with R, Intermediate Data Science; 3 textbooks: Probability and Statistics for Engineers (mostly a review, but my education heavily learns toward Modeling and Probability Theory rather than Statistics), Linear Models with R, and Acing the Data Science Interview. After doing a few projects and developing a portfolio, I hope that I can find an entry level position. Then, over the next 2-3 years I would like to study Machine Learning and finding the intersection with that and my OR background --Hopefully such a field exists. At this time I may get a second masters or maybe finish my PhD. Finally, I would hope to have a career when I turn 50.\n\nI understand my depression is a big problem, but I am not going to just go on welfare and eek out an existence for the rest of my life. If you will, please look past this issue for the moment.\n\nMy current questions are: \n1) What is your opinion of my plan? Is it viable? What roadblocks might I encounter?\n2) What job titles should I apply for as an entry-level data scientist? Data Analyst, Data Scientist, Data Engineer, MLOps, MLE.... I really don't know where I would fit.\n3) Although I want to work for scientists, is this domain viable? Should I apply for my common domains like FAANG, Finance, or Medicine?\n4) Ultimately, what job title should I be aiming for considering my interests.\n\nLastly, any other advice would be greatly appreciated!\n\nThank you!",
"comments": [
"I am not too knowledgeable about transitioning from academia to industry but I can probably comment on the intersection between operations research and ML. Its a very sought after field in supply chain planning companies. Its quite exciting to work in the field if you can solve problems that keep arising there.\u00a0\n\nThe major roadblock I see is the transition from academia to industry and the fact that you will have a gap of a few months/years before you intend to pick your career up again.\u00a0\n\nGood luck!\u00a0",
"In very brief, I would first suggest applying to IT/ ML/ AI consulting firms. Your skills will be useful on many projects, you can learn in a more structured way from those many projects. In consulting I met several consultants with a background similar to yours, and they were doing well.",
"Is his plan ok?",
"Best wishes with your career journey"
]
},
{
"id": "1c10pqe",
"datetime": 1712793645.0,
"flair": "Discussion",
"title": "Why is it so hard to get neural networks to beat XGBoost on most small-medium tabular datasets?",
"score": 141,
"comment counts": 89,
"content": "I get it; XGBoost is really potent and easy to use while with DL theres a lot more that can go wrong tuning hyper parameters wise.\n\nBut i always assumed that whatever an ML model can do, a DL model with proper settings and sufficient regularization can also do as well even in low-medium size datsets (~hundreds to thoussnds of examples range).\n\nI understand that DL models are more likely to overfit because theyre very very flexible espeically especially as width and parameter count goes up. Meanwhile, something like XGBoost tends to have just enough flexibioity to model complex patterns without overfitting.\n\nAt the same time, i thought that with sufficient regularization techniques like dropout, L1/L2, shrinking width etc, that a DL model should be able to generalize just as well even with small-medium datasets. \n\nIs it just that I havent found the correct combination of regularization and layer architecture? Or is it just that my assumption is wrong?",
"comments": [
"Neural networks and GBTs have different inductive biases which make them better suited to different tasks. Empirically, the inductive biases of XGBoost seem to make it more suitable for modelling relationships found in real word tabular data. Theoretically, neural networks (in the limit of large model complexity) are universal approximators, so you could always argue that in this limit some combination of weights could be chosen such that an NN will outperform XGBoost if the target function can't be perfectly represented by a GBT.",
"There's some talk on YouTube by Ga\u00ebl Varoquaux and associates as well,\nhttps://arxiv.org/abs/2207.08815",
"Because you are using an Ill advised algorithm for the task. NN in tabular data are a glorified linear regression. Xgboost will outperform just because of boosting and tree methods being superior.",
"From working with real data it is also because real data sux. Some random analysts might decide to set some random default columns to -1, cap some numbers to be maximum at 120 because of legacy code, or some rows might be screwed up in some ways. It's a pain to get an NN to deal with these problems, versus a lot of existing gradient boosting packages have decent defaults that takes cares of this. \n\nIntuitively decisions trees (say for classification) are just glorified ways of chopping up existing data into bins. Because of the way real data are screwed up (mostly due to human idiosyncrcies), they tend to get screwed up in a similar way and splitting data up in a decision trees seem to capture this aspect very well. Since if an engineer decides to default something to 42, it's not like we are gonna see 42.32, which is what NN is good for (extrapolating).",
"The biggest benefit of a DL model is it can create linearly separable vector representations of non-vector data (eg images, text). Tabular data is already in a vectorized form that\u2019s probably easily separable, so at that point DL is just logistic regression."
]
},
{
"id": "1c0ujfw",
"datetime": 1712778189.0,
"flair": "Career Discussion",
"title": "Starting as a Data Scientist",
"score": 248,
"comment counts": 114,
"content": "After being a data analyst in the navy and then a data engineer the last decade I start my new role as a Data Scientist working from home for my dream company on a project that I'm a subject matter expert on\n\nPretty stoked. Especially since I was laid off last month which got me to apply to a new role\n\nPretty nervous. Pretty excited. ",
"comments": [
"Congratulations man! \n\nNervousness is good.\n\nI'm sending my first ever model in production. Spent 2 months and 2.5KLOCS .",
"Wow! This is really awesome. It's a great feeling to work so hard and finally land the role you've been dreaming of. Congrats!!",
"Good stuff. Congratulations and I hope you do well! \ud83c\udf40",
"Congratulations, with your experience you are gonna do well!\n\nWhich were the requirements for the position?",
"Congrats! I\u2019m looking to transition soon as well. \n\nI\u2019ve worked as an analytics engineer for 2 years and a process engineer (same company) 3 years before that.\n\n\nWas contacted by a recruiter for a DS position at a competitor in my current industry. Interview is about a week out and I\u2019m feeling super nervous."
]
},
{
"id": "1c1uafg",
"datetime": 1712879222.0,
"flair": "Career Discussion",
"title": "Contracting while looking for FTE - bad idea?",
"score": 1,
"comment counts": 9,
"content": "I moved from Europe to the US, and am currently unemployed while looking for a job. I don't require sponsorship and have 5yoe as a data scientist, but it's still difficult in this market. \n\n\nI've been considering recently to take a lower-paying contract role (the ones I've been contacted about are typically about $50-$80/h) to have some income and give myself some more time to find a solid full time position. And then resign from the contract job when that happens, even if I've only been in that position for a couple of weeks. \n\n\nAside from burning bridges with the contract employer, is this a bad idea? ",
"comments": [
"No. At will employment goes both ways, especially with contracts.",
"Just make sure that the contract doesn't have any stipulations or penalties for jumping ship early. Other than that - yeah, that sounds like a perfectly fine plan.\n\nHaving said that - with 5 years of experience, I wonder if you need to work on your resume because while the market is bad, it's not that bad for people with meaningful experience.",
"I don't think there is any issue in that. If anything you get more experience. Do make sure to mark it as contracting on your LinkedIn and I would recommend a single experience bullet point and not one for each contract role. That just looks like you are leaving jobs after 1-2 months very time."
]
},
{
"id": "1c0x106",
"datetime": 1712784235.0,
"flair": "Discussion",
"title": "Shout-out to all you super competent Data Scientists out there!",
"score": 116,
"comment counts": 51,
"content": "In the past, I've often commented that my domain expertise and working experience (i.e. in organic chemistry) is likely the thing that is / has been most valuable to my employers so far - with my Data Science skills being a second place. So, there's likely tons of data scientists that will run circles around me with their expertise in ML, programming, maths or statistics.\n\nToday, I saw an (internal) job opening for a \"fully DS-centric\" Data Scientist position - the full shebang: multi-year-track record in diverse ML methods, experience in LLMs, fullly business-savy, proven track-record, excellent communicator, experienced in change management).\n\nThis is not the position for me. I'm happy and content with my job and with how things are going for me; I'm good at what I do. I like the niche that I have created for myself.\n\nStill, it made me realise *how effing competent* some of you guys out there really are in your respective domains! So, I just wanted to say, well done you! Keep at it! \ud83e\udde1\n\nEdit: \"proper\"",
"comments": [
">a \"proper\" Data Scientist position\n\nIt sounds to me like you have a proper DS position. Picking up ML techniques is a lot easier than learning the domain _you_ know.\n\nDon't sell yourself short.",
"Chemist to Data Science pipeline is so real tho",
"Domain knowledge is the most important DS skill",
"DS at this point is more and more like the BI 10 years ago. Anyone can write code, but not anyone can make sense of \u201cthis\u201d data and \u201cthis\u201d operation.",
"Disregarding, this message is not for me."
]
},
{
"id": "1c1lnz8",
"datetime": 1712858491.0,
"flair": "Discussion",
"title": "Webinar Calendar/Master List",
"score": 2,
"comment counts": 4,
"content": "There are a bunch of good free webinars, mostly by big names like Google, data bricks, neo4j, census.gov, etc. \n\nI'm trying to find a calendar that aggregates all these sorts of things. Has anyone seen something like that? I'm shocked that it isn't built.\n\nEdit --some webinars: \n\nJohn Snow Labs: https://www.johnsnowlabs.com/webinars/\n\nNeo4j: https://neo4j.com/webinars/ \n\nDatabricks: https://www.databricks.com/events?event_type=virtual-event-webinar®ion=all \n\nDatabricks also has a couple podcasts and blogs\nimage.png\n\nGoogle (Not DS Specific): https://cloudonair.withgoogle.com/ \n\nUS Census: https://www.census.gov/data/academy/webinars/upcoming.html",
"comments": [
"I was looking at this a while ago sadly I didn't find any for all of them but looking one by one still works ig",
"KDnuggets has a good collection.\n\n[https://www.kdnuggets.com/meetings/index.html](https://www.kdnuggets.com/meetings/index.html)\n\n[https://www.kdnuggets.com/2020/09/best-online-masters-data-science-analytics-online.html](https://www.kdnuggets.com/2020/09/best-online-masters-data-science-analytics-online.html)",
"can you share some of those here?"
]
},
{
"id": "1c0rkd3",
"datetime": 1712770936.0,
"flair": "Career Discussion",
"title": "What does a PIP look like for data scientists?",
"score": 174,
"comment counts": 122,
"content": "Im curious, for those who have been placed on a PIP, what does it look like generally and what metrics are typically measured to determine if you have met or failed to meet it?",
"comments": [
"pip install --upgrade pip",
"If you are placed on a PIP, 99% of the time you are going to be on the chopping block. For anyone on a PIP, just start looking aggressively for a new job",
"I would start looking for a new job",
"Hate to burst the bubble, but a PiP is just the final step in the \u201cquiet firing\u201d process. It\u2019s HR trying to cover the corporate six, especially if the victim is in a \u201cprotected\u201d group. The chance of surviving a PiP is basically zero. Use the PiP period to find another job.",
"Would it be Pip3 more stable?. Sometimes, pip causing some version conflict"
]
},
{
"id": "1c0zhd5",
"datetime": 1712790341.0,
"flair": "Career Discussion",
"title": "Capitulation | It's come to this",
"score": 33,
"comment counts": 40,
"content": "wayy TLDR: 2023 Statistics, Finance, and Math grad with only one internship in tangentially related role. Can't find a job going on well over a year. Located in the US and is mostly location agnostic with a few preferences. Just trying all my options.\n\nHello, all. I sit here writing this post in disbelief that I am actuall doing this. Generally, in any given subreddit, I am a lurker. A year ago I would have shuddered at the thought of making a post like this. It goes against my nature to ask (beg?) for help from stangers--or, even from anybody at times. I feel as if I've done everything I can and am going insane trying to figure out where I am going wrong. Confiding in my close friend, I spoke of my troubles and experience in my job search (how I had done everything possible and such), and he suggested I make a post to this subreddit. To his credit, I had not done this yet and decided to exhaust all options I know of. I don't know what to expect, but I hope at the very least a see a perspective that allows me to push past the feelings (of what I can't really pinpoint anymore but despair, desperation, and any others you can name are probably there too).\n\nThis is my first post of this nature; so, being not sure of how to start, I will begin with an overview of myself and then attempt to consruct a coherent description of my situation. Details of the biography aren't unique to me or necessarily important to understanding my struggles, but, for anyone interested I figured it would save a few additional comments that I may have to make to give them.\n\nI come from a very rural, impoverished area. Both my parents grew up even poorer than we are now, and neither went to college. My dad is a first-generation American so his life story has been quite rough as the family got here in the early 20th century (he is very old compared to my classmates fathers). I was lucky enough to have been born to two wonderful people who supported me in any way they could although not fully understanding the life I wanted--one away from poverty and where I could explore my interests with like-minded people. Luckily, I did exceptionally well in academics and found myself with an acceptance to a top ten undergrad program.\n\nMy thoughts (however misguided as there was no one I could have advise on this in highschool) were to go where the money was. Wallstreet. I had set out on studying finance my freshman year. My school sends countless kids to the top Investment Banks every year and I thought that I should do that, get paid, find myself, and then transition to what ever industry I could if I wanted out. The school paid for a trek out to NYC and we got to go and visit all the big banks and meet with emloyees it was really fun. That was until someone told me about the work culture. This was the 2nd or third month of my freshman year so I was pretty ignorant with most things and when told about the working hours my stomach dropped. It seemed inhuman to work that much and for the price it didnt seem worth it. I struggled with this for a while and really started to enjoy the statistics class I was in. I'm sure you can guess where the rest of that story goes. I ended up switching to statistics; however, I still greatly enoyed finance and the classes were super interesting, so I kept finance for my other major. I also took several math classes past whats required for the minor but not quite the major but I enjoyed them regardless. I got to intern at INSEAD in Paris for a summer as a research assistant in statistics, so I thought I had a good chance to get some sort of data analyst/scientist/etc. role going into my senior year. Being an IB feeder school many of my classmates and friends had fulltime offers already and many more were expecting theres upon completing their upcoming internship. I however got nothing. I couldnt even find a professor that I could work for over the summer.\n\nI went through junior summer jobless and scared. Everyone told me I would find a full time offer before graduating, but I didn't. I didn't care if it was for data/business analyst or data science or marketing data analyst etc. etc.\n\nI was then graduated living at home with no income. My student loans (though very low) still were there and I began having to pay on those. There is no job I can get near me. I never had a car growing up and still dont. We just never had the money for that.\n\nLuckily, there is an amazing alumni network that is always willing to help. Additonally, one of my professors I'm close with has made some good connections for me putting me in contact with some fantastic people who have helped me with interview prep, resume advice/editing, etc. The only thing they couldn't do is give me a referral as, upon looking there were no entry level roles for me that were open.\n\nAll of this was going on as I had life happening (as it does haha) This past summer I experienced the death of a close friend, a grandparent, and many other things. Job rejection and ghost after the other just hurt. I was lucky enough to a have a few go to an interview where I got to present my analysis to a team. Now they wont respond to my emails. It feels like some jobs just try to use applicants for free ideas without ever hiring them. Then, one day my girlfriend of 11 months facetimed me saying that she \"thinks it would be best if we didnt talk anymore\". I heard later through a friend that one of the reasons was my lack of ambition since i \"didn't even have a job yet and have the \\[prestige of my undergrad\\] to back me up\". Long story short, I did something very stupid and tried to take my life, but was stopped by cops who had been notified of a \"distressed person\". I was allowed to 201 myself and 2500 of my savings later, some therapy, and several months later, I feel much better and in control, but the frustration of getting a job still is there. I don't know what to do. I feel like I see everyone saying that referrals are the best way to get a job, which makes sense, but I have very excellent alumni helping and its still not doing it for me. Everyone I graduated with is on wallstreet now or at some consulting firm being overworked, but I just want to work. I like working I want to be sucessful. I sometimes wish I had just stayed with the crowd and not been such a baby about the working hours.\n\nLooking at what I've written I realize it has gotten quite long. I don't want to be annoying, but I promised my friend I would do this. The simple act of posting this brings me much peace as I have exhausted this option. I'm not a big reddit poster/user, so apologies if I broke some unspoken rule or something of this sub. I wont exlain anymore in this initial post but will be happy to answer any commens or dms. I hope that this is somewhat coherent but writing it brought back a lot of memories I don't like to think of, so I did my best to get past it and just have something written down.\n\nThank you for reading if you did :)",
"comments": [
"Just wanted to say look into getting your loans switched to the SAVE plan. It\u2019s income based so you won\u2019t have to pay anything until you get a job, and you don\u2019t even accumulate interest either.",
"If you're getting interviews but can't close on a job offer, it means you need to continue practicing your interviewing. Also, if you share your resume we can see what can be improved.",
"Hello Sir,\nI'm not based in the US (France here ;) ) so any advice I could provide might lack relevance.\nAlthough, I just wanted to give you and virtually send you my full support and compassion regarding your personal story.\n\nI also come from a poor family, had tough times during academics in order to find a place in a world that did not want me at first.\n\nIt's not easy, the market is complicated currently, but there's light at the end of the tunnel and based on what I understood from your experience and what others said, you will end up finding something good ! Plus, you seem like someone very resilient.\n\nKeep it up !!",
"I too didn\u2019t read through your entire post. But the general story is a good illustration of how people of your background have one hand tied behind their back in the current system, especially when it comes to getting into highly competitive fields like finance or academia. It\u2019s not knowing How The Game Is Played, or that The Game even exists. Middle/upper class kids are taught that almost from the moment they are born. Think of it as \u201cimplicit nepotism.\u201d The good news is that once you figure it out, you will go further than any of them, because you know how to overcome, rather than having everything handed to you.",
"Just try to get your foot in the door in an analyst role. Could be in operations, marketing, wherever the case is - even if it's not data analytics or data science, just go for that, so would broaden your search.\n\nHit up your alumni network, present a strong case but be humble, ask them for help. Meet folks in person - ask for an informational interview, tell them you'll treat them for coffee for some advice, and at the end of your 30 minutes, ask them if they would recommend you talk to anyone else.\n\nDo not pressure them, but ask very good questions, because that will show your potential. \n\nOnce you get that analyst job, master SQL, do all the dirty ad-hoc asks, and get really good. Be the go-to-person that handles those tasks - this is how you'll get noticed by the data team. \n\nThat'll get you your foot in the door."
]
},
{
"id": "1c18y1q",
"datetime": 1712819942.0,
"flair": "Discussion",
"title": "Tough call: How important is choosing MSc Dissertation Topic in Data Science",
"score": 6,
"comment counts": 25,
"content": " \n\nI'm 22-23 years old, currently at a crossroads in choosing my dissertation placement project for my MSc in Data Science and could really use your collective wisdom. One offers **stipend (3000 pounds**) (which I really want to take), others don't. Stipend provider project is about healthcare analytics with geospatial data (idk kind of not good career potential) and there is strong supervision, it will solve real life problem and might lead to **publications**. And then there is **offer from Bank of England** which offers no stipend, no supervision, not sufficient literature review (more work should be done on modelling) but aligns with my Economics bachelors, and would look good on CV. Since I want to do phd, I also have option of doing **internal dissertation project** which I can do on my interest field which is **Machine Learning** ( Developing algorithms that enable computers to learn from data)\n\nMain: One of the options I'm considering is a project focused on \"Visualizing Geospatial Trachoma Diagnostic Data and Risk Factors.\" This involves providing analytic and mapping support to a multi-country study on trachoma in countries like Zambia, Kenya, Ethiopia, Uganda, Cameroon, and Nigeria.\n\nHere's what I'd love to get your thoughts on:\n\n1. How easy is it to transition between sectors (e.g., from healthcare to finance, tech, etc.) in data science? Does specializing early help or hinder long-term career flexibility?\n2. Are the skills and experiences from a healthcare-focused project like this seen as valuable in other sectors within data science?\n3. Have any of you made a sector switch in your data science career? What challenges did you face, and what advice would you have for someone early in their career?\n4. For those of you who have worked on healthcare data or similar projects, what has been your experience? Would you recommend diving into such a niche, and why?",
"comments": [
"The dissertation you do would have a minor impact on job likelihood and usually PhD too. If you have an overall understanding of ML and do well that's usually more important.\n\nAfter my BSc and MSc, when applying for jobs nobody ever cared about my dissertation. Heck even after doing my PhD in biology and bioinformatics nobody ever even cares what my specific project is. Ever. They just see PhD in biology and that's it even though I am now in the data science field, the PhD itself gets me quite far.\n\nEspecially as you already have an undergrad in economics, that will get you further in getting economics PhD/jobs than a specific project will. I would just do some ML problems on the side to bulk up your ML and coding ability, which you will need anyway as an MSC project will only scratch the surface\n\n\nWhat will make a difference to any job or PhD place you apply to, even if it is in a distinct area as fashion or aerospace is having publications. They wouldn't read it but it would be like having extra certificates to add to your qualifications. That would certainly stand you out head and shoulders above other applicants and is what I would aim for in your dissertation if I were in your shoes.",
"that geospatial data science and drone technology intersection excite me too though",
"I\u2019d probably lean towards the healthcare topic because:\n\n* it pays you money today\n\n* It is a better posed/defined/documented problem\n\n* you\u2019re actually improving the human condition (as opposed to banking where you\u2019re just playing the arbitrage game sucking capital out of human life)",
"I would say that the thesis topic and publications might be relevant only in the case when applying for a phd. I had a similar case last year, where I wrote a thesis on 6D Pose estimation (computer-vision in a field of manufacturing), managed to publish a paper, but ended up working with LLM's in my full-time position :D. Of course the knowledge came in handy because the principles are somewhat similar, but the domain is different. Also, during the interview process nobody really paid much attention to my thesis work nor publication. \n\n \nSo go with the grant and in case you want switch industries this should be no problem as you will already be familiar with ML/AI.",
"Geospatial and healthcare are two huge industries?"
]
},
{
"id": "1c19npu",
"datetime": 1712822880.0,
"flair": "Challenges",
"title": "Coding Test (Data Science Framework) on CodeSignal",
"score": 6,
"comment counts": 3,
"content": "Hello! I have received an invitation to take a coding test (Data Science Framework) on CodeSignal. Do you know where I can find practice questions similar to the ones they might ask? Are there any previous sets of questions available? Also, what should I focus on to prepare for the test? I have four days before I need to take this test, so any help would be appreciated. Thank you!",
"comments": [
"Is it with BCG?",
"Kaggle"
]
},
{
"id": "1c0quez",
"datetime": 1712769122.0,
"flair": "Career Discussion",
"title": "What is a reasonable salary to ask for if you have a master's in data science/analytics and approx. two years of relevant experience?",
"score": 64,
"comment counts": 78,
"content": "With the title, I will be finishing my master's in DS this fall, and I've worked as a Data Analyst for a year (doing high level DS projects) and as a lead Clinical Data Manager for over a year before that. What salary should/could I ask for in a reasonably HCOL city for a Data Scientist position? I have a bit of imposter syndrome, and I want to make sure I don't sell myself short and ask for too little.",
"comments": [
"Depends on location",
"Really depends mate, looks like you might be in the six figure range.\n\n#1 rule for salary negotiations is to never give your number first. Instead of answering, ask what the salary range is for the position - and if the opportunity presents itself, then do this as early in the process as possible. \n\nThey might give you a solid number, or a range - just express agreement at first. If you want more, you can prepare how to present this depending on the medium - in person, phone, video call, email, etc.",
"What kind of DS and at what type of company? Check levels.fyi or job posts in states where it's required to post the pay band",
"You\u2019re not giving us much to work on here but I would do current salary times 1.2 at the minimum.",
"I'm DS with 4-5 yoe also in healthcare space. I've been stuck at 150k past 4 years. \nHCOL, mainly startups"
]
},
{
"id": "1c19o5p",
"datetime": 1712822936.0,
"flair": "Analysis",
"title": "Help to normalise 1NF to 2NF",
"score": 2,
"comment counts": 4,
"content": " Hullo i need help anyone can explain to me how to remove partial dependency to normalise 1NF to 2NF. I still dont understand after reading every source i can find ",
"comments": [
"Easiest to explain by example really - mock up a dummy dataset and you'll probably get more useful explanations",
"this [video helped me understanding it](https://www.youtube.com/watch?v=GFQaEYEc8_8)",
"if table is already in 1NF then see how can you decompose it further so that divided tables have one primary key with full dependency instead of partial dependency",
"a"
]
},
{
"id": "1c1mmtg",
"datetime": 1712860795.0,
"flair": "AI",
"title": "How to formally learn Gen AI? Kindly suggest.",
"score": 0,
"comment counts": 24,
"content": "Hey guys! Can someone experienced in using Gen AI techniques or have learnt it by themselves let me know the best way to start learning it? It is kind of too vague for me whenever I start to learn it formally. I have decent skills in python, Classical ML techniques and DL (high level understanding)\n\nI am expecting some sort of plan/map to learn and get hands on with Gen AI wihout getting overwhelmed midway.\n\nThanks!",
"comments": [
"Hey!\n\nI am a \u201cGen AI Engineer\u201d :\u2019) so i think i might be able to provide some guidance here. I\u2019ve only talked about text models here. So:\n\n- Learn about the attention mechanism. (No need to deep dive. Just understand what it does).\n\n- Transformers vs RNNs vs LSTM/GRU (Again a brief overview should suffice).\n\n- Different types of LLMs based on transformers. Encoder-Decoder, Decoder-Decoder, etc. Just skim through what types of architectures are popular LLMs such as GPT 3.5/4, Llama2, Mistral 7B or 8x7B based on.\n\n- Open Source vs Closed Source LLMs: Which ones are better at the moment? Different companies involved in the LLM rat race such as OpenAI, Google DeepMind, Mistral, Anthropic, etc. How to access these? For open source explore platforms such as Huggingface and Ollama. \n\n- Prompt Engineering: Get comfortable with writing prompts. I would suggest Andrew NGs short course on prompt engineering to understand methods such as few shot learning.\n\n- Learn about each of these: What are tokens? What are Vector Embeddings and what are some popular embedding model available today?Why do we need VectorDBs such as FAISS, Pinecone or ChromaDB etc? What does context length of an LLM mean? \n\n- What is Quantization of LLM weights? Difference between 4-bit, 8-bit, 16-bit LLMs. \n\n- Retrieval Augmented Generation or RAG: Understand how training data used for LLMs might not have all the info you need, RAG allows you to perform question answering on your personal documents. At this point, you might want to explore frameworks such as Langchain anf LlamaIndex. These provide one stop solution for all GenAI related requirements of your application.\n\n- Finetuning LLMs: Why do we need to finetune LLMs? How is it different from RAG? How much GPU memory/VRAM would I need to finetune a small LLM such as Llama2? Techniques such as LoRA, QLoRA, PEFT, DPO etc. Finetuning an LLM would require some understanding of frameworks such as Pytorch or tensorflow. \n\n- Advanced features such as Agents, Tool use, Funtion calling, Multimodal LLMs, etc. \n\n- Access various opensource models such from ollama or huggingface. Also get familiarized with using OpenAI\u2019s API. \n\n- I would also suggest try to work with streamlit. It\u2019s a very convenient way of creating a frontend for your application.\n\nThese were some points that i thought you might find useful. If you have any further questions, please feel free to reach out.",
"Ask ChatGPT",
"Try some of the free AI courses by Google. Here are some relevant ones I found:\n\n1) Introduction to Generative AI (45 mins): Learn what Generative AI is, how it is used, and how it differs from traditional machine learning methods.\nhttps://www.cloudskillsboost.google/course_templates/536\n\n2) Introduction to Large Language Models (30 mins): Explore what large language models (LLM) are, the use cases where they can be utilized, and how you can use prompt tuning to enhance LLM performance.\nhttps://www.cloudskillsboost.google/course_templates/539\n\n3) Encoder-Decoder Architecture (8 hours): Learn about the encoder-decoder architecture, a critical component of machine learning for sequence-to-sequence tasks.\nhttps://www.cloudskillsboost.google/course_templates/543\n\n4) Transformer Models and BERT Model (8 hours): Get a comprehensive introduction to the Transformer architecture and the Bidirectional Encoder Representations from the Transformers (BERT) model.\nhttps://www.cloudskillsboost.google/course_templates/538",
"Just pick a starting point and start running. It's a rabbit hole tbh.",
"I can suggest the \"LLM University\" by Cohere. Just searching in their website, there are several modules about LLM (starting from basic NLP concepts to more advanced topics)."
]
},
{
"id": "1c1i2hf",
"datetime": 1712849620.0,
"flair": "Career Discussion",
"title": "International remote work in the EU",
"score": 0,
"comment counts": 9,
"content": "Hello there. I'm in the early stages of my career\u2014half a year of experience as a junior analyst and starting my master's in data science this year.\n\nI wanted to ask about remote international employment in the EU (e.g. working in Germany remotely from Hungary) or\u2014if possible\u2014in the US. I would love to hear from someone with experience with such jobs; not necessarily in data, just in IT in general.\n\nMy primary motivation is money, but I also just don't really like the culture/work culture in my country, so I would like to apply for jobs like that in a year or two.\n\nIs it a very rare/difficult thing to do? How would I need to adjust my approach? What would I need to focus on in relation to my experience, education, and resume? Which fields and types of companies should I be aiming for? Is a fully remote job of that nature even possible, or is hybrid with weekly plane trips the best you can get? I would rather avoid freelance gigs, at least for the foreseeable future.\n\nIs it at all realistic for me to be thinking about it this early? Anything else I should consider? Thanks a lot in advance.",
"comments": [
"You actually can do it, but you should expect that a lot of companies (if not all) would consider your local job market when discussing the wage. Here i know some US-based companies who seriousluy offer something like 1700Eur/month for C++ developers, saying \"Hey, it's 2.5 times local average wage, you should be happy about it!\"\n\nAlso, taxation could be an issue too.",
"Wondering if you work for an U.S. and receive the U.S. wage, do you still need to pay EU income tax rate? That's gonna be damn high",
"Working for a company in Country X while actually living in Country Y is very uncommon, outside of going freelance or signing up with a 3rd party contracting agency. There's a handful of companies that hire globally remote that would be okay with it but given that you have basically no experience, those companies are probably out of your league. If you're in the EU why not just move to a different EU country and get a job there?",
"My company has offices all over and hires data/tech employees in the US, multiple European countries, and India. Most of us work remotely but the pay is competitive for the local market you live in. So the range for US employees is higher than the ranges in other countries for the same type of role.",
"Ive seen this setup before on a job posting:\n\n- Company is registered in many countries > You can only work full time in those countries.\n\n- Company also allows remote work in other countries they dont operate in, but only if the country offers some sort of freelancer/nomad visa for less than 1 year or 6 months. Also, you would need to prove to the country that you earn significantly higher.\n\nOR\n\n- Work anywhere but you will be a freelnce, contractor, or some sort of 3rd party in a legal sense. Essentially, in the eyes of your government, you will be a business.\n\n\nRe: US, you will need a working visa or a partner/spouse/parents\n\nRe: wage, companies typically pay you according to your location. Like, they have pay zones. If you notice a job posting for US companies, for example, they will mention that pay is x amount in Zone 1: NY, CO, CA while pay is y amount in Zone 2: TX, WA.... so on and so forth. \n\nRe: tax, it depends on the countries. Most likely, you will pay tax wherever you are legally cinsidered as a tax resident.\n\nEdit: they dont offer these to entry-level tho. Maybe look at Apprenticeship in EU."
]
},
{
"id": "1c0uyvo",
"datetime": 1712779218.0,
"flair": "Discussion",
"title": "The best place to network as a Data Scientist",
"score": 8,
"comment counts": 22,
"content": "Hi all, \nIn a week, I'll be flying to the States. I'm looking for nice places to network as a senior DS with over 10 yrs of experience in consulting, management but mostly in building AI products (I tend to favour Python over meetings). I'll be visiting Washington and New York and have 2 weeks time. Things on my list:\n\n- a DS conference (in Washington on AI in finance; will not post the link as this may seem like a hidden commercial, which it is not), \n- several meetups in NY as it seems to be the place with the most vacancies, \n- I want to go to workspots/techhubs/incubators to network and I also have to do some work. Any recommendations for places that actually stimulate networking (not WeWork and the likes) are highly appreciated! \n- I actually got the advice to go to a gym where a lot of rich tech people go called equinox. Not sure how serious I should take that. But I like weird stuff like that as it might also lead to some surprise connections. \n- Any good hostels where a lot of nomads/tech people live would also be nice.\n\nLet me know if this resonates with you and/or if you have any feedback.\n\nMuch appreciated!\n\nhttps://preview.redd.it/onsoekfi9ttc1.png?width=3362&format=png&auto=webp&s=5f4149437fcb0553255fdd081ffaed3bcd7e7692\n\n",
"comments": [
"Conferences have better ROI, followed by lunch invitations to companies that you are curious about (i.e. if you have friends that work there), and then meetups.",
"SF and the Bay Area seem to be the places where all the AI/ML/DS action is these days. Swing by if you get the chance. Ultimately, the concentration of DS-related everything is lackluster compared to SF. All the heavy hitters in the industry are concentrated in seven square blocks near Hayes Valley.",
"I\u2019ve had the same issue!\u00a0",
"co-working spaces",
"I'm supposed to be doing a lot of networking. You have a good list. \n\nI've done a meetup. \nThat gym is legendary. \n\nI'm thinking a career fair. There's one Friday. Please help make me go."
]
},
{
"id": "1c0bof0",
"datetime": 1712719321.0,
"flair": "Discussion",
"title": "why is all dev tool innovation in the AI/ML space focused on the least time consuming stuff?",
"score": 118,
"comment counts": 76,
"content": "Every DS, DE, and MLE I speak to (including myself) spends a majority of their time focused on data prep. Even though this is taking up 70-90% of everyone's time all the investment and innovation focus seems to be on training, fine-tuning, and on-demand inference. Why? Is data prep not sexy?\n\nIt is imperative that high quality samples are being used for training, shit data is going to create a shit model. I feel like there should be more focus on making feature selection, data cleansing, and preprocessing less cumbersome, any thoughts?",
"comments": [
"This is such an important point that I'm working on at work and on my personal projects.\n\nThe biggest reason is that most data analysts/scientists want final clean data to work with, and the considerations of exploring better sources isn't something that interests many people. Plus, it is usually out of the scope of many people.\n\nHigh-quality data is difficult to obtain. Id love to hear other people's comments on this matter.",
"The data is very specific to each company, so it\u2019s hard to develop useful tools in this space. I\u2019ve seen my fair share of things that try to automate data quality checks and feature engineering, and none of them impressed me. Lots of basic aggregations that any team can do in a week without having to rely on external things.",
"Is this not exactly what Data Engineering is?",
"The problem for me is the risk aversion of corporate. Projects are always conceptualized and delivered as MVP\u2019s and then Iterated upon instead of the infrastructure forward approach we need to not waste time cleaning, so I\u2019m always building MVP\u2019s out of others groups MVP\u2019s instead of clean and documented data sources maintained by expert data owners.",
"I went through this exercise of trying to build an internal tool for work to handle our data prep issues. It was hard to come up with anything that could be reusable and reliable. Too many quirks and random issues. This was just within 1 small company. if you scale that use case to an entire industry, i dont see it working out."
]
},
{
"id": "1c0rhdm",
"datetime": 1712770731.0,
"flair": "Discussion",
"title": "A Tale of Two Cultures: Integrating Data Science and MLOps to Build Successful ML Products",
"score": 5,
"comment counts": 5,
"content": " When the excitement about data science became widespread about 10 years ago, this spurred a lot of proof-of-concept ideas. However, most of these stayed confined in Jupyter notebooks and never made it into production. There are multiple reasons why it has been a lot harder than initially expected to productionize ML models, but the one I want to focus on in this blog post is one that has not been explored in as much depth. In order to create business value, we have to marry two very different approaches: The ML lifecycle starts out on the exploratory data science side, but we eventually have to transition towards an engineering-driven approach in order to achieve the quality attributes such as availability, reliability, scalability, and security typically expected of production systems. Thus, what it takes to do good work in data science is fundamentally opposed to what it takes to do good work in MLOps, giving rise to different best practices, skill sets, and even mentalities (ways of thinking about problems) on each side. As a result, a central challenge for creating successful ML products is to find a good process for making these two different cultures work well together. \n\nThis is very detailed article by *Thomas Loeber, Senior Machine Learning Engineer at* Logic20/20, Inc. \n\nSource here: [https://opendatascience.com/a-tale-of-two-cultures-integrating-data-science-and-mlops-to-build-successful-ml-products/](https://opendatascience.com/a-tale-of-two-cultures-integrating-data-science-and-mlops-to-build-successful-ml-products/)",
"comments": [
"What is this, a commercial? It feels like someone wrote a poorly done Python script: \"This is very detailed article by\u00a0*Thomas Loeber, Senior Machine Learning Engineer at*\u00a0Logic20/20, Inc.\"\n\nIn case it is not a bot: I don't think I agree with the message of the article. You did not ask for feedback so let me know if you are, because I do think it is a very interesting discussion.",
"various factors",
"wow",
"WOW"
]
},
{
"id": "1c0exqh",
"datetime": 1712730610.0,
"flair": "Discussion",
"title": "Time series train test split",
"score": 10,
"comment counts": 14,
"content": "Hi all,\nSuppose I have 6 months of data at weekly level (24 data points) and I want to forecast for the next two weeks.\n\nIf Im doing a train test split keeping let\u2019s say 4 data points for validation. With this setting suppose i get a good arima or ets model.\n\nMy question is, at the time of forecasting, I\u2019m actually predicting for the next 6 weeks then (4 in validation, 2 in prediction). Time series models become less reliable the further into the future you predict. \n\nIn such a scenario, is train test split for time series something we should be trying out? \n\nPS: the time horizon and data are just to explain my thoughts. ",
"comments": [
"Reading your question, I want to say: overfitting is a real risk. If you train your model on your test set and use your validation set for tweaking, you will be effectively predicting for two weeks ahead (you would have 3 evaluation periods using Rolling or Expanding Window Validation: week 18-19, 19-20 and 20-21) and you should try to limit yourself from 'cheating'. Not including your test data is really vital in this stage.\n\nIf you fear that you won't be able to predict your test set to the upmost degree, please realise that this is valuable information in itself. Communicate your fears to business and manage stakeholders based on your data. If you are desperate to improve performance, strategies would be to include more variables into this model as well. But this typically requires a lot of data engineering.\n\nFinally, **retrain the model on the full dataset (so you know you have the best model going forward)**. Calculate how much data you need to tell if the model performs well or worse (power) to some baseline. In the coming weeks, you can evaluate using that metric. Keep retraining :-)",
"When doing time-series analysis, you can't rely on model training setup alone. You must incorporate domain knowledge.\n\nThe first question I would ask here is what is the hypothesized seasonality? If this data set you are modeling has a strong annual seasonal component (e.g. patterns associated with holidays, weather, events, etc.), then you are absolutely wasting your time thinking about validation, test, or forecast, period. Your dataset are extremely insufficient to make future predictions. In fact, just use a simple linear model out-of-the-box and its guesses would be as good as any.\n\nAssuming seasonality is not an issue, then the second question is what is the hypothesized AR lag. In other words, how long do I imagine the influence of an observation this week to last in the future? Similar to the seasonality problem, if the hypothesis here is that a current observation may influence 12 or more observations in the future, then drop everything you are doing. It's useless... A linear model or a moving average are as good as any..\n\nAssuming lag is also not a problem, then I would ignore doing the test split and keep it at first to train and validation since 24 points aren't too much to play with. In general, the validation length shouldn't be far off the forecast length you are interested in. I would use the validation to choose between multiple and parameters, then once model/params are chosen, I would re-attach the validation to training and train on the entire dataset.",
"Train/valid/test split is to evaluate your model when building it. \n\nSo IMO after you use the valid set to hyperparams tuning and use the test set to estimate model performance, you should retrain the model with all the data before predicting 2 weeks into the future",
"If your forecast horizon is two weeks, best is to test on only two weeks.\n\nFirst train on the 20 first data points, and test on the 2 next.\nThen train on the 21 first points, and test on the two next.\nThen train on 22 points and test to the 2 last. \nThis should allow you to get a good idea of the performance for your true forecast that uses all 24 points to predict the 2 next weeks.\n\nFor instance see https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split",
"For the way your are making the question i think you are trying to make a model reliable over the long run even when new data comes and theorically the older ones should be less relevant. I tried to do something similar in the past. The answer i think it depends from what are you trying to estimate. which dynamics rely on the process that generate data?\n\nI do not know the context, but certainly is possible that train test split can be used. But with some conditions. I sincerely recommend you an adequate cross validation, so not the \"k-fold\" (that does not opportunely reply the estimation process) but the \"sliding window cross validation\". Choose the version that best suit your need.\n\nTo implement it, i suggest the library sktime in python.\n\nFurthermore, in order to make the older estimates less important you can see for some weighting approcah on errors or you simply keep only the latest timespan for training.\n\nRemember always to not insert in the model features that has information from the future or you will have overfitting.\n\n \nHope that helps."
]
},
{
"id": "1c05xju",
"datetime": 1712703299.0,
"flair": "Career Discussion",
"title": "Is anyone familiar with the state of the academic job market for data science/ML/statistics?",
"score": 21,
"comment counts": 16,
"content": "I frequently peruse [r/AskAcademia](https://www.reddit.com/r/AskAcademia/) and they always talk about how getting a tenure-track job these days is a pipe dream. They frequently cite some statistic (not sure where it's from) that 2% of graduating PhD's get a tenure-track job. They also say that even when filtered to just STEM fields, this figure is somewhere around 9-10%.\n\nI'm in the early stages of a PhD in statistics with research focus in ML at a fairly reputable program. My professors have all told me that getting a tenure-track academic job should be very doable, since I have no restrictions as to where I can/can't live and have a good advisor who I'm doing productive research with. They always say that because so many students in the field take higher-paying jobs in industry, there's a little bit less competition for those who want to get into academia. Which does make sense. But then I see the doomsday advice in [r/AskAcademia](https://www.reddit.com/r/AskAcademia/), and wonder if my professors are out of touch and/or not being honest with me about the state of the academic job market.\n\nIf anyone has recently been on the academic job market in data science/ML/statistics/etc., I'd love to know what your experience was like.",
"comments": [
"Disclaimer: STEM background but pivoted to applied ML during post doc. Also a private institution so YMMV\n\nI was trying to get tenure track maybe 2-3 years ago. All I wanted to do was teach and for a long time I set my sights on that. \n\nAfter my post doc I was able to get an assistant prof job through a connection but it turned out to be horrible. I was discriminated for being younger-ish compared to all the other profs, lots of boomer-esque politics. I didn\u2019t get any opportunities to teach any good classes, just intros to whatever since no one else wanted to teach it.\n\nWhich all would have been fine if I felt like I was getting paid enough but I ended up having to get a second job to make ends meet \ud83d\udc80 which I always had during grad school and post doc but I thought I could stop once I became a professor. I asked around to see how I could get a raise and tldr was I couldn\u2019t unless I got consistent grant funding (which is nearly impossible unless you suck on a teat of a famous senior professor, it\u2019s all about who you know in small fields)\n\nAnyways life got better after I switched to industry. I miss teaching a lot but I am a lot happier, healthier, and more well off financially. If academia could compete with industry salaries I would think about coming back but IMO the institution as a whole needs a reformation. If you happen to have generational wealth and are comfortable with not making a lot of money then academia isn\u2019t a terrible idea.\n\nEdit: I also remember reading somewhere a long time ago that universities were looking to do away with tenure track positions due to the lack of accountability (read: potential for abuse) that tenured professors had. But I haven\u2019t kept up with that",
"We hired recently for someone with your profile at a r2/r3 that would have started you at 140k in a business school. A&S salaries at our institution are significantly lower if you stick with math & stats departments. There are a dearth of good candidates from what I can see.\n\nDo you have any applied experience beyond academia? Are you willing to work beyond an A&S department? Can you teach across tools and programming languages? Have you got significant undergrad teaching experience already? All will matter somewhat in institutions like mine.\n\nEdit: that said with FAFSA screw up and demographic cliff it\u2019s not looking good for hiring over the short term at my institution.",
"I think you will have a good chance of finding a TT job if you don\u2019t have restrictions on location. Also, ask to the graduating class of this semester. They all should have had offers by end of April. (A mathematician trying to get out of academia.)",
"It's hard to know how realistic the profs are. How many of their former students are profs? If you really want to be a professor, then certainly go for it, but understand there is a significant chance that you will fail, so have you backup plan well thought through",
"What journals are you targeting?"
]
},
{
"id": "1c00y65",
"datetime": 1712691171.0,
"flair": "Career Discussion",
"title": "Has anyone taken the Master of Applied Data Science from the University of Michigan on Coursera?",
"score": 32,
"comment counts": 28,
"content": "What kind of things can I do to prepare for it? Would you recommend it to someone wanting to enter the data science field? Any advice helps, thanks!",
"comments": [
"I believe a lot of it comes from these free courses: https://www.py4e.com/\n\nHe's one of the professors in that program, or at least they use his courses. There's also a Django, webapps, and sql course in that series.",
"Created a throw away as to not dox my main, but I'm currently about half way through the program.\n\nTo prepare for it, brush up on Python. The first couple classes will hold your hand a bit, but you've still got to be somewhere between a beginner and intermediate Python user or you'll spend hours on assignments. I'd also recommend brushing up on statistics, calculus, and linear algebra. The program has just two classes that dive deep into teaching you the math. The rest assume that you understand enough to at least apply the concepts. For example, the class on data mining will dive into matrix decomposition and expect you to already be familiar with the linear algebra behind it. You'll ultimately do the analysis with Python, so you don't need to be an expert on the math, but you'll need to know the how and why behind the steps you take.\n\nThe cons of the program. The program is quite expensive, even if you're in-state. I think there are other programs that can give you a degree and the same knowledge for a fraction of the price. The first three or four classes are pretty much a repeat of what's on Coursera. They're just auto graded and everything. For the price you pay, it feels like a rip-off. Despite what other commenters have said though, that changes after the first few classes. The content goes far beyond what is available on Coursera. Assignments are not just auto-graded and require you to write papers/reports in addition to providing your code and output files. The program includes two milestone projects and a capstone project that give you a lot of flexibility and will give you nice projects to include with your resume. My last gripe with the program is it is entirely run through Jupyter Notebooks. It leaves a big blindspot that you will need to learn independently on proper IDE's, CI/CD, and deploying apps/models into production.\n\nThe good of the program, it's setup entirely in an asymmetric learning environment, so there's a lot of flexibility for those working. I work full-time while pursuing the degree. There is easy access to the professors. They're all responsive to slack and emails. They hold regular office hours that can be attended virtually. As I'm local, I've had no trouble meeting in person with professors as well. Some of the content in the classes is older, but they update regularly as the field and best practices change. They do a pretty good job of promoting networking amongst the cohort for those who want to network. You also get access to the UMich careers page, network, and brand. The name and network have a lot of value.\n\nFrom a career perspective, my background is accounting and finance and I've spent ten years in large financial institutions in compliance/audit/risk management. My role was data analyst work, with Excel and VBA scripting. I self-learned Python and had begun using Python scripting to automate workflows and build dashboards. I was wanting to continue down that path and become an ML ops engineer. I think the program, when I'm done, will have more than given me the requisite skills for that, outside of the gaps noted in the cons section. I haven't changed jobs externally since starting the program, but I do apply regularly to keep my interview skills sharp, and see what's out there. I have not had trouble landing data science/engineer interviews/offers with my background and the degree partially completed at blue blood though non-FAANG companies. I have not landed a FAANG interview yet. In my current role, admittedly surrounded by non-data scientists, I've been able to apply what I've learned to update/improve processes. My bosses act like I'm learning magic. It gets me face time with executives and I got a promotion and 30% raise a couple months ago.\n\nIn summary, yes it will give you the skills and it comes with the Michigan brand, which carries a lot of weight. However, if you're paying for it, I'd think long and hard about the value proposition of the program vs some other well regarded universities like Georgia Tech's program. I'm in-state and my employer pays the tuition, so for me I haven't regretted my decision to go with U of M and this program.",
"I\u2019ve looked at the program in the past few months and I think there are better options to be honest. The school being on your resume is obviously a big plus but out of state tuition is almost 50k. I live in-state and couldn\u2019t justify paying 36k for it. The degree is also from the School of Information and not a computer science or engineering department. For preparing I believe all you need to know is basic Python and basic statistics. There is some sort of coding proficiency test you have to take in order to be admitted but if you take the Python 3 specialization on coursera from UofM, it gets waived. I believe they teach you SQL, and the more advanced math and stats are sprinkled in the curriculum. There are a couple GitHub repos out there that have a review of all the courses. Personally I have been looking at CU-Boulder\u2019s options on Coursera. Obviously it\u2019s not the same brand as Michigan but it\u2019s still a good school and much cheaper I think it\u2019s 15,700 ish. If money isn\u2019t a problem then I guess it\u2019s a pretty good choice.",
"I'm currently in the Colorado, Boulder MS program. I brushed up on calc to get prepared; that really helped.",
"I have been looking into this as well. What attracts me to this one over others is that it\u2019s project based as opposed to test based so they emphasize that you leave with a portfolio that\u2019s actively worked on as part of the curriculum. I\u2019m also located in a place where Michigan is very highly regarded - I have thought about the fact that the program is within the school of information has opposed to something more CS or Statistic related but I suppose the only thing that will appear on my resume is UMich, the degree, and the date so I\u2019m not too worried about that. \n\nInterested to hear others\u2019 ideas though."
]
},
{
"id": "1c0btox",
"datetime": 1712719784.0,
"flair": "Career Discussion",
"title": "(Deep Learning vs Data scientist) Need some advice",
"score": 7,
"comment counts": 21,
"content": "I'm an undergrad student (mechanical engg) learning data science through online courses and resources, I need some advice for my career. \n\nI'm able to see 2 types of job postings, one is a traditional data science role, others are also data science roles but requiring deep learning skills. \n\nI have seen, for data science roles requiring deep learning, usually prefer people from PhDs or higher educations. \n\nSo my question is should I even target or try to get into these roles requiring deep learning skills because I doubt if I can compete with PhD students for such roles. \n\nOr should I stick with traditional data science and data analyst roles. ",
"comments": [
"I am a chemical engineer turned data analyst/data scientist and a product manager of sorts. \n\nWithout formal education and/or internships to back up your resume, it would be challenging to break into the field\n\nIf you are a forever learner like myself. Joining on an analyst role might be a little of an academic downgrade, but the soft skills you'll learn are invaluable. It's much easier for me now to connect between data scientists and business propositions.",
"A PhD is for research - you\u2019ll be over qualified anywhere but itll help you Jump a few year in the corporate latter; not worth it\n\n\nIf you take maths and cs courses you\u2019ll be fine doing deep learning; take AI electives and do well. Go for entry level roles and in 3-5 years depending on your skills you can get deep learning roles. I have a degree in math and social science and did supervised machine learning; the jump to deep learning wasn\u2019t hard with 1 year of work experience",
"Both paths are very difficult to get in! I just graduated with a Master in Stats(Took all my electives in ML and DL) and about 1.5 yrs full time Analysts exp, as well as an internship from FANNG. To be honest, I have no idea what's wrong with my background and resume, I hardly hear back from DS/MLE role, and even Data Analyst in today's market. However, my two cents is to get any data-related job first, and then transfer on the way.",
"I'd go the data science route. Its the foundation and you want to build those skills first. There are actually more roles out there that need you to be solid in your basic skills and perhaps basic deep learning work (not gen AI) than the other way around. Get into a data role in a company first and then keep learning.",
"Given that you are still an undergrad, it depends on what you want for the coming years, really. \n \nIf you are for instance excited about a Master and have a pet project that allows for a year long deep diving into whatever deep learning thing you like, I would say you are very likely to get a far better chance at landing a high payed job there than someone that applies for generic DS roles. There are no guarantees... but you have time. Also, these skills will also nicely transition into the more 'generic' roles. There is one big catch: you might be a bit saddened by the lack of neural nets you'll find in a typical company, despite what they say in their vacancies. But I feel like everybody that leaves academia goes through that crisis, so why not make it a real one and study what you love :P. \n \nIf you don't really like to dive deep, or don't have the skills/patience, there's no shame in going for a bit more generic impact and going for a 'safer' career. I see a lot of people saying that it is a tough market, which is true in comparison to earlier times, but I also think that there are areas which are worse off. I sound like an old man but... follow your interests. I would not necessarily say the same to a political science student with a passion for... the roots of ancient marxism, but you are in a field where passion can take you far (also in salary terms) without a lot of risks if you are willing to work."
]
},
{
"id": "1c15fax",
"datetime": 1712807406.0,
"flair": "Career Discussion",
"title": "How much does your undergrad degree matter?",
"score": 0,
"comment counts": 26,
"content": "For multitude personal reasons, I am a \"late bloomer\", and wasted most of my potential in my twenties. Got paid more then, but it was terrible on my mental health, and I hated every minute of it. \nI got a business degree from a no-name school a decade ago. I am pretty old for starting out. \nI went back to school while working in unrelated field, which got me into a reasonably well-respected master's program in quant econ, and landed that first DS job two years ago. I am now a senior DS, but paid way under my demonstrated value. (I turned down a 200K job in my old field to chase DS, and now make like 110K... first kid on the way, and I am having a hard time justifying this to myself)\n\nBig companies don't take a second look at me. My target field is radio silence. Not even rejection emails most of the time. My resume is super well polished. I do get interviews for smaller companies in my current niche fairly easily.\n\n I have funded research in biomedical AI (independent). I have done some LLM work on the side for a startup, and also for a doctor's research. I started doing some hardware work too to pick up some C++, and expand my general knowledge.\n\nEven with warm introductions from people in my target area (biomed, pharma, medicine), I get no responses to my messages.\n\nAm I too old? (Early thirties)\n\nIs my bullshit B.S. holding me back?\n\nAm I still too green? \n\n\nI just want to work applying AI to improve patient outcomes, and I am not finding any way in. \n\n\nJust looking for some perspective. \n\n\nThanks!",
"comments": [
"For an insecure person you sure earn a lot of money",
"You ruined it, completely. I mean, you should\u2019ve made better decisions. Look where it got you. Only six figures and a graduate degree and a string of enviable titles. What are you even doing with your life, slacker.\n\nYou might as well quit now and go flip some burgers and get your waving arm in shape to be a Walmart greeter cause that\u2019s all you\u2019re really worth based on your meager efforts.\n\nThis fucking website here and you fucking people\u2026",
"You're acting like six figures is an average salary or something",
"You sound like a victim of the market. I think if I saw your business BS with Econ MS and data science experience, you would at least be a noticeable candidate. \n\nDepends on what the employer is looking for though. If they want someone more ML focused, they might go for math and CS candidates. If they want a generalist, they might mix all sorts of candidates for interviews. If they have econometric or statistical problems, they might go for stats or Econ candidates. If they want a Storytelling person, they might go for a business candidate. Perhaps more positions are leaning towards the technical candidates. There are at least some positions calling for the Econ/business skills, a matter of finding them.",
"Here's some perspective: the median household income in the US is approximately $75,000. I understand the insecurities, but regardless of the job market being ass, you are doing great compared to most people and you should take a little time to breathe and remind yourself of that.\n\nKeep working hard but don't forget to celebrate how far you have already come :)"
]
},
{
"id": "1c02l55",
"datetime": 1712695172.0,
"flair": "Career Discussion",
"title": "How much does degree title matter vs skills and classes taken for an MS?",
"score": 18,
"comment counts": 21,
"content": "I'm (an American) in a biostatistics MS program, but I have the opportunity to finish early by a summer and a fall semester (7 months) with my departments online \"applied statistics and data science\" MS. The research that I've been apart of has mostly been data cleaning and building an R package to submit to CRAN. I've basically finished the core classes for our PhD, but I'm more interested in math heavy software development than original research. Tech skills: Python, R, SAS, and I'm rusty on PHP, JS/React, SQL, which I used years ago for past projects.\n\nThe program isn't placing me in debt. I have research funding, my Post 9-11 GI Bill covers housing costs, and I'm still a reservist in the military, which offers a 401k (TSP) and heavily discounted insurance\n\nEdit: I'm pretty ignorant of how the titles would be viewed when screened by HR/software/and particular industries. Eg I've heard it's easier to land a traditional biostatistics role with a degree in (bio)statistics vs data science",
"comments": [
"Maybe I'm not quite following - is there a disadvantage to having your degree in \"applied statistics and data science\" as opposed to \"biostatistics\"? If not, why not finish up 7 months early and take the applied stats/DS degree?\n\nI do personally believe that your skills and coursework are more important than the title of your degree, especially when we're talking about two closely related fields like statistics and data science.",
"Answering the question from the title, degree title matters not. I have an MS in Wildlife and Fisheries Sciences and have worked as a Sr. DS in financial and management consulting, healthcare, and most recently in FAANG. Skills matter; not degree title. The latter might get your resume noticed in ATS, but if you can't do shit, you won't get a job.",
"I\u2019m also confused, I don\u2019t see any reason not to gain more education for free",
"Stay with the Biostatistics track. It will be much easier to land roles in pharma/biotech industry if you ever have interest in going that route.\n\nBiostats is a very well regarded degree in general for DS/applied science roles in industry. It will be assumed you have significant training in mathematical statistics, experimentation and causal inference, which will always be in high demand. If you have the software development skills you\u2019ll be viewed as a unicorn by hiring managers.",
"I'm a hiring manager, so a sample of one. :-) I care more about a certain level of training and experience than degree titles. That experience can come from a graduate research thesis, working as a researcher at the university, an internship, etc.\u00a0\n\n\nI personally would stay and get the extra coursework if you can afford it. You will already be well positioned with biostat, but this gives you a little extra time to have said you worked on your research. This time.is also great to sharpen your programming skills, work on some soft skills, etc.\u00a0"
]
},
{
"id": "1bzzal7",
"datetime": 1712687114.0,