ds_subreddit_hot_posts.json

[
    {
        "id": "1c4cxoj",
        "datetime": 1713153681.0,
        "flair": null,
        "title": "Weekly Entering & Transitioning - Thread 15 Apr, 2024 - 22 Apr, 2024",
        "score": 2,
        "comment counts": 12,
        "content": " \n\nWelcome to this week's entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include:\n\n* Learning resources (e.g. books, tutorials, videos)\n* Traditional education (e.g. schools, degrees, electives)\n* Alternative education (e.g. online courses, bootcamps)\n* Job search questions (e.g. resumes, applying, career prospects)\n* Elementary questions (e.g. where to start, what next)\n\nWhile you wait for answers from the community, check out the [FAQ](https://www.reddit.com/r/datascience/wiki/frequently-asked-questions) and Resources pages on our wiki. You can also search for answers in [past weekly threads](https://www.reddit.com/r/datascience/search?q=weekly%20thread&restrict_sr=1&sort=new).",
        "comments": [
            "So I'm currently getting a Bachelor in Computer Science at WGU. Should I get master in Data Analytics after or find somewhere else that offers DS masters?",
            "Hello! I have an interview coming up next week at a startup, where the role requires some expertise in causal analysis. This involves identifying issues, understanding their underlying causes, improving the product by addressing these problems, and then conducting tests and sensitivity analysis to verify the results.\n\nRegarding my background, I\u2019ve spent 1.5 years working as a data scientist, including 1 year as an intern and half a year in a full-time role. My experience has primarily focused on exploratory data analysis, ml modeling, and A/B testing, with less emphasis on causal analysis. Although I have a theoretical background in causal inference from my statistics coursework, I haven\u2019t had the opportunity to apply this knowledge to real-time data. Could anyone recommend resources or Kaggle competitions for practical experience in causal analysis? If you are a DS professional who does causal inference/modeling, could you share insights on how to effectively frame problems and set up hypotheses? Additionally, I would appreciate recommendations for widely-used causal analysis libraries in Python that are industry standard.\n\nThanks in advance!",
            "Hi guys, I'm in need of good resources on statistics/probability geared towards machine learning to work through in the next few months. My knowledge of machine learning is quite limited currently, but I'm working through a python course on machine learning that I should be completing in the next month or so. My end goal is to either get a job as a data scientist (which would be very difficult given my current skillset) or to get admitted to a good masters program in data science in the next application cycle.\n\n\n\nI've had a hard time finding resources online as I'm specifically looking for courses/books that don't shy away from math and explain their methodology thoroughly. I have bachelors and masters degrees in math and a strong background in linear algebra and analysis. I've also taken a couple of intro courses in statistics, but I'm interested in building more foundational knowledge as I get more familiar with probability and machine learning.\n\n\n\nI would appreciate any guidance/advice or resources you guys would be willing to share!",
            "Hi all, \n\nI am seeking advice on the best step I should take in my career. I am transitioning to data science from academia. I have been offered an entry-level position in a medium-sized company that is developing its relatively new data science team. My impression is that they are not entirely clear on what they want to do or where resources should be best focused. They seem to be in an exploratory phase, determining which avenues will provide the most value to the company. I do not know the competency level of members of the current team. The company is not high-profile, but it may provide a good opportunity for career progression if I can help them build a helpful data science program. The position is also 100% remote.\n\nI also have an offer for a \"fellowship\" where I would be paired with a higher-profile company, either in the public or private sector, and carry out a machine learning project with them (previous examples include NLP, predictive models, RAG, etc.). At the end of the fellowship, there is a high probability (\"95%\" according to them) that I will offered a position by the partner company. The fellowship is highly competitive, so should make me more attractive to recruiters. This fellowship also requires in person attendance and could provide some valuable networking opportunities.\n\nI am struggling to decide which option to take. Does anyone have advice on which option might be best?",
            "Not sure if this is the best thread to ask, but does anyone have statista access? I want to use 2 of their datasets for a class project"
        ]
    },
    {
        "id": "1c4nwy0",
        "datetime": 1713192080.0,
        "flair": "Discussion",
        "title": "WTF? I'm tired of this crap",
        "score": 549,
        "comment counts": 161,
        "content": "Yes, \"data professional\" means nothing so I shouldn't take this seriously.\n\nBut if by chance it means \"data scientist\"... why this people are purposely lying? You cannot be a data scientist \"without programming\". Plain and simple.\n\nProgramming is not something \"that helps\" or that \"makes you a nerd\" (sic), it's basically the core job of a data scientist. Without programming, what do you do? Stare at the data? Attempting linear regression in Excel? Creating pie charts?\n\nYes, the whole thing can be dismisses by the fact that \"data professional\" means nothing, so of course you don't need programming for a position that doesn't exists, but if she mean by chance \"data scientist\" than there's no way you can avoid programming.",
        "comments": [
            "Data professional could mean being a data entry clerk, or working as a data analyst using only Excel, and maybe a little bit of SQL. I wouldn't read too much into it.",
            "I think this is just an accurate post. Data professional covers a broad range of jobs where the primary function is working with data. There are plenty of jobs under that umbrella that do not require programming. Analyst/entry/design. It doesn\u2019t mean top tier data jobs don\u2019t require programming. What a strange thing to rage about.",
            "It feels like you posted this looking for a reason to be outraged.",
            "Dude...data scientist doesn't actually mean anything either.\n\nCalm down. The phrase 'know programming' is so vague as to be nearly useless as well.\n\nI am a data science manager. I know some code, I can script pretty well, but I am terrible at actually programming anything. That's how I would characterize my skillset.\n\nOther people think I am a code wiz. They are very, very wrong. :)\n\nThese are subjective terms. Most of them are not codified the way doctor, lawyer, realtor, or engineer are codified.\n\nLet people believe what they want to believe.",
            "you can get a long way with SQL"
        ]
    },
    {
        "id": "1c5f11i",
        "datetime": 1713270988.0,
        "flair": "Projects",
        "title": "Loading a trillion rows of weather data into TimescaleDB",
        "score": 3,
        "comment counts": 1,
        "content": "",
        "comments": [
            "I posted a while back [asking for help](https://www.reddit.com/r/dataengineering/comments/16z8h6l/how_to_efficiently_load_20_tib_of_weather_data/) [on loading tons of data](https://www.reddit.com/r/PostgreSQL/comments/16z7vqs/how_to_quickly_load_20_tib_of_weather_data_into_a/) and got lots of great advice and feedback. I ended up doing some digging to answer my question and wrote a post benchmarking the fastest ways to insert data.\n\nI'm still learning Postgres so if anyone has any feedback or questions, I'd love to hear them!"
        ]
    },
    {
        "id": "1c5fqyd",
        "datetime": 1713273095.0,
        "flair": "AI",
        "title": "Rule based, Recommendation Based Embedding",
        "score": 2,
        "comment counts": 0,
        "content": "Hello Coders\n\nI would like to share an experience and know your opinions. I embedded about 12K+ order lists from a takeaway order system. I used Cohere english v3 and openai text embeding v3 for the embed. I prepared questions for the embed I would like large pizza, green pepper and corn questions with semantic parser. The output answers of these questions vegan pizza, vegan burger added pepperoni topping coke side topping did not satisfy me. Complementary and suggestion answers gave one quality and one poor quality output. Of course, these embed algorithms are usually based on conise similar. I suddenly had the suspicion that I should use embed for this type of rule based, match based, recommended. I believe that I can do the attached data with my own nlp libraries with more enrichment metadata tags without embedding. I would be glad if you share your ideas, especially if I can use llm in Out of vocabulary (OOV) detection contexts.\n\nThank you.",
        "comments": []
    },
    {
        "id": "1c4oyx9",
        "datetime": 1713194707.0,
        "flair": "Career Discussion",
        "title": "Excel Monkey",
        "score": 74,
        "comment counts": 51,
        "content": "How much in your daily career life do you feel like an Excel Monkey where you spend most of your work load in Excel?\n\nI\u2019m currently in a modeling role in the insurance industry looking to see if it is time to branch out to other industries or if my expectations are too high. \n",
        "comments": [
            "What are you doing in Excel? Can you load the data elsewhere to model/analyze it more flexibly?\n\nI'd second the other post suggesting Pandas which can read/write Excel files and gives you access to the entire Python ecosystem. It's a pretty good start.",
            "Hard recommendation Python \ud83d\udc0d  learning. Especially pandas library you can use currently job",
            "I think people are missing the point. If you are producing a product for somebody, you have to produce something that is useful for them. Often times this includes an excel spreadsheet because even most C suites can navigate Excel. There is nothing wrong with Excel when you are working with data that is < 100k observations. \n\nAlso, I\u2019m in the same industry and work with financial models, most of them are based in Excel and the primary reason is because Excel is very explainable.\n\nTo summarize, there is nothing wrong with Excel. You need to work within your company\u2019s tech stack and produce something that is useful for the people that need it. If you aren\u2019t happy with the rigor of the work (this is where I\u2019m at) look for opportunities and ask your boss for more challenging tasks where you will be forced to use additional tools besides Excel. Or, leave the company and go to a company that is a bit more mature in their tech stack choices and methodologies.",
            "In my experience, so much depends on the corporation\u2019s IT and licensing agreements and appetite to support certain applications \u2026I\u2019ve been slowly getting my team to learn Python and using Anaconda, but IT sent out a memo saying we could no longer have Anaconda on our machines.  So, we\u2019re looking at other solutions with IT, but we\u2019re mostly Excel and Oracle SQL developer at the moment while IT sorts out what we can and can\u2019t have.  Excel and the Microsoft suite are safe and easy to maintain, but a lot of IT people are not familiar with R, Python, etc\u2026 from the standpoint of maintaining it and ensuring IT security.",
            "I know some python. But the company I\u2019m currently with doesn\u2019t use Python in this role. I believe my predecessor is an actuary and what the role is meant for.  What I\u2019m wondering is if it\u2019s time to look elsewhere or if I\u2019m likely going to just find myself in the same Excel position."
        ]
    },
    {
        "id": "1c5b5xz",
        "datetime": 1713256858.0,
        "flair": "ML",
        "title": "Interview Advice - Sales and Marketing Predictive Modelling ",
        "score": 2,
        "comment counts": 1,
        "content": "Its hard as an international to get internships in this market but thankfully I had the fortune to interview for a few F250 companies. \n\nI seem to be missing out for fine margins. One company team technical lead said that i would be a good fit but since there was just 1  opening, I got referred to another team to apply . This happened quite a few times with others except i wasnt referred to other teams. I prepared for wrong things in that interview. I was able to answer all but it was thinking on spot and  beating around the bush which definitely didn't help . Someone who knew it would sound more sure and knowledgeable and will get the edge .I know where i could have improved :( \n\nThis maybe my last opportunity to bag summer internship this year.   I want to give my best and try to leave no  stone unturned.  \n\nIt would be great of someone with experience in predictive Modelling in sales and marketing can tell me about some work done and commonly used questions / techniques. I did google and chatgpt but some real world / production level insights and some commonly used models and methods MLOps of this domain would help me a lot. \n\nAppreciate your support in the above matter ",
        "comments": [
            "Hey FellowMates, I dont know If I would be asking this. But can you give me Some Karma(10 needed). I need to ask(post) something in this sub regarding my project."
        ]
    },
    {
        "id": "1c4m8q6",
        "datetime": 1713187760.0,
        "flair": "Career Discussion",
        "title": "How to negotiate salary when doing an internal move?",
        "score": 22,
        "comment counts": 17,
        "content": "Hi all,\n\nBasically the title \u2014 any tips on negotiating the salary when doing an internal move, and the hiring manager / HR most certainly know at least my pay bracket, if not the exact salary I have right now?\n\nI only know some very rough numbers from colleagues and I tend to underestimate their budget / undersell when negotiating.\n\nThanks! \ud83d\ude4f ",
        "comments": [
            "If you're underpaid, companies typically will give you a token raise to keep you happy unless you come in with a competing offer. If you really like where you work but also want to get paid fairly then I'd suggest interviewing, getting an offer for a higher salary and using that as leverage to get the salary you deserve.",
            "One aspect that makes this more difficult compared to a salary negotiation when being initially hired is you have less leverage. When being hired you can walk away and they have to keep recruiting. I don't know if your move is a promotion, a lateral move, or a completely different team / position but not only do they know your current salary, you don't have a lot to bargain with unless you plan to quit if you don't get the salary you want. \n\nI think the best you can do is show salary data from Glassdoor or some other website if you get lowballed. Otherwise I wouldn't try too hard.",
            "I\u2019d recommend external interviews. \n\nLast I checked, life time earning studies show that that those who job hop more have approximately 50% greater life time earnings. **although, I\u2019m having a hard time finding the articles**, but there is an exception, those who job hop under a year are less likely to get more money, the optimal time to job hop is like 2-4 years",
            "If you\u2019re underpaid and then use a competing offer, they will most likely lay you off after giving you a raise as the expectations most likely will double which wont be proportional to whatever raise they give you. \n\nUsing a competing offer as leverage will most likely strain your relationship with the employers 9/10.  So just job hop.\n\nThere is nothing more an employer hates than an employee asking to be compensated fairly.\n\nThe only case of them increasing your salary is that if you are paid less than a new grad who makes 40k usd more than you and has less experience then maybe just maybe they will give you a 5000$ salary adjustment.\n\nIt has to be comically large for them do the bare minimum to adjust your salary to reflect experience and pay from colleagues. That is the expectation by law so your employer will do the bare minimum to retain you if there is no real risk of you leaving. Else that will be an easy case to prove in court for pay discrimination.\n\nIf anything when you ask for a competitive salary what they will do is to make sure to make you feel worthless and pathetic so that you do not value yourself. Very common strategy, the employer must first make you feel that you are worthless employee and that you work is crap regardless of what achievements you had in the past. The main arguments will be they care about current achievements, the budget is rough, and also be thankful you have a job in this economy. They have 1000-5000 candidates willing to do tour job for half the price.\n\nThis is a harsh reality that i saw first hand, the only leverage you have is when you are coming in a company. Afterwards you got no leverage, especially with how the economy is going. For sure dont expect this to get any better either because just like the dot com bubble in it took around 15 years for the job market to rebound. So do expect that to be the case for the next 15 years until we get a new recession 5-10 years later. In the meantime the economy will slowly rebound but not any time soon.",
            "Key thing: you don't have the same leverage as an external hire, so it's hard to truly \"negotiate\" salary. \n\nHaving said that - you do have *some* leverage. The leverage is \"well, if you hire me and don't give me enough money I will start applying to other jobs behind your back and then leave in 6 months\".\n\nOne thing to keep in mind: HR doesn't like people negotiating salary increased internally for lateral moves. So if you're moving from Data Scientist 2 in one team to Data Scientist 2 in a different team, expect very little movement in salary. And if you are getting a promotion HR is likely going to cap that as well.\n\nNow, here's my advice:\n\nPull your comp history over the last 2-3 years (whatever paints the best picture), and look at what you average yearly comp increase has been. If it's been low (say, 4% or lower), then I think you have a really good argument to make that taking on a new role with new responsibilities is something that you think warrants a raise in compensation."
        ]
    },
    {
        "id": "1c40xcl",
        "datetime": 1713120639.0,
        "flair": "Discussion",
        "title": "If you mainly want to do Machine Learning, don't become a Data Scientist",
        "score": 669,
        "comment counts": 170,
        "content": "I've been in this career for 6+ years and I can count on one hand the number of times that I have seriously considered building a machine learning model as a potential solution. And I'm far from the only one with a similar experience.\n\nMost \"data science\" problems don't require machine learning.\n\nYet, there is SO MUCH content out there making students believe that they need to focus heavily on building their Machine Learning skills.\n\nWhen instead, they should focus more on building a strong foundation in statistics and probability (making inferences, designing experiments, etc..)\n\nIf you are passionate about building and tuning machine learning models and want to do that for a living, then become a Machine Learning Engineer (or AI Engineer)\n\nOtherwise, make sure the Data Science jobs you are applying for explicitly state their need for building predictive models or similar, that way you avoid going in with unrealistic expectations.",
        "comments": [
            "The problem is that \"machine learning\" is the vaguest term in the world that encompasses everything from linear regression to ChatGPT.",
            "My title is data scientist and honestly about 50-80% of my day is spent either using pytorch and prototyping, doing more large scale jobs on aws or preparing data so that I can then prototype on pytorch and then move toward a large scale job on hpc\u2026 however after joining this sub and reading the posts, i feel like im in a unique position.",
            "Linear regressions are my bread and butter no matter how much I try to do something better. Interpretability and consistency are more important than accuracy in my field.",
            "89% of my \u201cdata scientist\u201d role is making pretty charts to put in PowerPoint products. I don\u2019t have enough professional ML experience to get paid as much as I currently do anywhere else. Someone save me.",
            "I\u2019m sure it\u2019s like this with most jobs but I think the data space has been seriously subject to a massive amount of hype and marketing. Everything has to be ML or Ai and 90% of companies are just suckered into buying services and platforms that just don\u2019t need. Our jobs also get hyper competitive. Need to know snowflake, docker, spark, Kafka, airflow, databricks, sql, nosql, and 10 billion other things that just don\u2019t make sense. It\u2019s getting tiring."
        ]
    },
    {
        "id": "1c4r5sf",
        "datetime": 1713200032.0,
        "flair": "Discussion",
        "title": "Does anyone use this potential alternative to gradient descent?",
        "score": 13,
        "comment counts": 8,
        "content": "&#x200B;\n\n[ Step 1: there is some loss\\/cost function but we don't know its optimal parameters ](https://preview.redd.it/vo2fb58taouc1.png?width=723&format=png&auto=webp&s=d700eadb8435238bcf549c71cf7974d0d1d27cc1)\n\n&#x200B;\n\n[ Step 2: solve for the derivatives at random points for the parameters and obtain tangent vectors for those points. ](https://preview.redd.it/td8kuu9waouc1.png?width=619&format=png&auto=webp&s=5e0d879dda0ebb502895350fc23a302393268f74)\n\n&#x200B;\n\n[ Step 3: Solve for where the vectors \\\\\"cross\\\\\" \\(when stretched\\) in terms of the parameters, and plug those parameters into the loss function. If it seems to be a good place, you could try gradient descent\\/back-prop starting from here. The vectors may not intersect at any point with respect to all parameters, but as long as they intersect with respect to enough of them you could try that coordinate. If not you can repeat the process until you find some tangent vectors that intersect with respect to enough parameters. ](https://preview.redd.it/ozkafxn0bouc1.png?width=616&format=png&auto=webp&s=81ba479a32c6572f6da6701d3d55d32e223d2507)\n\n I chose an example in 3 dimensions for obvious reasons. This might not run so fast or be so easy in higher dimensions. Curious to hear what people think though. And maybe this approach already exists? ",
        "comments": [
            "I have considered this, but not used it. There are a few potential problems that have so far dissuaded me:\n\n1. Since the vectors probably won't intersect, I thought you'd need to find the shortest line segment where the vectors most closely approach one another, and then take the midpoint of that segment. Seems complicated... I didn't think of finding an intersection in a subspace as you suggest here.\n\n2. If the loss is bumpy and weird and the probe points are too far away from another, they might be pointing to completely different local minima with nothing interesting happening between them.\n\n3. If the two points are both on the inside surface of a banana-shaped valley, the vectors will point *away* from one another- the only \"intersection\" or \"closest approach\" is *uphill* from them :-( .",
            "It might work assuming the optimization problem is convex (as shown on your pictures). General optimization does not hold for that assumption, and deep neural network losses are way far from being convex. That means that the presumed \"intersection\" (or a middle point of the shortest distance line as pointed out in another comment) may point into a local maxima or whatever ultimately fucking everything up.\n\nApart from the above and the computational complexity, the approach introduces new hyperparameters:\n\n1. the number of points to sample\n\n2. variance of the sampling process\n\n3. how the above variance should decay over time (otherwise I don't expect this random walk to converge at all)\n\n4. distribution from which the points must be sampled\n\nAlso, the memory requirements scale linearly with the number of sampled points, meaning you'll have to kind of instantiate the network N times (where N is the number of points for gradient computation). And considering the scales in our modern deep learning era that's just stupidly expensive.",
            "Rather than using vectors you could use tangent planes and pick new points on their intersections",
            "[deleted]"
        ]
    },
    {
        "id": "1c4uvfr",
        "datetime": 1713208788.0,
        "flair": "Statistics",
        "title": "Real-time hypothesis testing, premature stopping ",
        "score": 8,
        "comment counts": 10,
        "content": "Say I want to start offering a discount for shopping in my store. I want to run a test to see if it's a cost-effective idea. I demand an improvement of $d in average sale $s to compensate for the cost of the discount. I start offering the discount randomly to every second customer. Given the average traffic in my store, I determine I should be running the experiment for at least 4 months to determine the true effect equal to d at alpha 0.05 with 0.8 power. \n\n\n1. Should my hypothesis be: \n\nH0: s_exp - s_ctrl < d\n\nAnd then if I reject it means there's evidence the discount is cost effective (and so I start offering the discount to everyone) \n\nOr \n\nH0: s_exp - s_ctrl > d \n\nAnd then if I don't reject it means there's *no* evidence the discount is *not* cost effective (and so i keep offering the discount to everyone or at least to half of the clients to keep the test going)\n\n\n2. What should I do if after four months, my test is not conclusive? All in all, I don't want to miss the opportunity to increase the profit margin, even if true effect is 1.01*d, right above the cost-effectiveness threshold. As opposed to pharmacology, there's no point in being too conservative in making business right? Can I keep running the test and avoid p-hacking?\n\n\n3. I keep monitoring the average sales daily, to make sure the test is running well. When can I stop the experiment before preassumed amount of sample is collected, because the experimental group is performing very well or very bad and it seems I surely have enough evidence to decide now? How to avoid p-hacking with such early stopping?\n\n\nBonus 1: say I know a lot about my clients: salary, height, personality. How to keep refining what discount to offer based on individual characteristics? Maybe men taller than 2 meters should optimally receive two times higher discount for some unknown reasons?\n\n\nBonus 2: would bayesian hypothesis testing be better-suited in this setting? Why?\n",
        "comments": [
            "You can stop the experiment before the pre-assumed number of sample are collected if the results are very clear and statistically significant. However, you should be careful about p-hacking with such early stopping. To avoid this, you could use sequential analysis, which allows you to stop the experiment early if the results are clear, but adjusts the statistical significance level to account for the fact that you're looking at the data multiple times.",
            "Checkout the \u201coptional stopping\u201d part of this paper\n\nhttps://arxiv.org/abs/2212.11366",
            "Some bayesian approach is probably a valid way to approach this type of problem. I also want to say that if you run an experiment for 4-6 months to measure a small effect you should be careful about drift in your user population behavior. Effects can be seasonal or just have secular changes so keep that in mind",
            "I wonder if the proper hypothesis is s\\_exp - s\\_ctrl = 0 and then you statistical test just measures if the difference is statistically significant. If it is and the difference is d, then you're good to go. But i think this is the same as what you're doing. Find the distribution of s\\_exp - s\\_cntrl and if $d falls in the <.05 left quantile then you can say s\\_exp is $d greater. \n\nI think you can just stop when the test returns something significant. This can happen if you have very few samples but the difference s\\_exp - s\\_cntrl is very large and/or the difference is small but you have many many samples."
        ]
    },
    {
        "id": "1c59ru9",
        "datetime": 1713251049.0,
        "flair": "ML",
        "title": "Help in creating a chatbot",
        "score": 0,
        "comment counts": 9,
        "content": "I want to create a chatbot that can fetch data from database and answer questions.\n\nFor example, I have a database with details of employees. Now If i ask chatbot how many people join after January 2024 that chatbot will return answer based on data stored in database.\n\nHow to achieve this and what approch to use?",
        "comments": [
            "You want a RAG. Assuming it\u2019s a text DB, you need to chunk the DB into passages, and an embedding model to create a vector DB. Given a query, embed it (use the same model as before), return top N closest passages, and use them to give a QA model the necessary context to answer the query by engineering the ideal prompt. Tip: use instruct-type QA models like mistral 7b instruct.",
            "Check out \"pandas ai\" package, basically does this"
        ]
    },
    {
        "id": "1c4kstx",
        "datetime": 1713183610.0,
        "flair": "Tools",
        "title": "Best framework for creating an ML based website/service for a data scientist",
        "score": 5,
        "comment counts": 7,
        "content": "I'm a data scientist who doesn't really know web development. If I tune some models and create something that I want to surface to a user, what options do I have? Also, what if I'd like to charge for it?\n\nI'm already quite familiar with Streamlit. I've seen that there's a new framework called Taipy that looks interesting but I'm not sure if it can handle subscriptions.\n\nAny suggestions or personal experience with trying to do the same?",
        "comments": [
            "Let's put the question around, what are you missing in streamlit and what exactly do you want to do?\n\nPersonally I never do production grade interfaces, these are done by my devs. So my use cases are demos, pocs or internal small scale apps.\n\nFor all these use cases a combination of streamlit, shiny and flask is more than enough.",
            "Flask, Fast API, Django?",
            "I think, you should use Streamlit already familiar. Streamlit provides a simple API for building interactive web apps entirely in Python, making it accessible to data scientists quickly build and deploy. Next step Paypal or stripe implement.",
            "check out Panel for python web development! it\u2019s pretty cool but can be hard for very large user bases. the docs are really good too.\n\non the payment side i would assume that\u2019s just an issue with whatever authentication you stand up in front of your app (panel has good docs on this as well). you have a list of users that are paying and you check on login if the current user is in that list.",
            "Do check out Dash by plotly. More customisable than Streamlit. Optimization is good."
        ]
    },
    {
        "id": "1c4uibm",
        "datetime": 1713207939.0,
        "flair": "Discussion",
        "title": "Self serve dashboard adoption",
        "score": 0,
        "comment counts": 3,
        "content": "I\u2019m trying to build a dashboard- PowerBI/Tableau which would serve as a one stop solution to most of the ad hoc analytics request that our stakeholders have. But in the past we observed that the adoption of such dashboards by them is not that great. Did anyone in the sub try this initiative, and what are the important factors to keep in mind before you go into such venture.",
        "comments": [
            "I\u2019ve done some work around this, I think my biggest suggestion would be a series of smaller more tailored 1 stop shops for each overarching initiative is more effective then one massive one for everything in my experience. For example we were building our ad hoc reports on customer engagement, call center volume, and push notification interactions for various alerts. It worked a lot better to create one dashboard for each of those topics than one huge one for everything because the metrics that are important for one aren\u2019t for the others and stakeholders got nervous when they saw metrics or filters they were unfamiliar with\u00a0\n\n\n\n\n\nEdit: just wanted to add that unless you have very dedicated and competent stakeholders these efforts are almost never worth it, it did very little if anything to reduce the ad hoc requests and the people that did try to use it often didn\u2019t understand how to use the filters/fields properly and ended up with some really strange takeaways from it",
            "Tough ask. I built high level dashboards for some years.    In my experience these dashboards disappoint. Worse, if you build in various customization options (filters, level of detail parameters, varying percentiles etc) data naive XFNs abuse and misunderstand and inevitably misuse the dashboard. End result: it's the builder's fault.\n\nPush back if you can and propose to deliver stand-alone  tools for most needed use cases. Some sort of 80-20 rule applies to dashboarding. 80 percent of insights come from 20 percent of dashboards. \n\nI am not one to lecture but please keep in mind that one key tool in an analyst's / data scientist's skillset is managing stakeholders. We all must learn how to say no without saying no"
        ]
    },
    {
        "id": "1c49ugr",
        "datetime": 1713144012.0,
        "flair": "Discussion",
        "title": "If you work in a unique domain, what is it and how did you get into it?",
        "score": 23,
        "comment counts": 42,
        "content": "Most of the domains I see for job listings are healthcare, products, retail, fraud, supply chain, etc. I'm still early career and my degree is in sociology and data, which has helped me work mostly with non-profits. I wanted to hear about what other domains that other DA/DS are working in which fall outside the majority.",
        "comments": [
            "I worked in nuclear power and transportation, pretty unique for DS imo but perhaps limited use cases as well.",
            "Video games. It\u2019s an amazing use of DS",
            "Air quality - job opportunity during my master's",
            "I work in the specialty insurance field, where pretty much anything you can imagine is insured - from somebody\u2019s vocal chords to offshore energy operations to more standard properties along the coast of Florida. Very different to retail insurance (to the extent that it\u2019s actually hard to move between them the more senior you become). \n\nI got into this by first training as an actuary. I packed in that career path, but stayed in the same field to further deepen my domain knowledge. I now focus more on software development and aspects of data analytics. \n\nBecause the focus is on specialty risks, the data tends to be \u2018smaller\u2019 and there isn\u2019t much \u2018machine learning\u2019. Descriptive statistics and exploratory analysis is king. Optimising processes to be more efficient is also massively important here. If you are somebody who has these kind of skills, and can combine it with an expertise of specialty insurance, you\u2019ll be (very) hot in this market.",
            "I specialize in recommendation and search. I got into it bc I was super interested in using graphs/ontologies. Ended up working for a research institute during my dissertation that was exploring the use of ontologies and named entity recognition for scientific paper recommendation. I\u2019ve since moved on from the research institute but still working in recommendation and search for a tech company now."
        ]
    },
    {
        "id": "1c41y7n",
        "datetime": 1713123133.0,
        "flair": "Discussion",
        "title": "Distraction caused by the Ai Hype",
        "score": 68,
        "comment counts": 24,
        "content": "I noticed there's some disconnection between this recent AI Hype we constantly witness on Linkedin/Twitter, things like these new LLMs, the latest 3D models, the Cool Gen AI stuff ...  and the industry requirements that actually matter for companies. Which is a bit confusing and can be distracting especially for juniors trying to upskill and learn the things that leads to get them jobs, this leaves you with the questions: Should you follow the hype and try to stay up to date by learning all these new things? or stick to what matters and can generates actual value and be good at it even if it seems \"outdated\" (things like traditional machine learning)? ",
        "comments": [
            "Unfortunately this is very hard for everyone. Everyone and their boss wants to jump on the GenAI train because of FOMO when they don\u2019t even have good quality data to begin with. Outside of tech, finance companies and a couple of others, data quality, its management and pipelines are extremely suspect. A lot of value in the field actually comes from doing simpler things.\u00a0\n\nThe only method to keep up with the madness is to keep reading about it to know where the field is moving. You don\u2019t have to be on top of it all everyday but its always good to be abreast of the various developments in the field even if you aren\u2019t necessarily implementing it in your day to day job. You\u2019d comes across as a genuinely curious person and even otherwise, its fun!\u00a0",
            "I am frankly of the opinion that being strong in the fundamental stuff and then branching out to areas that interest you is the most valuable path of any career. There may be areas that pay more and/or have more hype around them, but do you actually see yourself doing one of those hyped up things in your career? If yes, that's fine. If no, find something else you like.\n\nI've had the opportunity to work with GPT-3 before Chat-GPT came out and back then I thought \"This is pretty cool but it has many limitations to overcome.\" I am still of the same impression when it comes to Chat-GPT and other LLM technology. Chat-GPT is like a calculator: it's useful, but to get the most of it, you need a decent-to-strong foundation in programming, math, and stats. While I like LLMs, I made the decision that they won't be the sole, new fangled toy that I would advertise my skills in to employers. I'll use them when I have to, but I'll do other things too.\n\nI personally prefer having a strong foundation in fundamental Statistics, Data Science, CS, and domain expertise. I don't mind some ML thrown in though (NLP, Regression, and Classification mostly).\n\nTLDR; get good at the basics and branch out to areas that interest you. If you like LLMs and want to learn more, definitely do so! Data Science is a wide enough field that you'll find relevant work in the areas that you'll like (with effort of course).",
            "GenAI will fade, ML not.",
            "I am literally feeling this one ,I used to learn one by one like ML,Dl,NLP And then wanted to learn computer vision but because of LLM and New models,new fine-tuning methods and models , I really don't know which one to learn and which one to leave and also Now a days everyone is asking for LLM in job description.\n\nI am really confused,this got to me a waste of my most time instead of learning.",
            "Unlike traditional ML which is done in house, I don\u2019t LLM will follow this pattern. They will be built by a few companies then leveraged via API. If you want to add an AI chat bot to your product you just buy it. Building your own is a waste of time."
        ]
    },
    {
        "id": "1c45zjn",
        "datetime": 1713133176.0,
        "flair": "Career Discussion",
        "title": "Would really love opinions. ",
        "score": 22,
        "comment counts": 30,
        "content": "So I\u2019m a hs math teacher. Over the past 4-5 years my school has started teaching data science. I\u2019ve been teaching that for about 3-4 years. Very basic stuff at the HS level, basically intro to R and some stats.  \n\nI\u2019ve started to think about studying DS as a possible career path (and/or as a source of side income if that\u2019s possible). \n\nI was looking around and saw that Berkeley extension offers a Bootcamp in data science. It\u2019s not cheap and seems like it will take up a lot of my free time. \nBut I\u2019m having second thoughts about doing this. \nFor starters the application process was very easy. Which makes me wonder how effective this program is. \nAlso, this is just a certificate program. And it seems like most people on this subreddit have BS and or masters. \nDoes anyone have any experience with this particular program? Is there a better path to follow?\nThank you.  ",
        "comments": [
            "Don't go to a bootcamp, those things are cash grabs. Everyone and their mom is trying to get into tech, the market is flooded with people with degrees. You'll need at least a masters if you want to get into data science at this point.",
            "You probably won\u2019t break through into a career with that bootcamp. Do you have a math degree? \n\nIf I were you I\u2019d focus on tools. There\u2019s a lot of free materials to learn about tools and what they do. A lot of data scientists use SQL, Python(more) or R(less), a data vis tool like Tableau, have basic programming Know-how, like Git, basic bash/shell scripting and have some additional skills like AWS/Google cloud. They use those things to make projects that do Machine Learning, descriptive reporting or anything else that a business needs. You can read cases on the internet about what businesses commonly use data science for, how those projects were put together, and then learn those tools and methods. \n\nYou could do a bootcamp, or not. If you think it would help you gain the know-how, great. The credential itself probably won\u2019t be much help for landing a job.",
            "I started as a high school teacher (AP stats and SAS) along with Math I/II etc.. cause not enough kids took stats.  I'm now a Senior AI/ML architect. Here were my steps and why it was the best decision I ever made.\n\n1.) Knew SAS already and SQL so took a low entry level analyst role.\n2.) Got my masters at night while in the analyst role and as my \"internship\" I asked the data science team at my company to shadow them and got my first exposure to real data science.\n3.) Took Data Science job, contract role which I HATED cause of no PTO health insurance not great etc.. but don't regret it cause it was my foot in the door and full time exposure to data science.\n4.) Once graduated, I became a graduate professor myself.  I missed teaching SO much so this was my way to keep doing it and keep up with my skills, doing something is one thing teaching it is another as you know.\n5.) Took entry level architect job learned MLOPs and with my data science skills over the last few years worked my way to senior. \n6.) BEST PART...My company just approved the curriculum and I will be volunteering teaching cloud computing in the fall to my old high school with support from tons of companies in the area that use Snowflake.  I'm so excited and although it took awhile this journey was worth it.\n\nMy last piece of advice.  You have the skill that 90% of other data scientists don't, and that's your teaching ability.  You'll be able to learn to code, but the personality and ability to explain things to the business and other people will separate you SO much.  Good luck!",
            "I would recommend reading O\u2018Reilly books and actually doing some hands on projects. Uni certificates, in my experience, only give you a broad overview on a topic without actually teaching you how to do it.",
            "As has already been stated, don't do a boot camp and expect to switch careers, especially in the current job market. But, depending on your situation, consider an online masters. I was a hs math teacher and got my masters in Statistics, studying on nights and weekends. Attending an online program allowed me the ability to keep working full-time and still make progress in my studies. I got an internship one summer, then a full-time offer, and I've been a data scientist for 5 years now.\n\nLike I said, the market isn't great now, so look into getting a masters, plan for a couple of years, look for internships over the summer, and hopefully when you're complete, the job market will be a bit better. Good luck!"
        ]
    },
    {
        "id": "1c50iuu",
        "datetime": 1713222361.0,
        "flair": "Discussion",
        "title": "Why are Data Scientists still needed when Machine Learning Engineers are a thing? ",
        "score": 0,
        "comment counts": 38,
        "content": "I've worked in a company in which there were different teams for DS and MLE with very distinct responsibilities. DS were responsible for talking with business stakeholders, understanding the project goals, talking with data engineers and analysts to assess data availability, do the whole data science project cycle and then deploying the model as a flask API using infrastructure built by the MLE team.\n\nHowever, I have seen more and more MLE jobs in which their responsibilities are much broader than that. Some of them expect MLEs to actually build the model themselves. \n\nI might be wrong here but it seems like the average ML Engineer could do everything that is expected from an average data science position, but an average data Scientist would really struggle at an average ML Engineer position. ",
        "comments": [
            "I think the answer is that \"data scientist\" and \" machine learning engineer\" do not have specific definitions. They are simply labels that companies can use how they want. There will always be variation in job title as it relates to specific job duties across different companies.",
            "you're gonna catch a lot of shit for this post, but in some ways, it (what is in your title) is a question worth asking. i'm going to post my thoughts.\n\n1. mles are, first and foremost, \\*software engineers\\*; they are not necessarily trained to have specific statistical or analytic backgrounds that are typical pre-requisites for data scientists. organizations do not necessarily need more engineers but may have a need for data scientists. note that this somewhat presupposes a pay discrepancy between the two (which i think does exist)\n\n  \n2. related to the above: not all data science questions involve machine learning. in fact, most do not, at least any machine learning problem that would benefit from the expertise of an mle. at my prior job, i would say 80% of my time was spent with logistic regression as the typical model.\n\n  \n3. you're right, there is definitely some blending of roles. but it's always been like that. i've definitely had to build my share of data pipelines and monitoring s/w that would be outside of what the typical ds from way back would have had to do.",
            "Data scientists speced into math and being able to talk to people. If a MLE can do that on top of normal MLE stuff you don\u2019t need a DS. \u00a0But that\u2019s easier said than done so it makes sense to split the jobs.\u00a0",
            "I know a guy who was hired as a data analyst and the company expected him to build and maintain machine learning pipelines. Titles are goofy.",
            "The definition of Data Scientist has always been loose and the introduction of the term Machine Learning Engineer hasn\u2019t lead to more clarity. You basically need to create job descriptions one by one and match people\u2019s skills and experience to those jobs- knowing someone is/ was a data scientist or MLE just doesn\u2019t tell you anything useful.\u00a0"
        ]
    },
    {
        "id": "1c3t6ww",
        "datetime": 1713100049.0,
        "flair": "Career Discussion",
        "title": "Asking for a promotion too soon?",
        "score": 2,
        "comment counts": 36,
        "content": "Hi all,\n\nSome context for the situation,\nI\u2019m a career switcher (sales-> tech staffing AM -> self employed personal trainer -> Data Scientist )\n\nI will be graduating with my MS-DS next month (May) and have been working full time as a DS since October 23. (With a 3 month internship at the same company prior to coming on full time).\n\nI was brought in as entry level (BS and no prior experience required)\n\nI was pulled into a project that ended up being multifaceted and find that I am collaborating with a principal on how to tackle the problems we face, solutions we build, and deadlines for each piece etc. We divide the work pretty evenly and are responsible for our respective deliverables.\n\nOur lead suggests that I need to make the case now for a promotion given what we\u2019ve done thus far in terms of work load, deliverables etc \nAnd while I \u201cfeel\u201d like that makes sense my brain says \u201cI\u2019ve only been here 6 mths\u201d.\n\nWhen is it too soon? I was planning on letting the project wrap up and presenting my progress, and value add to the company along with including that I am now an MS grad but wanted to see what the more experienced folks have to share\n\nTIA!",
        "comments": [
            "If your lead is suggesting that you ask for a promotion, then you don't need really any more hints from the internet. Go for it!!",
            "Well: there\u2019s two schools of thought here: \n1. Your idea is great, I think that\u2019s how most people do it.\n2. Does your lead play any part in your promotion Process? If they do and they want you to make the case for a promotion, DO SO. They have been there probably longer than you and know how that company and their promotion processes work. Definitely still talk about value added and your improved education, but the timeline can be moved up. Ask your lead for advice if they\u2019re willing to give it.",
            "I\u2019m assuming that your lead is of the opinion that what you are doing exceeds that of an \u201centry level\u201d data scientist. My advice: compare your work to that of what an entry level data scientist is and use that to motivate your promotion. If you find that your work is entry level, than you needn\u2019t push for a promotion, if after your investigation you find that your work is definitely not that of an entry level, then the doubt you have of whether to ask for a promotion should be gone.",
            "what a amazing career line up",
            "How long until the project finishes?"
        ]
    },
    {
        "id": "1c3wgfp",
        "datetime": 1713109251.0,
        "flair": "Discussion",
        "title": "Hiring managers - how do you measure research capabilities and coding good practices on an interview?",
        "score": 0,
        "comment counts": 5,
        "content": "So in the next couple of weeks I am going to interview several candidates to our department. Our work is divided between ml development, feature engineering and ideation and customer facing. I don\u2019t need the candidates to excel in all of these, but if I need to choose the first two are the most important aspects.  \nHow do you evaluate candidates in your organization for these traits? I am looking for a candidate that has some kind of a programing skills or with some demonstrated ability to write production level code, and with initiative and research capabilities, not necessarily a PhD but some past experience. ",
        "comments": [
            "Personally I tend to ask more theoretical type questions about coding. I.e why or when would you prefer X pattern over Y pattern, or give an example of a recent problem your team faced at work and get their thoughts on how they would've that same issue. If you're really just interested in coding practices, a question I tend to ask is \"what does good code look like to you?\" And based on their answers dig a little bit deeper. For feature engineering type questions you can give a quick rundown of a problem and some variables you have for the model already, and then ask them what other factors they'd consider. Usually you're looking for creativity and whether the candidate asks clarifying questions and tries to really understand what the problem is.",
            "Use case analysis, business req and communication",
            "I usually don\u2019t test on coding because its not a need in my organization all that much. Usually I ask for git repo to see what their style is. I am more interested in thinking process and usually ask them about how they\u2019d go about solving situational problems. I\u2019d also dig into why they would prefer one approach over another and what benefits they see in each of the approaches.\u00a0"
        ]
    },
    {
        "id": "1c3jupb",
        "datetime": 1713064988.0,
        "flair": "Discussion",
        "title": "T test vs z test",
        "score": 9,
        "comment counts": 11,
        "content": "Why dont we use t tests for comparing proportions of two populations? I have tried digging into this but could not understand yet. ",
        "comments": [
            "Both are merely approximations, but the logic behind using the z-test is that, for proportions, the standard deviation is a function of the mean, and so the population standard deviation is known under the null hypothesis (i.e. it is not estimated from the sample).",
            "A t test will work given sufficient sample by treating it as a comparison of means with a 0/1 variable. However, in this case the z test takes advantage of the fact that the SD is known in the proportion case and so is a higher powered test.",
            "[deleted]",
            "We aren't in the 1950's. No reason to use an approximate test when exact tests are available.",
            "I just wrote a great article about data science and claritied two concepts T test and z-test . [HOCTHUE.NET: Khoa h\u1ecdc d\u1eef li\u1ec7u trong l\u0129nh v\u1ef1c kinh t\u1ebf.](https://www.hocthue.net/khoa-hoc-du-lieu)"
        ]
    },
    {
        "id": "1c2yl0i",
        "datetime": 1713003212.0,
        "flair": "Discussion",
        "title": "What field/skill in data science do you think cannot be replaced by AI?",
        "score": 130,
        "comment counts": 153,
        "content": "Title.",
        "comments": [
            "Talking to people and figuring out what they actually want",
            "The part that is related to responsibility and decision-making.",
            "\u00a0Dealing with messy data.",
            "Overseeing the whole process. Helping a stake holder understand what their problem is and what kind of solution could help them. Not just cleaning your data, but figuring of what data you could use or where it is even coming from. Helping with deployment so that the model is used in a way that helps the organization",
            "I don't know if this falls under data science, but garbage and garbage out, so making sure\n\nthe AI is being trained on the right data,\n\nthe right data is being collected\n\nThe right data is available for the AI.... Managing the training loop...\n\nThat's where data roles should focus. Not trying to compete with AI....\n\nalso KNOWLEDGE GRAPHS, soon it'll be the hottest things in Data."
        ]
    },
    {
        "id": "1c3pdcz",
        "datetime": 1713086069.0,
        "flair": "Challenges",
        "title": "Looking for team memebers for CV kaggle challenge ",
        "score": 1,
        "comment counts": 2,
        "content": "Hey! I am looking for teammates for image-matching-challenge-2024. Please do reach out if you have prior CV experience. \n\nMy Profile: Masters in data science, top kaggle achievement: finished top 8% in llm-detect-ai-generated-text  challenge. I have NLP experience, want to build CV experience. Most comfortable in pytorch.",
        "comments": [
            "I've CV experience. We can connect. MSc student as well and decent kaggle experience."
        ]
    },
    {
        "id": "1c30flh",
        "datetime": 1713010100.0,
        "flair": "Discussion",
        "title": "Enhancing Weather Forecast Accuracy: Exploring Regression Models with Multi-source Data Integration",
        "score": 23,
        "comment counts": 18,
        "content": "I am currently working as a data scientist at a new energy startup, mainly responsible for predicting photovoltaic power generation every 15 minutes for the next day. The key data relied upon are weather forecasts, especially the predicted solar irradiance values. Currently, we have data from five numerical weather forecasts, which include fields such as irradiance, temperature, and humidity. The accuracy of the forecasts varies among different data sources, and there are certain discrepancies with the actual weather. I am considering merging the five sets of data to obtain a more accurate weather forecast. Can I use a regression model to fit the actual weather using the five sets of weather forecast data? Is there a better method available?\n\nBtw, the weather forecast datasets appears as follows:\n\nhttps://preview.redd.it/pzvu110o0duc1.jpg?width=2298&format=pjpg&auto=webp&s=c5f45c31b5c683a09f8609de0077b0be06d1808a",
        "comments": [
            "you'd want to understand the shortcomings of existing modelling methods before going down this road. you're essentially trying to predict the types of mistakes those models make, rather than predict the weather. I'd guess regression models can be used effectively for at least a component of a good model, but you're going to have to do a lot of work to handle geo data and distance smoothings, and you'll probably need to classify the types of errors in existing models to define targets for individual regression models. Ultimately though you can define the problem space quite naturally in tabular datasets, so I'd expect regression or tree models to work well.",
            "I think you should consider a hierarchical Bayesian model. You have spatio temporal data that needs to be handled and a granularity of space that requires aggregation (thinking square acre vs county what was happening at that moment and how should the estimate aggregate)\n\nAlso there may be room for a convolutional recurrent neural network. So convolutions on the vector of each dataset over the two 2d layer (thinking spherical coordinates with fixed radius) and then on top of this network the time axis through say a lstm (maybe encoder to decoder)\n\nI think I saw some cool papers about generative networks that has some promising weather predictions that looked better",
            "You should fit each model separately prior to any ensemble averaging.   Essentially, you are tackling  model output statistics to develop corrections that are model-dependent.   These approaches have been used extensively and nonlinear AI/ML based approaches are certainly capable of developing corrections as well and being implemented more often.  At the hyper local scales you are likely interested in, there are likely many nonlinearities/state-dependencies that are important.  Each model will have unique error characteristics tied to its own model physics and those inherited from boundaries conditions of driving global models (assuming you are using regional/limited-area models).  Training using the ensemble mean is also possible but it\u2019s likely those unique aggregate errors from the ensemble able harder to disentangle from shorter hindcast records.  And, if a single model ever changes/updates/drops out/erc., then the underlying distribution of errors of the ensemble mean also changes and would no longer be consistent with your calibrated model.",
            "Interesting!",
            "Most of the irradiance actuals from different weather providers is actually a simulated series (not actual observation). You want to be sure of this as blending them together will not lead to good results. Better way would be to build individual models and ensemble them together (like xgboost using each weather vendor and then ensemble using regression etc)."
        ]
    },
    {
        "id": "1c35bvi",
        "datetime": 1713024179.0,
        "flair": "Discussion",
        "title": "FNN to predict improper vouchers. ",
        "score": 12,
        "comment counts": 19,
        "content": "I am an auditor for a state agency, we audit payments the state makes every day to find improper voucher. \n\nWe get about 30,000 vouchers a day so obviously we can\u2019t audit all of them. So we set up certain risks associated with vouchers to try and better find improper payments. And sometime we have filters for payments that meet certain criteria that must get audited. \n\nHowever, our risk based design doesn\u2019t really work, it\u2019s just a chance of whether or not the vouchers selected for audited are improper or not. I don\u2019t believe we have any better outcome that just randomly selecting a voucher everyday. \n\nIt just depends on the risks the auditors look for and how well they look at it. However, I am trying to create a statistical model to find these improper vouchers based on these risks. \n\nAs opposed to what some auditor thinks is the best risk, the model can look at all these risks and see how they interact and if there is some pattern. \n\nAdditionally, a lot of these risks have some arbitrary cut off date. For example, we might have a risk saying the specific vendor hasn\u2019t been audited in over a year. That\u2019s considered risky, however, a voucher that misses that by one year wouldn\u2019t be rated as risky. \n\nSo doing this we can turn some categorical variables into continuous variables. \n\nThe data set as of now is about 600,000 vouchers that have been audited over a ten year span. Currently about 8% of them have been rejected. But not all of the rejected ones were necessarily bad. We have two classes non compliance and saving. Savings are when the money is not due or at least some of it, bad math on the invoice, incorrect charges and so one. While non compliance don\u2019t really save any money it\u2019s just some account error, maybe they paid from the wrong funds, referenced the wrong contract or something. It\u2019s gonna mess up the accounting system but not really save any money. About 20% of rejected vouchers have saving and 80% are non compliant. \n\nObviously our goal is to identify vouchers that yield a saving. Even if we had a model that can predict all the improper ones, we just don\u2019t have the resource to audit all of them. \n\nSo my thoughts were to create a model fine tuned to have low false positive. Basically I would have a penalizing model for instances of an okay voucher being marked as improper. \n\nObviously we\u2019d miss some improper vouchers from that but we also don\u2019t have the resource to audit them all anyway so my thought is this would allow us to focus on those that might be improper. \n\nJust wondering if you guys have. Any thoughts on this. ",
        "comments": [
            "Have you considered framing as anomaly detection?",
            "What was your idea behind choosing FNN?",
            "Standard boosted decision tree methods should be more suitable instead of using neural networks.",
            "What programing languages did you learn for your job?",
            "What is your input data form? Text?\n\nHave you looked at huggingface models?"
        ]
    },
    {
        "id": "1c3vdii",
        "datetime": 1713106354.0,
        "flair": "Career Discussion",
        "title": "What would you do in my shoes?",
        "score": 0,
        "comment counts": 2,
        "content": "People of datascience, if you were in my shoes and the goal was  to get a job asap in machine learning engineering role (preferably   NLP), how would you plan a few months ahead and in what time frame would  you meet that goal.   \n\nMy shoes: [https://www.dropbox.com/scl/fi/k0ruhu6wnri4phfiniuri/Resume-Censored.pdf?rlkey=zq46ltuvu4xjxtrn68d8zyd01&dl=0](https://www.dropbox.com/scl/fi/k0ruhu6wnri4phfiniuri/Resume-Censored.pdf?rlkey=zq46ltuvu4xjxtrn68d8zyd01&dl=0)  \n(this is enough to give you an idea of what my shoes look like)   \n\nGoal is to get a job asap, as mentioned above. My university is  online so I have to spend about to 2hrs daily to complete my lectures,  and prepare for assignments and quizes. I spend about 8-12 hrs daily,  divided for learning ML (currently taking CS224n lectures) and working  at my internship (fully-remote). I live in Kuwait and the ML market here  is almost non-existent. I can move back to Pakistan (home country) and  get a full-time job there where the market is not very mature but there  is work for engineering (R&D near non-existent). Best case scenario  would be to get a fully-remote job in any part of the world possible. My  salary expectations are not much, I can work full-time job if it can  pay me $500/mo atleast because that would cover my needs and then with  that I can focus on my studies and ML specifically as I want to get into  masters or phd after bachelors. That's my long-term goal to get into  academic research.   \n\nSo for now, getting a job to cover my expenses is a priority and a  short term goal. How would you then plan next few months to meet that  goal, if possible?",
        "comments": [
            "Fully remote jobs are nearly impossible to get right now, especially junior roles. I wouldn\u2019t bank on it",
            "Contact staffing agencies to get contract work. That can be a fast way to get into the right type of work and eventually the right company. They do all the work of finding jobs for you but you will not be able to be picky about the opportunities they give you. You probably wont have great benefits or make a ton of money. But, in the matter of months or a few years, you will have enough experience to get a great paying job at a great company, full time. Often big, attractive companies will hire contractors, so you will get real experience and, perhaps more importantly, the opportunity to build relationships with people at that company. It could lead to a full time job with luck.\n\nGood luck"
        ]
    },
    {
        "id": "1c33azw",
        "datetime": 1713018702.0,
        "flair": "Career Discussion",
        "title": "Where do you guys apply for jobs in uk?",
        "score": 11,
        "comment counts": 6,
        "content": "I\u2019ve been using LinkedIn but haven\u2019t got much success, I\u2019m not sure if it\u2019s because I\u2019m unqualified (BSc from top 20 unis with 2ye), the markets tough or if I\u2019m on the wrong site. \n\nWhere do you guys apply for roles? Im based in London currently. \n\nI tried going to networking events, I attended big data London last year but the only people I met were trying to sell me storage solutions. Are there any networking events you\u2019d recommend? ",
        "comments": [
            "Indeed but there's alot of scams on there",
            "I live in the UK.  IT, software roles have been reduced by up to 40%. Junior roles difficult time.",
            "Otta is great for jobs in tech, much better than Linkedin",
            "Linkedin is not too bad. You might need to revamp your profile. Add the skills you've used in all your jobs and BS. Showcase a few personal projects if you've done (they could be from your BS, or ones you may have done in your free time). If you've contributed to open source, highlight that, and also provide a link to your github if you've got something to show there.\nRecruiters are suckers for keywords and jargons so make sure you use them effectively in your skills and work ex.",
            "Yeah big data London is a place for listening to talks and listening to sales pitches\ud83d\ude05\n\nAre there any local meetups around data? I don't know where you are, but in my area these events are on meetup.com, or some monthly events organized by local data people."
        ]
    },
    {
        "id": "1c2mqav",
        "datetime": 1712962195.0,
        "flair": "Discussion",
        "title": "XGBoost Please help",
        "score": 91,
        "comment counts": 61,
        "content": "\u2060I am trying to train an xgboost model that estimates stock price. I don't think there is anything wrong with the data format, and I am using gridsearch to find the likely hyperparameters.... But this is what I get for the estimation. except for a small area, the graph is flat...! Why is this happening?\n\nhttps://preview.redd.it/uzkt8fqxn4uc1.png?width=1954&format=png&auto=webp&s=b76f84d7910768b33692d0b9ee9d144f5e951984\n\nhttps://preview.redd.it/no2qddf0o4uc1.png?width=1267&format=png&auto=webp&s=dbd1e1b51b0b87949046e27f1976967e11caea06",
        "comments": [
            "The most plausible reason is that the max value of y\\_train is less than 42. Tree-based algorithms, like XGBoost, can only interpolate, not extrapolate.",
            "Please tell us you're forecasting return and not stock price",
            "I would try the skforecast library. It handles time series with regression techniques better. \n\nDo you have a GitHub link for this? It\u2019s tough to tell what the problem is from this. Seems like a data cleaning/structure issue from here, not an xgboost problem.",
            "You\u2019ll want to use walk forward validation at the very least when forecasting time series. Not simple train/test.",
            "This looks like an extrapolation problem. Tree based models are known to not be able to extrapolate on data outside the expected ranges. There are a ton of resources on this you can find. Here's an example that shows the problem and some other models you can use:\n\nhttps://www.kaggle.com/code/carlmcbrideellis/extrapolation-do-not-stray-out-of-the-forest\n\nNot sure what happened in your case exactly but it was probably something like that.\n\nBTW predicting stock prices is a difficult problem so you are likely going to struggle a bit. I think it's best to start with some time series forecasting libraries like skforecast or AWS forecast. \n\nThere's also libraries like this for more advanced models:\n\nhttps://unit8co.github.io/darts/\n\nhttps://nixtlaverse.nixtla.io/\n\nI remember reading this article which goes over the state of the art which I thought was pretty good too.\n\nhttps://mangodata.io/blog-post/forecasting"
        ]
    },
    {
        "id": "1c2tz99",
        "datetime": 1712984546.0,
        "flair": "ML",
        "title": "Predicting successful pharma drug launch",
        "score": 10,
        "comment counts": 20,
        "content": "I have a dataset with monthly metrics tracking the launch of various pharmaceutical drugs.  There are several different drugs and treatment areas in the dataset, grouped by the lifecycle month.  For example:\n\n\n\n|Drug|Treatment Area|Month|Drug Awareness (1-10)|Market Share (%)|\n|:-|:-|:-|:-|:-|\n|XYZ|Psoriasis|1|2|.05|\n|XYZ|Psoriasis|2|3|.07|\n|XYZ|Psoriasis|3|5|.12|\n|XYZ|Psoriasis|...|...|...|\n|XYZ|Psoriasis|18|6|.24|\n|ABC|Psoriasis|1|1|.02|\n|ABC|Psoriasis|2|3|.05|\n|ABC|Psoriasis|3|4|.09|\n|ABC|Psoriasis|...|...|...|\n|ABC|Psoriasis|18|5|.20|\n|ABC|Dermatitis|1|7|.20|\n|ABC|Dermatitis|2|7|.22|\n|ABC|Dermatitis|3|8|.24|\n\n* Drugs XYZ and ABC may have been launched years apart, but we are tracking the month relative to launch date.  E.g. month 1 is always the first month after launch.\n* Drug XYZ might be prescribed for several treatment areas, so has different metric values for each treatment area (e.g. a drug might treat psoriasis & dermatitis)\n* A metric like \"Drug awareness\" is the to-date cumulative average rating based on a survey of doctors.  There are several 10-point Likert scale metrics like this\n* The target variable is \"Market Share (%)\" which is the % of eligible patients using the drug\n* A full launch cycle is 18 months, so we have some drugs that have undergone the full 18-month cycle can that be used for training, and some drugs that are currently in launch that we are trying to predict success for.\n\nThus, a \"good\" launch is when a drug ultimately captures a significant portion of eligible market share.  While this is somewhat subjective what \"significant\" means, let's assume I want to set thresholds like 50% of market share eventually captured.\n\nQuestions:\n\n1. Should I model a time-series and try to predict the future market share?\n2. Or should I use classification to predict the chance the drug will eventually reach a certain market share (e.g. 50%)?\n\nMy problem with classification is the difficulty in incorporating the evolution of the metrics over time, so  I feel like time-series is perfect for this.\n\nHowever, my problem with time-series is that we aren't looking at a single entity's trend--it's a trend of several different drugs launched at different times that may have been successful or not.  Maybe I can filter to only successful launches and train off that time-series trend, but I would probably significantly reduce my sample size.\n\nAny ideas would be greatly appreciated!\n\n",
        "comments": [
            "Why not make it more easy for you and for the world to understand by just predicting the absolute sales, and make a separate prediction by market? That seems like a way more solid approach than to have relative targets going in all directions over a time window that is also made relative.\n\nRegarding your question:  \nIt seems you are trying to predict a Y variable here that is relative to the other candidates. I think there are some big challenges in setting this up as a time series like this if you don't have extra data to ungroup it into a 'normal' format where you know the start date. But it is definitely a problem that could hugely benefit from being a time series (including seasonality is one), so I would spend extra time data engineering to combat the problem: that your target is relative to other drugs but your time variable is also relative to some arbitrary beginning. \n\nAlso keep in mind:  \nLooking at this problem with common sense I would say your problem is going to be very likely that the correlation might not be very strong. Or it might be strong because of a latent variable, which is quite dangerous. To give a small example: let's say the quality of the drug against a certain disease (not possible to predict/capture in data) leads to it bought a lot (market share), and the fact that it's bought a lot leads to \"market awareness\". Then the Marketing team will spend a lot of money on marketing while there's actually nobody looking at the ads.",
            "I would use classification and model uplift + a cutoff value for what is deemed \u201csuccessful\u201d",
            "Given all you have is the market share and this drug awareness score, just build a simple curve-fitting model that links the two and call it a day.",
            "You could cluster all of the drugs you\u2019re looking at, then run a time series analysis on the cluster(s) that you\u2019ve determined represent a successful launch.",
            "Are there drugs that succeded?\n\nIf yes, how about a multitimeseries Analysis?"
        ]
    },
    {
        "id": "1c32kuk",
        "datetime": 1713016671.0,
        "flair": "Statistics",
        "title": "Looking for a decision-making framework ",
        "score": 2,
        "comment counts": 16,
        "content": "I'm a data analyst working for a loan lender/servicer startup. I'm the first statistician they hired for a loan servicing department and I think I might be reinventing a wheel here.\n\nThe most common problem at my work is asking \"we do X to make a borrower perform better. Should we be doing that?\"\n\nFor example when a borrower stops paying, we deliver a letter to their property. I performed a randomized A/B test and checked if such action significantly lowers a probability of a default using a two-sample binomial test. I also used Bayesian hypothesis testing for some similar problems.\n\nHowever, this problem gets more complicated. For example, say we have four different campaigns to prevent the default, happening at various stages of delinquency and we want to learn about the effectiveness of each of these four strategies. The effectiveness of the last (fourth) campaign could be underestimated, because the current effect is conditional on the previous three strategies not driving any payments.\n\nAdditionally, I think I'm asking a wrong question most of the time. I don't think it's essential to know if experimental group performs better than control at alpha=0.05. It's rather the opposite: we are 95% certain that a campaign is *not* cost-effective and should be retired? The rough prior here is \"doing something is very likely better than doing nothing \"\n\nAs another example, I tested gift cards in the past for some campaigns: \"if you take action A you will get a gift card for that.\" I run A/B testing again. I assumed that in order to increase the cost-effectives of such gift card  campaign, it's essential to make this offer time-constrained, because the more time a client gets, the more likely they become to take a desired action spontaneously, independently from the gift card incentive. So we pay for something the clients would have done anyway. Is my thinking right? Should the campaign be introduced permanently only if the test shows that we are 95% certain that the experimental group is more cost-effective than the control? Or is it enough to be just 51% certain? In other words, isn't the classical frequentist 0.05 threshold too conservative for practical business decisions?\n\n\n1. Am I even asking the right questions here?\n2. Is there a widely used framework for such problem of testing sequential treatments and their cost-effectivess? How to randomize the groups, given that applying the next treatment depends on the previous treatment not being effective? Maybe I don't even need control groups, just a huge logistic regression model to eliminate the impact of the covariates?\n3. Should I be 95% certain we are doing good or 95% certain we are doing bad (smells frequentist) or just 51% certain (smells bayesian) to take an action?",
        "comments": [
            "I am just trying to decipher what exactly are you trying to do. Retain borrower? Reduce default rates?",
            "My company build a custom deployment framework because there\u2019s just not bespoke stuff like that out there. I run this software not. You\u2019d create tremendous value if you can create this yourself. And you most certainly will have to do that. It sounds like a custom service.\n\nI don\u2019t know your framework to office advice so if you told me about your deployment environment and tools I might have better advice. Like, what software, how are you pulling data, what type of dbs, what do you currently do to handle the sequential testing?\n\nIn terms of models, it sounds like hierarchical or mixed models, but Bayesian seems most appropriate (I have no experience there unfortunately).",
            "Sounds like causal analysis will help. Assuming there is some variation in the data. Some go thru treatment 1, some go through treatment 1 and 2 ....then you can compute the treatment effect, or the lack of.",
            "Interesting",
            "Dynamic programming If you can model the problem as a sequential decision-making process with known transition probabilities, dynamic programming can be used to find the optimal sequence of interventions that maximizes the expected cumulative reward or minimizes the expected cumulative cost."
        ]
    },
    {
        "id": "1c1vciv",
        "datetime": 1712881995.0,
        "flair": "Career Discussion",
        "title": "What realistically will be automated in the next 5 years for data scientists / ML engineers? Plus would love some career advice",
        "score": 174,
        "comment counts": 138,
        "content": "Recently I\u2019ve been job hunting and have hit the sad realization that I\u2019ll have to take a salary cut if I want to work for a company with good ML practices. I have a lot of student loans from master\u2019s program. \n\nI\u2019ve been trying to keep up with LLM coding automations and software automators. It\u2019s all beginning to seriously make me anxious but I think the probability I\u2019m overreacting is at least 50%.\n\nHow much of a data scientist\u2019s job do you think will be completely automated? Do you think we (recent master\u2019s graduates with lots of debt) made the wrong choice? What areas can I strengthen to begin to future proof myself? Should I just chill out and just be ready to learn and adapt continuously?\n\nMy thinking is that I want to do more ML engineering or ML infra engineering even though right now I\u2019m just a data scientist. It feels like this career path will pay off my loans, have some security, and also is better than dealing with business stakeholders sometimes. \n\nI am considering taking a bad pay cut to do more sophisticated ML where I\u2019ll be building more scalable models and dealing with models in production. My thought process is this is the path to ML engineer. However my anxiety is terrifying me. Should I just not take the pay cut and continue to pay off loans + wait for a new opportunity? I fear the longer I wait, the worse my skills at a bad company become. Also would rather take a pay hit now and not in 1 year.\n\n My fear with taking pay cut is that I\u2019ll be broke for a year and then in another year automations and coding bots might really become sophisticated. \n\nAnyways, if anyone\u2019s knowledgeable would love to chat. This market and my loans are the most depressing realization ever\n\n",
        "comments": [
            "Here's what skills I am trying to develop as an AI/ML engineer:\n\n- actually knowing the models well. I.e being able to pick the right model for the right job.\n\n- efficiently hosting models on the cloud.\n\n- trust, safety, and explainability in models.",
            "Although a cloud practitioner is not a data scientist, and I am certainly not a cloud practitioner, the cloud practitioner stuff will be tough to automate. Knowing *what to do* on a more big-picture level will be tough to hand over to LLMs. Arguably point-and-click driven tools will continue to be popular, and they will continue to need a human in the loop.",
            "I\u2019ve been in industry at this point for a long time. I\u2019ve seen many cycles where layoffs have occurred. I largely have been isolated from that and got my first layoff ever in my career after working nearly 15 years. \n\nI was working with as a subcontractor/AWS ProServe Partner and eventually all the low side work was completed and I was awaiting to onboard for new project. There were some issues with my clearance at that time, which was beyond my control. Ultimately, it lead the company to let me go without severance.\n\nIt actually fucked with my head a lot. I have basically always done great work delivering for my clients and always aimed to do the right thing. And now I was being faced with unemployment first the  time in my life at 36 years old, expecting my first child in June\u201924. I couldn\u2019t even let go that I know it wasn\u2019t personal just business.\n\nIt felt very personal, largely cause it happened to me. So I can understand your fear and I can understand why you don\u2019t feel secure with the developments you are seeing through automation. \n\nI\u2019ve been automating things with code since 2004 and there may come a time when I am automated out of a job, but I don\u2019t think so.\n\nI have read a book by Cal Newport called [\u201cSo good  they can\u2019t ignore you\u201d](https://amzn.to/3VZU91h). This book completely changed my life and is the single best book I recommend anybody starting out with data science. Is the book about data science, no? Will it teach you to be a data scientist, no? It however will teach you how to pursue mastery in your craft as a way to build autonomy and security. \n\nFor you, I recommend you do like I did. \n\nI started as a DBA - Database Analyst, I learned everything I could about SQL, data warehousing, ETL, etc. And every two to five years, I take on a new technology to go hard one. Doing this you will accumulate so much experience and your skill set will be broad, but you also want to go deep in one or two areas. The two areas that I went deep on was programming and cybersecurity to couple with my DBA and Data Science skills. Cybersecurity is obviously too broad, so I focused on learning all about Security, Pen Testing and Reverse Engineering/Malware Analysis. \n\nIf you start now, I Cant guarantee you won\u2019t be unemployed for some time. So I must recommend for you to also get on a plan for paying off debt, setting out an emergency fund and planning for retirement. I started reading books at debt free living and reverse budgets. I eventually found Dave Ramsey\u2019s [\u201cTotal Money Makeover\u201d](https://www.amazon.com/dp/1595555277/ref=cm_sw_r_as_gl_api_gl_i_DWRC5XQ3ZDJ0W7NF7B6N?linkCode=ml2&tag=insightsthrou-20). I don\u2019t agree with everything that Dave says but I agree that having no debt (except the house) and emergency fund feels great.\n\nI was able to start a new job within two weeks after being let go. Some would say that is luck, some say it was good timing, etc. All I know is that I am going to keep refining my skills throughout my entire career because there is value and compound interest in acquiring the skills. \n\nI recommend you couple your data science with programming and some other skill. You don\u2019t need to be the smartest person in the room. You just need to be able to bring value to a team.\n\nWhat are skills required for the future idk, I just know that Cloud Computing is super popular right now, and I was lucky to get into about 3-4 years ago. I completed almost all the AWS certs now. Does that mean I won\u2019t be let go, absolutely not! But it does provide me with a wide range of skills which I can bring to bear on solving problems in code and for clients. And if you can do that you\u2019ll fine.\n\nDon\u2019t let your thoughts get in the way of you taking actions, rather use them a barometer to understand yourself and during the process of inquiry. As yourself questions about why you feel this way\u2026 don\u2019t bullshit yourself be honest. If it\u2019s a gap in skills or lack of productivity because you\u2019re more junior then own that and make yourself better. Keep moving forward brother.",
            "Soft skills + agency >>> general dev skills > ML knowledge > everything else\n\nBeing able to talk to stakeholders and deliver a project from start to finish is what matters to people. The technical details of how only matters to you (I still like communicating it to set expectations).\n\nOftentimes delivery speeds matter more because it lets you iterate over the whole problem space and get good back and forths about wtf the project actually needs to do (hint it's rarely what it starts off as). \n\nI'm a fan of just throwing the best language models at a problem first, then breaking it down and optimising individual steps with smaller simpler models if it's worth my time.\n\n(NLP domain, your mileage may vary)",
            "Most of the current AI stuff is hype. LLMs are not all that useful on their own, the hype will blow over and most of these AI jobs will disappear once companies realize that there is no return on investment."
        ]
    },
    {
        "id": "1c25koq",
        "datetime": 1712917046.0,
        "flair": "Discussion",
        "title": "What's next for the quintessential DS role?",
        "score": 25,
        "comment counts": 14,
        "content": "This post is multiple questions wrapped into a single topic kind of thing which is why I thought best to keep it as an open-ended discussion.\n\n  \n**Q1.** When I see recent DS job postings a majority now have these two added requirements: 1. Some knowledge of LLMs. 2. Experience in NLP. I'm not sure if this is just biased based on what LinkedIn algorithm is showing me. But is this the direction that the average DS role is headed? I've always considered myself as a jack of all trades, flexible DS, but with no expertise is any technical vertical. Is the demand for the general data scientist role diminishing?  \n  \n**Q2.** In my 5 years of experience as a DS I've worked on descriptive analytics, predictive modelling, dash-boarding in consulting and product alike. Now, 5 years isn't that much time, but it's not too short either. I'm now finding myself working on similar types of problems (churn, risk, forecasting) and similar tools and workflows. This is not a complaint by any means, it is expected. But this got me thinking... Are there new tools and workflows out there that might enhance my current working setup?  For example: I sometimes find myself struggling to manage code for different variations of datasets used for different model versions. After loads of experimentation my directory is a mess. I'd love to know tools and workflows you use for typical DS problems.  \n  \nHere's mine:  \ncode/notebook editor: VScode  \nversioning: git/github  \narchiving & comparing models: MLFlow \\[local only within project context\\]  \nhyperparameter optimisation: Optuna  \ninference endpoint deployment: fastapi  \nconvey results and progress: good ol' excel and powerpoint :p\n\n",
        "comments": [
            "For point 1. \nI think this is just what teams put into requirements as a lot of roles don\u2019t really require knowledge of LLMs or Neural Net in general. Most of the value is still derived from supervised learning. However having these on resume as skills does show that you are constantly learning and know about various advancements in the field you work in. \nAt least this is how I look at it. After all, Attention is all you need! \n\n2. For your second point - I think your working set up is very good.  I don\u2019t use MLflow and Optuna",
            "Optuna is the future also dude your work is top class",
            "Data versioning is a thing, if it\u2019s the same dataset changing over time that is causing the problem. then you change the training code to select which commit essentially it should use. For example, DVC is one well established one, or delta tables are another option.\n\nSounds a good set up to me for the jobs you specify. I\u2019d suggest making sure you\u2019re getting the most out of each of them, possibly looking to customise. For example, do you have pre commits set up, do you need custom hooks for git, etc.\n\nAdditionally maybe looking at deployment pain points could find you ways to add new tools. Would containerising with docker help? Are you reusing software efficiently? I agree with the other point where engineering and deployment is a bigger part now.\n\n\nOn the LLM point I wouldn\u2019t be surprised if most companies have internal pressure to deploy something in this area based on the current hype cycle, so are including it in job adverts. Having somebody able to do this quickly is going to be a benefit - but I\u2019d naively imagine most are just using pre trained models or an API and still value flexibility (it is SOME experience in NLP they ask for, after all ).\n \nif someone wanted the LLM experience to tick a box it really wouldn\u2019t take long nowadays given how accessible they are and fits with the jack of all trades approach. I\u2019d be trying to get a project in the common deep learning areas - vision, NLP, decision making with RL, ++",
            "Interesting question. Not necessarily NLP; that\u2019s just an easy thing to throw out there and they may never ever need you to actually leverage any kind of NLP. I would say something I\u2019ve noticed is that as a data scientist machine learning engineer (or whatever flavor of title your company has given you), expect to know data engineering methods and techniques and how to implement them more than you already know or have learned in school. I feel like the data scientist and data engineering roles have become more and more blended.\u00a0"
        ]
    },
    {
        "id": "1c1y7yu",
        "datetime": 1712890054.0,
        "flair": "Discussion",
        "title": "Am I glorifying ML research roles? ",
        "score": 60,
        "comment counts": 183,
        "content": "I think it\u2019s safe to say I\u2019ve constantly been \u201cfantasizing\u201d about getting a job as a ML researcher. Not traditional data science. I\u2019m talking one of the jobs that involve cutting edge modeling like https://www.microsoft.com/en-us/research/careers/.\n\nTo me it sounds like the dream DS job. Read papers on ML, implement them in software, and get paid a lot to do it. \n\nThe thing is though, with an MS in Stats it\u2019s safe to say it\u2019s nearly impossible for me to land any kind of research data scientist job because almost all the job reqs require a PhD. I just wish there wasn\u2019t a stigma associated with MS graduates. We know stuff too. Just because we have an MS and not a PhD doesn\u2019t make us any less qualified. Is there anyone here who is in a data science research role without a PhD?\n\nAm I glorifying these roles? \n\n",
        "comments": [
            "> Just because we have an MS and not a PhD doesn\u2019t make us any less qualified\n\nunfortunately, it literally does; it doesn\u2019t make you intellectually less capable but it means you are much less likely to have demonstrated research experience, and that does matter",
            "So a research heavy role requires a proven record of research. Big surprise.",
            "A couple of thoughts on this - 1. These roles typically also require (or at least encourage) publication of research in some form or another - PhDs are typically going to have more experience with this 2. Even companies like Microsoft only need so many people to \"read papers on ML and implement them in software\", I think there is  currently a lot more DS work to be done that involves figuring out profitable ways to leverage models that have already been implemented in software by someone else.",
            "Hmm. Actually having a Ph.D. literally means you are more qualified than an M.S holder.\n\nThe gap between a Ph.D and M.Sc is actually greater than the gap between M.Sc and layman. \n\nExceptions do exist. A dedicated genius can learn to contribute to a field without a degree but these are exceptions. MS or IBM will not go fishing for an exceptional M.Sc holder or an uncut gem when they can fish in a much smaller pond of Ph.Ds with published research.\n\nNote that these research roles are not for using libraries, they are for actually tweaking the method itself, creating new estimators etc.",
            "A job involving reading about what others have done and then implementing it doesn't really sound like cutting-edge research to me. It sounds like a big standard industry job.\n\nReal cutting-edge research is about discovering / making / doing things that nobody else has done before. That's hard and requires years of training to be good at, hence the requirements for a PhD.\n\nAnd yes, just because you have a Masters literally does make you less qualified than a PhD. It doesn't make you dumber or less competent. But the difference is literally that PhDs have more qualifications than you do."
        ]
    },
    {
        "id": "1c29ri7",
        "datetime": 1712930088.0,
        "flair": "AI",
        "title": "Retrieval-Augmented Language Modeling (REALM)",
        "score": 7,
        "comment counts": 9,
        "content": "I just came upon (what I think is) the original REALM paper, [\u201cRetrieval-Augmented Language Model Pre-Training\u201d](https://arxiv.org/abs/2002.08909). Really interesting idea, but there are some key details that escaped me regarding the role of the retriever. I was hoping someone here could set me straight:\n\n1. **First and most critically, is retrieval-augmentation only relevant for generative models?** You hear a lot about RAG, but couldn\u2019t there also be like RAU? Like in encoding some piece of text X for a downstream non-generative task Y, the encoder has access to a knowledge store from which relevant information is identified, retrieved, and then included in the embedding process to refine the model\u2019s representation of the original text X? Conceptually this makes sense to me, and it seems to be what the REALM paper did (where the task Y was QA), but I can\u2019t find any other examples online of this kind of thing. Retrieval-augmentation only ever seems to be applied to generative tasks. So yeah, is that always the case, or can RAU also exist?\n\n2. **If a language model is trained using retrieval augmentation, that would mean the retriever is part of the model architecture, right?** In other words, come inference time, there must always be some retrieval going on, which further implies that the knowledge store from which documents are retrieved must also always exist, right? Or is all the machinery around the retrieval piece only an artifact of training and can be dropped after learning is done?\n\n3. **Is the primary benefit of REALM that it allows for smaller model?** The rationale behind this question: Without the retrieval step, the 100% of the model\u2019s latent knowledge must be contained within the weights of the attention mechanism (I think). For foundation models which are expected to know basically everything, that requires a huge number of weights. However if the model can inject context into the representation via some other mechanism, such as retrieval augmentation, the rest of the model after retrieval (e.g., the attention mechanism) has less work to do and can be smaller/simpler. Have I understand the big idea here?",
        "comments": [
            "Found [5 relevant code implementations](https://www.catalyzex.com/paper/arxiv:2002.08909/code) for \"REALM: Retrieval-Augmented Language Model Pre-Training\".\n\n[Ask the author(s) a question](https://www.catalyzex.com/paper/arxiv:2002.08909?autofocus=question) about the paper or code.\n\nIf you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/2002.08909&title=REALM%3A+Retrieval-Augmented+Language+Model+Pre-Training) \ud83d\ude0a\ud83d\ude4f\n\n--\n\nTo opt out from receiving code links, DM me.",
            "nice",
            "Can people please up vote I need some advice and I don't have enough comment karma"
        ]
    },
    {
        "id": "1c344vh",
        "datetime": 1713020923.0,
        "flair": "Discussion",
        "title": "Feedback on response: What realistically will be automated in the next 5 years for data scientists/ML engineers?",
        "score": 0,
        "comment counts": 6,
        "content": "I had responded to Reddit thread [here](https://www.reddit.com/r/datascience/s/lhe8RQK6Up) I was completely blown away with the traction my response received. \n\nI wanted to thank everyone who took the time to read and share there thoughts. I would also appreciate if folks could share constructive feedback for me on the writing.\n\nI have a very small tech blog that I\u2019ve been wanting to write on for a while now. I wasn\u2019t sure where to start or what topics I should focus on first. I decided that I with all the engagement of that I would try to unpack the advice on the blog, which can be found [here](https://insightsthroughdiscovery.com/what-will-be-automated-by-ai-ml-in-the-next-five-years/). \n\nThe website hasn\u2019t had much work on it, not really looking for feedback on the website itself, cause I know it needs work. I\u2019m looking for feedback about the blog post and about the content within it?\n\nI would also like to hear about what topics you as a reader might be interested in reading about. Thank you, in advance for your feedback and I hope you have a great weekend ahead.",
        "comments": [
            "Making a post to get engagement on reddit and drive traffic to your own website hmmm \ud83e\udd14",
            "Will the ds people running ai ever be out of a job?",
            "Potential topic - There seems to be a lot of different tech stacks out there but they are doing similar things under the hood. And new techs come out all the time. So what skills are common to most tech stacks and are transferable?",
            "Great. Your experiences real world use case. \nI have review your blog post.\n\nWhile your predictions are interesting, you could consider providing more specific examples and use cases to illustrate your points. For instance, you mention \"data cleaning and preprocessing\" as a task that will be automated. You could elaborate on this by describing the current challenges in data cleaning and how automation can streamline the process, perhaps with a real-world example."
        ]
    },
    {
        "id": "1c1vo54",
        "datetime": 1712882866.0,
        "flair": "ML",
        "title": "How do you deal with predicting purchases where the purchases are extremely imbalanced and the data is extremely sparse. ",
        "score": 22,
        "comment counts": 31,
        "content": "Dataset has 300 million rows. Only 1 million have purchases. So the dataset is extremely sparse. \n\nI\u2019m taking the one million purchases and taking a random sample of one million non purchases and training my model on that. \n\nIs this approach feasible? Are there any other approaches people would recommend. Any papers on this? \n\nTrying to predict conversions on an ads platform. ",
        "comments": [
            "This approach is on the track. Down sampling is good. You have lots of data so down sampling should not be a problem for you. And it's good for your efficiency since you have less data now. \n\nBut when you are evaluating your model on the testing dataset. Make sure the test data actually follows the real distribution which is imbalanced to reflect the reality. \n\nAlso make sure you add one more step of score calibration at the end of your model building to make the prediction score follow the statistical distribution of being positive. The approach can be as simple as bucketize the prediction scores and apply a multiplier for each bucket.",
            "Your biggest issue as you\u2019ve correctly identified is scarcity. A third of one percent of people actually compete a purchase. That\u2019s an exceedingly rare event. I\u2019ve often solved similar problems using two different models. First, can you predict who is and isn\u2019t going to buy something? If you can, grab all of the records where you have a confidence they will purchase above some threshold and then try to predict what they will buy. You can create an extra class amongst this slimmed down group for those that are strong prospects but ultimately didn\u2019t complete a purchase. You may also want to cluster items into logical groups as you may have a better shot at predicting someone wants to buy tshirts vs wanting to buy a specific tshirt.",
            "For xgboost there is a parameter called scale_pos_weight, which weighs the classes proportionally. That has worked best for me in the past. Better than manually down sampling in my experience",
            "The problem really comes in when you have a really low raw count of positive cases. You do not. There\u2019s nothing wrong with modeling an imbalanced dataset. Other comments about downsampling affect the score calibration (you\u2019ll get a 50:50 distribution of scores if you weight or downsample), but really don\u2019t affect goodness of fit. \n\nI personally hate throwing out data. If you have compute just model the whole dang dataset.",
            "If you can link purchases to individual customers, you could look at [buy till you die](https://en.wikipedia.org/wiki/Buy_Till_you_Die) models. But those are simple parameterized models, not machine learning. Also, not applicable if you are trying to forecast which particular items will be bought next."
        ]
    },
    {
        "id": "1c28knv",
        "datetime": 1712926892.0,
        "flair": "AI",
        "title": "Advice and Resources Needed for Project on Auditing and Reversing LLMs employing coordinate ascent",
        "score": 2,
        "comment counts": 3,
        "content": "This may not be the right place to ask but really need advice.  \n\nI am a college  student and I'm working on a project for Auditing LLMs by reversing an LLM and looking for prompt - output pairs. I want to know which model would suit my purpose . I wanted to evaluate pretrained models like LLaMA , Mistral etc . I found a research paper doing experiments on GPT -2 and Gpt-j.  For the academic purposes i intend to extend the experiment to other llms like Mistral, LLaMA ,  somw suggestions are welcome .\n\nI am a beginner here and I have not worked on LLMs for prompting or optimization problems. I am really not sure how to progress and would appreciate any resources for performing experiments on LLMs.  \n\nAlso any concepts that i should know of ? .\nAlso im curious how do you usually run and train such models . Especially when there are constraints in computational power. \n\n What do you usually when access to server / gpu is limited . Any resources where it is easy to get GPU for distribted parallel computing that are easy to obtain? Other than google colab.  ",
        "comments": [
            "Does your school have computing resources for students in your program? Like a cluster or a supercomputer you can sign up for time on?",
            "Can people please up vote I need some advice and I don't have enough comment karma"
        ]
    },
    {
        "id": "1c1yc8i",
        "datetime": 1712890389.0,
        "flair": "Career Discussion",
        "title": "Advice on what types of entry-level roles to seek",
        "score": 7,
        "comment counts": 19,
        "content": "Hello! \n\nI'm looking for some career advice on what types of roles to seek, resume help, and how to get feedback on an NLP Developer technical project that I completed. \n\nA little about me: I have a B.Sc. in CS, a data science internship (data visualization, data analysis, and LLM experience), and research experience (published in JMIR). I've been looking for work, full-time, for about a month and a half now, 170 applications sent, and minimal responses so far.\u00a0**My end goal is to find a DS/ML role**. Something that will allow me to learn more technical skills in ML. I am primarily seeking remote roles in Canada and the United States, but am open to hybrid in my location. At some point, I plan on going back to school to receive an master's in CS specializing in ML, but at the moment, I'm looking to get more work experience.\n\nThe only response I've received so far is from an international company as NLP developer. I wrote a technical project for them, but sadly did not receive the role or any feedback on my submission. I have posted two examples of resumes that I might submit. I tailor my resume for each job, which usually means using an ATS checker and changing the summary/skills section to match the verbiage of the job. I'm looking for some advice:\n\n**1)**\u00a0At the moment, I'm casting a wide net in terms of roles. I've been applying to entry-level DA/DS/DE and ML related roles. Recently, I've focused a bit more on the DA roles, as I generally have all the skills they are asking for, and my data science internship involved a lot of dashboard building. That said, I am most interested in roles that involve ML (especially LLM and NLP).\n\nI sometimes feel as though I'm selling myself short by applying so much to data analytics roles when I really want ML.\u00a0**Should I continue to cast a wide net in terms of roles, or would it be better to focus more time on applying to only DS/ML roles?**\n\nI should mention, time is of the essence, and work experience is so valuable in this market, so I would rather start working as a data analyst now than hold out for another year to find an ML role.\n\n**2)**\u00a0Related to Q1. I'm currently working on the Google Data Analytics Profesional cert, which I'm finding boring and easy. At this point, I will finish it, but I'm wondering what certs to take next. I could do more data analytics certs, but I was considering doing Andrew Ng's ML and DL certs on Coursea.\u00a0**Would you recommend these certs, or any comments on other certs to take?**\n\n**3)** **Do you have any suggestions on my resumes?**\u00a0I know that it's a bit verbose in the skills section, but I'm trying to ensure that I make it through the initial screening if they are using AI.\n\n**4)** **Where can I post my technical project for the NLP developer position that I applied to?**\u00a0I would love to get some feedback on my submission, and I think it could be helpful for other Redditors looking for the same types of roles.\n\nThank you! ",
        "comments": [
            "[deleted]",
            "ML Resume:\u00a0[https://www.zippyshare.day/5LtewD2od3hXKhE/file](https://www.zippyshare.day/5LtewD2od3hXKhE/file)\n\nDS Resume:\u00a0[https://www.zippyshare.day/5GjSSZwVVPKeKvb/file](https://www.zippyshare.day/5GjSSZwVVPKeKvb/file)\n\nI tried to add these resumes as images to the post, and moderators took that post down. If you don't want to click on these Zippyshare links, you can view another post I made [here](https://www.reddit.com/r/datasciencecareers/comments/1c1m99s/advice_on_types_of_entrylevel_roles_to_seek/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button).",
            "Make a post on r/EngineeringResumes for resume review and follow their wiki if you haven't already. I didn't look at your resume. \n\nThe market is tough for entry level right now, especially with just a BS, only suggestion is to cast a wider net, maybe entertain data engineering or CS roles and then pivot later.",
            "Its good to secure a job first.. although if it's DA. You can get a good pay scale plus you can gain some experience and contacts are bonus. From there you can plan how, when and from where you can get into ML or DS.",
            "The situation what you\u2019re describing is like everyone else\u2019s situation. Your first job you\u2019re going to be overqualified for. Try to find any DA role (even if it\u2019s a consulting company) for the experience and leave after 1.5-2 years for a better role and work your way up in roles. I wish there was a short cut, and I\u2019ve tried everything, but there isn\u2019t.\n\nThe only real shortcut is go to top ranking school and networking. The rest of us kinda only have 1 path"
        ]
    },
    {
        "id": "1c1owjv",
        "datetime": 1712866253.0,
        "flair": "Tools",
        "title": "Ibis/dbplyr equivalent now on julia as TidierDB.jl",
        "score": 19,
        "comment counts": 8,
        "content": "I know a lot of ppl here dont love/heavily use julia, but I thought I'd share this package i came across here incase some people find it interesting/useful. \n\n[TidierDB.jl](https://github.com/TidierOrg/TidierDB.jl) seems to be a reimplementation of dbplyr and inspired by ibis as well. It gives users the [TidierData.jl](https://github.com/TidierOrg/TidierData.jl) (aka dplyr/tidyr) syntax for 6 backends (duckdb is the default, but there are others ie mysql, mssql, postgres, clickhouse etc). \n\nInterestingly, it seems that julia is having [consistent growth](https://discourse.julialang.org/t/some-julia-growth-usage-stats/112547), and they have  [native quarto](https://discourse.julialang.org/t/ann-native-julia-engine-for-quarto-using-quartonotebookrunner-jl/112753) support now. Who knows where julia will be in 10 yrs.. mb itll get to 1% on the tiobe index ",
        "comments": [
            "BTW, TidierDB is part of a larger \"100% Julia reimplementation of R's tidyverse\": https://github.com/TidierOrg. It covers plotting \u00e0 la ggplot, some web scraping and other things.",
            "Stop trying to make Julia happen. Julia is not going to happen"
        ]
    },
    {
        "id": "1c1y109",
        "datetime": 1712889477.0,
        "flair": "ML",
        "title": "The Mechanisms of LLM Prompting and Next Word Prediction",
        "score": 4,
        "comment counts": 2,
        "content": " Is a prompt always necessary for a large language model to generate a response? What processes occur behind the scenes when a prompt is given? How is prompting connected to the next word prediction in LLMs? ",
        "comments": [
            "No, a prompt is not necessary. In simple terms the generative process is generating a sequence of tokens (words) one at a time according to the probability of that assignment*. If there were no probabilities then it would be a purely random process that would generate random gibberish.\n\nOne mechanism to assess the probability of the next token is self attention (more specifically masked self attention in this case but keeping it simple). In this, we decide how probable the next token is based on the output itself ... i.e. if the output sequence already generated is \"How are ____\" then the model will learn that \"you\" is a very probable next token; \"how\" is a very improbable next token; and something like \"babies\" is moderately probable. This is self attention and can generate without any prompts.\n\nWhile this actually creates quite an effective generative process - it creates credible outputs/sentences - it is not controlled by a user and can't answer questions or be steered towards topics of interest. For this reason, we also use cross attention. This is basically the same thing but here we determine how probable the next token is based on a different input. Commonly this input will be a written user prompt but it could be an image or audio file or pretty much anything that produces a trainable pattern. This allows us to then direct the LLM to generate based on user requests. If we had only cross attention (no self attention) we would generate relevant words, but they would not be coherent sentences just a list of words associated with the input/prompt topic.\n\nSo in practice, LLMs generate the next token by considering two bits of evidence: (1) the previous words generated (self attention) and (2) the prompt/user input (cross attention). (1) Ensures that we generate coherent outputs and (2) makes generation controllable and linked to user input/prompts.\n\n*I appreciate RLHF and all that and the above probably better describes greedy generation, but for simplicity...",
            "Tokenization, context, pattern recognition, probability estimation, response and output text."
        ]
    },
    {
        "id": "1c1hqxo",
        "datetime": 1712848824.0,
        "flair": "Career Discussion",
        "title": "Data science vs Consulting ",
        "score": 19,
        "comment counts": 22,
        "content": "I went through a bunch of tech and operational roles for 5 years. For 1.5 years till 6 months ago, I was in an academia adjacent research role heavy on data analytics. Last 6 months I have moved to a full fledged data science role. Not much of neural networks/deep learning. Most work is tabulation and/or random forests, logistic regression and such.\n\nI might potentially get an offer to move into consulting (not MBB but globally known).\n\nFor many years, I was solely focussed on advancing my career in DS. But, hearing stories about how hard it is to even get interviews I am a but nervous about what the future holds after my current gig.\n\nI have a master's from an Ivy+ uni which is not a full fledged DS degree but involved a decent amount of DS coursework. I have about 8 years of work ex overall (But only <2 in DS). Currently working in the public health domain.\n\nDo you think it's worthwhile continuing the DS journey or should I switch? Any opinions or advice is helpful.",
        "comments": [
            "Ex consultant who went to FAANG DS. Consulting can be a great leaping off point for future roles, especially if you're doing DS consulting.",
            "I went to the commercial side of a pharma, it\u2019s more money and less work. Different kind of work though. Base is $150k with good benefits and work hours are like 9-3.\n\nNow, I do work til 5 because I want growth but office is usually empty after 2pm.",
            "Hey man don't let others scare you do what your passionate about if that's consulting or advancing career in ds that's where you will be most successful",
            "This is not a great answer for you, but over a long analytics / data science career, I went where the work was. That meant I jumped between small and large firms, Fortune 500 firms, FAANGs, advertising agencies, market research firms, consulting firms, and even did some independent consulting. My work ranged from IC to managing small teams to managing departments. \n\nEach place had its pluses and minuses, but if pushed, I would say my work at a Fortune 500 firm was the most fulfilling for a very straightforward reason: they consistently applied my analytics work for decision-making. I only wish that had been the case elsewhere, as even at the FAANG it was an uphill battle to get them to actually act on clear findings.",
            "I work in consulting as a lead customer facing DS. We're not a Big 4, so we actually do work. I can't speak for every org but here's my take:\n\nMy role is a mixture of technical hands on work and sales work, so I help with giving educational sessions to customers newer to \"AI\", work with a plethora of vendors/partners, scope out projects that range from basic planning to advanced cutting edge work, then get to help lead execution of the work. It's a nice balance with varying exposure.\n\nOne benefit is the variety of projects and industries you get to work on/with. Every customer has a different challenge and level of maturity, which means you'll likely have plenty of variety. \n\nAnother is typically the pay. Because you're essentially contracted out to clients to do work at a higher hourly rate and need to actually have consulting skills, you can usually demand higher than average comp. \n\nOne big downside is that it becomes difficult to form any true domain knowledge unless you bring it from previous experience. This is mostly due to the constant context switching between projects and clients. On one hand it's nice to lean in and have them educate you on their processes, but it can cause for some slow starts or frustration.\n\nAnother is typically how consultants are measured. Project utilization is a key metric for anyone that does customer facing work. The problem with this is many firms (maybe not all) will emphasize your utilization before your wants/career goals. I.e. - you may find yourself doing DS adjacent or random boring work just to keep your rate up. \n\nEach company and role is different. Try to get a feel for the consulting company itself and how they measure your value and where they see their DS capabilities going. You want to make sure the consulting firm sees a positive trajectory for DS otherwise there won't be investment and you'll likely become underutilized or burnt out."
        ]
    },
    {
        "id": "1c1rlkp",
        "datetime": 1712872626.0,
        "flair": "Discussion",
        "title": "Learning new skills Advice for down time?",
        "score": 5,
        "comment counts": 1,
        "content": "I would really enjoy learning and picking up new skills.  Typically I work in SQl, power bi , but I really want to work on my Python and data analysis skills on the data science level. I do use Python but just the basic pandas and group by. \n\n What can I do during working hours during my downtime for this? Typically I just maintain reports and that\u2019s about it but I want to be proactive and at least learn some new skills.  Any advice appreciated!",
        "comments": [
            "If you want to do it during your downtime at work, then I recommend you come up with a relevant business question you can try to answer, something that you think could have a real impact on the business.\n\nYou can start with doing some basic exploratory data analysis (EDA) using Python/Pandas in a Jupyter notebook.\n\nSlice the data in different ways, create some useful visualisations, and write down what you notice and whatever you think would be good exploring further. Make lots of inferences and gain more domain knowledge.\n\nOnce you get a good understanding of the data, try to answer that business question you came up with at the beginning. This could potentially lead you to trying some more advanced analytical techniques (Linear regression, causal impact analysis, survival analysis, etc\u2026).\n\nThis is more of a learn by doing approach.\n\nYou might get stuck along the way, but that\u2019s the point. Do some research, try to overcome any hurdles and finish your analysis. \n\nAt the end, if everything went well, you can share it with your stakeholders and get their thoughts/feedback."
        ]
    },
    {
        "id": "1c1i2i6",
        "datetime": 1712849621.0,
        "flair": "Tools",
        "title": "Tech Stack Recommendations?",
        "score": 16,
        "comment counts": 9,
        "content": "I'm going to start a data science group at a biotech company.  Initially it will be just me, maybe over time it would grow to include a couple more people. \n\nWhat kind of tech stack would people recommend for protein/DNA centric machine learning applications in a small group. \n\nMostly what I've done for my own personal work has been cloning github repos, running things via command-line Linux (local or on GCP instances) and also in Jupyter notebooks. But that seems a little ad hoc for a real group. \n\nThanks!",
        "comments": [
            "GCP/AWS/Azure probably pretty standard. I do dev in notebooks in GCP's Vertex on most days. Vertex is... not great, aside from developing in notebooks. But, it allows scaling compute pretty effortlessly, and switching between no GPU and an A100 (or whatever I need) is a major time saver when I get past modeling on a small sample. If I need a model built or inferences run on a schedule, I just wrap it in a DAG and run it in airflow using GCP's dataproc (managed pyspark), which can easily scale and handle R and python to process 10s of TBs of data for ETL and modeling jobs nightly. Code is saved in GitHub at end of day. \n\nNear real-time inferences via an API can be done via your cloud host or using a 3rd party edge deployment service within your cloud provider depending on your needs and budget (if you do lots of just in time inferences - like $500k worth per year or more - a 3rd party vendor can save you tons of $$). We save models in GCS buckets and outputs are saved in BigQuery. We do model and dataset/artifact tracking with a third party service that is similar to ml flow.",
            "Oooh this is fun: \n\n* **Secure Data Collection Tools**: RedCap for encrypted and secure data capture from medical devices and clinical trials.\n* **ETL/ELT Processors**: Stitch or Fivetran for HIPAA-compliant data ingestion.\n* **Data Storage**: AWS S3 or Google Cloud Storage, configured for HIPAA compliance with encryption and fine-grained access controls.\n* **Data Warehouses**: Google BigQuery or Snowflake, with strong security measures and PHI data isolation. I'd lean towards Snowflake unless my org were full of Google fans. \n* **Data Transformation**: dbt for transforming, modeling, and ensuring the quality of data in the warehouse.\n* **Compliance Management**: Datica or ClearDATA for continuous compliance monitoring with HIPAA and SOC II.\n* **Data Visualization**: Tableau for advanced data visualizations and dashboards, configured for healthcare data regulations.\n* **Report Automation**: Rollstack for automated, compliant reports for data consumers in decks and docs \n* **Security and Monitoring**: Vanta or Secureframe for continuous SOC 2 compliance monitoring and Keycloak or Okta for secure Identity and Access Management (IAM).\n* **Backup and Disaster Recovery**: Automated backups and a disaster recovery plan that meets HIPAA\u2019s contingency plan requirements.\n* **Data Team and Stakeholder Engagement**:\n   * **Data Literacy Training for Stakeholders**: Implement regular training sessions for stakeholders on data literacy, ensuring they understand how to interpret data and use analytics tools effectively. This helps in making informed decisions and leveraging data insights across the organization.\n   * **Embed a Data Consultancy Knowledgeable About Biotech**: Collaborate with a data consultancy that has familiarity of biotech to provide expert advice on managing and analyzing scientific data. Basically they act as another set of eyes, and an \"expert\" voice to help coax stakeholders to act. \n\nEpic project. Be sure to report back in a couple of years!",
            "d"
        ]
    },
    {
        "id": "1c1lf2t",
        "datetime": 1712857891.0,
        "flair": "Challenges",
        "title": "Framework for Build vs. Buy Decisions in Data Science Tools?",
        "score": 8,
        "comment counts": 5,
        "content": " Hi All,\n\nI'm a BI lead exploring the decision matrix for \"build vs. buy\" regarding data science & BI tools, especially for automating recurring reports. We have some budget this year, but leadership often questions the need for purchasing tools when we might build in-house solutions especially with the support of AI. \n\nimho, the major arguments against building internally include ongoing maintenance, scalability, and reliability issues. It seems to me that with the advancements in AI and increased competition, the cost of buying solutions is more justifiable than before.\n\nDoes anyone have a structured approach or a decision matrix for evaluating build vs. buy options? How do you weigh the pros and cons in terms of costs, time, and resource allocation? Am I overblowing the upkeep costs of buy? ",
        "comments": [
            "I\u2019ve definitely been in the same boat with the whole build vs. buy scenario. It's always a bit of a juggling act trying to figure out what's best for the team in the long run. One thing that's helped us is looking beyond just the cost. We consider things like how quickly we can get the tool up and running, and what we're really sacrificing in terms of time and manpower if we decide to build it ourselves.\n\nWe use Rollstack for some our report automation needs (dashboards to decks). It just ended up being easier and self serve for our non-python non-REST knowing users. \n\nAnd, don\u2019t forget about the support aspect. Buying usually gives you access to support further preserving your internal technical resources which can be a lifesaver, especially when you need to focus on other projects.",
            "It depends  on your use cases. We typically followed a process of:\n\nDoes our existing tools have our identified use cases on their website? If No, move to next question. If yes, we worked with partner.\n\nIs there an app which does generally what we are looking for and have our identified use case? If no, move to next question. If yes, we set up a call and discussed our use case with vendor and doing a Poc \n\nIs there on GitHub/Blog/. Other source discussing our use case or what we are trying to do. If no, move to next question. If yes, we took code adjusted to what we were trying to do into an identified pattern.\n\nIf all the questions were no\u2026 then we made a custom build.",
            "One problem with building it yourself is you\u2019re probably understaffed so it\u2019s going to take a long time and be kind of shit.\n\nBut the BIGGEST problem by far is when people leave you can\u2019t replacement their knowledge by hiring new people since everything is custom.",
            "Try something damn simple. Like a server plus cron job. \n\nSometimes 80% of your need can be met. Then try to decide what you would like to invest to close that 20% gap",
            "Clearly can write the problem you are trying to solve and the specific requirements needed to address it.\n\nHow much technical complexity? Is the tool/solution technically complex? If so, building in house may require significant expertise and time.\n\nSummarise compare the total cost of ownership."
        ]
    },
    {
        "id": "1c1v5vc",
        "datetime": 1712881495.0,
        "flair": "Career Discussion",
        "title": "Career roadmap and advice please. Transitioning from Academia to DS.",
        "score": 2,
        "comment counts": 9,
        "content": "Hello!\n\nI'm 46, Masters in Operations Research,  BS in Pure and Applied Math. I have been focusing on an academic career for a while, but lost my full-time professorship because I had problems with grading and meeting deadlines (I suffer from clinical depression). I am currently an adjunct math professor and cannot pay my bills. I love Applied Mathematics:, MDP, Game Theory, Simulation and I want to transition into some kind of Data Science field where I can learn more and utilize knowledge I have from grad school and more. I am not particularly interested in Finance or Medicine, and would find it much more interesting to work for physicists, chemists, biologists, economists, or other scientists. I am also very concerned about being able to acquire a career of this sort (and keeping it considering my depression). \n\nMy current plan is to finish 4 Coursera specializations: Python, Data Science, Data Science with R, Intermediate Data Science; 3 textbooks: Probability and Statistics for Engineers (mostly a review, but my education heavily learns toward Modeling and Probability Theory rather than Statistics), Linear Models with R, and Acing the Data Science Interview. After doing a few projects and developing a portfolio, I hope that I can find an entry level position. Then, over the next 2-3 years I would like to study Machine Learning and finding the intersection with that and my OR background --Hopefully such a field exists. At this time I may get a second masters or maybe finish my PhD. Finally, I would hope to have a career when I turn 50.\n\nI understand my depression is a big problem, but I am not going to just go on welfare and eek out an existence for the rest of my life. If you will, please look past this issue for the moment.\n\nMy current questions are: \n1) What is your opinion of my plan? Is it viable? What roadblocks might I encounter?\n2) What job titles should I apply for as an entry-level data scientist? Data Analyst, Data Scientist, Data Engineer, MLOps, MLE.... I really don't know where I would fit.\n3) Although I want to work for scientists, is this domain viable? Should I apply for my common domains like FAANG, Finance, or Medicine?\n4) Ultimately, what job title should I be aiming for considering my interests.\n\nLastly, any other advice would be greatly appreciated!\n\nThank you!",
        "comments": [
            "I am not too knowledgeable about transitioning from academia to industry but I can probably comment on the intersection between operations research and ML. Its a very sought after field in supply chain planning companies. Its quite exciting to work in the field if you can solve problems that keep arising there.\u00a0\n\nThe major roadblock I see is the transition from academia to industry and the fact that you will have a gap of a few months/years before you intend to pick your career up again.\u00a0\n\nGood luck!\u00a0",
            "In very brief, I would first suggest applying to IT/ ML/ AI consulting firms. Your skills will be useful on many projects, you can learn in a more structured way from those many projects. In consulting I met several consultants with a background similar to yours, and they were doing well.",
            "Is his plan ok?",
            "Best wishes with your career journey"
        ]
    },
    {
        "id": "1c10pqe",
        "datetime": 1712793645.0,
        "flair": "Discussion",
        "title": "Why is it so hard to get neural networks to beat XGBoost on most small-medium tabular datasets?",
        "score": 141,
        "comment counts": 89,
        "content": "I get it; XGBoost is really potent and easy to use while with DL theres a lot more that can go wrong tuning hyper parameters wise.\n\nBut i always assumed that whatever an ML model can do, a DL model with proper settings and sufficient regularization can also do as well even in low-medium size datsets (~hundreds to thoussnds of examples range).\n\nI understand that DL models are more likely to overfit because theyre very very flexible espeically especially as width and parameter count goes up. Meanwhile, something like XGBoost tends to have just enough flexibioity to model complex patterns without overfitting.\n\nAt the same time, i thought that with sufficient regularization techniques like dropout, L1/L2, shrinking width etc, that a DL model should be able to generalize just as well even with small-medium datasets. \n\nIs it just that I havent found the correct combination of regularization and layer architecture? Or is it just that my assumption is wrong?",
        "comments": [
            "Neural networks and GBTs have different inductive biases which make them better suited to different tasks. Empirically, the inductive biases of XGBoost seem to make it more suitable for modelling relationships found in real word tabular data. Theoretically, neural networks (in the limit of large model complexity) are universal approximators, so you could always argue that in this limit some combination of weights could be chosen such that an NN will outperform XGBoost if the target function can't be perfectly represented by a GBT.",
            "There's some talk on YouTube by Ga\u00ebl Varoquaux and associates as well,\nhttps://arxiv.org/abs/2207.08815",
            "Because you are using an Ill advised algorithm for the task. NN in tabular data are a glorified linear regression. Xgboost will outperform just because of boosting and tree methods being superior.",
            "From working with real data it is also because real data sux. Some random analysts might decide to set some random default columns to -1, cap some numbers to be maximum at 120 because of legacy code, or some rows might be screwed up in some ways. It's a pain to get an NN to deal with these problems, versus a lot of existing gradient boosting packages have decent defaults that takes cares of this. \n\nIntuitively decisions trees (say for classification) are just glorified ways of chopping up existing data into bins. Because of the way real data are screwed up (mostly due to human idiosyncrcies), they tend to get screwed up in a similar way and splitting data up in a decision trees seem to capture this aspect very well. Since if an engineer decides to default something to 42, it's not like we are gonna see 42.32, which is what NN is good for (extrapolating).",
            "The biggest benefit of a DL model is it can create linearly separable vector representations of non-vector data (eg images, text). Tabular data is already in a vectorized form that\u2019s probably easily separable, so at that point DL is just logistic regression."
        ]
    },
    {
        "id": "1c0ujfw",
        "datetime": 1712778189.0,
        "flair": "Career Discussion",
        "title": "Starting as a Data Scientist",
        "score": 248,
        "comment counts": 114,
        "content": "After being a data analyst in the navy and then a data engineer the last decade I start my new role as a Data Scientist working from home for my dream company on a project that I'm a subject matter expert on\n\nPretty stoked. Especially since I was laid off last month which got me to apply to a new role\n\nPretty nervous. Pretty excited. ",
        "comments": [
            "Congratulations man! \n\nNervousness is good.\n\nI'm sending my first ever model in production. Spent 2 months and 2.5KLOCS .",
            "Wow! This is really awesome. It's a great feeling to work so hard and finally land the role you've been dreaming of. Congrats!!",
            "Good stuff. Congratulations and I hope you do well! \ud83c\udf40",
            "Congratulations, with your experience you are gonna do well!\n\nWhich were the requirements for the position?",
            "Congrats! I\u2019m looking to transition soon as well. \n\nI\u2019ve worked as an analytics engineer for 2 years and a process engineer (same company) 3 years before that.\n\n\nWas contacted by a recruiter for a DS position at a competitor in my current industry. Interview is about a week out and I\u2019m feeling super nervous."
        ]
    },
    {
        "id": "1c1uafg",
        "datetime": 1712879222.0,
        "flair": "Career Discussion",
        "title": "Contracting while looking for FTE - bad idea?",
        "score": 1,
        "comment counts": 9,
        "content": "I moved from Europe to the US, and am currently unemployed while looking for a job. I don't require sponsorship and have 5yoe as a data scientist, but it's still difficult in this market.  \n\n\nI've been considering recently to take a lower-paying contract role (the ones I've been contacted about  are typically about $50-$80/h)  to have some income and give myself some more time to find a solid full time position. And then resign from the contract job when that happens, even if I've only been in that position for a couple of weeks.  \n\n\nAside from burning bridges with the contract employer, is this a bad idea? ",
        "comments": [
            "No. At will employment goes both ways, especially with contracts.",
            "Just make sure that the contract doesn't have any stipulations or penalties for jumping ship early. Other than that - yeah, that sounds like a perfectly fine plan.\n\nHaving said that - with 5 years of experience, I wonder if you need to work on your resume because while the market is bad, it's not that bad for people with meaningful experience.",
            "I don't think there is any issue in that. If anything you get more experience. Do make sure to mark it as contracting on your LinkedIn and I would recommend a single experience bullet point and not one for each contract role. That just looks like you are leaving jobs after 1-2 months very time."
        ]
    },
    {
        "id": "1c0x106",
        "datetime": 1712784235.0,
        "flair": "Discussion",
        "title": "Shout-out to all you super competent Data Scientists out there!",
        "score": 116,
        "comment counts": 51,
        "content": "In the past, I've often commented that my domain expertise and working experience (i.e. in organic chemistry) is likely the thing that is / has been most valuable to my employers so far - with my Data Science skills being a second place. So, there's likely tons of data scientists that will run circles around me with their expertise in ML, programming, maths or statistics.\n\nToday, I saw an (internal) job opening for a \"fully DS-centric\" Data Scientist position - the full shebang: multi-year-track record in diverse ML methods, experience in LLMs, fullly business-savy, proven track-record, excellent communicator, experienced in change management).\n\nThis is not the position for me. I'm happy and content with my job and with how things are going for me; I'm good at what I do. I like the niche that I have created for myself.\n\nStill, it made me realise *how effing competent* some of you guys out there really are in your respective domains! So, I just wanted to say, well done you! Keep at it! \ud83e\udde1\n\nEdit: \"proper\"",
        "comments": [
            ">a \"proper\" Data Scientist position\n\nIt sounds to me like you have a proper DS position. Picking up ML techniques is a lot easier than learning the domain _you_ know.\n\nDon't sell yourself short.",
            "Chemist to Data Science pipeline is so real tho",
            "Domain knowledge is the most important DS skill",
            "DS at this point is more and more like the BI 10 years ago. Anyone can write code, but not anyone can make sense of \u201cthis\u201d data and \u201cthis\u201d operation.",
            "Disregarding, this message is not for me."
        ]
    },
    {
        "id": "1c1lnz8",
        "datetime": 1712858491.0,
        "flair": "Discussion",
        "title": "Webinar Calendar/Master List",
        "score": 2,
        "comment counts": 4,
        "content": "There are a bunch of good free webinars, mostly by big names like Google, data bricks, neo4j, census.gov, etc. \n\nI'm trying to find a calendar that aggregates all these sorts of things. Has anyone seen something like that? I'm shocked that it isn't built.\n\nEdit --some webinars: \n\nJohn Snow Labs: https://www.johnsnowlabs.com/webinars/\n\nNeo4j: https://neo4j.com/webinars/ \n\nDatabricks: https://www.databricks.com/events?event_type=virtual-event-webinar&region=all \n\nDatabricks also has a couple podcasts and blogs\nimage.png\n\nGoogle (Not DS Specific): https://cloudonair.withgoogle.com/ \n\nUS Census: https://www.census.gov/data/academy/webinars/upcoming.html",
        "comments": [
            "I was looking at this a while ago sadly I didn't find any for all of them but looking one by one still works ig",
            "KDnuggets has a good collection.\n\n[https://www.kdnuggets.com/meetings/index.html](https://www.kdnuggets.com/meetings/index.html)\n\n[https://www.kdnuggets.com/2020/09/best-online-masters-data-science-analytics-online.html](https://www.kdnuggets.com/2020/09/best-online-masters-data-science-analytics-online.html)",
            "can you share some of those here?"
        ]
    },
    {
        "id": "1c0rkd3",
        "datetime": 1712770936.0,
        "flair": "Career Discussion",
        "title": "What does a PIP look like for data scientists?",
        "score": 174,
        "comment counts": 122,
        "content": "Im curious, for those who have been placed on a PIP, what does it look like generally and what metrics are typically measured to determine if you have met or failed to meet it?",
        "comments": [
            "pip install --upgrade pip",
            "If you are placed on a PIP, 99% of the time you are going to be on the chopping block.  For anyone on a PIP, just start looking aggressively for a new job",
            "I would start looking for a new job",
            "Hate to burst the bubble, but a PiP is just the final step in the \u201cquiet firing\u201d process. It\u2019s HR trying to cover the corporate six, especially if the victim is in a \u201cprotected\u201d group. The chance of surviving a PiP is basically zero. Use the PiP period to find another job.",
            "Would it be Pip3 more stable?. Sometimes, pip causing some version conflict"
        ]
    },
    {
        "id": "1c0zhd5",
        "datetime": 1712790341.0,
        "flair": "Career Discussion",
        "title": "Capitulation | It's come to this",
        "score": 33,
        "comment counts": 40,
        "content": "wayy TLDR: 2023 Statistics, Finance, and Math grad with only one internship in tangentially related role. Can't find a job going on well over a year. Located in the US and is mostly location agnostic with a few preferences. Just trying all my options.\n\nHello, all. I sit here writing this post in disbelief that I am actuall doing this. Generally, in any given subreddit, I am a lurker. A year ago I would have shuddered at the thought of making a post like this. It goes against my nature to ask (beg?) for help from stangers--or, even from anybody at times. I feel as if I've done everything I can and am going insane trying to figure out where I am going wrong. Confiding in my close friend, I spoke of my troubles and experience in my job search (how I had done everything possible and such), and he suggested I make a post to this subreddit. To his credit, I had not done this yet and decided to exhaust all options I know of. I don't know what to expect, but I hope at the very least a see a perspective that allows me to push past the feelings (of what I can't really pinpoint anymore but despair, desperation, and any others you can name are probably there too).\n\nThis is my first post of this nature; so, being not sure of how to start, I will begin with an overview of myself and then attempt to consruct a coherent description of my situation. Details of the biography aren't unique to me or necessarily important to understanding my struggles, but, for anyone interested I figured it would save a few additional comments that I may have to make to give them.\n\nI come from a very rural, impoverished area. Both my parents grew up even poorer than we are now, and neither went to college. My dad is a first-generation American so his life story has been quite rough as the family got here in the early 20th century (he is very old compared to my classmates fathers). I was lucky enough to have been born to two wonderful people who supported me in any way they could although not fully understanding the life I wanted--one away from poverty and where I could explore my interests with like-minded people. Luckily, I did exceptionally well in academics and found myself with an acceptance to a top ten undergrad program.\n\nMy thoughts (however misguided as there was no one I could have advise on this in highschool) were to go where the money was. Wallstreet. I had set out on studying finance my freshman year. My school  sends countless kids to the top Investment Banks every year and I thought that I should do that, get paid, find myself, and then transition to what ever industry I could if I wanted out. The school paid for a trek out to NYC and we got to go and visit all the big banks and meet with emloyees it was really fun. That was until someone told me about the work culture. This was the 2nd or third month of my freshman year so I was pretty ignorant with most things and when told about the working hours my stomach dropped. It seemed inhuman to work that much and for the price it didnt seem worth it. I struggled with this for a while and really started to enjoy the statistics class I was in. I'm sure you can guess where the rest of that story goes. I ended up switching to statistics; however, I still greatly enoyed finance and the classes were super interesting, so I kept finance for my other major. I also took several math classes past whats required for the minor but not quite the major but I enjoyed them regardless. I got to intern at INSEAD in Paris for a summer as a research assistant in statistics, so I thought I had a good chance to get some sort of data analyst/scientist/etc. role going into my senior year. Being an IB feeder school many of my classmates and friends had fulltime offers already and many more were expecting theres upon completing their upcoming internship. I however got nothing. I couldnt even find a professor that I could work for over the summer.\n\nI went through junior summer jobless and scared. Everyone told me I would find a full time offer before graduating, but I didn't. I didn't care if it was for data/business analyst or data science or marketing data analyst etc. etc.\n\nI was then graduated living at home with no income. My student loans (though very low) still were there and I began having to pay on those. There is no job I can get near me. I never had a car growing up and still dont. We just never had the money for that.\n\nLuckily, there is an amazing alumni network that is always willing to help. Additonally, one of my professors I'm close with has made some good connections for me putting me in contact with some fantastic people who have helped me with interview prep, resume advice/editing, etc. The only thing they couldn't do is give me a referral as, upon looking there were no entry level roles for me that were open.\n\nAll of this was going on as I had life happening (as it does haha) This past summer I experienced the death of a close friend, a grandparent, and many other things. Job rejection and ghost after the other just hurt. I was lucky enough to a have a few go to an interview where I got to present my analysis to a team. Now they wont respond to my emails. It feels like some jobs just try to use applicants for free ideas without ever hiring them. Then, one day my girlfriend of 11 months facetimed me saying that she \"thinks it would be best if we didnt talk anymore\". I heard later through a friend that one of the reasons was my lack of ambition since i \"didn't even have a job yet and have the \\[prestige of my undergrad\\] to back me up\". Long story short, I did something very stupid and tried to take my life, but was stopped by cops who had been notified of a \"distressed person\". I was allowed to 201 myself and 2500 of my savings later, some therapy, and several months later, I feel much better and in control, but the frustration of getting a job still is there. I don't know what to do. I feel like I see everyone saying that referrals are the best way to get a job, which makes sense, but I have very excellent alumni helping and its still not doing it for me. Everyone I graduated with is on wallstreet now or at some consulting firm being overworked, but I just want to work. I like working I want to be sucessful. I sometimes wish I had just stayed with the crowd and not been such a baby about the working hours.\n\nLooking at what I've written I realize it has gotten quite long. I don't want to be annoying, but I promised my friend I would do this. The simple act of posting this brings me much peace as I have exhausted this option. I'm not a big reddit poster/user, so apologies if I broke some unspoken rule or something of this sub. I wont exlain anymore in this initial post but will be happy to answer any commens or dms. I hope that this is somewhat coherent but writing it brought back a lot of memories I don't like to think of, so I did my best to get past it and just have something written down.\n\nThank you for reading if you did :)",
        "comments": [
            "Just wanted to say look into getting your loans switched to the SAVE plan. It\u2019s income based so you won\u2019t have to pay anything until you get a job, and you don\u2019t even accumulate interest either.",
            "If you're getting interviews but can't close on a job offer, it means you need to continue practicing your interviewing. Also, if you share your resume we can see what can be improved.",
            "Hello Sir,\nI'm not based in the US (France here ;) ) so any advice I could provide might lack relevance.\nAlthough, I just wanted to give you and virtually send you my full support and compassion regarding your personal story.\n\nI also come from a poor family, had tough times during academics in order to find a place in a world that did not want me at first.\n\nIt's not easy, the market is complicated currently, but there's light at the end of the tunnel and based on what I understood from your experience and what others said, you will end up finding something good ! Plus, you seem like someone very resilient.\n\nKeep it up !!",
            "I too didn\u2019t read through your entire post. But the general story is a good illustration of how people of your background have one hand tied behind their back in the current system, especially when it comes to getting into highly competitive fields like finance or academia. It\u2019s  not knowing How The Game Is Played, or that The Game even exists.  Middle/upper class kids are taught that almost from the moment they are born. Think of it as \u201cimplicit nepotism.\u201d  The good news is that once you figure it out, you will go further than any of them, because you know how to overcome, rather than having everything handed to you.",
            "Just try to get your foot in the door in an analyst role. Could be in operations, marketing, wherever the case is - even if it's not data analytics or data science, just go for that, so would broaden your search.\n\nHit up your alumni network, present a strong case but be humble, ask them for help. Meet folks in person - ask for an informational interview, tell them you'll treat them for coffee for some advice, and at the end of your 30 minutes, ask them if they would recommend you talk to anyone else.\n\nDo not pressure them, but ask very good questions, because that will show your potential. \n\nOnce you get that analyst job, master SQL, do all the dirty ad-hoc asks, and get really good. Be the go-to-person that handles those tasks - this is how you'll get noticed by the data team. \n\nThat'll get you your foot in the door."
        ]
    },
    {
        "id": "1c18y1q",
        "datetime": 1712819942.0,
        "flair": "Discussion",
        "title": "Tough call: How important is choosing MSc Dissertation Topic in Data Science",
        "score": 6,
        "comment counts": 25,
        "content": " \n\nI'm 22-23 years old, currently at a crossroads in choosing my dissertation placement project for my MSc in Data Science and could really use your collective wisdom. One offers **stipend (3000 pounds**) (which I really want to take), others don't. Stipend provider project is about healthcare analytics with geospatial data (idk kind of not good career potential) and there is strong supervision, it will solve real life problem and might lead to **publications**. And then there is **offer from Bank of England** which offers no stipend, no supervision, not sufficient literature review (more work should be done on modelling) but aligns with my Economics bachelors, and would look good on CV. Since I want to do phd, I also have option of doing **internal dissertation project** which I can do on my interest field which is **Machine Learning** ( Developing algorithms that enable computers to learn from data)\n\nMain: One of the options I'm considering is a project focused on \"Visualizing Geospatial Trachoma Diagnostic Data and Risk Factors.\" This involves providing analytic and mapping support to a multi-country study on trachoma in countries like Zambia, Kenya, Ethiopia, Uganda, Cameroon, and Nigeria.\n\nHere's what I'd love to get your thoughts on:\n\n1. How easy is it to transition between sectors (e.g., from healthcare to finance, tech, etc.) in data science? Does specializing early help or hinder long-term career flexibility?\n2. Are the skills and experiences from a healthcare-focused project like this seen as valuable in other sectors within data science?\n3. Have any of you made a sector switch in your data science career? What challenges did you face, and what advice would you have for someone early in their career?\n4. For those of you who have worked on healthcare data or similar projects, what has been your experience? Would you recommend diving into such a niche, and why?",
        "comments": [
            "The dissertation you do would have a minor impact on job likelihood and usually PhD too. If you have an overall understanding of ML and do well that's usually more important.\n\nAfter my BSc and MSc, when applying for jobs nobody ever cared about my dissertation. Heck even after doing my PhD in biology and bioinformatics nobody ever even cares what my specific project is. Ever. They just see PhD in biology and that's it even though I am now in the data science field, the PhD itself gets me quite far.\n\nEspecially as you already have an undergrad in economics, that will get you further in getting economics PhD/jobs than a specific project will. I would just do some ML problems on the side to bulk up your ML and coding ability, which you will need anyway as an MSC project will only scratch the surface\n\n\nWhat will make a difference to any job or PhD place you apply to, even if it is in a distinct area as fashion or aerospace is having publications. They wouldn't read it but it would be like having extra certificates to add to your qualifications. That would certainly stand you out head and shoulders above other applicants and is what I would aim for in your dissertation if I were in your shoes.",
            "that geospatial data science and drone technology intersection excite me too though",
            "I\u2019d probably lean towards the healthcare topic because:\n\n* it pays you money today\n\n* It is a better posed/defined/documented problem\n\n* you\u2019re actually improving the human condition (as opposed to banking where you\u2019re just playing the arbitrage game sucking capital out of human life)",
            "I would say that the thesis topic and publications might be relevant only in the case when applying for a phd. I had a similar case last year, where I wrote a thesis on 6D Pose estimation (computer-vision in a field of manufacturing), managed to publish a paper, but ended up working with LLM's in my full-time position :D. Of course the knowledge came in handy because the principles are somewhat similar, but the domain is different. Also, during the interview process nobody really paid much attention to my thesis work nor publication. \n\n  \nSo go with the grant and in case you want switch industries this should be no problem as you will already be familiar with ML/AI.",
            "Geospatial and healthcare are two huge industries?"
        ]
    },
    {
        "id": "1c19npu",
        "datetime": 1712822880.0,
        "flair": "Challenges",
        "title": "Coding Test (Data Science Framework) on CodeSignal",
        "score": 6,
        "comment counts": 3,
        "content": "Hello! I have received an invitation to take a coding test (Data Science Framework) on CodeSignal. Do you know where I can find practice questions similar to the ones they might ask? Are there any previous sets of questions available? Also, what should I focus on to prepare for the test? I have four days before I need to take this test, so any help would be appreciated. Thank you!",
        "comments": [
            "Is it with BCG?",
            "Kaggle"
        ]
    },
    {
        "id": "1c0quez",
        "datetime": 1712769122.0,
        "flair": "Career Discussion",
        "title": "What is a reasonable salary to ask for if you have a master's in data science/analytics and approx. two years of relevant experience?",
        "score": 64,
        "comment counts": 78,
        "content": "With the title, I will be finishing my master's in DS this fall, and I've worked as a Data Analyst for a year (doing high level DS projects) and as a lead Clinical Data Manager for over a year before that. What salary should/could I ask for in a reasonably HCOL city for a Data Scientist position? I have a bit of imposter syndrome, and I want to make sure I don't sell myself short and ask for too little.",
        "comments": [
            "Depends on location",
            "Really depends mate, looks like you might be in the six figure range.\n\n#1 rule for salary negotiations is to never give your number first. Instead of answering, ask what the salary range is for the position - and if the opportunity presents itself, then do this as early in the process as possible. \n\nThey might give you a solid number, or a range - just express agreement at first. If you want more, you can prepare how to present this depending on the medium - in person, phone, video call, email, etc.",
            "What kind of DS and at what type of company? Check levels.fyi or job posts in states where it's required to post the pay band",
            "You\u2019re not giving us much to work on here but I would do current salary times 1.2 at the minimum.",
            "I'm DS with 4-5 yoe also in healthcare space. I've been stuck at 150k past 4 years.  \nHCOL, mainly startups"
        ]
    },
    {
        "id": "1c19o5p",
        "datetime": 1712822936.0,
        "flair": "Analysis",
        "title": "Help to normalise 1NF to 2NF",
        "score": 2,
        "comment counts": 4,
        "content": " Hullo i need help anyone can explain to me how to remove partial dependency to normalise 1NF to 2NF. I still dont understand after reading every source i can find ",
        "comments": [
            "Easiest to explain by example really - mock up a dummy dataset and you'll probably get more useful explanations",
            "this [video helped me understanding it](https://www.youtube.com/watch?v=GFQaEYEc8_8)",
            "if table is already in 1NF then see how can you decompose it further so that divided tables have one primary key with full dependency instead of partial dependency",
            "a"
        ]
    },
    {
        "id": "1c1mmtg",
        "datetime": 1712860795.0,
        "flair": "AI",
        "title": "How to formally learn Gen AI? Kindly suggest.",
        "score": 0,
        "comment counts": 24,
        "content": "Hey guys! Can someone experienced in using Gen AI techniques or have learnt it by themselves let me know the best way to start learning it? It is kind of too vague for me whenever I start to learn it formally. I have decent skills in python, Classical ML techniques and DL (high level understanding)\n\nI am expecting some sort of plan/map to learn and get hands on with Gen AI wihout getting overwhelmed midway.\n\nThanks!",
        "comments": [
            "Hey!\n\nI am a \u201cGen AI Engineer\u201d :\u2019)  so i think i might be able to provide some guidance here. I\u2019ve only talked about text models here. So:\n\n- Learn about the attention mechanism. (No need to deep dive. Just understand what it does).\n\n- Transformers vs RNNs vs LSTM/GRU (Again a brief overview should suffice).\n\n- Different types of LLMs based on transformers. Encoder-Decoder, Decoder-Decoder, etc. Just skim through what types of architectures are popular LLMs such as GPT 3.5/4, Llama2, Mistral 7B or 8x7B based on.\n\n- Open Source vs Closed Source LLMs: Which ones are better at the moment? Different companies involved in the LLM rat race such as OpenAI, Google DeepMind, Mistral, Anthropic, etc. How to access these? For open source explore platforms such as Huggingface and Ollama. \n\n- Prompt Engineering: Get comfortable with writing prompts. I would suggest Andrew NGs short course on prompt engineering to understand methods such as few shot learning.\n\n- Learn about each of these: What are tokens? What are Vector Embeddings and what are some popular embedding model available today?Why do we need VectorDBs such as FAISS, Pinecone or ChromaDB etc? What does context length of an LLM mean? \n\n- What is Quantization of LLM weights? Difference between 4-bit, 8-bit, 16-bit LLMs. \n\n-  Retrieval Augmented Generation or RAG: Understand how training data used for LLMs might not have all the info you need, RAG allows you to perform question answering on your personal documents. At this point, you might want to explore frameworks such as Langchain anf LlamaIndex. These provide one stop solution for all GenAI related requirements of your application.\n\n- Finetuning LLMs: Why do we need to finetune LLMs? How is it different from RAG? How much GPU memory/VRAM would I need to finetune a small LLM such as Llama2? Techniques such as LoRA, QLoRA, PEFT, DPO etc. Finetuning an LLM would require some understanding of frameworks such as Pytorch or tensorflow. \n\n- Advanced features such as Agents, Tool use, Funtion calling, Multimodal LLMs, etc. \n\n- Access various opensource models such from ollama or huggingface. Also get familiarized with using OpenAI\u2019s API. \n\n- I would also suggest try to work with streamlit. It\u2019s a very convenient way of creating a frontend for your application.\n\nThese were some points that i thought you might find useful. If you have any further questions, please feel free to reach out.",
            "Ask ChatGPT",
            "Try some of the free AI courses by Google. Here are some relevant ones I found:\n\n1) Introduction to Generative AI (45 mins): Learn what Generative AI is, how it is used, and how it differs from traditional machine learning methods.\nhttps://www.cloudskillsboost.google/course_templates/536\n\n2) Introduction to Large Language Models (30 mins): Explore what large language models (LLM) are, the use cases where they can be utilized, and how you can use prompt tuning to enhance LLM performance.\nhttps://www.cloudskillsboost.google/course_templates/539\n\n3) Encoder-Decoder Architecture (8 hours): Learn about the encoder-decoder architecture, a critical component of machine learning for sequence-to-sequence tasks.\nhttps://www.cloudskillsboost.google/course_templates/543\n\n4) Transformer Models and BERT Model (8 hours): Get a comprehensive introduction to the Transformer architecture and the Bidirectional Encoder Representations from the Transformers (BERT) model.\nhttps://www.cloudskillsboost.google/course_templates/538",
            "Just pick a starting point and start running. It's a rabbit hole tbh.",
            "I can suggest the \"LLM University\" by Cohere. Just searching in their website, there are several modules about LLM (starting from basic NLP concepts to more advanced topics)."
        ]
    },
    {
        "id": "1c1i2hf",
        "datetime": 1712849620.0,
        "flair": "Career Discussion",
        "title": "International remote work in the EU",
        "score": 0,
        "comment counts": 9,
        "content": "Hello there. I'm in the early stages of my career\u2014half a year of experience as a junior analyst and starting my master's in data science this year.\n\nI wanted to ask about remote international employment in the EU (e.g. working in Germany remotely from Hungary) or\u2014if possible\u2014in the US. I would love to hear from someone with experience with such jobs; not necessarily in data, just in IT in general.\n\nMy primary motivation is money, but I also just don't really like the culture/work culture in my country, so I would like to apply for jobs like that in a year or two.\n\nIs it a very rare/difficult thing to do? How would I need to adjust my approach? What would I need to focus on in relation to my experience, education, and resume? Which fields and types of companies should I be aiming for? Is a fully remote job of that nature even possible, or is hybrid with weekly plane trips the best you can get? I would rather avoid freelance gigs, at least for the foreseeable future.\n\nIs it at all realistic for me to be thinking about it this early? Anything else I should consider? Thanks a lot in advance.",
        "comments": [
            "You actually can do it, but you should expect that a lot of companies (if not all) would consider your local job market when discussing the wage. Here i know some US-based companies who seriousluy offer something like 1700Eur/month for C++ developers, saying \"Hey, it's 2.5 times local average wage, you should be happy about it!\"\n\nAlso, taxation could be an issue too.",
            "Wondering if you work for an U.S. and receive the U.S. wage, do you still need to pay EU income tax rate? That's gonna be damn high",
            "Working for a company in Country X while actually living in Country Y is very uncommon, outside of going freelance or signing up with a 3rd party contracting agency.  There's a handful of companies that hire globally remote that would be okay with it but given that you have basically no experience, those companies are probably out of your league.  If you're in the EU why not just move to a different EU country and get a job there?",
            "My company has offices all over and hires data/tech employees in the US, multiple European countries, and India. Most of us work remotely but the pay is competitive for the local market you live in. So the range for US employees is higher than the ranges in other countries for the same type of role.",
            "Ive seen this setup before on a job posting:\n\n- Company is registered in many countries > You can only work full time in those countries.\n\n- Company also allows remote work in other countries they dont operate in, but only if the country offers some sort of freelancer/nomad visa for less than 1 year or 6 months. Also, you would need to prove to the country that you earn significantly higher.\n\nOR\n\n- Work anywhere but you will be a freelnce, contractor, or some sort of 3rd party in a legal sense. Essentially, in the eyes of your government, you will be a business.\n\n\nRe: US, you will need a working visa or a partner/spouse/parents\n\nRe: wage, companies typically pay you according to your location. Like, they have pay zones. If you notice a job posting for US companies, for example, they will mention that pay is x amount in Zone 1: NY, CO, CA while pay is y amount in Zone 2: TX, WA.... so on and so forth. \n\nRe: tax, it depends on the countries. Most likely, you will pay tax wherever you are legally cinsidered as a tax resident.\n\nEdit: they dont offer these to entry-level tho. Maybe look at Apprenticeship in EU."
        ]
    },
    {
        "id": "1c0uyvo",
        "datetime": 1712779218.0,
        "flair": "Discussion",
        "title": "The best place to network as a Data Scientist",
        "score": 8,
        "comment counts": 22,
        "content": "Hi all,  \nIn a week, I'll be flying to the States. I'm looking for nice places to network as a senior DS with over 10 yrs of experience in consulting, management but mostly in building AI products (I tend to favour Python over meetings). I'll be visiting Washington and New York and have 2 weeks time. Things on my list:\n\n- a DS conference (in Washington on AI in finance; will not post the link as this may seem like a hidden commercial, which it is not),  \n- several meetups in NY as it seems to be the place with the most vacancies,  \n- I want to go to workspots/techhubs/incubators to network and I also have to do some work. Any recommendations for places that actually stimulate networking (not WeWork and the likes) are highly appreciated!  \n- I actually got the advice to go to a gym where a lot of rich tech people go called equinox. Not sure how serious I should take that. But I like weird stuff like that as it might also lead to some surprise connections.  \n- Any good hostels where a lot of nomads/tech people live would also be nice.\n\nLet me know if this resonates with you and/or if you have any feedback.\n\nMuch appreciated!\n\nhttps://preview.redd.it/onsoekfi9ttc1.png?width=3362&format=png&auto=webp&s=5f4149437fcb0553255fdd081ffaed3bcd7e7692\n\n",
        "comments": [
            "Conferences have better ROI, followed by lunch invitations to companies that you are curious about (i.e. if you have friends that work there), and then meetups.",
            "SF and the Bay Area seem to be the places where all the AI/ML/DS action is these days. Swing by if you get the chance. Ultimately, the concentration of DS-related everything is lackluster compared to SF. All the heavy hitters in the industry are concentrated in seven square blocks near Hayes Valley.",
            "I\u2019ve had the same issue!\u00a0",
            "co-working spaces",
            "I'm supposed to be doing a lot of networking.  You have a good list.  \n\nI've done a meetup. \nThat gym is legendary. \n\nI'm thinking a career fair. There's one Friday. Please help make me go."
        ]
    },
    {
        "id": "1c0bof0",
        "datetime": 1712719321.0,
        "flair": "Discussion",
        "title": "why is all dev tool innovation in the AI/ML space focused on the least time consuming stuff?",
        "score": 118,
        "comment counts": 76,
        "content": "Every DS, DE, and MLE I speak to (including myself) spends a majority of their time focused on data prep. Even though this is taking up 70-90% of everyone's time all the investment and innovation focus seems to be on training, fine-tuning, and on-demand inference. Why? Is data prep not sexy?\n\nIt is imperative that high quality samples are being used for training, shit data is going to create a shit model. I feel like there should be more focus on making feature selection, data cleansing, and preprocessing less cumbersome, any thoughts?",
        "comments": [
            "This is such an important point that I'm working on at work and on my personal projects.\n\nThe biggest reason is that most data analysts/scientists want final clean data to work with, and the considerations of exploring better sources isn't something that interests many people. Plus, it is usually out of the scope of many people.\n\nHigh-quality data is difficult to obtain. Id love to hear other people's comments on this matter.",
            "The data is very specific to each company, so it\u2019s hard to develop useful tools in this space. I\u2019ve seen my fair share of things that try to automate data quality checks and feature engineering, and none of them impressed me. Lots of basic aggregations that any team can do in a week without having to rely on external things.",
            "Is this not exactly what Data Engineering is?",
            "The problem for me is the risk aversion of corporate. Projects are always conceptualized and delivered as MVP\u2019s and then Iterated upon instead of the infrastructure forward approach we need to not waste time cleaning, so I\u2019m always building MVP\u2019s out of others groups MVP\u2019s instead of clean and documented data sources maintained by expert data owners.",
            "I went through this exercise of trying to build an internal tool for work to handle our data prep issues. It was hard to come up with anything that could be reusable and reliable. Too many quirks and random issues. This was just within 1 small company. if you scale that use case to an entire industry, i dont see it working out."
        ]
    },
    {
        "id": "1c0rhdm",
        "datetime": 1712770731.0,
        "flair": "Discussion",
        "title": "A Tale of Two Cultures: Integrating Data Science and MLOps to Build Successful ML Products",
        "score": 5,
        "comment counts": 5,
        "content": " When the excitement about data science became widespread about 10 years ago, this spurred a lot of proof-of-concept ideas. However, most of these stayed confined in Jupyter notebooks and never made it into production. There are multiple reasons why it has been a lot harder than initially expected to productionize ML models, but the one I want to focus on in this blog post is one that has not been explored in as much depth. In order to create business value, we have to marry two very different approaches: The ML lifecycle starts out on the exploratory data science side, but we eventually have to transition towards an engineering-driven approach in order to achieve the quality attributes such as availability, reliability, scalability, and security typically expected of production systems. Thus, what it takes to do good work in data science is fundamentally opposed to what it takes to do good work in MLOps, giving rise to different best practices, skill sets, and even mentalities (ways of thinking about problems) on each side. As a result, a central challenge for creating successful ML products is to find a good process for making these two different cultures work well together. \n\nThis is very detailed article by  *Thomas Loeber, Senior Machine Learning Engineer at*  Logic20/20, Inc. \n\nSource here: [https://opendatascience.com/a-tale-of-two-cultures-integrating-data-science-and-mlops-to-build-successful-ml-products/](https://opendatascience.com/a-tale-of-two-cultures-integrating-data-science-and-mlops-to-build-successful-ml-products/)",
        "comments": [
            "What is this, a commercial? It feels like someone wrote a poorly done Python script: \"This is very detailed article by\u00a0*Thomas Loeber, Senior Machine Learning Engineer at*\u00a0Logic20/20, Inc.\"\n\nIn case it is not a bot: I don't think I agree with the message of the article. You did not ask for feedback so let me know if you are, because I do think it is a very interesting discussion.",
            "various factors",
            "wow",
            "WOW"
        ]
    },
    {
        "id": "1c0exqh",
        "datetime": 1712730610.0,
        "flair": "Discussion",
        "title": "Time series train test split",
        "score": 10,
        "comment counts": 14,
        "content": "Hi all,\nSuppose I have 6 months of data at weekly level (24 data points) and I want to forecast for the next two weeks.\n\nIf Im doing a train test split keeping let\u2019s say 4 data points for validation. With this setting suppose i get a good arima or ets model.\n\nMy question is, at the time of forecasting, I\u2019m actually predicting for the next 6 weeks then (4 in validation, 2 in prediction). Time series models become less reliable the further into the future you predict. \n\nIn such a scenario, is train test split for time series something we should be trying out? \n\nPS: the time horizon and data are just to explain my thoughts. ",
        "comments": [
            "Reading your question, I want to say: overfitting is a real risk. If you train your model on your test set and use your validation set for tweaking, you will be effectively predicting for two weeks ahead (you would have 3 evaluation periods using Rolling or Expanding Window Validation: week 18-19, 19-20 and 20-21) and you should try to limit yourself from 'cheating'. Not including your test data is really vital in this stage.\n\nIf you fear that you won't be able to predict your test set to the upmost degree, please realise that this is valuable information in itself. Communicate your fears to business and manage stakeholders based on your data. If you are desperate to improve performance, strategies would be to include more variables into this model as well. But this typically requires a lot of data engineering.\n\nFinally, **retrain the model on the full dataset (so you know you have the best model going forward)**. Calculate how much data you need to tell if the model performs well or worse (power) to some baseline. In the coming weeks, you can evaluate using that metric. Keep retraining :-)",
            "When doing time-series analysis, you can't rely on model training setup alone. You must incorporate domain knowledge.\n\nThe first question I would ask here is what is the hypothesized seasonality? If this data set you are modeling has a strong annual seasonal component (e.g. patterns associated with holidays, weather, events, etc.), then you are absolutely wasting your time thinking about validation, test, or forecast, period. Your dataset are extremely insufficient to make future predictions. In fact, just use a simple linear model out-of-the-box and its guesses would be as good as any.\n\nAssuming seasonality is not an issue, then the second question is what is the hypothesized AR lag. In other words, how long do I imagine the influence of an observation this week to last in the future? Similar to the seasonality problem, if the hypothesis here is that a current observation may influence 12 or more observations in the future, then drop everything you are doing. It's useless... A linear model or a moving average are as good as any..\n\nAssuming lag is also not a problem, then I would ignore doing the test split and keep it at first to train and validation since 24 points aren't too much to play with. In general, the validation length shouldn't be far off the forecast length you are interested in. I would use the validation to choose between multiple and parameters, then once model/params are chosen, I would re-attach the validation to training and train on the entire dataset.",
            "Train/valid/test split is to evaluate your model when building it. \n\nSo IMO after you use the valid set to hyperparams tuning and use the test set to estimate model performance, you should retrain the model with all the data before predicting 2 weeks into the future",
            "If your forecast horizon is two weeks, best is to test on only two weeks.\n\nFirst train on the 20 first data points, and test on the 2 next.\nThen train on the 21 first points, and test on the two next.\nThen train on 22 points and test to the 2 last. \nThis should allow you to get a good idea of the performance for your true forecast that uses all 24 points to predict the 2 next weeks.\n\nFor instance see https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split",
            "For the way your are making the question i think you are trying to make a model reliable over the long run even when new data comes and theorically the older ones should be less relevant. I tried to do something similar in the past. The answer i think it depends from what are you trying to estimate. which dynamics rely on the process that generate data?\n\nI do not know the context, but certainly is possible that train test split can be used. But with some conditions. I sincerely recommend you an adequate cross validation, so not the \"k-fold\" (that does not opportunely reply the estimation process) but the \"sliding window cross validation\". Choose the version that best suit your need.\n\nTo implement it, i suggest the library sktime in python.\n\nFurthermore, in order to make the older estimates less important you can see for some weighting approcah on errors or you simply keep only the latest timespan for training.\n\nRemember always to not insert in the model features that has information from the future or you will have overfitting.\n\n  \nHope that helps."
        ]
    },
    {
        "id": "1c05xju",
        "datetime": 1712703299.0,
        "flair": "Career Discussion",
        "title": "Is anyone familiar with the state of the academic job market for data science/ML/statistics?",
        "score": 21,
        "comment counts": 16,
        "content": "I frequently peruse [r/AskAcademia](https://www.reddit.com/r/AskAcademia/) and they always talk about how getting a tenure-track job these days is a pipe dream. They frequently cite some statistic (not sure where it's from) that 2% of graduating PhD's get a tenure-track job. They also say that even when filtered to just STEM fields, this figure is somewhere around 9-10%.\n\nI'm in the early stages of a PhD in statistics with research focus in ML at a fairly reputable program. My professors have all told me that getting a tenure-track academic job should be very doable, since I have no restrictions as to where I can/can't live and have a good advisor who I'm doing productive research with. They always say that because so many students in the field take higher-paying jobs in industry, there's a little bit less competition for those who want to get into academia. Which does make sense. But then I see the doomsday advice in [r/AskAcademia](https://www.reddit.com/r/AskAcademia/), and wonder if my professors are out of touch and/or not being honest with me about the state of the academic job market.\n\nIf anyone has recently been on the academic job market in data science/ML/statistics/etc., I'd love to know what your experience was like.",
        "comments": [
            "Disclaimer: STEM background but pivoted to applied ML during post doc. Also a private institution so YMMV\n\nI was trying to get tenure track maybe 2-3 years ago. All I wanted to do was teach and for a long time I set my sights on that. \n\nAfter my post doc I was able to get an assistant prof job through a connection but it turned out to be horrible. I was discriminated for being younger-ish compared to all the other profs, lots of boomer-esque politics. I didn\u2019t get any opportunities to teach any good classes, just intros to whatever since no one else wanted to teach it.\n\nWhich all would have been fine if I felt like I was getting paid enough but I ended up having to get a second job to make ends meet \ud83d\udc80 which I always had during grad school and post doc but I thought I could stop once I became a professor. I asked around to see how I could get a raise and tldr was I couldn\u2019t unless I got consistent grant funding (which is nearly impossible unless you suck on a teat of a famous senior professor, it\u2019s all about who you know in small fields)\n\nAnyways life got better after I switched to industry. I miss teaching a lot but I am a lot happier, healthier, and more well off financially. If academia could compete with industry salaries I would think about coming back but IMO the institution as a whole needs a reformation. If you happen to have generational wealth and are comfortable with not making a lot of money then academia isn\u2019t a terrible idea.\n\nEdit: I also remember reading somewhere a long time ago that universities were looking to do away with tenure track positions due to the lack of accountability (read: potential for abuse) that tenured professors had. But I haven\u2019t kept up with that",
            "We hired recently for someone with your profile at a r2/r3 that would have started you at 140k in a business school. A&S salaries at our institution are significantly lower if you stick with math & stats departments. There are a dearth of good candidates from what I can see.\n\nDo you have any applied experience beyond academia? Are you willing to work beyond an A&S department? Can you teach across tools and programming languages? Have you got significant undergrad teaching experience already? All will matter somewhat in institutions like mine.\n\nEdit: that said with FAFSA screw up and demographic cliff it\u2019s not looking good for hiring over the short term at my institution.",
            "I think you will have a good chance of finding a TT job if you don\u2019t have restrictions on location. Also, ask to the graduating class of this semester. They all should have had offers by end of April. (A mathematician trying to get out of academia.)",
            "It's hard to know how realistic the profs are. How many of their former students are profs? If you really want to be a professor, then certainly go for it, but understand there is a significant chance that you will fail, so have you backup plan well thought through",
            "What journals are you targeting?"
        ]
    },
    {
        "id": "1c00y65",
        "datetime": 1712691171.0,
        "flair": "Career Discussion",
        "title": "Has anyone taken the Master of Applied Data Science from the University of Michigan on Coursera?",
        "score": 32,
        "comment counts": 28,
        "content": "What kind of things can I do to prepare for it? Would you recommend it to someone wanting to enter the data science field? Any advice helps, thanks!",
        "comments": [
            "I believe a lot of it comes from these free courses: https://www.py4e.com/\n\nHe's one of the professors in that program, or at least they use his courses. There's also a Django, webapps, and sql course in that series.",
            "Created a throw away as to not dox my main, but I'm currently about half way through the program.\n\nTo prepare for it, brush up on Python. The first couple classes will hold your hand a bit, but you've still got to be somewhere between a beginner and intermediate Python user or you'll spend hours on assignments. I'd also recommend brushing up on statistics, calculus, and linear algebra. The program has just two classes that dive deep into teaching you the math. The rest assume that you understand enough to at least apply the concepts. For example, the class on data mining will dive into matrix decomposition and expect you to already be familiar with the linear algebra behind it. You'll ultimately do the analysis with Python, so you don't need to be an expert on the math, but you'll need to know the how and why behind the steps you take.\n\nThe cons of the program. The program is quite expensive, even if you're in-state. I think there are other programs that can give you a degree and the same knowledge for a fraction of the price. The first three or four classes are pretty much a repeat of what's on Coursera. They're just auto graded and everything. For the price you pay, it feels like a rip-off. Despite what other commenters have said though, that changes after the first few classes. The content goes far beyond what is available on Coursera. Assignments are not just auto-graded and require you to write papers/reports in addition to providing your code and output files. The program includes two milestone projects and a capstone project that give you a lot of flexibility and will give you nice projects to include with your resume. My last gripe with the program is it is entirely run through Jupyter Notebooks. It leaves a big blindspot that you will need to learn independently on proper IDE's, CI/CD, and deploying apps/models into production.\n\nThe good of the program, it's setup entirely in an asymmetric learning environment, so there's a lot of flexibility for those working. I work full-time while pursuing the degree. There is easy access to the professors. They're all responsive to slack and emails. They hold regular office hours that can be attended virtually. As I'm local, I've had no trouble meeting in person with professors as well. Some of the content in the classes is older, but they update regularly as the field and best practices change. They do a pretty good job of promoting networking amongst the cohort for those who want to network. You also get access to the UMich careers page, network, and brand. The name and network have a lot of value.\n\nFrom a career perspective, my background is accounting and finance and I've spent ten years in large financial institutions in compliance/audit/risk management. My role was data analyst work, with Excel and VBA scripting. I self-learned Python and had begun using Python scripting to automate workflows and build dashboards. I was wanting to continue down that path and become an ML ops engineer. I think the program, when I'm done, will have more than given me the requisite skills for that, outside of the gaps noted in the cons section. I haven't changed jobs externally since starting the program, but I do apply regularly to keep my interview skills sharp, and see what's out there. I have not had trouble landing data science/engineer interviews/offers with my background and the degree partially completed at blue blood though non-FAANG companies. I have not landed a FAANG interview yet. In my current role, admittedly surrounded by non-data scientists, I've been able to apply what I've learned to update/improve processes. My bosses act like I'm learning magic. It gets me face time with executives and I got a promotion and 30% raise a couple months ago.\n\nIn summary, yes it will give you the skills and it comes with the Michigan brand, which carries a lot of weight. However, if you're paying for it, I'd think long and hard about the value proposition of the program vs some other well regarded universities like Georgia Tech's program. I'm in-state and my employer pays the tuition, so for me I haven't regretted my decision to go with U of M and this program.",
            "I\u2019ve looked at the program in the past few months and I think there are better options to be honest. The school being on your resume is obviously a big plus but out of state tuition is almost 50k. I live in-state and couldn\u2019t justify paying 36k for it. The degree is also from the School of Information and not a computer science or engineering department. For preparing I believe all you need to know is basic Python and basic statistics. There is some sort of coding proficiency test you have to take in order to be admitted but if you take the Python 3 specialization on coursera from UofM, it gets waived. I believe they teach you SQL, and the more advanced math and stats are sprinkled in the curriculum. There are a couple GitHub repos out there that have a review of all the courses. Personally I have been looking at CU-Boulder\u2019s options on Coursera. Obviously it\u2019s not the same brand as Michigan but it\u2019s still a good school and much cheaper I think it\u2019s 15,700 ish. If money isn\u2019t a problem then I guess it\u2019s a pretty good choice.",
            "I'm currently in the Colorado, Boulder MS program. I brushed up on calc to get prepared; that really helped.",
            "I have been looking into this as well. What attracts me to this one over others is that it\u2019s project based as opposed to test based so they emphasize that you leave with a portfolio that\u2019s actively worked on as part of the curriculum. I\u2019m also located in a place where Michigan is very highly regarded - I have thought about the fact that the program is within the school of information has opposed to something more CS or Statistic related but I suppose the only thing that will appear on my resume is UMich, the degree, and the date so I\u2019m not too worried about that. \n\nInterested to hear others\u2019 ideas though."
        ]
    },
    {
        "id": "1c0btox",
        "datetime": 1712719784.0,
        "flair": "Career Discussion",
        "title": "(Deep Learning vs Data scientist) Need some advice",
        "score": 7,
        "comment counts": 21,
        "content": "I'm an undergrad student (mechanical engg) learning data science  through online courses and resources, I need some advice for my career.   \n\nI'm able to see 2 types of job postings, one is a traditional data  science role, others are also data science roles but requiring deep  learning skills.   \n\nI have seen, for data science roles requiring deep learning, usually prefer people from PhDs or higher educations.   \n\nSo my question is should I even target or try to get into these  roles requiring deep learning skills because I doubt if I can compete  with PhD students for such roles.   \n\nOr should I stick with traditional data science and data analyst roles.   ",
        "comments": [
            "I am a chemical engineer turned data analyst/data scientist and a product manager of sorts. \n\nWithout formal education and/or internships to back up your resume, it would be challenging to break into the field\n\nIf you are a forever learner like myself. Joining on an analyst role might be a little of an academic downgrade, but the soft skills you'll learn are invaluable. It's much easier for me now to connect between data scientists and business propositions.",
            "A PhD is for research - you\u2019ll be over qualified anywhere but itll help you Jump a few year in the corporate latter; not worth it\n\n\nIf you take maths and cs courses you\u2019ll be fine doing deep learning; take AI electives and do well. Go for entry level roles and in 3-5 years depending on your skills you can get deep learning roles. I have a degree in math and social science and did supervised machine learning; the jump to deep learning wasn\u2019t hard with 1 year of work experience",
            "Both paths are very difficult to get in! I just graduated with a Master in Stats(Took all my electives in ML and DL) and about 1.5 yrs full time Analysts exp, as well as an internship from FANNG. To be honest, I have no idea what's wrong with my background and resume, I hardly hear back from DS/MLE role, and even Data Analyst in today's market. However, my two cents is to get any data-related job first, and then transfer on the way.",
            "I'd go the data science route. Its the foundation and you want to build those skills first. There are actually more roles out there that need you to be solid in your basic skills and perhaps basic deep learning work (not gen AI) than the other way around. Get into a data role in a company first and then keep learning.",
            "Given that you are still an undergrad, it depends on what you want for the coming years, really.   \n  \nIf you are for instance excited about a Master and have a pet project that allows for a year long deep diving into whatever deep learning thing you like, I would say you are very likely to get a far better chance at landing a high payed job there than someone that applies for generic DS roles. There are no guarantees... but you have time. Also, these skills will also nicely transition into the more 'generic' roles. There is one big catch: you might be a bit saddened by the lack of neural nets you'll find in a typical company, despite what they say in their vacancies. But I feel like everybody that leaves academia goes through that crisis, so why not make it a real one and study what you love :P.  \n  \nIf you don't really like to dive deep, or don't have the skills/patience, there's no shame in going for a bit more generic impact and going for a 'safer' career. I see a lot of people saying that it is a tough market, which is true in comparison to earlier times, but I also think that there are areas which are worse off. I sound like an old man but... follow your interests. I would not necessarily say the same to a political science student with a passion for... the roots of ancient marxism, but you are in a field where passion can take you far (also in salary terms) without a lot of risks if you are willing to work."
        ]
    },
    {
        "id": "1c15fax",
        "datetime": 1712807406.0,
        "flair": "Career Discussion",
        "title": "How much does your undergrad degree matter?",
        "score": 0,
        "comment counts": 26,
        "content": "For multitude personal reasons, I am a \"late bloomer\", and wasted most of my potential in my twenties. Got paid more then, but it was terrible on my mental health, and I hated every minute of it.  \nI got a business degree from a no-name school a decade ago. I am pretty old for starting out.   \nI went back to school while working in unrelated field, which got me into a reasonably well-respected master's program in quant econ, and landed that first DS job two years ago.  I am now a senior DS, but paid way under my demonstrated value. (I turned down a 200K job in my old field to chase DS, and now make like 110K... first kid on the way, and I am having a hard time justifying this to myself)\n\nBig companies don't take a second look at me. My target field is radio silence. Not even rejection emails most of the time. My resume is super well polished. I do get interviews for smaller companies in my current niche fairly easily.\n\n I have funded research in biomedical AI (independent). I have done some LLM work on the side for a startup, and also for a doctor's research. I started doing some hardware work too to pick up some C++, and expand my general knowledge.\n\nEven with warm introductions from people in my target area (biomed, pharma, medicine), I get no responses to my messages.\n\nAm I too old? (Early thirties)\n\nIs my bullshit B.S. holding me back?\n\nAm I still too green?  \n\n\nI just want to work applying AI to improve patient outcomes, and I am not finding any way in.  \n\n\nJust looking for some perspective.   \n\n\nThanks!",
        "comments": [
            "For an insecure person you sure earn a lot of money",
            "You ruined it, completely. I mean, you should\u2019ve made better decisions. Look where it got you. Only six figures and a graduate degree and a string of enviable titles. What are you even doing with your life, slacker.\n\nYou might as well quit now and go flip some burgers and get your waving arm in shape to be a Walmart greeter cause that\u2019s all you\u2019re really worth based on your meager efforts.\n\nThis fucking website here and you fucking people\u2026",
            "You're acting like six figures is an average salary or something",
            "You sound like a victim of the market. I think if I saw your business BS with Econ MS and data science experience, you would at least be a noticeable candidate. \n\nDepends on what the employer is looking for though. If they want someone more ML focused, they might go for math and CS candidates. If they want a generalist, they might mix all sorts of candidates for interviews. If they have econometric or statistical problems, they might go for stats or Econ candidates. If they want a Storytelling person, they might go for a business candidate. Perhaps more positions are leaning towards the technical candidates. There are at least some positions calling for the Econ/business skills, a matter of finding them.",
            "Here's some perspective: the median household income in the US is approximately $75,000. I understand the insecurities, but regardless of the job market being ass, you are doing great compared to most people and you should take a little time to breathe and remind yourself of that.\n\nKeep working hard but don't forget to celebrate how far you have already come :)"
        ]
    },
    {
        "id": "1c02l55",
        "datetime": 1712695172.0,
        "flair": "Career Discussion",
        "title": "How much does degree title matter vs skills and classes taken for an MS?",
        "score": 18,
        "comment counts": 21,
        "content": "I'm (an American) in a biostatistics MS program, but I have the opportunity to finish early by a summer and a fall semester (7 months) with my departments online \"applied statistics and data science\" MS. The research that I've been apart of has mostly been data cleaning and building an R package to submit to CRAN. I've basically finished the core classes for our PhD, but I'm more interested in math heavy software development than original research. Tech skills: Python, R, SAS, and I'm rusty on PHP, JS/React, SQL, which I used years ago for past projects.\n\nThe program isn't placing me in debt. I have research funding, my Post 9-11 GI Bill covers housing costs, and I'm still a reservist in the military, which offers a 401k (TSP) and heavily discounted insurance\n\nEdit: I'm pretty ignorant of how the titles would be viewed when screened by HR/software/and particular industries. Eg I've heard it's easier to land a traditional biostatistics role with a degree in (bio)statistics vs data science",
        "comments": [
            "Maybe I'm not quite following - is there a disadvantage to having your degree in \"applied statistics and data science\" as opposed to \"biostatistics\"? If not, why not finish up 7 months early and take the applied stats/DS degree?\n\nI do personally believe that your skills and coursework are more important than the title of your degree, especially when we're talking about two closely related fields like statistics and data science.",
            "Answering the question from the title, degree title matters not. I have an MS in Wildlife and Fisheries Sciences and have worked as a Sr. DS in financial and management consulting, healthcare, and most recently in FAANG. Skills matter; not degree title. The latter might get your resume noticed in ATS, but if you can't do shit, you won't get a job.",
            "I\u2019m also confused, I don\u2019t see any reason not to gain more education for free",
            "Stay with the Biostatistics track.  It will be much easier to land roles in pharma/biotech industry if you ever have interest in going that route.\n\nBiostats is a very well regarded degree in general for DS/applied science roles in industry.  It will be assumed you have significant training in mathematical statistics, experimentation and causal inference, which will always be in high demand.  If you have the software development skills you\u2019ll be viewed as a unicorn by hiring managers.",
            "I'm a hiring manager, so a sample of one. :-) I care more about a certain level of training and experience than degree titles. That experience can come from a graduate research thesis, working as a researcher at the university, an internship, etc.\u00a0\n\n\nI personally would stay and get the extra coursework if you can afford it. You will already be well positioned with biostat, but this gives you a little extra time to have said you worked on your research. This time.is also great to sharpen your programming skills, work on some soft skills, etc.\u00a0"
        ]
    },
    {
        "id": "1bzzal7",
        "datetime": 1712687114.0,
        "flair": "ML",
        "title": "What kind of challenges are remaining in machine learning??",
        "score": 10,
        "comment counts": 25,
        "content": "To rephrase, I mean to ask that there are pretrained models for all the tasks like Computer Vision and Natural Language processing. With the advent of Generative AI I feel like most of the automation tasks have been solved. What other innovative uses cases can you guys think of? \n\nMaybe some help with some product combining these ML models?",
        "comments": [
            "What has Generative AI \u201csolved\u201d? The outputs from all these models are generally \u201cclose\u201d to something reasonable for their use case but often still flawed. These models are much more heavy on the \u201cGenerative\u201d than \u201cIntelligence\u201d.",
            "I don't think the solutions are the greatest, tho. Take medical claims data, for instance. There is still a lot of fraud that occurs.\n\nI was reading a few weeks ago that nearly 2 billion dollars worth of claims are fraudulent in the US. There is still a lot of systems that need to be built and fixed.",
            "Nothing is even close to being solved. Plus the tiny fraction that is \u201csolved\u201d we\u2019re still in the dark for interpretability. Any context in industry where one wants to implement theory or canonical tools requires an immense amount of finicky engineering. Everyone is still stuck on the most basic shit around connectivity and so on. And not a single inch has been gained in AGI.  Not one discovery. ML is in its infancy. If you want to know where to focus, learn pure math.",
            "I saw a thing a while ago where people are trying to use unsupervised language models to decipher whale language from sonar recordings. If I had a choice in what to work on it would probably be something like that, but I think the biggest hurdle there is getting hold of the data",
            "Hard to think of broad use models outside of comp vision, natural language processing, and audio recognition. Comp vision and audio recognition are two of the five senses lol.\n\nMaybe train a model to detect odors? Lol, not sure if there are sensors for that"
        ]
    },
    {
        "id": "1c07fg4",
        "datetime": 1712707191.0,
        "flair": "Career Discussion",
        "title": "Any leads on climate change specific/renewable or sustainable energy companies",
        "score": 1,
        "comment counts": 6,
        "content": "\nHey folks. I would want to apply my skillset in the domain of analytics/ data sciences in companies that are working in the field of climate change or other environment related domain. There is an unawareness about such companies. Can people comment here if you know of one? Thanks. \n\n#Climatechange #DataScience #Analytics",
        "comments": [
            "Here is a climate specific [job board](https://climatebase.org/jobs) that usually has data related jobs posted.",
            "To build a wind project, you need an energy production report. The industry realized that these reports were too rosy. NREL got involved, wrote a report of their own, and developed a Python project to do assessments of operational wind project. \n\n[Have a look. ](https://www.nrel.gov/wind/openoa.html)",
            "there was a company in India which was significantly working on this in collab with government but I haven't heard much since many months",
            "RemindMe!",
            "Try also 80,000 hours for some impactful jobs."
        ]
    },
    {
        "id": "1bz564f",
        "datetime": 1712600450.0,
        "flair": "AI",
        "title": "[Discussion] My boss asked me to give a presentation about - AI for data-science",
        "score": 93,
        "comment counts": 44,
        "content": "I'm a data-scientist at a small company (around 30 devs and 7 data-scientists, plus sales, marketing, management etc.). Our job is mainly classic tabular data-science stuff with a bit of geolocation data. Lots of statistics and some ML pipelines model training.\n\nAfter a little talk we had about using ChatGPT and Github Copilot my boss (the head of the data-science team) decided that in order to make sure that we are not missing useful tool and in order not to stay behind he wants me (as the one with a Ph.D. in the group I guess) to make a little research about what possibilities does AI tools bring to the data-science role and I should present my finding and insights in a month from now.\n\nFrom what I've seen in my field so far LLMs are way better at NLP tasks and when dealing with tabular data and plain statistics they tend to be less reliable to say the least. Still, on such a fast evolving area I might be missing something. Besides that, as I said, those gaps might get bridged sooner or later and so it feels like a good practice to stay updated even if the SOTA is still immature.\n\nSo - what is your take? What tools other than using ChatGPT and Copilot to generate python code should I look into? Are there any relevant talks, courses, notebooks, or projects that you would recommend? Additionally, if you have any hands-on project ideas that could help our team experience these tools firsthand, I'd love to hear them. \n\nAny idea, link, tip or resource will be helpful.  \nThanks :)",
        "comments": [
            "With such a broad subject I would recommend investigating the main abstracts and see what looks promising such as:\n\n1. Open source agents (e.g. SWE agent) [https://huyenchip.com/llama-police](https://huyenchip.com/llama-police)\n2. Frameworks (e.g. LangChain/DSPy) [https://spectrum.ieee.org/prompt-engineering-is-dead](https://spectrum.ieee.org/prompt-engineering-is-dead)\n3. Products (e.g. Claude Opus or even Devin)\n\nI'm not even junior level but looking at all these areas in my gap year has really made my views on this subject much more rock solid than just terse rhetoric that I've observed even some very smart/senior professionals have.\n\nEDIT: Some more resources:\n\n* [LangChain CEO on Agents](https://www.youtube.com/watch?v=pBBe1pk8hf4&ab_channel=SequoiaCapital)\n* [Overview of SWE-Agent](https://www.youtube.com/watch?v=9-JBHGlYEBI&t=10s&ab_channel=MatthewBerman)\n* [IBM, The most important AI trends in 2024](https://www.youtube.com/watch?v=sGZ6AlAnULc&ab_channel=IBMTechnology)\n* [DSPy from Stanford, does it live up to the hype](https://medium.com/emalpha/dspy-does-it-live-up-to-the-hype-6e56c2c6e7a0)\n\nTLDR; Even if you don't want to get into how foundation models could transform your pipelines, a solid understanding of the product space will have some overlap with those more technical patterns regardless.",
            "Beyond copilot, you can feed your entire project to an LLM with a big context and have it suggest refactors and write documentation. It can create tests and suggest design changes.\n\nIf you have lots of documents in the company, a RAG pipeline would be good. \n\nI think you\u2019ll do something more useful by structuring the use case and changes to the development process instead of looking for tools. At the end of the day it\u2019s all LLMs, prompt engineering and SFT.",
            "I\u2019d be careful not to go overboard on what A.I. can do, so it doesn\u2019t become the main go to \u201cperson \u201c for solutions later on.",
            "OP there\u2019s this course by Andrew Ng that\u2019s called AI for everyone or something\u00a0\n\nMy favorite takeaway is don\u2019t think of it as automating jobs. It\u2019s impossible. At least not now\u00a0\n\nRemember that a job is a series of tasks. AI can automate some tasks\u00a0\n\nDetermine a stakeholder, yourself included, then score the task if its feasible to be improved with the current tools today for the fastest, easy to build, high impact work.\u00a0\n\nLike marketing or PM work, you should see your DS work as a product. All products solve problems FOR a user. Making sure that it actually solves a user\u2019s problem is key for its adoption.\u00a0\n\nNo user -> undefined problem -> bad generic solution. You still need to push a solution for someone to adopt either way.\u00a0\n\nDon\u2019t get too hung up with the tools. They are useless if no one uses them. Or if you can\u2019t build a good one that actually solves anything.\u00a0",
            "Actually ChatGPT premoum has a built in python tool called \u201ecode interpreter\u201c. so it has all the librarys like pandas, matplotlib, etc\u2026\n\nThere is an awesome course on DataCamp on how to use ChatGPT as a tool for DS.\u00a0"
        ]
    },
    {
        "id": "1bzjlkf",
        "datetime": 1712637781.0,
        "flair": "Career Discussion",
        "title": "Help Deciding Between Two Graduate Schools",
        "score": 7,
        "comment counts": 66,
        "content": "Hey all, I have until this April 15th to decide between two graduate schools and I can't figure out which is best for a career in data science. I'd love to get some advice from some professional data scientists. The following are the two schools and programs:\n\n1. **Texas A&M's MSCS program**. 2 years long for a total cost of attendance  \\~60k.\n2. **North Carolina State's MS in Advanced Analytics program**. 10 months long for a total cost of attendance  \\~64k.\n\nHere are what i deem the pros and cons of each program:\n\n||Pros|Cons|\n|:-|:-|:-|\n|Texas A&M's MSCS|**Likely would get a research assistantship** as I am both a domestic student and have research experience. I estimate this would lower my total cost to \\~30k.|The **career path after graduation is not as clear**. Also I do not want to live in Texas upon graduation.|\n|North Carolina State's MSA|The MSA program is very well respected and all graduates are guaranteed a job. Last years class had a **median salary of $117,000** upon graduation (jobs typically are in NC. Huge alumni network consisting of data science professionals.|I will be taking out **$64,000 in loans** for 10 months of schooling.|\n\nAs an aspiring data scientist I'd appreciate it so much if you could let me know where you think I should go.",
        "comments": [
            "Texas seems to be a better choice as it gives you more flexibility.",
            "NC State Analytics pre-dates the data science boom by over a decade and has a great placement record.\n\nNorth Carolina has a lot of biostats in raleigh/durham and banking in Charlotte (Wells Fargo and Bank of America are head quartered there. Technically San Francisco is Wells Fargo's offical HQ, but Charlotte is their defacto HQ). Truist is also HQ there,  Ally's second biggest operations is there, Fifth Third has office there, Mitsubishi Group has office, regions and USAA also have offices. These are all fortune 500 banks and many are fortune 100).  You won't have any problem finding jobs.\n\nTexas A&M has dallas and austin and is generally considered a good school, and Texas of course has Austin, houston, dallas as job markets. There is a lot of banking in dallas, big tech in Austin.\n\nI personally don't think you can go too wrong with either, but given that Texas has other universities in its state to compete with and NC State mainly  has Duke, I'd probably take NC State if its me. Banking and Biostats are also not effected by lay offs the way tech is. Its more slowed hiring than anything else.  However, the upside potential straight out of grad school is probably a bit higher from texas A&M due to the big tech presence in Austin.",
            "If you're banking on an RA position at Texas I'd email the program coordinator and ask what proportion of the MSCS graduate students are able to secure those roles. At the university I went to it was much more common for labs to use that for their own PhD students and only will recruit from outside MSc programs if they have a skill gap that they can't train a PhD student for, or more funding than PhD students. Granted this wasn't a CS program, but both those situations were rare. The program should have stats on that though and should be able to provide that info. \n\nIgnore if that program recruits MSCS students directly into labs though, that's different",
            "I did the NC State IAA and graduated in 2021 despite job market worries. I got a job in December 2020 because companies come there specifically to recruit. They said \u201cwe know you don\u2019t know everything yet, but you will by the end of this program\u201d. That was their 5th year recruiting exclusively from the IAA.\n\nI don\u2019t know anything about Texas\u2019 program. But you would be taking out the same amount of loans over a two year period with the RA vs. a one year period at the IAA. So you would be taking a loss of six figure income for a year by going to Texas.",
            "There\u2019s one really big thing no one seems to be talking about. How can we compare a master\u2019s in computer science with an an analytics one? They\u2019re so different on so many things! \n\nThe main difference, I would say, is how analytics is better for data science specifically, but gives up a lot of versatility in terms of the roles you can get. If say, in 5 years you decide you want to give a different aspect of programming a shot, it will be a lot harder to land a job that uses it than if you had a CS program under your belt. \n\nObviously disregard that if you have an undergrad in CS or software engineering, or if you\u2019re dead set on data for life."
        ]
    },
    {
        "id": "1bzp8eg",
        "datetime": 1712660298.0,
        "flair": "Education",
        "title": "Syllabus for school",
        "score": 0,
        "comment counts": 18,
        "content": "I'm involved in developing a syllabus in data science for young people (aged around 16). It will be defined at three levels (let's call it levels 1, 2 and 3). I'm happy with the data science content but would like guidance about the statistical content.\n\nThe course will be short (40 hours) so there's not a great deal of time for statistics, given that the focus of the course is data science (tools, techniques, methods, processes, etc.). However, there is some time (5 hours?) for some stats at each level.\n\nAt this time my inclination is:\n\nLevel 1: simple descriptive statistics: mean, median, mode, max, min, range.\n\nLevel 2: Level 1 plus: percentiles, IQR.\n\nLevel 3: Level 2 plus: variance, standard deviation (z scores).\n\nI'm tempted to introduce probability because it's fundamental to data science. What do you think about that? Also correlation?\n\nI appreciate that this omits inferential statistics but given the time constraints I can't see how to fit that in. But I accept that linear regression would be nice at Level 3.",
        "comments": [
            "I think variance and standard deviation are important concepts, but given that the audience is 16-year olds, I might not spend too much time on z-scores. Some basic probability work might be more useful than diving into z-scores.",
            "Basics of hypothesis testing. Maybe just given this null hypothesis, we can use math (without derivation) to measure the likelihood of the data. It\u2019d be easy to slip into something deeper. But I found hypothesis testing to be the light bulb moment when I was starting.",
            "You might want to include something on graphical representation of data: Bar charts, histograms, and pie charts.",
            "Distributions, even if at a very basic level. I've never used IQR professionally but distributions are critical in almost every data set I look at.",
            "Correlations maybe?"
        ]
    },
    {
        "id": "1c02doz",
        "datetime": 1712694647.0,
        "flair": "Discussion",
        "title": "Desperately need feedback to land a job in ML/NLP asap",
        "score": 0,
        "comment counts": 16,
        "content": "People of datascience, I'm looking for some feedback on my  resume. I'm in the second year of my university (online university)  while I'm full-time self studying Machine Learning with a heavy focus on  natural language processing. My goal is to land a job as soon as  possible, ideally as an NLP engineer while I continue my studies and  then I plan on getting a masters degree in AI to get into academic  research.   \n\nI know that getting a job now might not be the best way to break  into academic research, rather focusing on my degree is a much better  way, however, given my life situation in general especially financial,  it's very important for me to get a full-time job and make money while I  spend next 3 years completing my bachelors degree along side.   \n\nThese next 3 years I have to settle my kitchen income first before I  can move towards academic research. In these 3 years, industry  experience might help me with immigration to move to a foreign country  where I may get a better and more sophisticated job and enroll in  masters degree in some good university, or if not, then moving abroad  for studies is the second option (student visa).   \n\nI need advice on how I can plan a few months ahead so that I can  land a full-time job as soon as possible, since my current internship is  unpaid, and also because it's a startup which is not generating much  revenue as of now, I don't really think they can afford to hire me  full-time. I'm 22 living in Kuwait (middle-east), originally from  Pakistan.   \n\nLink to my resume: [https://www.dropbox.com/scl/fi/mjib8hi6hmgvxzyk0r1sm/Resume-2.pdf?rlkey=uphpp6qralzknqqapmobmw9i2&dl=0](https://www.dropbox.com/scl/fi/mjib8hi6hmgvxzyk0r1sm/Resume-2.pdf?rlkey=uphpp6qralzknqqapmobmw9i2&dl=0)",
        "comments": [
            "I'm not a data science or ml expert but I have been in the job market and can share what my views are. Frankly, I'm not able to gauge how much your technical skills are. You have listed one internship experience and that is enough to get you a second internship but not a full time job. You have listed many skills at the bottom but I cannot tell at what level you are at each of them. Plus in a real company, you also need knowledge about version management, agile practices, cloud technologies, stakeholder management, presentation skills etc.\n\nIn your resume, I would remove the word certificate and make the names themselves as links and use the remaining space to give a one liner about the course's content.",
            "You might want to rethink your field of choice if you aim is to \u201cland a job asap\u201d while on the way to academic research. DS, ML, and NLP are specialized domains where you are competing against a plethora of qualified new graduates for entry level positions. Many of those new graduates already have advanced degrees. The field you\u2019re looking to get into isn\u2019t one that has \u201cstepping stone\u201d industry openings for would be researchers.",
            "If you want to do research get a PhD not a masters. A masters degree in AI (not sure what that means) is not the route. \n\nGet a job, that\u2019s great! But use all that experience to get into a financially supported PhD program if you really want to do research.",
            "Your bullet points need a bit of help. Here is what I would do:\n\n- Drove R&D efforts to advance a geopolitical analysis pipeline utilizing cutting-edge NLP and ML techniques, conceiving and implementing novel approaches to classification, information extraction, and model optimization. \n\n- Achieved a 26% gain in text classification accuracy by fine-tuning a BERT-based language model on specialized data, attaining 89% validation on political news articles using a customized hyperparameter scheme. \n\n- Build text analysis pipeline using the fine-tuned classifier, increasing processing speed 1400-fold from 0.274 to 1078 articles per second while preserving accuracy. This efficiently analyzes petabytes of geopolitical data driving the graph-based model.\n\n- Pioneered prompt engineering methods leveraging Constitutional AI techniques to refine a language model's understanding of geopolitical relationships, extracting 55 new nuanced qualitative indicators from diplomatic communiques with over 90% concordance to expert coders.\n\n- Developed rule-based extensions to an existing NER model identifying an additional 18 entity types critical to bilateral dynamics from texts in five languages, integrating billions of novel factoid relationships into the graph network. \n\n- Leveraged quantitative and qualitative insights generated through the optimized pipeline to iteratively update the predictive graph-based geopolitical analysis model, enhancing capabilities in dynamic tracking of international sentiment shifts and forecasting regional stability.",
            "we can see your full name in the dropbox link"
        ]
    },
    {
        "id": "1byvwwh",
        "datetime": 1712577122.0,
        "flair": "Education",
        "title": "Help in learning Linear programming for real time cases",
        "score": 22,
        "comment counts": 23,
        "content": "I got hired as a DS for the supply chain team and am pretty much a lone wolf here. I have zero knowledge of LPP and optimization and has somehow scraped by in the last couple of months.\n\nI have gone through many articles but they only use predefined equations. Most of my problems came from inability to convert the Excel solver to pulp format  equations.\n\nCan I please get some advice on learning about LP optimization?\n",
        "comments": [
            "H. Paul Williams: Model Building in Mathematical Programming  \nEven if you don't use MOSEK: [https://docs.mosek.com/modeling-cookbook/index.html](https://docs.mosek.com/modeling-cookbook/index.html)  \nEven if you don't use AIMMS: [https://download.aimms.com/aimms/download/manuals/AIMMS3OM\\_LinearProgrammingTricks.pdf](https://download.aimms.com/aimms/download/manuals/AIMMS3OM_LinearProgrammingTricks.pdf)\n\nPuLP is clumsy, but you can call it directly in Excel with: [https://solverstudio.org/](https://solverstudio.org/)  \nGurobipy or AMPL is better for general modeling and ChatGPT will get you started with description to model. \n\nExcel is great for small models and learning basic concepts. It's really hard to maintain and work with variables with more than 2 indices (e.g., truck-driver-route-day). No offense to r/datascience, but you're better off asking this from r/OperationsResearch",
            "A good introductory book on the subject is from Ronald Rardin Optimization in operations research for example\n\nI assume you know a decent amount of calculus, algebra, probabilities and statistics?",
            "You want a modeling in LP focused book, focus on what the equations mean to the problem you\u2019re trying to solve.\n\nLearn about sensitivity analysis and running lots versions of the data to see how the optimal point changes. Learn about shadow prices etc.",
            "I\u2019ve been in a similar boat and tried to teach myself from udemy or coursera courses. Instead of looking for lp articles look for supply chain optimization papers.",
            "I would look at Linear Programming: Foundations and Extensions by Robert J. Vanderbei for modern examples with code. Otherwise Gass's linear programming from 1958 is a bonafide classic."
        ]
    },
    {
        "id": "1bz20r9",
        "datetime": 1712593070.0,
        "flair": "Discussion",
        "title": "Three Practical Use Cases of Machine Learning and Digital Twins in Clinical Research and Care",
        "score": 5,
        "comment counts": 4,
        "content": "Hi all, thought my most recent Substack post would be of interest to those working in the healthcare/life sciences space. I talk about how we can use big data and machine learning to bring more personalized care through what\u2019s called \u201cdigital twins.\u201d Essentially, it uses historical data to look at the outcomes of people who share similar characteristics to you. Using Alzheimer\u2019s Disease as a motivating example, there\u2019s three use-cases for digital twins I discuss in my post:\n\n1. Reducing the size of randomized trial control arms through the prediction of treatment arm outcomes via their digital twins. The company with the most work on this space that I know of is [Unlearn.ai](https://www.unlearn.ai/). This would ideally save recruitment time and costs that scale per patient and per site.\n2. Using digital twins to calculate a prognostic score of disease progression and, in a trial, recruiting only those who are more likely to rapidly progress. In the literature, this is often called \u201cenrichment.\u201d When people progress at a faster rate, we can run the trial for less time while still having a good chance to observe a treatment effect if it\u2019s there. \n3. Using digital twins to help inform the delivery of precision medicine in routine care (e.g. at the doctor\u2019s office). Crucially, this system should be tested in randomized trials versus standard of care.\n\nIf any of these topics interest you, [check out the post here](https://open.substack.com/pub/mlinhealthcare/p/leveraging-machine-learning-using?r=7bxky&utm_campaign=post&utm_medium=web)! \n\nWhat promising use cases for digital twins and precision medicine have you found in your work? What other technology should we be using to improve clinical research and care? Would love to know in the comments below!",
        "comments": [
            "Following",
            "Wow cool!",
            "Following",
            "Thanks!"
        ]
    },
    {
        "id": "1bz2qxc",
        "datetime": 1712594717.0,
        "flair": "Career Discussion",
        "title": "Switching Domains",
        "score": 2,
        "comment counts": 6,
        "content": "Hey r/datascience! I am relatively new to the field and am loving my work so far, while my job title is not data scientist (Informatics engineer) I feel like I am exposed to a broad range of DS skills and knowledge (Neural Network training, ETL pipelines, AWS, Backend development , dev ops, and of course, analysis). The team I am is primarily concentrated on water recourses/ IOT, and while I love the actual data science work, I\u2019ve never cared much for water recourse engineering despite my civil engineering background having a heavy influence over me landing this job. Thinking about the rest of my career, I\u2019d like to learn as much as I can in my current role but want to work with different domain knowledge (healthcare, finance, etc). Just want to probe around and see if anyone else has switched it up! \n\ntl;dr not really interested in the domain I\u2019m in, what are my options in the future? ",
        "comments": [
            "Yes. My first job was in healthcare, next couple were supporting sales and marketing, now commercial real estate/private equity. The important thing is to learn how to learn quickly - pick up on the key concepts that drive value in the new domain and understand what sort of data is available related to those drivers. Hopefully the new job has someone who can be considered a domain expert to help guide you through the basics, but be careful not to rely on them too much in planning out a roadmap for your own work as they may not know what\u2019s possible from a quantitative perspective if they are not technical. \n\nOvertime you learn how some problems can be similar across domains which can help you hit the ground running.",
            "Options include getting another gig, even better if it\u2019s with a consulting group. You\u2019d be exposed to a lot of different industries in a short amount of time. \n\nI went from doing fisheries and environmental stats to financial and management consulting to healthcare to FAANG. Focus on upskilling and showing potential companies how quickly you learn.",
            "I think the best way to think about domain switching is to understand which concepts/skillsets/etc. are relevant to a broader area of application than just your current domain. Which things generalize.\n\nExample: you can work in pricing for B2B SaaS Healthcare products, and in theory that would limit you to a) B2B, b) SaaS, and c) healthcare. But it doesn't because the concepts that underpin pricing science (elasticity, cannibalization, customer lifecycle value, etc.) are common across basically every industry and product and audience that has to react to a price. So sure, the data may look different, the models may be slightly different, but if you have worked in one, you have a huge leg up over someone who has 0 pricing experience when approaching a new industry.\n\nThe same is true of a lot of broad concepts: risk, a/b testing, forecasting, network analysis, etc.\n\nSo the question to ask yourself is, in your current role, which skillsets generalize well.",
            "I am in a very similar connundrum, let me know what you decide and how it goes!",
            "Same here. Please share how it goes!"
        ]
    },
    {
        "id": "1byoga8",
        "datetime": 1712548884.0,
        "flair": null,
        "title": "Weekly Entering & Transitioning - Thread 08 Apr, 2024 - 15 Apr, 2024",
        "score": 6,
        "comment counts": 63,
        "content": " \n\nWelcome to this week's entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include:\n\n* Learning resources (e.g. books, tutorials, videos)\n* Traditional education (e.g. schools, degrees, electives)\n* Alternative education (e.g. online courses, bootcamps)\n* Job search questions (e.g. resumes, applying, career prospects)\n* Elementary questions (e.g. where to start, what next)\n\nWhile you wait for answers from the community, check out the [FAQ](https://www.reddit.com/r/datascience/wiki/frequently-asked-questions) and Resources pages on our wiki. You can also search for answers in [past weekly threads](https://www.reddit.com/r/datascience/search?q=weekly%20thread&restrict_sr=1&sort=new).",
        "comments": [
            "Hey guys, I just finished a DS internship where most of my work was basically data cleaning and prepping data for analysis. I received a FT offer and am returning and will definitely be put on modeling work. I come from a non traditional background so there\u2019s still a lot of holes for me to fill. I suspect they\u2019ll have me start it gradually and work my way up. What books or courses do you guys recommend to help me get ahead of the curve? I\u2019ll have about 4 months before I start if that matters. Anything is appreciated, thanks!",
            "# Are Azure certificates useful?\n\nIt seems that AWS is way more popular, but my uni offers training and free exams for Azure. Is it worth the time and effort? Will it help me land the\u00a0**first**\u00a0**coop**? I have not seen many coop/intern job posts that mention Azure.\n\nOr I could spend the summer preparing AWS certificate, which costs a couple hundred dollars. Are the knowledge and skills from Azure cert somewhat transferable to AWS?\n\nI know they don't mean much for industry veterans, but what about a student trying the land the first intern?",
            "Hello all,\n\nQuestion about school options.  I'm wanting to transition into Data Science from a completely unrelated background (I majored in English about 1000 years ago). I've taken some classes at my local community college (pre-calc and intro Java) and did very well, so I feel like I might be ok.  I'm looking at programs at a local university that offers an in person degree in Data Science or ASU's online program.  For those who specifically went to school for DS, is there something to be gained by in person classes?  My concern is that with my age, it will be awkward to be in a classroom, but I don't know if online learning will be a good fit for the major (mainly the maths).\n\n  \nAny insights or thoughts appreciated!",
            "Finishing my math phd thesis in manifold learning this Summer at a US university and I've been applying to industry jobs since January. **I will follow any advice I get.**\n\nI've got a 3 month internship at a decently known ds startup under my belt and some part time work data/software/llm work at my friend's startup. Despite my non-zero work experience, several side projects, and somewhat relevant skills. I'm getting automatically rejected everywhere. Even asking each of my friends for internal referrals still gets me automatically rejected.\n\nI've been casting a wide net and applying for analyst, ds, swe, mle, researcher, and even just basic consultant roles to no avail. I'm also open to anywhere location wise that's not the midwest (all love just prefer the coasts).\n\nThe one interview I got was at Capital One and I scored in the 900's on their code assessment when 700 was the cut off. They ghosted me and then sent an automatic rejection a month later.\n\nI really love working with data and will do whatever I need to in order to keep on doing it, but I'm not sure how much more one-off contracts and side projects I should be going for. I get that the theory side of ML is not that attractive to most, but I believe my job applications display that I can do front to back dev work too.  \nWhat should I be doing to improve my chances of being a good candidate? Kaggle? Online courses? I've considered extending my deadline and finding buzzword worthy applications of my work, but it feels like such a shot in the dark.",
            "Any advice for what to do over the summer for a data science major who just finished their freshman year?\nMaybe bootcamps, projects, any research?\nOr since I\u2019m low on money just getting a regular college summer job wouldn\u2019t be a bad idea?"
        ]
    },
    {
        "id": "1bya7j3",
        "datetime": 1712511466.0,
        "flair": "Career Discussion",
        "title": "Any marketing graduates who have switched to DA/DS?",
        "score": 29,
        "comment counts": 53,
        "content": "History about myself\ud83d\ude05 I\u2019m 27 and studied a bachelors degree in marketing with honours(From South Africa). Then I did another honours degree in financial planning and have been a Paraplanner/Digital Marketer the past 3 years. I got frustrated about a year ago as the job was really boring me, I end up working about 3 hours a day. I enjoy the free time though but decided after dabbling with some minor excel data analysis for my company to self teach myself python and SQL as I had made a decision to start a Masters in Applied Data Science(MADS) in 2024 at one of the top 5 universities in South Africa, which is a 2 year program. In my class, about 90 students I am the only one coming from a marketing degree, rest are from engineering and economics. I\u2019m guessing the Python entrance exam phased out a lot of people. I\u2019ve been enjoying the course so far and have learnt more about Python the last 3 months then I did last year self learning\ud83d\ude05 I am curious if there if there are others with my kind of background who have made it into the Data industry and any advice they can give?\n\n&#x200B;",
        "comments": [
            "I\u2019m 10+ years out of school so sort of a different ballgame, but my degree is in marketing and I work in data science. I started right out of school as a marketing analyst and upskilled the analysis side from there. \n\nThere\u2019s a lot of advantages to a marketing background in data science - there are a ton of applications for DS in the marketing realm, and marketers tend to be skilled at one of the most important (but often overlooked) aspects of the field - getting stakeholder buy-in and convincing them of value.",
            "22M here and I have the same query. Been doing SEO (digital marketing) since last 3 years and am now switching to DA/DS.",
            "Yes. Graduated with a BA in Communication about 20 years ago. Worked in public relations, marketing communications, branding, social media, and web publishing for 12 years. Dabbled in data analysis during that time, just Excel and web analytics tools like Google and Adobe Analytics. Because of that, I was moved into a marketing analytics role at my last company, after I had been with them ~3 years in a digital marketing role. I was working under someone with more experience (including an MS in stats) and was blown away by what you could do with the proper skills and knowledge, so I enrolled in an MS Data Science program part-time while continuing to work. Now I\u2019m a product analytics data scientist (reports and insights, experimentation, predictive modeling and clustering, etc).",
            "[removed]",
            "So cool that you found what you love! Did the exact same thing. When looking to 'make it' into the Data Industry you'll find that consultancies are easier targets than product companies to start and make a career, if you can stomach it. Having side projects helps. And having a network is even better (meetups, kaggle competitions, etc.).\n\nPerhaps unnecessary to say but finishing your Masters is I think more important if you are not that technical and will probably end up in a lead role rather than an Expert ML role, for which dropouts are more easily accepted I feel like.\n\nA nice final Masters project that makes use of your background in Marketing could be building on a DSP (https://en.wikipedia.org/wiki/Demand-side\\_platform), which requires recommender systems, search, auto-bidding strategies and some front end as well."
        ]
    },
    {
        "id": "1by0ijw",
        "datetime": 1712482712.0,
        "flair": "Career Discussion",
        "title": "From two competeing models in a team, how do i bring up data leakage in the other?",
        "score": 81,
        "comment counts": 35,
        "content": "For this project that I am working on we have been developing two competeing models. Having access to the codebase, I noticed the other model which has been accepted to be used in production for seemingly better results, has data leakage (using information during training from test data). Synthetic data generation done on the entire dataset and other feature engineering such as  standardising the values on the entire dataset. \n\nI brought this up in the group chat once, but it hasn't been paid attention that much. How do I assert myself and bring this up? Because my model is unfairly being put on a second place.",
        "comments": [
            "The fairest assessment is to test both models on a new, completely held out sample. Preferably one that mimics how it will be evaluated by the end user.",
            "Test both models on new test data, like u/Dramatic_Wolf_5233  said.\n\nAlso, bring this up in actual group meetings instead of just the chat.",
            "You might get a better outcome if you don't frame this as being concerned about your model being unfairly maligned, but as concern for the overall result and possible lost business value.",
            "Definitely don't frame it defensively or in a way that seems non-objective ('My model')",
            "As others have said, propose an A/B/C test where the two models are tested against what\u2019s currently in production. In production you cannot data leak without a Time Machine."
        ]
    },
    {
        "id": "1bxmy77",
        "datetime": 1712439348.0,
        "flair": "Projects",
        "title": "I made my very first python library! It converts reddit posts to text format for feeding to LLM's!",
        "score": 559,
        "comment counts": 67,
        "content": "Hello everyone, I've been programming for about 4 years now and this is my first ever library that I created!\n\n## What My Project Does\n\nIt's called Reddit2Text, and it converts a reddit post (and all its comments) into a single, clean, easy to copy/paste string.\n\nI often like to ask ChatGPT about reddit posts, but copying all the relevant information among a large amount of comments is difficult/impossible. I searched for a tool or library that would help me do this and was astonished to find no such thing! I took it into my own hands and decided to make it myself.\n\n## Target Audience\n\nThis project is useable in its current state, and always looking for more feedback/features from the community!\n\n## Comparison\n\nThere are no other similar alternatives AFAIK\n\nHere is the GitHub repo: [https://github.com/NFeruch/reddit2text](https://github.com/NFeruch/reddit2text)\n\nIt's also available to download through pip/pypi :D\n\nSome basic features:\n\n1. Gathers the authors, upvotes, and text for the OP and every single comment\n2. Specify the max depth for how many comments you want\n3. Change the delimiter for the comment nesting\n\nHere is an example truncated output: [https://pastebin.com/mmHFJtcc](https://pastebin.com/mmHFJtcc)\n\nUnder the hood, I relied heavily on the PRAW library (python reddit api wrapper) to do the actual interfacing with the Reddit API. I took it a step further though, by combining all these moving parts and raw outputs into something that's easily useable and very simple.\n\nCould you see yourself using something like this?",
        "comments": [
            "I think it's really cool that you did not stop after writing a script for yourself, but actually went through the trouble of turning it into a full blown library that's available on pip.\n\nThat kind of experience and commitment to see things through to the end is really valuable, and will greatly help you in your future endeavors!",
            "I've been coding close to 15 years now and never published a python library \ud83d\ude05\n\nReally cool project and a great milestone, be proud! \ud83d\udc4f",
            "Amazing work! And congrats on shipping your first python package. I feel like the best way for data scientists to learn software engineering stuff is doing exactly this.\n\nSome advice from a data scientist that has made many mistakes (and counting):\n\n\\- Great that you've used setuptools because it will teach you the fundamentals of packaging python code. For your next project look at tools like [Poetry](https://python-poetry.org/). Makes your life a lot easier!\n\n\\- [Pre-commit](https://pre-commit.com/) is your friend! It will help sense check your code for you whenever you make a commit. Here is a great tutorial: [https://www.youtube.com/watch?v=ObksvAZyWdo](https://www.youtube.com/watch?v=ObksvAZyWdo). I also highly recommend using mypy, a static type checker that will catch nasty bugs for you before they become a problem.\n\n\\- Think about how you could test the code with something like pytest. How could you mock up the Reddit API? And check out things like Github workflows, which will run the tests for you when you push a new release and even package it up and push it to pypi.\n\nThe above three are some of the first things I teach junior DS's and it usually results in cleaner code, less development time, and happier teams. \n\nKeep up the great work! I can't wait to see what you build next.",
            "Hey man this is sick!\n\nI am saving this post for later use. Is there any way I could credit you for the library if I ever use it for academic purposes?",
            "This is so cool!"
        ]
    },
    {
        "id": "1bx6dsp",
        "datetime": 1712391291.0,
        "flair": "Career Discussion",
        "title": "What's your way of upskilling and continuous learning in this field?",
        "score": 96,
        "comment counts": 48,
        "content": "As the title suggests. How do you think and go about long term learning and growth?",
        "comments": [
            "Keep questioning your approaches and find how you can make an existing solution better. This way you\u2019ll strengthen your current skills and learn new ones. This worked well for me",
            "I don\u2019t. I take my pay check and enjoy my free time engaging with my hobbies.",
            "Kaggle comps have been a really nice blend of practice and theory for me. I'm planning to build independent end to end projects on github next. Might not work but would be worth the try.",
            "Try to find a technical mentor. Start a data science reading group at work. Buy the latest book on the topic your current work project is on. Read more general books like statistics or how to code better python/r. As most people already commented, I do find paid courses motivate me to actually do them.\u00a0",
            "To me, following the right people on X and Linkedin has helped to point out to good research papers. Top of my mind: Jim Fan and Matteo Courthoud, both are sharing good stuffs.\n\nBeware of AI-fluencers though (those basically saying XX is dead, etc every week)"
        ]
    },
    {
        "id": "1bxjeol",
        "datetime": 1712430352.0,
        "flair": "ML",
        "title": "Looking for a kaggle Team....",
        "score": 8,
        "comment counts": 21,
        "content": "Looking for teammates who could take part in kaggle competitions with me, i have knowledge in Computer Vision, Artificial Neural networks, CNN and recommender systems.... ",
        "comments": [
            "In which time zone do you live? I feel like this matters.",
            "I have a statistics background and data science. Am I of any help?",
            "I am proficient in python and it's libraries. And have knowledge on stats and maths.Could he useful for ur work",
            "I\u2019m interested. Send me a dm. I have knowledge in NLP",
            "Which comp?"
        ]
    },
    {
        "id": "1bx9sz3",
        "datetime": 1712404734.0,
        "flair": "AI",
        "title": "Philly Data & AI - April Happy Hour",
        "score": 18,
        "comment counts": 3,
        "content": "If anyone is interested in meeting other data and AI folks in the Philly area, I run a monthly connect to make friends and build local industry connections. Our next connect is April 16th. See here for details: [Philly Data & AI - April Happy Hour](https://www.meetup.com/philly-data-and-ai/events/300140371)",
        "comments": [
            "\ud83d\udc4f\ud83c\udffb",
            "Since it's in the Irish pub, I would totally go if I lived in Philly. Nice hero image by the way."
        ]
    },
    {
        "id": "1bx5q8g",
        "datetime": 1712388682.0,
        "flair": "Discussion",
        "title": "I just can't fine tune BERT over 40% accuracy for text-classification task",
        "score": 29,
        "comment counts": 12,
        "content": "Hi everyone, this is the first time I'm fine tuning an LLM and I just can't get over 40% accuracy for the text-classification task.\n\nI'm using BERT from transformers library to load and train the model and peft for LoRA implementation. My data set contains English written summaries of news articles and with each article there is a label such as Economics, Politics, Science, Entertainment, etc... (14 unique labels). The maximum length of summaries can extend up to 250-300 tokens. My training set has 800 examples and validation set has 200 examples.\n\nAt first the training loss was reaching very low but the validation loss was not going too low with validation accuracy going maximum up to 45%. Since it was overfitting, I changed dropout rate form 0.1 to 0.5. After that the model is not overfitting now, but it is underfitting, with validation and training loss being almost the same and validation accuracy still reaching 45% max.\n\nI tried removing LoRA implementation but nothing changed, except for the training time. At this point I'm confused as to what should I do. I've tried tuning hyperparameters but nothing changes.\n\nCan anyone help me out in understanding what possibly could I be missing here. I can share stats and code implementation or I can even get on call if that's possible. Any help will be very much appreciated.\n\nEdit: Thank you everyone for the input. Most of all were right about insufficient data and class imbalance, since the classes were not even evenly distributed. So, I changed the dataset and used a much larger set of data with 70k examples approx. and turns out, I got an accuracy of 88.45% on validation set. Apparently, it was lack of data and class imbalance which was not training the model.",
        "comments": [
            "how big is your base BERT model so that you decided to use LoRA? \n\nIMO 800 examples seem to be on the lower end for \u201cmore traditional\u201d NLP tasks. Without further info I\u2019d guess that there\u2019s not enough pretraining contained in the model..",
            "Check for class imbalance, what is your loss metric?",
            "Not enough data",
            "I have trained using almost 1M records for over 14 classes and I get around 95-97% accuracy. Ig you are training with very less data that if you don't have access to much more, you will need to try a different strategy.",
            "Since your training set only has \\~50 samples per class (assuming classes are somewhat balanced), I would suggest looking into few-shot fine-tuning, particularly [SetFit](https://huggingface.co/docs/setfit/index), which doesn't require prompt engineering nor large-scale models to perform well."
        ]
    },
    {
        "id": "1bxfed8",
        "datetime": 1712420223.0,
        "flair": "Discussion",
        "title": "LLM APIs vs Hosting OSS/Fine-tuned models",
        "score": 2,
        "comment counts": 6,
        "content": "Hi guys, just want to check my line of thinking.\n\n\n\nI'm managing a DS/ML team in my company, and we've been picking up a couple of projects that uses LLM.\n\nTo date, I see that for the applications happening inside the company, using LLM APIs (OpenAI, Google, etc) and building systems around it (RAG, guardrails, prompting, you name it) is still the way to go because of:\n\n- Speed to iterate\n\n- Fine-tuning data not readily available\n\n- The current + foreseeable future use cases seems to be able to be solved using \"general knowledge\" contained in the big tech's pretraining + instruct-tuning\n\nI still see fine-tuning being thrown around by either the big tech sales people (I get it, they're sales function at the end of the day) or by senior leadership that knows a bit into the details behind these LLMs, but personally I don't see a specific value yet of doing fine-tuning at my company's scale.\n\n\n\nThe reasons I can think of on why someone in my position resorts to fine tuning is:\n\n- If there is an available infrastructure + team managing it already, and serving our own fine-tuned model is cheaper (economy of scale).\n\n- Compliance issues (eg. maybe Banks really don't want to risk their data being sent to other company's server)\n\n- Risk of the model's response stability being at the hand of the provider\n\n- If the task is proven to be too specific, and even GPT-4/Opus/Gemini-1.5 with RAG, etc can't solve (or the modifications around it becoming too expensive)\n\n\n\nBased on your experience, is there a major reason that I miss? Another recent data point is Cognition labs. If people at their caliber wrap their system around GPT-4, why should I bother with fine-tuning? (other than the reasons\u00a0stated\u00a0above)",
        "comments": [
            "Hello,\n\nYou are completely right about your reasons. But I just want to point some more stuff.\nThe fine-tuning is super powerful when you are already using a model and that your costs skyrockets. It allows you to lower the token usage but you need some logged prompts and output for the fine-tuning. \nAnd you can host the fine-tuned models on the big LLM services, you do not need to deploy it yourself.\nI think one of the biggest reasons as to why you would deploy your own LLM is if you want to test some super new model that is not available anywhere.\nYou can check this link for more information:\nhttps://www.metadocs.co/2024/03/22/never-use-local-llm-again-do-this-instead/",
            "it depends on your company's industry. If you find there are too many vertical knowledge in your company's dataset, the llm can't represent these knowledge in a right way. that's why we need to fine tuning our own model. GPT-4 and Gemini-1.5 are used to resolve the universal questions, but when it comes to a very vertical  industry like biology. these two models can't work due to the professional terms. Compared to RAG, fine tuning can perform bettter in some scenario because RAG only fecth the embedding of gpts to execute the retrival task while the fine-tuning involves parameter updating in the gpts.On the other hand, I don't think fine tuning will be cheaper due to the infrastructure.",
            "Using general models assumes the collective noise produced by a bunch of internet anons to make the dataset is sufficient in a niche field of knowledge (it\u2019s not). Fine tuning assumes those anons at least used written language to a sufficient proficiency level for the mode to parrot domain specific knowledge form your pseudo randomly chosen corpus of what you hope contains applicable information about said vertical (you can\u2019t guarantee this).\u00a0"
        ]
    },
    {
        "id": "1bwebry",
        "datetime": 1712311630.0,
        "flair": "Career Discussion",
        "title": "Why there is nope for Data Science Juniors",
        "score": 194,
        "comment counts": 215,
        "content": "Since the last year, I never seen anyone from a different field (not Computer Science, Statistics, DS grad) get an entry level job.\nEven if one complete many projects and courses, bootcamp, github etc. \n\nDo you think the market is dead for outcomers,\nActually do you have anybody got the entry levrl job[.](https://images.app.goo.gl/yjyRu3Hk2uTMTNCg7) without any related academical degree, in last 6 months?\nJust prove me wrong, I want to see real examples to not lose my hopes completely,\n\n-- Btw I am a 3 year+ python developer, with experience on deploying DS models on industry.\nI have applied more than 100 jobs and got no interview. I am in Turkey and appying mostly for foreign jobs.\n",
        "comments": [
            "it probably hurts your chances that you are not working from the EU as GDPR is a big deal in Europe and there is a similar act in the US",
            "I think you need to ask the question: is the problem that there are no entry level roles or is the problem that companies don't want to hire from abroad?\n\nAs a manager in the analytics space, we have largely stopped considering foreign applicants who don't already reside in the country and have a visa (previously not an issue). I also prefer someone with a few years of DA experience over someone who did a DS master's and have no idea of working in data in an actual company. \n\nThe market is tough so make a choice - if your objective is to work abroad, go for any data role and switch once you are already in the country. If your priority is being a DS, try to get the title in Turkey first.",
            "You make the point exactly: without any related academical degree\n\nI think with more and more AI tools etc. the demand for drawing simple graphs and data cleaning is becoming less and less and every Msc or Business graduate these days knows the basics of coding in python/R. \n\nWhat companies need is experts in their respective fields to use DS to solve problems and business cases OR experts to set up data infrastructures. Nothing you will learn in BootCamp or homecooked projects in my opinion.",
            "The market is brutal right now. US citizens who have computer engineering degrees and who send 1000s of applications are unable to find a job. \n\nBut the situation could improve a lot later this year if the interest rates come down.",
            "The reality is that many companies understand that having a degree or portfolio is not sufficient. Everybody can wrangle data, make a nice visual, analyze and make conclusions. The key thing here is that someone with experience, developed strong instincts on which data to wrangle and how to wrangle it in such a way that it becomes valuable, make visuals that focus on the insights and leave out the fluff, or better yet, the visual actually showing something useful. I'm in EU and most/all job openings here do not accept juniors, which also means that recent graduates have a rough time, regardless of the company size.\n\nWhat I would try is to connect with a fairly large charity in your country and commit at least 6 months of volunteering on data management, analysis, science, and don't wait until they ask you for help, be proactive by suggesting situations/scenarios for which you can create insights. The key here is that you explain why it's valuable for them."
        ]
    },
    {
        "id": "1bwno6f",
        "datetime": 1712337696.0,
        "flair": "Career Discussion",
        "title": "upskilling for ex-academic with skill gaps",
        "score": 36,
        "comment counts": 40,
        "content": "Hey folks, I\u2019m looking for advice on filling in some skill gaps. I\u2019m a social science academic with a highly quantitative background, left academia a couple years ago for a nonprofit role, and am now looking for my next thing.\n\nMy job search revealed that I have some noticeable skill gaps that affect interviewing and hiring. But typical data science training options are pitched too low \u2014 I\u2019m qualified/have been recruited to teach subjects like causal inference, experiment design, surveys, data viz, and R programming at the grad level. I\u2019d like to upskill on at least the following topics:\n\n* \u2060Python, but the intro stuff is just unbearably boring. Is there a Python transition course for R experts?\n\n* SQL, ditto. I fully understand most concepts around data manipulation \u2026. in R.\n\n * \u2060Forecasting and predictive analytics. Would be happy to read a book or take a class on this.\n\n* \u2060Product oriented analytics. I\u2019m solid on working with non-technical stakeholders but there seem to be some common issues (churn, pricing, auctions, marketing/attribution, risk, search) where specific knowledge of how people typically approach the problems would be helpful.\n\n* AI/ML basics and assessment. Again, looking for stuff for someone with minimal ML experience but a strong stats/quant background. \n\n\n\nAlso interested in anything you think would be a good direction to pursue. I\u2019m not currently in a hurry, plus the market is miserable, so I\u2019d like to set myself up for a big push next year. I have a substantial amount of PD money I can use as long as it\u2019s started in the next 6 months, so, happy to pay for courses if they\u2019re useful. ",
        "comments": [
            "Pick up \"Hands-On Machine Learning with Scikit-Learn, Keras, & TensforFlow\" by Aurelien Geron.  It's by far the most accessible tome on machine learning that I have come across.  By far.  \n\n\nYou are probably past (or have read) the famous \"An Introduction to Statistical Learning\" by James, Witten, Hastie, Tibshirani.  But now you can walk through an edition in R *and* Python.  That seems like it would make an excellent transition.  \n\n\nAnd, I am a newbie to this one, but I am impressed by INFORM'S Job Task Analysis.  That seems like an excellent breakdown of a problem-solving approach that could help you bridge your expertise to the needs and language of a business.  \n\n\n(I also have a soft spot for Kuhn and Johnson's \"Applied Predictive Modeling.\" However, Kuhn says \"tidymodels\" is his updated approach to \"caret\", or re-building it from the ground up.  So maybe this book is a little out-dated.)  \n\n\n(Lastly, I have had people swear to me that what they can do in dplyr would take a SQL expert a month.  So I'm not so sure it's necessary to learn that much SQL -- I guess it depends on your work environment.)",
            "Just want to say programming at the grad level in social sciences isn\u2019t even close to the level of rigor in terms of SWE standards that need to happen if you\u2019re a real DS shipping code to production. I\u2019d probably start there.",
            "Piping in regarding SQL, it\u2019s suuuuper valuable to know. R or Python might work for datasets that fit in memory, but SQL is going to let you work with way more data.\n\nA fun approach to learn might be [this SQL murder mystery game](https://mystery.knightlab.com). \n\nEach SQL database has its own set of management concerns (SQLite vs Postgres vs Snowflake vs BigQuery vs Redshift vs \u2026). But, SQL is the common query language, and knowing it will get you far.",
            "People seem to like the 100 days of Python course. I\u2019ve just started it and while the early days are pretty easy, it\u2019s also possible to skip videos and jump forward to the exercises when you want to. \n\nDatalemur has a lot of sql questions. If you want to practice in a way that\u2019s helpful for interviews, start with the easy ones and practice 1) talking through your solution cogently as you read the question and write your code, 2) answering the question correctly the first time without making any mistakes 3) working quickly. These assessments are timed and you don\u2019t want to waste time on an easy question that could be spent on a tougher one.\n\nSpeaking as a social scientist myself, I think you\u2019ll find your causal inference training isn\u2019t particularly valued in industry. There are definitely some companies out there that will value it and will give you space to do rigorous work in the space. But they\u2019re the exceptions. If you want to be competitive in the data science market, becoming expert in sql, python, and ML are much more important.",
            "I might be misinterpreting your post but to me it seems like you put too much emphasis on studying using courses.\n\nCourses can be good as a first step, but imho the only way you will actually close those gaps if you come up with a data science projects that require you to use those skills.\n\nBasically you need to use it what you have studied. When I left academia and other PhDs, you think you need to read a book, understand theory to close those gaps. In reality that is super inefficient way of learning. You need to build stuff, fail, build again, and repeat this iterative feedback process when you are actually doing stuff rather than learning how to do stuff from courses.\n\nMy suggestion to upskill your technical skills in sql, python, ML. come up with project where you need to scrape data, load it in some sort of database. then create data transformation scripts in sql (e.g. to create features for your ML model). finally build an ML model. this end-to-end project will teach you more than doing a course."
        ]
    },
    {
        "id": "1bwsylt",
        "datetime": 1712350820.0,
        "flair": "Discussion",
        "title": "Data Science Masters in the US or in Europe?",
        "score": 21,
        "comment counts": 86,
        "content": "Hi, I would like to pursue a career in Data Science/Analytics. I have been accepted to Tilburg's Data Science and Society program in the Netherlands for the upcoming Fall and am also considering applying to the Queens College Applied Data Analysis program in NY, USA.\n\nAs a non-EU and non-US citizen, and planning to stay in the country after completing both master's programs, which one do you recommend? From what I understand, Europe offers lower paychecks and finding a job can be challenging but possible. On the other hand, the US, with a master's degree, may offer better job prospects, although it can be challenging in general. I have no knowledge about visa requirements in the US as well. Can companies support visa for me? Any help is useful. ",
        "comments": [
            "Find a job first before masters, i wish I knew that before making the mistake of going for a masters then get thrown into an over saturated market.",
            "To contrast the sentiment of many commenters here, I had a good experience with doing a data science MSc (in my case, 1 year in the UK). I did my undergrad in economics/econometrics and felt that I needed to upskill (in terms of programming and machine learning theory depth/breadth) to move into data science. I figured: either find a data analytics job and learn that stuff in my free time, or do a data science masters - went for the latter as I figured I\u2019d learn more that way and it would set me up for the long run.\n\nI vetted course content very carefully as I was aware that many of these courses are just a cash grab that shove together a jumble of comp sci + stats modules and call it data science. Eventually I found one that I thought was worth doing (Python, advanced rather than basic statistics, heavy machine learning focus - including deep learning, computer vision, NLP, Bayesian methods, reinforcement learning, etc. - with emphasis on implementing ML algorithms and fundamentals from scratch, lecturers with highly cited publications in ML, etc.).\n\nIt was a lot of work (70-80 hours a week pretty consistently, typically deadlines every week) but l found a job immediately afterwards and have never had difficulty switching jobs since then. I learned loads, most of which continues to benefit me 5 years later, and have consistently had very positive feedback on technical skills (had multiple bits of feedback saying my programming and machine learning skills were on par with many DS seniors as a new grad - not to say I\u2019m so great, looking back there\u2019s a lot I didn\u2019t know, but just to say that clearly the MSc was relevant to industry vs most of my colleagues who were coming from mathematics, physics, etc.).\n\nNow, to caveat this, it was back in 2019 when I found my first job, so the job market was a lot better back then. With hindsight, would I have done the MSc in computer science instead? Maybe, but honestly I think the DS MSc was more relevant to my day-to-day work than a computer science degree would\u2019ve been, and I\u2019ve found computer science *much* easier to self-learn online than statistics, ML theory (at the level of depth where you can implement the algorithms from scratch anyway), etc., so I probably wouldn\u2019t change anything.",
            "This is my pov as an international student in the US. \nH1b is the work visa required to work in US after completing your masters. It will give you 3 years permit to work in the US and you can extend it for 3 more years. Right now the chances of getting H1b picked is 1 in 6. I don\u2019t see the number going down soon, they need to increase the H1b cap. \nI came to US in 2019, used my opt and stem opt and still I didn\u2019t get my H1b picked. That\u2019s 3 attempts plus master students have 20k more cap for h1b. Even when you get your h1b if you\u2019re laid off from the company you get 60 days to find another job, in the current market it\u2019s tough. \n\nIf you\u2019re very anxious person and/or you don\u2019t won\u2019t to gamble your future , or you are okay with uncertainty about your visa status you can consider US.\n\nEdit:- I don\u2019t know which country you\u2019re from, but if you\u2019re from India or china it will take years to get Green card based on I-140 approval after getting your h1b",
            "Tilburg DS&S is kind of mid in the context of the Netherlands, source: I supervised some students there as an industry advisor and have colleagues who finished it. The best data scientists I know have a master\u2019s in math, cs, or econometrics, never \u201cdata science\u201d. Also good luck finding housing in the Netherlands",
            "Probably depends on where you wanna get hired. Other than Oxford in Cambridge hiring managers probably aren\u2019t gonna be all that familiar off the top of their head with a lot of European schools. But if you\u2019re looking for a job in Europe, that would be more relevant."
        ]
    },
    {
        "id": "1bwsdgn",
        "datetime": 1712349376.0,
        "flair": "Education",
        "title": "Recommend good books/ courses",
        "score": 16,
        "comment counts": 20,
        "content": "Hi all. \n\nI\u2019m really free these days, unemployed and looking for employment, but the way the market is right now, I guess it\u2019ll take some time. So can anyone recommend me good data science books/ courses? \n\nWhat im looking for:\n- mlops,\n- docker, kubernetes in data science \n- tackling data science problems without business context\n- how to modularize code (not just Jupyter notebooks, but how to create entire pipelines on vscode/ pycharm. \n- create web dashboards \n\nLooking forward to the recommendations \n\nThanks",
        "comments": [
            "Is this intentional you want to tackle problems without business knowledge? Usually other way around - you incorporate as much domain expertise as you can to make your models more relevant",
            "[fast.ai](http://fast.ai) for MLops it's free online and is very comprehensive. It's structured in a way that you get out what you put in. Good luck in your job hunt!",
            "For mlops, there tons of resources - the question get posted and answered for example here repeatedly: https://www.reddit.com/r/mlops/s/zWhFGSEuyj \n\nDocker and k8s - there are ages before you might need k8s but learning about docker is a valid path, again there tons of resources available. \n\nFor \"pipelines in pycharm\" - is this a substitute question for \"learning about orchestaration\", pycharm is an editor, you can write some pipeline even in notepad, better path learn about Airflow, dbt, Prefect, Luigi, etc",
            "I am a book guy and here's my \"book stack\" for ML:\n\n1. Python Crash Course (Matthes) - Python 101. Knowing the material in part I is essential but part II (applications via a game) is probably optional.\n2. Data Structures and Algorithms (Wengrow) - Data Engineering 101. It covers the fundamental data structures (arrays, sets, hash tables, linked lists, trees, graphs), data structure operations, space and time complexity of algorithms. \n3. The Linux Command Line (Shotts) - Linux 101. This one is probably optional, but I've found it incredibly helpful for my career. The cloud is Linux after all... \n4. Intro to Machine Learning with Python (Muller & Guido) - ML 101. Supervised/unsupervised methods, scaling data, dimensionality reduction, kernel methods, cross validation. \n5. Hands on Machine Learning (Geron) - ML 201. Same material as 101, but from a new perspective and part II covers neural nets. It's just a fantastic book. \n6. Natural Language Processing with Transfomers (Tunstall) - ML 301. LLMs baby, so hot. Another amazing book, also a great introduction to hugging face. \n\n\nThese were the most important books in my journey. Obviously skip over whatever ones you already are strong in and in most cases the best material is in the first ~half of the book. I would also advise you to avoid anything in R, as it's much more limiting than Python. Take detours where needed - if you encounter a concept (eg. matrix lin alg) you are rusty or unprepared for, don't be afraid to spend some time on that!\n\nI've been reading these books for a few years now and it will take time and $ to buy them, but having a single coherent source for your information (rather than the incoherent mess that is the internet) is worth it.",
            "You should just try to build these things and learn from that. If you really want to read or watch there are plenty of free courses that get it done"
        ]
    },
    {
        "id": "1bwk2y4",
        "datetime": 1712328899.0,
        "flair": "Discussion",
        "title": "Data challenge take-homes. How are you setting up your repos?",
        "score": 30,
        "comment counts": 17,
        "content": "Howdy folks.\n\n&#x200B;\n\nI've been on the job market for a while now (7 YoE, laid off from one of the larger tech companies last year), and written a bunch of data challenges in that time. Mostly I've just been sending over a requirements file and a jupyter notebook with extensive discussion of what I did and why I did it, but this is not how I would actually code on the job -- just how I code given the constraints of having to do things quickly to turn around data challenges. I don't think it's a great way to showcase my actual coding habits which would typically involve being more thoughtful and including unit/integration tests, but when you're faced with a tight deadline to turn something around, you do what you have to.\n\n&#x200B;\n\nThat said -- some of you must have boilerplate that you copy from one project to the next to make the process less painful. Show me your repo structures! So that I can steal them and not have to think next time.",
        "comments": [
            "I typically use [cookiecutter](https://cookiecutter-data-science.drivendata.org), it\u2019s a decent starting point. There\u2019s also a good [cookiecutter](https://github.com/TezRomacH/python-package-template) for CLIs available too that I\u2019ve been using for developing scripts and helper packages. The latter has some good dev-ops features if you\u2019re using GitHub",
            "How long are your takehomes that you're setting up an entire repo to show your results? Maybe a hot take from me but any company asking you for that much is kind of a red flag.",
            "Just use cookiecutter if they really want a repo.\n\nIMO they're asking way too much from you if you need a fully organized repo for a 1hour take home.",
            "Take homes shouldn't be more than an hour, otherwise fuck that",
            "However I set it up, I'm using a license like GPL that can get them in legal trouble if they decide to use my work after not hiring me lol."
        ]
    },
    {
        "id": "1bwexfq",
        "datetime": 1712313913.0,
        "flair": "Projects",
        "title": "Opinions on a side project for a recent grad?",
        "score": 25,
        "comment counts": 35,
        "content": "Hey everyone \ud83d\udc4b \n\nI\u2019m graduating in 4 weeks. Got an analyst role, but I eventually want to land a DS role. Was thinking of taking a gap year and getting an MS degree in CS with an emphasis on ML from Georgia Tech.  In that year, I wanted to work on a side project. \n\nI was honestly thinking of teaching myself object oriented programming and making a video game without an engine, just using hard coding.  I know that\u2019s not DS related, but I\u2019ll be doing plenty of analytical stuff with SQL/Python/Tableau at my day job. And this felt like a project that would teach me more about the programming side of things, less of the basic scripting side that I do at work. \n\nI am wondering if anyone sees value in a side project like this, in regard to landing an actual DS role in the future? I really want to learn outside of work, but also want it to be something I\u2019m interested in. Thanks for the feedback!",
        "comments": [
            "If DS is your goal, I would recommend finding a subject you are interested in (sports, weather, politics) and analyze the heck out of it. I went from several years of DA to working on masters in DS. Your job will likely be lots of queries and plots. A project that utilizes statistical analysis and modeling will tell you a lot about if DS is right for you.",
            "I think solid programming skills and an understanding of object oriented design is always helpful for anyone working on software or the AI that is to be embedded in software - I say this as a ML engineering manager",
            "Can you explain what you mean with your gap year plans? I\u2019m looking into ML at Georgia tech as well, and iirc, the fastest the degree can be achieved is in 2 years. Maybe you meant, you would take a year to focus on the degree, and then finish the degree while working? \n\nI\u2019m somewhat in a similar boat to you. Recent CS grad with DS intern experience. I\u2019m looking for a DA, DS or DE role, and was planning on getting started with the masters once I have another year of industry experience under my belt. \n\nYou\u2019ll will definitely have some DS/ML projects to show from your degree. It sounds like the degree of difficulty of OMSCS is quite high, it may be ambitious to think that you will have time to work, do your masters, and be able to work on side projects. Assuming you do though, I would just delve into one of your interests. For me that\u2019s music generation with ML. That way it\u2019s a bit more fun.",
            "If you do any form of exercise or hobbies that you can data-fy I've always been impressed by that. I chart my rowing for example and chuck that on streamlit with some nice analytics over it and some API integrations to chart progress. Personally side projects that have a showable end point show me that you can finish a task - something many smart people don't do...",
            "Check out the big book of small python projects. It isn\u2019t specific to DS but it is fun and has oop"
        ]
    },
    {
        "id": "1bwlye4",
        "datetime": 1712333488.0,
        "flair": "Career Discussion",
        "title": "Need guidance for (lack of) career path",
        "score": 4,
        "comment counts": 17,
        "content": "I'm at a loss of where I stand in the Data Analyst career path. I did an econ MA in 2019 immediately after finishing my BA, which was a terrible idea because I was playing catchup on the maths and couldn't really properly learn any of econ models or causal inference/statistics.\n\nAfter graduating I struggled to find an \"Econ\" job while my peers got positions months before graduation. Thanks to Twitter hobby-posting during the start of COVID though, I got my first gig as a Data Analyst late 2020 with the Dept of Health. Thats when I started self teaching Python alongside PowerBI and Tableu. More recently I've picked up SQL and R...\n\nFast forward to now, I've been through about a job per year and I am once again not too happy with the position I am in. I'm a glorified data wrangler at my mental health research lab, which has a small 3 person data analyst team (4 if we cound the boss/director). I get barraged with so much ad-hoc stuff that I can't say no to that I don't have time to revisit all the modelling/causal inference stuff I didn't fully grasp during my MA... nor does anyone really care about my opinion in that topic. I've had countless instances of cases where, despite not know how to fix an issue, I call out an issue in an analysis that is egregious (ex: operating on a dataset for which, due to issues with my peer's R code, only 30% of observations had an IPTW and the rest where NULL, when none should be NULL). No one ever cares - they are in the well-known social sciences loop of \"shit out as many papers as possible, or perish due to lack of grants\".\n\nWhenever I do get the chance to go beyond data wrangling, I'm basically sent on fishing expeditions that we use to show some silly model in a silly one-time presentation never to be revisited. \n\nI have insisted at times for my name not to be included on a paper we submit to journals, but they always get me included because I can't get myself to say \"the reason is you have a lot of issues in there, which I pointed out and you chose to ignore. I don't wanna be victim to a replication crisis blogpost\". It's demoralizing and I can't continue this way for longer.\n\nIt seems all academic-ish jobs in social sciences are like this, from what I've read on forums. But I just don't have the skillset to make it as a \"Data Scientist\" in industry either...and I don't have the time to fill the gaps while I'm working because I'm always data monkeying away, and often times reading a shitton of documentation that wears me out from being able to get into my Statistics bookmarks after work...Right now I have been tasked with figuring out our datawarehouse, which is prepared in fucking SAS-SQL and has dozens of SAS programs each with copies like `code_v1` thru `code_v16_final_FINAL` - the person that did all that work for years, and was my mentor when I joined the lab, abruptly quit recently.\n\nWhat should I do? I have savings...My partner is OK with me quitting to figure things out. But I'm not sure I am. I need a plan, at the very least, before doing that... I've considered proposing they have me as a part time employee, or just returning to my previous job for which I had similar issues but they weren't in this magnitude...\n\nIf it matters below is my \"career path\" thus far. I've an Econ/IR double BA and an Applied Econ MA...\n\n\n1. COVID contact tracing team - ended after 1 year because politics\n2. Development NGO - quit after accepting job on #3 because my pay would be doubled, plus I was like 3 additional unpaid roles there on top of DA\n3. Govt. transparency, civic participation, econ development think tank - quit after getting told I couldn't work remotely from the state I wanted fo move to so I could move in with my long distance partner. However they did ask me to rejoin 1 month later and I said no...still in good terms\n4. Mental health research lab - current job...pays well enough but dreading it hence this post",
        "comments": [
            "I have a few more years of experience, but have a somewhat similar path into data analytics, but from public health. I'm currently a data scientist at a health department, and I don't do too much statistics these days, since most people I work with are dealing with data collection, cleaning, and visualization, and just aren't at the point to do serious analyses. \n\nI worked at an academic research lab before that, and actually got to utilize lots of interesting statistical analyses, though after a few years they just kept reusing the same methods for different projects, so it became less interesting. \n\nGenerally I feel like to do actually complex fancy stats, you either have to go a fully academic route, or find a relatively rare and high level industry or government job that actually needs it. \n\nI feel like most places need clean and accurate datasets, visualizations, and dashboards more than complex stats. \n\nHow much are they currently paying you? I was job searching this time a year ago, and it took a while. The market isn't great at the moment either.",
            ">I just don't have the skillset to make it as a \"Data Scientist\" in industry either\n\nI have a similar educational background. What you are doing is very similar to what most DS/Analysts do in industry. If you can do a group by, write a for loop, and know what a p-value is you are qualified. Most claims otherwise are posturing. \n\nThe problems you're describing (BS methodology, garbage code) are also way more common in industry than you would think. \n\nI recommend you check out a little and start looking for new jobs. The market is rough and it will take a long time. Why stress out so much getting things done for people that you don't respect? Worst thing that can happen is you lose the job, but I bet that's unlikely, and you want to leave anyway.",
            "This sounds less like a technical skill issue and more of a self advocacy issue. Learning how to advocate for yourself and affect change across an org is a major factor in how satisfied you will be in your job. I dont have any pointers on how to do that unfortunately but its a skill that ive seen a lot of my seniors/ team leads use and it mostly stems from confidence in ones skills/abilities.\n\nAlso, youre not being compensated enough for your 5 years of exp and tools that you use.\n\nJust my two cents, just throw out apps for data scientist roles. There are plenty in government and non profits that are geared more towards policy/ research and would pay you well. Keep the 80/20 rule in mind and understand that you dont need knowledge of the absolute cutting edge. Just having the basics in mind is enough. Hell if youre willing to move to nyc, there are a good amount of ds jobs in gov that deal with stats/modeling. Good luck man.",
            "The problems that you describe, do you think they do not exist in data science?  Looking at your other comments, it sounds like you have been brainwashed into thinking these problems do not exist in data science and industry.  \nThe problems exist there too.  The incentives are just different.  Instead of publications, it is whatever the business has decided to prioritize.  With research publications, it is at least somewhat out in the open, and if it really matters, people can at least look and critique the studies.  In business, the problems are instead hidden away and who knows what problems they are causing.  \nAnother big difference is the pay.",
            "[removed]"
        ]
    },
    {
        "id": "1bw2ppe",
        "datetime": 1712273696.0,
        "flair": "Discussion",
        "title": "You are under arrest, you can be released if and only if, you explain p-values the right way.",
        "score": 124,
        "comment counts": 102,
        "content": "So, I asked chatGPT this:  \n \"I know, And you know, that you have been trained on a lot of data from reddit. Pretend you are a tough critic on datascience sub and validate my statements:  I am studying p-value, in ab testing and in hypothesis testing in general, here's my interpretation to it: Given a chance model that embodies the null hypothesis, the p-value is the probability of obtaining results as unusual or extreme as the observed results.  How wrong am I ?\"\n\n&#x200B;\n\nAnd here's its reply:\n\n \n\nYour interpretation of the p-value is essentially correct and aligns well with standard statistical definitions. Let's break it down a bit more for clarity and completeness, especially from the perspective of a critical view one might encounter in discussions, say, on the data science subreddit:\n\n1. **Definition Accuracy**: The p-value measures the probability of observing results as extreme as, or more extreme than, what was actually observed, assuming that the null hypothesis is true. Your description captures this essence correctly. It's important to emphasize that this is under the assumption that the null hypothesis is true.\n2. **Context of 'Extreme'**: The term \"as unusual or extreme\" is accurately used here. In statistical testing, \"extreme\" is defined in the context of the null hypothesis. What constitutes \"extreme\" is determined by the test statistic used and its distribution under the null hypothesis. It's not just about being unusual; it's about being so far from what the null hypothesis predicts that it prompts a second look.\n3. **Not the Probability of the Null Hypothesis Being True**: A common misconception about p-values that's important to avoid is thinking that the p-value tells you the probability that the null hypothesis is true or false. Your statement avoids this pitfall by correctly not making claims about the truth of the null hypothesis itself. The p-value doesn't provide direct evidence about the truth of the null or alternative hypotheses; it's about the evidence against the null hypothesis given the data.\n4. **Implications for Decision Making**: While not directly stated in your interpretation, it's essential to recognize that a small p-value indicates that, if the null hypothesis were true, the observed result would be surprising. However, deciding on a threshold for \"small\" (like p < 0.05) is somewhat arbitrary and should be considered in the context of the field of study, the stakes of the decision, and the study design.\n5. **Misinterpretations and Limitations**: It's also worth noting, especially in a critical discussion, that p-values are widely misunderstood and misused. They should not be used in isolation for making decisions. Other factors, including the effect size, the study design, the data quality, and the pre-study probabilities of the hypotheses, are crucial for a comprehensive understanding of the study's results.\n\nYour interpretation of the p-value as the probability of obtaining results as unusual or more extreme than the observed results, under the null hypothesis, is a concise and correct starting point. Expanding upon this with an awareness of common misunderstandings and the context of its application will deepen the understanding and communication of statistical results.\n\n \n\nREALLY GUYS, HOW WRONG AM I?",
        "comments": [
            "its how big your hog is, next question",
            "The correct answer is, I don\u2019t use p-values",
            "It\u2019s the probability that you\u2019re wrong and yes I will fight anyone who disagrees including the police.",
            "'It's about the evidence against the null hypothesis given the data'\n\nLoved it",
            "The p-value is the probability that the sample parameter is at least as extreme as the actual parameter based on the null hypothesis."
        ]
    },
    {
        "id": "1bwreum",
        "datetime": 1712347008.0,
        "flair": "Analysis",
        "title": "How can I address small journey completions/conversions in experimentation ",
        "score": 2,
        "comment counts": 2,
        "content": "I\u2019m running into issues with sample sizing and wondering how folks experiment with low conversion rates. Let say my conversion rate is 0.5%, depending on traffic ( my denominator) a power analysis may suggest  I need to run an experiment for months to achieve statistically significant detectable lift which is outside of an acceptable timeline. \n\nHow does everyone deal with low conversion rate experiments and length of experiments? \n",
        "comments": [
            "There are no real way out of it but there are techniques you can use to improve power depending on your distribution \u2014 CUPED, Sequential Testing etc. You could also structure your test to measure a bigger MDE, essentially saying with this setup we can only measure a 20% or bigger change. \n\nBefore that though see if you can model your metric in a way that is more representative of the effect you\u2019re trying to measure, usually through catching the effect closer to the point of impact. So measure clicks on the campaign rather than conversion of buying the product. It\u2019s not exact science but see.",
            "Measuring a direct action which has more data and that has a correlation with the  required outcome like conversion would be better. For example retention per days during trial often has a pretty good correlation with conversion. Conversions can be low and when spread over months be affected by multiple variables beyond our control including the market, if its fluctuating, and hence hypothesis testing can be unreliant."
        ]
    },
    {
        "id": "1bw933n",
        "datetime": 1712291274.0,
        "flair": "Discussion",
        "title": "What would you consider \"advanced\" seaborn plotting?",
        "score": 14,
        "comment counts": 17,
        "content": "Hey guys, I'm doing this little project on the side where I'm exploring how to do things with seaborn that are usually not covered in most courses and tutorials on seaborn, and have involved basically figure things out by tinkering and adapting code I found online for my purposes, and then finding more elegant ways to ago about the same code I found. So I decided I want to put together a little \"advanced seaborn\" project to collate these things, share them, and also for future personal reference.\n\nSo far I'm covering:\n\n-Labeling some/all points/bars on a plot with data that is not part of the plot (i.e. that is not passed to the containers), especially when using `hue`, `order` and `hue_order`,\n\n-Plotting over background images,\n\n-Centering long axis labels using custom functions,\n\n-Custom legend when the defaults don't work well,\n\n-Styling/modifying xtick and ytick labels on the fly\n\n-Using latex\n\nWhat would you add to the list or, alternatively, what do you think would make you say \"that's cool/handy\" when plotting with seaborn?\n\nThanks!",
        "comments": [
            "Honestly, it's great to know all these but a graph is as good as its comprehensibility.\n\nNone of my stakeholders give a crap about violin, pair plots. In fact, the most basic plots work out great as everyone in the meeting can more or less understand it.\n\nA picture speaks a thousand words, but it shouldn't be crammed with more pictures.",
            "I know you asked about seaborn specifically. But I just want to say that if you like learning things like this, I recommend looking at plotly as well. \n\nIt\u2019s useful especially if you\u2019re interested in interactive visuals or animations (or dashboards). You can also enable editing which allows labels, annotations to be changed, dragged on the charts.\n\nIf you want true freedom look at things you can do in d3.js",
            "I like seaborn because it generates in one line a beautiful figure that would require a lot of code to do from scratch.\n\nI don't use seaborn because I struggle to extend, adapte, modify the figure.\n\nLike, if I want to add the regression parameters of a regplot I need to compute it myself? Well I'll just do my own regplot from scratch then.\n\nI'm looking forward for your project!",
            "Young DS want to use the most sophisticated tools. Old DS want to use the fastest tool.",
            "I just use matplotlib and pyqtgraph, and I wouldn\u2019t worry about trying to do something fancy unless you need to. As long as you understand how to read the documentation and the source code, you can do anything within reason."
        ]
    },
    {
        "id": "1bvj27b",
        "datetime": 1712223753.0,
        "flair": "Career Discussion",
        "title": "Almost 1100 jobs over the past year or so\u2026 zero call back or interviews, is the market really that bad??",
        "score": 486,
        "comment counts": 413,
        "content": "",
        "comments": [
            "The market is rough these days, you have to compete with data scientists who know how to take a screenshot.",
            "need to see your CV to tell if this is the current market problem or is it your marketability problem.\n\nMarket isn\u2019t great now, but my spidey sense is saying this is not markets fault entirely",
            "I think LinkedIn job applications is a sham",
            "Should not be this bad. \n1. You need to close the gap between industry requirements and your skills. If you have them you have to show it someway. If you dont, you need to upskill yourself or atleast able to show some potential\n2. Check if your resume is ATS friendly, there are plenty of websites they will give you tips on how to structure and write your resume\n3. Network: JUST ASK! Expand your network aggressively, everyday and ask them for referrals\n4. Iterate: Reiterate from time to time, making small progressive improvements, write down where/when/on what questions you fumbled during interviews, work on them",
            "I\u2019ve never applied to 700 jobs before, but I see all these post \u201cI\u2019ve applied to 2,000 jobs and 0 calls\u201d. Maybe it\u2019s different now.\n\nThere is no way you\u2019re writing a cover letter for each one. There\u2019s no way you\u2019re tailoring your resume towards each role. There\u2019s no way you\u2019re only picking jobs you\u2019re best suited for. Nobody seems to actually be trying, it\u2019s just click as many apply now buttons as you can and hope for the best."
        ]
    },
    {
        "id": "1bwmh4t",
        "datetime": 1712334753.0,
        "flair": "Analysis",
        "title": "Deduplication with SPLINK",
        "score": 1,
        "comment counts": 2,
        "content": "I'm trying to figure out a way to deduplicate a large-ish dataset (tens of millions) of records, and SPLINK was recommended. It looks very solid as an approach, and some comparisons are already well defined. For example, I have a categorical variable that is unlikely to be wrong (e.g., sex), dates, for which there are some built in date comparisons, and I could define the comparison myself be something like abs(date_l - date_r)<=5 to get the left and right dates within 5 days of each other. This will help with blocking the data into more manageable chunks, but the real comparisons I want are some multi-classification fields. \n\nThese have large dictionaries behind them. An example would be a list of ingredients. There might be 3000 ingredients in the dictionary, and any entry could have 1 or more ingredients. I want to design a comparator that looks at the intersection of the sets of ingredients listed, but I'm having trouble with how to define this in SQL and what format to use. If I can block by \"must have at least one ingredient in common\" and use a Jaccard-like measure of similarity I would be pretty happy, I'm just struggling with how to define it. Anyone have any experience with that kind of task?",
        "comments": [
            "You can use the array intersect Comparison Level for data like that.\n```\nimport splink.duckdb.comparison_level_library as cll\ncll.array_intersect_level(\"name\")\n```\n\nSee here:\nhttps://moj-analytical-services.github.io/splink/comparison_level_library.html?h=array+intersec#splink.comparison_level_library.ArrayIntersectLevelBase.__init__--__tabbed_1_1\n(look at the examples)\n\nIf you want a Comparison which has array intersections of various sizes, there's a function for that:\n```\nimport splink.duckdb.comparison_library as cl\ncl.array_intersect_at_sizes(\"first_name\", [3, 1])\n```\n\nhttps://moj-analytical-services.github.io/splink/comparison_library.html#splink.comparison_library.ArrayIntersectAtSizesBase.__init__--__tabbed_1_1"
        ]
    },
    {
        "id": "1bvwmnu",
        "datetime": 1712259661.0,
        "flair": "Career Discussion",
        "title": "Turning down a job but offering to work as a consultant instead?",
        "score": 33,
        "comment counts": 15,
        "content": "I got a job offer for a position that I'm going to turn down (pay is lower than I make currently and even if that wasn't a factor, the benefits aren't great). I've definitely already decided not to take it. But I wonder if it would be possible to ask to work part time for them if they wanted. The work is infinitely more interesting than what I'm doing now, so it would be nice to be involved.\n\n\nHas anyone done this before? I don't want to insult them or anything by offering such a thing.",
        "comments": [
            "I think this is really just a matter of framing it well. Something like, \u201cI really appreciate this offer, and I really enjoyed meeting the team and learning about the work you\u2019re doing. After considering the offer, unfortunately, I won\u2019t be able to accept the full time position. However, if you\u2019re ever interested in working with someone in a part-time consulting role for this work, I\u2019d be very interested in exploring that option.\u201d\n\nAnd then just be ready for them to say no.",
            "You think highly of them and want to cooperate with them, where is the insult? Go for it.",
            "I think that's awesome and a great idea, but I'm not too confident about this kind of work related decision making, so I'm just commenting to read what others would do too, you sound like you have the enthusiasm to add something to that team and they'd be crazy not to want to work with you!",
            "Well many companies, especially larger companies, don\u2019t hire 1099 (assuming you\u2019re in the US) directly anymore unless it\u2019s highly specialized and very short term. If there\u2019s an established temp / contract agency they already work with then maybe you can try that route. Otherwise I think you might find it harder than you assume it would be even if the hiring manager is on-board.",
            "But they need a full-time person?"
        ]
    },
    {
        "id": "1bvwatd",
        "datetime": 1712258897.0,
        "flair": "Discussion",
        "title": "Small Company vs. Larger Company for a Data Scientist: A Discussion on Generalist vs. Specialis5",
        "score": 11,
        "comment counts": 12,
        "content": "I have experience in machine learning engineering, data science, data engineering, and MLOps, which aligns my profile more with that of a generalist. I'm uncertain if this is beneficial, but I believe being a generalist could offer more opportunities in the future. Am I mistaken? Additionally, due to my ADHD/Autism, I often find myself quickly bored with repetitive tasks.\n\nRecently, I've been participating in the hiring process for a data scientist position and am now a finalist at two companies: a small startup (which I'll refer to as \"S company\") and a large corporation (\"B company\").\n\n\nS company is just beginning to implement AI, and the vacancy is for a senior-level position. As a data scientist, I would need to identify opportunities to apply AI. They lack an MLOps platform but promise freedom to deploy and use technologies as needed, with minimal bureaucracy. This seems appealing for rapid growth within the company. However, the downside is the limited number of experienced data scientists in the team (only one, actually). S company, focusing on package delivery, has many opportunities to apply optimization algorithms for routing. Yet, I've noticed the company seems somewhat disorganized.\n\nB company, is a large bank known for being data-driven, frequently hosts insightful conferences on ML and DS on YouTube. The vacancy relates to credit limits, and I was told that specializing in credit and loans is crucial for advancement within this company. The position is mid-level and offers a higher salary than S company. This could be advantageous, as it allows for learning without excessive pressure. However, I wonder if becoming too specialized in this area might limit my future career options.\n\nWhat you would do in my position?",
        "comments": [
            "Here's my summary:\n\nThere is no faster way to learn than a startup. However, there is zero guarantee that you're learning the right, let alone best, way of doing things. But you will learn how to get it (whatever \"it\" is) done, and you will also learn a lot more about everything else that the business does.\n\n>S company, focusing on package delivery, has many opportunities to apply optimization algorithms for routing. Yet, I've noticed the company seems somewhat disorganized.\n\nAll of this checks out and is expected of a startup. Lots of opportunity, very little maturity.\n\nLarge companies will make it much more likely that you will be exposed to the right way of doing things - although it may often be the case that you a) won't be the one who gets to do it, and b) even if you get to do it and you know the right way to do it, you'll end up doing it the \"okay\" way instead because it's faster and cheaper and no one wants to wait 6 months for that. \n\n>However, I wonder if becoming too specialized in this area might limit my future career options.\n\nIt does in the sense that it will pigeonhole you to that industry to some degree, but it won't in the sense that literally every company in the world has to deal with risk in some capacity, and banks are quite possibly the most experienced risk modelers in the world.\n\nSo yes, once you have 3+ years in risk modeling you might find yourself having an easier time with risk modeling jobs vs. e.g. product analytics roles, but that shouldn't be *terribly* limiting because there will likely be a BUNCH of risk modeling jobs. \n\nNow, is risk modeling the most lucrative sub-area of DS? Probably not. If you want to aim for profitability I think focusing on MLOps or Product Analytics or Causal Inference are probably a faster path to big bucks. But at the same time, sometimes the most growth doesn't come from picking the right area and more from climbing the fastest within an area.",
            "[deleted]",
            "I think it depends on what you want. Both places it seems like you will learn but at the big bank you will be around really smart people and in my experience you learn more from being around people smarter than you. Always nice to have a team and people who can assist you and the extra money won't hurt.",
            ".",
            "Much better work-life balance at a larger company but if you are earlier in your career, probably better off going with the start up and learning more and gaining more skills faster"
        ]
    },
    {
        "id": "1bvv2ks",
        "datetime": 1712256178.0,
        "flair": "Discussion",
        "title": "Does anyone recommend a clustering algorithm that can also update existing clusters?",
        "score": 12,
        "comment counts": 7,
        "content": "For instance say I have 1000 features that I cluster with algorithm A.  I obtain another 500 features, I would like to use the existing cluster information without reclustering everything from the start. \n\nIs there a clustering algorithm (ideally in sklearn and not k-means) that can handle this type of usage?\n\nIn one use case, the distance metric I plan on using will be jaccard since my data will be binary.",
        "comments": [
            "This sounds like a use case for online clustering algorithms. They are designed to process data points sequentially as they arrive, updating the clustering solution on-the-fly without the need to revisit old data. Take a look at StreamKM++, DenStream. [https://riverml.xyz/dev/api/cluster/CluStream/](https://riverml.xyz/dev/api/cluster/CluStream/)",
            "I assume you mean records, not features.\n\nApplying k-means for initial clustering and using knn for the classification of new records based on the previously learned labels is a common method for such use cases. Both algorithms are available in sklearn.",
            "I have my doubts that using 1000+ binary features directly in a clustering algorithm will get you good results due to curse of dimensionality. I hypothesize that doing PCA and bringing the dimensionality to <=5 and then clustering would get you better results (try with number of dimensions \\[2, 3, 4, 5\\]). \n\nSecondly, as mentioned Online clustering algorithms would allow you to update the cluster centroids / assignments on the fly. However, they only work with new data, not with new features. There is no way to add 500 more features on the fly, without re-training.",
            "Following"
        ]
    },
    {
        "id": "1bvsmeb",
        "datetime": 1712250489.0,
        "flair": "Education",
        "title": "Feels like I\u2019m in a grey area education wise. ",
        "score": 13,
        "comment counts": 18,
        "content": "I\u2019ll be graduating in a month with a BsC in Applied Statistics. The 3 most important classes I\u2019ve taken are Regression Models (Poisson, NB, Beta), Multivariate Analysis (PCA\u2019s, Discriminant Analysis, Factor Analysis), and Machine Learning (SVM, Decision Trees, SMOTE). Have a course in data visualization using the tidyverse package in R, a course dedicated to the SAS programming specialist certification, and 2 courses preparing for the actuarial FM exam and P exam amongst other electives. \n\nI don\u2019t know if an undergraduate is enough for competency in the first 3 classes mentioned given each one has a graduate level variation (Ex: I\u2019m taking 410, grad students take 510). Feel like my degree gave me breadth but not depth stats wise. Math wise I got to the Real Analysis sequence but I don\u2019t think I\u2019m cut out for a pure stats approach.\n\nIs this enough for an entry level job or is it gatekept by a post grad level of education?\n\n\n\n\n",
        "comments": [
            "This is definitely enough for entry DA / BA / BI type jobs. Probably not ML or DS unless you were at a prestigious university and/or had great internships. \n\nAs long as you have some personal projects you can talk about in interviews or something like an internship you should be fine. If you're worried why not get a Masters in something quant, do an internship, some research, etc.?",
            "It's hard to be competent in any of these without getting into the weeds, putting them into practice, for years.\n\nEven first and often second year ML/CS graduate students are generally not fully competent with most of these.\n\nJust keep climbing the hill and practice your tools, and most importantly, ask good questions.\n\nSometimes its enough for an analyst role or something very junior.",
            "Your issue will be less about whether you have enough knowledge to contribute to a work environment and a lot more to do with the fact that it's a terrible work market and there's a bunch of people with MS degrees who can't get a job. \n\nSo as u/i_oper8 said, unless you're from like a top 10 school, you're probably going to want to get a MS degree.\n\nAnd I'll take this as an opportunity to recommend you considering a MS in Stats or CS vs. an MS in DS.",
            "In entry level jobs usually they don't care about your knowledge as much as how willing you are to work hard and your ability to learn as you grow.",
            "At my company we typically require a masters, but startups and other smaller companies are more open. Right now the job market is a shit show though so even with the master degree people are struggling to get hired"
        ]
    },
    {
        "id": "1bvyd6m",
        "datetime": 1712263598.0,
        "flair": "Discussion",
        "title": "How much money is there in packaging and selling economic data?",
        "score": 4,
        "comment counts": 22,
        "content": "I know this is a vague question but just looking for a ballpark. My friend's business is sitting on a bunch of real time economic data on traffic volumes at everything from ports to raw materials suppliers and construction companies across a single EU country. How would they even go about sizing this, and is it likely to have a market?",
        "comments": [
            "Pretty hard to put a number on it without more detail or expertise, but the kind of information you describe is possibly worth a lot (as long as it is not already being packaged and sold elsewhere)",
            "Look into S&P Global.",
            "I\u2019d start with a data broker and look for distribution that way. They\u2019ll mark the data up and resell to their clients. Look up experian, acxiom, etc.",
            "For hedge funds, the value depends on the availability of the data:\n\nIf the data are not publicly available, it is potentially worth millions. Some specialized hedge funds pay fortunes for satellite images to predict the market. If this is publicly or easily obtainable data, then most likely it is already \"priced in\", and not worth a lot more than labor and costs.",
            "Does anyone know how such company's like S&P Global  collect their data?"
        ]
    },
    {
        "id": "1bv2zmt",
        "datetime": 1712175348.0,
        "flair": "Discussion",
        "title": "An example of how Linear Programming has helped you on the job",
        "score": 94,
        "comment counts": 84,
        "content": "Hi guys, I\u2019ve been a data scientist for 1.5 years, and I haven\u2019t needed to use linear programming one bit. I\u2019m thinking of changing jobs for a higher pay, and I feel the need to get better at LP beyond the basics, otherwise I\u2019d feel like a fraud in my next job. I\u2019m curious, how actually has that helped with your typical business use cases? I\u2019d love some examples, as I\u2019d like to tie a concept to an actual solution that helps you, either as an unexpected one off case or a regular experience. ",
        "comments": [
            "I've used LP frequently in my jobs, but all my jobs have been as Operations Research analyst, not data scientist per se. One example was a mixed-integer program for medical staff scheduling, solved using branch-and-bound (which uses LP). The solvers will take care of the algorithm for you, so that all you need to know is how to program the model and retrieve/interpret the results. Problems that require more advanced optimization techniques very frequently use LP as part of the solution strategy, but you will have to write your own code for that. \n\nI could write a screed about how under-utilized LP is, but the main idea is this: \n\nWhenever you need to make a decision (which every organization does hundreds of times per day), you need to use the tools of operations research. Optimization modeling (linear, non-linear, stochastic, dynamic, integer, and others) or simulation are basically the only way to do that at an industrial scale. Any large corporation that has an institutionalized decision support system will use one or probably several of those techniques. Dashboards, predictive models, and subject matter expertise are all useful, but they are all going to lead to sub-optimal decisions (edit: if used without an optimization model or simulation).\n\nJust my two cents.",
            "Why on earth would you feel like a fraud not having used this?",
            "What makes you feel like you need LP to succeed and a lack of knowledge in the area makes you a fraud? It\u2019s a specialized area of DS/ML and is therefore not strictly necessary unless you\u2019re tasked with solving those types of problems.",
            "IMO that would seem more applicable to optimization skills in supply-chain and transportation which isn't a small niche but is still a niche.\n\nyou'd be better off developing skills in data or ML engineering if you're chasing higher pay, or other DS areas like casual inference",
            "LP can be useful, but seeing this post trigger my PTSD from college lol. I had to take a linear programming class and I had to solve several problems using the Simplex method by hand.\n\n*By hand*\n\nThis was around 2017-2018. *Shudders*"
        ]
    },
    {
        "id": "1bvat2z",
        "datetime": 1712194610.0,
        "flair": "Analysis",
        "title": "Simpson\u2019s Paradox: which relationship is more \u201ctrue\u201d the aggregate or the groups? ",
        "score": 19,
        "comment counts": 20,
        "content": "Hello, \n\nI am doing an analysis using linear regression where I have 3 variables. I have 6 categories, an independent and dependent variable. There are 120 samples, so I have 6 groups of 20 samples. \n\nWhat I found is when I compute the line of best fit for the groups, they all have a negative relationship. But when I compute the line of best for the aggregate data, the relationship is positive. Also all of the group and the aggregate relationships have a small r^2 value. \n\nMy question is which one is more true the relationship among groups or the aggregate, and how do I determine this? ",
        "comments": [
            "Without more context it's not possible to provide any kind of useful input to your question. We don't know what the X and Y axis are and we don't know what the context the categories carry.\n\nFor example if this was a demand vs price curve where X: price and Y: demand, then yeah logically as your price increases your demand drops, because fewer people are able to buy it at higher prices.\n\nBut, if your categories are all luxury goods of different types, e.g. luxury handbags, luxury cars. Then within each group you'll see something like a positive correlation, because for luxury goods a higher price makes it appear more premium, boosting demand for that category.\n\nBut the average price and demand for each group will be different; cars are generally more expensive and people buy more handbags. This will make your population's correlation negative.\n\nIn summary, we need more context of what problem you're trying to solve and what the variables mean, else no useful analysis can be done.",
            "The question boils down to: is the categorical feature a confounder? If yes, adjust for it. Otherwise, the global aggregate is correct.\u00a0\n\nIn your case, the bosses have different difficulty levels: more health, more damage. This causes both the DPS units to need to do more damage, as well as the healers needing to heal more (even if the healer damage means how much they chip at the boss\u2019s health, the argument still holds). The categorical is a confounder here, so the groups are correct.",
            "Both, maybe. Or maybe neither. It really depends on what your data looks like.\n\nLook at the figures in the wikipedia article for Simpson's Paradox and you'll see immediately why it exists and isn't really much of a paradox.\n\nSo, plot your data and see what's going on in your case.",
            "Contrived interpretation example is: while at the population (assuming here) level, it may appear that taller people are wealthier, this is because men are on average both taller and wealthier. When examining the relationship within men and women separately, shorter men are wealthier than taller men, and shorter women are wealthier than taller women. \n\nSo, neither is more true, truthyness depends on your question / Interpretation goal.",
            "Your Y is categorical?\n\nSounds like you picked the wrong model."
        ]
    },
    {
        "id": "1buvnuw",
        "datetime": 1712158544.0,
        "flair": "Career Discussion",
        "title": "[Need advice]I Want to Leave MAANG a Month After Being Hired",
        "score": 112,
        "comment counts": 101,
        "content": "(Hope this is the right subreddit as many people post job-related/job-seeking questions. If not, I'm sorry, that's an honest mistake.)\n\nHello everyone. I was recently hired by one of the big corporations, and I would appreciate advice from the community.\n\nI was offered a high-level DS manager position at one of the MAANG companies. During the hiring process, I discussed with the hiring manager that I would start as an individual contributor, focusing on helping to improve the recommendation engine. Several months later, I was supposed to transition into managing a small team. I was extremely excited, it\u2019s MAANG, after all.\n\nHowever, on my very first day of the job, my manager\u2019s peer informed me that there had been a reorg. As a result, the tasks we had initially discussed would not be happening. To make matters worse, it turned out that my manager\u2019s planned promotion didn\u2019t materialize, leaving no team for me to manage. I thought, \u201cOkay, I can live with this. Worst-case scenario, I\u2019ll transfer to another team later.\u201d\n\nFast forward two weeks, and I honestly hate it. The tools we use are awful. Simple tasks that would take minutes outside of big tech now take up to an hour because our small databases constantly crash under load. Even building a basic causal model becomes incredibly challenging because we can\u2019t extract data from the database for a sufficiently long period. When I raised these concerns with my manager, their response was, \u201cI\u2019m not a fan of complex modeling. If you want to do that, let\u2019s outsource it to the Data Engineering (DE) team.\u201d\n\nWhat\u2019s more, my daily work involves adding charts to dashboards and handling ad hoc requests from various stakeholders. Nobody can clearly state our long-term goals or what we\u2019re trying to achieve. Our OKRs are DAU and Revenue, which are essentially driven by other company products (lol). The most disconcerting part is that every other Data Scientist in our part of the organization seems to be doing the same \u2014 ad-hocs, reports, and dashboards. Frankly, this doesn\u2019t feel like DS work at all.\n\nIn my previous workplace, I held the responsibility of creating models and introducing new ways to leverage data to increase our KPIs. My team was first to introduce causal modeling and ML-based user segmentation, we\u2019ve also piloted several other things that involved shipping models to production. When I decided to leave, my employer has repeated several times that should anything go wrong, I shouldn\u2019t hesitate and come back to working for them. I\u2019m seriously consider this option now. By the way it wasn't a small shop, that was an international company with petabytes of data and over a billion of revenue per year. Not the MAANG level though.\n\nMy current job offers better compensation, particularly if I can vest my stock options. However, during these several weeks, we\u2019ve already experienced another reorganization (3rd or 4th in 1.5 years), making it highly unlikely that I\u2019ll vest before the next round of layoffs.\n\nWhat would be the right move? To those familiar with MAANG companies, I\u2019d appreciate your insights. Is this situation normal?",
        "comments": [
            "Big Tech companies are huge, it seems like you\u2019re caught in a bad team. I would wait some months and try an internal transfer. Your compensation is probably great  and the company looks great in a CV, so I\u2019d be wary of jumping ship without anything better lined up. Is your old company something that will have better leverage on your career long term?",
            "You're not missing anything. For reasons unclear to me a lot of tech company \"DS\" jobs are really analyst positions and can be surprisingly low on technical skill requirements. I did 18 months at one and it felt like my resume was rotting. Now I lead a team at non-tech company and can say I wouldn't hire any ds I worked with at the tech company not because they weren't good people but because they didn't have the scientific or technical skills ds need.\n\nThat doesn't mean you should leave, but I think it's a better idea than it would have sounded in 2020 or so.",
            "You need to decide what\u2019s most important to you; the clout of MAANG or the personal satisfaction of the other job.",
            "It seems like you are coming in with preconceived notions about the right way to do things. There are probably very valuable things to learn from your peers who are also top notch. Explainable more simple models can be really cool as I am just learning as a junior who wants to throw XGBoost at everything.\u00a0",
            "All sounds pretty standard really. Four re orgs for me already this year"
        ]
    },
    {
        "id": "1bvcjcp",
        "datetime": 1712199521.0,
        "flair": "Career Discussion",
        "title": "(Need Advice) Got a job offer after a long time off, need tips on negotiating. ",
        "score": 12,
        "comment counts": 19,
        "content": "Hi, I would like some advice/perspective on my situation if you would be so kind! \n\nI left my previous position 22 months ago due to a toxic new director. There, I was making 84K with 4.5 weeks of PTO. It went from at office to once a week after the pandemic hit. I have drained all of my savings at this point. I have been searching with little luck, even with 6 years of experience. \n\nA CEOof a local company reached out to me on LinkedIn, and we had some good conversations. He asked what my rate was, and I told him 100K - 110K. He seems genuinely interested in being data driven to expand into new territory. He was impressed with what I did with his test data, and I feel like he would be a good boss. He currently works remotely, but he plans to move to my city in the summer. \n\nHe called an hour ago to extend an offer. The offer is 70K plus 20% bonus. I told him that's within range (probably a mistake). I also brought up that I would like to work remote by default. I could come into the office if needed for a meetinf or task, but that I was healthier and more productive working from home (I lost 40-50 pounds when my previous organization went remote). He said that he felt that would hurt productivity, and that flexibility may be available but he doesn't want to say yes because he can't \"condone\" it. I said that was not a deal breaker (also probably a mistake). We ended the call with him saying that he was going to send the offer letter over and that he wanted to call \"so you wouldn't be surprised.\"\n\nI went from feeling excited to feeling anxious and disappointed. The salary, even with bonus, was not what I was making at my previous position in 2022. Which I think I would be okay with... If I could work remote even 3 days a week. His resistance to the idea of even working remote on days where it would be helpful to be home (like apartment maintenance or vet appointments) was also very disheartening. Add also that there are only 2 weeks of PTO in the offer letter I just received. \n\nAfter talking with a friend, I have decided to at least make a counter offer for the salary. I want to ask for 85K plus the bonus. What I really want is a 2 on/3 out of office schedule. I am thinking about saying I appreciate the offer but the offer is less than I made in the previous position and that I would be happy with either 85 plus the bonus or 2/3 hybrid schedule. My fear is that he becomes resentful of my asking for the hybrid given his stance of it. My lesser fear is that they say no, I take it anyway, and that sets up a weird power dynamic that I am on the losing end of. He knows I've been looking for a while, so I don't really have leverage. I do have Adhd, and I susoect autism as well, but I don't know how to bring that up as an additional reason to allow remote work. \n\nI would love to hear feedback on negotiating offers in general and also negotiating for hybrid specifically. I would also like to hear perspective from people who moved on from their organizations after 4-12 months if full time office affects my mentality that drastically. \n",
        "comments": [
            "This position doesn't seem like a good fit, to be honest. I doubt the guy will budge on the total compensation in this market, especially as it seems like he is trying to get somebody as cheaply as possible. \n\nYour options are limited, though, as you are not currently employed. \n\nIn this situation, I would take the position (maybe try to negotiate to something like 77k, but a lower bonus as a \"compromise\" to keep TC the same, say 10% instead of 20%) and then keep looking for a new position. \n\nIt'll be easier to find a new job when you already have one, and you'll be in a better position to negotiate. \n\nIf this were the 2021/2022 market, I'd say you could be quite a bit bolder on negotiations, especially on the work from home stuff, but things have changed a lot since then.",
            "If you\u2019re unemployed, you take that job. No push back. I get that you may be more productive from home, but starting in a new company they need to trust you. If no one works remote, having one team member remote doesn\u2019t work for the whole team. \n\nYou can\u2019t ask to have remote flexibility and ask for more pay. Or you can, but if you don\u2019t have other offers lined up, you gotta be prepared for them to rescind. \n\nI\u2019d remove the hybrid condition. A lot of older employees see it as a cop out. They think hybrid workers are lazy and ineffective at home. There\u2019s studies on both sides of this argument with conflicting information, so I\u2019ll leave it at that. All I know is all the employers I\u2019ve interviewed with in the past 6 months have stressed the importance of in person working. This is for small market, 2000-3000 person companies.",
            "Being unemployed for too long is not a good look on your resume. Take this offer and keep a lookout for other opportunities and move as soon as you find another one.\u00a0",
            "I was laid off since 2022 and finally got an offer earlier this year. I\u2019m making 20+% less than I did before. I took the offer because I was ready to move on with my life. \n \nI did negotiate, despite not having a job or another offer. I went in mentally prepared to lose the offer. The way I handled it was to avoid any phrasing that sounded like I might decline. I said, \u201cI\u2019m thrilled about the offer and I can\u2019t wait to join the team. I am targeting a higher salary\u2014can you do $x?\u201d I got a little extra money, but honestly, I can\u2019t recommend this unless you\u2019re ready to lose the offer. The rational thing to do is to accept.\n\nI really like my new team so I\u2019m going to stay and try to build a career here. But your new job doesn\u2019t look like a great fit. If I were you I\u2019d take the job, start networking NOW, and then start applying around 8 months in. That\u2019s plenty of time to stay at a job that isn\u2019t a great fit. Just stay longer at the next one.",
            "Take the job, keep looking for another job, if you can\u2019t find another one at least you can put this down on your resume after a year."
        ]
    },
    {
        "id": "1bvcdq5",
        "datetime": 1712199059.0,
        "flair": "Discussion",
        "title": "What is a Data Visualization Grammar?",
        "score": 7,
        "comment counts": 7,
        "content": "There are many ways to create visualizations, between chart choosers, chart wizards, GUI-based tools of various flavors, and of course, many libraries if you\u2019re looking to use code. Many of the latter describe themselves as *grammars* or *grammar-based*. But what does that mean?\n\nThis is a great article written by Robert Kosara, a Data Visualization Developer at Observable. Source here: [https://opendatascience.com/what-is-a-data-visualization-grammar/](https://opendatascience.com/what-is-a-data-visualization-grammar/)",
        "comments": [
            "Thank you for sharing!",
            "Thanks",
            "Look up the book Grammar of Graphics by Leland Wilkinson. \n\nThis is a very influential book that was the inspiration for ggplot in the language R. Also was at the core of visualization in Tableau. And Tableau itself, when launched, was a game changer and forced other BI/ charting software to up their game.\n\nWilkinson is a legend in visualization circles. He VP of Statisrics was with Tableau and later with H2O.ai. before that he was a researcher and academic focusing on statistical graphics and visualization. He developed the Systat package in 1980s, which was then integrated into SPSS in the 1990s.",
            "Following",
            "Thanks. This is neat!"
        ]
    },
    {
        "id": "1bv0t1l",
        "datetime": 1712170392.0,
        "flair": "Discussion",
        "title": "How many ad-hoc data requests do you all get a week?",
        "score": 29,
        "comment counts": 30,
        "content": "Pretty much title, I guess how painful is this whole process for you? \n\nDo you all get as many data reqs as my data team did? If you do, what's your role, company size, and who are your stakeholders? How much percent of your time/overall suffering is spent on them and how do you go about fulfilling them?\n\nCurious if it's as painful for others as well.",
        "comments": [
            "If you work in a role where this is a big % of your time, it might be worth making a ticket tracker that's visible to *at least* your manager but possibly your other internal customers. You want high visibility of the fact that you have to prioritize. If different people want to be prioritized they can fight about it with your manager.",
            "Almost every other day. I work in a mid-sized analytical company as a senior analyst. Most of the times my stakeholders just as for some quick data pulls or mappings, which are just a waste of time. Otherwise there will be some small deep dives into the data",
            "So many that I could spend 40 hours just working on other people\u2019s requests if I don\u2019t say no\u00a0",
            "About 2-3 ad hoc tasks a week. I am a Data Scientist mostly working on ML models and data pipelines, 1,000 - 2,000 size company, mostly for Marketing/Product/Compliance managers.     \n\n\nThe only painful process for me is that we don't had a ticketing system that is uniform for all teams. So there ends up being little visibility for each request outside me and the stakeholder. This causes a lot of similar duplicate requests between teams and sometimes difficult workload management for the DS team since these request are tracked in different locations.",
            "Some days I'm the hammer. Some days I'm the nail.\n\nI think it's 50/50 that other shareholders have data asks of me vrs me needing access to data &/or domain expert time.\n\nTotally agree with other comments: document you time on these tasks and communicate to your manager(s) where you spend time."
        ]
    },
    {
        "id": "1bvdkzl",
        "datetime": 1712202693.0,
        "flair": "Discussion",
        "title": "Any learning resource recommendations about space usage optimizations.",
        "score": 5,
        "comment counts": 3,
        "content": "Anyone here work on optimizing space usage and have any good resources for learning? \n\nI'm working on a problem similar to retail space optimizations where the goal is to find out the best combination of product/promotions to be placed in store that would maximize profit. There are some research papers on store space optimizations but most of them seem quite theoretical. Any leads to applied resources would be much appreciated \ud83d\ude4f",
        "comments": [
            "Optimisation is always theory, and it can be broken down like this \n\n1) cost function - how good your current setup is optimized\n2) search algoritm - how to manipulate variables\n3) ????\n4) profit",
            "You might be looking for Association Rule Mining (https://en.wikipedia.org/wiki/Association\\_rule\\_learning)",
            "Im curious about the research papers you've mentioned here, would you possibly be able to send them my way?"
        ]
    },
    {
        "id": "1bv02xw",
        "datetime": 1712168763.0,
        "flair": "Career Discussion",
        "title": "What do top companies test Data Analysts on (versus Data Science)?",
        "score": 21,
        "comment counts": 16,
        "content": "Unlike data scientist interviews full of stats and possibly machine learning, I feel like I don't know how to prep for data analyst interviews. Before moving on to DS prep (I'm on the verge of what some firms would accept as DS, and what most would be just DA), I wanted to make sure I'm absolutely golden in all aspects of a DA interview.\n\nDatabase wise, I've for the R in CRUD down from every day work. I can work on the C, U, D but not really sure where to focus my efforts. I've also got normalization and keys on my list, but besides that, is there anything more to do here?\n\nThen, I'll do some leetcode as a show of general programming ability. I use Python somewhat frequently, but don't have any major projects under my belt.\n\nBesides those two, what else do top data analysts get tested on during interviews? It's hard to test dashboarding skills so not sure what I need to do in regards to that.",
        "comments": [
            "Depends wildly on the company/role, so base your prep on the JD as much as possible. But for analyst roles, I\u2019ve usually been tested on SQL, how would you join these tables, how would you visualize this data, various group by / ranking questions, binning and building histograms, design and evaluate an AB test, that kind of thing.",
            "Come good points already mentioned. Know your advanced SQL, Python etc. What people often forget is the use case of analytics. Analytics helps a business understand how to maximize value and alleviate pain points. One thing to focus on is how does the data relate to the business. I do quite a bit of testing including A/B, MAB MVT etc so depending on the role I\u2019d be sure you\u2019re comfortable with stats and making recommendations based on experiments.",
            "Give u several datasets to merge and manipulate / clean/transform. You are to create new columns based off of the transformation. \n\nI.e  you\u2019ll be given a sample output of what the new columns will look like.\n\nHouse buyers (population) dataset, house prices dataset, house sales dataset.\n\nIt prolly tests all the basic syntaxes + writing your own functions in python and then applying it to dataframe.\n\nYou have 1-2hr + to do it live.\n\nI.e it\u2019s just like leetcode, where you have the input and sample output, and u just have to code them out",
            "Depends on the job but usually some form of HR screen, hiring manager screen, SQL / python test, or take home assignment (some data munging, build a dashboard or viz) and then an onsite. If the role overlaps with DS analysts, throw some a/b testing and case studies in there related to analytical thinking",
            "Focus on SQL and ask to see some of their projects"
        ]
    },
    {
        "id": "1but6uu",
        "datetime": 1712152440.0,
        "flair": "Analysis",
        "title": "Help with Multiple Linear Regression for product cannibalization.",
        "score": 45,
        "comment counts": 26,
        "content": "I briefly studied this in college, and chat gpt has been very helpful, but I\u2019m completely out of my depth and could really use your help. \n\n We\u2019re a master distributor that sells to all major US retailers. \n\nI\u2019m trying to figure out if a new product is cannibalizing the sales of a very similar product. \n\nI\u2019m using multiple linear regression. \n\nIs this the wrong approach entirely?\n\nData base: Walmart year- Week as integer (higher means more recent), Units Sold Old Product , Avg. Price of old product, Total Points of Sale of Old Product where new product has been introduced to adjust for more/less distribution, and finally, unit sales of new product. \n\nSo everything is aggregated at a weekly level, and at a product level. I\u2019m not sure if I need to create dummy variables for the week of the year.\n\nThe points of sale are also aggregated to show total points of sale per week instead of having the sales per store per week. Should I create dummy variables for this as well?\n\nI\u2019m analyzing only the stores where the new product has been introduced. Is this wrong?\n\nI\u2019m normalizing all of the independent variables, is this wrong? Should I normalize everything? Or nothing?\n\nMy R^2 is about 15-30% which is what\u2019s freaking me out.  I\u2019m about to just admit defeat because the statistical \u201ctests\u201d chatgpt recommended all indicate linear regression just aint it bud. \n\nThe coefficients make sense (more price less sales), more points of sale more sales, more sale of new product less sale of old.\n\nMy understanding is that the tests are measuring how well it\u2019s forecasting sales, but for my case I simply need to analyze the historical relationship between the variables. Is this the right way of looking at it? \n\nEdit: Just ran mode with no normalization and got an R2 of 51%. I think Chat Gpt started smoking something along the process that just ruined the entire code. Product doesn\u2019t seem to be cannibalizing, seems just extremely price sensitive.",
        "comments": [
            "Also not my industry, but I think you should take a step back and consider the specific quantities that need to be isolated to get at this question before even thinking about regression. In particular, my first pass guess is that you need to capture both whether absolute sales of new product A are increasing while old product B is decreasing over time, but this might not be correct. Make plots, tables, descriptive stats. A useful comparison here would also be to compare stores where the product was rolled out against stores where the product was not rolled out, to see whether old product B sales decrease in stores with the new product compared to those without. IMO you could almost answer this question entirely descriptively, and then build a model later down the line once you\u2019ve got a good handle on the fundamental patterns.",
            "You want to model the effect that sales of Product A (Asales) has on the sales of product B (Bsales). \n\nYou could use a simple linear regression, such that: \n\nBsales = Asales + other factors + error term. \n\nA model of this make will be able to give you an idea of the association between your products sales numbers.\n\nIf you happen to have data on a granular enough level, and know the date the product was introduced,  you could specify the following model for a more casual estimate:\n\nBsales = Asales + DateIntroduced + Asales * DateIntroduced + other factors + Errors\n\nWhere your \"Asales * DateIntroduced\" is an interaction term that captures the casual effect of the introduction of your product into a respective market.",
            "You might want to take a look at multivariate time series",
            "I agree with Ocelot. If you wanted to, you could plot the cumulative sale of old and cumulative sale of new on the same plot (two lines overlapping) and consider each individual store to see which prefer old vs new. If it\u2019s local, plot on a map which stores like the old vs new. Maybe there\u2019s regional favorability.\n\nYou could smooth your plots using a moving average \u2014 which points towards autoregression (auto ARIMA) but this seems to be overkill. You could just straight up slap a Gaussian process over it so you get an uncertainty measurement. However, what\u2019s likely more important is communicating the trends and performance of the new and old product rather than actually forecasting.",
            "30% r2 is basically nothing and it's hard to really say what's wrong without seeing the sales (time series). This was my field for 10 years, so I can be your huckleberry. \n\nYou may see 30% because you have two massive peaks of sales that aren't being fit. For this you need to account for trade/promotion/seasonality\n\nYou may see 30% because sales are increasing as you grow shelf presence, but have no variable to account for this (an average item fact will help). Or if due to increased number of stores sold a distribution fact will help. These may require smoothing if unstable week to week. \n\nIt could be a low fit because product level is too granular and it's just causing noise. Sub brand or brand level may be a more stable level of analysis. \n\nAnd there are a handful of additional ideas that may cause the low fit. It's truly hard to say without seeing the data. \n\nWith aggregated data you are not likely to find an interaction effect. Consumers don't just switch products in large numbers without some impetus (trade, lower price, media, etc.). This behavior is more noticable at a store level, but becomes very washed out at aggregate levels. This doesn't mean cannibalization doesn't occur, just that it's hard to see at an aggregate level. \n\nEdit: happy to help if you've got questions. If you DM me a screenshot of your sales I may be able to help more precisely (you can take away the x and y axis to make it more masked)."
        ]
    },
    {
        "id": "1bv21ju",
        "datetime": 1712173168.0,
        "flair": "Discussion",
        "title": "Data science project management tools",
        "score": 9,
        "comment counts": 14,
        "content": "Hi everyone :) I've been tasked with finding a user-friendly tool for project management and I am   \n struggling to choose one. We are a team of 7 data scientists/analysts and I have not formally managed a team before.  \n\n\nI've been considering Jira, Trello and Microsoft teams planner but since I haven't used either before, I'm not sure which one to prioritize. It does not have to be one of these either. \n\n&#x200B;\n\nIdeally, we want something simple, quick and SAFE because we work with a lot of sensitive data. Could you please share your experience on managing projects and what has worked best for you and your team? \n\n&#x200B;\n\nThank you!",
        "comments": [
            "Jira has always been the industry standard for me and I never had any complaints about it",
            "Jira is by far the best I have used, it is very flexible and easy to scale when projects get more complex and boards can easily be customizable based on a teams Scrum/Sprint style or workflow. \n\n I would only consider Microsoft teams planner if your company is fully integrated into the microsoft365 stack, but even then I think it is a far worse product.",
            "Jira is standard at most companies. I work on a small team right now that decided not to use our company\u2019s jira for project management, and are using SmartSheet instead. I don\u2019t love smartsheet, but it\u2019s a lot simpler than Jira for sure. It might be worth looking into if Jira is overkill.",
            "We use Jira, if you use confluence and bitbucket the integrations are great. Plus there\u2019s VSCode extensions available to make tasks more streamlined. Have been looking into linear to plan out school work and personal projects though!",
            "Do you code a lot? Notebooks and such? GitHub/GitLab issues, epics might be enough. Has been for my team."
        ]
    },
    {
        "id": "1bukjw2",
        "datetime": 1712121462.0,
        "flair": "Career Discussion",
        "title": "Student wanting to maximise last year of study",
        "score": 29,
        "comment counts": 31,
        "content": "Hi,\n\nI'm in my final year of my BSc, major is not data analytics but my minor is. I'm learning SQL on the side, once in comfortable with that I'm going to look into python a little. What can I do to maximise my potential? I've seen people comment about portfolios, I would love any suggestions on how to wrangle that. \n\nFor context: used to live in a house truck in the woods. No smart phone or computer. Last six years I've turned life around and taught myself everything, including the tech knowledge I needed before starting university. So I am still new to some things, but I'm working really hard to make myself a decent candidate for jobs. I've got 20 years of workforce experience behind me, up to management level, so I'm not a spring chicken. ",
        "comments": [
            "Projects, projects, projects. Ideally you do projects that interest you. Maybe throw your financial statements into DuckDB and build out some visuals with Python. Grab some baseball (or insert your favorite sports here) stats and analyze them. I find my best learning comes from building projects that interest me. Throw a stats book for good measure, but I would lean 80% projects and 20% books. This would all happen after you finish your BSc, of course.",
            "In this economy the problem to solve is this one: \n\nHow can you leverage your 20 years of experience to differentiate you from fresh grads, leapfrog over internship bullshit and immediately add value to any team or organisation? \n\ni.e. either do your previous job in a data context or use your data expertise to do your previous job better.",
            "I emailed around 50 professors in my university asking if they needed help with any project. I was in my final year as well and had just completed a few python and ds courses online. Got to work on a research project and got an lor out of it. It's your best bet I think, because they'll understand when you have to focus on your exams. Good luck!",
            "Try to get experience where you can. Get an internship. Look for local hack nights or project nights. Check with your department or professors for projects you can contribute to. Look for opportunities online to volunteer or contribute to open source projects. Do your own projects if none of that stuff works out. \n\nBuild your network. Reach out to alumni. Attend local industry events. Join Slack & Discord communities.",
            "There are a ton of intro free python courses online- or small learning projects. Even just going to [python.org](http://python.org) is a good start. Otherwise there are youtube tutorials. Looking into different data visualization libraries once you get comfortable with python would be a good idea too! Plotly, matplotlib, and vaga are some good ones to start with."
        ]
    },
    {
        "id": "1bv52ai",
        "datetime": 1712180128.0,
        "flair": "Projects",
        "title": "Looking for Expertise in A/B Testing and Experimentation for Idea Testing and Collaboration",
        "score": 0,
        "comment counts": 3,
        "content": "Hello all, I'm working on a new A/B testing platform. It's an idea born from over a year in this field. I'm looking for a data scientist interested in joining me.\n\nAbout me - I've been a software developer for 14 years, much of that with startups. Now, I'm at a startup focused on A/B testing. It's been a great learning experience, but I'm branching out on my own.\n\nI've been researching and networking at meetups. Recently, I shared my ideas with a data scientist. They found them interesting. I hope to follow up, though it's uncertain.\n\nMy goal is to create a platform that simplifies advanced experimentation for teams with limited data science knowledge. It uses LLMs for hypothesis generation and more. I've started building and testing these ideas.\n\nI'm seeking a data scientist's expertise in experimentation. Your input would help prioritize development. If interested, you could join me as a co-founder.\n\nI'd greatly appreciate anyone's time or willingness to help ... Thanks",
        "comments": [
            "How do you plan on differentiating yourself from the other experimentation platforms out there - things like Eppo, Measured, and Growthbook?",
            "."
        ]
    },
    {
        "id": "1bvi2wf",
        "datetime": 1712219625.0,
        "flair": "Tools",
        "title": "Does anyone knows how to scrape post on Reddit thread into Python for data analysis?",
        "score": 0,
        "comment counts": 9,
        "content": " \n\nHi does anyone knows how to scrape post on Reddit thread into Python for data analysis? I tried to connect python into the reddit server and this is what i got. Does anyone know how to solve this issue?\n\nAfter the user authorizes the app and Reddit redirects to the specified redirect URI with a code parameter, you need to extract that code from the URL.\n\nFor example, if the redirect URI is http://localhost:65010/authorize\\_callback  \n, and Reddit redirects to a URL like http://localhost:65010/authorize\\_callback?code=example\\_code&state=unique\\_state  \n, you would need to parse the code  \nparameter from the URL, which in this case is 'example\\_code'.\n\nOnce you have extracted the code, you need to use it to obtain the access token by making a POST request to Reddit's API token endpoint. This endpoint is usually something like [https://www.reddit.com/api/v1/access\\_token](https://www.reddit.com/api/v1/access_token).\n\nHere's a general outline of how you can do it:\n\n1. Extract the code parameter from the redirect URI.\n2. Make a POST request to Reddit's API token endpoint with the code, along with your app's client ID, client secret, redirect URI, and grant type (which is typically 'authorization\\_code'  \n).\n3. Reddit's API will respond with an access token.\n4. You can then use this access token to authenticate requests to the Reddit API.\n\nThe specific details of making the POST request, handling the response, and using the access token will depend on the programming language and libraries you are using. You'll need to refer to Reddit's API documentation for the exact endpoints, parameters, and response formats.",
        "comments": [
            "What's the issue? It looks like you just posted some chatgpt response.\u00a0",
            "\u201cHey Reddit, does someone want to build me a scraper to read post threads?\u201d Fixed that for you*",
            "Beautiful soup",
            "https://praw.readthedocs.io/en/stable/getting_started/quick_start.html",
            "Beautiful soup."
        ]
    },
    {
        "id": "1bv13l1",
        "datetime": 1712171038.0,
        "flair": "Education",
        "title": "fundamentals of LLM: A story from history of GPTs to the future",
        "score": 0,
        "comment counts": 0,
        "content": "",
        "comments": []
    },
    {
        "id": "1bu6fvf",
        "datetime": 1712083259.0,
        "flair": "Career Discussion",
        "title": "How do a data scientist should expand into MlOps / Data Engineering?  ",
        "score": 63,
        "comment counts": 26,
        "content": "I have been working in data science in the retail industry for almost 3 years, the first 1.25 years as a data science intern & later 1.25 years as a data scientist. Till now, I have mostly worked on projects from POC to market test / backtest. I have not had a chance to push the model into production. But with the advent of AI automation, I do not wish to just stick to being a notebook data scientist & expand my expertise to ML engineering / MlOps. I am confused about how & from where should I start since it is such a vast ocean. I would be grateful if you guys could provide me with some starting points. Thanks !! ",
        "comments": [
            "Read Machine Learning Engineering in Python by Andrew McMahon. Just finished it and I think it\u2019s great. I\u2019ve been productizing models for a few years and it does a good job of summarizing the stack.\n\nOtherwise, as a first step, try to build a toy model, save it to AWS S3, and use the UI to click around and write up a Lambda Function that reads the model into memory and returns the output. Those are the very basics of deployment, done the quick and dirty way.",
            "There's a well-regarded website/course called [MadeWithML](https://madewithml.com/) by Goku Mohandas that is worth checking out. There's also [MLOps Community](https://mlops.community/), which has a Slack, and Chip Huyen's [ML Ops Discord](https://discord.com/invite/Mw77HPrgjF), which I think is about to do a book club on Chip's book [Designing ML Systems](https://www.amazon.com/Designing-Machine-Learning-Systems-Production-Ready/dp/1098107969). Oh and r/mlops.",
            "I'd split it into a few things:\n\n**Some good resources first that I've used to start:**  \nFundamentals of Data Engineering by Joe Reis, Hands-On Machine Learning with Scikit Learn, Keras and Tensorflow by Geron.\n\n**Key concepts to start:**  \nFocus on understanding the lifecycle of machine learning models (from dev to prod), continuous integration and deployment (CI/CD) pipelines, monitoring and maintenance of models in production, and the DE necessary to prepare and manage data at scale.  \nFor MLOps: Learn about model versioning, experiment tracking, model deployment, and monitoring. Familiarize yourself with platforms like MLflow, Kubeflow, and TensorFlow Extended (TFX).  \nFor Data Engineering: Focus on basic ETL (Extract, Transform, Load) processes, data warehousing, and data pipelines. Tools and frameworks like Apache Spark, Apache Airflow, and cloud data services (AWS Glue, Google Cloud Dataflow) are good to learn for big data.\n\n**Building**  \nPretty much any project that involves the end-to-end lifecycle of a machine learning model. This could include building a model, deploying it using a tool like Docker, and setting up a simple pipeline with Jenkins.",
            "These are just special cases of software engineers.\n\nData engineering is basically SQL, Pyspark and DBA duties with some cloud sprinkled in. The 2000's style writing performant data pipelines age using map reduce in Scala while earning 200k/y are long gone. It's just SQL on top of some data warehouse tool and the pay is shit compared to the good ol' days.\n\nML Engineering is all about turning jupyter notebooks and csv files into a working ML system. Usually scheduling it as a batch job. It can be just learning Airflow and saving that notebook as a .py and scheduling it/rewriting the damn thing in Pyspark or it can be rewriting everything in Kotlin or Swift and making it work on a phone in real-time without eating the battery or burning a hole in your pants. Mostly the former though.\n\nMLOps engineering is basically DevOps but for ML. It can be super simple like installing Kubeflow and MLFlow for data scientists and ML engineers to use or it can be architecting and engineering an internal ML platform with AutoML features so you can make production ML models using SQL or clicking a button in an UI.\n\nThe answer on \"how\" is to get a degree in computer science (or equivalent knowledge on your own) and learn about system design, webdev, DevOps, Cloud etc. while not forgetting the data, stats or ML side. The key is knowing the basics of basically everything related to software systems and going deep on the ML/Data specific stuff.\n\nMost people I know in the field are CS people/software engineers with an interest in ML/Data and they learned all of this in school/at work before getting into ML. Very rarely it's the other way.",
            "Build and deploy an ML model as a side project and make your code available on GitHub. Write good, readable code and have diagrams that explain your AI system and data ingestion. That will be way better than taking courses, a master's, or theoretical knowledge."
        ]
    },
    {
        "id": "1bu7vs6",
        "datetime": 1712086589.0,
        "flair": "Discussion",
        "title": "Daily practice websites/platforms for Stats",
        "score": 28,
        "comment counts": 17,
        "content": "Hi all,\n\nThere are a dozen websites and platforms to learn and practice SQL, R, and Python on a daily basis. SQLZoo, Mode, Hackerrank, etc. I'm looking for similar spots to practice stats, one that comes with their built-in datasets, and a gazillion of challenging questions. I'm guessing some coding is required to do the actual calculation, or even an Excel environment would suffice, but I'm looking for the same type of challenges. For example:\n\n'Here is a table with data about burger prices from our 19 restaurants, we think our customers enjoy the Jacky Jalapeno Burger the most and are willing to pay premium for it, but we really don't know for sure, can you check it out?'\n\nFree would be my first choice but I don't mind a paid sub if the quality is great.\n\nThanks.",
        "comments": [
            "I'm a software engineer who is trying to get started in DS and would be interested in building something like this.",
            "Kaggle might not be the daily type you're looking for but it has datasets that you can hop on competitions with other users (all free)",
            "Not exactly what you've described, but ProjectEuler is awesome. They have lots of challenging math (and probably some stats-oriented) questions that require programming to solve.",
            "Stratascratch",
            "StrataScratch has the questions but it's non-interactive as you mentioned. But they're all free. Try [Brilliant.org](http://Brilliant.org) for an interactive experience."
        ]
    },
    {
        "id": "1bttf9c",
        "datetime": 1712046098.0,
        "flair": "Discussion",
        "title": "How stressed are you all? ",
        "score": 151,
        "comment counts": 132,
        "content": "This is a topic that isn\u2019t talked about much on here. But how stressed are you all? As a grad student I get very minimal sleep and frankly looking forward to working after I graduate so I can evade this hell ish schedule of having to come home at 5, do a 10 question problem set, 4 other homeworks and then grade 120 assignments by the end of the week, when instead, come home, eat dinner, watch tv, and then just get sleep. I\u2019m already beyond stressed and irritable as a grad student and I\u2019m just praying work schedule as a data scientist will bring some ease and less stress into my life. Or do you guys work more than the traditional 9-5, and do you often work a lot after hours. ",
        "comments": [
            "Work is easier than school. Not a rule, but often.\n\n\nWhat is the most important, you can (relatively) easily change employer, which is not so easily with school. If there is difficult professor or subject, youbhave to push through it. At job, you cam just find another one.\n\n\n\nThay said, as others said some people are not made for 9-5.",
            "Data analyst working with data scientists. I work from home, usually 7-15, but that's movable to fit my needs. No overtime, no weekends, no commute",
            "I'm stressed but still much better than being a teacher",
            "It really depends on where you work and who you work with. From a work perspective, I'm not very stressed. I have deadlines, but they're very manageable which means I rarely need to work more than 40 hours a week. The biggest thing I've had to learn is when to say no, and usually people are respectful of your boundaries - within reason of course, you can't say no to everything. \n\nFrom a general life perspective, I'm pretty stressed. I have two kids, 3 and 9 months. My life outside work is a never ending cycle of cleaning up toys, wiping bodily fluids, making sure non-edible items aren't ingested, and generally trying to negotiate with a screaming banshee that she can't eat chocolate for dinner. On top of that there's the looming specter of teething and sleep regression.\n\nEdit: after today I\u2019m pretty stressed at work. Anyone wanna take over building a performance marketing experimentation platform for me?",
            "I have been working as data scientist for almost 3 years now. When I was student, the stress was about getting my foot into the industry, now the stress is about fighting the imposter syndrome and constant feeling of inadequacy. You can\u2019t evade the stress you just leave old stress behind and stress about the new thing"
        ]
    },
    {
        "id": "1bu9lrp",
        "datetime": 1712091050.0,
        "flair": "Projects",
        "title": "Analyzing 40 Million Hacker News Items: User Behavior, Story Scoring, and Content Trends",
        "score": 11,
        "comment counts": 2,
        "content": "Hello, \n\nI've conducted an extensive analysis of \\~40 million items from Hacker News, exploring user behavior, the scoring system, and how content trends have evolved from 2006 to the present. \n\nI've also open-sourced the code for anyone interested in the data or methodology. \n\nHere's the link to the full analysis and code: [https://blog.osm-ai.net/2024/04/01/hn-part-1.html](https://blog.osm-ai.net/2024/04/01/hn-part-1.html)\n\nAny feedback or additional thoughts are welcome!",
        "comments": [
            "Awesome stuff mate.",
            "Thanks for sharing!"
        ]
    },
    {
        "id": "1bub0hm",
        "datetime": 1712094559.0,
        "flair": "Discussion",
        "title": "Categorical Root Cause Analysis",
        "score": 6,
        "comment counts": 5,
        "content": "Hi everyone,\n\nWe're trying to solve a problem that I feel should be rather generic, and for some reason I can't seem to find good material about.\n\nSo the problem goes like this: we have a certain metric that we want to maximize, within a range of 0-100%. This metric is computed as the percentage of 1s in a binary column in our dataset. All of the other columns are categorical features, with varying cardinalities. Some might have 3 possible values, some dozens or even hundreds. \n\nI would like to perform Root Cause Analysis, so that I can tell my stakeholders \"Here's what's causing a decrease in the overall metric, go fix it\". However, all of the RCA content I find online handles numerical features, while mine are all categorical. And there's no trivial way to make them numeric, as in, there's no order applicable to their values.\n\nSo what I try to do currently, is look for 'segments' in the data that are underperforming. A segment means any combination of feature values over perhaps just 1, or some, or all features. I chose to call these 'segments', if there's a well-known term for this I would love to be corrected and educated.\n\nNow, since my features may have a high cardinality, the number of segments to look through is potentially huge, so an exhaustive search is off the table.\n\nQuestion number 1: Is this a common problem that has a known optimal solution? An algorithm or perhaps even a library that implements it that I can just use?\n\nSo here's what we did to solve this problem in our use-case, and I'd love to hear opinions or suggestions:\n\nWe one-hot encode all of our data, train an XGBoost (what else...) model on the whole dataset, then use SHAP values to extract the knowledge that our model gained. We recommend to our users the feature-value pairs with the highest mean SHAP values, as the most likely culprits.\n\nTo allow for some combinations to be looked at, we use our top 5 feature-value pairs as filters (one at a time), then run the same logic on each data subset. This produces 5 new top-SHAP feature-value pairs, and we do the same once again. This gives us at most 3 levels of combination. But, I worry that we might be missing other niche segments that are badly underperforming because we got to the 'niche' segments by drilling down from higher-level feature-values.\n\nAnyway, hopefully this isn't becoming too confusing. I'd love to hear thoughts or perhaps some of you have faced similar challenges in the past and have some recommendations.\n\nThanks!",
        "comments": [
            ">I'd love to hear thoughts or perhaps some of you have faced similar challenges in the past and have some recommendations.\n\nConceptually your approach is not bad and it would probably get you good marks if you were thrown this question in an exam or interview.\n\nIn reality, when you have such a complex dataset and you are looking for very specific types of relationships, encoding everything and just throwing XGBoost over and over at it based on SHAP values is probably not going to magically surface the stuff you're interested in, and puts you at significant risk of overfitting.\n\nIt's a boring suggestion but personally I would go back and do more EDA and feature engineering. You haven't talked at all about initial independence testing or dimensionality reduction and both are really critical in cases like this. Subsequent to that and prior to any complex algorithm, using something like association rule learning or decision trees might help you extract some more relationships.\n\nExtracting complex relationships from catagorical data is never particularly easy but the best way to make it easier is to spend a lot of time on the simple stuff and prep your data as well as you can before you throw it into something like XGBoost.",
            "SHAP is not causal. If you create a model to predict rain, you will find that the presence of umbrellas has a large SHAP value, but causality points in the opposite direction.\n\nYou're talking about causal discovery, which is one of the hardest things you can do in data science. There is a vast literature, but [here is a place to get starte](https://medium.com/causal-data-science/causal-data-science-721ed63a4027)d. Roughly, you need to use conditional independence tests to find a series of connections between variables. You will have to assume you've measured all possible confounders, which is of course nearly always impossible.",
            "Mutual information works really well for categorical variables. It actually underlies SHAP values if I'm not mistaken or at least entropy does. It won't give you a model but it will definitely give you some good insight on feature importance.",
            "I would look at logistic regression and using lasso, ridge or elastic net to select your features. This should trim the fat, give you some insight into the most predictive features where things are very complicated."
        ]
    },
    {
        "id": "1bua71x",
        "datetime": 1712092583.0,
        "flair": "Discussion",
        "title": "Newbie Seeking DS Project Ideas",
        "score": 3,
        "comment counts": 13,
        "content": "Hey everyone,\n\n**Fresh data science learner here!** Looking to jumpstart my portfolio with impactful projects (EDA, ML, anything relevant!). Hit me with your best ideas!\n\nThanks!  \n\n\n`For mods: Apology if this post is against the rules. Let me know, I'd be careful from next time.`",
        "comments": [
            "This is a bit of self-promotion, but here are some sources for you to check out!\n\n* [2024 Ultimate Guide: 90+ Free Datasets for Data Science](https://www.interviewquery.com/p/free-datasets)\n* [Top 39 Data Science Projects with Source Code (2024)](https://www.interviewquery.com/p/data-science-projects-with-source-code)\n* [36 Data Analytics Project Ideas and Datasets (2024 UPDATE)](https://www.interviewquery.com/p/data-analytics-project-ideas-and-datasets)\n\nIf you need more, we have tons on our site\u2014and feel free to DM if you need anything else!",
            "Reposting my response from someone who posted the same thing earlier today:\n\n> I think you'll find more success if you take on a project on a topic that you're passionate about. If you love movies you could predict genres from synopses, or if you love running you could visualize and analyze your running routes. Picking a random project from a list might give you the opportunity to do something technically interesting but I've found that the best \"portfolio projects\" are the ones that candidates pursued out of a genuine interest. You'll likely end up spending more time and going more in depth on something that's genuinely interesting to you.\n\nhttps://www.reddit.com/r/datascience/comments/1bt1ova/what_could_be_some_of_the_projects_that_a_new/kxovuqm/",
            "This is helpful",
            "Make a portfolio management tool using convex optimization and write a REST API / dashboard around it",
            "i wanna start learning with projects as well they're a good help"
        ]
    },
    {
        "id": "1bto1ay",
        "datetime": 1712026611.0,
        "flair": "Career Discussion",
        "title": "Data science manager conundrum",
        "score": 107,
        "comment counts": 76,
        "content": "Hello everyone. As a data science manager, managing a team in a rather complex organizational structure, I find myself in a bit of fix. While I enjoy people management and helping solve non technical problems and suggesting methods in technical aspects for the team to solve, it also leaves me with little to no time to do any data science stuff as most of my days are filled in meetings, responding to emails, team strategy, finances etc. I am worried about my tag as a data scientist as I continue to be involved in the managerial path, forget anything leetcode level. I would massively fail there at this stage of my career. Most companies that are hiring seem to want people who excel at technical stuff while managing teams. While personally this seems impractical when implementing, I would still need to crack interviews that might be heavy on data science stack. \n\nSo the question is, how do I manage to keep my self updated on technical stuff when I barely have the time in my role? I could keep brushing some of my core data stuff in the time outside my work hours but learning continuously on the side is going to eat away into any of my personal time as well. Advice?\n\nPS - I am not in the tech industry nor am I interested in it.",
        "comments": [
            "If you have solved that let me know... I'm right there with you in a nearly identical situation",
            "I struggle with this too. I\u2019m constantly taking courses, reading textbooks, and working on open-source projects to maintain/improve the skills. Although, it\u2019s not the same as building at scale.",
            "I feel you. I'm in a similar situation.\n\nOne thing I do is the following. I assume you have performer's in your team. You can learn by letting them walk you through their solutions / code in detail.\n\nFurthermore, I use my position to connect with a broader group of people. Among which there are more data scientists and learning from them broadens my understanding. I might not be able to technically apply their approaches but if I had to, I'd know where to look tk get the job done.\n\nIMO there is no way around getting rusty but therefore you broaden your knowledge.\n\nAlso, companies have to decide what they want. Either they want a traditional manager like you are right now from what I understood or they want leaders that are heavily contributing indivually too. The latter is what I believe / have heard is the case at Amazon (happy to stand corrected). If your company wants that too, you should be allowed to limit the time you spend with the team. In that scenario team members learn to swim without much help or they don't and that's accepted as collateral demage by superiors. I personally prefer your company's model but that doesn't matter much.\n\nA last thing, I'd avoid to take on tasks if you have limited time. I did that (more our of business need than to up skill) and I learnt that I simply couldn't go as deep as required due to my other duties. Nobody won in these episodes. At least things didn't go as well as they could have if I'd approached it differently.\n\nHence, I started to work closer with team members where I helped them with the concept to get the project started and they'd do the coding work to bring it home. Everyone, including stakeholders, won in these episodes.\n\nSorry for the unstructured writing up but I hope it helps. I really appreciate your question and wanted to participate in some way. Let's keep it going :)",
            "Welcome to middle management.",
            "Ever watch Oppenheimer?  He went from a respected practicing scientist to being a respected manager.  In the movie I think they had a line about him being \"out of the game\" for years  and \"was a politician now.\"\n\nThis is a pretty common career growth pattern.  It's happened to a couple of data scientists that I know.  And, I recall a story I read somewhere on reddit where all the engineers and scientists declined to step forward and suffered when an ignorant, power-hungry micromanager eventually stepped up.\n\nIf it makes you feel better, check out the four stages of career development described in the book \"Novations\" and used by various consultants: dependence, independence, coach/mentor/manager, and visionary.  It sounds like you're about to transition from Stage Two to Stage Three, and it's a different ball game."
        ]
    },
    {
        "id": "1bugbq3",
        "datetime": 1712108359.0,
        "flair": "ML",
        "title": "Interesting Scrapable Publicly available ML database that Can be retrieved via APIs ",
        "score": 1,
        "comment counts": 10,
        "content": "Looking for some tabular data where i can apply ML techniques .  And I need to scrape ot off using API calls or something similar. I cant use static data ..  For a class project. \n\nPS : Dont provide data where Time Series is applicable. I found plenty of  such data. ",
        "comments": [
            "A lot of video games (valorant, league of legends, cs:go) have accessible data. I like kayaking so I did a project pulling weather data from noaa (I think..it\u2019s been a few years) and the USGS on weather and river flows. I think Basketball Reference has an API if you\u2019re into basketball.\n\nWhat are some of your hobbies? I\u2019m positive there\u2019s data available to you as a student in fields you\u2019re interested in.\n\nYour profile, which has some interestingly conflicting posts lol, shows you\u2019re into F1 and football. It took one Google search of \u2018F1 API data\u2019 to pull up a Reddit thread linking an active API. You got this!",
            "https://github.com/public-apis/public-apis\n\nThis repo has list of public apis which can be used to pull in the data. Take a look",
            "Hmmmmm you can maybe use the movie database API - tabular, non-time series data",
            "Check out UCI ML repository, Kaggle,  Data.gov",
            "Kaggle?"
        ]
    },
    {
        "id": "1btl6ei",
        "datetime": 1712018703.0,
        "flair": "Tools",
        "title": "Nature: No installation required: how WebAssembly is changing scientific computing",
        "score": 11,
        "comment counts": 7,
        "content": "WebAssembly is a tool that allows users to run complex code in their web browsers, without needing to install any software. This could revolutionize scientific computing by making it easier for practitioners to share data and collaborate.\n\nPython, R, C, C++, Rust and a few dozen languages can be compiled into the WebAssembly (or Wasm) instruction format, allowing it to run in a software-based environment inside a browser.\n\nThe article explores how this technology is being applied in education, scientific research, industry, and in public policy (at the FDA). \n\nAnd of course, it's early days; let's have reasonable expectations for this technology; \"porting an application to WebAssembly can be a complicated process full of trial and error \u2014 and one that\u2019s right for only select applications.\"\n\n--------\n\nKinda seems like early days (demos I've seen feel a little... janky sometimes, taking a while to load, and not all libraries are ported yr, or portable). But I love that for many good use-cases this is a great way to get analytics into anybody's hands. \n\nJust thought I'd share. \n\n[https://www.nature.com/articles/d41586-024-00725-1](https://www.nature.com/articles/d41586-024-00725-1)",
        "comments": [
            "WASM has been in use for a while for consumer apps that have an emphasis on performance (e.g. Twitch, Figma) but for scientific computing there definitely is an opportunity to port typically hard-to-run or old programs such as the R interpreter very easily within the context of a web browser. And there stands to be potential in building scientific tools for mobile devices as well.",
            "Thanks for sharing. I heard a lot of complaints from developers about wasm. And it hasn't really taken off as it look it would back when it was announced, apparently it has some shortcomings that hinder its growth",
            "Thanks for sharing!"
        ]
    },
    {
        "id": "1btymwi",
        "datetime": 1712064203.0,
        "flair": "ML",
        "title": "Interpreting a low-prevalence Reliability Diagram",
        "score": 0,
        "comment counts": 2,
        "content": "I'm checking to see if my model is calibrated (ie, are my predicted probabilities reasonable given observed probabilities?). When I plot the diagram I see two things:\n\n1. the plot is beneath the ideal line\n2. my observed probabilities are in the set (0, .2) and my predicted probabilities are in the set (0, 1)\n\nHow am I to interpret this? Should my predictions only fall in the same set (0, .2) as observed?\n\nI know that the initial read is that my model is overconfident but feel like I'm missing something that has to do with the range of observed probabilities.",
        "comments": [
            "How are you defining \"observed probabilities?\" Representing as P(Y|X), do you have lots of samples with identical X that you can average?\n\nSome hypotheses:\n\n1. You're over-fitting. What's the out-of-sample performance?\n2. Your model choice is inherently poorly calibrated (SVMs, GBTs are vulnerable to this)\n3. You don't have enough samples to get a good estimate of the model probability"
        ]
    },
    {
        "id": "1bt0fkm",
        "datetime": 1711968240.0,
        "flair": "Discussion",
        "title": "Am I not getting interviews because I dont have a data science portfolio?",
        "score": 79,
        "comment counts": 68,
        "content": "Im working on a bs/ms for data science and havent been having too much success finding internships. Zero interviews at all. Ill look at a posting that lists requirements like: \"knows what python is (not the snake)\", \"used excel before\", \"(preferable) has heard of statistics\" and its really demoralizing. So now im just trying to figure out whats wrong with me as an applicant.\n\nI have interesting projects in my resume: im working on a grad school project as part of a team for US army research where im building visualization dashboards and writing SQL for backend. In undergrad, I worked on finetuning a MRI neuroimages CNN for depression diagnosis.\n\nI have previous internship experience: did GenAI/Rag at a start up, building data pipelines, finetuning embedders, making flask webapp, deploying model. Product saving xyz manhours by helping people find and summarize important documents etc.\n\nProblem is i cant really put up these projects in my portfolio because of NDAs and I dont really do side projects. So because of that i dont have a web portfolio/git portfolio. \n\nIs that the issue? Just really bummed out about the hob search and trying to channel that into improving my profile :/",
        "comments": [
            "I did a Ph.D. in CS with a thesis on ML. I got plenty of replies from interviewers but after the initial phone call, after they realized that my research was theoretical and that I don't have almost any practical experience they says something about the need for a proper experience and that I do seem like a good fit beside this, and ended the process there.  \nIt happened over and over. \n\nI did know how to program (I had a bachelor's degree in CS after all) so I taught myself the practical ML/DS stuff, took some online courses, but it made almost no difference. They wanted proven experience which could assess my abilities for them.\n\nSo I started doing project, no matter what, just created stuff from start to finish which had anything to do with data, in a way that I could tell the story of what I did and what were the challenges etc.. Stopped talking about my thesis on interviews (as it made the interviewers to get a weird unfocused gaze) and started telling about those (which got those interviewers immediately interested and intrigued). They didn't care for the content of the project (most of the times) but the surrounding frameworks.\n\nTwo months later I got my first job.\n\nSo, yes, from my experience having a portofolio is important, but what is more important is your ability to tell about it - what was the type of problem at hand? what did you use? Which tools? What challenges did you face? And so on. In most cases you can talk about those without exposing any specifics. It is annoys sometimes but doable. Oh, and you don't necessarily need a website or a document listing them, just be ready to talk about them in a way that will catch the interviewer attention.\n\nBest of luck :)",
            "Ask your former supervisors to be available as a reference. The project might be NDA but that doesn't mean they can't confirm your skills and involvement. Unless of course, they work in a 'when caught the gov will deny your existence' kinda agency :-)",
            "One of the toughest things right now that aspiring / starting data scientists must contend with is that there seem to be thousands of \"you's\" out there all of a sudden. That is, with the exponential rise in educational programs and students and word of mouth, especially internationally, data science has been flooded with people who study the same things, get the same degrees, have the same types of internships, know the same software and hardware packages, look alike on paper and apply in droves for every role, in every city and state. Breaking through such a logjam is new to analytics folks historically, as the applicant to job ratio was never as extreme as it is today, and even analytics veterans are finding it bleak and nightmarish.\n\nSo, what to do? Heed everything you read in responses here. Networking should be your top activity, as nothing works better. Think of networking as a critical skills set you now need in the work world. Getting both your resume and LinkedIn into good shape is paramount, despite what some say, as they are your self-branding. Applying and applying would have been frowned upon in the past as getting little return, but apparently seems a necessity now. It may also be wise to differentiate yourself from the throngs applying as well, showcasing what you bring to the table that is indeed unique, like the US Army project that you can at least highlight. The \"new\" data scientist has got to have more skills in their arsenal than the \"historical\" data scientist if they want to make it. That's 2024.",
            "You're not getting interviews because there is very little demand for entry-level data scientists. This has always been true, but hiring right now is pretty slow so it's especially true now. \n\nNo matter what stage of your career you're in, the best way to find a job is not by applying to job postings. It's by networking.",
            "Are you an international?"
        ]
    },
    {
        "id": "1btkw62",
        "datetime": 1712017940.0,
        "flair": "ML",
        "title": "CatBoost and hyperparameterisation",
        "score": 5,
        "comment counts": 3,
        "content": "I'm an ecologist starting my first forays into machine learning. Specifically, I'm using CatBoost to predict presence/absence of a threatened species at discrete wetlands. You guys are the experts in this space, so I'm hoping you can help. Firstly, is hyperparameterisation conserved? So for example, if I'm using a grid search for tree depth using low iterations and higher learning rate, will the best tree depth also hold true at higher iterations and smaller learning rates in all cases?  Secondly, when seeking a binary output from the testing set, is there anything that I should be cautious of? It feels more intuitive to use categories to validate the model then to predict probability when applying the model.",
        "comments": [
            "Hyper parameters are not conserved as you put it. The optimal tree depth may change depending on other parameters, etc.\n\nFor number of iterations, I always use more than necessary and apply over fitting detection.\n\nWhether you use the label or probability for predictions is up to you, but should be guided by the problem you are trying to solve. The default threshold for classifying as a positive class is >= 50%. Generally one would use a cost matrix to calculate the gain/loss of true positives, false positives, true negatives and false negatives. \n\nSo for example, if a false positive ends up wasting a lot of time/resources, you may want to have a stricter threshold for classifying as positive, e.g. 80%. \n\nLook up classification ROC curves for more information.",
            "I experimented with XGBoost, LightGBM and CatBoost (a.k.a. the trio from Kaggle) and my conclusion is Catboost pretty much doesn't require any HP optimization\\* to perform. The difference between default parameters and best parameters (for catboost) is so small it's not worth the human attention and time.  \nThat is, I give it test set for automatic overfitting detection, set ROC AUC as control metrics and choose iterations as high as time/memory allows. Learning rate either default or a bit smaller than default. Number of iterations is the number of trees in your model. If learning process is too slow on CPU I move to GPU, it looses precision a bit but runs significantly faster. The only downside of CatBoost is lack of documentation. Technically they do have a website with a documentation but the only info there that makes sense is cookbook recipes page and old youtube video of girl giving a presentation at some conference.\n\nNowadays CatBoost is my goto algorithm for approaching new problems. Allows to focus on features engineering instead of HP optimization. If I ever have extra free time I would rather try to discover a new data augmentation method rather than do HP with catboost.\n\n(\\*) by HP i mean algorithm related HP. Not features related HP.",
            "Two major overall thoughts before answering the Qs asked:\n\n1. Hyperparam optimisation is rarely a huge performance change.  It can fix overfitting, or squeeze an extra percent or three out of a decent model, but it never takes a model from mediocre to fantastic.  I wouldn't worry too much about it; your time is better spend on data prep and model selection\n2. While it is good to understand how hyperparam tuning conceptually works, your question suggests you are doing it by hand and that is rarely a good use of time.  Use a wrapper package like [Optuna](https://optuna.org/) to automate the search space hunt.  \n\nQs asked\n\n>Firstly, is hyperparameterisation conserved?\n\nNo.  The best tree depth will change dependent on other params.\n\nHowever, often those changes are small.  Which is to say that many hyperparameters can be optimised semi-independently without major performance loss.   But again, you should automate the hyperparam fitting and let a tree parzen estimator handle this for you\n\n>Secondly, when seeking a binary output from the testing set, is there anything that I should be cautious of?\n\nStrictly speaking, it is helpful for the tuning to see something more continuous like prediction probabilty (logloss) because itgives us better ability to know if the model is getting closer to the right answer or not.  You may want to use focal loss if trying to classify a rare event.  \n  \nA metric based on literally \"did you pick the correct class\" has a discontinuous reward at a particular certainty and no reward for changes elsewhere on the certainty, so you can't optimise from it very well.\n\nBut a binary metric can be useful for a human sense check of the model results.  Pick what makes sense for your use case."
        ]
    },
    {
        "id": "1bt1ova",
        "datetime": 1711972363.0,
        "flair": "Projects",
        "title": "What could be some of the projects that a new grad should have to showcase my skills to attract a potential hiring manager or recruiter?",
        "score": 34,
        "comment counts": 30,
        "content": "So I am trying to reach out new recruiters at job fairs for securing an interview. I want to showcase some projects that would help to get some traction. I ahve found some projects on youtube which guides you step by step but I don't want to put those on my resume. I thought about doing the kaggle competition as well but  not sure either. Could you please give me some pointers on some projects idea which I can understand and replicate on my own and become more skilled for jobs? I have 2-3 months to spare, so I have enough time do a deep dive into what is happening under the hood. Any other advice is also very welcome! Thank you all in advance!",
        "comments": [
            "Ingest a free data source. Clean and store that data in a Postgres database you host out of docker. Use this data to perform analysis or train a ML model. Part of your pipeline now use the model to make predictions on new data. Use a pipeline orchestrator. Now build a dashboard in python that visualizes all data, analysis, and model predictions. Host this use cloud resources. \n\nYou will find building all the things that aren\u2019t DS will be the hardest, and \u201cnot what you want to be doing\u201d. But you will also find as soon as you can put all the pieces together you will get multiple job offers.",
            "I put up several projects on git. Those included predictive maintenance, some exploratory data analysis, churn prediction and others. It's important to have them well documented so they can get a feeling about your way of work. Of course at end it depends which job you're applying for, if it's in the industry, finance or insurance. Make your mind or simply ask chat GPT for use cases and try to find those on kaggle.",
            "Combine multiple projects into one and make an end to end project with multiple features and functionality. The more the project has usecase potential, the more you get close to icing the hiring team.\nAlso, try on new age technologies like blockchain, large language models etc. Picking up these skills and building something out of it will give a positive impact on your profile that you are a tech savy, with great skills to pick up new stuff. Every hiring manager is looking for these skills desperately.\nChoose your niche and create some advanced projects which will make you stand out of peers\nGo ahead with your own portfolio website which has 3-5 decent projects hosted on free server. Hiring team can be pushed to demonstrate these in your tech round as they are handy and just a click away from end to end implementation. This will also help you to communicate your approach and the challenges faced during the production and will provide a genuine gesture of work from your end.",
            "Pick something up you\u2019re interested in and get data for it and post about that. I say this because it will keep you interested and it\u2019ll be easier to see the project all the way through. Create a repo on github and push your code there. If you want a DS role, focus on those areas (data cleaning, feature engineering, modeling, etc.) and create documentation discussing why you are doing certain things. If I\u2019m looking for a candidate their thought process is important in addition to fundamentals. Complex concepts can be learned but knowing someone can think their way to a solution and explain what/why they are doing it goes a long way.",
            "1 find a project the interests you that you will continuously work on til it\u2019s finished\n2 make sure that the project you pick has a narrative and solves a problem you can write about and talk about during the interview (very important!)\n3 find an easy data source (hopefully with an API, it just sounds better) or if you have to then scrape the data. Also there are some good paid data sources for specific things. I paid for fantasy sports data for example\n4 build out a pipeline to ingest the data into the model. This is key. Try to fit as many buzzword tech things into the stack that the recruiter can also understand (snowflake, docker, etc.)\n5 choose a popular algorithm/model to train that you can learn inside-out cause you will likely get cross examined on an interview. Also make sure the model fits the narrative or you have a good reason why you choose it. It shouldn\u2019t be something new or fancy cause the recruiter/interviewer might not know it and that doesn\u2019t help your case. Also don\u2019t just use basic stat 101 stuff that also makes you look like a noob\n6 make sure you have presentable conclusions to finish off the narrative story of this project to finalize the timeline and effect this project had\n\nImo most people mess up the first part which is having a good \u201cmarket fit\u201d. Meaning that your project needs to be marketable on a resume. So it has to have a clear problem, good narrative, right buzzwords, etc. this all depends on what type of industry and job as weak as seniority your applying to\n\nHopefully this helps"
        ]
    },
    {
        "id": "1bta8md",
        "datetime": 1711993361.0,
        "flair": "Career Discussion",
        "title": "I need advice about how to get a new job and what i\u2019m even qualified or on track to do",
        "score": 7,
        "comment counts": 13,
        "content": "Sorry if this is another redundant post but I\u2019m at a complete loss as to where I fit in. I got hired as a media analyst at a marketing company two years ago but I was on a two man (including me) data science team. I was immediately tasked with doing MMM and in between those we were working on a spot attribution model that wasn\u2019t going anywhere. Then my boss quit and a data scientist from the parent company came to oversee the rest of the MMM we were working on. That was at about my six month mark. After that my new boss took me with him to do \u201creal\u201d data science. \n\nSince moving to that dept, I\u2019ve been using different ML methods to do different predictive tasks mostly relating to Nielsen. I do most of my work in R now but before this I was solely using Python. I have a ton of experience making dynamic SQL queries, creating and cleaning model input files, building pretty complicated programs that do all of the modeling, make useable prototypes (mostly excel) and validation. My SWE friends say I have a natural ability to code and am better at it than a lot of their coworkers. \n\nMy problem is this, I am underpaid. According to people on this subreddit maybe by as much as 60k. I\u2019m still making an analyst salary. The other problem is that I get hit up by recruiters for basic analyst and dashboarding jobs. I\u2019m not opposed to taking those but the pay isn\u2019t better and due to inflation and a few other things I can\u2019t take a pay cut and moving for the same pay seems like i\u2019d just be setting myself back another year salary wise. The data science jobs I\u2019ve applied for have turned me down because I don\u2019t have experience moving data from lakes to warehouses. (I think) I know i\u2019m good at my job. I\u2019ve never missed a deadline and I\u2019ve gotten two shining reviews since I\u2019ve been here. I know if it was up to my boss he\u2019d bring my salary inline with what I should be making. I need a realistic plan on how to fill in the blanks on my resume and what type of jobs to target. Any advice is appreciated.",
        "comments": [
            "Whether you are underpaid is really hard to tell without a number of dollars and size of company and country and more detail about work.  How many YOE do you have?  The type of work you\u2019re doing is really not too far from a low level SWE.  You shouldn\u2019t be going for DS positions tbh. When using the models do you actually understand the underlying reasons and limitations do models or do you just feed data into whichever one performs best?  You also need to learn also a lot about how models are used to solve business problems , or if you even need models at all if you want to be a DS, not just how to implement or interpret them.\n\nEdit: you\u2019re also kind of doing DE work, but modern DEs need a lot more than just data pipelining usually; some computer networking, database management, and cloud computing knowledge is needed, not sure how much you\u2019re doing on those ends. Any dunning Krueger effects here?\n\nEdit2: if you stay while you figure it out it may be easier, especially once the market gets better. You should target data analyst, data scientist, analytics, or data engineer jobs. Everything except data analyst would be junior level, and data analyst level depends on your comfort ability with solving nonobvious problems with data",
            "Your experience is solid. \n\n>  \nI don\u2019t have experience moving data from lakes to warehouses\n\nSounds like nonsense. This is also data engineering stuff that you shouldn't really be expected to know for a DS job. It's just a rough market right now. Your best bet is probably to just keep applying.",
            "Depending on your YOE and education, you could definitely be underpaid at 90k.\n\nSalaries vary so widely in our field. In my last job search I interviewed for jobs as low as 70k and as high as 330k. My interview skills weren\u2019t good enough to get jobs in the higher end of that range, BUT I did get the impression that those jobs were totally attainable. Just need to learn to sell yourself, including (but not exclusively) in technical interviews. My advice is to start applying to new jobs so you can get interview practice and also see where you might fit within data science.",
            "i think some info about ur country would be helpful",
            "I think you not doing too bad man, your experience is definitely good and you have the advantage of applying for new jobs that will pay you better without the pressure of being unemployed"
        ]
    },
    {
        "id": "1bsndqm",
        "datetime": 1711925235.0,
        "flair": "Career Discussion",
        "title": "First job out of undergrad is really boring",
        "score": 157,
        "comment counts": 67,
        "content": "Hey all, im a fresh grad with a background with applied math and econ. I got a job really quickly after graduation as a data analyst at a large bank in my country (anti money laundering & compliance), but the actual responsibility of the role is more like a data entry position with excel. As you can imagine, it\u2019s painfully dull and low paying aside from the advantage of good LSB (9-5). I\u2019ve been working on a way to automate my work with python scripts, but aside from this there is really not much to add to my resume. \n\nMy overall goal is to move to a backoffice positon in risk/investment research unit in my bank where they do something more quantitative like analytics, modelling and statistical analysis. What else could I be doing to get there in the future? \n\n",
        "comments": [
            "This is common. Ten years ago you probably wouldn\u2019t have been hired into a data analytics position until you had 5+ years of business experience under your belt. It\u2019s not really an entry level job. Success in the field usually comes down to deep business knowledge, which takes time to acquire.\n\nThe explosion of data science popularity has created a lot of these junior positions that are being filled with maths grads straight out of university. But you don\u2019t know enough about anything to be useful yet, so you will be kept doing fairly menial work until you learn more about the business.\n\nLong term, don\u2019t expect to use a ton of your maths degree. Every grad has one these days and there aren\u2019t enough jobs with really fancy modelling to employ them all. Most of your job will be applying relatively simple models judiciously based on a deep understanding of the business need.",
            "I honestly wish my job was boring. \n\nI dread Sundays because it's reminder that I have to head into another week of constant pings and a plethora of data analyses with everything on fire for no reason other than just because.\n\nI'd switch with you any day. Be happy, and stay boring.",
            "Completely normal.",
            "Meet people on the team. If local, go to lunch, let them know you're interested, and seek a mentor.",
            "Boring is good - trust me.  You will progress and suddenly someone will need something by tomorrow for an executive presentation.  Oh and it\u2019s 4pm already.  That\u2019s the alternative, very little in the middle early in your career"
        ]
    },
    {
        "id": "1bt4e1j",
        "datetime": 1711979842.0,
        "flair": "Statistics",
        "title": "Univariate K-Means Clustering vs. Fixed Cluster Boundaries",
        "score": 8,
        "comment counts": 1,
        "content": "I am attempting to cluster stores based off their sales. I can either do:\n\n&#x200B;\n\n1. Univariate K-Means clustering by way of the Ckmeans.1d.dp package in R. This works perfectly fine, only 2 cons are figuring out the upper limit on K, and possibly explainability to the client.\n2. Fixed cluster boundaries. In this case, I average the sales of all stores, and create boundaries like: 50% below average, 25% below average, 25% above average, 50% above average. This is the method that has already been put in place and is what our client is familiar with (and can understand).\n\nWhat I am trying to determine is when is #1 more \"preferable\" to #2? ",
        "comments": [
            "You want to see clusters emerging when none are obvious by merely looking at the data and that's the scenario when you'd use clustering techniques such as k-means. When you already have separations like what you mention, I would rather use box plots to see whether there is any overlap between those pre-decided groups. For example, is there a substantial overlap in the data points of each cluster...does the median of one subplot lie well within the 50th percentile of the next group etc. It gives you much more nuanced understanding of the data. You can run some stats to see if the groups are indeed statistically different and so on. If you do not find them to be so, you can easily communicate this to your customer and decide with them on how to reset these boundaries that can better give them distinguishing groups."
        ]
    },
    {
        "id": "1bsuvbx",
        "datetime": 1711946751.0,
        "flair": "Career Discussion",
        "title": "I\u2019m double majoring in mathematics and computer science, considering doing a minor in the business field. Which would be the best for data science jobs?",
        "score": 43,
        "comment counts": 76,
        "content": "Was talking to family members who are currently in data analytic positions and they said a business background would be very beneficial for data science. Which ones would be the best?",
        "comments": [
            "Don\u2019t do a minor in business - keep maths and cs, that\u2019s the most beneficial. \nAs the other comment said, just take 1-2 business project management courses; focused on systems engineering etc.",
            "Take the time you would\u2019ve spent on business stuff to go more in-depth in CS and math.\u00a0",
            "Math(not only math, but also physics, statistics, etc) is the best major. As long as you have a good math background, you can self study the core CS topics(discrete math, data structures, algorithms, theory of computation). Data structures and algorithms are important for LC interviews for data science positions, but if you would like to become a machine learning engineer in the future, you will be more prepared (think that data science LC interviews lie in the spectrum of easy- medium and for MLE in the spectrum of medium- hard with better optimization). Math background will also give you the ability to code machine learning algorithms from scratch, without memorizing anything, as you will have a deep understating of the mathematics behind those stuff",
            "Some project management courses would be good to add on.",
            "> family members who are currently in data analytic positions\n\nNo disrespect to your family members, but those \"data analytic positions\" don't really know much when it comes to data scientist positions. Take their recommendations with a grain of salt. They are probably just projecting.\n\nAs for what you should do, forget about a minor in business, that is very useless for you as a data scientist, it is more of an Analyst thing (like your family). Sure, you are expected to have some business acumen, but you don't get that from a minor in business. Just focus on landing an internship, that is really all you need!"
        ]
    },
    {
        "id": "1bsu2qi",
        "datetime": 1711944090.0,
        "flair": null,
        "title": "Weekly Entering & Transitioning - Thread 01 Apr, 2024 - 08 Apr, 2024",
        "score": 4,
        "comment counts": 91,
        "content": " \n\nWelcome to this week's entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include:\n\n* Learning resources (e.g. books, tutorials, videos)\n* Traditional education (e.g. schools, degrees, electives)\n* Alternative education (e.g. online courses, bootcamps)\n* Job search questions (e.g. resumes, applying, career prospects)\n* Elementary questions (e.g. where to start, what next)\n\nWhile you wait for answers from the community, check out the [FAQ](https://www.reddit.com/r/datascience/wiki/frequently-asked-questions) and Resources pages on our wiki. You can also search for answers in [past weekly threads](https://www.reddit.com/r/datascience/search?q=weekly%20thread&restrict_sr=1&sort=new).",
        "comments": [
            "\n\nData Science for government, public sector, NGOs etc?\n\nI have been studying political science and economic history in Sweden, aiming to become an analyst of some kind. I have found these subjects to be very interesting, they have given me a sturdy base of general knowledge and taught me how to write and think with more precision. However, I do feel that I lack the essential, practical skills to maximize my usefulness in the job market. Many of the more technical Master programs I've been considering, that combine analysis with political subjects, require some kind of statistical or programming background as an entry requirement. With a limited amount of credits I want to make the most of my remaining studies and i have therefore been considering jumping straight into a 2-year degree in data science from a reputable school outside of the University. My only concern is that all these data science educations seem to lead to jobs within business intelligence, where as i am more interested in researching subjects like politics, economic development, health care, etc. Is data science more than just identifying customer behavior from website data? Is it something that employers of other kinds, the ones i'm looking for, government, NGOs, etc. are also interested in hiring?\u00a0\n\nPS. I dont usually beg, but i would appreaciate an upvote so that i can ask this question in a new thread, which requires 10 karma to post.",
            "Sorry for asking a common question.\n\nI work in marketing and Im interested in data science.\n\nI was in software engineering before but left for the work life balance. Ive heard data science is stressful too, but Im trying to be the go to analytics person in my department so maybe I get to set my own pace haha.\n\nAnyway my department has some education budget, and I was wondering if there are some good online courses with instructors?\n\nI worry that I will slack off if I do a self paced learning like dataquest.io\n\nAppreciate any recs I can get!",
            "Can y'all please roast my resume\n\n[https://imgur.com/z2woTKc](https://imgur.com/z2woTKc)",
            "Leaving Education for Data Science\n\nI'm a 48-year old high school math teacher. I've taught every subject from Pre-Algebra through AP Calculus (with the exception of AP Statistics) in my 27-year career. In 2008, I got my Masters in Educational Technology. Being in education has become increasingly difficult, especially these last few years.  \nThis past year I started teaching an Intro to Data Science course, and I've really enjoyed it. We use Google Sheets, CODAP, Python (via Google Colab), and Tableau and complete 8 unit projects using those skills. I was hoping this new course would give me some energy to finish out the last 10 years of my teaching profession. However, after teaching Data Science, I'm thinking I might like a full career change. As an educator, I have access to all of Datacamp's courses, which I've been working through (free is good for me!). I'm going through Excel, then SQL followed by Python courses, followed by their Tableau lessons. I know these will help me be a better teacher, but not sure if they would help with an actual career change.  \nSo is it possible (or even worth it) to pursue a career change? Where would I start? I don't really want to do more schooling since I won't make up that cost in time. I was thinking this could also be a \"post-teaching career\" so I have something to keep me busy in my 60s. If I wait until I retire from teaching, it would be more of a free-lance situation. Any insight would be helpful!",
            "Opinions on sending thank you notes post interview? I read it doesn\u2019t really make a difference. My BF, who conducts many interviews and is a lead, said to not waste my time bc it doesn\u2019t make a difference but I am not sure if he is saying this because he interviews so much."
        ]
    },
    {
        "id": "1brk50d",
        "datetime": 1711811582.0,
        "flair": "Career Discussion",
        "title": "Where are the Junior Level Data Scientist Jobs?",
        "score": 194,
        "comment counts": 133,
        "content": "When I search for data type jobs on Indeed, I see analyst level jobs, and then  senior, lead, mostly director data scientist jobs. I hardly ever see Junior level jobs or even \"Data Scientist\" as a job title without a \"Director\" or \"Vice President\" attached. As you can imagine, this makes jumping from analyst to data scientist very difficult despite being qualified (MS stats, 7 years in various, increasingly senior analyst roles). Where are these roles?",
        "comments": [
            "The level of business understanding required for a lot of data science work kinda makes junior data scientist a difficult role to create. Someone with a few years of experience in an analyst role who has cursory experience building ML models is probably going to be more successful in a \u201cstandard\u201d data scientist role than a recent college grad who\u2019s handy with ML but has very little experience in general.\u00a0",
            "Can I say something controversial?\n\nData Science is not a junior level job.  I know data science means a lot of different things depending on where you work, but generally it involves applying the scientific method + domain knowledge + taking into account business context.  Enough domain knowledge to make meaningful product recommendations is itself probably not something a junior candidate has, let alone good scientific skills (here, I'm not considering fitting models a scientific skill).\n\nAnalyst type jobs are the _de facto_  junior level IMHO, since this is where you're going to cut your teeth on core technologies and sharpen your domain expertise, and in all likelihood develop good scientific practices.\n\nEDIT:  Addressing OP's question more directly\n\nTech -- where the majority of DS roles are -- is in a turbulent time right now. Data science is not a core need for a business like software engineering, so there are fewer \"Data Scientist\" roles (setting aside Senior for a moment).\n\nWhen those roles are being hired for (and they are being hired for, I'm just not sure where you are looking), the roles are senior level (e.g. Senior DS, staff DS) because senior people have a better chance of being independent and not needing extensive hand holding. Because of my first point, companies currently can't afford to train junior people up, so they are hiring for reliable senior talent.\n\nThe roles exist, or at least have existed in the past, but market conditions are currently forcing orgs to reconsider who they hire and for what.",
            "I agree with others that data science doesn\u2019t have much room for junior roles. Usually only see junior DS at bigger / data-focused companies that have a business case for training up DS (massive tech company or very specialized domain w need to upskill talent). scientist very similar to senior analyst\u2014 unfortunately it means there aren\u2019t many junior DS roles BUT fortunately, it means there is a clear (ish) path from analyst to DS. I would encourage you to look for an analyst role that involves doing some DS type of stuff, at a company you like, with room for advancement, establish yourself and then make the case for a transition to DS once you\u2019ve proven your value. If you land an analyst role and convince your employer to give you a DS title, you\u2019ve essentially \u201cmade it\u201d and are forever after a data scientist, and in my experience this is so much easier than finding a DS role from scratch\u00a0",
            "I don\u2019t think junior level data science ever really existed at mass scale. It inherently is a profession that requires deep math/stats knowledge, strong software engineering skills, project management skills, and domain knowledge. All that is needed to simply land impact.\n\nMany who are currently senior/staff/principal in the field never started in junior DS roles, but instead they joined a mid-level role after PhD or transitioned into mid-level roles from software engineering or quant/statistician roles.\n\nWhat is new in the current time is that hordes of universities have started churning out master degrees in data science who are now looking for junior DS roles that don\u2019t exist and never existed in the first place. \n\nI have been warning for years against DS master programs and advocating instead for statistics master or PhD programs that give a stronger mathematical foundation. I have always seen the current scenario as a likely scenario, which is now playing out.",
            "Yeah, because the junior jobs are saturated. It takes a really long time for someone to become a good Data Scientist and even longer for someone to be able to create direction. Most companies are bottom heavy in these roles so the high level ones are the ones that are available."
        ]
    },
    {
        "id": "1bro6ep",
        "datetime": 1711822044.0,
        "flair": "ML",
        "title": "How do I know when to stop hyper parameter tuning and try something else?",
        "score": 49,
        "comment counts": 37,
        "content": "Edit: its for deep learning just to clarify; im referencing stuff like messing around with a CNN's architecture, activation, optimizer, learning rate, regularizers, etc\n\nI feel like i understand the math and algorithm behind model architectures quite well; i take care to preprocess and clean data, but in practice i struggle to get good performance. I always just end up manually tuning hyper parameters or using gridsearch for days or weeks with minimal improvement in erformance. \n\nI guess my question is: how do I know if i just need to keep going until i find some good combination of hyper params or if i just need to be trying something else?",
        "comments": [
            "Assuming this is not for deep learning, if you are hyper parameter tuning for days or weeks expecting some major performance improvement I think there might be some misalignment on exceptions. Usually you want to work on your data quality, feature engineering, model type, or reevaluating the scope of what you are predicting if you are are looking for large improvements to performance.    \n\n\nUsually you want to use hyper  hyper parameter tuning to squeeze out the last bit of juice. Then even if you are using that to get out the last bit of performance, doing this for weeks is probably a waist of time, it always better to have a good model running in production then a slightly better model in development. \n\n&#x200B;\n\nTLDR: if you are unhappy with performance after a few rounds of grid search start looking to improve the model upstream.",
            "So this is very model dependent. In general hyper parameter tuning is going to give you the smallest performance gain of all the modeling steps (assuming your parameters are in the right ballpark). \n\nData processing, feature selection / engineering, and layer architecture are all going to be far more impactful than hyper parameter tuning. \n\nThis doesn't mean you should ignore it, but you should spend more time on the other steps.",
            "If you\u2019re asking Reddit whether you should give up, you probably should. In all seriousness, look at your features and see whether there is noise, collinearity, or too many of them. This might be the root cause of your model fit issues.\u00a0",
            "Hyperparameters are generally not the root cause.",
            "[deleted]"
        ]
    },
    {
        "id": "1brkmsu",
        "datetime": 1711812884.0,
        "flair": "Career Discussion",
        "title": "What to spend company's \u00a31500 annual training budget on?",
        "score": 40,
        "comment counts": 29,
        "content": "I've been working as a data analyst at a fintech for 9 months now, although my master's degree is in data science\\*\n\nCompany offers a \u00a31500 annual budget to be spent on anything related to upskilling. What course would you recommend I spent it on? I am comfortable with data science theory and ML projects in a vacuum (AKA have never deployed into production) but have very little-to-no knowledge in specialised areas (NLP, Generative AI, LLMs etc.)\n\nI'm pushing to introduce some predictive analytics and ML into my role but will probably need to do some sort of proof of concept to sell it to stakeholders because all of them are very non-technical.\n\n\\*(graduated 5 years ago, had an illness which prevented me from working post-graduation).",
        "comments": [
            "If you aspire to be a senior one day, invest in coaching skills. The technical skills can be learned doing the job and on Youtube + GitHub repos.",
            "I think not having experience in deploying models in production will seriously hamper your career. Maybe you can do some azure / aws ML ops certification?",
            "I personally really like Udemy's courses. Took some while working on my thesis which was mainly theoretical and I learned a lot on the practical side of stuff from those courses.   \nSearch for frameworks and concepts you feel are relevant (maybe NLP, GenAI, or even go for PySpark or AWS). The high rated and popular courses will be, with high probability, a good fit.",
            "Use it on a nice ML summer school.",
            "I doubt that many courses would provide you with as much depth as you would have encountered in your Master's. \n\n  \nAlthough, Tensorflow and some cloud computing certificates could be valuable. Books would be great too, nice opportunity to build up that home library."
        ]
    },
    {
        "id": "1brmr5o",
        "datetime": 1711818376.0,
        "flair": "Career Discussion",
        "title": "Are there any DA/DS-adjacent roles (other than dev roles requiring a lot of Leetcode) that are in-demand, preferably low-code ones?",
        "score": 22,
        "comment counts": 46,
        "content": "Hello, I've been working in educational technology for over a decade and have a useless MSDS (graduated December 2022). I desperately need a new career. What can I learn that my MSDS might help me in, something that I could combine it with? Scrum master? I've read that RPA is dead, is that right? Anything along those lines?\n\nI am really good at things like project management, communicating with people, understanding client needs and putting them into writing, making business use case arguments, leading teams, writing and maintaining documentation, and related soft skills. I understand what is going on with different pieces of technology at a higher level but am not great at the nitty-gritty of it. I am fine with basic Python/SQL (and can keep learning programming in general to improve) but I hate Leetcode and really don't want to have to learn it. I do not see myself as a hardcore programmer who's going to learn 800 hard Leetcode questions...and I don't think anyone else does, either. lol\n\nOf course I know viz tools as well like Tableau, PowerBI, Looker etc. \n\nLooking for something for which there is actually a demand, so no entry level data analysis. Am also okay if it's not super related to DA/DS. I just do not want to waste any more time the way I did with the MSDS if at all possible. Thanks!",
        "comments": [
            "It sounds a lot like what you're good at overlaps with Product Mangement. There are definitely PMs who work with or specialize in working with DS teams. I know my own company has struggled at times to find good PMs who have enough DS/ML knowledge to work with DS teams, so it's potentially a good move in that there might not be quite the same level of competition as for other roles.",
            "Business Analyst sounds like a natural fit given your skills.\u00a0\u00a0 \u00a0\u00a0\n\n\nWith an MSDS you probably have more than a\u00a0 basic understanding of what the development process/DA/DS entails enough to gather requirements, write basic pseudocode, and plan analytics projects to gain insights for the business.\u00a0 \u00a0\u00a0\u00a0\u00a0\n\n\n\u00a0The trick would then be to gain domain knowledge to understand the business. It seems you already have that in the education field.\u00a0\n\n\nProject manager works too, but those positions always seemed tenuous. PM's seem to be the first to go when layoffs happen.",
            "I'm curious why you think a MSDS is worthless.  If things go well, we'll probably be looking for DS for customer success roles in a year or so.\n\nAlso, lots of people taking AIML certificate programs right now and a MSDS should both put you a step ahead of the graduates of those programs and put you in position to be a trainer for those programs.",
            "You seem to have a somewhat similar skillset to me in some ways, although I had a technical background. I actually decided to re-engage in technical roles a few years ago for this exact reason\n\nI take it you don\u2019t have a lot of experience/qualifications in BA/PM/SM/PO/agile coach? They are really hard roles to show evidence for, bc they are so contingent. There are agile qualifications and stuff like PRINCE2 which might help? Also some for Tableau/PowerBI\n\nIf you don\u2019t have a decent network to hit up, it\u2019s quite hard. I\u2019d suggest maybe getting into a habit of writing a regular blog about how you overcame specific problems in a PM/BA or similar role and then sharing it on LinkedIn etc. Be as detailed as you can.\n\nYou could try searching for people on LI in these roles at big companies in your city, then messaging them and asking to go for coffee to pick their brains. Obviously do your research so you can ask intelligent questions",
            "Sales Engineering or Support Engineering could be worth looking into. They are customer facing technical roles.\n\nEdit: and I\u2019ll also add - they are technical but it\u2019s not like you have to be a SWE to do them."
        ]
    },
    {
        "id": "1brlkvl",
        "datetime": 1711815310.0,
        "flair": "Analysis",
        "title": "Basic modelling question",
        "score": 8,
        "comment counts": 33,
        "content": "Hi All,\n\nI am working on subscription data and i need to find whether a particular feature has an impact on revenue.\n\n&#x200B;\n\nThe data looks like this (there are more features but for simplicity only a few features are presented):\n\n&#x200B;\n\n|id|year|month|rev|country|age of account (months)|\n|:-|:-|:-|:-|:-|:-|\n|1|2023|1|10|US|6|\n|1|2023|2|10|US|7|\n|2|2023|1|5|CAN|12|\n|2|2023|2|5|CAN|13|\n\n&#x200B;\n\nGiven the above data, can I fit a model with y = rev and x = other features?\n\nI ask because it seems monthly revenue would be the same for the account unless they cancel. Will that be an issue for any model or do I have to engineer a cumulative revenue feature per account and use that as y? or is this approach completely wrong?\n\nThe idea here is that once I have the model, I can then get the feature importance using PDP plots.\n\nThank you\n\n&#x200B;",
        "comments": [
            "Sorry I didn't understand the comment about what you think might be problematic in building such model. Seems to me a normal scenario in which yes of course you can build a predictive model of such kind",
            "Once you have those PDPs of your features then what? What\u2019s the ultimate business problem you\u2019re trying to solve here, is this more about acquisition or retention?",
            "Yes, it would be an issue. \n\n**reason**\n\nIf you use the data as is, the contribution of long term subscribers will be substantially higher than newer subscribers. This would mean that the important features will be biased toward more long term users, who are least likely to represent the current state of play. \n\n**simplest solution**\n\nTo deal with this, depending on how constant rev is within ids, you would want to reduce the data down to one row per id using one of the following methods:\n\n- based on df.groupby('id')[x].first()\n-  the method noted under bullet 1 in 'proxy' below\n- constructing a weight and including it in your model, such as 1/n_obs_of_id to give each id the same importance regardless of length of subscription. To get sum(weight) == n_total_obs you can stabilize it using weight * n_total_obs.\n\n**going further**\n\nHowever, this solution has the issue that features of your data will be largely time invariant within id (country), and you can't see how changes in x have an effect. \n\nAs such, it's important to split this question into two parts:\n\n1.  (between model) understand not only the time invariant factors related to subscription level, and \n2. (within model) the time varying factors related to an up-/down- grade, which determines rev\n\nIf you're using traditional stats, the class of model best suited to this is panel regression (a subset of linear mixed effects modelling). ML versions now exist, but I haven't kept up to date with these developments.\n\n**an easy to implement proxy**\n\nYou can proxy this in standard OLS for:\n\n1. Using y = mean_rev_in_id and using the mean x values for all integer values, and the modal value for all other features.\n2. Using y = (rev - mean_rev_in_id) and fitting (and ignoring) the mean x values for all integer values, and the modal value for all other features, and then fitting (and interpreting) the obs x values. \n\nYou may also want to add the lagged values of each x value (eg 1 month priors x value) to model 2 to get the effect of a difference to the ids typical value the month prior to the subscription change.\n\nOff the top of my head, you would need to include a weight still in both cases.\n\n**extending the proxy to full business case**\n\nNow, for this proxying, if you have cases where an id had a subscription, a break (and this is not in your data, ie no rows with rev =0), and then resubscribed you may want to:\n\n-  split each id into multiple, one per subscription period (id_subscription), and rerun the above. You'll need to recalculate weights (one for number of subscriptions by id, and one for obs per id_subscription and multiply them together). This will give you the factors per subscription rather than id, and is worth using as a cross check to understand the generalizability of analysing users to inferences about subscriptions (can apply to models 1 and 2). While it doesn't capture unobserved heterogeneity across the subscriptions of one id, you could assess the impact by doing a sensitivity test involving randomly selecting only one id_subscription per id for inclusion. Similar results, no issue.\n- create an additional model which extends model 2, where you fill all breaks with a row with rev = 0, and (for simplicity) assign all 'id constant-ish' x variables to their last observed, and set all x variables that are contingent on app use to a unique dummy indicator (lots of other fill methods, but need to use for time constant). Then use y = (rev > 0) in a logistic model. Add one final feature, either the cumcount of y (as an indicator of number of months previously ever subscribed), or the cummean of y (proportion of months subscribed since first subscription). \n\nHowever, if you remove each ids first subscription period, it will tell you the likelihood of resubscription (noting, anyone who doesn't resubscribe needs to be in the model still, albeit with all 0s in y. While each case should start at the time of first subscription, they should all rows for each month through to the present.\n\n**interpreting the proxy/extensions**\n\nTaken together, Model 1 shows you the characteristics of users which are associated with higher rev, model 2 shows you the changes in characteristics of users associated with increasing/decreasing revenue (if you apply the extension above, you see this for both subscriptions and users), and model 3 shows you the characteristics of users who resubscribe after a break.\n\n**doing it properly**\n\nIf you use this kind of approximation, you should be able to generalize from OLS to most other approaches eg RF. \n\nHowever to do it properly, you run this in a panel regression. Depending on terminology, model 1 above approximates what may be described as a fixed-effects model, and model 2 above approximates what may be described as a between-effects model. Using linear mixed effects models makes set up more difficult, but gives you more flexibility over assumptions of the covariance matrix within users, and a few other things.\n\nStata's official documenation and forums, and UCLA stats department both have good conceptual documentation of panel models.\n\nIt's probably a but easier to do this with traditional stats first, then try to find an ML approach.",
            "There at least 3 ways you can approach it, all with different models:\n\nYou can look at increasing revenue by making subscribers stay longer - then you need to look at the length of time they stay with the business. Look into survival analysis for this.\n\nYou can increase revenue by identifying who subscribes to more expensive tiers. Then you ditch the month column and just build a classification model to see what correlates with signing up for more expensive subscription - so monthly payment is your variable of interest\n\nAnd finally, you can look at what helps increase the number of new subscribers overall. There are multiple ways you can spin it - you can look at the impact of location, marketing etc",
            "It seems like the Y (rev) is not independent and may need to be reframed so that the Y is independent (which likely means you may need to aggregate the X's). This in my opinion would be the easiest path. Alternatively, you \\_may\\_ be able to use a hierarchical model (I say may because I'm not entirely sure - in other words, your data is nested).   \n\n\nBut with the limited context, my bet is that the data is poorly framed for what you want. How much data do you currently have and how much would you have if you had single row per id?"
        ]
    },
    {
        "id": "1br7ri1",
        "datetime": 1711768641.0,
        "flair": "Discussion",
        "title": "What's your tooling look like?",
        "score": 37,
        "comment counts": 27,
        "content": "Those of you who work with ton of data i.e. TB of data and do metrics development to ML modeling, what does your tooling look like? I mean what tool/tool combination do you use to retrieve, analyze data, develop models etc. \n\nMy tooling combination has been: Oracle + Python + GitHub +couple of Linux boxes which served well in my last company with no cloud or no TB size data. Recently moved to a new cash rich company that seems like a bit of wild West in terms of tooling: data bricks, big query, AWS, Posit everything is used in some way or other. So, trying to find the best set up. ",
        "comments": [
            "\ndbt\nAirflow\nPython \nBigQuery\nPostgres \nFirestore\nRedis \nGitlab\nKubernetes\nVertex AI",
            "GitHub, Python, Hamilton, Jupyter, various AWS managed services.",
            "Big query, python, jupyter",
            "Custom product built on Airflow, Kubernetes and AWS. Jupyter notebooks within this for development, GitHub and Airflow for production. Occasionally Databricks to test some queries. \n\nMy company works with PB of data, and the tools across teams varied significantly a few years ago. We invested a significant amount of money to build this custom product and enable Data Science to easily do development/production without having to rely on other teams. Integration with our Data Lake allows for easy delivery to any software system connected to the Lake.",
            "Dbt, snowflake, snowpark, python, airflow etc."
        ]
    },
    {
        "id": "1br0kjc",
        "datetime": 1711749271.0,
        "flair": "Discussion",
        "title": "Maybe we're using too much data to train models",
        "score": 72,
        "comment counts": 91,
        "content": "If you make the weak analogy between LLMs and humans / other organisms, we find that the LLMs are trained on vast amounts of data. So much so that a human being may never process that much information over their developmental years. Of course, the brain seems to be a foundation model that has been trained over all of its ascendant's data, which could be argued to be much more than all of the internet. But I don't think the information coming from our ancestors is as detailed as the information we gain from actually experiencing and learning things.\n\nSo, my shower thought was that maybe we are using way too much data to train the models. Clearly, there is enough information in much less data to gain a deep understanding of the world.\n\nI do recognize that it's the trillion dollar question to construct a model that would need less data to develop as good of a world model as gpt4 or the other sota models.\n\nJust a Friday night ponder, ta ta and farewell.",
        "comments": [
            "The problem is that language is a sparse medium for information. According to a tweet from Yann LeCun, the amount of data that SOTA models are trained on is actually much less than the visual data processed by a 4 year old. Here\u2019s the tweet: https://twitter.com/ylecun/status/1766498677751787723",
            "The way in which humans learn and the way machines learn have essentially nothing in common, and one of the ways this becomes apparent is that  human beings are incredibly efficient with training data. A toddler that touches a hot stove once never will again. Good luck getting a machine that can do that.",
            "Doesn\u2019t more data (assuming the majority of the data is correct) help combat noise? For example, if I look up at the sky and see it is blue 99/100 times, I have much more confirmation that it is supposed to be blue vs if I see it is blue once and grey another time because it is foggy.",
            "Someone just read the GPT paper",
            "You are forgetting about the evolution of instincts installed in humans at birth and pretty much every living creature. We are not exactly born with zero knowledge in our genes and we merely warm start ourselves based on that. Also, as other mentioned, we are using so many sensors at the same time to learn instead of just tokens in a model. If there was a baby born with no ability to move, see, smell but only hear, the learning would slow down dramatically."
        ]
    },
    {
        "id": "1br7irg",
        "datetime": 1711767873.0,
        "flair": "Discussion",
        "title": "What was your onboarding like and what should a proper onboarding look like?",
        "score": 25,
        "comment counts": 15,
        "content": "I am the first data analyst at my non-profit and I was pretty much thrown into immediate, priority tasks with very little explanation of what the non-profit's processes are, what fields are used, what these fields actually mean. They also hired me knowing that I was new to their domain but basically left me to play telephone for the first several months as I tried to fill data requests, understand what was where or what was even possible, and verify its quality and accuracy. It didn't help that the person who interviewed me and was suppose to be guiding me quit before I even started, and that person's boss quit a month after I joined. I'm quitting next month and wanted to hear what others onboarding experiences are, what a proper onboarding should look like for a DA/DS and also test to see if these questions make sense to ask when I start interviewing for a new job, to avoid frantic organizations.\n\n* How standardized is your current data process? (as in, is there a database already set up, is there a dictionary/glossary/schema to refer to, how frequently are metrics changing?)\n* How are you currently fulfilling your data needs? (as in, are there dashboards already set in place, is it mostly ad-hoc)\n* Do you have any dedicated engineers/architects or anyone dedicated to actual database and pipeline management?\n* How do you handle tech debt?",
        "comments": [
            "I'd assume majority of fortune 500 companies, on boarding looks like the following, the first three weeks:\n\n\n\n1. Mandatory corporate trainings\n\n2. Sorting out whatever software/it/data access requirements that your manager didn't ahead of time or couldn't before you were on boarded.\n\n3. Learning the regular meeting routines. Your in a group. You have some routine your expected to follow.\n\n4. Your manager has a task for you, that involves you either figuring out what the person before you did from code, data, documentation, or you effectively shadow someone on your team and learn what set of tasks they are responsible for and how to do it, because you'll be expected to take over a portion of that work and they'll be assigned to differnt tasks, or you will be asked to assist on a project that other people are effectively leading.\n\n  \nGenerally smaller companies are shit shows and whether a company is good or not depends whose running it. At a startup or small company (say fewer than 100 employees)  the experience of a person directly reflects the founders. At mid and even large companies, company culture is a mix of organization culture thats entrenched from where the organization comes from, the executive board, the companies and your direct manager. Company cultures are very much real and they do take a life of their own.",
            "I have been doing data and evaluation capacity building for non-profits for the last decade.\n\nThis is not uncommon for non-profits to have little structure for their data. Often I\u2019ve seen analysts come in and need to support a lot of this capacity building - often skills they don\u2019t - to help the organization get to a place where their data is meaningful. However, this has been some of the most meaningful work I have ever done once you start seeing processes working and data driven decisions being made to improve impact.\n\nIf you want to stay in the non-profit field, I would anticipate more of this and encourage you to develop more skills beyond analytics to help build and sustain these data infrastructures.",
            "Gitlab has a very interesting employee onboarding process. Basically you start and you have to make one change inside of their handbook within the first week(?).. I believe.",
            "One thing my current job did well, was to give us the following:\n\n1. A repository of who knows what. Ability to quickly figure out who you should reach out to for information. Ie: /u/znihilist  can answer questions about these sets of tables, etc\n1. A list of what can be and can't be asked from supporting teams. Ie: The engineering team can help with X/Y/Z, etc. \n1. Prior to joining, they extensively prepare access to anything your team and project needs. Other than logging in on your laptop, you don't need to do much. Everything is installed, access to web tools, etc.",
            "I don't think I'm the norm, but I have a 6-8 week onboarding for new employees. Most of it covers domain knowledge because we hire people with no background in my domain. We also talk through how to prepare for and have effective check-ins, how to deal with messy data, getting set up with all the systems and databases, how to name and store documents and code in our document and code repositories, etc. Finally, we put the person on a more well-defined project, so they can get feedback."
        ]
    },
    {
        "id": "1bqrbdm",
        "datetime": 1711724258.0,
        "flair": "Analysis",
        "title": "Causal inference lecture notes ",
        "score": 85,
        "comment counts": 9,
        "content": "Stumbled across these lecture notes by a former student of Rubin. He taught this course at Berkeley:\n\n\n",
        "comments": [
            "Damn causal inference finally getting its due in this sub lol\n\nProbably like 15 threads in the last week haha",
            "Causal Inference AND R code? Gosh I love this community.",
            "WOW thats an entire book you wrote here! Thanks!  \nI'm just getting into causal inference and was looking for more learning material :)",
            "Commenting to get back to this later. Would love to take a course on this, so I'm jealous",
            "Will give it a look"
        ]
    },
    {
        "id": "1br5b8e",
        "datetime": 1711761418.0,
        "flair": "Discussion",
        "title": "SVD Did Not converge in Linear Least Squares Error ",
        "score": 8,
        "comment counts": 18,
        "content": "Was working on a project and was trying to make a scatterplot with a trend line. usually always used R for simplicity in Linear regression problems.  \n\nWorked in Regression extensively for first time in Python. Used numpy polyfit for the problem.  I performed a left join and joined 2 tables here \n\nWhat does it mean when SVD does not converge in Linear Least Squares... \n\nIm aware how Singular Value Decomposition is used in PCA, Linear Regression and how it works. \nWhat does it signify?  why it happens. ? Never faced this issue before. ",
        "comments": [
            "Any chance of sneaky NaNs or Infs in your data?",
            "My guess is you have some NAs in your data. Or one of your variables contains two different data types.",
            "Have you tried using a different SVD algorithm? If you have some small singular values that you can discard, you could use a partial SVD algorithm from Tensorly, for example.",
            "I don\u2019t think I understand the question. What does SVD have to do with linear regression"
        ]
    },
    {
        "id": "1bqyh5x",
        "datetime": 1711743477.0,
        "flair": "Career Discussion",
        "title": "Some new job opportunities is sports and gaming analytics!",
        "score": 13,
        "comment counts": 21,
        "content": " \n\nHey guys,\n\nI'm constantly checking for jobs in the sports and gaming analytics industry. I've [posted recently in this community](https://www.reddit.com/r/datascience/comments/1b8xr3y/zelus_analytics_and_tennessee_titans_are_hiring/) and had some good comments.\n\nThe job board updates daily and as we know, the market is not as dynamic as before so I wanted to share several data science positions that appeared recently.\n\n* [Senior Data Scientist - USA](https://www.sportsjobs.online/job-details/3063senior%2520data%2520scientist/r/recRuJu4maAaDfKC6?utm_source=reddit&utm_medium=post)\n* [ML Engineer/Data Scientist (Creative Research Team)](https://www.sportsjobs.online/job-details/3047ml%2520engineer%252fdata%2520scientist%2520%2528creative%2520research%2520team%2529/r/recfyjrPwuu4SXZUm?utm_source=reddit&utm_medium=post) \\- USA\n* [Data Engineering Summer Intern](https://www.sportsjobs.online/job-details/3068data%2520engineering%2520summer%2520intern/r/recKlr4Fp6ZFXs3tp?utm_source=reddit&utm_medium=post) \\- USA\n* [Director - Machine Learning Engineering](https://www.sportsjobs.online/job-details/3069director%2520-%2520machine%2520learning%2520engineering/r/recC9yE6ql3GxdVQN?utm_source=reddit&utm_medium=post) \\- USA\n* [Data Analyst - Europe](https://www.sportsjobs.online/job-details/3041data%2520analyst/r/recpQdVPzYhMAAo0p?utm_source=reddit&utm_medium=post)\n* [Staff Computer Vision Engineer](https://www.sportsjobs.online/job-details/3060staff%2520computer%2520vision%2520engineer/r/recsru3dRyQnKXTfk?utm_source=reddit&utm_medium=post) \\- USA\n\nThere are multiple more jobs related to data science and hundreds of others jobs in analytics and software.\n\nI've created also a [reddit community](https://www.reddit.com/r/sports_jobs/) where I post recurrently the openings if that's easier to check for you.\n\nDisclaimer: I run the job board.\n\nI hope this helps someone!",
        "comments": [
            "cool",
            "Are there any openly available case study for sports analytics? I\u2019m interested in this domain and would like to understand the type of problems solved using f data science.",
            "Thanks!!",
            "Cool",
            "Do you have anything entry level that does not involve being in school. I'm a recent graduate and looked at some of the internships. But all I've seen require I be in school. FML for trying to finish school quickly. Biggest mistake I think."
        ]
    },
    {
        "id": "1bqxqru",
        "datetime": 1711741220.0,
        "flair": "Statistics",
        "title": "Instrumental Variable validity",
        "score": 12,
        "comment counts": 14,
        "content": "I have a big graph and I used DoWhy to do inference with instrumental variables. I wanted to confirm that the instrumental variables were valid. To my knowledge give the graph below:  \n1- IV should be independent of u (low correlation)  \n2- IV and outcome should be dependent (high correlation)  \n3- IV and outcome should be independent given TREAT (low partial correlation)\n\n  \nTo verify those assumptions I calculated correlations and partial correlations. Surprisingly IV and OUTCOME are strongly correlated (partial correlation using TREAT as covariate). I did some reading and I noticed that assumption 3 is mentioned but often not tested. Assuming my DGP is correct, how would you deal with assumption 3 when validating IVs with graph and data ( I copied the code at the bottom) .   \n\nhttps://preview.redd.it/e11wdxkqsbrc1.png?width=858&format=png&auto=webp&s=d02ef2c13c3783ec1d2f5985fc21a5c8bfabb167\n\n    # Generate data\n    N = 1000\n    u = np.random.normal(1,2, size = N)\n    IV = np.random.normal(1,2, size = N)\n    TREAT = 1 + u*1.5 + IV *2 + np.random.normal(size = N)\n    OUTCOME = 2 + TREAT*1.5  + u * 2\n    \n    print(f\"correlation TREAT - u : {round(np.corrcoef(TREAT,u)[0,1], 3 )}\") \n    print(f\"correlation IV - OUTCOME : {round(np.corrcoef(IV,OUTCOME)[0,1], 3 )}\")\n    print(f\"correlation IV - u : {round(np.corrcoef(IV,u)[0,1], 3 )}\")\n    print()\n    df = pd.DataFrame({\"TREAT\":TREAT, \"IV\":IV, 'u':u, 'OUTCOME': OUTCOME})\n    print(\"Partial correlation IV - OUTCOME given TREAT: \" )\n    \n    pg.partial_corr(data=df, x='IV', y='OUTCOME', covar=['TREAT']).round(3)",
        "comments": [
            "An instrument Z (I use Z because IV's are usually denoted with Z) is an \"instrument\" if it fulfills the following 3 conditions:\n\n1. Z is associated with Treatment A\n2. Z does not affect Outcome Y except through its effect on A\n3. Z and Y do not share causes\n\nWhen the above conditions are fulfilled, Z is an instrument. However, only the first condition is empirically verifiable. You cannot prove definitively in a real observational dataset if 2.) and 3.) is fulfilled, which is the part that makes instrumental variable estimation difficult. Like all observational causal inference models, IV methods rely on their own set of unverifiable assumptions.\n\nIn practice, you will simply have to reason and convince your audience that you have chosen an instrument that is reasonable. One way this is done is by using instruments that other people have also used and are generally agreed to be good instruments.\n\nFurther note - even when you reason that something is an instrument, you still need further assumptions to extract a causal effect of your treatment A on outcome Y.\n\n**EDIT: (a tl;dr of the comment chain conclusion)**\n\nEven in a simulation setting, you cannot \"see\" if assumption 2 holds if you include unobserved confounding U between A and Y, due to a collider effect of Z -> A <- U. This means that you can only test assumption 2 under the setting where you don't add unobserved confounding U, but if you know that there is no unobserved confounding U, then there is no point in using IV methods anyways, so practically speaking it does not make sense either.",
            "If you're asking how to check if assumption three holds, you can regress the outcome on your covariates and the treatment. If the outcome and treatment are independent conditional on the covariates, the IV should have a coefficient near zero and a non-significant p-value.\n\nThat being said, assumption 3 is not the issue- assumption 1 is. By definition, you cannot measure the correlation between the instrument and u, so that assumption can never be checked.",
            "Oh"
        ]
    },
    {
        "id": "1br6iip",
        "datetime": 1711764845.0,
        "flair": "Discussion",
        "title": "Entity based vs aspect based sentiment analysis?",
        "score": 3,
        "comment counts": 12,
        "content": "Say I have a set of product reviews. For example \"the product was too big but the material was good.\" I want an approach that can extract sentiment like: fit=negative, quality=positive. What's a good approach that can do this? I'm seeing entity sentiment on Google cloud. Does it do the same thing as ABSA on hugging face for example? ",
        "comments": [
            "My first thought would be to use an LLM",
            "Following"
        ]
    },
    {
        "id": "1bqyft6",
        "datetime": 1711743350.0,
        "flair": "Discussion",
        "title": "Just realized peft (for LoRA finetuning) does not support tensorflow \ud83e\udd72",
        "score": 6,
        "comment counts": 8,
        "content": "Just realized peft (for LoRA finetuning) does not support tensorflow \ud83e\udd72\n\nNot  finding compatible libraries is one of the most frustrating part of my  work, especially when I've just recently started using linux (though  it's fun). And maybe now I'll have to spend tomorrow some time  converting tensorflow implementation to pytorch implementation.\n\nAnyways,  after spending 12+ hours of working and hitting a dead end, my brain  refuses to deal with this problem now, perhaps I'll look into it  tomorrow. So any suggestions on how I can use LoRA for BERT finetuning  using tensorflow are welcome and will be VERY MUCH APPRECIATED.",
        "comments": [
            "Abandon tensorflow",
            "Abandon tensorflow",
            "Do people still use TensorFlow?",
            "Oh"
        ]
    },
    {
        "id": "1bqhujk",
        "datetime": 1711691536.0,
        "flair": "Education",
        "title": "How do you guys do CI/CD?",
        "score": 41,
        "comment counts": 13,
        "content": "Any good resources on CI/CD for productionalization?",
        "comments": [
            "We use gitlab cicd which has a pretty wide array of tooling you can use. There's basic stuff like running unit tests, running vulnerability scanners, building docker images, linting, etc etc. So when someone merges a change into our main branch after CR we will have a new image from that CI pipeline that has gone through all of these stages. Then we have a gitlab runner instance setup on our dev/stg/prod servers that (depending on our configurations) can pull and redeploy that latest docker image onto those servers. Obviously it's a bit different for prod but that's the gist of it",
            " for CI/CD you need to have tests. I dont like unit tests so i implemented integration tests with mock input data for an API of llm based NER application.  Since i have tests this means after each change in the code base i can run a process of  testing , dockerising, testing again and using the new docker image in prod\n\nBut dont take my word for it. I am just a DS who is learning more about SWE",
            "Is there much of a difference to this from SE and DS? Just curious because this is a good question",
            "Generally will run tests, linters, etc in CI and build + deploy a docker image to production. ci/cd for DS is just ci/cd after all -- it's not all that different from what the engineers are doing",
            "Github Actions. You can look at code examples in open source repos (check the .github folder). Those usually won't have deployment examples if they're for packages instead of services, but most systems like ECS, Google Cloud Run etc. will have examples in docs or you can just find blog posts."
        ]
    },
    {
        "id": "1bpwc6z",
        "datetime": 1711634474.0,
        "flair": "Discussion",
        "title": "What is a Lead Junior Data Analyst?",
        "score": 357,
        "comment counts": 97,
        "content": "",
        "comments": [
            "The blind leading the blind",
            "You will have the responsibilities of a lead data scientist but you will be paid as a junior \ud83d\ude0a",
            "You can do easy projects on your own?",
            "As someone who works at amazon I've never seen this posted on job boards.\n\nMostly cuz the data analyst roles tend to be in the warehouses or through regional data hubs.\n\nT3 is the lowest level data analyst position we have, but it's FC level. T4 data analyst I think is region/ metro, then T5 is when you become a lead data analyst, and iirc anything above T6 gets a data science title (junior ds being the lowest tier there). \n\nI don't know if this is a contracted position or someone wrote a wrong job title tbh.",
            "\u201cWe expect a lot out of you, and are willing to pay you an entry-level salary for it.\u201d"
        ]
    },
    {
        "id": "1bqfzzx",
        "datetime": 1711685096.0,
        "flair": "Analysis",
        "title": "Could you guys provide some suggestions on ways to inspect the model I'm working on?",
        "score": 18,
        "comment counts": 20,
        "content": "My employer has me working on updating and refining a model of rents that my predecessor made. The model is simple OLS for interpretability (which is fine by me) and I've been mostly incorporating exogenous data that I've scratched together. The original model used primarily data related to the homes in our portfolio. My general theory is that people choose to live in certain places for more reasons than the home itself. So including data that describe the neighborhood (math scores at the closest schools for example) should add needed context.\n\nAccording to standard metrics, it's been going gangbusters. I'm not nearly out of ideas on data to draw in and I've gone from an R-Squared of .86 to .91, AIC has decreased by 3.8% and when inspecting visually where there was previously a nasty curve at the low and high ends of the loess on the actual values versus predicted scatterplot, it's now straightened out. Tests for multicollinearity all check out. *However*, my next step is pretty work intensive and when talking to my boss he mentioned it would be a good time to take a deeper dive in inspecting the model. He said the last time they tried to update it they did alright with the typical metrics but that specific communities and regions (it's a large national portfolio) suffered in accuracy and bias and that's why they didn't update it.\n\nI just started this job a month ago and I'm trying to come out of the gate strong. I've got some ideas, but I was hoping you guys could hit me with some innovative ways to do a deeper dive inspecting the model. Plots are good, interactive plots are better. Links to examples would be awesome. Looking for \"wow\" factor. My boss is statistically literate so it doesn't have to be super basic.\n\nThanks in advance!",
        "comments": [
            "Dummy variables for zip code are pretty standard. I\u2019m not sure what the purpose of the mode output is for the business, but zip code can definitely have some ethical and legal implications to keep in mind as far as redlining is concerned for housing.\u00a0\n\nIf for whatever reason you can\u2019t use zip code, think of all the things that it captures: crime rate, school district, population density, commercial density, distance to higher education institution, etc, and try to incorporate those.\u00a0\n\nAs for inspecting, you should probably check on hereroskedasticity as well, as this will indicate clusters where you\u2019re underperforming potentially. Should look at any leverage points too. Could look at added explanation plots to see how each variable contributes to additional reduction in variance. Could plot the residuals of the updated model against the current model, and put a slope=1 line on the plot. Anywhere where the scatter is below the line is where you\u2019re outperforming the current model, and vice versa. Could then investigate why you\u2019re over/underperforming for each data point, or try to get the model to strictly outperform for every data point.\u00a0",
            "You may want to consider a hierarchical model to account for subregions. Things like LMMs are pretty common if you\u2019re dealing with geographic data that varies by area.",
            "Can your manager tell you which subregions looked bad? Basically re-do the analysis they did originally and then start thinking about how you can extend it. That\u2019ll show that you 1) listen, 2) understand previous work, and 3) can improve it. \n\nAt a month in, re-doing and improving previous analyses is a great way to ramp up. I think looking for \u201cinnovative ways\u201d to inspect the model might actually backfire at this point if your manager knows exactly what went wrong last time. \n\nBut as far as visualizations go, maybe geographical heatmaps (not that I know of any) at varying regional levels? Rankings of high and low performing regions would also be useful to you in debugging.",
            "Without seeing the model, it's impossible to critique it, but start by looking at the cases with large residuals and consider whether there are variables that ought to be included but aren't.  You might also want to consider functional form variations.\n\nIf you try a random forest, (SPSSINC RANFOR and SPSSINC RANPRED) extension commandS), you can see whether predictions can be improved in a meaningful way.\n\nThe improvement in fit you are getting is pretty trivial, but the cases with large residuals may suggest some alternatives.",
            "For your neighborhood score you could build a classifier with something like hdbscan. In this, the class of the neighborhood would become a categorical variable (one hot encoded) into the original model. Theoretically, the qualities of a good neighborhood are intrinsic across regions and these similarities will result in similar clusters."
        ]
    },
    {
        "id": "1bqrgm5",
        "datetime": 1711724601.0,
        "flair": "ML",
        "title": "Supervised learning classification model VS anomaly detection model. Has anyone done both and compared results? ",
        "score": 2,
        "comment counts": 4,
        "content": "I was given a small sample of data and tasked with creating a classification model, where the classes were essentially \u201cnormal\u201d and multiple versions of \u201canomaly\u201d. My XGBoost classification model did very well, where I did an 80/20 train/test split with 3-fold cross validation. Realizing that there could be more versions of \u201canomaly\u201d than what I was given, I decided to make an anomaly detection model, training on only the \u201cnormal\u201d observations in the training data set and testing on the entire test data set. \n\nTo my surprise, both my one class support vector machine and my autoencoder results were abysmal. I suspect my issue stems from a low sample size and a high number of features. That\u2019s not the focus of this post though.  \n\nI\u2019m curious if anyone has done something like this. How did your classification model compare to your anomaly detector?",
        "comments": [
            "I feel like it's a rite of passage for prospective data scientists to learn about unsupervised techniques, think \"man, this is fucking sweet\", and then inevitably have it fail when you try it on real data.\n\n\n\nThe reason that it fails is that there are an enormous number of ways in which your data may vary. Things like PCA and autencoders will pick out the axes of maximal variance/features that minimise reconstruction error. But sometimes the thing that makes something anomalous isn't something that messes up the reconstruction error. It's something as simple as an account transacting with an account it isn't supposed to. The impact on the reconstruction error is tiny, but the significance of this behaviour is huge. In the vast majority of cases, most of the variability that you're looking at in your data is not relevant to the thing you're interested in when you want to detect anomalies.\n\n\nIf you have labels, always use supervised learning.",
            "Interesting",
            "yeah seems like weighted classes would be better here"
        ]
    },
    {
        "id": "1bpv01y",
        "datetime": 1711630718.0,
        "flair": "Statistics",
        "title": "New Causal ML book (free! online!)",
        "score": 193,
        "comment counts": 22,
        "content": "Several big names at the intersection of ML and Causal inference, Victor Chernozhukov, Christian Hansen, Nathan Kallus, Martin Spindler, and Vasilis Syrgkanis have put out a new book (free and online) on using ML for causal inference. As you'd expect from the authors, there's a heavy emphasis on Double ML, but it seems like it covers a breadth of material. The best part? There's code in both Python and R.\n\nLink: https://www.causalml-book.org/",
        "comments": [
            "This is the type of content I subscribe to this sub Reddit for.",
            "I\u2019m a novice at causal inference and ML but wanted to express what a time to be alive when stalwarts in the field just outright put things online for free. More power to the democratisation of education.",
            "Thank you for this.\n\nA couple weeks back I asked a question on this sub about moving from a Predicitive framework to a casual one and I was more or less attacked for it. \n\nIt good to see that actual professionals and experts are making good progress in this area.",
            "Oooh, thank you for sharing! Adding this to my reading list right now.",
            "Amazin thanks"
        ]
    },
    {
        "id": "1bprs9x",
        "datetime": 1711620007.0,
        "flair": "Analysis",
        "title": "Top Cities in the US for Data Scientists in terms of Salary vs Cost of Living",
        "score": 159,
        "comment counts": 76,
        "content": "We analyzed 20,000 US Data Science job postings from June 2024 - Jan 2024 with quoted salaries: computed median salaries by City, and compared them to the cost of living.\n\nSource: [Data Scientists Salary article](https://jobs-in-data.com/salary/data-scientist-salary)\n\nHere is the Top 10:\n\n&#x200B;\n\nhttps://preview.redd.it/jigjbhivs1rc1.png?width=1643&format=png&auto=webp&s=de294a1e3b4fdf46cbf30cfa64274aa3ae19a0dc\n\nHere is the full ranking:\n\n|Rank|City|Annual Salary|Annual Cost of Living|Annual Savings|N job offers|\n|:-|:-|:-|:-|:-|:-|\n|1|[Santa Clara](https://jobs-in-data.com/c-santa%20clara)|207125|39408|167717|537|\n|2|[South San Francisco](https://jobs-in-data.com/c-south%20san%20francisco)|198625|37836|160789|95|\n|3|[Palo Alto](https://jobs-in-data.com/c-palo%20alto)|182250|42012|140238|74|\n|4|[Sunnyvale](https://jobs-in-data.com/c-sunnyvale)|175500|39312|136188|185|\n|5|[San Jose](https://jobs-in-data.com/c-san%20jose)|165350|42024|123326|376|\n|6|[San Bruno](https://jobs-in-data.com/c-san%20bruno)|160000|37776|122224|92|\n|7|[Redwood City](https://jobs-in-data.com/c-redwood%20city)|160000|40308|119692|51|\n|8|[Hillsboro](https://jobs-in-data.com/c-hillsboro)|141000|26448|114552|54|\n|9|[Pleasanton](https://jobs-in-data.com/c-pleasanton)|154250|43404|110846|72|\n|10|[Bentonville](https://jobs-in-data.com/c-bentonville)|135000|26184|108816|41|\n|11|[San Francisco](https://jobs-in-data.com/c-san%20francisco)|153550|44748|108802|1034|\n|12|[Birmingham](https://jobs-in-data.com/c-birmingham)|130000|22428|107572|78|\n|13|[Alameda](https://jobs-in-data.com/c-alameda)|147500|40056|107444|48|\n|14|[Seattle](https://jobs-in-data.com/c-seattle)|142500|35688|106812|446|\n|15|[Milwaukee](https://jobs-in-data.com/c-milwaukee)|130815|24792|106023|47|\n|16|[Rahway](https://jobs-in-data.com/c-rahway)|138500|32484|106016|116|\n|17|[Cambridge](https://jobs-in-data.com/c-cambridge)|150110|45528|104582|48|\n|18|[Livermore](https://jobs-in-data.com/c-livermore)|140280|36216|104064|228|\n|19|[Princeton](https://jobs-in-data.com/c-princeton)|135000|31284|103716|67|\n|20|[Austin](https://jobs-in-data.com/c-austin)|128800|26088|102712|369|\n|21|[Columbia](https://jobs-in-data.com/c-columbia)|123188|21816|101372|97|\n|22|[Annapolis Junction](https://jobs-in-data.com/c-annapolis%20junction)|133900|34128|99772|165|\n|23|[Arlington](https://jobs-in-data.com/c-arlington)|118522|21684|96838|476|\n|24|[Bellevue](https://jobs-in-data.com/c-bellevue)|137675|41724|95951|98|\n|25|[Plano](https://jobs-in-data.com/c-plano)|125930|30528|95402|75|\n|26|[Herndon](https://jobs-in-data.com/c-herndon)|125350|30180|95170|88|\n|27|[Ann Arbor](https://jobs-in-data.com/c-ann%20arbor)|120000|25500|94500|64|\n|28|[Folsom](https://jobs-in-data.com/c-folsom)|126000|31668|94332|69|\n|29|[Atlanta](https://jobs-in-data.com/c-atlanta)|125968|31776|94192|384|\n|30|[Charlotte](https://jobs-in-data.com/c-charlotte)|125930|32700|93230|182|\n|31|[Bethesda](https://jobs-in-data.com/c-bethesda)|125000|32220|92780|251|\n|32|[Irving](https://jobs-in-data.com/c-irving)|116500|23772|92728|293|\n|33|[Durham](https://jobs-in-data.com/c-durham)|117500|24900|92600|43|\n|34|[Huntsville](https://jobs-in-data.com/c-huntsville)|112000|20112|91888|134|\n|35|[Dallas](https://jobs-in-data.com/c-dallas)|121445|29880|91565|351|\n|36|[Houston](https://jobs-in-data.com/c-houston)|117500|26508|90992|135|\n|37|[O'Fallon](https://jobs-in-data.com/c-o'fallon)|112000|24480|87520|103|\n|38|[Phoenix](https://jobs-in-data.com/c-phoenix)|114500|28656|85844|121|\n|39|[Boulder](https://jobs-in-data.com/c-boulder)|113725|29268|84457|42|\n|40|[Jersey City](https://jobs-in-data.com/c-jersey%20city)|121000|36852|84148|141|\n|41|[Hampton](https://jobs-in-data.com/c-hampton)|107250|23916|83334|45|\n|42|[Fort Meade](https://jobs-in-data.com/c-fort%20meade)|126800|44676|82124|165|\n|43|[Newport Beach](https://jobs-in-data.com/c-newport%20beach)|127900|46884|81016|67|\n|44|[Harrison](https://jobs-in-data.com/c-harrison)|113000|33072|79928|51|\n|45|[Minneapolis](https://jobs-in-data.com/c-minneapolis)|107000|27144|79856|199|\n|46|[Greenwood Village](https://jobs-in-data.com/c-greenwood%20village)|103850|24264|79586|68|\n|47|[Los Angeles](https://jobs-in-data.com/c-los%20angeles)|117500|37980|79520|411|\n|48|[Rockville](https://jobs-in-data.com/c-rockville)|107450|28032|79418|52|\n|49|[Frederick](https://jobs-in-data.com/c-frederick)|107250|27876|79374|43|\n|50|[Plymouth](https://jobs-in-data.com/c-plymouth)|107000|27972|79028|40|\n|51|[Cincinnati](https://jobs-in-data.com/c-cincinnati)|100000|21144|78856|48|\n|52|[Santa Monica](https://jobs-in-data.com/c-santa%20monica)|121575|42804|78771|71|\n|53|[Springfield](https://jobs-in-data.com/c-springfield)|95700|17568|78132|130|\n|54|[Portland](https://jobs-in-data.com/c-portland)|108300|31152|77148|155|\n|55|[Chantilly](https://jobs-in-data.com/c-chantilly)|133900|56940|76960|150|\n|56|[Anaheim](https://jobs-in-data.com/c-anaheim)|110834|34140|76694|60|\n|57|[Colorado Springs](https://jobs-in-data.com/c-colorado%20springs)|104475|27840|76635|243|\n|58|[Ashburn](https://jobs-in-data.com/c-ashburn)|111000|34476|76524|54|\n|59|[Boston](https://jobs-in-data.com/c-boston)|116250|39780|76470|375|\n|60|[Baltimore](https://jobs-in-data.com/c-baltimore)|103000|26544|76456|89|\n|61|[Hartford](https://jobs-in-data.com/c-hartford)|101250|25068|76182|153|\n|62|[New York](https://jobs-in-data.com/c-new%20york)|115000|39324|75676|2457|\n|63|[Santa Ana](https://jobs-in-data.com/c-santa%20ana)|105000|30216|74784|49|\n|64|[Richmond](https://jobs-in-data.com/c-richmond)|100418|25692|74726|79|\n|65|[Newark](https://jobs-in-data.com/c-newark)|98148|23544|74604|121|\n|66|[Tampa](https://jobs-in-data.com/c-tampa)|105515|31104|74411|476|\n|67|[Salt Lake City](https://jobs-in-data.com/c-salt%20lake%20city)|100550|27492|73058|78|\n|68|[Norfolk](https://jobs-in-data.com/c-norfolk)|104825|32952|71873|76|\n|69|[Indianapolis](https://jobs-in-data.com/c-indianapolis)|97500|25776|71724|101|\n|70|[Eden Prairie](https://jobs-in-data.com/c-eden%20prairie)|100450|29064|71386|62|\n|71|[Chicago](https://jobs-in-data.com/c-chicago)|102500|31356|71144|435|\n|72|[Waltham](https://jobs-in-data.com/c-waltham)|104712|33996|70716|40|\n|73|[New Castle](https://jobs-in-data.com/c-new%20castle)|94325|23784|70541|46|\n|74|[Alexandria](https://jobs-in-data.com/c-alexandria)|107150|36720|70430|105|\n|75|[Aurora](https://jobs-in-data.com/c-aurora)|100000|30396|69604|83|\n|76|[Deerfield](https://jobs-in-data.com/c-deerfield)|96000|26460|69540|75|\n|77|[Reston](https://jobs-in-data.com/c-reston)|101462|32628|68834|273|\n|78|[Miami](https://jobs-in-data.com/c-miami)|105000|36420|68580|52|\n|79|[Washington](https://jobs-in-data.com/c-washington)|105500|36948|68552|731|\n|80|[Suffolk](https://jobs-in-data.com/c-suffolk)|95650|27264|68386|41|\n|81|[Palmdale](https://jobs-in-data.com/c-palmdale)|99950|31800|68150|76|\n|82|[Milpitas](https://jobs-in-data.com/c-milpitas)|105000|36900|68100|72|\n|83|[Roy](https://jobs-in-data.com/c-roy)|93200|25932|67268|110|\n|84|[Golden](https://jobs-in-data.com/c-golden)|94450|27192|67258|63|\n|85|[Melbourne](https://jobs-in-data.com/c-melbourne)|95650|28404|67246|131|\n|86|[Jacksonville](https://jobs-in-data.com/c-jacksonville)|95640|28524|67116|105|\n|87|[San Antonio](https://jobs-in-data.com/c-san%20antonio)|93605|26544|67061|142|\n|88|[McLean](https://jobs-in-data.com/c-mclean)|124000|57048|66952|792|\n|89|[Clearfield](https://jobs-in-data.com/c-clearfield)|93200|26268|66932|53|\n|90|[Portage](https://jobs-in-data.com/c-portage)|98850|32215|66635|43|\n|91|[Odenton](https://jobs-in-data.com/c-odenton)|109500|43200|66300|77|\n|92|[San Diego](https://jobs-in-data.com/c-san%20diego)|107900|41628|66272|503|\n|93|[Manhattan Beach](https://jobs-in-data.com/c-manhattan%20beach)|102240|37644|64596|75|\n|94|[Englewood](https://jobs-in-data.com/c-englewood)|91153|28140|63013|65|\n|95|[Dulles](https://jobs-in-data.com/c-dulles)|107900|45528|62372|47|\n|96|[Denver](https://jobs-in-data.com/c-denver)|95000|33252|61748|433|\n|97|[Charlottesville](https://jobs-in-data.com/c-charlottesville)|95650|34500|61150|75|\n|98|[Redondo Beach](https://jobs-in-data.com/c-redondo%20beach)|106200|45144|61056|121|\n|99|[Scottsdale](https://jobs-in-data.com/c-scottsdale)|90500|29496|61004|82|\n|100|[Linthicum Heights](https://jobs-in-data.com/c-linthicum%20heights)|104000|44676|59324|94|\n|101|[Columbus](https://jobs-in-data.com/c-columbus)|85300|26256|59044|198|\n|102|[Irvine](https://jobs-in-data.com/c-irvine)|96900|37896|59004|175|\n|103|[Madison](https://jobs-in-data.com/c-madison)|86750|27792|58958|43|\n|104|[El Segundo](https://jobs-in-data.com/c-el%20segundo)|101654|42816|58838|121|\n|105|[Quantico](https://jobs-in-data.com/c-quantico)|112000|53436|58564|41|\n|106|[Chandler](https://jobs-in-data.com/c-chandler)|84700|29184|55516|41|\n|107|[Fort Mill](https://jobs-in-data.com/c-fort%20mill)|100050|44736|55314|64|\n|108|[Burlington](https://jobs-in-data.com/c-burlington)|83279|28512|54767|55|\n|109|[Philadelphia](https://jobs-in-data.com/c-philadelphia)|83932|29232|54700|86|\n|110|[Oklahoma City](https://jobs-in-data.com/c-oklahoma%20city)|77725|23556|54169|48|\n|111|[Campbell](https://jobs-in-data.com/c-campbell)|93150|40008|53142|98|\n|112|[St. Louis](https://jobs-in-data.com/c-st.%20louis)|77562|24744|52818|208|\n|113|[Las Vegas](https://jobs-in-data.com/c-las%20vegas)|85000|32400|52600|57|\n|114|[Camden](https://jobs-in-data.com/c-camden)|79800|27816|51984|43|\n|115|[Omaha](https://jobs-in-data.com/c-omaha)|80000|28080|51920|43|\n|116|[Burbank](https://jobs-in-data.com/c-burbank)|89710|38856|50854|63|\n|117|[Hoover](https://jobs-in-data.com/c-hoover)|72551|22836|49715|41|\n|118|[Woonsocket](https://jobs-in-data.com/c-woonsocket)|74400|25596|48804|49|\n|119|[Culver City](https://jobs-in-data.com/c-culver%20city)|82550|34116|48434|45|\n|120|[Louisville](https://jobs-in-data.com/c-louisville)|72500|24216|48284|57|\n|121|[Saint Paul](https://jobs-in-data.com/c-saint%20paul)|73260|25176|48084|45|\n|122|[Fort Belvoir](https://jobs-in-data.com/c-fort%20belvoir)|99000|57048|41952|67|\n|123|[Getzville](https://jobs-in-data.com/c-getzville)|64215|37920|26295|135|\n\n&#x200B;",
        "comments": [
            "This cost of living can't be accurate! These numbers are basically just rent!",
            "Cool list.\u00a0Deducting taxes will make it more accurate.\u00a0\u00a0",
            "Of course the best compensation is up north.\n\nI'm in LA but once I finish this MS in DS, I'm moving to the north ASAP.\n\n&#x200B;\n\nMy Firm has DS positions but they're hard to come by, it's one of THE SMALLEST teams we have on site. Seeing a Data Scientist is like seeing a unicorn.",
            "Lol, the entire top 10 is in the San Francisco Bay Area \ud83d\ude01 Wild!",
            "I'm sorry but I don't think this is accurate at all. I can tell by the fact Princeton, NJ is even on the list at all. $31k cost of living? Brother that's a night out in Princeton."
        ]
    },
    {
        "id": "1bqcci4",
        "datetime": 1711674406.0,
        "flair": "ML",
        "title": "How should structure my data to train GPT4 model to red line contracts?",
        "score": 3,
        "comment counts": 4,
        "content": "Hey guys so I\u2019m a Data Analyst training a GPT4 model at work to red line contracts for our legal team. \n\nI know I have to structure the data in chat completion format, I was thinking of structuring the data something along the lines of this - \n\nUser: Why was this paragraph red lined [insert paragraph]\n\nAssistant: this paragraph was red lined for [xyz reasons]\n\nI collected samples from contracts that have been already red lined and why they were red lined. After the model is trained I planned on giving the \u201cassistant\u201d in playground our red lining checklist, feeding it the contract, and seeing the results. \n\nI have tried a preliminary experiment with some other data to train a model (to get my feet wet) and got a training loss of 0.000 but the model was over fit. Then I retrained it with what it did wrong and got a 0.218. Not the best but definitely better. Was curious if any data scientists had some better methods to my approach. ",
        "comments": [
            "One way is converting the given text into a list of individual sentences.\n\n**For example:**\n\n    These Terms of Service (\"Terms\") govern your access to and use of the products and services (the \"Services\") provided by Example Company. By accessing or using the Services, you agree to be bound by these Terms...\n\nInto:\n\n    [\n        \"These Terms of Service (\\\"Terms\\\") regulate your access to and utilization of the products and services (the \\\"Services\\\") offered by Example Company.\",\n        \"By accessing or using the Services, you consent to adhere to these Terms...\"\n        ...\n    ]\n\nNext, compile a training dataset consisting of a sentence named **prompt** and their corresponding subsequent sequences as **message.**\n\n**For instance:**\n\n    [\n        {\n            \"prompt\": \"These Terms of Service (\\\"Terms\\\") govern your access to and use of the products and services (the \\\"Services\\\") provided by Example Company.\",\n            \"message\": [\"By accessing or using the Services, you agree to be bound by these Terms.\", \"...\"]\n        },\n        {\n            \"prompt\": \"By accessing or using the Services, you agree to be bound by these Terms.\",\n            \"message\": [...]\n        },\n        ...\n    ]",
            "check out [https://github.com/dtflare/GPTparser](https://github.com/dtflare/GPTparser) - it'll scrape and parse your data into Chat Completions format."
        ]
    },
    {
        "id": "1bpkhfj",
        "datetime": 1711592901.0,
        "flair": "Career Discussion",
        "title": "Cant land a job in Data Science",
        "score": 156,
        "comment counts": 187,
        "content": "I quit my job in an unrelated field to pursue my dream and failed. I thought I would make it but I didnt.\n\n\nThis is not a rant. Im looking for advice because I feel pretty lost. I honestly dont feel like going back to my field because I dont have it in me. But I cant stay jobless forever. Im having a mental breakdown accepting I may not get into DS so soon because Ive made so many projections about future me as a data guy. Its not easy to let go of them.",
        "comments": [
            "What is your education and what was the unrelated field? What have you done in order to make the transition? How old are you?",
            "Right now everyone and their mother is trying to get into data science. The field is flooded which makes it hard for highly qualified candidates to get a job. \n\nI don't know your qualifications but I'm guessing they aren't stellar (I'm not trying to insult you here). I have a master in Statistics and a decade of Data Science work including running teams and it took me 6 months and over 100 applications to land a position last time I looked, which was a few years ago. I've heard the market is even tougher now. \n\nIf you just have some online classes you won't land a position as a Data Scientist. You will have to lower your expectations and target a Data or Business analyst position to get your foot in the door.",
            "Shoot for an analyst role and keep trying for a ds job.",
            "The market is heinous right now. Try not to take it personally. Without knowing your qualifications, I can't really give you meaningful advice.  \n\n\nIt took me 500+ applications 2 years ago with a master's degree and some research under my belt.",
            "Therapy might be a good option. \n\nBreaking into a data is not an easy overnight journey, I\u2019ve failed countless times. Fail, learn and adjust."
        ]
    },
    {
        "id": "1bpev7u",
        "datetime": 1711578000.0,
        "flair": "Career Discussion",
        "title": "Found a company asking for high school certificates for a Data Scientist role.",
        "score": 102,
        "comment counts": 28,
        "content": "&#x200B;\n\nhttps://preview.redd.it/2qy68tawbyqc1.png?width=1322&format=png&auto=webp&s=2e9d875eb6fb7d11e14e9e1d7fa91180c6f67eb8",
        "comments": [
            "Going to record a video of my high school grade list brb",
            "Make sure it's not identity theft.",
            "Personal motivation video is crazy too",
            "I\u2019m guessing they ask this for all candidates not just DS roles. \n\nMy high school doesn\u2019t even exist anymore, where would I even get a transcript.",
            "I barely apply if it's a work day resume and I have to create a new account for each employer."
        ]
    },
    {
        "id": "1bpnem1",
        "datetime": 1711602156.0,
        "flair": "Career Discussion",
        "title": "Thinking of dipping out of DS for an MBA. Legit looking for advice.",
        "score": 24,
        "comment counts": 53,
        "content": "Edit: Heads up! I did post recently but I want to know your thoughts on making this switch.\n\nOk so I\u2019m in my late 30s and I feel I\u2019ve hit a wall. I work for a tech company as an IC and I make a decent salary, just over 200k a year, but I look up and I only see roadblocks ahead of me. Before going back to school to get my MS I had  a BS in CS but was serving as an officer in the Army leading division level staff. I loved my time in the army but wanted to progress, hence the transition. \n\nData science has been super interesting but progression seems so difficult to come by and you make no strategy decisions for your organization, at least in my case. \n\nI\u2019m thinking of backing up and going down another path. That said I also have financial obligations including a mortgage and I\u2019m fresh out of GI Bill. \n\nIs this worth the opportunity cost if I go to a good school? Is it worth it at all? Is there another path to take that would make more sense? I appreciate the insight.",
        "comments": [
            "Am in DS, I have an MBA as well as Masters in CS.\n\nMBA in the current day means f*ck all, jack sh!t, diddly squat.\n\nDon't do it.",
            "It depends on your end goal.\n\nDo you want to climb the management ladder and eventually lead an organization (or a company) in the future? If yes, an MBA is a good idea. An MBA could provide leadership opportunities across various business functions like marketing, DS, operations, finance, product etc. \n\nDo you want to remain in DS or move to DS management? If yes, an MBA may not be necessary to achieve that goal. You can switch companies to get new experience or to get a promotion if your current job does not satisfy your needs. But this route may pigeonhole you in DS career path for a long time. \n\nIf you decide to do an MBA, you can consider part-time/online programs to enable you keep your job (you don\u2019t have to go full time). There are a few <$50K options from good schools. If you\u2019re not cashflow positive from your $200K+ salary, you can switch jobs to make more money, maintain your cost of living and use the extra cash to fund the MBA.",
            "I did it at ~45yo. I did a state college for around $17k total. (Work paid for most)\n\nWhat I learned (and practiced) was\u2026\n- an appreciation for strategy\n- supply chain \n\nThere were obviously other things I learned but those are the two that stuck for me. \n\nWere the three letters in my signature line and learning worth 17k? Maybe. Considering I will only get 10ish years from it. I would never have gone for a 100k MBA. \n\nWill work pay for any of it?",
            "Since it's not 1995, the answer to \"should I get an MBA\" is a simple _no_.\n\nHave you discussed with your manager a desire to take on managerial responsibilities?",
            "\\[You might be a great Product Manager\\]\n\nIf you think you are hitting a wall then do go for an MBA. Given your leadership background in the army you have a very strong chance of making it to a top B-school (the MBA programs aims for diversity in their incoming batches to ensure holistic learning for their students). \n\nSince you have a CS and DS background, you would be great at Product Management in a Tech company.  PMs (not to be confused with program managers) also get paid very well, and have a lot of say in the vision and strategic decisions for the product. \n\nIs MBA from a top B-school worth financially? -- Yes (You can also enroll in their executive programs that run during the weekend, allowing you to both work and study)\n\nIs MBA from a top B-school worth financially for you? -- only you can decide based on your financials."
        ]
    },
    {
        "id": "1bp6qfz",
        "datetime": 1711558398.0,
        "flair": "Discussion",
        "title": "Is it just me, or have there been a lot of data science job postings lately that require skills in data engineering?",
        "score": 135,
        "comment counts": 97,
        "content": "Not only with job postings, but I know a few individuals who work as data scientists at reputable companies, and often they are tasked with the responsibilities of a data engineer. I believe the issue stems from a lack of data literacy among companies and data managers.\n\nIn terms of job postings, most of them require extensive experience in SQL, data cleaning, ETL, Pipelines and data quality-related tasks, which I believe fall within the realm of data engineering. I would like to hear your thoughts on this. Have any of you experienced something similar or perhaps dealt with it firsthand?",
        "comments": [
            "IMO SQL and data cleaning/wrangling seems like foundational skills across the family of data analyst/data scientist/data engineer titles\n\nI think complex ETL/ELT data pipeline work would typically fall under the data engineer role",
            "I would expect data engineers to be responsible for the core data warehouses, data ingestion pipelines, infra/data platforms, and BI tools.  But dude, SQL and data cleaning are table stakes for any kind of analyst or data scientist.",
            "Data scientists who can\u2019t do DE tasks are going to be redundant",
            "i think the majority of data science job postings for many years now have been mostly data engineering or data analyst jobs. \n\ncompanies are now just being more honest and also trying to make sure there\u2019s a clear distinction in the description between an ML/AI Scientist and a Data Scientist to attract the right applicant for each role.\n\ntldr: data engineering being a majority of a data scientists work at many companies isn\u2019t really new, companies are just being more clear about it.",
            "Hot take: Very few organizations need \u201cpure\u201d data scientists who don\u2019t perform DEng on the backend or analytics on the frontend."
        ]
    },
    {
        "id": "1bpcc6l",
        "datetime": 1711571877.0,
        "flair": "Challenges",
        "title": "Dumb question but do data scientists make an effort to automate there work?",
        "score": 53,
        "comment counts": 42,
        "content": "Lowly BI person here -- just curious outside of maths, data modeling, and drinking scotch in the library,  do data scientists make an effort to automate their work? Like are there tools or scripts you all are building to be more efficient or is it not really a part of the job? ",
        "comments": [
            "Yes? I'm kind of confused by the question, on some level automation is the entire job. I'm building predictive models and the final product is an automated pipeline to generate new predictions as new input data comes in. Or I'm building automated reports (i.e. accuracy) about said models. Like I'm doing math and modeling and drinking scotch but the thing I'm actually making when I do all that is code.",
            "Sometimes you get questions that warrant exploratory analysis and don\u2019t require being automated. Those are pretty flexible in terms of what you want to use (R or Jupyter Notebooks, your tool of choice). \n\nBut when you need to put something into production and run it regularly, part of the job is working with your data engineers and infrastructure guys to integrate your work into whatever pipeline your company has. That means writing readable code and following SWE best practices to make life easier for everyone else who is going to have to look at your work and, more likely than not, debug when something doesn\u2019t quite run.",
            "The short answer is yes, we definitely automate our work. The reason? We\u2019re essentially trying to be efficient (or maybe just a bit lazy in a smart way). Automating tasks with Python scripts and scheduling tools like Apache Airflow helps us manage the overwhelming amount of data we face daily. It's a survival tactic to prevent drowning in data.\n\nIn fact, as personal rule, I attempt to automate anything I do more than once, and I'm not opposed to paying for a tool if it saves me or my team a significant amount of time. A few automation tactics and tools we keep handy (in addition to the bottle of single malt). \n\n* Python/R for Data Analysis: Essential for any data manipulation, statistical analysis, and machine learning.\n* Jupyter Notebooks: Great for prototyping, exploration, and sharing findings with others.  \nPandas/Numpy/Scipy: Key Python libraries for data manipulation, numerical, and scientific computing.\n* SQL: For data querying from databases. Knowing how to write efficient queries is a huge time-saver.\n* Apache Airflow/ Prefect: For orchestrating and automating data pipelines.\n* Git: For version control, especially when working on projects with a team.\n* Docker: Helps in creating reproducible environments, easing the transition from development to production.\n* Tableau/Power BI: For quick and effective data visualization and dashboards.\n* Rollstack: For report automation\n* Scikit-learn/TensorFlow/PyTorch: Popular libraries for machine learning and deep learning.\n* Dask: For parallel computing in Python, useful for working with large datasets that don\u2019t fit into memory.\n\nI'm sure there are more I missed. If you're wondering how the other half lives, we sure as heck automate, in addition to the scotch drinkin' :)",
            "First, I write the code. Then, I write automation/encapsulation scripts around the code so I hopefully never have to look at the code again.",
            "It really depends on what part of the job we're talking about. If you've got some manual processes that you're doing by hand every day/week, then hell yeah, those should definitely be automated. Everyone I know is doing that these days.\n\nNow, when it comes to answering business questions or actually running the analysis/creating new models where you need to write up some code, automation can come into play in a different way. For instance, you can use ChatGPT to write R and Python code for you. You can also use some text-to-SQL tools to write your SQL queries. I probably save around 5hrs a week just by doing that.\n\nBut, of course, I\u2019m biased \u2014 I'm a co-founder of [datalynx.ai](https://datalynx.ai) and I use it together with Claude to write about 80% of the code I need. But it doesn't really matter what you use. I've found that writing code with an AI co-pilot these days is the biggest time-saver out there."
        ]
    },
    {
        "id": "1bpjsas",
        "datetime": 1711590911.0,
        "flair": "Education",
        "title": "Best Supplementary Texts for Casella and Berger are \"Introduction to Probability\" (first half) and \"The Simple and Infinite Joy of Mathematical Statistics\" (second half)",
        "score": 14,
        "comment counts": 6,
        "content": "Making this post to help current and future grad students. I'd also to help departments choose good supplementary books for non-pure statistics programs, like biostatistics, economics, psychometrics, etc.  I'm working through the tail end of a first year graduate mathematical statistics sequence. I only had an intro probability course using \"Mathematical Statistics with Applications\" by Wackerly beyond the typical math courses in engineering BS programs. Neither is as rigorous as \"Statistical Inference\", but they're close. I've found \"Statistical Inference\" clearer as a second read.\n\nMy two suggestions are \"Introduction to Probability\" by Blitzstein and Hwang for the first half. The STAT 110 lectures are on YouTube. The book is very conversational in tone.\n\n\"The Simple and Infinite Joy of Mathematical Statistics\"  by JN Corcoran is amazing for the statistical inference half  for many of the same reasons as \"Introduction to Probability\". It's expensive, but I think its privately published. My paperback from Amazon came with a stamp saying it was made two days prior. Her writing style is incredible and she genuinely makes the reader excited to learn. She also had a playlist on YouTube for CU Boulder.\n\nedit: grammar ",
        "comments": [
            "I've considered Corcoran's book, but I'm on the fence and I can't even find a table of contents to look at before dropping $80. Glad you like it. Maybe I'll take the plunge. I really need to review (really learn, lol, that was a \"cram and forget\" class for me in college) my math stats.\n\nOh, and I'll second the recommendation for Blitzstein and Hwang's book. It's a well-written and comprehensive tour of non-measure theoretic probability with good exercises. edit: and you can read (but not download/print) a free copy at http://probabilitybook.net"
        ]
    },
    {
        "id": "1bpkfvd",
        "datetime": 1711592774.0,
        "flair": "Career Discussion",
        "title": "What title would you describe this position as?",
        "score": 11,
        "comment counts": 19,
        "content": "I work as a data analyst at a fintech company, but my roles seem different than those of a typical analyst. Would these duties fall under data analyst, data scientist, data engineer, or something else?\n\n- I mainly work in the Unix terminal, I almost never look at a GUI for my job\n- I write a lot of bash code to extract data and clean it\n- when I do work with Python, I only write code using vim. No jupyter notebooks, no anaconda, nothing else. I write Python code mainly to streamline data retrieval, filtering, writing to new directories, etc \n- most of the data I handle is text based. Almost no numerical analysis\n- with that being said, my team is very perl heavy and it is our main tool to automate gathering large text based csv files\n",
        "comments": [
            "All I see as far as a job \u201cfunction\u201d is you extract data and clean it. Sound like a data engineer if that\u2019s the core part of your job.",
            "You didn't tell us much about what you do with the data, you described how you do things but that's pretty much orthogonal to to the job title. You did say extraction and cleaning so data analyst sounds about right to me.",
            "Data and software engineer using tools of hell. I can't imagine using vim as my daily driver for python code \ud83d\udc80",
            "Data Developer/Engineering",
            "data engineer"
        ]
    },
    {
        "id": "1bpe2fn",
        "datetime": 1711576074.0,
        "flair": "Statistics",
        "title": "Causal inference question",
        "score": 25,
        "comment counts": 24,
        "content": "I used DoWhy to create some synthetic data. The causal graph is shown below. Treatment is v0 and y is the outcome. True ATE is 10. I also used the DoWhy package to find ATE (propensity score matching) and I obtained \\~10, which is great. For fun, I fitted a OLS model (y \\~ W1 + W2 + v0 + Z1 + Z2) on the data and, surprisingly the beta for the treatment v0 is 10. I was expecting something different from 10, because of the confounders. What am I missing here? \n\n&#x200B;\n\nhttps://preview.redd.it/ve6753p75yqc1.png?width=458&format=png&auto=webp&s=0935bbb15fba1dc63bdb3f8f445dca73fa2988e9",
        "comments": [
            "My causal inference is super rusty so I don\u2019t have a confident answer for you. \n\nMy recollection is that you did exactly the right thing by controlling for the confounders and that\u2019s why they don\u2019t bias your estimate. This is in contrast to how we might deal with colliders, which is a bit messier. \n\nZ0 and z1 only interact with your dependent variable through v0, so I would expect their effect is already expressed by the coefficient on v0 and they might just be statistically insignificant. \n\nAlso just wanted to say I\u2019m SO glad you posted this question. We need to be doing more causal inference in data science departments!",
            "The reason why your regression model outputted the correct causal treatment effect of 10 is because regression adjustment is in fact a method for adjusting for confounders, alongside methods like matching, weighting, etc.,\n\nIn the causal inference literature, the method of using regression to control for confounders is referred to as \"outcome regression\". However, this is not as popular as other methods like matching because they share similar weaknesses but have an additional weakness of requiring the assumptions of parametric form of the model to be correct, which was not an issue in your case because of how you simulated the data (i assume). A strength of matching is that it promises to reduce (or optimistically, eliminate) model dependence, which you can read about at Ho et. al (2007)\n\nIn practice, matching is actually used together with outcome regression, so nowadays it's less about \"choosing\"",
            "Yes it would be great to see some more resources for this",
            "This is a great demo. OLS effectively controls for everything in the problem, whether or not it's a confounder. That *can* lead to problems if:\n\n\\* You're accidentally conditioning on colliders, or  \n\\* It's a very high dimensional problem that would require regularization",
            "The top comment (at time of writing) is right that your adjustment set shouldn't bias your results. But you shouldn't be conditioning on the z0 and z1 variables because they explain variation in v0, which will make it more difficult to measure the effect on y. \n\nBecause the confounder of W1 and W0 are forks, you should condition on them to remove them as confounders. That was fine."
        ]
    },
    {
        "id": "1bpodhu",
        "datetime": 1711605812.0,
        "flair": "DE",
        "title": "Data for LLMs, navigating the LLM data pipeline",
        "score": 2,
        "comment counts": 13,
        "content": "Tons of articles about LLMs, yet when I wanted to read about the data pipelines, it was hard to find a resource that curated things I wanted to know about LLM data pipelines. As we all know, it\u2019s the huge amount of data that makes LLMs possible, so here\u2019s a blog I wrote after satisfying my curiosity.\n\n&#x200B;\n\n[https://medium.com/@abhijithneilabraham/data-for-llms-navigating-the-llm-data-pipeline-23a449993782](https://medium.com/@abhijithneilabraham/data-for-llms-navigating-the-llm-data-pipeline-23a449993782)",
        "comments": [
            "It was a nice and comprehensive read! Especially in this age where most people and companies are using pretrained llms for their workflows, knowing how to build and optimise one using the data helps a lot.",
            "Enjoyed the read. Student here. Any reason you didn\u2019t choose Azure Datalake or warehouse in your last section \u2018Storing training data\u2019?",
            "Good one",
            "Nice article! I've noticed this throughout my career in tech that whenever I'm tackling a new focus that requires new skills there are barely any available resources that cohesively describe processes and workflows from the ground up.",
            "Me too"
        ]
    }
]