• Cat Hicks

Walking through a survey analysis

A cool tree I saw while taking a break from this analysis

I want to share a little more about very practical experiences in applied research. Despite endless online resources on statistics and data work, I don't know that we talk enough about research design in applied work. Research design is central, but I understand why it’s harder to put it into scaleable online teaching. It’s really contextual, and full of decision making. But I believe that you can bring a strategic, evidence science mindset even to the smallest data tasks.

So I thought I'd take you along a very small survey analysis I did the other afternoon. I'll do my best to make this a think-out-loud. Along the way, I’ll try to do a (haphazard) exploration of different evidence science elements that I think about while I work.

Even though I often work with far more complex models, or more complex data collection situations like behavior in digital products, surveys are a very bread-and-butter type of analysis for me. Everybody does surveys. They’re imperfect, but for a lot of scenarios they’re the only tool you have to gather evidence. Frequently what makes a survey interesting are the people the survey found. I really enjoy analyzing them and thinking about survey design. This one was bonus fun, because it’s a piece of an academic project that I’m excited about. We’re looking at people's experiences with discipline-specific coding, CS education, and computational research careers. It’s bonus bonus fun, because I’m collaborating on it with my wife (!!).

1. Intake

First thing I do is an intake meeting: in this case, I sat down with the survey author (& our dog).

The first part of intake is logistical. This is obvious, but I’m trying to write down obvious things here–you have to make sure you know what the variables in your dataset are. I had the text of the survey (in fact, I helped draft it) but I hadn’t run the survey myself. Along the way, some questions had gotten named peculiar things. Constraints of the platform the survey was run on had changed the format of some questions (this often happens). This is where social science > “data science”: you have to check that you know exactly what participants saw. I knew the text, but needed to map each variable to its original version. I am quite detail-oriented from the moment I get a dataset because this is where tiny silly mistakes can have big consequences. And of course because this was the real world, there were also missing and incomplete responses (more to come on silly mistakes here!).

The second part of intake is conceptual. This is where the fun is for me, where I bring a lot of social science thinking to how we might analyze a survey. This survey dealt with topics that I know quite a lot about (education, learning experiences, metacognitive beliefs about learning, and learning to code specifically), so I didn’t need to do as much ramp-up as I might normally. Of course, I often design surveys from start to finish for clients. But not always; many organizations have surveys they're not quite sure how to understand, or a backlog of old data that no one really explored. One of the things I specialize in is helping organizations learn to maximize the understanding they already have--so I love digging into the history of data at an org. For unfamiliar surveys, here are some questions I might ask:

  • Walk me through how/if you’ve looked at this data before

  • Tell me if anything surprised you

  • Tell me if something confirmed your expectations

  • Are there any “types of participants” or “types of answers” you might expect to see on this survey

  • Are there relationships you expect between these different measures? (e.g., “we think people who say they like x on question 3, probably won't say they like y on question 5”)

  • Is there a subgroup I should be aware of and look at separately (e.g., “we’re really interested in college students, but we’re not sure how many answered this survey”/ “can you tell us if we can say anything about managers”)--> this question often tells me that I need to put some work into making a recommendation about strength of evidence for a client

  • If we found (x), what do you think you would change/do next? If we found (x), can you give me an example of how you'd tell a story about it? --> this is a type of question I ask a lot, not just at this stage. I am constantly reaffirming my own understanding of how partners are going to use the data that we work on together, which is both an ethical and a logistical check.

  • Are there questions on here that you have doubts about? Where you wonder if participants didn’t understand them or are answering a different question than you intended? Is there anything you wish you’d asked about, that you didn’t? Or anything that you thought was impossible to ask about? -->these questions are super useful, and I ask them because I typically consult on people’s long-term research strategy. If I were just turning around results and pressed for time, this is the kind of can of worms you might not want to open. But people usually work with me not just because I can give them some stats, but because I make their next survey better.

Surveys might not feel complex compared to translating over-time logstream data or digging through enormous public datasets, but people analyze them in lots of different ways. When you’re an applied researcher, you have to be constantly thinking about doing something that is fit for use. I have standard approaches and principles of analytical rigor that provide a core checklist across projects, but one of the huge values I bring to partners is empathic research strategy. At Catharsis, this is also key to our research ethics. Care about the context of evidence drives everything else.

It's also just good storytelling. Just as data collection moments are constrained by time and resources, so is your analysis. One person might need a compelling story about the descriptives on a large survey–so you might spend your shared time on polish, data visualization, data cleaning on the right descriptives. One person might need to design a high stakes survey that they want to draw from robust scales and be able to query for future predictive relationships–you might spend your shared time on researching the literature, statistical evaluation and item analysis. And the reality in applied work is that most clients I work with don’t know which they need. That’s strategic direction I provide, matching their goals to the right approach.

If it’s really a novel or complex topic, at this stage I almost always do some digging into research literature to find analytical examples or measurement issues known to experts (please, make more journal articles open for people like me!). This is even more helpful when you are working with partners in different fields from your own training, where there are different expectations for how evidence is presented. In this project, we found a pretty clear comparison from a study with a similar N size and similar questions.

2. Exploring

A bit of exploration had already happened on this data, enough to know there were potentially interesting trends with group differences in who did computational research in this field. Descriptives are useful in themselves, but we wanted to explore stats behind those descriptives. This was a pilot study. We knew we were going to design another survey, so there was strategy work to figure out which questions were going to be most important to ask about in future rounds of data collection.

I also had an idea about a new question. Some questions were about taking coding courses and level of code experience. But there was also a question about comfort with coding that I thought was potentially pretty interesting. Since participants answered questions about multiple timepoints, that gave us a within-participant story. I made a note to look at how coding comfort changed over time. I often find the most useful variables are the holistic ones we can operationalize from fitting important, related pieces of the puzzle together. I always try to think about change over time holistically. This is also a perspective from being an applied researcher: in real world data you are usually dealing with highly correlated variables, and multiple measures of similar things. You are less likely to be looking at very distinct, lab-designed, separate concepts. :) This means exploring correlations and mapping the ways your variables build each other conceptually is super important.

I always start by summarizing the dataset, and looking at a few of the counts of the more interesting variables. I never change the dataset while I do this (social scientists are very allergic to changing data). We already had descriptives on the variables, so mostly, I did straightforward things like counting up different responses and looking at correlation patterns.

Here was a silly mistake. In order to count some of the variables where there was complete data, I dropped NAs in this particular exploration. That is not a problem, but it made me forget to explore the missing responses, which is a very standard thing that I always do, and simply blanked on at this moment. I will fix it later!

Next I did some data cleaning. Here was not a silly mistake: I re-generated the full descriptives on each variable, ranges, means, etc. This revealed several weird outliers. One wasn’t technically an outlier, just unusually rare–a decimal point answer where no other responses had a decimal point. Was that allowed on this survey? I always note unusually rare responses and check on them. But another one was an outlier of scale: a participant had answered “23” on a 1-5 point scale! I didn’t even understand how that was possible.

Some approaches just delete responses like this without ever tracking its source down. People act like they don't, but they do. Frankly, this leads to a lot of hideous problems in data science. I imagine it happens because people are pressed for time and not properly rewarded for diligent data cleaning. Sometimes it's excusable when you know you're dealing with a super noisy dataset and you expect a certain amount of error from the way the data was instrumented (but when I find myself in that situation, there are some big questions about what kind of data work is even appropriate to do if your system is literally generating garbage). Well, with people data, you always have to be very careful about understanding how outliers happened. A response outside of what should be possible could mean a transcription problem: aka, the accuracy of the whole dataset thrown into doubt. Even data that looks “right” could be wrong.

3. Analysis….wait, no, more intake.

Ok, so I went to track it down, because that's the kind of work I value. My dataset was a csv subset of a larger csv, which was stored in a drive, which originated from the survey platform. I try to trace back through each “level” of datasharing, when I have collaborators, because if someone else is working through an erroneous dataset, it won’t help them if you only solve it on your local copy.

I know this is wildly obvious stuff to the engineers among us who do a lot of version control, etc. In larger data work, I would be dealing with a more formal system of change management built to be resilient to errors and be able to push corrections to everyone collaborating. But real world again, many applied research projects are one-offs and a surprising number of orgs don't have good processes on this. Especially because of my work partnering with organizations that are just learning to do research, or because of how human behavior research is sometimes treated like an 'afterthought' even in engineering environments, I often live with CSVs sent in one-off emails or folders. Managing this isn’t always something we talk about to junior researchers or junior analysts, so here's me trying to talk about that.

I checked my file, then the file in our drive, and found the same outliers. I documented it in the online file so anyone looking there would see this was WIP I was chasing down, to be corrected. It must have originated with the survey platform, so I had to track down access to the participant-facing view on the survey platform itself. Turns out, the survey question was written with a text box, not a radio button. The “23” participant had put in “2,3”. I think this was their attempt to write 2-3 on a 5 pt scale.

Classic! You want your data collections designed with totally foolproof UX for participants, but they never are. Thankfully this was not a hard thing to fix and our other participants were remarkably talented at selecting the right boxes. Did some QA on whether people used the terrible text boxes accurately, checked in with collaborators, and we agreed 2.5 would be an accurate representation of this person’s answer. This also explained the decimal.

But even this mundane error held some research strategy: I had a thought here about including confidence ratings on our next survey. I love confidence ratings, multidimensional questions, and being able to represent variance and participants' relationships to the questions that you are asking. There is actually a whole can of worms on evaluating whether your surveys are being broadly understood and whether we can generalize that understanding across participants, but suffice it to say it is often super illuminating to be able to divide participants into “very certain about their answer” and “very uncertain.” I thought this would be particularly relevant in light of the metacognitive questions about learning on this project, so I jotted down some recommendations.

Noodling thought: notice how even when dealing with pretty simple methods, evidence decisions have to happen. Those decisions can reveal what we think about the concepts we’re measuring, and being transparent about documenting and moving forward with those decisions is critical to good collaboration. Second noodling thought: imho you are a huge value add to a team when you can build strategy that isn’t divorced from mundane small tasks, but actually arises from the “mundane” work. That’s why I dwell so long on seemingly trivial examples like when participants give unexpected answers because they were trying to force a scale to represent what their answer really was. Participants are always commenting on our methods, and we should be open to learning from that, not just forcing our methods on them.

Ok, remember that bit above about coding comfort over time? Next I created that variable. I also created a measure of how much (and in what direction) the comfort changed. This gave me several things to look at: the level of comfort someone started at (e.g., you could start high or low), the level of change (you could change a lot or a little), and the nature of the change (someone could maintain high comfort, or start high and then drop in comfort, or start low and increase in comfort). All of these are mutually dependent; aka these are entangled measures that help create each other. So we would never put all of these together in, e.g., a regression. But they provide useful descriptives, and going through this process of operationalization can help to create the best, most accurate version of the variable that you are trying to ask about. Yet another research design thought, even surveys that are already written can still have new operationalizations hidden inside them. Actually in applied work, I often find that people gather a ton of multicollineated measures and can benefit a lot from the ways social science can statistically describe, pull apart, and combine those measures into more coherent variables.

4. ACTUAL analysis, for real for real

After all that, I copied my statistical templates into our project script. These are lines of code I carry from project to project — these days I rarely start with a blank file. It is also a cognitive tool to give yourself a skeleton to work from. It’s like the plot outlining for fiction writing, breaking the blank page. If you’re a junior analyst or researcher, I urge you to think about developing a personal library of skeletons (that sounds creepy but awesome, very Gideon the Ninth). Every project you do can create not just findings, but methods scripts that you carry with you. This is also a handy cognitive (+emotional) tool to pull value from things that don’t yield the “findings” you want. (here, I will allow that data scientists could teach social scientists, lol)

I immediately found out that I was still missing some key variables, ffs, despite all that intake. Somehow we had missed including a demographic variable that told me what participants’ current jobs were. Pretty important. Even data-savvy people will sometimes drop variables, not keep track of earlier versions of files, and assume you can or want to infer things from other variables. In this case I actually could have inferred people’s jobs based on their responses to certain questions (which were tied to jobs) but I prefer to be a PITA to collaborators in the beginning of a project rather than the end. So my first task was merging in the demographics from a new file. New versions of all the exploration resulted (sigh).

Now it’s time to remember that I promised a silly mistake about NAs: I blithely ran a logistic regression with some variables that were full of missing data, dropping the sample size unreasonably. After all that talk about assumptions, I assumed that because we had very complete responses for many of the variables, we had it for others. Worse, I assumed this about an academic variable, GPA. Well, turns out many people do not remember their old GPAs (if you are a student perhaps you can find this comforting). But I am used to working with enormous education datasets where variables like “GPA” are pulled from school records, not from self-report. So this was a collision of some of my experience with the reality of this data. I humbly include this just to tell any students reading that the more things change, the more they stay the same etc, and mistakes are inevitable.

I wasted time running these deficient-sample models and glancing at their results (obviously not closely enough to check the sample numbers). But it wasn’t really a waste because I got to practice using a package that was new to me, {gtsummary}, and that was awesome. So I played around with making gorgeous tables and different plots, which will all be useful for the project. Also awesome, at the literal time I was doing this I saw a youtube video go up and enjoyed spending some time watching that and making loads of formatting changes to my tables. More skeletons generated!

Also, it’s ok and maybe even necessary to build up exploratory modeling on this kind of project even if you do it wrong the first times around. I was pretty certain that I needed to check in with my co-author about exactly what variables made sense in these models from their research design POV, informed by my checking on the independence and completeness of the measures (once she was done teaching ridiculously complicated stuff online from our dining table during a pandemic). We weren’t dropping variables based on the results of the exploratory analysis or anything like that, I simply needed to get some stats sea legs on the project. Differences between "silly mistakes everyone makes" and "big problems in evidence claims" can be a matter of maintaining consistent principles about how you use data and select variables. This is more stuff that can be hard to communicate to non-technical clients, but is a tremendous value add.

5. Actual for real analysis, dot FINAL final !!

I took a break and came back. Breaks are key. Refreshed, I actually looked at the numbers in my new fancy tables and immediately saw my mistake. I scanned my script to trace back (sidenote: reading and thinking about your previous work--anything reflective that isn't 'production' but IS a key task of learning--is really undertaught in analytics imho). This is also where your conceptual library of skeletons (💀) is great because I barely even had to think about it, I just visually recognized the absence of the NA exploration I usually do.

Missingness is really interesting. Sometimes people not answering your questions is itself a signal. So an interesting aspect to NAs is whether they correlate, especially on a survey where people might be selecting what to answer, and they’re answering about experiences in their past. I wondered whether people had selectivity in their memory about these things. Systematic missingness is something that I find very meaningful, and like to check for it. Because this was a new survey and a pilot study, we were paying attention to what data we could gather in this format and not just what the data that we did gather said. This is another advantage of bringing research thinking to the forefront.

In this case, however, there were also some obvious things like the self-report inaccuracy I mentioned above and another facet, I noticed that a small group of our participants were from unusual jobs and didn’t really match our research question, which meant they had a lot of NA responses to questions that didn't apply to them. So (going back to fit for use decisions) I made some important analytical decisions about who should be included in the models which again, I really should have done before dumping all the data into a model. But, sometimes iterating your analysis is just nonlinear like that, like building up the layers of a painting and adding the right shadows and depth after you get the lines and ratios in place (but it's important to note that those 'shadows' can change the entire shape and direction of relationships, which is again why you have principles like 'I am not going to leap to treating these as the actual findings until we've got the right variables in place'). I noted these decisions in the script: I make selection into our analytic sample as transparent as I can, so the team could chat about what to do with that small group of different participants. Maybe a different followup project to learn from them?

I generated an {analysis} dataset with the right subsample, this time, for the project, and re-ran it all. At this point, I was ready to call it a day and present it in a nice dynamic report to the research team. Remember that thought I had about how “comfort with coding” might change over time? It started to look really interesting. So stay tuned on that!

93 views0 comments

Recent Posts

See All

In my work as an applied researcher so far, a few big truths have come up again and again. They aren’t the hardest concepts to understand, nor are they the most technical and nuanced problems in stats

I was trying to get the registration stickers for my car. I should’ve gotten them, but I hadn’t. Instead, I’d gotten three different letters from the DMV informing me that my registration was paid but