Home > APIs > YSQL > Functions and operators > Aggregate functions > case study—linear regression on COVID data >

Using the YSQL linear regression analysis functions on the COVIDcast data—introduction

Introduction

Try this query:

select max(symptoms_pct) from covidcast_fb_survey_results;

The result is about 2.7%. This indicates that the signal "symptoms_pct" (characterized on the COVIDcast download page by "Percentage of people with COVID-like symptoms, based on surveys of Facebook users") has little power of discrimination.

Now try this:

select
  (select min(cmnty_symptoms_pct) as "Min" from covidcast_fb_survey_results),
  (select max(cmnty_symptoms_pct) as "Max" from covidcast_fb_survey_results);

The results here are about 7% for the minimum and about 55% for the maximum. This indicates that the signal "cmnty_symptoms_pct" (characterized on the COVIDcast download page by "Percentage of people who know someone in their local community with COVID-like symptoms, based on surveys of Facebook users") will have a reasonable power of discrimination.

None of the YSQL built-in aggregate functions can take account of the "stderr" or "sample_size" values that were carried forward into the final "covidcast_fb_survey_results" table. But you might like to try some ad hoc queries to get an idea of the variability and reliability of the data.

For example, this:

select avg(cmnty_symptoms_stderr) from covidcast_fb_survey_results;

gives a result of about 0.8 for the percentage values in the range 7% through 55%. This suggests that the seven day moving averages are reasonably reliable.

And this:

select
  (select min(cmnty_symptoms_sample_size) as "Min" from covidcast_fb_survey_results),
  (select max(cmnty_symptoms_sample_size) as "Max" from covidcast_fb_survey_results);

results in about 325 for the minimum and about 24.6 thousand for the maximum. This is a rather troublesomely wide range. The result of this query:

select
  round(avg(cmnty_symptoms_sample_size)) as "Avg",
  state
from covidcast_fb_survey_results
group by state
order by 1;

suggests that the sample size is probably correlated with the state's population. For example, the two biggest sample size values are from California and Texas. and the two smallest are from DC and Wyoming. It would be straightforward to find a list of recent values for state populations from the Internet and to join these, using state, into a table together with the average sample sizes from the query above. You could then use the regr_r2() function to see how well-correlated the size of a state's response to the COVIDcast Facebook survey is to its population. This is left as an exercise for the reader.

Create a view to focus your attention on the values that the analysis presented in the remainder of this section uses:

create or replace view covidcast_fb_survey_results_v as
select
  survey_date,
  state,
  mask_wearing_pct,
  cmnty_symptoms_pct as symptoms_pct
from covidcast_fb_survey_results;

This is included in the analysis-queries.sql script that also implements all of the queries that the analysis presented in the remainder of this section uses.

If you want to see how the results come out when you use the "symptoms_pct" column instead of the "cmnty_symptoms_pct" column, just redefine the view, thus:

create or replace view covidcast_fb_survey_results_v as
select
  survey_date,
  state,
  mask_wearing_pct,
  symptoms_pct as symptoms_pct
from covidcast_fb_survey_results;

How the rest of this analysis section is organized

The section Daily values for regr_r2(), regr_slope(), regr_intercept() for symptoms vs mask-wearing describes the actual linear regression analysis code.
The section Select the data for COVID-like symptoms vs mask-wearing by state scatter plot shows the SQL that lists out the 51 individual "(symptoms_pct, mask_wearing_pct)" tuples for the day that was arbitrarily chosen for drawing a scatter-plot on top of which the outcome of the regression analysis for that day is drawn.