How to create aggregate data

We want to run a survey about how many people like hats.

Open a python shell in interactive mode (run python3) and run the following commands in sequence.

First we’ll import the library and make a new store.

from ocdsmetricsanalysis.library import Store
store = Store("how-to-create-aggregates.sqlite")

We’ll add our own metric.

store.add_metric("HATS", "How many people like hats?", "We ran a survey to find out.")
metric = store.get_metric("HATS")

Here are the results of our survey:

survey_results = [
     {"response":"like", "person_height":"tall"},
     {"response":"like", "person_height":"tall"},
     {"response":"dislike", "person_height":"tall"},
     {"response":"neither like or dislike", "person_height":"tall"},
     {"response":"like", "person_height":"tall"},
     {"response":"neither like or dislike", "person_height":"tall"},
     {"response":"like", "person_height":"short"},
     {"response":"like", "person_height":"short"},
     {"response":"neither like or dislike", "person_height":"short"},
     {"response":"like", "person_height":"short"},
     {"response":"dislike", "person_height":"short"},
     {"response":"neither like or dislike", "person_height":"short"},
     {"response":"like", "person_height":"short"},
     {"response":"dislike", "person_height":"short"},
     {"response":"dislike", "person_height":"short"},
     {"response":"like", "person_height":"short"},
]

We don’t want to publish individual survey responses in our metrics, as there may be anonymisation issues with that.

Instead, we’ll publish aggregates - how many people answered a certain way?

Also, because we have additional data on height, we can break down the answers by height too.

There is a function that will calculate the observations for us automatically.

metric.add_aggregate_observations(
    survey_results,
    "response",
    "answer",
    idx_to_dimensions={"person_height": {"dimension_name": "height"}}
)

Lets go through the parameters:

survey_results is the array of our survey responses.
“response” is the key in each survey response we are counting the answers to (normally a survey would have more than one question, so we need to know which one to count!)
“answer” tells us what dimension in the observations we create we should store the answer in.
idx_to_dimensions tells us about additional questions in the results that we can use to make more dimensions. In this case we are saying the “person_height” key in each survey response can be used to make a new dimension called “height”.

Let’s quickly list all the observations for all overall survey:

observation_list = metric.get_observation_list()
observation_list.filter_by_dimension_not_set('height')
observations = observation_list.get_data()
for observation in observations:
    print("OBSERVATION id=" + observation.get_id())
    print(observation.get_measure())
    print(observation.get_dimensions())

You should see something like:

OBSERVATION id=000000001
4
{'answer': 'dislike'}
OBSERVATION id=000000002
8
{'answer': 'like'}
OBSERVATION id=000000003
4
{'answer': 'neither like or dislike'}

Let’s also see our results broken down by height:

observation_list = metric.get_observation_list()
observations_grouped = observation_list.get_data_by_dimension('height')
for height, observations in observations_grouped.items():
    print("HEIGHT IS " + height)
    for observation in observations:
        print("OBSERVATION id=" + observation.get_id())
        print(observation.get_measure())
        print(observation.get_dimensions())
    print()

You should see something like:

HEIGHT IS short
OBSERVATION id=000000004
3
{'answer': 'dislike', 'height': 'short'}
OBSERVATION id=000000006
5
{'answer': 'like', 'height': 'short'}
OBSERVATION id=000000008
2
{'answer': 'neither like or dislike', 'height': 'short'}

HEIGHT IS tall
OBSERVATION id=000000005
1
{'answer': 'dislike', 'height': 'tall'}
OBSERVATION id=000000007
3
{'answer': 'like', 'height': 'tall'}
OBSERVATION id=000000009
2
{'answer': 'neither like or dislike', 'height': 'tall'}