Monday, 17 February 2020

A sort of pivot table

A post preparatory to one about tabulating frames of consciousness.

There are lots of ways of doing cross tabulations and there are lots of ways – some sophisticated and elaborate – of describing them in more or less formal language.

Here we describe a simple variety of cross tabulation, a simple variety which is reasonably well supported by the Microsoft Excel pivot table function – although we have yet to find or work out why this function is so called. In what follows we suppose some familiarity with Excel. Our own position being one of familiarity rather than of expertise.

Data

We start with a worksheet, Sheet C, which contains the data to be tabulated, which we suppose, for the purposes of exposition, to be the results of a survey of 100 households in England and Wales. Each household is represented by a row on this worksheet, while individual properties of those households are represented by columns.

A simple rectangular array of data which has been generated using the Excel random number generator and which, while vaguely realistic, does not describe real households.

Figure 1
Figure 2
Figure 1 shows the top of this worksheet and Figure 2 shows the bottom of the data. Note that, the first row having been used for the headings, the data starts at row 2 and stops at row 101.

The column headed ‘#’ contains the household identifier. This might be needed in the case that some error, anomaly or point of interest in the data meant that one wanted to go back to the raw material, to the survey forms used to collect the data. Assuming here that there were no anonymity protocols restricting such access.

The column headed ‘Region’ contains the statistical region in which the household is to be found. We suppose that there are ten such regions, coded from 1 to 10. An example of a region might be East Anglia.

The column headed ‘Tenure’ is about things like owned outright, owned with a mortgage and rented from a private landlord. We suppose that we have five tenures, coded from 1 to 5.

The column headed ‘Roof’ is about the type of roof; things like hipped, pitched, flat or mixed. We also need ‘not applicable’ to cover those households which do not have their own roof, for example a flat in a block of flats. Coded from 1 to 5. Note that it is often necessary to include ‘not stated’, as odd fields may be missing from odd records for all kinds of reasons. We have not done this here.

The column headed ‘Garage’ is about the number of indoor parking spaces in garages, including here covered parking spaces in blocks of flats. Coded 0 for none, 1 for one and 2 for 2 or more.

The column headed ‘Standing’ is about the number of outdoor parking spaces, excluding the road, available to the household. Coded 0 for none, 1 for one and 2 for 2 or more.

The column headed ‘Garden’ is about the garden attached to the household in question. Coded 0 for none, 1 for front garden only and 2 for other.

The column headed ‘Income’ is the household income over the past year, in pounds. That is to say the total income of all the people usually resident in the household; a property of person aggregated to household. A more complex, a more informative, file might be organised hierarchically with each household record followed by one or more person records – a complexity which would have defeated the early computer tabulators of the early 1970’s.

The column headed ‘Members’ is the number of people in the household, which is required to be one or more, which means that vacant property is excluded. Here coded from 1 to 5, although more would be allowed.

Note the difference between standing and members, with standing being capped at 2 or more while members is open ended. With the choice between the two approaches depending on the data, how one proposes to analyse it and the tools available for that analysis.

The column headed ‘Bathrooms’ is the number of bathrooms in the household, recorded as zero or more. Here coded from 0 to 4, although more would be allowed.

Note that quite a lot of decisions have been taken here which will restrict the subsequent analysis. Deciding, after the event, for example, that one is really interested in households with lots of parking spaces in garages, means going back to the raw data and trying to do that column over.

Note that these seemingly innocuous looking definitions gloss over plenty of potential for controversy at the margins. What about car ports? Does a shower room or a wet room count as a bathroom? What about the income of the people who stayed over the winter? All things which a careful surveyor would think about before setting off. Or would perhaps sort out with a pilot survey.

Data selection

Figure 3
One needs to tell Excel for which data one wants the pivot table. This can either be done by selecting the lot, by selecting a group of columns or by selecting a rectangle of data, with this last being illustrated above. Here, we have selected households H2410 thru H2424 and want to analyse them using some selection from Garage, Standing, Garden, Income and Members. While selecting the lot or selecting columns has the advantage of the selection carrying the column headers with it.

In the examples which follow, we have selected all the columns, which amounts to selecting the lot; all the rows and all the columns.

One dimension

Figure 4
This pivot table simply counts the number of households by region. It includes a data row for every value of ‘Region’ which is to be found in the data. To which is added a header row, using information from the header row in the data, where present, a row for blank and a total row. It is quite possible, if fiddly, to move the table somewhere else and to fiddle with the layout and the formats.

Note that in the case that a region code is not present in the data, that row is simply omitted – except in the case of blank which Excel always adds in. And it may well be that neither action is what one wants in a table for circulation or publication.

Note that it would have been quite possible to replace the numeric region codes with region names, in which case one might have wanted to make the relevant column rather wider. In the past the extra space needed in the data worksheet would have been significant – which might still be in the case that we had a busy worksheet and one wanted to be able to display all the columns, legibly, across the screen, in one go.

Note that the random number generator has not produced a very uniform distribution, at least not for this number of households.

Figure 5
Here we extend things a bit, analysing by tenure this time, but having three accumulator fields for each tenure. The first simply counts the number of households with the tenure in question, while the second counts up the members, sums them in the jargon of Excel, and the third counts up the bathrooms. These last two exploiting the fact that these two columns are just counts, not manipulated in any way, and which can be sensibly added up. Which would not be the case, for example, with the column for garage which has been capped at 2 or more.

Note that the count of members is a simple multiple of the count of households, a consequence of laziness in allocating the random data to columns, visible in Figure 1.

Two dimensions

Figure 6
Here we revert to one accumulator field, but extend to two dimensions, with tenure across the page and region down the page. A not very random looking distribution, despite our attempt at randomisation. Although we would say, with real data, such lack of randomness is apt to be the result of some interesting interaction between the dimensions; just the sort of thing we might be interested in finding out about.

Note how Excel has added in blanks for each dimension and simply left blanks cells which might otherwise have been shown as zero. Indeed, layout is no longer fit for publication: the data has been analysed and it has been put on the page, but it needs to be considerably smartened up. Copying the material into Microsoft Word may be indicated.

More

Figure 7
Excel can do a lot more with pivot tables, but the default presentation can be rather unattractive, forbidding even.

A simple variety of cross tabulation

Figure 8
The data has been recoded with text codes rather than numbers. For the curious, lifehold is an obsolete form of tenure which I first came across in Hardy’s ‘The Woodlanders’.

Figure 9
The raw pivot table.

Figure 10
The lightly edited pivot table.

This last figure is more the sort of thing we want here: a simple two dimensional analysis with counts and totals. But it is not always convenient to code one’s data into Excel in the way suggested in Figure 8 above.

PS 1: there appears to be something odd going on with this supposedly random data, but we do not think it material and have not troubled to track it down.

Summary

Figure 11
What we about is conceptually summarised in the figure above: the transformation of a long, open ended series of things into a compact, two dimensional array of counts. The sort of open ended series which used to be stored as flat files on magnetic tapes. Where flat means the same simple record for every thing; no open ended repeats or anything like that. Probably no text fields.

More exactly, our simple variety of cross tabulation may be thought of as a one or two dimensional analysis of a rectangle of data from a data worksheet, with the selection of a rectangle for analysis by pivot table amounting to the selection of the things (in our example households), that is to say the rows, one wants to include in the analysis and of the properties, that is to say the columns, one wants to use in that analysis.

Leaving aside the totals added by Excel, each row (in our example, each household) is counted into exactly one cell of the analysis.

There is a cell in the pivot table for each and every combination of properties which occurs in the data. And it may well be that this also is not what one wants in a table for circulation or publication.

We allow one or more accumulators to each cell of the analysis. In this example, most of the time we will want to know how many households contributed to each cell: in pivot table speak, value field setting equals count.

It is assumed that the data has been arranged and coded in such a way, that the data worksheet has been prepared, with all this in mind. That preparatory analysis and design has been done before starting to collect and record the data – failure in this last department being all too common. But it is also true that one often does not know where one is going when one starts out – and one just has to accept a certain amount of toing and froing. Of reworking of the data. Or the expense of a pilot survey.

Along the way, we have shown something of the way in which Excel does things.

Lastly, in addition to describing our simple variety of cross tabulation, we have suggested various ways in which things might easily get more complicated.

PS 2: not clear why Microsoft has chosen to render Figure 11 in blue, given that it started off as black biro on white paper.

No comments:

Post a Comment