Rob Minto

Sport, data, ideas

Page 18 of 39

John Terry vs Chris Huhne, Fred Goodwin vs Johann Hari: why it pays to wait

I can’t help thinking about four recent falls from grace. In essence, two are about awards, the other two about pre-emptive punishment. In all cases, we could benefit from being less hasty. I’ll explain why.

Let’s start with pre-emptive punishment. John Terry was stripped of the England captaincy while pending an investigation over racist abuse. Chris Huhne quit the cabinet following charges over his wife taking speeding points for him.

In these cases, the alleged crimes are totally different, but the principle is the same. Should someone step down from high office (the cabinet, the captain of English football) before their case is heard? And in both instances, the MP and player can remain just that. Why not go further – if they are not acceptable to lead the team, should they even be in it? If Huhne is not fit for cabinet, should he represent his constituents in Parliament?

Yet it was over the Terry case, the more morally worrisome and noxious case, and over an individual with prior bad behaviour (violence, infidelity), that Fabio Capello, England manager resigned. Capello said it was unfair to pre-judge the case. And surely, he has a point? If Terry is innocent, will the FA give him back the captaincy? About as likely as Capello managing England again.

Terry may be an odious person, certainly. But this is all the more reason to not give him the captaincy in the first place.

Which brings me neatly to getting things right in the first place.

Fred Goodwin was stripped of his knighthood. Johann Hari was forced to give back his Orwell prize for journalism.

In both cases, it seems the witch-hunt was hugely enjoyable for the press and public alike. Goodwin is an unrepentant, apparently unpleasant banker. Hari is a delusional journalist, protected by the Independent who should have sacked him when his dishonesty came to light.

In both cases, their prizes inflated their egos and should not have been given. Neither man can be blamed for accepting. If you are a multi-millionaire banker dealmaker, or a fêted journalist, darling of the left, a gong is exactly what you think you should be getting.

And yes, in both cases, a few checks would have made all the difference. Did Hari’s article stand up to scrutiny? It fell over pretty fast, as soon as a light was shone on his sources. Why give knighthoods to sitting CEOs? Why not wait and see if their deals work out, or if they bring a bank (and the country) to its knees?

In all four cases, it pays to wait, check and not jump in. Should Huhne still be a minister? If Terry was a good choice for captain before (he wasn’t), he still would be now. Hari should not have been awarded the Orwell prize; Goodwin should never have got close to a knighthood in the first place.

A banker, a footballer, a politician, a journalist. Very different crimes or charges. These men are problematic, certainly, but our eagerness to award or judge makes the problem far worse.

Why Gmail’s new look is a usability nightmare

I am absolutely furious with Google’s changes to gmail. I don’t really care about the design. The themes allow you enough scope to personalise. The problem is a technical one that has screwed up usability. It’s fundamental, and is the use of multiple iFrames.

What these iFrames do is create scrollbars within scrollbars, especially if you use labels and gadgets in your gmail (which I do).

Gadgets are used to be an easy way of seeing other things like your calendar or Google docs without leaving your email – a nifty productivity bonus. Labels are pretty fundamental to using gmail. Now they have become a nightmare. Continue reading

Occupy Wall Street: how quick were the media on the uptake?

The Occupy Wall Street movement is spreading and sprawling, into different countries and encompassing many issues.

But how fast did it take for the news media to catch on? This is possible to quantify using two things – Factiva to show the volume of news, and Google Trends to show how people are searching.

Factiva searches give the volume of news articles by day. Google Trends show the search relevance and volume. Plot them together, and you get an idea of when the public were searching for something, and when the mainstream media wrote about it.

Here’s the chart:

You can see straight away that there is a two day lag between the Factiva news peak and the Google peak, on October 15th for search and October 17th for news.

But there was a previous search peak on Oct 6th that was scored 15.7 by Google, not far below the peak of 18.1. But the Factiva volume at that point was 349, over 50 per cent below the highest single day news volume of 792.

In fact, up to the peak, there is a news lag, shown by the gap between the pink line and the blue bars. After the peak, the blue bars trend higher than the pink line, suggesting that the news media is playing catch-up while searching has tailed off.

Ok, some caveats. Google Trends is good – it made a big deal about how it could predict outbreaks of flu back in 2008. But it’s not everything, and Twitter data might be even more revealing. Ditto Factiva: an excellent source, but if we looked at their blogs results rather than news publications, it would be closer to the google trend line.

But I think it’s an interesting way to see what we are searching for, and writing about – and where the gaps are.

How Georgia rules the newspaper web fonts

What have the Guardian, Times, Telegraph, FT and Independent got in common (aside from being UK newspapers)? Politically? Not much. Ownership? Couldn’t be more different. Style? Now you are getting somewhere.

If you’ve ever surfed a few news websites and had a sense of deja vu, that’s because you have seen it before. All the papers listed above use Georgia as their main headline font – and most use it for the text as well.

While print editions of newspapers try their best to look different, it seems all broadsheet or quality press outfits online look the same. Georgia everywhere. It’s true of my employer, the FT, which has adopted the font in its last redesign, and it’s true of most US papers too.

Interestingly, the tabloid press are keener on Arial and other sans-serif (ie non-twiddly) fonts.

So why are the newspaper sites gravitating to one font? Georgia is a classy font, but why is it the be-all and end-all?

One reason is web standards. If you want a consistent look for your site, you have to use a font that is compatible with all browsers and devices, so you can be sure of your how it renders, and Georgia (along with Arial and a few others) is one of those ‘base’ fonts.

But this is crazy. In this web environment, you can pick any font using css (stylesheets) and tell the browser what to do if it doesn’t recognise that font. It’s just a list – you could start with something exotic, and then put Georgia as the backup. I’m baffled as to why sites don’t do this. The spacing issue isn’t an issue, as headlines change in length all the time. You can even specify different stylesheets for different devices if you need. The world has moved on, but we are retreating to a handful of fonts.

And before you point it out, yes, I’ve used Georgia as the font for this blog. I just like it, but maybe that’s the reason – it’s just really really good. In which case, hats off to Matthew Carter, who invented it (along with loads of other fonts.)

Here’s a quick rundown (not comprehensive) of who is using which font:

Georgia (for headlines at least):
– Guardian
– Independent
– FT
– The Times
– Telegraph
– Wall Street Journal
– International Herald Tribune
– NYTimes
– LA Times
– Washington Post
– New Statesman
– Time – Georgia and Arial mixed

Arial:
– Daily Mail
– USA Today
– The Onion
– Reuters and Bloomberg use Arial in their sites (Bloomberg uses a Georgia derivative in its terminals)

Economist uses Verdana. Good for the Economist. A bit different.

Congestion vs population

I’ve seen a few references to a study on big cities and congestion recently, so I thought I’d take a closer look. It’s a survey by IBM – so caveats aplenty are needed. For starters, it’s based on a sample. And that sample is based on perception. (Perception is a good measure for some things, like happiness or success. It’s not so good for things you could actually measure, like travel times or car density or delays.) It also refers to lots of interesting auxiliary questions but gives no data in a usable format. Not very transparent, and weak for a company that you might think is data-savvy.

Anyway, at first glance, it’s quite easy to see where the congestion is: the Bric countries plus Mexico and South Africa. I’ve dumped all the available data into a Google Fusion table. The cities with a score over 75 out of 100 are the red markers. So is this a developing-country issue? Poorer countries don’t have the infrastructure, hence the congestion. QED.

But actually, is this a population issue? Perhaps the bigger the population, the harder it is to move people around, and the more congestion you get.

Without wanting to commit the classic correlation vs causation mistake, here’s the data plotted to population. (The population data is from Wolfram Alpha, which uses these sources.)

Although there isn’t a perfect correlation (score is 0.56), there is a basic grouping in the bottom left corner (lower population, lower congestion) and top right corner (with high for both).The outliers are Johannesburg, with a lower population but extremely high congestion, and New York, with a high population but low congestion.

Upshot: New York is a good place to live, Johannesburg not so. Assuming that there are benefits to a big city such as interesting things to see and do.

Omissions: Why did they leave Tokyo out? It’s a) huge and b) hard to navigate. It would have been interesting to see what the congestion perception was there.

The crazy cost of Switzerland

I’ve just got back from a long weekend in Geneva. Lovely place, beautiful lake, painful exchange rate. Switzerland was always quite expensive, but with the Swiss Franc a safe haven for investors, hanging out in Geneva suddenly looks like a small fortune.

But leave aside the cost of normal stuff like food and hotels for a second. We were staying with friends for part of the trip who live very near the border with France, so I got text messages alerting me to what mobile services would cost from my telco (T-Mobile) in either country.

[easychart type=”vertbar” height=”200″ width=”350″ title=”Mobile prices, price(£)” axis=”both” groupnames=”France, Switzerland” valuenames=”Make call, Receive call, Text, Data per mb, Picture msg” group1values=”0.366, 0.115, 0.115, 0.333, 0.2″ group2values=”1,1,0.4,7.5,0.2″]

And what a difference half a kilometer makes – over in France, it was 36p per call, and 11p to receive a call, compared to £1 in Switzerland. A text in Switzerland was 40p to 11p in France. Weirdly, picture messages were the same on both (20p).

But it was data where the greatest difference lay. In France, I was offered £1 per 3mb. In Switzerland, it was £7.50 for 1mb – over 22 times more expensive.

Now I know that EU regulations are bringing down the cost of call and data roaming in Europe, which Switzerland is free to ignore. And this is a sample of one, rather than a proper survey. But data should never, ever cost 22 times more just by walking 500m across a border.

What if cricket counted centuries differently?

Alistair Cook’s 294 against India got me thinking today – why does 200 not count for 2 in the 100s column in a batsman’s career stats? And if it did? How would the stats look then?

Going from 99 to 100 may just be one run, but it’s the milestone. So why not 199 to 200? It’s the same achievement, 100 consecutive runs in one innings. So the chart below shows how the century list would look if scores over 200 counted as 2 centuries, over 300 as 3, and Lara’s 400 as 4.

In this chart, the accepted number of centuries is in orange, and the compound counting of 200s, 300s and 400 is in blue.

The first thing you notice is that although Tendulkar is still in top spot, his lead is cut, and he hasn’t got too many “big” scores compared to others.

Second – the big beneficiaries are Lara, who leapfrogs Ponting, and Bradman, who gets a huge boost. Sehwag and Hammond also move ahead of rivals, as do Sangakkara and Jayawardene.

Here’s the best list for data: Cricinfo – double hundreds, triple hundreds. And here’s my big100s spreadsheet.

As ever, it just confirms that Bradman is the best of all time. But it also would reward the effort of getting from 100 to 200. Time to change the counting system, I think.

The perils of comparing the greatest at different sports

It could almost be a sport itself – debating who is the greatest sportsman of their sport / generation / all time. The great names are easy to think of – Pele, Federer, Bradman, Woods. Or is it Maradona, Laver, Tendulkar, Nicklaus?

The arguments will rumble on, but a few statistical caveats should always be kept in mind. One is: You can’t compare between sports very easily.

Here’s an example which has made me furious. In a recent issue of Prospect magazine, Jay Elwes tries to make the case for Indian cricketer Sachin Tendulkar being the best sportsman in the world. Fair enough, a good candidate I’d agree. But just read the following paragraph:

At which point, a question arises: can Federer, perhaps the greatest ever tennis player, be measured alongside Tendulkar? One instructive comparison is the distance by which each leads the trailing pack. Federer has won 16 Grand Slam tennis titles. In second place is Pete Sampras on 14, which makes Federer 14 per cent more successful than his nearest competitor. Tendulkar has scored a total of 32,803 runs for India in Test and one-day internationals combined. Ponting, in second place, has scored 25,769, meaning that Tendulkar has scored 27.3 per cent more again than his nearest rival. His lead is nearly twice that of Federer.

I’d like to say this is a small blip, but it’s not. It seems to be the main data to buttress his argument. What’s wrong with this? In no particular order:

  • Why are total runs so important? Tendulkar is great, but he’s played more matches than anyone else too in both tests and one-day internationals.
  • How on earth can you make sense of a “percentage lead” when the range is 0 to 16? And compare it to a measurement system with range 0 to 30,000 plus? Idiotic.
  • If Federer wins the US Open next month, that puts him 21 per cent more successful than Sampras, up from 14 per cent. And the point is?
  • Comparing grand slams to runs is just bonkers. You accumulate runs, win or lose. You can’t do that with grand slams.
  • Why not compare total tennis match victories to runs? Or test match wins to tournament wins? It would be a more like-for-like comparison, although similarly meaningless.

I could go on, but you get the idea.

Cricket and tennis lend themselves to some fascinating statistical analyses. But this is not an “instructive comparison”. It’s grossly misleading, shows little thought, and does the debate about great sportsman no favours. Prospect magazine is a superb publication, but this is not one of their better articles.

Big data is underestimating the emerging markets

Consultants and analysts – and bloggers, of course – are keen to tell us how big the world’s data is, and how fast it is growing. We have entered the “zetabyte age”.

But for all the talk of “Big data” and how daunting it all is, I think data levels are going to be far bigger than we estimate now. As far as I can tell, most of the models of data usage look at developed markets, and extrapolate the phenomenal growth in data from use of smartphones, PC usage, companies etc.

But this underestimates the usage of data in the developing world. Many countries are going to run straight through the non-networked, 2G world and join the data-everywhere, cloud-based, streaming world instead. And this has big implications for data.

The EMC Digital Universe infographic (pdf) suggests exabyte growth of the total world data from 1,227 in 2010 to 7,910 in 2015. Although this looks like a huge increase compared to 2005 to 2010, when world data was estimated to go from 130 exabytes to 1,227, the actual rate of growth they predict is slowing, from a factor of 9.4 to 6.4.

Instead, take a look at the McKinsey report into big data (pdf).  On page 103 we can see a rough breakdown of data storage by world region. If we take North America as the target level, that region uses 6.5 petabytes per million people. Run the rest of the world at that level of data usage, and the world total of 6,750 petabytes goes up over 5 times to 37,296 petabytes. See table below.

Now the rest of the world isn’t going to catch the US in the next 5 years in terms of data usage, but you get the idea of the scale of this. China is currently on 0.2 petabytes per million. India is even lower. Working on models of developed countries is fine for now, but the rest of the world will catch up faster, and use far more data. I’d rip up a few of those models and predictions and start again.

Region Petabytes Population (m) (Source: Wolfram Alpha) Petabytes per million people Petabytes assuming North American data usage Percentage change
North America 3,500 538 6.5 3,500 0
Latin America 50 589 0.1 3,832 7,564
Europe 2,000 595 3.4 3,871 94
China 250 1,350 0.2 8,783 3,413
Japan 400 127 3.1 826 107
MENA 200 599 0.3 3,897 1,848
India 50 1,210 0.0 7,872 15,643
Rest of APAC * 300 725 0.4 4,717 1,472
Total:
6,750
Total:
37,296

* Rest of Apac population taken from Wikipedia, with Japan, China (incl HK and Macau) and India removed.

How to live dangerously – a book that does statistics a disservice

Being a statistics junkie, a couple of people recommended to me the book How to live dangerously by Warwick Carins. Normally, I would read it, enjoy, and move on. But this book has prompted a mini-review (several years late, but who cares…), because it commits several statistical crimes.

One is that Cairns plays fast and loose with surveys. Surveys here, surveys there. No mention of how many people asked, by which method, or the sources. We can all cherry pick surveys to prove any point we like. A health warning is needed.

Second, Cairns is too casual to dismiss what we don’t know, and uses little data to back up the main thrust of the argument (which I broadly agree with), peppering his prose with “probably”s and “these days”. Example:

In 1970, eight out of ten elementary schoolchildren used to walk to school. In 2007, less than one out of ten did – and they were probably the ones who lived across the road, or whose dads were the school caretakers. Most children these days are driven to school in cars, even if they live just round the corner.

Really?

Thirdly, and far worse, it actually uses statistics to deceive, rather than prove a point. The worst offence is comparing the data on child abduction and murder with death from fires.

It is clear that the media make more of the former than the latter – a child killed in a fire is a tragedy that is maybe mentioned in the local news, while an abduction and murder will make national headlines quite often.

But Cairns breaks down the stats by pointing out that in any one year, only 100 or so US children are abducted by strangers, and of those 46 are killed. He then extrapolates that to say that the average child has a 0.00007 per cent chance of this fate, which equates to it taking 1.4m years for a stranger to murder your child if you left him or her unguarded on the street.

Obviously the idea of living for 1.4m years is nonsense, and a cunning way of pointing out our ridiculous fear of this event. But then he points out the relative danger of keeping a child indoors and the risk of fire, to show how foolish we are at stopping children going out.

Not citing which country (I assume the US again) he says “one child dies of [fire in the home] every ten days.”

So he sums up our fears thus (from p46):

So, they go out, and face the 1-in-1.4 million chance of being abducted and murdered. Or they stay in, where one child gets burned to death every ten days.

This is the worst statistical argument I have ever come across. Comparing a 1-in-1.4m chance (which is not the same as 1-in-1.4m years anyway) with one-in-10 days sounds like a logical slam dunk – why on earth would we care about the million chance when every 10 days a child dies in a fire? Except that these are far more similar stats than the way they are presented. Actually, using Cairns’ data, one child is abducted and then murdered every 8 days, compared to a death every 10 days in a fire. Or, put it another way, there are 46 abductions and murders every year in the US compared to roughly 37 fire deaths.

Either Cairns is being appallingly deceptive, or incredibly sloppy and can’t understand the stats himself. Either is hard to forgive in a book that tries to cut through the froth and present our fears and risk in a rational way.

Overall – for a book that cites statistics and tries to uncover our irrational fears, it is sloppy, prejudiced and patronising. It is poorly sourced, and although entertaining, lacks rigour. This is an important topic. It’s a shame that it is treated so badly.

« Older posts Newer posts »

© 2025 Rob Minto

Theme by Anders NorenUp ↑