Republic of Mathematics blog

Spreadsheets and big data

Posted by: Gary Ernest Davis on: May 8, 2012

Many people use spreadsheets for calculation and for storing data.

The tabular format of spreadsheets, the ability to use formulas,  to search the cells, to plot charts, and to change parameters and have plots redraw are compelling features of spreadsheets and have embedded them into popular use.

Spreadsheets are widely use for storing and transmitting data: the tabular layout that allows for sorting by columns is very appealing to  anyone in an organization who collects and needs to disseminate data.

The widely used data analysis and statical software R imports files directly from most spreadsheet formats, so it is very tempting for students of statistics and data analysis to store their data in a spreadsheet. For teaching purposes this does no apparent harm in the short term. However, longer term, the habit of using spreadsheets to store and disseminate data can be very problematic.

Despite the many rows and columns, a spreadsheet can effectively manipulate only a limited amount of data.

Excel 2007 has 17,179,869,184 cells. If each of these cells were filled with data, that would seem to be a large amount of data by anyone’s standards. Imagine that in each of these cells there was only a 0 or a 1. A terabyte of data is 1,000,000,000,000 (a trillion) bytes, so an Excel 2007 spreadsheet filled with o’s and 1’s would hold only about 1.7% of a terabyte of information: it would take about 58 such filled Excel spreadsheets to get a single terabyte of data.

A petabyte of data is 1,000 terabytes. To get this much data from spreadsheets filled with 0’s and 1’s we would need about 58,000 spreadsheets.

So if every single person in the town of Great Yarmouth in the UK had an Excel 2007 spreadsheet filled with 0’s and 1’s we would have about a petabyte of data.

Surely no-one could regularly want to deal with that much data?

But that is just what Big Data sets (and extremely large data sets) contain. In fields such as genomics, meteorology,  internet searching, and finance informatics, petabytes of data are routine.

In fact exabytes of data are not uncommon: an exabyte is 1,000,000 terabytes -  about the equivalent of every single person in a country such Italy as having an Excel 2007 spreadsheet full of data.

But wait, you say: a person in a medium size business, producing a list of employees and job descriptions, for example, doesn’t have to worry about exabytes of data. Surely they can keep on using a spreadsheet to store their data?

The answer is: of course they can and of course they will. Spreadsheets are simply too useful in everyday life to abandon.

Now we have  a problem when we want to amalgamate, or consolidate, the data from many, many thousands of spreadsheets.

How do we handle such data, how do we ensure its integrity and fidelity, how and where do we store it, and how do we analyze it?

One suggestion is to store spreadsheet data in a large spreadsheet format in the cloud that is scalable to handle big data sets. Another is to develop a spreadsheet search engine that could extract semantic information from large collections of spreadsheets.

Spreadsheets are probably not going way anytime soon, because of their useful features for handling small scale data. Yet demands of Big Data steer us to thinking of effective ways of managing the accumulation and consolidation of manifold spreadsheet data sets.

Reference

Jacek Becla1, Daniel Liwei Wang, Kian-Tat Lim, REPORT FROM THE 5th WORKSHOP ON EXTREMELY LARGE DATABASES, Data Science Journal, Volume 11, 23 March 2012 [ Becla_et-al ]

Is this the sexiest mathematics job ever?

Posted by: Gary Ernest Davis on: May 6, 2012

Alexis Wajsbrot

Alexis Wajsbrot is a technical film director specializing in the simulation of  movement of  fluids and textiles.

Sound sexy so far?

Well, here’s some of the projects Alexis has worked on:

Here’s a Vimeo link to some of Alexis’ film effects.

Alexis approximates a moving fluid or textile  by a large number of particles  in a 3-dimensional co-ordinate system.

The software he uses allows him  to choose how the particles move in space – controlling their speed and acceleration.

Another way Alexis simulates a fluid is to dissect  the region of the fluid into a 3-dimensional grid of  voxels (from  volume & pixel). He then uses the mathematics of fluid dynamics to calculate the velocity in each voxel at each time step, to create a realistic fluid motion.

More details of Alexis’ work and Quicktime movies of his simulations can be found at +plus magazine, from which this information was taken.

Finally, here’s Alexis describing some of his work on Red Balloon (en Francais):