Skip to content

Wayne State University

Aim Higher

May 30 / David Corliss

When it comes to Big Data, size matters – but it’s not the size of the data!

As a physicist, I often come across technology issues that aren’t always easy to see in a world focused on programming and hardware. Take distributed networks for example. They seem like a great way to handle Big Data: split it up into pieces of just the right size and scatter them across a scalable network, with as many nodes as are needed to handle the data. However, one limitation we get in the bargain doesn’t come from the hardware or software but from the laws of nature.

Maybe you’ve seen the bumper sticker that reads “186,000 miles a second. It’s not just a good idea. It’s the law.” It refers to the speed of light, which always travels (in a vacuum) at the same speed. Outside of science fiction, nothing ever goes faster – but what really matters is that electrical signals travel at the exact same speed. So, this number really should be called the Speed of Information. Inside a computer, it works out to 30 centimeters – about 1 foot – every nanosecond (ns). This creates a natural speed limit for moving data: the farther data needs to be moved, the longer it takes. Size really does matter and bigger is definitely not better. When it comes to Big Data, the physical size of the network – how far the data is from where it gets used – is what really makes the difference today.

The solution is the data equivalent of work-at-home: using the data on the same distributed nodes where it is stored. Speed is gained simply by cutting down the physical distance between data storage and use, reducing the time it takes to move the data to the user’s program. While many of the first applications have been for simple queries and list management, more powerful tools are reaching the market. SAS is a leader in big data, offering analytic systems operating at the node level on hadoop servers. (www.sas.com/en_us/software/sas-hadoop.html). More and more companies are adding distributed data usage side-by-side with storage as the only way increase speed. 30 cm per ns: it’s not just a good idea. It’s the law.

As with every technological advancement, distributed processing brings its challenges. Some things just don’t work very well in a distributed environment. For example, taking an average is easy because the data can be used piece-meal but finding a median is very hard in a distributed environment because all of the data has to be examined together. Also, it’s essential for the nodes to be close together because of the time cost of 1 ns for every 30 cm applies no matter how the data gets there – wire, fiber optic, satellite or anything else. With biog data, the work doesn’t happen on our desktop but where the nodes live. So, half of the nodes in Japan with a bunch in Europe and the rest in North America just isn’t going to work. Pushing programs that use the data down to the node level overcomes one huge barrier but other changes are needed to reap the benefits. Because, with all the emerging technology today to improve the processing speed and efficiency, there is one thing we can count on that will never change.

The speed of light.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
David J Corliss, PhD consults on big data analytic solutions in industry while continuing research in statistical astrophysics at Wayne State University. His current projects include a developing book on Time Series Analysis for SAS and pro bono work applying advanced statistical methods to projects in social justice. @dcorliss_astro on Twitter