2014-05-20

Co.Labs

Businesses Can Now Use The Same Stats Language As Universities, Thanks To "Pandas"

The Python number-crunching toolkit gives programmers the statistical tools they need in a computer language that is familiar to businesses.



For years, specialized tools like R and SPSS have been standard for anyone working in statistics and analytics, whether in private industry or the academic sector.

But for many in the business world, those languages are unfamiliar, which means companies haven't been able to leverage these languages the way that, say, universities have. But that could change thanks to an open-source Python data-analysis library called Pandas, which offers many of the same analytics tools as R in a language developers are already using, but in languages businesses can work in.

“One of the reasons we like to use Pandas is because we like to stay in the Python ecosystem,” says Burc Arpat, a quantitative engineering manager at Facebook. “We have a lot of systems inside Facebook, or infrastructure that allows us to either use Python to talk to those systems or integrates with Python very easily or is written in Python."

Many of the engineers working on analytics projects at Facebook are well acquainted with R, which the company also uses for certain tasks, but Facebook’s existing Python codebase often makes Pandas easier to work with. That’s a common reason for developers to choose Pandas, says the library’s creator, Wes McKinney.

"In companies that have an engineering culture, just because of the general growth of Python, it's made Python and Pandas an easy choice,” he says.

McKinney began work on the library while working at the financial firm AQR Capital Management, where he was using R for quantitative finance projects and basic “data wrangling,” he says.

“I was frustrated on multiple fronts,” he says. “I felt that R was not strong enough for software engineering—for building big software, R left a lot to be desired as far as the tooling for debugging and building big systems.”

McKinney started building his own toolkit in Python, which he saw as better suited to larger projects, and that evolved into what was open sourced as Pandas.

Like R, Pandas is oriented around manipulating “data frames”—two-dimensional matrices of structured data similar to a well-organized spreadsheet or a SQL-style relational database. Pandas has built-in support for quickly reading in data frames from Excel spreadsheets or comma-separated value files, filtering rows and columns, and generating aggregate statistics like sums and means.

Those data frames can also be passed to other Python number-crunching libraries, like the powerful statsmodels package that handles linear regressions and more esoteric statistical tasks, or the Scikit-Learn machine learning toolkit.

And having stats-intensive coders work in Python means their work can more easily be integrated with production code, says Dave Himrod, the director of ad-quality engineering at the Internet advertising exchange AppNexus.

“In a lot of places, your data scientist or your quants or analysts, whatever you call them, they analyze the data and create this model in R or SPSS, and then an engineer has to take that and translate it into whatever your production system is,” he says. With Pandas, that’s often not necessary, says Himrod.

"It's nice that you can have your production environment and your researchers all using their same tools,” he says. "They just say, ‘Here's what's going on in this code,’ and everybody just knows the basic operations of Pandas and some of the [Python math] libraries like SciPy or NumPy."

A new version of Pandas due out this month also offers better integration with a variety of SQL databases, says Jeff Reback, one of the lead developers on the project. Reback says Pandas’ ability to translate between different structured data layouts, like SQL, Excel, and more obscure formats like HDF5, is a crucial advantage in a world where Python has come to be the intermediate language connecting companies’ various computer systems.

"Python is now the new glue language,” replacing earlier choices like Perl, says Reback.

Still, Pandas users and developers emphasized, there’s still a place in the statistical world for R, which McKinney says has really improved in terms of engineering tools since he began work on Pandas, thanks to the team behind the IDE RStudio and library developers like Hadley Wickham.

"The work that R-Studio has done to build a better development environment for R has really changed the game for R programmers,” he says. "R and Python are both rapidly growing—and taking market share away from proprietary tools like SAS, SPSS, and Matlab."

[Image: Flickr user Stéfan]




Add New Comment

1 Comments