This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Saturday, December 15, 2012

Data Science, Data Analysis, R and Python

The October 2012 issue of Harvard Business Review prominently features the words “Getting Control of Big Data” on the cover, and the magazine includes these three related articles:

  1. “Big Data: The Management Revolution,” by Andrew McAfee and Erik Brynjolfsson, pages 61 – 68;
  2. “Data Scientist: The Sexiest Job of the 21st Century,” by Thomas H. Davenport and D.J. Patil, pages 70 – 76;
  3. “Making Advanced Analytics Work For You,” by Dominic Barton and David Court, pages 79 – 83.

All three provide food for thought; this post presents a brief summary of some of those thoughts.

One point made in the first article is that the “size” of a dataset – i.e., what constitutes “Big Data” – can be measured in at least three very different ways: volume, velocity, and variety.  All of these aspects of the Big Data characterization problem affect it, but differently:

·        For very large data volumes, one fundamental issue is the incomprehensibility of the raw data itself.  Even if you could display a data table with several million, billion, or trillion rows and hundreds or thousands of columns, making any sense of this display would be a hopeless task. 
·        For high velocity datasets – e.g., real-time, Internet-based data sources – the data volume is determined by the observation time: at a fixed rate, the longer you observe, the more you collect.  If you are attempting to generate a real-time characterization that keeps up with this input data rate, you face a fundamental trade-off between exploiting richer datasets acquired over longer observation periods, and the longer computation times required to process those datasets, making you less likely to keep up with the input data rate. 
·        For high-variety datasets, a key challenge lies in finding useful ways to combine very different data sources into something amenable to a common analysis (e.g., combining images, text, and numerical data into a single joint analysis framework).

One practical corollary to these observations is the need for a computer-based data reduction process or “data funnel” that matches the volume, velocity, and/or variety of the original data sources with the ultimate needs of the organization.  In large organizations, this data funnel generally involves a mix of different technologies and people.  While it is not a complete characterization, some of these differences are evident from the primary software platforms used in the different stages of this data funnel: languages like HTML for dealing with web-based data sources; typically, some variant of SQL for dealing with large databases; a package like R for complex quantitative analysis; and often something like Microsoft Word, Excel, or PowerPoint delivers the final results.  In addition, to help coordinate some of these tasks, there are likely to be scripts, either in an operating system like UNIX or in a platform-independent scripting language like perl or Python.

An important point omitted from all three articles is that there are at least two distinct application areas for Big Data:

1.      The class of “production applications,” which were discussed in these articles and illustrated with examples like the un-named U.S. airline described by McAfee and Brynjolfsson that adopted a vendor-supplied procedure to obtain better estimates of flight arrival times, improving their ability to schedule ground crews and saving several million dollars per year at each airport.  Similarly, the article by Barton and Court described a shipping company (again, un-named) that used real-time weather forecast data and shipping port status data, developing an automated system to improve the on-time performance of its fleet.  Examples like these describe automated systems put in place to continuously exploit a large but fixed data source. 
2.      The exploitation of Big Data for “one-off” analyses: a question is posed, and the data science team scrambles to find an answer.  This use is not represented by any of the examples described in these articles.  In fact, this second type of application overlaps a lot with the development process required to create a production application, although the end results are very different.  In particular, the end result of a one-off analysis is a single set of results, ultimately summarized to address the question originally posed.  In contrast, a production application requires continuing support and often has to meet challenging interface requirements between the IT systems that collect and preprocess the Big Data sources and those that are already in use by the end-users of the tool (e.g., a Hadoop cluster running in a UNIX environment versus periodic reports generated either automatically or on demand from a Microsoft Access database of summary information).

A key point of Davenport and Patil’s article is that data science involves more than just the analysis of data: it is also necessary to identify data sources, acquire what is needed from them, re-structure the results into a form amenable to analysis, clean them up, and in the end, present the analytical results in a useable form.  In fact, the subtitle of their article is “Meet the people who can coax treasure out of messy, unstructured data,” and this statement forms the core of the article’s working definition for the term “data scientist.” (The authors indicate that the term was coined in 2008 by D.J. Patil, who holds a position with that title at Greylock Partners.)  Also, two particularly interesting tidbits from this article were the authors’ suggestion that a good place to find data scientists is at R User Groups, and their description of R as “an open-source statistical tool favored by data scientists.”

Davenport and Patil emphasize the difference between structured and unstructured data, especially relevant to the R community since most of R’s procedures are designed to work with the structured data types discussed in Chapter 2 of Exploring Data in Engineering, the Sciences and Medicine: continuous, integer, nominal, ordinal, and binary.  More specifically, note that these variable types can all be included in dataframes, the data object type that is best supported by R’s vast and expanding collection of add-on packages.  Certainly, there is some support for other data types, and the level of this support is growing – the tm package and a variety of other related packages support the analysis of text data, the twitteR package provides support for analyzing Twitter tweets, and the scrapeR package supports web scraping – but the acquisition and reformatting of unstructured data sources is not R’s primary strength.  Yet it is a key component of data science, as Davenport and Patil emphasize:

“A quantitative analyst can be great at analyzing data but not at subduing a mass of unstructured data and getting it into a form in which it can be analyzed.  A data management expert might be great at generating and organizing data in structured form but not at turning unstructured data into structured data – and also not at actually analyzing the data.”

To better understand the distinction between the quantitative analyst and the data scientist implied by this quote, consider mathematician George Polya’s book, How To Solve It.  Originally published in 1945 and most recently re-issued in 2009, 24 years after the author’s death, this book is a very useful guide to solving math problems.  Polya’s basic approach consists of these four steps:

  1. Understand the problem;
  2. Formulate a plan for solving the problem;
  3. Carry out this plan;
  4. Check the results.

It is important to note what is not included in the scope of Polya’s four steps: Step 1 assumes a problem has been stated precisely, and Step 4 assumes the final result is well-defined, verifiable, and requires no further explanation.  While quantitative analysis problems are generally neither as precisely formulated as Polya’s method assumes, nor as clear in their ultimate objective, the class of “quantitative analyst” problems that Davenport and Patil assume in the previous quote correspond very roughly to problems of this type.  They begin with something like an R dataframe and a reasonably clear idea of what analytical results are desired; they end by summarizing the problem and presenting the results.  In contrast, the class of “data scientist” problems implied in Davenport and Patil’s quote comprises an expanded set of steps:

  1. Formulate the analytical problem: decide what kinds of questions could and should be asked in a way that is likely to yield useful, quantitative answers;
  2. Identify and evaluate potential data sources: what is available in-house, from the Internet, from vendors?  How complete are these data sources?  What would it cost to use them?  Are there significant constraints on how they can be used?  Are some of these data sources strongly incompatible?  If so, does it make sense to try to merge them approximately, or is it more reasonable to omit some of them?
  3. Acquire the data and transform it into a form that is useful for analysis; note that for sufficiently large data collections, part of this data will almost certainly be stored in some form of relational database, probably administered by others, and extracting what is needed for analysis will likely involve writing SQL queries against this database;
  4. Once the relevant collection of data has been acquired and prepared, examine the results carefully to make sure it meets analytical expectations: do the formats look right?  Are the ranges consistent with expectations?  Do the relationships seen between key variables seem to make sense?
  5. Do the analysis: by lumping all of the steps of data analysis into this simple statement, I am not attempting to minimize the effort involved, but rather emphasizing the other aspects of the Big Data analysis problem;
  6. After the analysis is complete, develop a concise summary of the results that clearly and succinctly states the motivating problem, highlights what has been assumed, what has been neglected and why, and gives the simplest useful summary of the data analysis results.  (Note that this will often involve several different summaries, with different levels of detail and/or emphases, intended for different audiences.)

Here, Steps 1 and 6 necessarily involve close interaction with the end users of the data analysis results, and they lie mostly outside the domain of R.  (Conversely, knowing what is available in R can be extremely useful in formulating analytical problems that are reasonable to solve, and the graphical procedures available in R can be extremely useful in putting together meaningful summaries of the results.)  The primary domain of R is Step 5: given a dataframe containing what are believed to be the relevant variables, we generate, validate, and refine the analytical results that will form the basis for the summary in Step 6.  Part of Step 4 also lies clearly within the domain of R: examining the data once it has been acquired to make sure it meets expectations.  In particular, once we have a dataset or a collection of datasets that can be converted easily into one or more R dataframes (e.g., csv files or possibly relational databases), a preliminary look at the data is greatly facilitated by the vast array of R procedures available for graphical characterizations (e.g., nonparametric density estimates, quantile-quantile plots, boxplots and variants like beanplots or bagplots, and much more); for constructing simple descriptive statistics (e.g., means, medians, and quantiles for numerical variables, tabulations of level counts for categorical variables, etc.); and for preliminary multivariate characterizations (e.g., scatter plots, classical and robust covariance ellipses, classical and robust principal component plots, etc.).   

The rest of this post discusses those parts of Steps 2, 3, and 4 above that fall outside the domain of R.  First, however, I have two observations.  My first observation is that because R is evolving fairly rapidly, some tasks which are “outside the domain of R” today may very well move “inside the domain of R” in the near future.  The packages twitteR and scrapeR, mentioned earlier, are cases in point, as are the continued improvements in packages that simplify the use of R with databases.  My second observation is that, just because something is possible within a particular software environment doesn’t make it a good idea.  A number of years ago, I attended a student talk given at an industry/university consortium.  The speaker set up and solved a simple linear program (i.e., he implemented the simplex algorithm to solve a simple linear optimization problem with linear constraints) using an industrial programmable controller.  At the time, programming those controllers was done via relay ladder logic, a diagrammatic approach used by electricians to configure complicated electrical wiring systems.  I left the talk impressed by the student’s skill, creativity and persistence, but I felt his efforts were extremely misguided.

Although it does not address every aspect of the “extra-R” components of Steps 2, 3, and 4 defined above – indeed, some of these aspects are so application-specific that no single book possibly could – Paul Murrell’s book Introduction to Data Technologies provides an excellent introduction to many of them.  (This book is also available as a free PDF file under creative commons.)   A point made in the book’s preface mirrors one in Davenport and Patil’s article:

“Data sets never pop into existence in a fully mature and reliable state; they must be cleaned and massaged into an appropriate form.  Just getting the data ready for analysis often represents a significant component of a research project.”

 Since Murrell is the developer of R’s grid graphics system that I have discussed in previous posts, it is no surprise that his book has an R-centric data analysis focus, but the book’s main emphasis is on the tasks of getting data from the outside world – specifically, from the Internet – into a dataframe suitable for analysis in R.  Murrell therefore gives detailed treatments of topics like HTML and Cascading Style Sheets (CSS) for working with Internet web pages; XML for storing and sharing data; and relational databases and their associated query language SQL for efficiently organizing data collections with complex structures.  Murrell states in his preface that these are things researchers – the target audience of the book – typically aren’t taught, but pick up in bits and pieces as they go along.  He adds:

            “A great deal of information on these topics already exists in books and on the internet; the value of this book is in collecting only the important subset of this information that is necessary to begin applying these technologies within a research setting.”

My one quibble with Murrell’s book is that he gives Python only a passing mention.  While I greatly prefer R to Python for data analysis, I have found Python to be more suitable than R for a variety of extra-analytical tasks, including preliminary explorations of the contents of weakly structured data sources, as well as certain important reformatting and preprocessing tasks.  Like R, Python is an open-source language, freely available for a wide variety of computing environments.  Also like R, Python has numerous add-on packages that support an enormous variety of computational tasks (over 25,000 at this writing).  In my day job in a SAS-centric environment, I commonly face tasks like the following: I need to create several nearly-identical SAS batch jobs, each to read a different SAS dataset that is selected on the basis of information contained in the file name; submit these jobs, each of which creates a CSV file; harvest and merge the resulting CSV files; run an R batch job to read this combined CSV file and perform computations on its contents.  I can do all of these things with a Python script, which also provides a detailed recipe of what I have done, so when I have to modify the procedure slightly and run it again six months later, I can quickly re-construct what I did before.  I have found Python to be better suited than R to tasks that involve a combination of automatically generating simple programs in another language, data file management, text processing, simple data manipulation, and batch job scheduling.

Despite my Python quibble, Murrell’s book represents an excellent first step toward filling the knowledge gap that Davenport and Patil note between quantitative analysts and data scientists; in fact, it is the only book I know addressing this gap.  If you are an R aficionado interested in positioning yourself for “the sexiest job of the 21st century,” Murrell’s book is an excellent place to start.


  1. How funny, I was given a copy of George Polya’s book 'How to solve it' when I was 16, almost 24 years ago. I only recently picked it back up and found that it is one of the most relevant books I own. With tricks like finding similar problems and solve those, the books gives the avid problem solver of all walks of life, strategies to follow when you come up against the wall. Recommend you get yourself a copy double quick.

  2. big data is all about placing the right technologies and practice of handling, analyzing, and storing colossal volumes of data.
    Big Data trainings

  3. Hi this is raj i am having 3 years of experience as a php developer and i am certified. i have knowledge on OOPS concepts in php but dont know indepth. After learning hadoop will be enough to get a good career in IT with good package? and i crossed hadoop training in chennai website where someone please help me to identity the syllabus covers everything or not??

  4. Thanks for sharing this informative blog. If anyone want to get ios training, android training in Chennai please refer this link. Android Training Institutes in Chennai, Android Training in Chennai

  5. Thanks for sharing this informative blog. FITA provides SAP Training in Chennai with years of experienced professionals and fully hands-on classes.

  6. Thanks for sharing this article it cleared my thoughts. Since I'm doing Big Data Course in Chennai this was very useful to me.

  7. The information you have deliver here is really useful to make my knowledge good. Thanks for your heavenly post. It is truly supportive for us and I have accumulated some essential data from this blog.
    Hadoop courses in Chennai

  8. I see this content as a Unique and very informative article. Impressive article like this may help many like me in finding the best Best Hadoop Training in Chennai

  9. This is extremely helpful info!! Very good work. Everything is very interesting to learn and easy to understood. Thank you for giving information. AWS Training in chennai | AWS Training chennai | AWS course in chennai

  10. Nice article i was really impressed by seeing this article, it was very interesting and it is very useful for me.. cloud computing training in chennai | cloud computing training chennai | cloud computing course in chennai | cloud computing course chennai

  11. I gathered a lot of information through this article.Every example is easy to understandable and explaining the logic easily.Thanks! VMWare Training in chennai | VMWare Training chennai | VMWare course in chennai | VMWare course chennai

  12. Excellent post!!! Selenium automation testing tool makes your software validation process lot simpler. Keep on updating your blog with such awesome information. Selenium Training in Chennai | Selenium Course in Chennai | Selenium training institute in Chennai

  13. This site provides good information about mobile app testing. If anyone wants to know the basics of mobile app testing then this is the right blog for them. mobile application testing training in Chennai | mobile application testing training

  14. This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic.
    Informatica training in chennai|ccna course in Chennai|Best Informatica Training In Chennai

  15. This comment has been removed by the author.

  16. There are lots of information about latest technology & how to get trained in them, like Best Hadoop Training In Chennai in Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies Hadoop Training in Chennai By the way you are running a great blog. Thanks for sharing this blogs..

  17. I found some useful information in your blog, it was awesome to read, thanks for sharing this great content to my vision, keep sharing..
    SalesForce Training in Chennai

  18. Pretty article! I found some useful information in your blog, it was awesome to read,thanks for sharing this great content to my vision, keep sharing..
    Unix Training In Chennai

  19. This information is impressive..I am inspired with your post writing style & how continuously you describe this topic. After reading your post,thanks
    for taking the time to discuss this, I feel happy about it and I love learning more about this topic
    Android Training In Chennai In Chennai

  20. SAP Training in Chennai
    This post is really nice and informative. The explanation given is really comprehensive and informative..

  21. Oracle Training in chennai
    Thanks for sharing such a great information..Its really nice and informative..

  22. Selenium Training in Chennai
    Wonderful blog.. Thanks for sharing informative blog.. its very useful to me..

  23. Data warehousing Training in Chennai
    I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly..

  24. Whatever we gathered information from the blogs, we should implement that in practically then only we can understand that exact thing clearly,
    but it’s no need to do it, because you have explained the concepts very well. It was crystal clear, keep sharing..
    Websphere Training in Chennai

  25. Oracle DBA Training in Chennai
    Thanks for sharing this informative blog. I did Oracle DBA Certification in Greens Technology at Adyar. This is really useful for me to make a bright career..

  26. hai,i have to learned to lot of information about java Gain the knowledge and hands-on experience you need to successfully design, build and deploy applications with java.
    Java Training in Chennai

  27. Looking for real-time training institue.Get details now may if share this link visit
    Spring Training in chennai

  28. hybernet is a framework Tool which helps in Functional and Regression testing of an application. If you are interested in hybernet training, our real time working.
    Hibernate Training in Chennai,hibernate training in Chennai

  29. Job oriented form_reports training in Chennai is offered by our institue is mainly focused on real time and industry oriented. We provide training from beginner’s level to advanced level techniques thought by our experts.
    forms-reports Training in Chennai

  30. This is really an awesome article. Thank you for sharing this.It is worth reading for everyone. Visit us:
    Oracle Training in Chennai

  31. very nice blogs!!! i have to learning for lot of information for this sites...Sharing for wonderful information.Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing.
    Oracle DBA Training in Chennai

  32. great article!!!!!This is very importent information for us.I like all content and information.I have read it.You know more about this please visit again.
    Oracle RAC Training in Chennai

  33. Wonderful tips, very helpful well explained. Your post is definitely incredible. I will refer this to my friend.
    SalesForce Training in Chennai

  34. I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
    Java Training in Chennai

  35. Really awesome blog. Your blog is really useful for me. Thanks for sharing this informative blog. Keep update your blog.
    PHP Training in Chennai

  36. Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing.Nice article i was really impressed by seeing this article, it was very interesting and it is very useful for me..
    Android Training in Chennai

  37. Really awesome blog. Your blog is really useful for me. Thanks for sharing this informative blog. Keep update your blog.
    SAP Training in Chennai

  38. Excellent information with unique content and it is very useful to know about the information based on blogs.
    Hadoop Training in Chennai

  39. It is really very helpful for us and I have gathered some important information from this blog.If anyone wants to Selenium Training in Chennai reach Greens Technology training and placement academy.
    selenium Training in Chennai

  40. I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work. vmware training

  41. Really awesome blog. Your blog is really useful for me. Thanks for sharing this informative blog. Keep update your blog.
    QTP Training in Chennai

  42. Excellent information with unique content and it is very useful to know about the information based on blogs.
    Selenium Training in Chennai | QTP Training In Chennai

  43. In database computing, Oracle Real Application Clusters (RAC) — an option for the Oracle Database software produced by Oracle

    Corporation and introduced in 2001 with Oracle9i — provides software for clustering and high availability in Oracle database

    environments. Oracle Corporation includes RAC with the Standard Edition, provided the nodes are clustered using Oracle

    Oracle RAC allows multiple computers to run Oracle RDBMS software simultaneously while accessing a single database, thus

    providing clustering.

    In a non-RAC Oracle database, a single instance accesses a single database. The database consists of a collection of data

    files, control files, and redo logs located on disk. The instance comprises the collection of Oracle-related memory and

    operating system processes that run on a computer system.

    Oracle RAC

    Training in Chennai