Why Excel can be a nightmare for data science projects?

Published by Loïc Lejoly on

Excel, the spreadsheet tool developed by Microsoft, is a powerful tool allowing you to do almost everything you want. You can write programs, simulators, build complex formulas, load external data, create analytical charts, and much more.

People mastering Excel will not have any problem to find a job. Indeed, most sectors are intensively using Excel to do almost everything with data (Finance, Banks, Insurances, etc.) and some mainly survive thanks to Excel.

Since Excel does not restrict users, it is not surprising to see presentations created on Excel with screenshots in spreadsheets. It might be useful to remind that Excel is not PowerPoint. Furthermore, others use Excel to write entire reports. It can be justified in some cases, but most of the time a word processor is more suitable for readability reasons and accessibility.

Excel Data sharing

Excel was developed to bring calculations features, graphing tools, pivot tables, and is still maintained by Microsoft. Although, it is a great and powerful tool, it should not be used as storage for transformed data intended to be used outside the Excel environment.

When working on data (science) projects, receiving data stored in Excel files is, in general, not a good sign. Indeed, the files come mostly from business experts who use Excel as a local data management system where they store, transform, and aggregate data. They also perform analytics and dash-boarding. The results are complex files, sometimes referring to other Excel files and when one is missing or a path is broken, everything is broken.

Moreover, one could argue on the readability of the excel file language. More precisely, files containing advanced formulas and lines referencing different spreadsheets/files are not straight to comprehend.

Reverse Engineering Nightmare

When data workers (e.g. Data Engineers, Data Scientists, Data Architects) receive such files, the nightmare begins. Indeed, we have to perform reverse engineering on Excel files containing business knowledge. Such files might be the complex outcome of several years of work by business experts. Therefore, the reverse engineering step may last up to months. This is generally difficult, for businesses, to understand why.

The reason is simple, data experts struggle to

  • create flat files (e.g. csv, tsv) representing the raw data in an exploitable representation, to avoid persisting mistakes and ease future use of the data;
  • transcript the transformations from Excel formulas to post DB processing

This struggle is mainly due to missing descriptions or documentation and encryption unclarity.

When this step is eventually achieved, we are able to import the result, which contains the business knowledge in a nowadays standard form, in the target destination (e.g. Data Warehouse, database, application, machine learning algorithm).

Flat files instead of Excel files

Excel was certainly the most appropriate tool at the beginning of 21st century but with the explosion of data collection/generation and the multitude of tools created for analytics, dashboarding, storage and many others. Inserting raw data and all the intelligence in Excel files is no longer the more suitable solution. Indeed, another view should be considered. Raw data should be stored on resilient data storage (e.g. database: SQL/NOSQL, data Warehouse, flat files). The business knowledge applied on the raw data should be clearly documented. Analyses can be done with dashboarding, analytical, and BI tools like POWERBI, TABLEAU, CUMUL.IO, MicroStrategy, SAP BI/Analytics, Looker, Qlik.

Nonetheless, Excel advocates can still use the tool as data exploitation and with such use, raw data are not modified and formulas applied are documented, thus, it will be easy for data experts to understand their work.