March 6, 2024

How Do We Examine Data?

Every now and then, our clients present us with an opportunity to build a custom software solution that involves managing tabular data from spreadsheets. It goes without saying that a good understanding of the provided data is a necessity. Although numerous user-friendly tools like Excel, Numbers, and Google Sheets are available to assist in data analysis, they often come with limitations and complexities that may hinder our efficiency.

That’s why we prefer to use tools better suited for the job.

CSV Format

First of all let’s talk about the format. If possible, we generally don’t work with xlsx files directly. What we do instead is we convert them into CSV and go on from there. This has the benefit of faster data processing, less memory consumption, and, in general, more pleasant experience in which we can focus solely on the crux of the problem-the data itself.

For the conversion from xlsx to CSV file format we use csvkit’s in2csv command.

in2csv data.xlsx > data.csv

And that’s it; we’re now ready to start examining the data.

Examining Data

Now that we have the data converted to CSV format, we can begin examining what’s in there.

The simplest way to begin is to review all the data. We can simply do that by printing the entire content of the file.

csvcut data.csv | csvlook | less

There are times when datasets contain a huge amount of columns, not all of which may be relevant to our immediate analysis needs. In such cases, listing all available columns to pinpoint the ones of interest is a practical first step.

csvcut -n data.csv

Having identified the necessary columns, we can narrow down our output to include only those.

csvcut -c "First Name" "Last Name" "Email" data.csv | csvlook

Another very useful tool we use to examine the data is the csvstat command. This command helps us get a broad understanding of the data we’re working with.

csvcut -c "First Name" "Last Name" "Email" data.csv | csvstat

It’ll print some really cool information about our columns such as: type of data, whether it contains null values, number of unique values, most common values, etc.

Deeper Analysis of the Data

After narrowing down columns of interest, getting a high level view of the data, and better understanding the document’s structure, we’re ready to analyze data in more depth. Luckily, csvkit, in combination with other command line tools like awk, sed, wc, etc., can cover just about any use case one can think of.

Frequently, while analyzing the data, we need to search columns for specific patterns. That’s when grep command comes handy. We could pipe csvkit’s commands through the standard grep command to get the desired results. But luckily csvkit has csvgrep command built-in for this very purpose.

csvgrep -c "Email" -r ".*@gmail.com" data.csv

csvsort command is another useful tool we use to quickly sort the data.

Command line tools can get us only so far and at some point the natural course of action would be to store the data in the relational database like SQLite or PostgreSQL and continue our analysis. On the other hand there are times when we really just need the power of SQL, but don’t want to waste our time setting up a database. And of course, csvkit has a tool for us. It can create in-memory databases for us and let us run SQL queries directly on CSV files. How awesome is that!

csvsql --query "select 'Last Name', count(*) as count from data group by 'Last Name' order by count desc"

Summing Things Up

So, there you have it. When it comes down to examining tabular data, csvkit is our trusty sidekick. It lets us easily slice, dice, and make sense of data, without ever leaving the comfort zone of our command line.

Sure, every now and then, the data’s complexity or sheer size shouts the need for something a bit more robust. That’s when we embrace the power of relational databases. Whether it’s SQLite for something quick and easy, or PostgreSQL for more heavy lifting, setting up a database is our next adventure. From there, the skies—or should I say, the depths of data analysis—are the limit.

How Do We Examine Data?

CSV Format

Examining Data

Deeper Analysis of the Data

Summing Things Up

Need help bringing your dream project to life?