Geek Logbook

Tech sea log book

SODA: Check count distinct elements

Categories in Data Quality

When you are doing a quality you’re looking for six levels of knowledge about it. It might change depends on your requirements and what is the usage about it. Because, in general speaking the that not always work in a synchronous way, it’s possible that it could be from outside your organization and you need to discard it in order to achieve the creation of insights. But, talking again about the six points we in general take accounts of: completeness, accuracy, consistency, validity, uniqueness and integrity.

One common case to know if the data is complete or not is seeing if all the categories are in the field we have to check. For example: I want to know if in the database world we have all the continents. If you want to install the database you can follow this link to the MySQL official scripts

Structure of World Database

We are going to use this continents as the correct case. They are the complete case: North America, Asia, Africa, Europe, South America, Oceania, Antarctica. The database now have this records so we are going to take it as the “normal” we want to make a quality check.

The query

If we were asking to create a SQL query we would work around this idea:

SELECT COUNT(DISTINCT continent) AS cont FROM country; 

as the result has to be only 7 you should complete the query filtering the query by 7.

But, if you’re using SODA data quality check you’re going to notice that they don’t have something like count distinct on their Metrics and Checks however you can make this check using one of these: Failed rows checks or User defined checks

In the cecks.yml your quality checks should be like these:

checks for country:
  # FAILED ROW CHECKS -> https://docs.soda.io/soda-cl/failed-rows-checks.html
  - failed rows:
      fail query: |
          WITH continent AS (
            SELECT COUNT(DISTINCT continent) AS quantity FROM country
          )
 
          SELECT * FROM continent WHERE quantity != 7
  # USER DEFINED CHECKS -> https://docs.soda.io/soda-cl/user-defined.html
  - continents = 7:
      continents query: |
        SELECT COUNT(DISTINCT continent) AS quantity FROM country

The code in the repository

Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *.