Geek Logbook

Tech sea log book

Counting Word Frequency in a SQL Column

Sometimes, you may need to analyze text data stored in a database, such as counting the frequency of words in a text column. This blog post demonstrates how to achieve this in SQL using a practical example.

Problem Overview

Let’s assume you have a table named feedback with a column comentarios that contains text data. The goal is to count how many times each word appears in this column.

Example Table: feedback

idcomentarios
1Hadoop map reduce
2map reduce example
3Hadoop map example

The desired output is:

palabratotal
hadoop2
map3
reduce2
example2

SQL Solutions

Approach 1: Using PostgreSQL

PostgreSQL has powerful string and array manipulation functions that make this task easier.

WITH split_words AS (
    SELECT
        unnest(string_to_array(lower(comentarios), ' ')) AS palabra
    FROM feedback
),
word_count AS (
    SELECT
        palabra,
        COUNT(*) AS total
    FROM split_words
    GROUP BY palabra
)
SELECT *
FROM word_count
ORDER BY total DESC;

Explanation:

  1. string_to_array: Splits the text in comentarios into an array of words using spaces as delimiters.
  2. unnest: Expands the array into individual rows, where each row contains a single word.
  3. COUNT(*) and GROUP BY: Groups identical words and counts their occurrences.

Output:

palabratotal
map3
hadoop2
reduce2
example2

Approach 2: Using MySQL

MySQL requires a different approach since it lacks built-in functions to directly split strings into arrays. A helper table is used instead.

Step 1: Create a Helper Table

The helper table generates numbers that represent word positions.

CREATE TEMPORARY TABLE numbers (n INT);
INSERT INTO numbers (n) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);

Step 2: Query to Extract and Count Words

SELECT
    LOWER(SUBSTRING_INDEX(SUBSTRING_INDEX(comentarios, ' ', n), ' ', -1)) AS palabra,
    COUNT(*) AS total
FROM feedback
JOIN numbers ON n <= CHAR_LENGTH(comentarios) - CHAR_LENGTH(REPLACE(comentarios, ' ', '')) + 1
GROUP BY palabra
ORDER BY total DESC;

Explanation:

  1. SUBSTRING_INDEX: Extracts words by position using spaces as delimiters.
  2. CHAR_LENGTH and REPLACE: Calculates the total number of words in each comment.
  3. JOIN numbers: Associates the helper table to iterate through word positions.
  4. GROUP BY: Groups by words to count occurrences.

Output:

palabratotal
map3
hadoop2
reduce2
example2

Considerations

  • Normalization: Convert text to lowercase and remove punctuation to avoid duplicates (e.g., “Hadoop” vs “hadoop”).
  • Performance: Processing large text data in SQL can be resource-intensive. Consider preprocessing text in a programming language like Python for complex cases.

Conclusion

While SQL may not be the first tool you think of for text analysis, it can effectively handle basic tasks like word frequency counts. Using PostgreSQL’s powerful functions or MySQL’s creative solutions, you can extract valuable insights from your text data directly within the database.

Tags: