Counting Word Frequency in a SQL Column

By - Geek Logbook
Posted on 2025-01-03
Posted in Data

Counting Word Frequency in a SQL Column

Sometimes, you may need to analyze text data stored in a database, such as counting the frequency of words in a text column. This blog post demonstrates how to achieve this in SQL using a practical example.

Problem Overview

Let’s assume you have a table named feedback with a column comentarios that contains text data. The goal is to count how many times each word appears in this column.

Example Table: `feedback`

id	comentarios
1	Hadoop map reduce
2	map reduce example
3	Hadoop map example

The desired output is:

palabra	total
hadoop	2
map	3
reduce	2
example	2

SQL Solutions

Approach 1: Using PostgreSQL

PostgreSQL has powerful string and array manipulation functions that make this task easier.

WITH split_words AS (
    SELECT
        unnest(string_to_array(lower(comentarios), ' ')) AS palabra
    FROM feedback
),
word_count AS (
    SELECT
        palabra,
        COUNT(*) AS total
    FROM split_words
    GROUP BY palabra
)
SELECT *
FROM word_count
ORDER BY total DESC;

Explanation:

string_to_array: Splits the text in comentarios into an array of words using spaces as delimiters.
unnest: Expands the array into individual rows, where each row contains a single word.
COUNT(*) and GROUP BY: Groups identical words and counts their occurrences.

Output:

palabra	total
map	3
hadoop	2
reduce	2
example	2

Approach 2: Using MySQL

MySQL requires a different approach since it lacks built-in functions to directly split strings into arrays. A helper table is used instead.

Step 1: Create a Helper Table

The helper table generates numbers that represent word positions.

CREATE TEMPORARY TABLE numbers (n INT);
INSERT INTO numbers (n) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);

Step 2: Query to Extract and Count Words

SELECT
    LOWER(SUBSTRING_INDEX(SUBSTRING_INDEX(comentarios, ' ', n), ' ', -1)) AS palabra,
    COUNT(*) AS total
FROM feedback
JOIN numbers ON n <= CHAR_LENGTH(comentarios) - CHAR_LENGTH(REPLACE(comentarios, ' ', '')) + 1
GROUP BY palabra
ORDER BY total DESC;

Explanation:

SUBSTRING_INDEX: Extracts words by position using spaces as delimiters.
CHAR_LENGTH and REPLACE: Calculates the total number of words in each comment.
JOIN numbers: Associates the helper table to iterate through word positions.
GROUP BY: Groups by words to count occurrences.

Output:

palabra	total
map	3
hadoop	2
reduce	2
example	2

Considerations

Normalization: Convert text to lowercase and remove punctuation to avoid duplicates (e.g., “Hadoop” vs “hadoop”).
Performance: Processing large text data in SQL can be resource-intensive. Consider preprocessing text in a programming language like Python for complex cases.

Conclusion

While SQL may not be the first tool you think of for text analysis, it can effectively handle basic tasks like word frequency counts. Using PostgreSQL’s powerful functions or MySQL’s creative solutions, you can extract valuable insights from your text data directly within the database.

Tags:Hadoop

Geek Logbook

Recent Posts

Categories

Archives

Counting Word Frequency in a SQL Column

Problem Overview

Example Table: `feedback`

SQL Solutions

Approach 1: Using PostgreSQL

Explanation:

Output:

Approach 2: Using MySQL

Step 1: Create a Helper Table

Step 2: Query to Extract and Count Words

Explanation:

Output:

Considerations

Conclusion

Previous Article

Next Article

Geek Logbook

Recent Posts

Categories

Archives

Counting Word Frequency in a SQL Column

Problem Overview

Example Table: feedback

SQL Solutions

Approach 1: Using PostgreSQL

Explanation:

Output:

Approach 2: Using MySQL

Step 1: Create a Helper Table

Step 2: Query to Extract and Count Words

Explanation:

Output:

Considerations

Conclusion

Previous Article

Next Article

Example Table: `feedback`