Counting Word Frequency in a SQL Column
Sometimes, you may need to analyze text data stored in a database, such as counting the frequency of words in a text column. This blog post demonstrates how to achieve this in SQL using a practical example.
Problem Overview
Let’s assume you have a table named feedback with a column comentarios that contains text data. The goal is to count how many times each word appears in this column.
Example Table: feedback
| id | comentarios |
|---|---|
| 1 | Hadoop map reduce |
| 2 | map reduce example |
| 3 | Hadoop map example |
The desired output is:
| palabra | total |
|---|---|
| hadoop | 2 |
| map | 3 |
| reduce | 2 |
| example | 2 |
SQL Solutions
Approach 1: Using PostgreSQL
PostgreSQL has powerful string and array manipulation functions that make this task easier.
WITH split_words AS (
SELECT
unnest(string_to_array(lower(comentarios), ' ')) AS palabra
FROM feedback
),
word_count AS (
SELECT
palabra,
COUNT(*) AS total
FROM split_words
GROUP BY palabra
)
SELECT *
FROM word_count
ORDER BY total DESC;
Explanation:
string_to_array: Splits the text incomentariosinto an array of words using spaces as delimiters.unnest: Expands the array into individual rows, where each row contains a single word.COUNT(*)andGROUP BY: Groups identical words and counts their occurrences.
Output:
| palabra | total |
|---|---|
| map | 3 |
| hadoop | 2 |
| reduce | 2 |
| example | 2 |
Approach 2: Using MySQL
MySQL requires a different approach since it lacks built-in functions to directly split strings into arrays. A helper table is used instead.
Step 1: Create a Helper Table
The helper table generates numbers that represent word positions.
CREATE TEMPORARY TABLE numbers (n INT);
INSERT INTO numbers (n) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
Step 2: Query to Extract and Count Words
SELECT
LOWER(SUBSTRING_INDEX(SUBSTRING_INDEX(comentarios, ' ', n), ' ', -1)) AS palabra,
COUNT(*) AS total
FROM feedback
JOIN numbers ON n <= CHAR_LENGTH(comentarios) - CHAR_LENGTH(REPLACE(comentarios, ' ', '')) + 1
GROUP BY palabra
ORDER BY total DESC;
Explanation:
SUBSTRING_INDEX: Extracts words by position using spaces as delimiters.CHAR_LENGTHandREPLACE: Calculates the total number of words in each comment.JOIN numbers: Associates the helper table to iterate through word positions.GROUP BY: Groups by words to count occurrences.
Output:
| palabra | total |
|---|---|
| map | 3 |
| hadoop | 2 |
| reduce | 2 |
| example | 2 |
Considerations
- Normalization: Convert text to lowercase and remove punctuation to avoid duplicates (e.g., “Hadoop” vs “hadoop”).
- Performance: Processing large text data in SQL can be resource-intensive. Consider preprocessing text in a programming language like Python for complex cases.
Conclusion
While SQL may not be the first tool you think of for text analysis, it can effectively handle basic tasks like word frequency counts. Using PostgreSQL’s powerful functions or MySQL’s creative solutions, you can extract valuable insights from your text data directly within the database.