Geek Logbook

Tech sea log book

Principle of Data Wrangling

Data Wrangling involves the process of cleaning and organizing data before any analysis takes place. It typically consumes between 50% and 80% of an analyst’s time. Factors to consider include time, granularity, scope, and structure.

Importance:

1. Understanding the type of available data.

2. Choosing which data and level of detail to focus on.

3. Combining data from multiple sources appropriately to derive meaningful conclusions.

4. Determining if the size of the extracted datasets is suitable for further analysis.

Techniques:

1. Extraction

2. Parsing

3. Joining

4. Standardizing

5. Augmenting

6. Cleansing

7. Consolidating

8. Filtering

Data Wrangling (or data preparation) versus Classic ETL (Extract, Transform, and Load):

– Data Wrangling is typically handled by individuals who have a deep understanding of the business.

– ETL (Extract, Transform, Load) is focused on IT end-users and deals with structured and homogeneous data.

Differences in Data and Users:

– In Data Wrangling, software solutions emerged out of the necessity to understand, clean, and organize data of various shapes and sizes for use in traditional Excel.

– ETL is designed to handle structured and homogeneous data properly.

User Cases:

– Data Wrangling tends to be more exploratory and is commonly used by smaller teams. These teams often seek to implement new combinations of data sources.

– ETL serves as a complementary element within the available information in the organization.

Data Wrangling is much more than just “Data Curation.”

Common Situations:

– Splitting data

– Identifying the relevant time frame

– The value of data is determined by three necessary but not sufficient characteristics: speed, precision, and personalization.

Basic Steps in Data Wrangling:

1. Access

2. Transformation

3. Profiling

4. Publishing

Profiling can encompass two slightly different approaches:

1. Examining individual values in your dataset.

2. Examining a summary view across multiple values in your dataset.

Principle of Data Wrangling – P.43

Transformation involves restructuring data, which includes changing its structure or granularity. This can be achieved by rearranging fields, creating new fields by extracting data from existing ones, or combining multiple fields into a new one.

There are three primary types of data enrichment [page 71] transformation:

1. Unions

2. Joins

3. Driving new discoveries

Data Standardization [Page 78]

Organizational best practices for data projects: page 75.

Leave a Reply

Your email address will not be published. Required fields are marked *.

*
*