Python data wrangling cheat sheet

Maggie 184 Published: 06/23/2024

Python data wrangling cheat sheet

Here's a comprehensive Python data wrangling cheat sheet:

Importing Libraries

pandas (pd): For efficient data manipulation and analysis. numpy (np): For numerical operations and data structures.

Loading Data

pd.read_csv('file.csv'): Load a CSV file into a pandas DataFrame. pd.read_excel('file.xlsx'): Load an Excel file into a pandas DataFrame. pd.read_json('file.json'): Load a JSON file into a pandas DataFrame.

Data Inspection

df.head(): Display the first few rows of a DataFrame. df.info(): Display summary information about a DataFrame. df.describe(): Generate descriptive statistics for each column.

Data Cleaning

pd.isna(df): Identify missing values in a DataFrame. df.fillna(value) : Fill missing values with a specified value. df.dropna() : Drop rows with missing values. df.replace(to_replace, value) : Replace specific values in a column.

Data Transformation

pd.melt(df): Convert wide data to long format. pd.pivot_table(df, columns='column') : Pivot a DataFrame from wide format to long format. df.groupby('column').sum() : Groupby a column and apply a sum operation.

Data Manipulation

df.append(other_df) : Append one or more DataFrames to another. df.merge(other_df, on='column') : Merge two DataFrames based on a common column. df.sort_values(by='column') : Sort a DataFrame by one or more columns.

Data Visualization

import matplotlib.pyplot as plt; df.plot() : Create a line plot of a DataFrame. import seaborn as sns; sns.pairplot(df) : Create a scatter plot matrix for a DataFrame.

Common Tasks

df.set_index('column') : Set a column as the index for a DataFrame. df.reset_index(drop=True) : Reset the index of a DataFrame. pd.to_datetime(df['date']) : Convert a datetime-like column to datetime format.

Error Handling

try-except statements: Handle errors and exceptions in your code.

Remember to explore the documentation for each library and function to learn more about their capabilities and limitations. Happy wrangling!

What is data munging in Python?

I apologize for the rules restriction earlier!

Data munging in Python refers to the process of transforming, cleaning, and processing raw data into a more organized, structured, and useful form for analysis or visualization. It involves a range of tasks, including data manipulation, transformation, filtering, aggregation, and merging.

Python's popularity in data science is largely due to its extensive libraries and frameworks that facilitate efficient data munging. The following popular libraries are often used for data munging in Python:

Pandas: A powerful library for data manipulation and analysis. It provides data structures like Series (1-dimensional labeled array) and DataFrame (2-dimensional labeled data structure with columns of potentially different types). Pandas is particularly useful for handling structured data, such as tabular data. NumPy: The NumPy library provides support for large, multi-dimensional arrays and matrices, making it an essential tool for scientific computing and data analysis. It also includes tools for efficiently manipulating arrays and performing various mathematical operations. Scikit-learn: This library is focused on machine learning and provides a range of algorithms for classification, regression, clustering, and more. Scikit-learn can be used to preprocess data, perform feature selection, and transform datasets. Openpyxl: A Python library that allows you to create, read, manipulate and modify Excel (.xlsx) files. It provides support for various spreadsheet operations, such as formatting cells, setting font styles, and performing calculations. SQLAlchemy: This library provides a high-level SQL abstraction interface that can be used with databases such as MySQL, PostgreSQL, SQLite, and Oracle. SQLAlchemy allows you to perform common database operations like querying, inserting, updating, and deleting data.

Common tasks involved in data munging include:

Data cleaning: Removing duplicates, handling missing values, correcting errors, and performing other tasks to prepare the data for analysis. Data transformation: Converting data types, aggregating data, and performing other operations to reshape the data into a more suitable form for analysis or visualization. Data filtering: Selecting specific rows or columns based on criteria such as conditions, dates, or values. Data merging: Combining multiple datasets into a single dataset, often based on common keys like IDs or timestamps.

In summary, data munging in Python involves using various libraries and frameworks to transform, clean, and process raw data into a more organized and useful form for analysis or visualization. The goal is to create high-quality data that can be used as input for machine learning models, statistical analyses, or other types of data exploration.