Pandas¶
Info
Pandas is a Python library for data analysis and manipulation. It provides data structures and functions to work with structured data like CSV files, Excel sheets or SQL tables.
Load & Save Data¶
import pandas as pd
# Load data
df = pd.read_csv("data.csv", sep=";", encoding="utf-8")
df = pd.read_excel("data.xlsx", sheet_name="Sheet1", engine="openpyxl")
# Save data
df.to_csv("data.csv", sep=";", encoding="utf-8", index=False)
Note:
sep
is the separator used in the CSV file (default is,
)encoding
is the encoding used in the CSV file (default isutf-8
)utf-8
latin-1
- see here for a list of all encodings
Web
import pandas as pd
url = "https://example.com/data.csv"
df = pd.read_csv(url, sep=";", encoding="utf-8")
SQL
import pandas as pd
import sqlite3
conn = sqlite3.connect("data.db")
df = pd.read_sql_query("SELECT * FROM data", conn)
Column Operations¶
Select Columns
df2 = df[["column1", "column2"]]
Drop Columns
df = df.drop(columns=["column1", "column2"])
Rename Column
df = df.rename(columns={"old": "new"})
Apply to Column
df["column"] = df["column"].apply(lambda x: x + 1)
Add Columns
df["new"] = df["column1"] + df["column2"]
Remove Duplicates
df = df.drop_duplicates(subset="column")
Row Operations¶
Filter Rows
df = df[df["column"] > 10]
Group Rows
df = df.groupby("column").sum()
Sort Rows
df = df.sort_values("column", ascending=False)
Select Rows and Columns
df = df.loc[df["column"] > 10, ["column1", "column2"]]
Iterate over Rows
for index, row in df.iterrows():
print(index, row["column1"], row["column2"])