Pandas

Pandas in the ML Pipeline
Loading & Inspecting Data
Cleaning & Transforming
Groupby & Aggregation
Preparing ML Datasets
Pandas → HF Datasets

SECTION 01

Pandas in the ML Pipeline

Raw data rarely arrives ready for model training. Pandas is the standard tool for the wrangling step: loading CSV/JSON/Parquet files, cleaning text, filtering rows, and constructing the train/val/test splits that go into your DataLoader.

Typical workflow: Load CSV → inspect with .head()/.info() → clean nulls/types → transform features → export to HuggingFace Dataset or PyTorch tensors.

SECTION 02

Loading & Inspecting Data

import pandas as pd # Loading data df = pd.read_csv("data.csv") df = pd.read_json("data.jsonl", lines=True) # JSONL (common for LLM datasets) df = pd.read_parquet("data.parquet") # Efficient columnar format # Inspection df.head(5) # First 5 rows df.tail(5) # Last 5 rows df.info() # Column types, null counts, memory usage df.describe() # Stats for numeric columns: mean, std, min, max df.shape # (n_rows, n_cols) df.columns.tolist() # Column names # Check for missing values df.isnull().sum() # Count nulls per column df.isnull().sum() / len(df) # Null fraction per column # Value distribution df["category"].value_counts() df["label"].value_counts(normalize=True) # As percentages

SECTION 03

Cleaning & Transforming

import pandas as pd df = pd.read_csv("qa_dataset.csv") # Drop nulls df = df.dropna(subset=["question", "answer"]) # Remove rows where these are null # Fill nulls df["context"] = df["context"].fillna("") # String cleaning — common for NLP datasets df["text"] = ( df["text"] .str.strip() # Remove whitespace .str.lower() # Lowercase .str.replace(r"\s+", " ", regex=True) # Normalize whitespace ) # Type conversion df["length"] = df["text"].str.len() # Character count column df["word_count"] = df["text"].str.split().str.len() # Filtering df = df[df["word_count"] > 10] # Keep rows with >10 words df = df[df["label"].isin(["yes", "no"])] # Keep only these labels # Rename columns df = df.rename(columns={"q": "question", "a": "answer"})

SECTION 04

Groupby & Aggregation

import pandas as pd # Groupby: analyze statistics by category stats = df.groupby("category").agg({ "text": ["count", lambda x: x.str.len().mean()], "label": lambda x: (x == "yes").mean() }) # Check dataset balance label_counts = df.groupby("label").size() print(label_counts) # Per-split stats for split in ["train", "val", "test"]: subset = df[df["split"] == split] print(f"{split}: {len(subset)} rows, avg length: {subset['text'].str.len().mean():.0f}") # Pivot table df.pivot_table(values="score", index="model", columns="task", aggfunc="mean")

SECTION 05

Preparing ML Datasets

import pandas as pd from sklearn.model_selection import train_test_split df = pd.read_csv("dataset.csv") # 1. Create prompt format for instruction tuning def make_prompt(row): return f"### Instruction:\n{row['instruction']}\n\n### Response:\n{row['output']}" df["prompt"] = df.apply(make_prompt, axis=1) # 2. Filter by length (avoid truncation issues) df["n_chars"] = df["prompt"].str.len() df = df[df["n_chars"].between(50, 4000)] # 3. Train/val/test split train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42) val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42) print(f"Train: {len(train_df)}, Val: {len(val_df)}, Test: {len(test_df)}") # 4. Save train_df.to_parquet("train.parquet", index=False) val_df.to_parquet("val.parquet", index=False)

SECTION 06

Pandas → HF Datasets

import pandas as pd from datasets import Dataset, DatasetDict train_df = pd.read_parquet("train.parquet") val_df = pd.read_parquet("val.parquet") # Convert to HuggingFace Dataset (Arrow-backed, efficient) train_ds = Dataset.from_pandas(train_df, preserve_index=False) val_ds = Dataset.from_pandas(val_df, preserve_index=False) dataset = DatasetDict({"train": train_ds, "validation": val_ds}) # Tokenize with map() — parallel processing from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b") def tokenize(batch): return tokenizer(batch["prompt"], truncation=True, max_length=512, padding="max_length") tokenized = dataset.map(tokenize, batched=True, num_proc=4) tokenized.set_format("torch", columns=["input_ids", "attention_mask"]) # Now ready for a DataLoader from torch.utils.data import DataLoader loader = DataLoader(tokenized["train"], batch_size=8, shuffle=True)

Tip: Use .parquet for intermediate storage — it's columnar (fast column reads), compressed, and reads directly into both pandas and HuggingFace Datasets.

Performance Optimization

Pandas operations on large datasets (>1GB) benefit from chunking, category dtypes for strings, and downcast integer types. Using eval() and query() with expression optimization can accelerate filtering operations by 2-10x compared to boolean indexing on massive DataFrames.

Integration with Machine Learning Pipelines

Operation	Time (1M rows)	Memory (MB)	Best For
Boolean Indexing	145ms	128	Small-medium datasets
DataFrame.query()	82ms	105	Complex conditions
DataFrame.eval()	56ms	95	Column operations
Numba JIT	12ms	110	Custom loops

Pandas DataFrames seamlessly integrate with scikit-learn pipelines through sklearn's ColumnTransformer and Pipeline classes. For deep learning, pandas.DataFrame.iterrows() should be avoided; instead, use numpy.array_split() or PyArrow for zero-copy conversion to PyTorch tensors.

Pandas performance optimization becomes critical when working with datasets exceeding 1GB. Memory-efficient workflows use chunked processing: split large files into 100K-1M row chunks, process independently, then combine results. Category dtypes reduce memory usage for string columns by 90%, converting repeated strings to integer indices. Downcast numeric types from float64 to float32 (or int64 to int32) halves memory for numeric columns with minimal accuracy loss. For time-series data with timestamp indices, Pandas' DatetimeIndex enables fast window operations and resampling. Multi-index DataFrames support hierarchical operations essential for grouped analyses. The apply() function with raw=True avoids Series creation overhead, while groupby().agg() with dictionaries specifying per-column operations outperforms lambda functions. For production data pipelines, Pandas integrates with Spark through PySpark's Pandas API, enabling distributed computation while maintaining familiar syntax. The DataFrame.to_parquet() format provides superior compression and read performance versus CSV for repeated access patterns.

Advanced pandas operations for production data pipelines require careful memory and performance management. Large-scale groupby operations on 1B+ row datasets benefit from explicit chunking: process 10M-row chunks independently, combine aggregation results. Distributed computing frameworks (Spark DataFrames with Pandas API, Dask DataFrames) enable seamless scaling to multi-node clusters. Pandas' categorical dtype automatically optimizes string storage: category("dtype=int8") for <256 unique values uses 1 byte per element versus 32+ bytes for object strings. Index optimization is critical: set appropriate column as index matching query patterns, use MultiIndex for hierarchical queries. Time-series operations include rolling windows (rolling(window=30).mean()), exponential smoothing (ewm(span=30).mean()), and resampling (resample('1D').sum()). Pandas' merge operations require careful attention to performance: merge on integer keys (O(n log n)) versus string keys (slower). For NLP/ML pipelines, converting Pandas to PyTorch tensors uses df.values (numpy copy) or df.to_numpy() for zero-copy views. Memory profiling tools (memory_profiler, py-spy) identify bottlenecks in large workflows. Typical optimization reduces peak memory 50-70% and execution time 10-20x through strategic use of categorical dtypes, chunking, and efficient I/O formats.

Integration patterns for pandas in machine learning systems typically follow: data loading → cleaning → feature engineering → train/test split → model training. Custom pandas extensions enable domain-specific operations: medical data pipelines add phenotype extraction functions, financial workflows add portfolio rebalancing operations. Spark SQL integration for large datasets: SQL queries execute on Spark engine, convert results to Pandas for analysis. Pandas DataFrame.apply() should be avoided in favor of vectorized operations (df['col'] * 2 instead of df.apply(lambda x: x['col'] * 2)). For high-throughput systems processing 100K+ rows/sec, batch the data: use iterator patterns (pd.read_csv(chunksize=50000)) to avoid memory spikes. MLOps workflows use pandas for data validation: check distributions match training data (Kolmogorov-Smirnov test), detect feature drift, validate schema. Monitoring production models involves pandas-based analysis: compute prediction distributions, identify anomalies, correlate input features with outputs. The combination of pandas' ease-of-use with distributed computing frameworks enables both rapid prototyping and production-scale data engineering.

Advanced pandas operations for machine learning workflows require balancing readability with performance. Vectorized operations (pandas operations on entire columns) outperform row-wise operations (df.apply, iterrows) by 10-100x: prefer df.col.rolling(window=5).mean() over df.apply(lambda row: row.col.rolling(5).mean()). String operations on object dtype columns are slow: convert to category (categorical dtype) for 50-90x speedup on string comparisons. Memory optimization progressively: identify memory hogs with df.memory_usage(deep=True), downcast columns (float64→float32, int64→int32 when possible), use sparse data structures for data with many zeros. For time-series data, set datetime column as index: enables time-based indexing (df.loc["2023-01":"2023-03"]) and efficient resampling. Groupby operations benefit from observed=True parameter (for categorical groupby) reducing unnecessary group creation by 10x. Method chaining improves readability: df.pipe(clean).pipe(transform).pipe(aggregate) reads like pipeline rather than nested function calls. Debugging performance: use %prun (iPython profiler) to identify slow operations, profile memory with memory_profiler. Production pipelines integrate pandas with validation: check shape matches expectations, verify no NaN introduction, validate value ranges, detect feature drift (distributions changed). Integration with cloud storage: read directly from S3, GCS, Azure Blob Storage using appropriate filesystem. Spark DataFrames with Pandas UDFs (User Defined Functions) enable distributed processing with pandas syntax: best-of-both-worlds (scale of Spark, ease of pandas).

Integration with machine learning libraries requires careful data pipeline design. Scikit-learn pipelines chain preprocessing and models: Preprocessing standardizes features, then Model applies algorithm. Pipelines ensure preprocessing fits on train set only, preventing data leakage, and applies same preprocessing to test. Pandas integration through ColumnTransformer specifies which columns receive which preprocessing. Custom transformers wrap pandas operations as classes extending BaseEstimator. PyTorch integration converts DataFrames to Tensors via df.values creating copies. TensorFlow integration uses tf.data.Dataset.from_tensor_slices for efficient datasets. Polars offers 10-100x speedup on large datasets with different API. Real-time ML systems stream data into pandas with rolling windows, retraining periodically. Batch processing increases throughput while streaming reduces latency. Production pipelines maintain feature stores for centralized feature computation and reuse. Data quality monitoring detects distribution shift by comparing current statistics to baseline. Automated retraining triggers on monthly schedules, accuracy drops greater than 2 percent, or new data exceeding 10 percent previous volume.

Operation	Pandas method	LLM use case
Load CSV	pd.read_csv()	Load evaluation dataset
Filter rows	df[df.col == val]	Select failed eval examples
Add column	df["col"] = values	Attach model responses
Group + aggregate	df.groupby().agg()	Score by category or model
Export	df.to_csv() / to_json()	Save results for review

Pandas

Table of Contents

Pandas in the ML Pipeline

Loading & Inspecting Data

Cleaning & Transforming

Groupby & Aggregation

Preparing ML Datasets

Pandas → HF Datasets

Performance Optimization

Integration with Machine Learning Pipelines

Related concepts