Back to Skills
    🦞

    pandas-construction-analysis

    Comprehensive Pandas

    By @datadrivenconstruction
    View on GitHub
    SKILL.md
    ---
    slug: "pandas-construction-analysis"
    display_name: "Pandas Construction Analysis"
    description: "Comprehensive Pandas toolkit for construction data analysis. Filter, group, aggregate BIM elements, calculate quantities, merge datasets, and generate reports from structured construction data."
    ---
    
    # Pandas Construction Data Analysis
    
    ## Overview
    
    Based on DDC methodology (Chapter 2.3), this skill provides comprehensive Pandas operations for construction data processing. Pandas is the Swiss Army knife for data analysts - handling everything from simple data filtering to complex aggregations across millions of rows.
    
    **Book Reference:** "Pandas DataFrame и LLM ChatGPT" / "Pandas DataFrame and LLM ChatGPT"
    
    > "Используя Pandas, вы можете управлять и анализировать наборы данных, намного превосходящие возможности Excel. В то время как Excel способен обрабатывать до 1 миллиона строк данных, Pandas может без труда работать с наборами данных, содержащими десятки миллионов строк."
    > — DDC Book, Chapter 2.3
    
    ## Quick Start
    
    ```python
    import pandas as pd
    
    # Read construction data
    df = pd.read_excel("bim_export.xlsx")
    
    # Basic operations
    print(df.head())           # First 5 rows
    print(df.info())           # Column types and memory
    print(df.describe())       # Statistics for numeric columns
    
    # Filter structural elements
    structural = df[df['Category'] == 'Structural']
    
    # Calculate total volume
    total_volume = df['Volume'].sum()
    print(f"Total volume: {total_volume:.2f} m³")
    ```
    
    ## DataFrame Fundamentals
    
    ### Creating DataFrames
    
    ```python
    import pandas as pd
    
    # From dictionary (construction elements)
    elements = pd.DataFrame({
        'ElementId': ['E001', 'E002', 'E003', 'E004'],
        'Category': ['Wall', 'Floor', 'Wall', 'Column'],
        'Material': ['Concrete', 'Concrete', 'Brick', 'Steel'],
        'Volume_m3': [45.5, 120.0, 32.0, 8.5],
        'Level': ['Level 1', 'Level 1', 'Level 2', 'Level 1']
    })
    
    # From CSV
    df_csv = pd.read_csv("construction_data.csv")
    
    # From Excel
    df_excel = pd.read_excel("project_data.xlsx", sheet_name="Elements")
    
    # From multiple Excel sheets
    all_sheets = pd.read_excel("project.xlsx", sheet_name=None)  # Dict of DataFrames
    ```
    
    ### Data Types in Construction
    
    ```python
    # Common data types for construction
    df = pd.DataFrame({
        'element_id': pd.Series(['W001', 'W002'], dtype='string'),
        'quantity': pd.Series([10, 20], dtype='int64'),
        'volume': pd.Series([45.5, 32.0], dtype='float64'),
        'is_structural': pd.Series([True, False], dtype='bool'),
        'created_date': pd.to_datetime(['2024-01-15', '2024-01-16']),
        'category': pd.Categorical(['Wall', 'Slab'])
    })
    
    # Check data types
    print(df.dtypes)
    
    # Convert types
    df['quantity'] = df['quantity'].astype('float64')
    df['volume'] = pd.to_numeric(df['volume'], errors='coerce')
    ```
    
    ## Filtering and Selection
    
    ### Basic Filtering
    
    ```python
    # Single condition
    walls = df[df['Category'] == 'Wall']
    
    # Multiple conditions (AND)
    large_concrete = df[(df['Material'] == 'Concrete') & (df['Volume_m3'] > 50)]
    
    # Multiple conditions (OR)
    walls_or_floors = df[(df['Category'] == 'Wall') | (df['Category'] == 'Floor')]
    
    # Using isin for multiple values
    structural = df[df['Category'].isin(['Wall', 'Column', 'Beam', 'Foundation'])]
    
    # String contains
    insulated = df[df['Description'].str.contains('insulated', case=False, na=False)]
    
    # Null value filtering
    incomplete = df[df['Cost'].isna()]
    complete = df[df['Cost'].notna()]
    ```
    
    ### Advanced Selection
    
    ```python
    # Select columns
    volumes = df[['ElementId', 'Category', 'Volume_m3']]
    
    # Query syntax (SQL-like)
    result = df.query("Category == 'Wall' and Volume_m3 > 30")
    
    # Loc and iloc
    specific_row = df.loc[0]                    # By label
    range_rows = df.iloc[0:10]                  # By position
    specific_cell = df.loc[0, 'Volume_m3']      # Row and column
    subset = df.loc[0:5, ['Category', 'Volume_m3']]  # Range with columns
    ```
    
    ## Grouping and Aggregation
    
    ### GroupBy Operations
    
    ```python
    # Basic groupby
    by_category = df.groupby('Category')['Volume_m3'].sum()
    
    # Multiple aggregations
    summary = df.groupby('Category').agg({
        'Volume_m3': ['sum', 'mean', 'count'],
        'Cost': ['sum', 'mean']
    })
    
    # Named aggregations (cleaner output)
    summary = df.groupby('Category').agg(
        total_volume=('Volume_m3', 'sum'),
        avg_volume=('Volume_m3', 'mean'),
        element_count=('ElementId', 'count'),
        total_cost=('Cost', 'sum')
    ).reset_index()
    
    # Multiple grouping columns
    by_level_cat = df.groupby(['Level', 'Category']).agg({
        'Volume_m3': 'sum',
        'Cost': 'sum'
    }).reset_index()
    ```
    
    ### Pivot Tables
    
    ```python
    # Create pivot table
    pivot = pd.pivot_table(
        df,
        values='Volume_m3',
        index='Level',
        columns='Category',
        aggfunc='sum',
        fill_value=0,
        margins=True,           # Add totals
        margins_name='Total'
    )
    
    # Multiple values
    pivot_detailed = pd.pivot_table(
        df,
        values=['Volume_m3', 'Cost'],
        index='Level',
        columns='Category',
        aggfunc={'Volume_m3': 'sum', 'Cost': 'mean'}
    )
    ```
    
    ## Data Transformation
    
    ### Adding Calculated Columns
    
    ```python
    # Simple calculation
    df['Cost_Total'] = df['Volume_m3'] * df['Unit_Price']
    
    # Conditional column
    df['Size_Category'] = df['Volume_m3'].apply(
        lambda x: 'Large' if x > 50 else ('Medium' if x > 20 else 'Small')
    )
    
    # Using np.where for binary conditions
    import numpy as np
    df['Is_Large'] = np.where(df['Volume_m3'] > 50, True, False)
    
    # Using cut for binning
    df['Volume_Bin'] = pd.cut(
        df['Volume_m3'],
        bins=[0, 10, 50, 100, float('inf')],
        labels=['XS', 'S', 'M', 'L']
    )
    ```
    
    ### String Operations
    
    ```python
    # Extract from strings
    df['Level_Number'] = df['Level'].str.extract(r'(\d+)').astype(int)
    
    # Split and expand
    df[['Building', 'Floor']] = df['Location'].str.split('-', expand=True)
    
    # Clean strings
    df['Category'] = df['Category'].str.strip().str.lower().str.title()
    
    # Replace values
    df['Material'] = df['Material'].str.replace('Reinforced Concrete', 'RC')
    ```
    
    ### Date Operations
    
    ```python
    # Parse dates
    df['Start_Date'] = pd.to_datetime(df['Start_Date'])
    
    # Extract components
    df['Year'] = df['Start_Date'].dt.year
    df['Month'] = df['Start_Date'].dt.month
    df['Week'] = df['Start_Date'].dt.isocalendar().week
    df['DayOfWeek'] = df['Start_Date'].dt.day_name()
    
    # Calculate duration
    df['Duration_Days'] = (df['End_Date'] - df['Start_Date']).dt.days
    
    # Filter by date range
    recent = df[df['Start_Date'] >= '2024-01-01']
    ```
    
    ## Merging and Joining
    
    ### Merge DataFrames
    
    ```python
    # Elements data
    elements = pd.DataFrame({
        'ElementId': ['E001', 'E002', 'E003'],
        'Category': ['Wall', 'Floor', 'Column'],
        'Volume_m3': [45.5, 120.0, 8.5]
    })
    
    # Unit prices
    prices = pd.DataFrame({
        'Category': ['Wall', 'Floor', 'Column', 'Beam'],
        'Unit_Price': [150, 80, 450, 200]
    })
    
    # Inner join (only matching)
    merged = elements.merge(prices, on='Category', how='inner')
    
    # Left join (keep all elements)
    merged = elements.merge(prices, on='Category', how='left')
    
    # Join on different column names
    result = df1.merge(df2, left_on='elem_id', right_on='ElementId')
    ```
    
    ### Concatenating DataFrames
    
    ```python
    # Vertical concatenation (stacking)
    all_floors = pd.concat([floor1_df, floor2_df, floor3_df], ignore_index=True)
    
    # Horizontal concatenation
    combined = pd.concat([quantities, costs, schedule], axis=1)
    
    # Append new rows
    new_elements = pd.DataFrame({'ElementId': ['E004'], 'Category': ['Beam']})
    df = pd.concat([df, new_elements], ignore_index=True)
    ```
    
    ## Construction-Specific Analyses
    
    ### Quantity Take-Off (QTO)
    
    ```python
    def generate_qto_report(df):
        """Generate Quantity Take-Off summary by category"""
        qto = df.groupby(['Category', 'Material']).agg(
            count=('ElementId', 'count'),
            total_volume=('Volume_m3', 'sum'),
            total_area=('Area_m2', 'sum'),
            avg_volume=('Volume_m3', 'mean')
        ).round(2)
    
        # Add percentage column
        qto['volume_pct'] = (qto['total_volume'] /
                              qto['total_volume'].sum() * 100).round(1)
    
        return qto.sort_values('total_volume', ascending=False)
    
    # Usage
    qto_report = generate_qto_report(df)
    qto_report.to_excel("qto_report.xlsx")
    ```
    
    ### Cost Estimation
    
    ```python
    def calculate_project_cost(elements_df, prices_df, markup=0.15):
        """Calculate total project cost with markup"""
        # Merge with prices
        df = elements_df.merge(prices_df, on='Category', how='left')
    
        # Calculate base cost
        df['Base_Cost'] = df['Volume_m3'] * df['Unit_Price']
    
        # Apply markup
        df['Total_Cost'] = df['Base_Cost'] * (1 + markup)
    
        # Summary by category
        summary = df.groupby('Category').agg(
            volume=('Volume_m3', 'sum'),
            base_cost=('Base_Cost', 'sum'),
            total_cost=('Total_Cost', 'sum')
        ).round(2)
    
        return df, summary, summary['total_cost'].sum()
    
    # Usage
    detailed, summary, total = calculate_project_cost(elements, prices)
    print(f"Project Total: ${total:,.2f}")
    ```
    
    ### Material Summary
    
    ```python
    def material_summary(df):
        """Summarize materials across project"""
        summary = df.groupby('Material').agg({
            'Volume_m3': 'sum',
            'Weight_kg': 'sum',
            'ElementId': 'nunique'
        }).rename(columns={'ElementId': 'Element_Count'})
    
        summary['Volume_Pct'] = (summary['Volume_m3'] /
                                  summary['Volume_m3'].sum() * 100).round(1)
    
        return summary.sort_values('Volume_m3', ascending=False)
    ```
    
    ### Level-by-Level Analysis
    
    ```python
    def analyze_by_level(df):
        """Analyze construction quantities by building level"""
        level_summary = df.pivot_table(
            values=['Volume_m3', 'Cost'],
            index='Level',
            columns='Category',
            aggfunc='su
    
    ... (truncated)