Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: bishoyh/mbox2eml
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: main
Choose a base ref
...
head repository: kylebarlow/mbox2eml
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: main
Choose a head ref
Checking mergeability… Don’t worry, you can still create the pull request.
  • 4 commits
  • 2 files changed
  • 1 contributor

Commits on Sep 24, 2025

  1. feat: Convert mbox2eml to process chunked files with Maildir output a…

    …nd compression
    
    - Modified input handling to process directory of chunked mbox files (chunk_0.mbox, chunk_1.mbox, etc.) instead of single file
    - Added regex-based chunk file discovery with proper numerical sorting (handles non-zero-padded filenames)
    - Implemented gzip compression for all output files (.eml.gz format)
    - Added Maildir-compatible directory structure creation (cur/new/tmp subdirectories)
    - Updated file output to save compressed emails in cur/ subdirectory
    - Changed email numbering to zero-indexed with 9-digit zero padding (email_000000000.eml.gz)
    - Maintained continuous numbering across all chunks during sequential processing
    - Updated multithreading to work with global counter for consistent numbering
    - Added proper error handling for compression and Maildir structure creation
    - Updated build system to link with zlib (-lz flag)
    - Added required headers: <cstring>, <iomanip>, <sstream> for new functionality
    - Updated documentation and usage messages to reflect new Maildir output format
    
    Breaking changes:
    - Command line now expects input directory instead of single mbox file
    - Output format changed from .eml to compressed .eml.gz in Maildir structure
    - File numbering now starts from 0 instead of 1
    kylebarlow committed Sep 24, 2025
    Configuration menu
    Copy the full SHA
    ab2af3c View commit details
    Browse the repository at this point in the history
  2. feat: Extract email timestamps and optimize multithreading performance

    - Added timestamp extraction from email Date headers using RFC 2822 parsing
    - Enhanced Email struct to store both content and parsed timestamp
    - Implemented parseEmailDate() with support for multiple date formats and timezone handling
    - Updated generateMaildirFilename() to use actual email timestamps instead of current time
    - Added extractEmailTimestamp() to parse Date headers from email content with fallback handling
    
    Performance optimizations:
    - Fixed critical threading bottleneck by moving heavy operations outside mutex lock
    - Reduced mutex scope to only protect counter increment (microsecond lock time)
    - Changed gzip compression from Z_DEFAULT_COMPRESSION to Z_BEST_SPEED for better throughput
    - Eliminated serialized processing - threads now truly run in parallel
    - Removed console output from critical section to reduce lock contention
    - Renamed output_mutex to counter_mutex to reflect actual purpose
    
    Breaking changes:
    - Maildir filenames now use actual email timestamps instead of processing time
    - Slightly larger compressed files due to faster compression level
    
    Performance improvements:
    - Multi-core CPU utilization instead of single-core bottleneck
    - Parallel compression and file I/O operations
    - Significantly reduced processing time on multi-core systems
    kylebarlow committed Sep 24, 2025
    Configuration menu
    Copy the full SHA
    8c9b875 View commit details
    Browse the repository at this point in the history
  3. Add .gz extension

    kylebarlow committed Sep 24, 2025
    Configuration menu
    Copy the full SHA
    dfcd592 View commit details
    Browse the repository at this point in the history
  4. feat: Fix nested MIME boundary parsing and enhance attachment extraction

    Major fixes:
    - Fixed critical bug where nested multipart boundaries were not detected
    - Completely rewrote boundary extraction to handle Gmail's complex nested structure
    - Enhanced attachment detection to catch inline images and base64 content more aggressively
    
    Boundary detection improvements:
    - Added two-pass boundary extraction (headers + content scanning)
    - Now finds boundaries buried in nested multipart/related and multipart/alternative structures
    - Better boundary parsing with semicolon/whitespace handling and deduplication
    - Fixed issue where only first boundary was processed, missing attachments in subsequent boundaries
    
    Attachment detection enhancements:
    - Moved image/ content-type detection higher in priority (before base64 size check)
    - Lowered base64 detection threshold from 10KB to 100 bytes for better coverage
    - Added aggressive filename-based detection (any part with filename= becomes attachment)
    - Enhanced Content-ID detection for inline attachments/images
    
    Smart compression handling:
    - Avoid double-compressing already compressed formats (JPEG, PNG, ZIP, etc.)
    - Save compressed formats directly without .gz extension for easy viewing
    - Added comprehensive format detection by both filename and content-type
    
    User experience improvements:
    - Enhanced attachment markers to show actual saved filename in filesystem
    - Format: "[Attachment extracted: original.jpg (12345 bytes) -> saved as: email_000012345_attachment_0_original.jpg]"
    - Shows compression status (.gz suffix for compressed, none for direct formats)
    - Makes it easy to locate specific attachments in attachments/ directory
    
    This fixes the major issue where large base64 blocks (like JPEGs) were not being
    extracted from Gmail Takeout emails due to Gmail's nested multipart structure.
    kylebarlow committed Sep 24, 2025
    Configuration menu
    Copy the full SHA
    69ce770 View commit details
    Browse the repository at this point in the history
Loading