Skip to content

Latest commit

 

History

History
153 lines (109 loc) · 3.92 KB

File metadata and controls

153 lines (109 loc) · 3.92 KB

CSV Splitter

A simple command-line tool to split large CSV files into multiple smaller parts with customizable options.

Features

  • Split CSV files into any number of parts
  • Automatically generates output filenames with _part1, _part2, etc. suffixes
  • Preserves original file extension and directory
  • Optional header inclusion in all output files
  • Support for custom column separators
  • Even distribution of rows across parts
  • Handles files with any number of rows efficiently
  • Smart newline handling: Automatically replaces line breaks with spaces in quoted CSV fields while preserving them in unquoted fields

Installation

go build -o splitcsv main.go

Usage

./splitcsv -in <input-file> [options]

Required Flags

  • -in - Input CSV file path (required)

Optional Flags

  • -parts - Number of parts to split into (default: 2)
  • -header - Include header row in all output files (default: true)
  • -comma - Column separator character (default: ",")

Examples

Basic Usage

Split a CSV file into 2 parts (default):

./splitcsv -in data.csv

Output: data_part1.csv, data_part2.csv

Split into Multiple Parts

Split into 5 parts:

./splitcsv -in sales_data.csv -parts 5

Output: sales_data_part1.csv, sales_data_part2.csv, ..., sales_data_part5.csv

Without Headers

Split without including headers in output files:

./splitcsv -in data.csv -parts 3 -header=false

Custom Separator

Split a semicolon-separated file:

./splitcsv -in european_data.csv -parts 4 -comma ";"

Complex Example

Split a large file with tab separator into 10 parts without headers:

./splitcsv -in huge_dataset.tsv -parts 10 -comma "\t" -header=false

How It Works

  1. Row Counting: First pass counts total data rows (excluding header)
  2. Distribution: Calculates optimal row distribution across parts
  3. File Generation: Creates output files with _partN suffix
  4. Data Writing: Distributes rows evenly, with extra rows going to first parts

Row Distribution Logic

For a file with 100 data rows split into 3 parts:

  • Part 1: 34 rows
  • Part 2: 33 rows
  • Part 3: 33 rows

Extra rows are distributed to the first parts to ensure even splitting.

Smart Newline Handling

The tool automatically handles CSV fields that contain line breaks:

  • Quoted fields: Line breaks (\n and \r\n) inside quoted fields are automatically replaced with spaces
  • Unquoted fields: Line breaks in unquoted fields are preserved as-is
  • Escaped quotes: Properly handles escaped quotes ("") within quoted fields

Example:

name,description,price
"Product A","This is a long
description with line breaks",100
Product B,Simple description,200

After processing:

name,description,price
"Product A","This is a long description with line breaks",100
Product B,Simple description,200

This ensures that CSV files with multi-line content in quoted fields remain properly formatted and compatible with standard CSV parsers.

File Naming Convention

Output files follow this pattern:

{original_name}_part{N}{original_extension}

Examples:

  • data.csvdata_part1.csv, data_part2.csv
  • sales_2024.csvsales_2024_part1.csv, sales_2024_part2.csv
  • export.tsvexport_part1.tsv, export_part2.tsv

Error Handling

The tool will exit with an error message if:

  • Input file doesn't exist or can't be read
  • Input file has no data rows
  • Number of parts is less than 1
  • Separator is not a single character
  • Output files can't be created

Performance

  • Memory efficient: processes files row by row
  • Two-pass reading: first for counting, second for splitting
  • Supports files of any size (limited only by available disk space)

Requirements

  • Go 1.16 or later
  • Read permission for input file
  • Write permission for output directory

License

This project is open source and available under the MIT License.