Network Working Group                                    Y. Shafranovich
Internet-Draft                                 Amazon Web Services (AWS)
Intended status: Informational                             2 August 2024
Expires: 3 February 2025


   Common Format and MIME Type for Comma-Separated Values (CSV) Files
                   draft-shafranovich-rfc4180-bis-07

Abstract

   This RFC documents the common format used for Comma-Separated Values
   (CSV) files and updates the associated MIME type "text/csv".

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 3 February 2025.

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.

Table of Contents

   1.  Introduction
     1.1.  Terminology
     1.2.  Motivation For and Status of This Document
   2.  Definition of the CSV Format
     2.1.  High level description
     2.2.  Default charset, binary content and line break values
     2.3.  ABNF Grammar
   3.  Common implementation concerns
     3.1.  Null values
     3.2.  Empty files
     3.3.  Empty lines
     3.4.  Fields spanning multiple lines
     3.5.  Unique header names
     3.6.  Whitespace outside quoted fields
     3.7.  Other field separators
     3.8.  Escaping double quotes
     3.9.  BOM header
     3.10. Bidirectional text
     3.11. Comments
     3.12. IANA Considerations
   4.  Update to MIME Type Registration of text/csv
   5.  Security Considerations
   6.  Acknowledgments
   7.  References
     7.1.  Normative References
     7.2.  Informative References
   Appendix A.  Major changes since RFC4180
   Appendix B.  Changes since the -00 draft
   Appendix C.  Changes since the -01 draft
   Appendix D.  Changes since the -02 draft
   Appendix E.  Changes since the -03 draft
   Appendix F.  Changes since the -04 draft
   Appendix G.  Changes since the -05 draft
   Appendix H.  Note to Readers
   Author's Address

1.  Introduction

   The comma separated values format (CSV) has been used as a common way
   to exchange data between disparate systems and applications for many
   years.  Surprisingly, while this format is very popular, it has never
   been formally documented and didn't have a media type registered.
   This was addressed in 2005 via publication of [RFC4180] and the
   concurrent registration of the "text/csv" media type.

   Since the publication of [RFC4180], the CSV format has evolved and
   this specification seeks to reflect these changes as well as update
   the "text/csv" media type registration.

1.1.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in BCP
   14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

1.2.  Motivation For and Status of This Document

   The original motivation of [RFC4180] was to provide a reference in
   order to register the media type "text/csv".  It tried to document
   existing practices at the time based on the approaches used by most
   implementations.  This document continues to do the same, and updates
   the original document to reflect current practices for generation and
   parsing of CSV files.

   Both [RFC4180] and this document are published as informational RFCs
   for the benefit of the Internet community and are not intended to be
   used as formal standards.  Implementers should consult [RFC1796] and
   [RFC2026] for differences between IETF standards and informational
   RFCs.

2.  Definition of the CSV Format

   While there had been various specifications and implementations for
   the CSV format (for ex.  [CREATIVYST], [EDOCEO], [CSVW] and [ART])),
   prior to publication of [RFC4180] there was no attempt to provide a
   common specification.  This section documents the format that seems
   to be followed by most implementations (incorporating changes since
   the publication of [RFC4180] and listing common implementation
   concerns).

2.1.  High level description

   The CSV format uses line breaks to separate records, and commas to
   separate fields within a given record.  The format is described as
   follows:

   1.  Each record is located on a separate line, ended by a line break
       (CR, LF or CRLF) indicating the end of this record.  For example:

       aaa,bbb,cccCRLF
       zzz,yyy,xxxCRLF

   2.  The last record in the file MUST have an ending line break
       indicating the end of a record.  For example:

       aaa,bbb,cccCRLF
       zzz,yyy,xxxCRLF

   3.  The first record in the file MAY be an optional header and MUST
       follow the same format as normal records.  This header contains
       names corresponding to the fields in the file and SHOULD contain
       the same number of fields as the records in the rest of the file.
       For example:

       field_name_1,field_name_2,field_name_3CRLF
       aaa,bbb,cccCRLF
       zzz,yyy,xxxCRLF

   4.  Within each record, there MAY be zero or more fields, separated
       by commas.  Each record SHOULD contain the same number of fields
       throughout the file.  Spaces are considered part of a field and
       SHOULD NOT be ignored.  The last field in the record MUST NOT be
       followed by a comma (since this will indicate an empty field
       following the comma).  For example:

       aaa,bbb,cccCRLF

   5.  Each field MAY be enclosed in double quotes.  If fields are not
       enclosed with double quotes, then double quotes MUST NOT appear
       inside the fields.  For example:

       "aaa","bbb","ccc"CRLF
       zzz,yyy,xxxCRLF

   6.  Fields containing line breaks (CR, LF or CRLF), double quotes, or
       commas MUST be enclosed in double quotes.  For example:

       "aaa","b CRLF
       bb","c,c,c"CRLF
       zzz,yyy,xxxCRLF

   7.  A double quote appearing inside a field MUST be escaped by
       preceding it with another double quote.  For example:

       "aaa","b""bb","ccc"CRLF

2.2.  Default charset, binary content and line break values

   Since the initial publication of [RFC4180], the default charset for
   "text/*" media types has been changed to UTF-8 (as per [RFC6657]) and
   [RFC7111].  This document reflects this change and the default
   charset for CSV files is now UTF-8.

   As per section 4.2.1 of [RFC6838], the "text/*" media types are
   defined as those reasonable to present to the user.  While [RFC4180]
   restricted CSV contents to printable ASCII only, [RFC7111] updated
   the MIME registration to allow binary content in CSV entities.
   Therefore, this document has been updated to allow binary content
   within CSV files.

   Although section 4.1.1. of [RFC2046] defines CRLF to denote line
   breaks, implementers MAY recognize a single CR or LF as a line break
   (similar to section 3.1.1.3 of [RFC7231]).  However, some
   implementations MAY use other values.

2.3.  ABNF Grammar

   The ABNF grammar (as per [RFC5234]) appears as follows:

  file = [header] *(record)

  header = [field] *(COMMA field) linebreak

  record = [field] *(COMMA field) linebreak

  field = (escaped / non-escaped)

  escaped = DQUOTE *(textdata / COMMA / CR / LF / 2DQUOTE) DQUOTE

  non-escaped = *(textdata)

  textdata = %x00-09 / %x0B-0C / %x0E-21 / %x23-2B / %x2D-7F / UTF8-data
           ; all characters except CR, LF, DQUOTE and COMMA

  linebreak = CR / LF / CRLF

  COMMA = %x2C

  CR = %x0D ; as per section B.1 of [RFC5234]

  CRLF = CR LF ; as per section B.1 of [RFC5234]

  DQUOTE = %x22 ; as per section B.1 of [RFC5234]

  LF = %x0A ; as per section B.1 of [RFC5234]

  UTF8-data = UTF8-2 / UTF8-3 / UTF8-4 ; as per section 4 of [RFC3629]

   Note that the authoritative definition of UTF-8 is in section 2.5 of
   [UNICODE].

3.  Common implementation concerns

   This section describes some common concerns that may arise during
   generation or parsing CSV files.  These are not part of the formal
   definition of CSV and are included for awareness.  Implementers may
   also use other means to handle these use cases including approaches
   like [CSVW].

3.1.  Null values

   Some implementations (such as databases) treat empty fields and null
   values differently.  For these implementations, there is a need to
   define a special value representing a null.  However, this
   specification does not attempt to define a default value for nulls.

   Example of a CSV file with nulls (if "NULL" is used to mark nulls):

   field_name_1,field_name_2,field_name_3CRLF
   aaa,bbb,cccCRLF
   zzz,NULL,xxxCRLF

3.2.  Empty files

   Implementers should be aware that in accordance to this specification
   a file does not need to contain any comments or records.  Therefore,
   an empty file with zero bytes is considered valid.

3.3.  Empty lines

   This specification recommends but doesn't require having the same
   number of fields in every line.  This allows CSV files to have empty
   lines without any fields at all.  Implementors may choose to skip
   empty lines instead of parsing them but this specification does not
   dictate such behavior.

   Example of a CSV file with empty lines:

   field_name_1,field_name_2,field_name_3CRLF
   aaa,bbb,cccCRLF
   CRLF
   zzz,yyy,xxxCRLF

   However, if the records are only made up of one field, it is not
   possible to differentiate between an empty line, and an empty
   unquoted field.  This differentiation might play an important role in
   some implementations such as database exports/imports.

   Example of a CSV file with empty lines and only one field per record:

   aaaCRLF
   CRLF
   bbbCRLF

   Note that some implementations may interpret the presence of a line
   break after the last record in the file as a start of a new but empty
   record.

3.4.  Fields spanning multiple lines

   When quoted fields are used, it is possible for a field to span
   multiple lines, even when line breaks appear within such field.
   Implementers should be aware that line breaks appearing in such
   fields are considered part of the data content of those records, may
   differ from the line breaks used in the rest of the CSV file and
   should not be altered.

   Example of a CSV file with a quoted field spanning multiple lines:

   "aaa","b CRLF
   bb","c,c,c"CRLF
   zzz,yyy,xxxCRLF

3.5.  Unique header names

   Implementers should be aware that some applications may treat header
   values as unique (either case-sensitive or case-insensitive).

   Example of a CSV file with non-unique header names:

   field_name_1,field_name_2,field_name_1CRLF
   aaa,bbb,cccCRLF
   zzz,yyy,xxxCRLF

3.6.  Whitespace outside quoted fields

   When quoted fields are used, this document does not allow whitespace
   between double quotes and commas.  Implementers should be aware that
   some applications may be more lenient and allow whitespace outside
   the double quotes.

3.7.  Other field separators

   This document defines a comma as a field separator but implementers
   should be aware that some applications may use different values,
   especially with non-English languages.  Those are outside the scope
   of this document and implementers should consult other efforts such
   as [CSVW].

3.8.  Escaping double quotes

   This document prescribes that a double quote appearing inside a field
   must be escaped by preceding it with another double quote.
   Implementers should be aware that some applications may choose to use
   a different escaping mechanism.

3.9.  BOM header

   Applications that create text files with unicode character encoding
   might write a BOM (byte order mark) header in order to support
   multiple unicode encodings (like UTF-16 and UTF-32).  Some
   applications might be able to read and properly interpret such a
   header, others could break.  Implementors should review section 6 of
   [RFC3629] and section 23.8 of [UNICODE].

3.10.  Bidirectional text

   While most of the world's written languages are displayed left-to-
   right, many languages such as ones based on Hebrew or Arabic scripts
   are displayed primarily right-to-left.  Implementers should consult
   the "bidirectional display" part in section 5 of [RFC6365] for
   further guidance.

   One example of how bidirectional text can be handled in CSV files can
   be found in section 6.5.1 of [CSVW].

3.11.  Comments

   Some implementations may use the hash sign ("#") to mark lines that
   are meant to be commented lines.  Such lines may contain any
   character until terminated by a line break (CR, LF or CRLF) and might
   appear in any line of the file (before or after the header).
   Comments should not be confused with a subsequent line of a multi-
   line field.  If a first field of a record starts with a hash, it
   should be surrounded with double quotes to avoid being mistaken for a
   comment as per Section 2.1.

   Example of a CSV file containing comments:

   #commentCRLF<br/>
   aaa,bbb,cccCRLF<br/>
   #comment 2CRLF<br/>
   "aaa","this is CRLF<br/>
   # not a comment","ccc"CRLF<br/>
   "#aaa",bbb,cccCRLF

3.12.  IANA Considerations

   As per [RFC6838], IANA is directed to update the MIME type
   registration for "text/csv" with the content in Section 4 and add a
   reference to this document within the registration.

   The update to the media type registration is copied from the current
   one which consists of the original registration from [RFC4180] as
   updated by [RFC7111] and updated based on this document.

4.  Update to MIME Type Registration of text/csv

   Type name: text

   Subtype name: csv

   Required parameters: none

   Optional parameters: charset

      The "charset" parameter specifies the charset employed by the CSV
      content.  In accordance with RFC 6657 [RFC6657], the charset
      parameter SHOULD be used, and if it is not present, UTF-8 SHOULD
      be assumed as the default (this implies that US- ASCII CSV will
      work, even when not specifying the "charset" parameter).  Any
      charset defined by IANA for the "text" tree may be used in
      conjunction with the "charset" parameter.

      The "header" parameter defined in [RFC4180] is deprecated and
      SHOULD NOT be used.

   Encoding considerations:

      CSV files and CSV MIME entities can consist of binary data as per
      section 4.8 of [RFC6838].  Although section 4.1.1. of [RFC2046]
      defines CRLF to denote line breaks, implementers MAY also
      recognize a single CR or LF as a line break (similar to section
      3.1.1.3 of [RFC7231]).  However, some implementations may use
      other values.

   Security considerations:

      Text/csv consists of nothing but passive text data that should not
      pose any direct risks.  However, it is possible that malicious
      data may be included in order to exploit buffer overruns or other
      bugs in the program processing the text/csv data.

      Implementers and users should also be aware that some software
      applications may interpret certain characters in the beginning of
      CSV fields as referring to code or formulas, thus resulting in
      malicious code execution.  This is known as "CSV injection" and
      users parsing CSV files should filter out such characters.

      The text/csv format provides no confidentiality or integrity
      protection, so if such protections are needed they must be
      supplied externally.

      The fact that software implementing fragment identifiers for CSV
      and software not implementing them differs in behavior, and the
      fact that different software may show documents or fragments to
      users in different ways, can lead to misunderstandings on the part
      of users.  Such misunderstandings might be exploited in a way
      similar to spoofing or phishing.

      Implementers and users of fragment identifiers for CSV text should
      also be aware of the security considerations in RFC 3986 [RFC3986]
      and RFC 3987 [RFC3987].

   Interoperability considerations:

      Due to lack of a single specification, there are considerable
      differences among implementations.  Implementers should "be
      conservative in what you do, be liberal in what you accept from
      others" ([RFC0793]) when processing CSV files.  An attempt at a
      common definition can be found in section 2 of (to be replaced
      with the RFC number).

      There are numerous differences between different CSV
      implementations, many of which are discussed in (to be replaced
      with the RFC number of this document).

   Published specification:

      While numerous private specifications exist for various programs
      and systems, there is no single "master" specification for this
      format.  An attempt at a common definition can be found in (to be
      replaced with the RFC number).

   Applications that use this media type:

      Spreadsheet programs and various data conversion utilities.

   Fragment identifier considerations:

      Fragment identification for text/csv is supported by using
      fragment identifiers as specified by [RFC7111].

   Additional information:

   Magic number(s): none

   File extension(s): CSV

   Macintosh file type code(s): TEXT

   Person & email address to contact for further information:

      Yakov Shafranovich (ietf@shaftek.org) and Erik Wilde
      (dret@berkeley.edu)

   Intended usage: COMMON

   Restrictions on usage: none

   Author:

      Yakov Shafranovich (ietf@shaftek.org) and Erik Wilde
      (dret@berkeley.edu)

   Change controller: IESG

5.  Security Considerations

   All security considerations discussed in Section 4 still apply.

6.  Acknowledgments

   In addition to everyone thanked previously in [RFC4180], the author
   would like to thank acknowledge the contributions of the following
   people to this document: Alperen Belgic, Abed BenBrahim, Damon Koach,
   Barry Leiba, Oliver Siegmar, Marco Diniz Sousa and Greg Skinner.

   A special thank you to L.T.S.

7.  References

7.1.  Normative References

   [RFC0793]  Postel, J., "Transmission Control Protocol", RFC 793,
              DOI 10.17487/RFC0793, September 1981,
              <https://www.rfc-editor.org/info/rfc793>.

   [RFC2046]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
              Extensions (MIME) Part Two: Media Types", RFC 2046,
              DOI 10.17487/RFC2046, November 1996,
              <https://www.rfc-editor.org/info/rfc2046>.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC4180]  Shafranovich, Y., "Common Format and MIME Type for Comma-
              Separated Values (CSV) Files", RFC 4180,
              DOI 10.17487/RFC4180, October 2005,
              <https://www.rfc-editor.org/info/rfc4180>.

   [RFC5234]  Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
              Specifications: ABNF", STD 68, RFC 5234,
              DOI 10.17487/RFC5234, January 2008,
              <https://www.rfc-editor.org/info/rfc5234>.

   [RFC6657]  Melnikov, A. and J. Reschke, "Update to MIME regarding
              "charset" Parameter Handling in Textual Media Types",
              RFC 6657, DOI 10.17487/RFC6657, July 2012,
              <https://www.rfc-editor.org/info/rfc6657>.

   [RFC6838]  Freed, N., Klensin, J., and T. Hansen, "Media Type
              Specifications and Registration Procedures", BCP 13,
              RFC 6838, DOI 10.17487/RFC6838, January 2013,
              <https://www.rfc-editor.org/info/rfc6838>.

   [RFC7111]  Hausenblas, M., Wilde, E., and J. Tennison, "URI Fragment
              Identifiers for the text/csv Media Type", RFC 7111,
              DOI 10.17487/RFC7111, January 2014,
              <https://www.rfc-editor.org/info/rfc7111>.

   [RFC7231]  Fielding, R., Ed. and J. Reschke, Ed., "Hypertext Transfer
              Protocol (HTTP/1.1): Semantics and Content", RFC 7231,
              DOI 10.17487/RFC7231, June 2014,
              <https://www.rfc-editor.org/info/rfc7231>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

7.2.  Informative References

   [ART]      Raymond, E., "The Art of Unix Programming, Chapter 5",
              September 2003,
              <http://www.catb.org/~esr/writings/taoup/html/
              ch05s02.html>.

   [CREATIVYST]
              Repici, J., "HOW-TO: The Comma Separated Value (CSV) File
              Format", 2022,
              <https://www.creativyst.com/Doc/Articles/CSV/CSV01.htm>.

   [CSVW]     W3C, "Model for Tabular Data and Metadata on the Web",
              December 2015,
              <https://www.w3.org/TR/tabular-data-model/>.

   [EDOCEO]   Edoceo, Inc., "Comma Separated Values (CSV) Standard File
              Format", 2020, <https://edoceo.com/dev/csv-file-format>.

   [RFC1796]  Huitema, C., Postel, J., and S. Crocker, "Not All RFCs are
              Standards", RFC 1796, DOI 10.17487/RFC1796, April 1995,
              <https://www.rfc-editor.org/info/rfc1796>.

   [RFC2026]  Bradner, S., "The Internet Standards Process -- Revision
              3", BCP 9, RFC 2026, DOI 10.17487/RFC2026, October 1996,
              <https://www.rfc-editor.org/info/rfc2026>.

   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
              10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
              2003, <https://www.rfc-editor.org/info/rfc3629>.

   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
              Resource Identifier (URI): Generic Syntax", STD 66,
              RFC 3986, DOI 10.17487/RFC3986, January 2005,
              <https://www.rfc-editor.org/info/rfc3986>.

   [RFC3987]  Duerst, M. and M. Suignard, "Internationalized Resource
              Identifiers (IRIs)", RFC 3987, DOI 10.17487/RFC3987,
              January 2005, <https://www.rfc-editor.org/info/rfc3987>.

   [RFC6365]  Hoffman, P. and J. Klensin, "Terminology Used in
              Internationalization in the IETF", BCP 166, RFC 6365,
              DOI 10.17487/RFC6365, September 2011,
              <https://www.rfc-editor.org/info/rfc6365>.

   [UNICODE]  The Unicode Consortium, "The Unicode Standard, Version
              15.1.0", September 2023,
              <https://www.unicode.org/versions/Unicode15.1.0/>.

Appendix A.  Major changes since [RFC4180]

   *  Added a section clarifying motivation for this document and
      standards status

   *  Changing default encoding to UTF-8 and adding Unicode to the ABNF
      grammar

   *  Allowing CR, LF and CRLF for line breaks

   *  Allowing binary content including HTAB in CSV files

   *  Mandating a line break at the end of the last line in the file

   *  Making records and headers optional, thus allowing for an empty
      file

   *  Adding a section on common implementation concerns

   *  Removed "header" parameter for the MIME type since it is not used

Appendix B.  Changes since the -00 draft

   *  Added CSV injection to security considerations (#30)

   *  Added a reference to RFC 7111 (#27)

Appendix C.  Changes since the -01 draft

   *  No changes yet, refreshed to keep draft alive

Appendix D.  Changes since the -02 draft

   *  Refreshed to keep draft alive

   *  Contact information and GitHub link changes

   *  Minor updates on language

   *  Added a section on bidi handling

Appendix E.  Changes since the -03 draft

   *  Moved comments to the common practices section and removed from
      the ABNF grammar (#32)

   *  Added more clarifications to the format section

   *  Made ABNF grammar match the document

   *  Added a note about text content to the format section

Appendix F.  Changes since the -04 draft

   *  No changes yet, refreshed to keep draft alive

Appendix G.  Changes since the -05 draft

   *  Grammar changes

   *  Added another reference for section 3.10 (Bidi)

   *  Added clarification language on linebreaks (#36)

Appendix H.  Note to Readers

      *Note to the RFC Editor:* Please remove this section prior to
      publication.

   Development of this draft takes place on Github at:
   https://github.com/yakovsh/rfc4180-bis

   Comments can also be sent to the ART mailing list at:
   https://www.ietf.org/mailman/listinfo/art

   Full list of changes can be viewed via the IETF document tracker:
   https://tools.ietf.org/html/draft-shafranovich-rfc4180-bis

Author's Address

   Yakov Shafranovich
   Amazon Web Services (AWS)
   Email: yakovsh@amazon.com or ietf@shaftek.org