Internet Engineering Task Force                                 P. THOMY
Internet-Draft                                                 Loco-labs
Intended status: Informational                          19 December 2023
Expires: 21 June 2024


                      NTV tabular format (NTV-TAB)
                         draft-thomy-ntv-tab-00

Abstract

   This document describes a set of simple rules for unambiguously and
   concisely encoding semantic tabular and multidimensional data (NTV-
   TAB format).  These rules are based on the NTV structure and its JSON
   representation (JSON-NTV format).

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 21 June 2024.

Copyright Notice

   Copyright (c) 2023 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.


THOMY                     Expires 21 June 2024                  [Page 1]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Presentation  . . . . . . . . . . . . . . . . . . . . . .   3
     1.2.  Key design features . . . . . . . . . . . . . . . . . . .   3
     1.3.  Conventions Used in This Document . . . . . . . . . . . .   4
   2.  Tabular data  . . . . . . . . . . . . . . . . . . . . . . . .   4
     2.1.  Principles  . . . . . . . . . . . . . . . . . . . . . . .   4
     2.2.  Tabular structure . . . . . . . . . . . . . . . . . . . .   5
     2.3.  Field structure . . . . . . . . . . . . . . . . . . . . .   8
     2.4.  Representation  . . . . . . . . . . . . . . . . . . . . .   8
   3.  NTV-TAB format  . . . . . . . . . . . . . . . . . . . . . . .   8
     3.1.  NTV structure . . . . . . . . . . . . . . . . . . . . . .   8
     3.2.  simple NTVfield formats . . . . . . . . . . . . . . . . .   9
     3.3.  default NTVfield formats  . . . . . . . . . . . . . . . .   9
     3.4.  Optimized NTVfield formats  . . . . . . . . . . . . . . .  12
     3.5.  Synthesis . . . . . . . . . . . . . . . . . . . . . . . .  13
   4.  Examples  . . . . . . . . . . . . . . . . . . . . . . . . . .  15
     4.1.  Field examples  . . . . . . . . . . . . . . . . . . . . .  15
     4.2.  Dataset examples  . . . . . . . . . . . . . . . . . . . .  17
   5.  Properties  . . . . . . . . . . . . . . . . . . . . . . . . .  19
     5.1.  JSON representation . . . . . . . . . . . . . . . . . . .  19
     5.2.  Dataset size  . . . . . . . . . . . . . . . . . . . . . .  19
     5.3.  Nested NTV-TAB structure  . . . . . . . . . . . . . . . .  20
   6.  Parsing a JSON-value  . . . . . . . . . . . . . . . . . . . .  20
   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  21
   8.  Security Considerations . . . . . . . . . . . . . . . . . . .  22
   9.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  22
     9.1.  Normative References  . . . . . . . . . . . . . . . . . .  22
     9.2.  Informative References  . . . . . . . . . . . . . . . . .  22
   Appendix A.  Dataset sizing . . . . . . . . . . . . . . . . . . .  23
     A.1.  Methodology . . . . . . . . . . . . . . . . . . . . . . .  23
     A.2.  Formats . . . . . . . . . . . . . . . . . . . . . . . . .  25
   Appendix B.  Table schema compatibility . . . . . . . . . . . . .  27
     B.1.  Table schema  . . . . . . . . . . . . . . . . . . . . . .  27
     B.2.  Compatibility . . . . . . . . . . . . . . . . . . . . . .  28
     B.3.  Example . . . . . . . . . . . . . . . . . . . . . . . . .  29
   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .  30
   Contributors  . . . . . . . . . . . . . . . . . . . . . . . . . .  30
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  30

1.  Introduction


THOMY                     Expires 21 June 2024                  [Page 2]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


1.1.  Presentation

   The main operational standard used to exchange textual tabular data
   is CSV format [RFC4180].  Unfortunately CSV format is obsolete (last
   revision in 2005) and current CSV tools do not comply with the
   standard.

   It is therefore important to define an alternative format that meets
   the expectations of tabular and multidimensional data exchanges.  The
   NTV-TAB format proposed here is a response to this need.

1.2.  Key design features

   The format's focus is on simplicity, lightness and web usage.

   The key features of this format are the following:

   *  JSON as the base format

         JSON is simple and readable as simple text

         JSON supports rich structure including nesting and basic types

         JSON is web-native and very widely used and supported

         JSON format has binary representation (i.e.  CBOR format)

   *  optimized representations

         from the simplest to the most optimized, are available

         avoid data duplication,

         reduce the size of data,

         allows strict and unambiguous reversibility (lossless round-
         trip)

   *  high semantic level of data (JSON-NTV as a grammar)

         Take into account most common data formats used in Internet
         standards

         wide variety of data typing

         meta-data (header or schema) can be integrate

         common format between tabular and multidimensional data


THOMY                     Expires 21 June 2024                  [Page 3]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


   *  simple, compact, extensible and self-describing

1.3.  Conventions Used in This Document

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in BCP
   14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

   This document also uses the following terms:

   *JsonText, JsonValue, JsonObject, JsonMember, JsonElement,
   JsonArray, JsonNumber, JsonString, JsonFalse, JsonNull, JsonTrue
   :*
      These terms are defined in [JSON-NTV].

   *NTV, NTVlist, NVlist, Vlist, TVlist, NTVsingle, NVsingle,
   TVsingle, Vsingle, NTVname, NTVtype, NTVvalue, JsonNTVtype,
   JsonNTVname, JsonPrimitive, JsonUnnamed, JsonNamed:*
      These terms are defined in [JSON-NTV].

   *Row, Column, Table, Cell:*
      These terms are defined in [W3C_TAB].

   *Dataset*
      A Dataset is equivalent to a Table

   *Field*
      A Field is equivalent to a Column

2.  Tabular data

2.1.  Principles

   _*Tabular data* is data that is structured into rows, each of which
   contains information about some things.  Each row contains the same
   number of cells (although some of these cells may be empty), which
   provide values of properties of the thing described by the row.  In
   tabular data, cells within the same column provide values for the
   same property of the things described by each row.  This is what
   differentiates tabular data from other line-oriented
   formats._[W3C_TAB]

   Two main uses are identified for tabular data:


THOMY                     Expires 21 June 2024                  [Page 4]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


   *  a flow-oriented use for which each row is independent of the
      others.  The dataset is then seen as a list of rows whose number
      can be variable.  This is for example the case of a list of
      measurements from a sensor.

   *  a structure-oriented use for which the rows are not independent
      and contribute to describing the same object.  This is for example
      the case of a grade table for a class which integrates the
      students, courses, periods, etc.

   This document deals with this second use.

2.2.  Tabular structure

   In structure-oriented use, columns and rows are not equivalent, the
   columns (or Fields) represent the 'semantics' of the data and the
   rows represent a specific combination of Field's values according to
   the structure defined by the tabular data (Dataset).  The nature of
   the rows is often implicit.

   Two basic patterns are present in Datasets:

   *  *Tree pattern*: A tree is represented in tabular form by a list of
      paths between each leaf and the node.  The columns then represent
      the levels of the tree.

   *  *Matrix pattern*: A matrix (or multidimensional data) is
      represented in tabular form by a column of the values of the
      matrix and additional columns represent the coordinates of each of
      the values.

   Table 1 and Table 2 present an example of such patterns

   +======+=========+=========+
   | Root | level 1 | level 2 |
   +======+=========+=========+
   | A    | B       | D       |
   +------+---------+---------+
   | A    | B       | E       |
   +------+---------+---------+
   | A    | C       | F       |
   +------+---------+---------+
   | A    | C       | G       |
   +------+---------+---------+

      Table 1: Tree pattern


THOMY                     Expires 21 June 2024                  [Page 5]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


    +=======+=====+=====+
    | Value | row | col |
    +=======+=====+=====+
    | 1     | A   | C   |
    +-------+-----+-----+
    | 2     | A   | D   |
    +-------+-----+-----+
    | 3     | B   | C   |
    +-------+-----+-----+
    | 4     | B   | D   |
    +-------+-----+-----+

   Table 2: Matrix pattern

   Taking these structures into account leads to significant duplication
   of data.  In the general case, Datasets mix these different
   structures.

   If we now observe the relationships between Fields [TAB-ANA], we can
   identify four main uses:

   *  *association*: this consists of coupling each value of a Field to
      a single value of another Field ("coupled" relationship between
      two fields),

   *  *classification*: This involves grouping the data by category in
      order - for example - to be able to make a statistical use of it,
      ("derived" relationship between two fields),

   *  *crossing*: This consists of representing all the combinations
      between the two Fields, such as in matrix representations
      ("crossed" relationship between two fields),

   *  *characterization*: It corresponds to the documentation of defined
      properties (no specific relationship).

   _Example: Price list of different foods based on packaging for the
   year 2022._Table 3


THOMY                     Expires 21 June 2024                  [Page 6]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


   +==+=======+=========+=========+======+=====+========+==============+
   |Id|Product|Food     |Packaging|Weight|Price| Period | Availability |
   +==+=======+=========+=========+======+=====+========+==============+
   |11|apple  |fruit    |bag      |1 kg  |1    | 2nd    | Yes          |
   |  |       |         |         |      |     | half   |              |
   |  |       |         |         |      |     | 2022   |              |
   +--+-------+---------+---------+------+-----+--------+--------------+
   |12|apple  |fruit    |cardboard|10 kg |9    | 2nd    | Yes          |
   |  |       |         |         |      |     | half   |              |
   |  |       |         |         |      |     | 2022   |              |
   +--+-------+---------+---------+------+-----+--------+--------------+
   |13|orange |fruit    |bag      |1 kg  |2    | 2nd    | end of 2022  |
   |  |       |         |         |      |     | half   |              |
   |  |       |         |         |      |     | 2022   |              |
   +--+-------+---------+---------+------+-----+--------+--------------+
   |14|orange |fruit    |cardboard|10 kg |18   | 2nd    | end of 2022  |
   |  |       |         |         |      |     | half   |              |
   |  |       |         |         |      |     | 2022   |              |
   +--+-------+---------+---------+------+-----+--------+--------------+
   |15|pepper |vegetable|bag      |1 kg  |1.5  | 2nd    | end of 2022  |
   |  |       |         |         |      |     | half   |              |
   |  |       |         |         |      |     | 2022   |              |
   +--+-------+---------+---------+------+-----+--------+--------------+
   |16|pepper |vegetable|cardboard|10 kg |13   | 2nd    | end of 2022  |
   |  |       |         |         |      |     | half   |              |
   |  |       |         |         |      |     | 2022   |              |
   +--+-------+---------+---------+------+-----+--------+--------------+
   |17|banana |fruit    |bag      |1 kg  |0.5  | 2nd    | Yes          |
   |  |       |         |         |      |     | half   |              |
   |  |       |         |         |      |     | 2022   |              |
   +--+-------+---------+---------+------+-----+--------+--------------+
   |18|banana |fruit    |cardboard|10 kg |4    | 2nd    | Yes          |
   |  |       |         |         |      |     | half   |              |
   |  |       |         |         |      |     | 2022   |              |
   +--+-------+---------+---------+------+-----+--------+--------------+

                            Table 3: Price list

   _We find here:_

   *  _association: between "Packaging" and "Weight",_

   *  _classification: between "Product" and "Food",_

   *  _crossing: between "Product" and "Weight",_

   *  _characterization: between "Product" and "Availability"_


THOMY                     Expires 21 June 2024                  [Page 7]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


2.3.  Field structure

   A Field is an ordered set of Cells.

   To represent this structure, several representations are possible
   depending on the nature of the data:

   *  the simplest format is to represent a Field by the list of Cells
      with the same order for all Fields.  This format is interesting
      when the data is little duplicated,

   *  when the data is repetitive, a second option is to represent on
      the one hand the list of different data and on the other hand
      their position in the list (i.e. categorical data),

   *  another special case also concerns repetitive data for which one
      value is highly predominant (sparse data).  In this case, it is
      sufficient to provide only the position in the list of data except
      for the one that is predominant,

   *  a last option consists in representing the Field according to its
      dependence with another Field (coupled, derived or crossed
      relationship).  This leads to an optimized data volume.

2.4.  Representation

   Three representations are available for a tabular object : row-
   oriented (list of Rows), cells-oriented (list of Cells), field-
   oriented (list of Fields).

   The field-oriented representation is retained because it takes into
   account the semantics carried by the Fields as well as the inter-
   Field analysis presented above.

   A Dataset is then seen as a set of Fields representing the properties
   of the entire Dataset.

   The order of Fields or Rows is not relevant.

3.  NTV-TAB format

3.1.  NTV structure

   A Dataset is represented by the following NTV entities:

   *  NTVcell represents a Cell.  NTVcell is a NTVsingle.


THOMY                     Expires 21 June 2024                  [Page 8]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


   *  NTVfield represents a Field.  NTVfield is a NTV entity depending
      on the format chosen to represent the Field (simple format,
      default format, optimized format).  A NTVfield contains a NTVlist
      of part of the NTVcells (Codec) and optionnaly coding data.

   *  NTVdataset represents the Dataset.  NTVdataset is a NVlist where
      the NTVname is the name of the Dataset and the NTVvalue is the
      list of NTVfields.

   The JSON format of a NTVdataset is his JSON-NTV format.

3.2.  simple NTVfield formats

   This category is the usual representation of a Field with different
   values (Full format) or with several identical values (Unique
   format).

   *Full format* :

   The Full format is the format that does not use any coding.  Codec
   and NTVfield are identical.  The NTVfield is therefore a NTVlist
   where the NTVname is the name of the Field, the NTVtype is the
   default type of the NTVcells and the NTVvalue is the list of
   NTVcells.

      _Example JsonNTVvalue ( "price" Field)_ :

         _[ 1, 9, 2, 18, 1.5, 13, 0.5, 4 ]_

   *Unique format* :

   The Unique format is used when all NTVcells are identical.  The Codec
   is the NTVcell.Codec and NTVfield are identical (coding is implict).
   The NTVfield is therefore the NTVcell.

      _Example JsonNTVvalue ( "period" Field)_ :

         _"2nd half 2022"_

   Note :

      The Unique format also makes it possible to represent tabular
      metadata

3.3.  default NTVfield formats

   This category completes the simple formats with the other most common
   representations of a Field :


THOMY                     Expires 21 June 2024                  [Page 9]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


   *  Categorical Field (Complete format)

   *  Periodic Field (Primary format)

   *  Sparse Field (Sparse format)

   In those formats, Codec is explicit and is the TVlist of different
   Field NTVcells (Codec).  The NTVfield is a NVlist where the NTVname
   is the name of the Field.

   *Complete format* :

   The "complete format" is equivalent to the format used to store
   categorical variables.

   The NTVfield is a NVlist composed with two NTV entities :

   *  Codec: TVlist of different Field NTVcells (Codec),

   *  Coding: Vlist of indexes of NTVcells in Component (Keys)

   The list of NTVcells is reconstituted by replacing the integers in
   the coding Vlist with the NTVcell at the coding index in the Codec
   (e.g. pandas categories and codes).

      _Example JsonNTVvalue ( "product" Field)_ :

         _[ [ "orange" , "pepper" , "apple" , "banana" ], [ 2, 2, 0, 0,
         1, 1, 3, 3 ] ]_

   *Sparse format* :

   A specific format (one dimensional sparse LIL format) is used for
   sparse data.  It is defined by:

   *  'fill_value': it should be most common value

   *  'sp_value': it is a list storing only values distinct from the
      'fill_value'

   *  'sp_index': list of index of 'sp-value' in the sparse data list

   The NTVfield is a NVlist composed with three NTV entities :

   *  Codec: TVlist of different Field NTVcells (Codec),

   *  Ref: Vlist of indexes of Codec value in 'sp_value' ,


THOMY                     Expires 21 June 2024                 [Page 10]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


   *  Coding: Vlist of 'sp_index'

   The list of NTVcells is reconstituted by replacing in a list of
   'fill_value', the values with index in the Coding Vlist by the
   corresponding value defined by the Ref index in the Codec TVlist.

      _Example JsonNTVvalue ( "food" Field)_ :

         _[ [ "vegetable" , "fruit"], [0, 0], [ 4, 5 ] ] 'fruit' is the
         'fill_value' - 0 is the index of "vegetable"_

   *Primary format* :

   This format is equivalent to the Complete format where the Keys Vlist
   is calculated with the "repetition coefficient".

   The NTVfield is a NVlist composed with two NTV entities :

   *  Codec: TVlist of different Field NTVcells (Codec),

   *  Coding: Vlist with a single integer (Repetition coefficient)

   The Keys Vlist is generated with the formula:

      keys[ikey] = ( ikey % ( coef * period ) ) // coef

      where:

         keys: is the Keys Vlist

         ikey: is the index of a key value

         coef: is the Repetition coefficient

         period: is the length of Codec

      _Example: coef = 2, period = 3, Keys length = 12_

         _keys = [0, 0, 1, 1, 2, 2, 0, 0, 1, 1, 2, 2]_

   The Repetition coefficient is the number of adjacent identical values
   in the Keys list.

      _Example "packaging"_ :

         _Codec: [ "bag" , "cardboard" ]_

         _Coefficient: 1_


THOMY                     Expires 21 June 2024                 [Page 11]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


         _(implicit Keys : [ 0, 1, 0, 1, 0, 1, 0, 1 ] )_

      _Example "product"_ :

         _Codec: [ "apple" , "orange" , "pepper" , "banana" ]_

         _Coefficient: 2_

         _(implicit Keys : [ 0, 0, 1, 1, 2, 2, 3, 3 ] )_

3.4.  Optimized NTVfield formats

   This category of formats reduces the size of Complete format with
   optimized Keys.  The length of Keys is reduced with using of derived
   (Relative format) or coupled (Implicit format) relationships between
   two Fields.

   In those formats, Codec is explicit and is the TVlist of different
   Field NTVcells (Codec).  The NTVfield is a NVlist where the NTVname
   is the name of the Field.

   *Implicit format* :

   This representation is associated with "coupled" Fields.  These
   Fields have a one-to-one correspondence.

   The NTVfield is a NVlist composed with two NTV entities :

   *  Codec: TVlist of different field NTVcells (Codec),

   *  Ref: Vsingle entity index or name of the coupled Field.

   This format is equivalent to the Complete format where Keys is the
   Keys of the Field (with Complete format) defined by Ref.

      _Example JsonNTVvalue ( "weight" Field is associated with
      "packaging" Field )_ :

         _[ [ "1 kg" , "10 kg" ], "packaging"]_

         _( implicit Keys : [ 0, 1, 0, 1, 0, 1, 0, 1 ] )_

   *Relative format* :

   This representation is associated with "derived" Fields.  These
   Fields have a one-to-many correspondence.


THOMY                     Expires 21 June 2024                 [Page 12]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


   The values of a "derived" Field are inferred from the values of the
   parent Field.

   The Field is a NVlist composed with three NTV entities :

   *  Codec: TVlist of different field NTVcells (Codec),

   *  Ref: Vsingle entity index or name of the parent Field,

   *  Coding : Vlist of relative indexes of NTVcells in Codec (Relative
      Keys).

   This format is equivalent to the Complete format where the Keys Vlist
   is obtained by replacing the values of the Keys Vlist of the parent
   Field with the corresponding values in the Relative Keys (the length
   of the Relative Keys is the length of the Codec of the parent Field).

      _Example JsonNTVvalue ( "food" Field - "product" Field is the
      parent Field of "food" Field)_ :

         _[ [ "fruit" , "vegetable" ], "product", [ 0, 1, 0, 0 ] ]_

         _(the Vlist Keys is obtained by replacing the values 0, 1, 2, 3
         of the Vlist Keys of the "product" Field by the values 0, 1, 0,
         0 of the Relative Keys i.e.: [ 0, 0, 0, 0, 1 , 1, 0, 0] )_

3.5.  Synthesis

   The NTVfield structure corresponding to the format defined above are
   in Table 4:


THOMY                     Expires 21 June 2024                 [Page 13]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


    +====================+========+===============+==================+
    |     Structure      | Codec  |      Ref      |      Coding      |
    +==========+=========+========+===============+==================+
    |  format  |   NTV   | TVlist |    Vsingle    |      Vlist       |
    +==========+=========+========+===============+==================+
    | Relative | NTVlist | x      | index         | Relative Keys    |
    |          |         |        |               |                  |
    |          | len = 3 |        | or name       | len < len(Field) |
    +----------+---------+--------+---------------+------------------+
    | Complete | NTVlist | x      |               | Keys             |
    |          |         |        |               |                  |
    |          | len = 2 |        |               | len = len(Field) |
    +----------+---------+--------+---------------+------------------+
    | Sparse   | NTVlist | x      | list of index | sp_index         |
    |          |         |        |               |                  |
    |          | len = 3 |        | sp_value      | 1<len<len(Field) |
    +----------+---------+--------+---------------+------------------+
    | Implicit | NTVlist | x      | index         |                  |
    |          |         |        |               |                  |
    |          | len = 2 |        | or name       |                  |
    +----------+---------+--------+---------------+------------------+
    | Primary  | NTVlist | x      |               | coef             |
    |          |         |        |               |                  |
    |          | len = 2 |        |               | len = 1          |
    +----------+---------+--------+---------------+------------------+
    | Unique   |                      NTVsingle                      |
    +----------+-----------------------------------------------------+
    | Full     |               NTVlist len = len(Field)              |
    +----------+-----------------------------------------------------+

                        Table 4: NTVfield formats

   Three levels are available to convert tabular data in JSON structure
   Table 5.

   *  *Level 0: "simple"* is the usual representation of tabular data.

      Fields are converted with the Simple or Unique format.

   *  *Level 1: "default"* avoids duplication of information by adding
      simple encoding.

      Fields are converted according to their own structure (simple,
      unique, categorical, sparse, periodic).

   *  *Level 2: "optimize"* avoids duplication of information and
      minimizes encoding.  It is the usual representation of
      multidimensional data.


THOMY                     Expires 21 June 2024                 [Page 14]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


      This level requires an analysis of the relationships between
      Fields ("partition")

                +===============+=========================+
                |     Level     |        Structure        |
                +====+==========+==============+==========+
                |    |   mode   |  Type Field  |  format  |
                +====+==========+==============+==========+
                | 0  | simple   | Unique       | Unique   |
                |    |          +--------------+----------+
                |    |          | Simple       | Full     |
                +----+----------+--------------+----------+
                | 1  | default  | Unique       | Unique   |
                |    |          +--------------+----------+
                |    |          | Simple       | Full     |
                |    |          +--------------+----------+
                |    |          | Sparse       | Sparse   |
                |    |          +--------------+----------+
                |    |          | Categorical  | Complete |
                |    |          +--------------+----------+
                |    |          | Periodic     | Primary  |
                +----+----------+--------------+----------+
                | 2  | optimize | Unique       | Unique   |
                |    |          +--------------+----------+
                |    |          | Root coupled | Full     |
                |    |          +--------------+----------+
                |    |          | Root derived | Complete |
                |    |          +--------------+----------+
                |    |          | Primary      | Primary  |
                |    |          +--------------+----------+
                |    |          | Derived      | Relative |
                |    |          +--------------+----------+
                |    |          | Coupled      | Implicit |
                +----+----------+--------------+----------+

                          Table 5: NTVfield levels

4.  Examples

4.1.  Field examples

   The example in Section 2.2 has the following JSON representation
   Table 6:


THOMY                     Expires 21 June 2024                 [Page 15]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


    +========+========================================================+
    | Format |                JsonNTV Representations                 |
    +========+========================================================+
    |Full    |{ "price::float": [ 1, 9, 2, 18, 1.5, 13, 0.5, 4 ] }    |
    |        |                                                        |
    |        |{ "price": [ 1, 9, 2, 18, 1.5, 13, 0.5, 4 ] }           |
    |        |                                                        |
    |        |[ 1, 9, 2, 18, 1.5, 13, 0.5, 4 ]                        |
    +--------+--------------------------------------------------------+
    |Complete|{"product":[["orange","pepper","apple","banana"],       |
    |        |                                                        |
    |        |[2,2,0,0,1,1,3,3]]}                                     |
    |        |                                                        |
    |        |{"product": [ ["orange","pepper","apple","banana"],     |
    |        |                                                        |
    |        |[2, 2, 0, 0, 1, 1, 3, 3] ]}                             |
    |        |                                                        |
    |        |[ ["orange","pepper","apple","banana"],                 |
    |        |                                                        |
    |        |[2, 2, 0, 0, 1, 1, 3, 3] ]                              |
    +--------+--------------------------------------------------------+
    |Unique  |{ "period": "2nd half 2022" }                           |
    |        |                                                        |
    |        |"2nd half 2022"                                         |
    +--------+--------------------------------------------------------+
    |Implicit|{"weight":[{"::string":["1 kg","10 kg"]},"packaging"]}  |
    |        |                                                        |
    |        |[["1 kg","10 kg"],3]                                    |
    +--------+--------------------------------------------------------+
    |Relative|{"food": [ {"::string": [ "fruit" , "vegetable" ]},     |
    |        |                                                        |
    |        |"product", [ 0, 1, 0, 0 ]] }                            |
    |        |                                                        |
    |        |[ [ "fruit" , "vegetable" ], 1, [ 0, 1, 0, 0 ] ]        |
    +--------+--------------------------------------------------------+
    |Sparse  |{"food":[{"::string":["vegetable","vegetable","fruit"]},|
    |        |                                                        |
    |        |[4,5,-1]]}                                              |
    |        |                                                        |
    |        |[["vegetable","vegetable","fruit"],[4,5, 1]]            |
    +--------+--------------------------------------------------------+
    |Primary |{"packaging":[{"::string":["cardboard","bag"]},[1]]}    |
    |        |                                                        |
    |        |[["cardboard","bag"],[1]]                               |
    |        |                                                        |
    |        |{"product":[["apple","orange","peppers","banana"],[2]]} |
    |        |                                                        |
    |        |[["apple","orange","peppers","banana"],[2]]             |


THOMY                     Expires 21 June 2024                 [Page 16]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


    +--------+--------------------------------------------------------+

                         Table 6: NTVfield examples

4.2.  Dataset examples

   The examples in Table 7 below illustrate the optimize level:


THOMY                     Expires 21 June 2024                 [Page 17]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


   +====================================+==============================+
   |                Data                |        Optimize level        |
   +========+===========================+==============================+
   |  type  |        Full format        |           JsonNTV            |
   +========+===========================+==============================+
   |matrix  |[['a','a','b','b','c','c'],|[[['a','b','c'],[2]],         |
   |        |                           |                              |
   |        |[10,20,10,20,10,20],       |[[10,20],[1]],                |
   |        |                           |                              |
   |        |[1,2,3,4,5,6]]             |[1,2,3,4,5,6]]                |
   +--------+---------------------------+------------------------------+
   |single  |[[1,2,3,4,5,6],            |[[1,2,3,4,5,6],               |
   |        |                           |                              |
   |        |['a','a','a','a','a','a']] |'a']                          |
   +--------+---------------------------+------------------------------+
   |complete|[[1,2,3,3,5,5]]            |[[[1,2,3,5],[0,1,2,2,3,3]]]   |
   +--------+---------------------------+------------------------------+
   |coupled |[[1,2,3,3,5,5],            |[[[1,2,3,5],[0,1,2,2,3,3]],   |
   |        |                           |                              |
   |        |['a','b','c','c','e','e']] |[['a','b','c','e'],0]]        |
   +--------+---------------------------+------------------------------+
   |derived |[[1,2,3,4,5,6],            |[[1,2,3,4,5,6],               |
   |        |                           |                              |
   |        |['a','a','b','b','c','c'], |[['a','b','c'],[0,0,1,1,2,2]],|
   |        |                           |                              |
   |        |[10,10,10,10,20,20]]       |[[10,20],1,[0,0,1]]]          |
   +--------+---------------------------+------------------------------+
   |matrix  |[[6,6,7,7,8,8,9,9],        |[[[6,7,8,9],[2]],             |
   |        |                           |                              |
   |+       |[10,20,10,20,10,20,10,20], |[[10,20],[1]],                |
   |        |                           |                              |
   |coupled |[1,1,2,2,3,3,4,4],         |[[1,2,3,4],0],                |
   |        |                           |                              |
   |        |[1,2,3,4,5,6,7,8]]         |[1,2,3,4,5,6,7,8]]            |
   +--------+---------------------------+------------------------------+
   |matrix  |[[6,6,7,7,8,8,9,9],        |[[[6,7,8,9],[2]],             |
   |        |                           |                              |
   |+       |[10,20,10,20,10,20,10,20], |[[10,20],[1]],                |
   |        |                           |                              |
   |coupled |[1,1,2,2,3,3,4,4],         |[[1,2,3,4],0],                |
   |        |                           |                              |
   |+       |[11,11,22,22,22,22,22,22], |[[11,22],0,[0,1,1,1]],        |
   |        |                           |                              |
   |derived |[1,2,3,4,5,6,7,8]]         |[1 2,3,4,5,6,7,8]]            |
   +--------+---------------------------+------------------------------+

                      Table 7: optimize level examples


THOMY                     Expires 21 June 2024                 [Page 18]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


   The examples in Table 8 below illustre NTVdataset with a length equal
   to 0, 1 or 2:

     +------------------+-------------------------------------------+
     | [ ] or { }       | _Empty NTVdataset_                        |
     +------------------+-------------------------------------------+
     | [25] or [[25]]   | _NTVdataset with 1 NTVfield and length 1_ |
     +------------------+-------------------------------------------+
     | [2, 1] or [[2],  | _NTVdataset with 2 NTVfield and length 1_ |
     | [1]] or [2, [1]] |                                           |
     +------------------+-------------------------------------------+
     | [[2, 1]]         | _NTVdataset with 1 NTVfield and length 2_ |
     +------------------+-------------------------------------------+
     | [[2, 1], [4, 3]] | _NTVdataset with 2 NTVfield and length 2_ |
     +------------------+-------------------------------------------+

                Table 8: NTVdataset with length 0, 1 or 2

5.  Properties

5.1.  JSON representation

   NTV-TAB format defines the representation of a Dataset into the NTV
   format.  This conversion is reversible (lossless).

   Furthermore, the NTV format defines the conversion into JSON format.
   This conversion is also reversible.

   The exchange format (JsonText) of a Dataset is therefore obtained by
   a representation in NTV-TAB format then a conversion to JSON format
   and finally a conversion to text format (or binary format with CBOR
   conversion).  The data is reconstituted identically by reverse
   conversions.

5.2.  Dataset size

   As explain in Section 2.2 cells are often duplicated in a Field.  The
   principle of NTV-TAB format is to replace duplicated data with
   encoding based on integers.

   This optimization considerably reduces the size of a representation
   of a Dataset.  Appendix A details the methodology to optimize this
   size.


THOMY                     Expires 21 June 2024                 [Page 19]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


5.3.  Nested NTV-TAB structure

   NTVcells in a NTVdataset are any NTVsingle.  We can therefore include
   in a NTVdataset the data associated with the types defined in the NTV
   format.

   The 'tab' and the 'field' NTVtypes are associated to NTVdataset and
   NTVfield.  A NTVcell can also include a NTVdataset or a NTVfield.

   Figure 1 is an example of nested Dataset.  The 'nested' JsonNTV is
   the representation of a Dataset with length equal 2 and composed with
   two Fields 'field1' and 'field2'.

   *  'field1' is a Field with two Cells 'dataset1' ans 'dataset2' which
      are Dataset. 'field1' is represented with a Full format NTVfield.

   *  'field2' is a Field with two Cells 'field2_1' ans 'field2_2' which
      are Field. 'field2' is represented with a Full format NTVfield.

   nested = {
     "field1": {
       "dataset1:tab":{
         "dts1_field1": [1,2,3],
         "dts1_field2": [4,5,6]
       },
       "dataset2:tab":{
         "dts2_field1": [10,20,30],
         "dts2_field2": [40,50,60],
         "dts2_field3": [70,80,90]
       },
     },
     "field2":{
       "field2_1:field": [1,2,3],
       "field2_2:field": [4,5,6],
     }
   }

                          Figure 1: Nested Dataset

6.  Parsing a JSON-value

   A NTV parser generates a NTV entity from a JSON-value.

   The decoding NTV entity is directly converted into the NTVdataset and
   a list of NTVfields.

   For each NTVfield the format is deduced following the structure
   defined in the table xxx.


THOMY                     Expires 21 June 2024                 [Page 20]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


   For each format, a decoder converts the NTVvalue of the NTVfield into
   the chosen object.

   _Note :_

      _Several NTVvalue are ambiguous to deduce the Field format :_

      -  _[ list-data, integer, list-integer ] : Full or Relative format
         ?_

      -  _[ list-data, string, list-integer ] : Full or Relative format
         ?_

      -  _[ list-data, integer ] : Full or Implicit format ?_

      -  _[ list-data, string ] : Full or Implicit format ?_

      -  _[ list-data, list-integer ] : Full or Complete/Primary format
         ?_

      -  _[ list-data, list-integer, list-integer ] : Full or Sparse
         format ?_

      _The full format is not retained for those NTVvalue._

      _To avoid this ambiguity, precautions can be taken for Dataset
      with length = 2 or 3 and with a Full format:_

      -  _a name can be added to the list-data (e.g. { "data": list-
         data}),_

      -  _the order of data can be changed (e.g. [ integer, list-data
         ])_

      -  _a type can be added (e.g. { "::json": [ list-data, list-
         integer ] } )_

      -  _an additional field can be added (e.g. [ list-data, integer,
         list-integer, {"format": "full"} ] )._

7.  IANA Considerations

   Any JsonValue is a JsonNTVValue and conversely, any JsonNTVvalue is a
   JsonValue.

   Thus, any JSON data may or may not be treated as JsonNTV data, so
   there is no need to create a specific MIME media type for JsonNTV.


THOMY                     Expires 21 June 2024                 [Page 21]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


   All properties of the MIME media type "application/json" are
   applicable.

8.  Security Considerations

   The format used for NTV data exchanges is the JSON format.  So, all
   the security considerations of [RFC8259] apply.

   The NTV structure provides no cryptographic integrity protection of
   any kind.

9.  References

9.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC4180]  Shafranovich, Y., "Common Format and MIME Type for Comma-
              Separated Values (CSV) Files", RFC 4180,
              DOI 10.17487/RFC4180, October 2005,
              <https://www.rfc-editor.org/info/rfc4180>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

   [RFC8259]  Bray, T., Ed., "The JavaScript Object Notation (JSON) Data
              Interchange Format", STD 90, RFC 8259,
              DOI 10.17487/RFC8259, December 2017,
              <https://www.rfc-editor.org/info/rfc8259>.

9.2.  Informative References

   [TABLE]    "FrictionLess", "Table Schema", 2021,
              <https://specs.frictionlessdata.io/table-
              schema/#language>.

   [JSON-NTV] Thomy, P., "JSON semantic format (JSON-NTV)", 2023,
              <https://datatracker.ietf.org/doc/draft-thomy-json-ntv/>.

   [TAB-ANA]  Thomy, P., "Tabular dataset analysis", 2022,
              <https://github.com/loco-philippe/tab-
              analysis/blob/main/docs/tabular_analysis.pdf>.


THOMY                     Expires 21 June 2024                 [Page 22]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


   [W3C_TAB]  "W3C", "Recommendation : Model for Tabular Data and
              Metadata on the Web", 17 December 2015,
              <https://www.w3.org/TR/2015/REC-tabular-data-model-
              20151217/>.

Appendix A.  Dataset sizing

   This appendix presents an analysis of NTVdataset size optimization
   with the defined formats.

A.1.  Methodology

   The principle of defined formats is to replace duplicated data with
   encoding based on integers.

   We define the size of a Dataset representation (SZ) as the sum of the
   encoding size and the size of unencoded unduplicated values.  The
   coding is modeled as being the product of the values remaining to be
   represented (nv - nc) with an average coding size (sc):

      SZ = nc * sv + (nv - nc) * sc

      where :

         nv : number of values

         sv : mean value size

         nc : number of different values

         sc : mean coding size

      example :

         Full format : {"product":
         ["orange","apple","apple","apple","orange","orange"]}

         Complete format : {"product": [ ["orange","apple"], [1, 0, 0,
         0, 1, 1] ]}

         SZ = 9 + 8 + 7 + 6 * 1 = 30 (including double quotes),

         nv = 9 (including the Field name),

         sv = (9 + 8 * 3 + 7 * 3) / 7 = 7.71

         nc = 3 (including the Field name)


THOMY                     Expires 21 June 2024                 [Page 23]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


         sc = (30 - 3 * 7.71) / (9 - 3) = 1.15 (sc = (SZ - nc * sv) /
         (nv - nc))

      In this example the JSON overhead (coma, space, curly bracket,
      square bracket) is not included.

   SZ is maximal when there is no coding (sc = sv) and minimal when the
   coding is perfect (sc = 0):

      SZmax = nv * sv

      SZmin = nc * sv

   We then define the following indicators:

   *  unicity level UL = nc / nv

         UL = SZmin / SZmax

         1 - UL = (SZmax - SZmin) / SZmax

         UL characterizes the nature of the data independently of the
         coding and represents the maximum achievable gain (1-UL).

         maximum UL = 1 (unduplicated data)

         minimum UL = 0 (full duplicated data = empty data)

   *  object lightness OL = sc / sv

         OL = (SZ - SZmin) / (SZmax - SZmin)

         1 - OL = (SZmax - SZ) / (SZmax - SZmin)

         OL characterizes coding efficiency

         maximum OL = 1 (no coding)

         minimum OL = 0 (perfect coding)

   The optimization of the size of the representation is then evaluated
   by comparing the size obtained without coding and that obtained with
   coding:

      G = (SZmax - SZ) / SZmax = (1 - UL) * (1 - OL)

      R = 1 - G = SZ / SZmax = UL + OL - UL * OL


THOMY                     Expires 21 June 2024                 [Page 24]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


      The maximum G gain is 1 - UL, the minimum G gain is 0.

      If the data is empty, UL = OL = 0 and the gain is equal to 1.

   The indicators are deduced from the following four measurable values:

   *  number of cells in the dataset (nv)

   *  number of different cells in the dataset (nc)

   *  size of the dataset with the format to study (SZ)

   *  size of the dataset with Full format (SZmax)

   We then deduce sv = SZmax / nv as well as sc = (SZ - nc * sv) / (nv -
   nc)

   In the example above, the indicators are:

      UL = 3 / 9 = 0.33

      OL = 1.15 / 7.71 = 0.15

      G = 0.67 * 0.85 = 0.57

      SZmax = 9 * 7.71 = 69.4

      SZmin = 3 * 7.71 = 23.1

      The Complete format is close to the minimum size (SZ = 30) and its
      size is less than half the size of the Full format (43 %).

A.2.  Formats

   The formats used to represent an NTVfield are in general form:

   *  list of part of NTVcells

   *  list of integers used to encode other NTVcells

   The size of this format can then be written (without taking into
   account the overhead linked to the format):

      SZ = nc * sv + k * nv * si

      where :

         nv : number of values


THOMY                     Expires 21 June 2024                 [Page 25]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


         sv : mean value size

         nc : number of different Field values

         si : integer size

         k: specific coefficient of the coding used

   Comparison with the structure defined in the previous chapter allows
   us to deduce the parameters:

      UL = nc/nv

      sc = si * k * nv/(nv-nc)

      OL = si/sv * k * nv/(nv-nc)

      G = 1- nc/nv - si/sv * k

      R = nc/nv + si/sv * k

   The gain G is therefore equal to the maximum gain 1-UL reduced by the
   weight of the coding corresponding to the parameter k weighted by the
   average size of the values compared to an integer.

   Table 9 below specifies the values of k for the different formats:

         +==========+===============+===========================+
         |  Format  | k coefficient |          comments         |
         +==========+===============+===========================+
         | Full     | 0             | R = 1                     |
         +----------+---------------+---------------------------+
         | Unique   | 0             | R = 1/nv (nc = 1)         |
         +----------+---------------+---------------------------+
         | Complete | 1             | R = nc/nv + si/sv         |
         +----------+---------------+---------------------------+
         | Primary  | 1 / nv        | R = nc/nv + si/sv/nv      |
         +----------+---------------+---------------------------+
         | Coupled  | 1 / nv        | R = nc/nv + si/sv/nv      |
         +----------+---------------+---------------------------+
         | Sparse   | 2 * ns / nv   | R = nc/nv + 2*si/sv*ns/nv |
         +----------+---------------+---------------------------+
         | Derived  | nd / nv       | R = nc/nv + si/sv*nd/nv   |
         +----------+---------------+---------------------------+

                       Table 9: coding coefficient

         _ns: number of values distinct from the 'fill_value'_


THOMY                     Expires 21 June 2024                 [Page 26]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


         _nd: number of different values in the parent Field_

Appendix B.  Table schema compatibility

   This appendix presents the compatibility between Tableschema [TABLE]
   and the NTV-TAB format.

B.1.  Table schema

   Table Schema is a simple language- and implementation-agnostic way to
   declare a schema for tabular data.  A Table Schema is represented by
   a descriptor.  The descriptor MUST be a JSON object with defined
   properties (JsonMember).

   Table Schema define following descriptors and properties:

   *  Fields property

         Fields property MUST be an array where each entry in the array
         is a field descriptor (as defined below).

   *  Field descriptor

         A field descriptor MUST be a JSON object that describes a
         single field.

   *  Field properties

         The field descriptor object MAY contain any number of other
         properties.  Some specific properties are defined below.  Of
         these, only the name property is REQUIRED.

         Defined Properties:

         o  name

         o  title

         o  description

         o  example

         o  type / format

         o  constraints

         The constraints property on Table Schema Fields can be used by
         consumers to list constraints for validating field values.


THOMY                     Expires 21 June 2024                 [Page 27]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


   *  Table properties

         In additional to field descriptors, there are the following
         "table level" properties:

         o  missingValues

         o  primaryKey

         o  foreignKeys

B.2.  Compatibility

   Three levels of compatibility are addressed :

   *  Concepts

         The concepts are equivalent between Table Schema and NTV-TAB :

         o  Table is equivalent to Dataset

         o  Field in Table Schema is equivalent to the NTVfield

         o  Name in Table Schema is equivalent to the NTVname of the
            NTVfield

         o  Type / Format in Table Schema is equivalent to the NTVtype
            of the NTVfield

   *  Type / Format

         NTVtype combines the concepts of type and format.  The
         correspondence table in the NTV specification [JSON-NTV] gives
         the link between an NTVtype and the corresponding type/format.

   *  Constraints

         Constraints are applicable to each value in a Table Field.
         Validating constraints for all values in a Table Field is
         equivalent to validating a constraint for all values in the
         Codec list.

   These compatibility levels are reached, which makes it possible to
   validate an NTVdataset with a schema defined according to the
   Table Schema format.

   The following principles should then be considered to validate an
   NTVdataset:


THOMY                     Expires 21 June 2024                 [Page 28]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


   *  The NTVfields names must be identical to the Schema names,

   *  If the NTVtype of Codec data is not 'json', it must match the
      Type/Format defined in the Schema,

   *  If the constrainst are valid with the Codec data, they are valid
      with the Field.

B.3.  Example

   Figure 2 is an example of Dataset with Full format ('tab_data1'
   without NTVtypes) and with other formats ('tab_data2').

   tab_data1 = {
     "index":  [100, 200, 300, 400, 500, 600],
     "dates":  ["1964-01-01", "1985-02-05", "2022-01-21", "1964-01-01",
                "1985-02-05", "2022-01-21"],
     "value":  [10, 10, 20, 20, 30, 30],
     "coord":  [[1,2], [3,4], [5,6], [7,8], [3,4], [5,6]],
     "names":  ["john", "eric", "judith", "mila", "hector", "maria"],
     "unique": ["true, "true", "true", "true", "true", "true"]
   }
   tab_data2 = {
     "index": [100, 200, 300, 400, 500, 600],
     "dates": {"::date":[["1964-01-01","1985-02-05","2022-01-21"],[1]},
     "value": [[10, 20, 30], [2]],
     "coord::point": [[1,2], [3,4], [5,6], [7,8], [3,4], [5,6]],
     "names::string":["john", "eric", "judith", "mila", "hector",
                      "maria"],
     "unique":        True
   }

                         Figure 2: Dataset example

   The schema in Figure 3 is valid with 'tab_data1' and 'tab_data2'
   formats

   tab_schema = {
     "fields": [
       {"name":"index", "type":"integer", "constraint":{"minimum":50}},
       {"name":"dates", "type":"date"},
       {"name":"value", "type":"integer"},
       {"name":"coord", "type":"geopoint", "format":"array"},
       {"name":"names"},
       {"name":"unique", "type":"boolean"}
     ]
   }


THOMY                     Expires 21 June 2024                 [Page 29]

Internet-Draft        NTV tabular format (NTV-TAB)         December 2023


                          Figure 3: Schema example

Acknowledgements

   TBD

Contributors

   TBD

Author's Address

   Philippe THOMY
   Loco-labs
   476 chemin du gaf de Famian
   84 500 BOLLENE
   France
   Email: philippe@loco-labs.io
   URI:   https://github.com/loco-philippe/NTV/blob/main/README.md


THOMY                     Expires 21 June 2024                 [Page 30]