Type Detection Library
Over the last few weeks and months we have been working on a Type Detection Library. It is not used in DataTables, but it is used in Cloudtables to automatically detect types within an array of data. In CloudTables the arrays of data represent columns of data which is really useful for automatic type detection when importing data from a CSV, JSON or Excel file. We have released this on npm as an open source library so that hopefully it can be useful to the web development community.
In this blog post I will walk through the different types and formats that can be identified using this new library.
Detectable Types
The type detection library can detect the following types.
- Numbers
- HTML
- Dates and Times
- Strings
- Mixed
This covers the majority of formats that data can be presented in.
Numbers
The Type Detection library can detect numbers in many forms. These include
- JS numbers (
100
)[100, 200, 300]
- Numeric Strings (
'100'
)['100', '200', '300']
- Numeric Strings with a prefix, such ascurrency (
'$100'
)['$100', '$200', '$300']
- Numeric Strings with a postfix, such as a unit (
'100 cm'
)['100 cm', '200 cm', '300 cm']
- Numeric Strings with both a prefix and a postfix (
'$100 per month'
)['$100 per month', '$200 per month', '$300 per month']
- All of the above with decmial places,
'.'
by default (100.5, '100.5', '$100.50', '100.5 cm', '$100.50 per month'
) - All of the above (except JS Numbers) with a thousands separator,
','
by default ('100,000.5', '$100,000.50', '100,000.5 cm', '$100,000.50 per month'
)
Any prefix or postfix that is detected will be common across all data points, except null points. If there is an inconsistency then the returned type will be a string.
HTML
HTML can be detected using the Type Detection library. Any dataset that is made up of strings can be identified as HTML if there is at least one valid HTML tag present within it. Without a valid HTML tag, a string type is returned.
Some examples of HTML data are included below
['<span>I</span>', '<span>am</span>', '<span>HTML</span>']
['I', '<span>am</span>', 'HTML']
Dates and Times
Dates and Times were the most challenging part to implement within this library. The format of the dates is identified using the same tokens as moment. There isn't a concrete Detection library for date formats, and there likely never will be.
Consider the date 10/10/10
. This could be one of any number of possible formats because there is no way to differentiate between day, month or year. Fortunately, we are normally considering large sets of data rather than a single value. It's unlikely that in a whole set it will be impossible to derive any information from the values that are passed in. This library will always return a format that is valid for every element within the array that is passed in, a best guess. Most of the time however, it will be possible to identify with a good degree of accuracy the format of the data.
There is a list of supported tokens within the ReadMe on Github. These tokens can be arranged in any order, however there are some conditions to make detection feasible. Firstly, no token can be used twice - there is no use case for this in real world dates. Secondly, if a token like MM
is used, the M
token is immediately ruled out. The only difference between these two is the leading '0' at the start for values less than 10 - again it makes no sense for these to be included twice. Beyond these conditions a series of logical steps occur to attempt to accurately identify a valid moment format for the entire data set.
If there is no time element to the data provided then a date
type will be returned. Data with both dates and times will be given the type datetime
. Date with just times present will be given the type time
,
If a data set is provided that has multiple different formats, a mixed type will be returned. This can be as simple as an extra column or 0
somewhere so it is important to make sure that data is formatted properly.
Some examples are provided below
- Date
['2021-03-11','2020-12-25','2021-01-01']
-YYYY-MM-DD
['12-25-2020', '03-1-2021', '01-11-1999']
-MM-D-YYYY
['March 5th, 2021', 'April 16th, 2022', 'May 27th, 2023']
-MMMM Do, YYYY
- DateTime
['2021-03-11 08:17','2020-12-25 09:47','2021-01-01 15:04']
-YYYY-MM-DD HH:mm
['12-25-2020 08:17 AM', '03-1-2021 09:47 AM', '01-11-1999 03:04 PM']
-MM-D-YYYY hh:mm A
['March 5th, 2021 08:17:33', 'April 16th, 2022 09:47:01', 'May 27th, 2023 15:04:25']
-MMMM Do, YYYY HH:mm:ss
- Time
['08:17','09:47','15:04']
-HH:mm
['08:17 AM', '09:47 AM', '03:04 PM']
-hh:mm A
['08:17:33', '09:47:01', '15:04:25']
-HH:mm:ss
- Mixed
['2021-03-11 08:17','2020/12/25 09:47','2021-01-01 15:04']
- Slashes in one, dashes in another['12-25-2020 08:17 AM', '03-1-2021 09:47 AM', '01-11-1999 15:04']
- Inconsistent AM/PM and 24 hourHH
['March 5th, 2021 08:17:33', 'April 16th, 2022 09:47:01', 'May 27th 2023 15:04:25']
- Missing comma in last value
Mixed
A mixed type is returned if multiple different types, or formats are identified within the data set. Some examples are included below.
['2021-03-11 08:17','2020/12/25 09:47','2021-01-01 15:04']
Different formatting
[17, 'Some string', '2021-05-17']
Mixture of number and string and date
Strings
Data that cannot be categorised into any of the above types will be given a type of string.
Excel
Cloudtables allows the import of data from excel spreadsheets. Because of this the library needs to be able to identify types within excel data. All of the above types can be identified from data that is passed in as follows.
[{ value: '1', excel: '#' }, { value: '2', excel: '#' }, { value: '3', excel: '#' }]
This would return a number type with no decimal places, prefixes or postfixes. The property value pairs within the object represent the value of a cell and the format that is being applied to that cell. Both of these contain key pieces of information when it comes to identifying both type and format.
Usage
If after reading this blog post you can think of a way to make use of the Type Detection Library, you need to take the following steps to use it in your projects.
Install
The library is available on npm under the MIT license. You can install it by running
npm install @datatables/type-detector
Initialisation
Usage is really straightforward and only involves importing, initialising and calling a single api method. The initialisation code for both Javascript and Typescript is included below. When the typeDetect()
function is called a single dimensional array of data is passed in. The library will then return on object that represents the typing and formatting that is being used within that array.
ES3
var TypeDetect = require('@datatables/type-detector');
var detector = new TypeDetect["default"]();
var type = detector.typeDetect(dataArray);
ES6
import TypeDetect from '@datatables/type-detector';
let detector = new TypeDetect();
let type = detector.typeDetect(dataArray);
Contributions
Contributions to add and improve our open source software is always welcome. Just make a pull request on Github, or get in touch via the forum!