Should you prefer 1970 Jan 1 over 1 Jan 1970?

Hey Fellow Earthly Souls.

Thou may refer to me as Abhay Agarwala, a Final Year Undergraduate Student of Jadavpur University.

Recently, I found something interesting while trespassing in the Machine Learning alley of the AI world. I was working on a simple task of classifying tweets in the past year on the basis of sentiment conveyed by them. The datasets had separate files for each day and a plot of the sentiment score variation with time was required. (One might accurately predict negative sentiments to explode in the last year, yet we were stubborn enough to find the silver lining.)

Now, here comes the interesting part. The various date formats used around the world are:

• Little Endian (DD-MM-YYYY) (Most Used)
• Middle Endian (MM-DD-YYYY) (Least Used)
• Big Endian (YYYY-MM-DD).

In the dataset used, the dates were in the format YYYY-MM-DD, however, I asked the creator to reformat it to DD-MM-YYYY i.e. Little Endian. (Don’t blame me later on as it is a widely used format globally.)

Here, a python dictionary was used to store the tweets, classification pairs list with the key being the date.

A python dictionary looks similar to :

Here, “Country name” is the key and (Capital, Continent) is the value assigned to the key.

Now, for those who don’t know, the keys aren’t sorted when the dictionary is populated with data. However, to have a perfect plot with time only increasing in the positive X direction, I had to sort them.

The keys of the dictionary were sorted and a new dictionary was created.

Here comes the big revelation.

The dates were not increasing in the true monotonic sense of time.

Lets take an example here. Consider two dates, 31st January, 2021 (A) and 14th February 2021 (B).

In Little Endian format

• A is 31–01–2021.
• B is 14-02–2021.
• Let’s make an array: [A,B] = [“31–01–2021”,“14–02–2021”].

Now the elements are string that is each character has an importance. If we sort it based on the natural ordering of digits [0, 1, 2, …., 9], which is how string based sorting takes place by default:

The sorted array looks like [“14–02–2021”,“31–01–2021”].

Welcome to a world of new dates, where there is ordering even in chaos. Well, if today is 31–01–2021, then Valentines is over, my friend. (Excuse Me.)

(A fun exercise; make a fake calendar if dates were arranged based on DD-MM-YYYY)

Don’t Worry. We are not going to let down Betty. Let’s try sorting on the basis of YMD format.

In Big Endian format

• A is 2021–02–14.
• B is 2021-01–31.
• [A,B] = [“2021–02–14”,“2021–01–31”].

The result is [“2021-01–31”,“2021-02–14”]. It’s perfect. The plot was fine, and there were days throughout 2020 when the silver lining of positive sentiments appeared beneath the dark clouds of despair and worry created by the pandemic.

What a sigh of relief. Phew.

This example can be considered as a type of unstable sort. The natural ordering is changed. Here the first key is the day in the DMY example. If the dates are equal then sort it based on the month value and then on year value. If we stick with this format then time is going to flow in a turbulent fashion as 31st December 1999 is going to occur after New Year’s of 2021.

However, in case of Big Endian format,the sort takes place based on Year first, then Month and then Day of the month. The order of the keys on which sorting is performed is based on decreasing value of their time period.

Why is it such a big deal?

Now, one might argue the format can always be changed from one to another. In small datasets, the differences in processing time aren’t perceptible. However, it starts to matter with the explosion of data with so many smart electronic devices being used around the world, imagine the situation where billions and billions of logs generated, tweets generated and other similar data where timestamps are important source of information. A human who leaves it for whosoever is going to process them, should be fined and asked to pay a 100 Dogecoins (:P).

So, next time when you are going to store date as information, try to make an effort to store it in the Big Endian format (YMD).

I hope that you might have learnt something new. Until next time, share it with your friends while I walk along ‘A’ alley of ‘W’ world.

Thank You. :)

The heart of logistic regression

Get the Medium app