28 February 2024

Test Quality vs. Bashing TailwindCSS

„Tailwind vs. Semantic CSS“

There are great and really nuanced articles by e.g. Bastian Allgaier discussing utility classes, TailwindCSS etc. Another article is currently making the rounds: Tailwind vs. Semantic CSS, which claims to contain a „study [that] compares two websites with identical design: the commercial Spotlight template from developers of Tailwind vs the same site with semantic CSS“ from the article's author.

When someone uses the word "study" and a lot of people follow that phrase, I feel really triggered to have a deeper look into it.

Test Quality

Before I became a full-time web developer, I earned my teaching degree and a master's degree in "Media and education". My university emphasized quantitative and qualitative methodology. A lot. In my first year, I learned the three main indicators to measure the quality of quantitative tests (translated from Uni Leipzig):

Reliability: „A measurement instrument is considered reliable if it produces the same results when used repeatedly under the same conditions.“
Objectivity: „This criterion describes the degree to which the measurement results are independent of the person collecting the data.“
Validity: „This criterion asks [...] whether and to what extent the measurement measures what it is supposed to measure.“

I thought a lot about writing this blog post, but with so many people sharing the article, I really feel the urge to write down my thoughts – and measure the test quality of the "study".

Disclaimer: I use TailwindCSS a lot, but understand anyone who doesn't. I have concerns about all-in utility classes and the marketing strategies of the Tailwind team. I use TailwindUI, but have never used their templates because I strongly avoid React.

Reliability

A measurement instrument is considered reliable if it produces the same results when used repeatedly under the same conditions.

Let's start with the simplest: Reliability is given. If we compare visits to both landing pages with exactly the same setup, the results will be the same. There will be changes due to caching, repeated page loads, switching between pages, internet connection, etc. – but overall the measurements are repeatable.

Objectivity

This criterion describes the degree to which the measurement results are independent of the person collecting the data.

The gist of the article is that the "Semantic version [...] does not require JavaScript bundlers / tooling". Later, the article describes that „[with] the semantic version, we can use Markdown in place of the custom JSX component because the generated HTML is semantic“.

The article doesn't include the information that it uses Nuemark, "a standalone library that works under Bun, Node, and Deno", written by the author himself. It "comes with a set of built-in components aimed at addressing the most common content management use cases" and is part of Nue, which the author markets as "A perfect framework".

IMO there is nothing like a perfect framework
IMO there's nothing like "the most common content management use cases".
Nuemark is... JavaScript tooling?
Instead of JSX components we use... Nuemark components?

Back to objectivity: A study by an author building a JS framework that focuses on a "semantic first" approach and has a "design system" on the roadmap feels biased to me – and leaving out information to make a point definitely reduces objectivity.

Validity

This criterion asks [...] whether and to what extent the measurement measures what it is supposed to measure.

Let's focus the introduction: "This study compares two websites of identical design: the commercial Spotlight template by the developers of Tailwind vs. the same site using semantic CSS".

The most problematic phrase here is "identical design", which might be the reason why the article has been shared so many times. Let me give a short list of things I found during a 10 minutes analysis that are not "identical":

Theming:
1. No automatic dark mode.
2. Theme switch doesn't change icon.
Navigation:
1. Navigation does not switch to a hamburger menu on smaller screens.
2. Navigation contains fewer items.
3. The top left avatar disappears on smaller screens (making it impossible to return to the main page).
4. Navigation doesn't appear when scrolling up.
5. Current elements in the navigation don't have the "gradient" border.
Landing page
1. Incorrect icon and text in 'Download CV' button.
2. Poorly designed newsletter input.
Blog
1. No columns in article view.
2. No code preview with syntax highlighting.
Projects
1. Projects are not scrollable and clickable as a whole.
2. Project icons have shadows baked into the SVG and look completely different.
Other details
1. Worse focus styles.
2. No transitions on hover.

Everything here is fixable with semantic CSS, a few sprinkles of JS - and maybe sometimes a wrapper div. But for now, the pages are not "identical", so the comparison doesn't measure "what it's supposed to measure" and is therefore invalid.

Summary

Building a website that does less, is less designed and contains glitches is of course smaller and more performant. While reliability of the comparison is given, this violates its validity. In addition skipping information (like the usage of Nuemark) reduces objectivity.

What bothers me most: I'm 100% sure we can make a smaller, faster site without TailwindCSS and especially without Next.js. Maybe not "8x smaller", maybe not without additional "bundlers / tooling". But for that we would need more objective and valid test environments. This could be, for example, a GitHub repo that contains a lot (!) of visual regression and performance tests where the original website is the baseline.

I would love to see that. That would be something I would like to share, discuss and learn from. But until that happens, I will at least stick to reliable, neutral facts and try to openly discuss things that could lead to misinformation. There's far too much of that already. 🙏🏻