tl;dr; The old
/favicon.ico was 15KB and due to bad caching was downloaded 24M times in the last month totaling ~350GB of server-to-client traffic which can almost all be avoided.
MDN Web Docs (formerly MDN) was first launched by Mozilla around 2005, and has grown from humble beginnings to being one of the most popular web development resources on the web today. It now boasts over 13 million page views per month, a strong ecosystem of documentation and data, and a lively community of contributors. Web Hosting, E commerce Hosting, Reseller Hosting, VPS, Dedicated Server, Domain Name Registration,Brand SMS/ Bulk SMS & free SSL Certificate.Web Development, Android App Development. In this video, we take a talk about the Mozilla Developer Network or MDN for short. The Mozilla Developer Network (MDN) provides information about Open Web t. A web testing deep dive: The MDN web testing report. For the last couple of years, we've run the MDN Web Developer Needs Assessment (DNA) Report, which aims to highlight the key issues faced by developers building web sites and applications.
How to save the planet? Well, do something you can do, they say. Ok, what I can do is to reduce the amount of electricity consumed to browse the web. Mozilla MDN Web Docs, which I work on, has a lot of traffic from all over the world. In the last 30 days, we have roughly 70M pageviews across roughly 15M unique users.
A lot of these people come back to MDN more than once per month so good assets and good asset-caching matter.
I found out that somehow we had failed to optimize the
/favicon.ico asset! It was 15,086 bytes when, with Optimage, I was quickly able to turn it down to 1,153 bytes. That's a 13x improvement! Here's what that looks like when zoomed in 4x:
The MDN pages for the border properties give you information about the different styles of border you can choose from.
The next challenge was the
Cache-Control. Our CDN is AWS Cloudfront and it respects whatever
Cache-Control headers we set on the assets. Because
favicon.ico doesn't have a unique hash in its name, the
Cache-Control falls back to the default of 24 hours (max-age=86400) which isn't much. Especially for an asset that almost never changes and besides, if we do decide to change the image (but not the name) we'd have to wait a minimum of 24 hours until it's fully rolled out.
Another thing I did as part of this was to stop assuming the default URL of
/favicon.ico and instead control it with the
<link href='/favicon.323ad90c.ico' type='image/x-icon'> HTML meta tag. Now I can control the URL of the image that will be downloaded.
Our client-side code is based on
create-react-app and it can't optimize the files in the
So I wrote a script that post-processes the files in
client/build/. In particular, it looks through the
index.html template and replaces...
Plus it makes a copy of the file with this hash in it so that the old URL still resolves. But now can cache it much more aggressively. 1 year in fact.
Mozilla Web Dev
Combined, we used to have ~350GB worth of data sent from our CDN(s) to people's browsers every month.
Just changing the image itself would turn that number to ~25GB instead.
The new Cache-Control hopefully means that all those returning users can skip the download on a daily basis which will reduce the amount of network usage even more, but it's hard to predict in advance.
- How to simulate slow lazy chunk-loading in React25 March 2021
- Related by category:
- Fastest way to find out if a file exists in S3 (with boto3)16 June 2017Web development
- How to throttle AND debounce an autocomplete input in React01 March 2018Web development
- How to create-react-app with Docker17 November 2017Web development
- Be very careful with your add_header in Nginx! You might make your site insecure11 February 2018Web development
- Displaying fetch() errors and unwanted responses in React06 February 2019Web development
- Related by keyword:
- favicon2dots12 February 2007
tl;dr: Periodically, the whole of MDN is built, by our Node code, in a GitHub Action. A Python script bulk-publishes this to Elasticsearch. Our Django server queries the same Elasticsearch via
/api/v1/search. The site-search page is a static single-page app that sends XHR requests to the
/api/v1/search endpoint. Search results’ sort-order is determined by match and “popularity”.
The challenge with “Jamstack” websites is with data that is too vast and dynamic that it doesn’t make sense to build statically. Search is one of those. For the record, as of Feb 2021, MDN consists of 11,619 documents (aka. articles) in English. Roughly another 40,000 translated documents. In English alone, there are 5.3 million words. So to build a good search experience we need to, as a static site build side-effect, index all of this in a full-text search database. And Elasticsearch is one such database and it’s good. In particular, Elasticsearch is something MDN is already quite familiar with because it’s what was used from within the Django app when MDN was a wiki.
Note: MDN gets about 20k site-searches per day from within the site.
When we build the whole site, it’s a script that basically loops over all the raw content, applies macros and fixes, dumps one
index.html (via React server-side rendering) and one
index.json contains all the fully rendered text (as HTML!) in blocks of “prose”. It looks something like this:
You can see one here:
Next, after all the
index.json files have been produced, a Python script takes over and it traverses all the
index.json files and based on that structure it figures out the, title, summary, and the whole body (as HTML).
Next up, before sending this into the bulk-publisher in Elasticsearch it strips the HTML. It’s a bit more than just turning
<p>Some <em>cool</em> text.</p> to
Some cool text. because it also cleans up things like
<div> and certain
One thing worth noting is that this whole thing runs roughly every 24 hours and then it builds everything. But what if, between two runs, a certain page has been removed (or moved), how do you remove what was previously added to Elasticsearch? The solution is simple: it deletes and re-creates the index from scratch every day. The whole bulk-publish takes a while so right after the index has been deleted, the searches won’t be that great. Someone could be unlucky in that they’re searching MDN a couple of seconds after the index was deleted and now waiting for it to build up again.
It’s an unfortunate reality but it’s a risk worth taking for the sake of simplicity. Also, most people are searching for things in English and specifically the
Web/ tree so the bulk-publishing is done in a way the most popular content is bulk-published first and the rest was done after. Here’s what the build output logs:
So, yes, for 3m 35s there’s stuff missing from the index and some unlucky few will get fewer search results than they should. But we can optimize this in the future.
The way you connect to Elasticsearch is simply by a URL it looks something like this:
It’s an Elasticsearch cluster managed by Elastic running inside AWS. Our job is to make sure that we put the exact same URL in our GitHub Action (“the writer”) as we put it into our Django server (“the reader”).
In fact, we have 3 Elastic clusters: Prod, Stage, Dev.
And we have 2 Django servers: Prod, Stage.
So we just need to carefully make sure the secrets are set correctly to match the right environment.
Now, in the Django server, we just need to convert a request like
GET /api/v1/search?q=foo&locale=fr (for example) to a query to send to Elasticsearch. We have a simple Django view function that validates the query string parameters, does some rate-limiting, creates a query (using
elasticsearch-dsl) and packages the Elasticsearch results back to JSON.
How we make that query is important. In here lies the most important feature of the search; how it sorts results.
In one simple explanation, the sort order is a combination of popularity and “matchness”. The assumption is that most people want the popular content. I.e. they search for
foreach and mean to go to
/en-US/docs/Web/API/NodeList/forEach both of which contains
forEach in the title. The “popularity” is based on Google Analytics pageviews which we download periodically, normalize into a floating-point number between 1 and 0. At the time of writing the scoring function does something like this:
This seems to produce pretty reasonable results.
But there’s more to the “matchness” too. Elasticsearch has its own API for defining boosting and the way we apply is:
- match phrase in the
title: Boost = 10.0
- match phrase in the
body: Boost = 5.0
- match in
title: Boost = 2.0
- match in
body: Boost = 1.0
This is then applied on top of whatever else Elasticsearch does such as “Term Frequency” and “Inverse Document Frequency” (tf and if). This article is a helpful introduction.
We’re most likely not done with this. There’s probably a lot more we can do to tune this myriad of knobs and sliders to get the best possible ranking of documents that match.
The last piece of the puzzle is how we display all of this to the user. The way it works is that
A lot of interesting details are omitted from this code snippet. You have to check it out for yourself to get a more up-to-date insight into how it actually works. But basically, the
pushState) query string drives the
fetch() call and then all the component has to do is display the search results with some highlighting.
W3schools Web Development
/api/v1/search endpoint also runs a suggestion query as part of the main search query. This extracts out interest alternative search queries. These are filtered and scored and we issue “sub-queries” just to get a count for each. Now we can do one of those “Did you mean…”. For example: search for
There are a lot of interesting, important, and careful details that are glossed over here in this blog post. It’s a constantly evolving system and we’re constantly trying to improve and perfect the system in a way that it fits what users expect.
A lot of people reach MDN via a Google search (e.g.
mdn array foreach) but despite that, nearly 5% of all traffic on MDN is the site-search functionality. The
/$locale/search?... endpoint is the most frequently viewed page of all of MDN. And having a good search engine that’s reliable is nevertheless important. By owning and controlling the whole pipeline allows us to do specific things that are unique to MDN that other websites don’t need. For example, we index a lot of raw HTML (e.g.
<video>) and we have code snippets that needs to be searchable.
Hopefully, the MDN site-search will elevate from being known to be very limited to something now that can genuinely help people get to the exact page better than Google can. Yes, it’s worth aiming high!
(Originally posted on personal blog)
Mdn Web Development
About Peter Bengtsson
Mdn Web Development Tool
Peter is a senior web developer at Mozilla working on MDN Web Docs. He writes more opinionated nerdery on www.peterbe.com