Skip to the content

Convert HTML to Markdown

After a discussion with a fellow developer on Slack one evening (Cheers Nik!), I found I had a side project to play with. I have been trying to do some non-umbraco coding and the side project was going to be exactly that. 

I'm planning to move this blog off Umbraco 7 on to Umbraco 8. I could do a migration but I'm looking to setup a vanilla Umbraco 8 website. Just now, this blog runs off a uSkinned setup. If you've never used or heard of uSkinned, check them out. It's a great way to spin up an Umbraco site with a frontend built done for you, the backend is packed full of features and the guys are supper friendly and support is great.

I'm just wanting to play a bit more with a clean Umbraco 8 setup hosted on Umbraco Cloud. 

Being honest, I'm nervous about releasing my code to the public but I also see it as a way I can learn how to improve my coding. If others see better ways of doing things and they share that with me, then great! If they don't, well, the code works for me so nothing lost. I want to give a shoutout to Candid Contributions, a great podcast and I made the decision to share more of my code after listening to a number of their shows. The code doesn't need to be perfect and it's nice to share with the open source community. 

So, what have I built?

It's a console application that reads the RSS feed of this site, takes each individual blog link, reads the html for the blog and looks for a specific div Id. Once it finds that Id the app takes the contents of that div and converts it in to a markdown file which is saved locally on my computer. 

Once I have the Umbraco 8 site setup, I'll then copy all these markdown files in and I'm up and running again.

I like markdown format because it's fast, I can write a blog quickly and I can also write it on my phone. No need to be online or logged in to my CMS. Perfect. 

Blog to Markdown parser.

Note: This parser is still a work in progress so any code samples in this blog may be out of date. For the latest code, check out my github repo

I'm using two nuget packages for this application

Here is something I've never admitted before, I have normally always looked for other ways to do things, find the 'code' way rather than using Nuget Packages. Which is crazy, I was just making life harder for myself! I was trying to reinvent the wheel and wondering why I never got anywhere! 

ReverseMarkdown is a great package, it does the conversion from HTML to Markdown. FeedReader makes it really easy to read my RSS feed and get the data out of it that I need, in this case, the link item. 

Here is a snip of the RSS feed that I will be reading in to my console app. As you can see I have multiple options for places to get the Url from but I decided to stay with the link element as that makes more sense.

Below are some of the configuration settings that are available via ReverseMarkdown.

It would have taken me months to try and clean up the HTML and make a markdown file from it! This has been really useful and I found it by chance! I'm so glad I did. 

This next snip of code I is where I read in the RSS feed. Once I have the feed I get the `Link` element from the feed, remove empty spaces and clean up the link then add it to a list. 

If you look at the RSS feed example, you will see the link item had some random spacing in it so the Trim() and Replace() cleans that up.

Once the list has been created with clean links, I then start to create the Markdown files on a local location, it only makes a file if it doesn't already exist. The reason I did this realistically, once a markdown file is created, it won't have changed from the last time I created it since they are all published blogs. I could have maybe named the files a bit better, maybe using the blog title from the rss feed but just now they are saved as blog1, blog2, blog3.....

Here is a demo of it actually running locally on my laptop:

Further development ideas

Some ideas I've still got for this are:

  • Add some commands to the console so that I can either read an rss feed or a single html page. 
  • Remove all the hardcoded bits e.g. rss feed and save location and give the user an option from the console.
  • Save all the markdown files to a private repo on git

 

As I mentioned earlier, all my code is open source, it's on my github repo and you're welcome to use it, change it, or if you want to improve it, make a pull request. 

 

Happy coding! 

About the author

Owain

Owain

Owain is an Umbraco MVP, an Umbraco certified master and works on Umbraco projects on a daily basis. When not coding, he enjoys running, spending time with his wife and building lego! 

He is also a GitKraken ambassador and helps look after the H5YR.com website.

comments powered by Disqus

Where to find Owain

Twitter: @ScottishCoder
Linkedin: Owain Williams
Our: Our Umbaco