This week we got an interesting question on Twitter. If a CSV file is really big, how can a user convert the data to a .txt file? With Majestic, you could quite easily download a backlinks file with 100 million links in it. What do you do if it is not in the format that you need it?
I started by foolishly not reading the Tweet properly and without reading the 120 Million part… I blithely suggested just saving as .txt via Excel.
I doubt Excel will cope for most computers. I apologize to Bartosz and the general Internet community at large for not reading the question! My team will tell me it is one of my more common failings. Thank goodness I’m not a programmer.
Lukas kindly rephrased the question for me:
Yes… that is indeed the point. If you extract data from our system using the Search Explorer interface only, then you are only getting the top level of data, and we make it really easy for you to export the data in several formats… but as this only has the top links – maybe 20,000 depending on your subscription – its a very different story when you try to download every link. Here the computational work our end is often too much to make the download instant, so you may have to wait a few seconds or minutes whilst we build the report for you to download.
This means the data only comes in a compacted CSV format. The Compacted file at the moment is in a .gz zip file, which can itself be be extracted once downloaded using many free unzip programs such as WinRar. But this will still leave you with a 20 GB CSV file! If you want a text file, what do you do? And how much will it cost?
Fortunately for me, Lukas kindly followed up with a simple (and free) solution…. with no software required.
So just in case you are using a text only browser, or need to cut and paste, Lukas’s answer is to go into your command line, navigate to the right directory and type the command:
“cut -d, -f3 report_source.csv | sed ‘s/”//g > report_links.txt” (without the quotes).
Does this Work?
I decided to get a second opinion on whether this would work, so asked our systems manager. The problem is (and I quote his brains, not mine) “cut and sed are both GNU software which is installed on Linux, Freebsd and so probably Mac, but not Windows”. So firstly, thanks ever so much Lukas. That helps a load. But Windows remains (I think) the most common platform. So I decided to go the extra mile.
I’m no techie, so how do I do it in Windows?
Here at Majestic we like to over deliver – so our systems manager has also created a small batch file which will do this for you in Windows. Here is a step by step set of instructions:
Go into Command mode. (Type cmd in your search box on a Windows computer, that should do it if that stupid tiles layout in Windows 8 hasn’t hidden it beyond the wit of man to find).
Navigate to the directory with the file in. For kids born after Windows 3.11 you change directory by typing “cd directory” (replace “directory” with “myDocuments” or something. You can check the files in the directory you are in by typing “dir *.*”. Make sure the csv file is there. To go up one directory level type “cd..” (Those double dots are not a typo).
Type “Notepad csvtotxt.bat”. This opens Notepad in DOS mode, with a new blank file called csvtotxt.bat. Yep… you are about to write a program! Don’t panic… it is only cut and paste.
Type or Cut and Paste this code into the Notepad.
—cut below this line—
for /f “usebackq tokens=1-13 delims=,” %%A in (“%1%”) do (
—cut above this line—
Save the file and close Notepad. You’ve finished your program.
Now run the program: In the dos prompt type “csvtotxt report_source.csv > report_links.txt” where report_source.csv is the name of your massive .CSV file. report_links.txt will be the name of the file after you have converted it to .txt.
I’m on a Mac… can someone try this and tell me if it works?
Latest posts by Dixon Jones (see all)
- Crawling Smarter on the Infinite Web - March 16, 2017
- Biggest Link Index Hits New Highs with Historic Update - March 14, 2017
- A New Approach to Blogging. Expect a Better Standard. - February 7, 2017