This week we got an interesting question on Twitter. If a CSV file is really big, how can a user convert the data to a .txt file? With Majestic, you could quite easily download a backlinks file with 100 million links in it. What do you do if it is not in the format that you need it?
I started by foolishly not reading the Tweet properly and without reading the 120 Million part… I blithely suggested just saving as .txt via Excel.
I doubt Excel will cope for most computers. I apologize to Bartosz and the general Internet community at large for not reading the question! My team will tell me it is one of my more common failings. Thank goodness I’m not a programmer.
Lukas kindly rephrased the question for me:
Yes… that is indeed the point. If you extract data from our system using the Search Explorer interface only, then you are only getting the top level of data, and we make it really easy for you to export the data in several formats… but as this only has the top links – maybe 20,000 depending on your subscription – its a very different story when you try to download every link. Here the computational work our end is often too much to make the download instant, so you may have to wait a few seconds or minutes whilst we build the report for you to download.
This means the data only comes in a compacted CSV format. The Compacted file at the moment is in a .gz zip file, which can itself be be extracted once downloaded using many free unzip programs such as WinRar. But this will still leave you with a 20 GB CSV file! If you want a text file, what do you do? And how much will it cost?
Fortunately for me, Lukas kindly followed up with a simple (and free) solution…. with no software required.
So just in case you are using a text only browser, or need to cut and paste, Lukas’s answer is to go into your command line, navigate to the right directory and type the command:
“cut -d, -f3 report_source.csv | sed ‘s/”//g > report_links.txt” (without the quotes).
Does this Work?
I decided to get a second opinion on whether this would work, so asked our systems manager. The problem is (and I quote his brains, not mine) “cut and sed are both GNU software which is installed on Linux, Freebsd and so probably Mac, but not Windows”. So firstly, thanks ever so much Lukas. That helps a load. But Windows remains (I think) the most common platform. So I decided to go the extra mile.
I’m no techie, so how do I do it in Windows?
Here at Majestic we like to over deliver – so our systems manager has also created a small batch file which will do this for you in Windows. Here is a step by step set of instructions:
Go into Command mode. (Type cmd in your search box on a Windows computer, that should do it if that stupid tiles layout in Windows 8 hasn’t hidden it beyond the wit of man to find).
Navigate to the directory with the file in. For kids born after Windows 3.11 you change directory by typing “cd directory” (replace “directory” with “myDocuments” or something. You can check the files in the directory you are in by typing “dir *.*”. Make sure the csv file is there. To go up one directory level type “cd..” (Those double dots are not a typo).
Type “Notepad csvtotxt.bat”. This opens Notepad in DOS mode, with a new blank file called csvtotxt.bat. Yep… you are about to write a program! Don’t panic… it is only cut and paste.
Type or Cut and Paste this code into the Notepad.
—cut below this line—
@echo off
for /f “usebackq tokens=1-13 delims=,” %%A in (“%1%”) do (
echo %%~C
)
—cut above this line—
Save the file and close Notepad. You’ve finished your program.
Now run the program: In the dos prompt type “csvtotxt report_source.csv > report_links.txt” where report_source.csv is the name of your massive .CSV file. report_links.txt will be the name of the file after you have converted it to .txt.
I’m on a Mac… can someone try this and tell me if it works?
- How Important will Backlinks be in 2023? - February 20, 2023
- What is in a Link? - October 25, 2022
- An Interview with… Ash Nallawalla - August 23, 2022
I’m on Windows 7… After following all of the steps, the message I received once running the command line for the .bat file was “tokens was unexpected at this time.”
However, I did end up with an empty .txt file with the name that I chose (episodeVIreturnofthecommandline.txt)
January 31, 2014 at 6:11 pm> Hi Nicholas,
February 2, 2014 at 8:15 amIt would appear that wordpress has reformatted the double quotation marks to more aesthetic double quotes, if you edit the file again and replace the double quotes with normal double quotes. I’m on a UK keyboard so it is shift-2 on a US keyboard I think it is shift single quote on the right of the keyboard. Then this should work.
One other method is to use the various “notepad on steroids” text editor products out there. I’ve used UltraEdit 32 for 10+ years for tasks like this… but there are any number of others you could use. Open the file in the editor of your choice and use the search/replace functionality to replace the commas with a space, tab, line break or whatever you want. I’ve stretched UE32 to at least 140m lines.
January 31, 2014 at 8:35 pmI follow your steps but did not get anything.
February 3, 2014 at 7:32 amI found another way around the huge file issue. There are freeware programs that can split csv files without having to open them in Excel first. That makes it easier to look at the data as well.
February 11, 2014 at 6:18 pmWe have a product called Ron’s Editor that can open large csv files. Its got a number of data cleaning and formatting tools. Maybe this could help you out.
February 13, 2014 at 10:06 amThanks for the pointer Sylvie.
February 14, 2014 at 7:24 am