A fixed width file is similar to a csv file, but rather than using a delimiter, each field has a set number of characters. This creates files with all the data tidily lined up with an appearance similar to a spreadsheet when opened in a text editor. This is convenient if you’re looking at raw data files in a text editor, but less ideal when you need to programmatically work with the data.

Fixed width files have a few common quirks to keep in mind:

  • When values don’t consume the total character count for a field, a padding character is used to bring the character count up to the total for that field.
  • Any character can be used as a padding character as long as it is consistent throughout the file. White space is a common padding character.
  • Values can be left or right aligned in a field and alignment must be consistent for all fields in the file.

A thorough description of a fixed width file is available here.

Note: All fields in a fixed width file do not need to have the same character count. For example: in a file with three fields, the first field could be 6 characters, the second 20, and the last 9.

Upon initial examination, a fixed width file can look like a tab separated file when white space is used as the padding character. If you’re trying to read a fixed width file as a csv or tsv and getting mangled results, try opening it in a text editor. If the data all line up tidily, it’s probably a fixed width file. Many text editors also give character counts for cursor placement, which makes it easier to spot a pattern in the character counts.

If your file is too large to easily open in a text editor, there are various ways to sample portions of it into a separate, smaller file on the command line. An easy method on a Unix/Linux system is the head command. The example below uses head with -n 50 to read the first 50 lines of large_file.txt and then copy them into a new file called first_50_rows.txt.

head -n 50 large_file.txt > first_50_rows.txt

UniProtKB Database

The UniProt Knowledgebase (UniProtKB) is a freely accessible and comprehensive database for protein sequence and annotation data available under a CC-BY (4.0) license. The Swiss-Prot branch of the UniProtKB has manually annotated and reviewed information about proteins for various organisms. Complete datasets from UniProt data can be downloaded from ftp.uniprot.org. The data for human proteins are contained in a set of fixed width text files: humchr01.txthumchr22.txt, humchrx.txt, and humchry.txt.


Source: towardsdatascience.com