Friday, May 22, 2020

What could cause the file command in Linux to report a text file as data?

What could cause the file command in Linux to report a text file as data?

🔥 Save unlimited web pages along with a full PDF snapshot of each page.
Unlock Premium →

Asked

Viewed 11k times

I have a couple of C++ source files (one .cpp and one .h) that are being reported as type data by the file command in Linux. When I run the file -bi command against these files, I'm given this output (same output for each file):

application/octet-stream; charset=binary

Each file is clearly plain-text (I can view them in vi). What's causing file to misreport the type of these files? Could it be some sort of Unicode thing? Both of these files were created in Windows-land (using Visual Studio 2005), but they're being compiled in Linux (it's a cross-platform application).

Any ideas would be appreciated.

Update: I don't see any null characters in either file. I found some extended characters in the .cpp file (in a comment block), removed them, but file still reports the same encoding. I've tried forcing the encoding in SlickEdit, but that didn't seem to have an effect. When I open the file in vim, I see a [converted] line as soon as I open the file. Perhaps I can get vim to force the encoding?

Vim tries very hard to make sense of whatever you throw at it without complaining. This makes it a relatively poor tool to use to diagnose file's output.

Vim's "[converted]" notice indicates there was something in the file that vim wouldn't expect to see in the text encoding suggested by your locale settings (LANG etc).

Others have already suggested

  • cat -v
  • xxd

You could try grepping for non-ASCII characters.

  • grep -P '[\x7f-\xff]' filename

The other possibility is non-standard line-endings for the platform (i.e. CRLF or CR) but I'd expect file to cope with that and report "DOS text file" or similar.

If you run file -D filename, file displays debugging information, including the tests it performs. Near the end, it will show what test was successful in determining the file type.

For a regular text file, it looks like this:

[31> 0 regex,=^package[ \t]+[0-9A-Za-z_:]+ *;,""]  1 == 0 = 0  ascmagic 1  filename.txt: ISO-8859 text, with CRLF line terminators

This will tell you what it found to determine it's that mime type.

I found the issue using binary search to locate the problematic lines.

head -n {1/2 line count} file.cpp > a.txt  tail -n {1/2 line count} file.cpp > b.txt

Running file against each half, and repeating the process, helped me locate the offending line. I found a Control+P (^P) character embedded in it. Removing it solved the problem. I'll write myself a Perl script to search for these characters (and other extended) in the future.

A big thanks to everyone who provided an answer for all the tips!

It could be that the files have been saved with a BOM at the beginning of them, although I would have thought a recent-ish version of the file binary should recognise that too.

Have you tried dumping them through something like "head -2 | xxd" and seeing if there's a BOM present?

*BOM = Byte Order Mark - sometimes present in unicode text files. http://en.wikipedia.org/wiki/Byte_order_mark

It probably is a non-ASCII character from Unicode or some other character set. Since you're using vi, which in most Linux distributions is some version of vim, you can search for that character by typing

/[<Ctrl-V>x80-<Ctrl-V>xff]

and hitting Enter, where <Ctrl-V> means typing v while pressing the Ctrl key. Similarly, you can search for nulls (as Mehrdad suggested) with this:

/<Ctrl-V>x00

Which charset/encoding/(codepage) are the files in?
Perhaps the files have stray character(s). typically from bad cross-encoding between different platforms. Invalid data in you files may be causing file to report as you have described. You can test the validity of a file for a particular charset encoding by testing it with recode (or iconv).

Follow the link for a list of Common character encodings

This script lists charset encodings (from $my_csets) which aren't valid for your file(s). You can list all charsets via: recode -l

file="$1"      my_csets="UTF-16 UTF-8 windows-1250 ASCII"    # Use the next lines to test all charsets  # =======================================  # all_csets=$(recode -l |sed -ne "/^[^:/]/p" | awk '{print $1}')  # my_csets=$all_csets    for cset in $my_csets ;do     <"$1" recode $cset.. &>/dev/null || echo  "$cset  ERROR: $?"  done 

Not the answer you're looking for? Browse other questions tagged or ask your own question.

Source: https://superuser.com/questions/411214/what-could-cause-the-file-command-in-linux-to-report-a-text-file-as-data

Upgrade to Premium Plan

✔ Save unlimited bookmarks.

✔ Get a complete PDF copy of each web page

✔ Save PDFs, DOCX files, images and Excel sheets as email attachments.

✔ Get priority support and access to latest features.

Upgrade now →

No comments: