Closed Captioning (CC) is more for video than it is for audio, but it's available for both.
An audio or video podcaster would want to include CC either because they have to by law (some government and all educational institutions must have CC), or because they are serving an audience who would utilize CC (such as hearing impaired).
Adding text to your media does not add much to the overall file size, so it's impact on the file size associated with hosting such a file should be minimal.
Just to put things into perspective, every letter/space is a byte, the average word length is 5-7 (We'll use 5 for easy math) and on average you can speak about 50 words a minute. So in a 1 hour recording, you could have 5*50*60=15,000 bytes, (we don't need to worry about spaces, thanks to compression). Converting that to something readable, that's about 15K, or .015 Megabytes. We'll use .015 MB as a reference, but keep in mind that the text will be compressed when it's saved into the media file, which will decrease the storage space by another 50-75%. Considering a recording saved at 96Kbps joint stereo 44.1 Mhz will be at least 30 MB in size, a 30.015 MB media file does not make much of a difference. Usually folks save their audio at a higher quality than that as well, and their ID3 poster artwork is usually 200-300k alone. So I wouldn't sweat at all the overhead of adding CC to a media file.
I would add that CC is not the same as a text transcript of a recording, particularly with video. The CC is designed to be placed along with the recording to be displayed at certain time slots as the media is playing. Even actions are described in the video, details that a transcript may not include. CC should be written into the content by the creator (not a transcript service) to ensure that the text and actions are in the correct location and interpretation for the media content.